Codeq NLP API Tutorial 5

Part 5. Named Entities

By Rodrigo Alarcón, Computational Linguist

We started this series of tutorials to show you how to call different modules of Codeq’s NLP API. So far, we have covered the following topics:

In this tutorial we will detail NLP modules related to the extraction and disambiguation of named entities from texts. Named entities refer to real-world entities, that is, things in the world, whether concrete or abstract (e.g., person, organizations, locations, etc.). The identification of named entities plays an important role in many NLP related use cases, for example in summarization of texts and monitoring of brands or products.

The complete list of modules we offer can be found in our documentation:

Codeq NLP API Documentation

Define a NLP pipeline and analyze a text

We initialize an instance of the Codeq Client and declare a pipeline containing the desired annotators. After that, we can send a text to the client and retrieve a Document object with a list of Sentences, where the information related to named entities is stored. To print a quick overview of the results, you can use the method document.pretty_print(), which we will explain in detail in the following sections.

Copy to Clipboard

For each annotator of this pipeline we are going to explain:

the keyword (KEY) used to call the annotator,
the attribute (ATTR) where the output is stored,
the Output Labels of the annotator, if applicable.

Named Entity Recognition

This annotator extracts named entities from texts and provides the tokens of the entity, its type and its position in the tokenized sentence.

KEY: ner
ATTR: sentence.named_entities

Output Labels

PER (person)
LOC (location)
ORG (organization)
MISC (miscellaneous)
DATE
MONEY
URL
PHONE
EMAIL
TWITTERNAME
TRACKINGNUMBER
AIRLINECODE
AIRLINENAME
AIRPORTCODE
AIRPORTNAME
EMOJI
SMILIE

Copy to Clipboard

pipe = ["ner"]

text = 'Tesla almost died earlier this year, Elon Musk said last Friday ' \
       'in an interview with Axios that aired on HBO. Musk said the company ' \
       'was "bleeding money like crazy" as it worked through the Model 3 ' \
       'production ramp in the spring and summer. He said the company ' \
       '"came within single-digit" weeks of death before it was able ' \
       'to meet its Model 3 production goals.'

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
    named_entities = sentence.named_entities
    for ne in named_entities:
        ne_tokens, ne_type, ne_position = ne
        print("ne_tokens: %s" % ne_tokens)
        print("ne_type: %s" % ne_type)
        print("ne_position: %s\n" % ne_position)

# OUTPUT:
# 
# ne_tokens: Tesla
# ne_type: ORG
# ne_position: ['0']
#
# ne_tokens: Elon Musk
# ne_type: PER
# ne_position: ['7', '8']
#
# ne_tokens: last Friday
# ne_type: DATE
# ne_position: ['10', '11']
#
# ne_tokens: Axios
# ne_type: ORG
# ne_position: ['16']
#
# ne_tokens: HBO
# ne_type: ORG
# ne_position: ['20']

Named Entity Linking

Named Entity Linking (also called Named Entity Disambiguation) refers to the process of mapping a mention of an entity extracted from a text to unique entries of a knowledge base.

Given the high ambiguity of language, a named entity can have multiple names and a name can be linked to different named entities. Hence, the main goal of this annotator is to disambiguate the entity mentions in their textual contexts and identify a concrete referent, using Wikipedia and Wikidata as knowledge bases.

KEY: nel
ATTR: sentence.named_entities_linked

Copy to Clipboard

pipe = [
    "ner", "nel"
]

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
    named_entities = sentence.named_entities
    named_entities_linked = sentence.named_entities_linked
    for i, ne in enumerate(named_entities):
        print("ne: %s" % ne)
        ne_linked = named_entities_linked[i]
        if ne_linked:
            label = ne_linked['label']
            description = ne_linked['description']
            wikipedia_link = ne_linked['wikipedia_link']
            wikidata_link = ne_linked['wikidata_link']
            print("- label: %s" % label)
            print("- description: %s" % description)
            print("- wikipedia: %s" % wikipedia_link)
            print("- wikidata: %s\n" % wikidata_link)

# Output:
# 
# ne: ['Tesla', 'ORG', ['0']]
# - label: Tesla
# - description: American automotive, energy storage and solar power company
# - wikipedia: https://en.wikipedia.org/wiki/Tesla,_Inc.
# - wikidata: https://www.wikidata.org/wiki/Q478214
#
# ne: ['Elon Musk', 'PER', ['7', '8']]
# - label: Elon Musk
# - description: South African-born American entrepreneur
# - wikipedia: https://en.wikipedia.org/wiki/Elon_Musk
# - wikidata: https://www.wikidata.org/wiki/Q317521
# 
# ne: ['last Friday', 'DATE', ['10', '11']]
# - None
# 
# ne: ['Axios', 'ORG', ['16']]
# - label: AXIOS Media
# - description: American news and information website
# - wikipedia: https://en.wikipedia.org/wiki/Axios_(website)
# - wikidata: https://www.wikidata.org/wiki/Q28230873
#
# ne: ['HBO', 'ORG', ['20']]
# - label: HBO
# - description: American pay television network
# - wikipedia: https://en.wikipedia.org/wiki/HBO
# - wikidata: https://www.wikidata.org/wiki/Q23633

The output stored in the variable sentence.named_entities_linked is a list with the same number of entities found on the Named Entity Recognition module stored in the variable sentence.named_entities. If it is not possible to disambiguate a named entity (for example entities of type DATE, or cases where no referent is found on the knowledge base), then the corresponding element in the list sentence.named_entities_linked will be None.

Named Entity Salience

The goal of this annotator is to indicate the salience of named entities, that is, how relevant they are to the content of the input document. This annotator produces a tuple for each named entity, indicating if the entity is salient or not and its salience score.

KEY: salience
ATTR: sentence.named_entities_salience

Copy to Clipboard

pipe = [
    "ner", "salience"
]

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
    named_entities = sentence.named_entities
    named_entities_salience = sentence.named_entities_salience
    for i, ne in enumerate(named_entities):
        ne_salience = named_entities_salience[i]
        is_salient, salience_score = ne_salience
        print("\nne: %s" % ne)
        print("- is_salient: %s" % is_salient)
        print("- score: %s" % salience_score)

# Output:
# 
# ne: ['Tesla', 'ORG', ['0']]
# - is_salient: 1
# - score: 0.4574693441390991
#
# ne: ['Elon Musk', 'PER', ['7', '8']]
# - is_salient: 0
# - score: 0.2523033320903778
#
# ne: ['last Friday', 'DATE', ['10', '11']]
# - is_salient: 0
# - score: 0
#
# ne: ['Axios', 'ORG', ['16']]
# - is_salient: 0
# - score: 0.31706923246383667
#
# ne: ['HBO', 'ORG', ['20']]
# - is_salient: 0
# - score: 0.25251904129981995

Date Resolution

This annotator tries to resolve the specific dates of temporal expressions in natural language (e.g., next Friday, last Monday, etc.). The annotator takes as referent a relative date for the resolution, by default today. The output includes the date entity, its tokens span and the resolved timestamp.

KEY: date
ATTR: sentence.dates

Copy to Clipboard

Coreference Resolution

A coreference occurs when two or more mentions in a text refer to the same entity using different words. For example, in the sentence:

“John is working today, he is at the office.”

The pronoun he refers to the entity John.

Our coreference resolution module tries to resolve pronominal coreferences, i.e., find references of entities for personal pronouns (he, she, him, her, etc.). The annotator produces a resolved coreference including the mention of the entity, its referent and a coreference chain with all the elements that point to the same entity.

KEY: coreference
ATTR: sentence.coreferences

Copy to Clipboard

pipe = [
    "ner", "coreference"
]

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
    print("\n%s" % sentence.raw_sentence)
    coreferences = sentence.coreferences
    for coreference in coreferences:
        mention = coreference['mention']
        referent = coreference['referent']
        referent_chain = coreference['referent_chain']
        print("- mention: %s" % mention)
        print("- referent: %s" % referent)
        print("- referent_chain: %s\n" % referent_chain)

# Output:
# 
# Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO.
# 
# - mention: None
# - referent: ['00_07', ['Elon', 'Musk'], [7, 8]]
# - referent_chain: []
#
# Musk said the company was "bleeding money like crazy" as it worked through the Model 3 production ramp in the spring and summer.
# 
# - mention: ['01_00', ['Musk'], [0]]
# - referent: ['00_07', ['Elon', 'Musk'], [7, 8]]
# - referent_chain: ['00_07']
#
# - mention: None
# - referent: ['01_16', ['Model', '3'], [16, 17]]
# - referent_chain: []
#
# He said the company "came within single-digit" weeks of death before it was able to meet its Model 3 production goals.
# 
# - mention: ['02_00', ['He'], [0]]
# - referent: ['01_00', ['Musk'], [0]]
# - referent_chain: ['01_00', '00_07']
#
# - mention: ['02_19', ['Model', '3'], [19, 20]]
# - referent: ['01_16', ['Model', '3'], [16, 17]]
# - referent_chain: ['01_16']

As we can observe from the output above, a coreference element can contain a null mention if it is the first referent found in a text:

Copy to Clipboard

The referent_chain indicates the ids of the related coreference elements:

Copy to Clipboard

In the example above, the referent chain ids [‘01_00’, ‘00_07’] mean that all the related tokens, including the current mention, are:

Copy to Clipboard

Wrap Up

In this tutorial we described different modules related to the extraction and disambiguation of named entities. The code below summarizes how to call the annotators explained here and access their output.

Copy to Clipboard

from codeq_nlp_api import CodeqClient

client = CodeqClient(user_id="USER_ID", user_key="USER_KEY")

pipe = [
    "ner", "nel", "salience",
    "date", "coreference"
]

text = 'Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO. Musk said the company was "bleeding money like crazy" as it worked through the Model 3 production ramp in the spring and summer. He said the company "came within single-digit" weeks of death before it was able to meet its Model 3 production goals.'

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
    raw_sentence = sentence.raw_sentence
    named_entities = sentence.named_entities
    named_entities_linked = sentence.named_entities_linked
    named_entities_salience = sentence.named_entities_salience
    dates = sentence.dates
    coreferences = sentence.coreferences

print("\nraw_sentence: %s" % raw_sentence)
    print("nes:")
    for i, ne in enumerate(named_entities):
        ne_linked = named_entities_linked[i]
        salience = named_entities_salience[i]
        print("  ne: %s" % ne)
        print("  ne_linked: %s" % ne_linked)
        print("  salience: %s" % salience)
        print("")

print("dates:")
    for date in dates:
        print("  date: %s" % date)

print("coreferences:")
    for coreference in coreferences:
        print("  coreference: %s" % coreference)

Take a look at our documentation to learn more about the NLP tools we provide.

Do you need inspiration? Go to our use case demos and see how you can integrate different tools.

In our NLP demos section you can also try our tools and find examples of the output of each module.

Codeq NLP API Tutorial 5

Part 5. Named Entities

Codeq NLP API Documentation

Define a NLP pipeline and analyze a text

Named Entity Recognition

Named Entity Linking

Named Entity Salience

Date Resolution

Coreference Resolution

Wrap Up

Share This Story, Choose Your Platform!

Related Posts

Codeq’s Summarizer Updated with Summary Length Option

The ‘ncomp’ dependency label

Semantic Role Labeler Argument Categories