Document Summarization Inspiration

By: Chloe Kestermont, Product Manager

The news can be a source of overwhelming information and helping humans manage massive amounts of textual data is what NLP does best. Our news cycle seems to be rapidly shortening, due mostly to our shift to consuming news online (when was the last time you actually held a newspaper?). Time and mental bandwidth is stretched thin and keeping up with current events sometimes seems impossible. We’ve got information coming at us literally from left and right and media bias can influence the direction of our opinions. Join me in a thought experiment to explore an interesting use case of the Codeq NLP API as I delve into how we can manage this rising tide of information with Natural Language Processing.

Let’s create an app focused on consolidating multiple news sources into one convenient location; we’ll call it “Next Level News.” For this theoretical app, we’ll use the Text Summarization, Named Entity Recognition, Named Entity Linking, Named Entity Salience, and Keyphrase Extraction modules. The Codeq API is a robust set of text understanding tools, containing more than 25 modules that extract rich representations from unstructured textual data, and it can be easily customized. Let’s create our own NLP pipeline based on the linguistic tools we’ll require for our application by using the following python code:

from codeq_nlp_api import CodeqClient

client = CodeqClient(user_id="YOUR_USER_ID", user_key="YOUR_USER_KEY")

text = "YOUR TEXT TO BE ANALYZED"

pipeline = "summarize, ner, nel, salience, coreference, keyphrases"
document = client.analyze(text, pipeline)

print('Document summary:')
print('\n')
print(document.summary)
print('\n')
print('Document keyphrases:')
print('\n')
print(document.keyphrases)
print('\n')

for sentence in document.sentences:
	print('Sentence: ', sentence.raw_sentence)
	print('Named entities: ', sentence.named_entities)
	print('Linked named entities: ', sentence.named_entities_linked)
	print('Named entities salience: ', sentence.named_entities_salience)
	print('\n')
	print('#'*50)
	print('\n')

Our mutually imagined app pulls in news from sources covering the entire political spectrum. In these increasingly fraught times, our app will help our users evaluate online articles by labeling them with the media bias of the source, as determined by comparing media bias maps from multiple third party organizations. This will give our users a broad view of the news available and help them assess the sources for themselves.

As a key part of our media bias feature, our app will display summaries of news articles to help our users digest and compare information from a vast variety of sources. After preprocessing the content to extract only the text of the article, we’ll call the text Summarization module (summarize) to automatically generate extractive summaries that contain the most relevant sentences of these news stories. Alternatively, if we want to reduce the size of summaries further, we could use the Summarization with Compression module (summarize_compress), a feature you won’t find in other NLP APIs. This module, where applicable, will remove extraneous clauses without disturbing the main point of the sentences contained in our summaries, generating even more condensed summaries.

In order to improve the organization of information in our app, let’s call the Named Entity Recognition module (ner) which will scour the analyzed articles for named entities found in the sentences of our summaries, highlighting people, places, organizations, dates, and more. In conjunction with this module, we’ll call the Named Entity Linking module(nel), which will produce a list of disambiguated named entities, and a link to their respective Wikipedia pages. This will help us create distinct profiles for homonymous entities (for example Michelle Williams the actress and Michelle Williams the singer). We’ll also call Named Entity Salience module(salience), which will automatically detect the named entities most pertinent to the text, so our app will highlight only the most relevant entities found in each news summary. We can use that information to automatically create deep links that will take our users to an index of current content related to that entity.

A quick mock up of our thought experiment

To direct our users’ attention to the most pressing topics, let’s add a trending stories feature as part of the landing page our users first see when they open our app. To create a list of ranked topics, we’ll call on the Keyphrase Extraction module(keyphrases) which will generate a list of keyphrases, in order from most to least relevant, that capture the issues covered by the articles summarized in our app. Our trending stories algorithm will then compile the top ranked topics from these lists, which our users can then select and be taken to an index of articles from a wide variety of sources.

Information management, while it has always been an important part of critical thinking, is more important now than ever before. As online citizens, we depend on digital sources to keep us informed, but keeping up with current events in 2020 can almost be a full time concern. Summarization of news sources can help us avoid information overload by identifying relevant content in text based news and presenting it in a digestible and concise format.

At Codeq, we’re proud to be making linguistic tools that can make a theoretical app like this possible. Try out this use case for yourself with our Document Summarization Demo and sign up to create your own experiments. What will you create with NLP?

Document Summarization Inspiration

Share This Story, Choose Your Platform!

Related Posts

Detecting Abuse Online

Codeq NLP API is Now Available on the RapidAPI Marketplace

Codeq NLP API: Leverage the Past to Build an Awesome Future