A Quick Introduction to Natural Language Processing

9 Sep 2015

Need for Natural Language Processing (NLP)

Significant growth in the volume and variety of data is due to the amount of unstructured text data—in fact, up to 80% of all your data is unstructured text data. Companies collect huge amounts of documents, emails, social media, and other text-based information to get to know their customers better, offer services or market their products .However, most of this data is unused and untouched.

Text analytics, through the use of natural language processing (NLP), holds the key to unlocking the business value within these vast data resources.

In the era of big data, businesses can fully utilize the data potential and take advantage of the latest parallel text analytics and NLP algorithms packaged in a variety of open source software namely R, python, etc.

What is NLP?

Natural Language Processing is the scientific discipline concerned with making natural language accessible to machines. NLP addresses tasks such as identifying sentence boundaries in documents, extracting relationships from documents, and searching and retrieving of documents, among others. NLP is a necessary means to facilitate text analytics by establishing structure in unstructured text to enable further analysis.

Common tasks for NLP

An example of a common NLP task is the identification of paragraph, sentence, and word boundaries within a document.

For Example: “Apple has generated sufficient profits this year.”

NLP tries to understand such sentences by providing the most likely interpretation given sufficient context—it determines whether Apple refers to a ‘the company Apple’ or ‘a fruit’.

The table below illustrates a few examples for the following common NLP tasks:

Sentence segmentation identifies where one sentence ends and another begins. Punctuation often marks sentence boundaries, but there are many exceptions in the usage of language.

Tokenization is the process of identifying individual words, numbers, and other single coherent constructs. Hashtags in Twitter feeds are example of constructs consisting of special and alphanumeric characters that should be treated as one coherent token.

Stemming removes the ‘ending’ of words. This process is often used in sentiment analysis or semantic analysis for example to retrieve documents – regardless of whether the word is ‘innovations” or “innovates”

Part-of-Speech (PoS) tagging – it processes a sequence of words and attaches a part of speech tag such as a verb, noun, or adjective to each word in a sentence. A commonly used set of PoS tags can be found in the Penn Treebank Tag-set, and the example in the table above uses this tag set.

Parsing helps us to identify the structure of a sentence. The example in the table facilitates the conclusion that ‘Katie and Sam” are to be treated as conjunctive noun phrase (NP) and that both of them were involved in the action ‘went.’ Parsing is often a prerequisite for other NLP tasks such as named entity recognition.

Named entity recognition identifies entities such as persons, locations, and times within documents. After the introduction of an entity in a text, language commonly makes use pronouns such as ‘he, she, it, them, etc.’ instead of using the noun entity again. This is called Reference resolution and it attempts to identify multiple mentions of an entity in a sentence or document and marks them as the same instance.

Open source software

The table below illustrates a sample of open-source tools used to accomplish several common NLP tasks and use-cases.

Found this interesting? Explore Jigsaw’s new course Text Mining with R that is created for:

Business analysts that need to analyze unstructured text data
Business managers that need to understand how to effectively analyze text based data
Analytics professionals that need to integrate unstructured data analysis to existing analytics

Also check out the video Text Analytics from the course.

A text mining analysis on SOPs for IIM MBA admissions