7 Natural Language Processing Techniques for Extracting Information

  • Dinesh Kumawat
  • Nov 18, 2019
  • NLP
7 Natural Language Processing Techniques for Extracting Information title banner

Natural language processing (NLP), as the title clears our perception that it has a sort of processing to do with language or linguistics. NLP primarily comprises two major functionalities, The first is “Human to Machine Translation” (Natural Language Understanding), and the second is “Machine to Human translation”(Natural Language Generation). This blog will cover the introduction of NLP and different NLP techniques for finding inferences mainly from sentiment data.


Let me take into account about the brief history of NLP, It started back in the year 1950 (yeah too old, :D ) when Alan Turing had published an article titled “Computing Machinery and Intelligence” which is also known as the “Turing test”. In that article, a question was considered, like, “Can machines think?”, since this question had small ambiguous words, like, “machines” and “think”. Turing test suggested a few changes, the question with another question that had expressed in unambiguous words and closely related. 


In the year 1960, some natural language processing systems developed, SHRDLU, the work of Chomsky and others together on formal language theory and generative syntax. Up to the 1980s, the evolution originated in natural language processing with the introduction of Machine Learning algorithms for language processing. Later, In 2000, a massive amount of audio and textual data was available for everyone.


Techniques of Natural Language Processing Covered 



  1. Named Entity Recognition (NER)

  2. Tokenization

  3. Stemming and Lemmatization

  4. Bag of Words

  5. Natural language generation

  6. Sentiment Analysis 

  7. Sentence Segmentation



Named Entity Recognition (NER)


This technique is one of the most popular and advantageous techniques in Semantic analysis, Semantics is something conveyed by the text. Under this technique, the algorithm takes a phrase or paragraph as input and identifies all the nouns or names present in that input. 


There are many popular use cases of this algorithm below we are mentioning some of the daily use cases;


  1. News Categorization:>  This algorithm automatically scans all the news article and extract out all sorts of information, like, individuals, companies, organizations, people, celebrities name, places from that article. Using this algorithm we can easily classify news content into different categories.


  1. Efficient Search Engine:> The Named entity recognition algorithm applies to all the articles, results, news to extract relevant tags and stores them separately. These will boost up the searching process and makes an efficient search engine.


  1. Customer Support :> You must have read out thousands of feedbacks provided by people concerning heavy traffic areas on twitter on a daily basis. If Named Entity Recognition API is used then we can easily be pulled out all the keywords(or tags) to inform concerned traffic police departments.





First of all, understanding the meaning of Tokenization, it is basically splitting of the whole text into the list of tokens, lists can be anything such as words, sentences, characters, numbers, punctuation, etc. Tokenization has two main advantages, one is to reduce search with a significant degree, and the second is to be effective in the use of storage space. 


The process of mapping sentences from character to strings and strings into words are initially the basic steps of any NLP problem because to understand any text or document we need to understand the meaning of the text by interpreting words/sentences present in the text. 


Tokenization is an integral part of any Information Retrieval(IR) system, it not only involves the pre-process of text but also generates tokens respectively that are used in the indexing/ranking process. There are various tokenization’ techniques available among which Porter’s Algorithm is one of the most prominent techniques.



Stemming and Lemmatization


The increasing size of data and information on the web is all-time high from the past couple of years. This huge data and information demand necessary tools and techniques to extract inferences with much ease. 


Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form - generally a written form of the word.” For example, what stemming does, basically it cuts off all the suffixes. So after applying a step of stemming on the word “playing”, it becomes “play”, or like, “asked” becomes “ask”.  


Image describing the difference between stemming and lemmatization

Stemming and Lemmatization


Lemmatization usually refers to do things with the proper use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. In simple words, Lemmatization deals with lemma of a word that involves reducing the word form after understanding the part of speech (POS) or context of the word in any document.



Bag of Words


Bag of words technique is used to pre-process text and to extract all the features from a text document to use in Machine Learning modeling. It is also a representation of any text that elaborates/explains the occurrence of the words within a corpus (document). It is also called “Bag” due to its mechanism, i.e. it is only concerned with whether known words occur in the document, not the location of the words.


Let’s take an example to understand bag-of-words in more detail. Like below, we are taking 2 text documents:


“Neha was angry on Sunil and he was angry on Ramesh.”

“Neha love animals.”


Above you see two corpora as documents, we treat both documents as a different entity and make a list of all the words present in both documents except punctuations as here, 


“Neha”, “was”, “angry”, “on”, “Sunil”, “and”, “he”, “Ramesh”, “love”, “animals”


Then we create these documents into vectors (or we can say, creating a text into numbers is called vectorization in ML) for further modelling.


Presentation of “Neha was angry on Sunil and he was angry on Ramesh” into vector form as [1,1,1,1,1,1,1,0,0] , and the same as in, “Neha love animals” having vector form as    [1,0,0,0,0,0,0,0,1,1]. So, the bag-of-words technique is mainly used for featuring generation from text data. 



Natural Language Generation


Natural language generation (NLG) is a technique that uses raw structured data to convert it into plain English (or any other) language. We also call it data storytelling. This technique is very helpful in many organizations where a large amount of data is used, it converts structured data into natural languages for a better understanding of patterns or detailed insights into any business.


As this can be viewed opposite of Natural Language Understanding (NLU) that we have already explained above. NLG makes data understandable to all by making reports that are mainly data-driven, like, stock-market and financial reports, meeting memos, reports on product requirements, etc.  


There are many stages of any NLG;


  1. Content Determination: Deciding what are the main content to be represented in text or information provided in the text.

  2. Document Clustering: Deciding the overall structure of the information to convey.

  3. Aggregation: Merging of sentences to improve sentence understanding and readability.

  4. Lexical Choice: Putting appropriate words to convey the meaning of the sentence more clearly.

  5. Referring Expression Generation: Creating references to identify main objects and regions of the text properly.

  6. Realization: Creating and optimizing text that should follow all the norms of grammar (like syntax, morphology, orthography).



Sentiment Analysis


It is one of the most common natural language processing techniques. With sentiment analysis, we can understand the emotion/feeling of the written text. Sentiment analysis is also known as Emotion AI or Opinion Mining


The basic task of Sentiment analysis is to find whether expressed opinions in any document, sentence, text, social media, reviews are positive, negative, or neutral, it is also called finding the Polarity of Text.


sentiment analysis differentiating emotions which are positive, negative and neutral.

Analysing sentiments


Sentiment analysis usually works best on subjective text data rather than objective test data. Generally, objective text data are either statements or facts which does not represent any emotion or feeling. On the other hand, the subjective text is usually written by humans showing emotions and feelings.


For example, Twitter is all filled up with sentiments, users are addressing their reactions or expressing their opinions on each topic whichever or wherever possible. So, to access tweets of users in a real-time scenario, there is a powerful python library called “twippy”.



Sentence Segmentation


The most fundamental task of this technique is to divide all text into meaningful sentences or phrases. This task involves identifying sentence boundaries between words in text documents. We all know that almost all languages have punctuation marks that are presented at sentence boundaries, So sentence segmentation also referred to as sentence boundary detection, sentence boundary disambiguation or sentence boundary recognition. 


There are many libraries available to do sentence segmentation, like, NLTK, Spacy, Stanford CoreNLP, etc, that provide specific functions to do the task.





In this blog, we have covered seven techniques of Natural language processing. But in general, there are other techniques also, like, Natural Language Understanding, Aspect Modelling, Topic Modelling, Text Summarization, Decompounding, etc. which are very helpful to understand the text more clearly for machines. 


Also Sharing is Caring! Please share this blog with your friends and family if you find it useful. And for more latest blogs and news in the field of Analytics, Please subscribe to our newsletter and read Analytics Steps.


Dinesh Kumawat

Dinesh is another data science enthusiasts at Analytics Steps. Dinesh likes to discover new technologies in the field of data science.

Trending blogs

  • Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & Models

  • How is Artificial Intelligence (AI) Making TikTok Tick?

  • 7 Types of Activation Functions in Neural Network

  • 7 types of regression techniques you should know in Machine Learning

  • 6 Major Branches of Artificial Intelligence (AI)

  • Introduction to Logistic Regression - Sigmoid Function, Code Explanation

  • What is K-means Clustering in Machine Learning?

  • Top 10 Big Data Technologies in 2020

  • Introduction to Linear Discriminant Analysis in Supervised Learning

  • Convolutional Neural Network (CNN): Graphical Visualization with Code Explanation

Write a BLOG