What is Stemming and Lemmatization in NLP?

  • Tanesh Balodi
  • Jul 14, 2020
  • NLP
  • Updated on: Jul 15, 2020
What is Stemming and Lemmatization in NLP? title banner

This informative blog is presenting Stemming and Lemmatization in detail that covers their difference and practical applications.

 

For a short note, Stemming & lemmatization are text normalizing procedures, progressively used in NLP which is responsible for text preprocessing analysis. 

 

Let’s learn them deeply!!!!

 

Introduction

 

Usually, a word has multiple meanings based on its usage in text, similarly, different forms of words convey related meaning, like “toy” and “toys”, indicate identical meaning.

 

You would probably find no different objective between a search for “toy” and a search for “toys”. This kind of contrast between various forms of words termed as an “inflection”, however, this makes various problems in understanding queries. 

 

Suppose another word “came” and “camel”, their search intent gives a different meaning, instead of having the same root-word. Similarly, if you search for the word “Love” in the google search option, it shows results in stems of words like  “Loves”, ”Loved”, and “Loving”.  


Stems of the word “Love”


For the simplification of various search queries, Stemming and Lemmatization are the strategies used for the same. 

 

Stemming and Lemmatization have been developed in the 1960s. These are the text normalizing and text mining procedures in the field of Natural Language Processing that are applied to adjust text, words, documents for more processing. These are a widely used system for tagging, SEO, Web Search Result, and Information Retrieval. 

 

While Implementing NLP, you will always face an issue of similar root-forms but different representations, for example, the word “caring” can be stripped out to “car” and “care” using the method Stemming and Lemmatization respectively.

 

What is Stemming?

 

We already know that a word has one root-base form but having different variations, for example, “play” is a root-base word and playing, played, plays are the different forms of a single word. So, these words get stripped out, they might get the incorrect meanings or some other sort of errors.   

 

The process of reducing inflection towards their root forms are called Stemming, this occurs in such a way that depicting a group of relatable words under the same stem, even if the root has no appropriate meaning. 

 

Moreover;

  • Stemming is a rule-based approach because it slices the inflected words from prefix or suffix as per the need using a set of commonly underused prefix and suffix, like “-ing”, “-ed”, “-es”, “-pre”, etc. It results in a word that is actually not a word.

  • There are mainly two errors that occur while performing Stemming, Over-stemming, and Under-stemming. Over-steaming occurs when two words are stemmed from the same root of different stems. Under-stemming occurs when two words are stemmed from the same root of not a different stems. Two types of stemmers are:

 

Defining Porter Stemmer

 

Porter Stemmer uses suffix striping to produce stems. It does not follow the linguistic set of rules to produce stem for phases in different cases, due to this reason porter stemmer does not generate stems, i.e. actual English words. 

 

It applies algorithms and rules for producing stems. It also considers the rules to decide whether it is wise to strip the suffix or not. A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer.

 

Defining Snowball Stemmer

 

Martin Porter, an inventor of the Snowball programming language, developed it to support other languages. It’s an advanced version of Porter Stemmer, also named as  Porter2 Stemmer. 

 

For example, if you print the word “badly” with the help of Snowball in English and Porter, we get different results. Consider the code context below;  

 

print(SnowballStemmer("English").stem("badly"))

Output: bad

 

Here, the word “badly” is stripped from the English language using Snowball Stemmer and get an output as “bad”. Now, snowball Stemmer is used for stripping the same word from the Porter language, we get the output as “badli”   

 

print(SnowballStemmer("porter").stem("badly"))

Output: badli

 

The above example clearly shows that the Snowball stemmer is better than Porter Stemmer.

 

What is Lemmatization?

 

In simpler forms, a method that switches any kind of a word to its base root mode is called Lemmatization. 

 

In other words, Lemmatization is a method responsible for grouping different inflected forms of words into the root form, having the same meaning. It is similar to stemming, in turn, it gives the stripped word that has some dictionary meaning. The Morphological analysis would require the extraction of the correct lemma of each word. 

 

For example, Lemmatization clearly identifies the base form of ‘troubled’ to ‘trouble’’ denoting some meaning whereas, Stemming will cut out ‘ed’ part and convert it into ‘troubl’ which has the wrong meaning and spelling errors.

 

‘troubled’ -> Lemmatization -> ‘troubled’, and error

‘troubled’ -> Stemming -> ‘troubl’

 

 

What is the Difference Amid Stemming and Lemmatization?

 

S.No

Stemming

Lemmatization 

1

Stemming is faster because it chops words without knowing the context of the word in given sentences.

Lemmatization is slower as compared to stemming but it knows the context of the word before proceeding.

2

It is a rule-based approach.

It is a dictionary-based approach.

3

Accuracy is less.

Accuracy is more as compared to Stemming.

4

When we convert any word into root-form then stemming may create the non-existence meaning of a word.

Lemmatization always gives the dictionary meaning word while converting into root-form.

5

Stemming is preferred when the meaning of the word is not important for analysis.

Example: Spam Detection

Lemmatization would be recommended when the meaning of the word is important for analysis.

Example: Question Answer

6

For Example:

“Studies” => “Studi”

For Example:

“Studies” => “Study”

 


An image reflects the fundamental difference amid Stemming and Lemmatization 

Difference between Stemming and Lemmatization


What are the applications of Stemming and Lemmatization?

 

Stemming and Lemmatization are broadly utilized in Text mining where Text Mining is the method of text analysis written in natural language and extricate high-quality information from text. 

 

Text mining tasks incorporate text categorization, text clustering, making of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling, etc.

 

Most useful applications are the following;

 

  1. Information Retrieval (IR) Conditions:

To map documents to general topics through stemming and lemmatization and show search results by indexing when documents are developing to numbers. 

 

Query Expansion is used in search ecosystems that indicate a user’s query and employ to enhance the query that matches extra documents.

 

For example, one searches for 'marketing', but may not be satisfied with results that show 'markets' and not marketing. But with the help of  Stemming and different algorithms for stemming, results could be better. Also,  Google search affirmed stemming in the year 2003.

 

  1. Sentiment Analysis

Sentiment Analysis, the analysis of reviews, and comments that were given by various users about anything are generally utilized for analysis of products, like for online retail shops. Stemming and Lemmatization is accepted in the form of the text-preparation mean before it is interpreted.

 

  1. Document Clustering

 

Document clustering (or text clustering) is a practice of group analysis to textual documents. From an automatic document organization, topic extraction, to rapid information retrieval are essential applications of it. 

 

Stemming and Lemmatization are applied to diminish the number of tokens to transfer the same information and hence boost up the entire method. After this pre-processing means, features are estimated via determining the frequency of each token, and then clustering methods are implemented.


 

Conclusion

 

In conclusion, we have seen the pros and cons of both Stemming and Lemmatization along with a difference in terms. A person must have strong linguistic knowledge for creating a dictionary that permits algorithms to allow and look after the proper form of words. 

 

There are many other applications of Stemming and Lemmatization like text categorization, clustering or extraction of text, sentimental analysis, entity relational modeling, summarization of documents, etc. 

0%

Comments