• Category
  • >NLP

Introduction to Natural Language Processing: Text Cleaning & Preprocessing

  • Rohit Dwivedi
  • May 27, 2020
  • Updated on: Jul 05, 2021
Introduction to Natural Language Processing: Text Cleaning & Preprocessing title banner

Introduction

 

Natural language processing, NLP, is the convergence between linguistics, computer science and artificial intelligence. It mainly aims for the interconnection between natural languages and computers that means how to analyse and model a high volume of natural language data. For understanding NLP in detail, click the link.

 

As you already know, computers are able to understand numbers well rather than understanding words. There is a lot of research and development happening in the domain of NLP everyday. 

 

There are huge amounts of applications that are today working because of NLP. Like, Alexa or Apple’s Siri both are able to understand whenever someone raises a question and are even able to answer those whether to ask for playing music or taking the weather updates. Spam mail filtering is also an example of NLP

 

Being in the Data science world, we can use NLP for text classification, sentimental analysis (classifying sentiments as positive or negative), text summarizations and all other classification models amid other applications.

 

(Must read: Top NLP python libraries)


Generally, if the data is scrapped or data is given for analyzing it would always be in its natural human format of sentences, or paragraphs etc. Before doing an analysis on that we need to transform that language and clean it so that the computer is able to understand that language in the desired format.

 

Data preprocessing is a fundamental step while building a machine learning model. If the data is fairly pre-processed the results would be reliable. In NLP, the first step before building the machine learning model is to pre-process the data.

 

(Suggested read: Top NLP trends)

 

 

Data Preprocessing in NLP

 

Let’s see the various different steps that are followed while preprocessing the data also used for dimensionality reduction.

 

  1. Tokenization 

  2. Lower casing

  3. Stop words removal

  4. Stemming

  5. Lemmatization

 

Each term is the axis in the vector space model. In muti-dimensional space, the text or document are constituted as vectors. The number of different words represents the number of dimensions.

 

The python library that is used to do the preprocessing tasks in nlp is nltk. You can install the nltk package using “pip install nltk”

 

1. Tokenization:  

 

It is a method in which sentences are converted into words.


import nltk
from nltk.tokenize import word_tokenize
token = word_tokenize("My Email address is: taneshbalodi8@gmail.com")
token

tokenization in natural language processing

Tokenization


(Read also: Sentiment Analysis of YouTube Comments)

 

2. Lowercasing

 

the tokenized words into lower case format. (NLU -> nlu). Words having the same meaning like nlp and NLP if they are not converted into lowercase then these both will constitute as non-identical words in the vector space model.


Lowercase = []
for lowercase in token:
    Lowercase.append(lowercase.lower())
Lowercase

lowercasing with the help of lower() function in natural language processing

Lowercasing



3. Stop words removal:

 

These are the most often used that do not have any significance while determining the two different documents like (a, an, the, etc.) so they are to be removed. Check the below image wherefrom the sentence “Introduction to Natural Language Processing” the “to” word is removed.


from nltk.corpus import stopwords
stop_words = stopwords.words('english')
from string import punctuation
punct = list(punctuation)
print(dataset[1]['quote'])
tokens = word_tokenize(dataset[1]['quote'])
len(tokens)

output of the dataset without stopwords removal

Without removing Stopwords


We got to see 50 tokens without removing stopwords, Now we shall remove stopwords.

 

cleaned_tokens = [token for token in tokens if token not in stop_words 
                  and token not in punctuation]
len(cleaned_tokens)

 

By cleaning the stopwords we got the length of the dataset as 24.


(Referred blog: What is SqueezeBERT in NLP?)

 

4. Stemming

 

 It is the process in which the words are converted to its base from. Check the below code implementation where the words of the sentence are converted to the base form.


from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem('jumping'))
print(ps.stem('lately'))
print(ps.stem('assess'))
print(ps.stem('ran'))

implementing stemming in our data for natural language processing

Stemming


 

5. Lemmatization

 

Different from stemming, lemmatization lowers the words to word in the present language for example check the below image where word has and is are changed to ha and be respectively. 


from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('ran', 'v'))
print(lemmatizer.lemmatize('better', 'a'))

lemmatization in natural language processing

Lemmatization


 

Conclusion

 

In this blog, I have discussed the basic preprocessing steps that are required before building models in natural language processing that are fundamentals. These include tokenization, lowercasing the text, stop word removal, stemming, and lemmatization. Although the range with which natural language processing could be implemented is wide, much research has been going on in this particular topic.

 

(Must read: 7 Natural Language Processing Techniques for Extracting Information)

Latest Comments

  • 18bit159

    Aug 04, 2020

    i can't see the code in this blog!