Introduction to Natural Language Processing: Text Cleaning & Preprocessing

  • Rohit Dwivedi
  • May 27, 2020
  • NLP
Introduction to Natural Language Processing: Text Cleaning & Preprocessing title banner

Natural language processing also known as NLP is the convergence between linguistics, computer science and artificial intelligence. It mainly aims for the interconnection between natural languages and computers that means how to analyse and model high volume of natural language data.

 

As you already know, computers are able to understand numbers well rather than understanding words. There is a lot of research and development happening in the domain of NLP everyday. 

 

There are huge amounts of applications that are today working because of NLP. Like, Alexa or Apple’s Siri both are able to understand whenever someone raises a question and are even able to answer those whether to ask for playing music or taking the weather updates. The spam mail filtering is also an example of NLP. 

 

Being in Data science worlds, we can use NLP for text classification, sentimental analysis (classifying sentiments as positive or negative), text summarizations and all other classification models amid other applications.


Generally, if the data is scrapped or data is given for analyzing it would always be in its natural human format of sentences, or paragraphs etc. Before doing analysis on that we need to transform that language and clean it so that the computer is able to understand that language in the desired format.

 

Data preprocessing is a fundamental step while building a machine learning model. If the data is fairly pre-processed the results would be reliable. In NLP, the first step before building the machine learning model, is to pre-process the data. Let’s see the various different steps that are followed while preprocessing the data also used for dimensionality reduction.

 

  1. Tokenization 

  2. Lower casing

  3. Stop words removal

  4. Stemming

  5. Lemmatization

 

Each term is the axis in the vector space model. In muti-dimensional space, the text or document are constituted as vectors. The number of different words represents the number of dimensions.

 

The python library that is used to do the preprocessing tasks in nlp is nltk. You can install the nltk package using “pip install nltk”

 

 

Tokenization:  

 

It is a method in which sentences are converted into words.


Code to convert sentences into words.

Tokenization


 

Lowercasing

 

the tokenized words into lower case format. (NLU -> nlu). Words having the same meaning like nlp and NLP if they are not converted into lowercase then these both will constitute as non-identical words in the vector space model.


Code to convert the sentence into lowecase.

Lowercasing



Stop words removal:

 

These are the most often used that do not have any significance while determining the two different documents like (a, an, the, etc.) so they are to be removed. Check the below image where from the sentence “Introduction to Natural Language Processing” the “to” word is removed.


Code for removing the least significant words.

Stopwords removal


 

Stemming

 

 It is the process in which the words are converted to its base from. Check the below code implementation where the words of the sentence are converted to the base form.


Code to convert words to base form.

Stemming


 

Lemmatization

 

Different from stemming, lemmatization lowers the words to word in the present language for example check the below image where word has and is are changed to ha and be respectively. 


Code to lemmatize a word.

Lemmatization


These are steps used for text-preprocessing in NLP. Python libraries like nltk, spacy and textblob can be used for doing these tasks. You can check the documentation and use them.

 

 

Twitter Sentiment Analysis

 

To get more into pre-processing of the textual data for the data analysis lets us quickly dive into the code where the data is about tweets of the user and the task is to classify these tweets into positive and negative sentiments. You can download the dataset from here. For this task, I will work on Jupyter Notebook.


 

Steps

 

  1. Importing the dataset 

 

Initially, the necessary libraries of python like numpy and pandas are imported. Also, the dataset is imported using the standard way of importing csv files and a bit of EDA is done. To know more about EDA that can be done to get familiar with the data you can visit here. The shape of the data is (9093,3).


Implementation of importing the data.

Twitter data


  1. Dropping null values

 

All the rows having null values are to be removed. There were 5802 null values that were present in emotion_in_tweet_is_directed_at column.


Removing the null values.

Rows with null values removed


 

  1. Preprocess the data

 

  • Converting all the texts to lowercase format by making usage of str.lower function.

  • Making selection of only numbers, alphabets and #+_ from the text data by making the use of function re.sub().

  • Striping all the text using .strip().


Code for pre--processing the data.

Pre-processing

 

Pre-processed data.

View of the pre-processed data


The above image shows the cleaned and preprocessed data where text is converted to lowercase, selection of only part that is to be kept is done and stripping of the text is done.


 

  1. Encoding the target

 

Encoded the target column where positive sentiment is now 1 and negative sentiment is 0.


Encoding

Label encoding of the target


  1. Features and Labels

 

  • "tweet_text" is the independent attribute and "is_there_an_emotion_directed_at_a_brand_or_product" is the target column. 

  • Splitted the data into training and testing data.

  • X_train.shape() is (2392,)

  • X_test.shape() is (798,)


Splitting the dataset.

Independent feature and target


 

  1. Vectorizing the data

 

Convert the data into vector form using CountVectorizer().

 

Code for Creating vectors.

CountVectorizer


  1. Creating classification model

 

Used two different classification models that are Logistic Regression and Multinomial Naive Bayes to classify the sentiments as positive and negative. The prediction score of the logistic regression model came out to be 97% whereas multinomialNB gave the score of 85%.


Implementation of two classification models.

Classification of tweets using Logistic Regression and MultinomialNB


You can read more about NLP and different applications here in this blog.

 

 

Conclusion

 

In this blog I have discussed the basic preprocessing steps that are required before building models in natural language processing that are fundamentals. These include tokenization, lowercasing the text, stop words removal, stemming and lemmatization.

 

In last, the hands on problem on sentimental analysis is discussed where the aim is to classify the tweets into positive and negative sentiments.

0%

Comments

  • 18bit159

    Aug 04, 2020

    i can't see the code in this blog!