Natural language processing also known as NLP is the convergence between linguistics, computer science and artificial intelligence. It mainly aims for the interconnection between natural languages and computers that means how to analyse and model high volume of natural language data.
As you already know, computers are able to understand numbers well rather than understanding words. There is a lot of research and development happening in the domain of NLP everyday.
There are huge amounts of applications that are today working because of NLP. Like, Alexa or Apple’s Siri both are able to understand whenever someone raises a question and are even able to answer those whether to ask for playing music or taking the weather updates. The spam mail filtering is also an example of NLP.
Being in Data science worlds, we can use NLP for text classification, sentimental analysis (classifying sentiments as positive or negative), text summarizations and all other classification models amid other applications.
Generally, if the data is scrapped or data is given for analyzing it would always be in its natural human format of sentences, or paragraphs etc. Before doing analysis on that we need to transform that language and clean it so that the computer is able to understand that language in the desired format.
Data preprocessing is a fundamental step while building a machine learning model. If the data is fairly pre-processed the results would be reliable. In NLP, the first step before building the machine learning model, is to pre-process the data. Let’s see the various different steps that are followed while preprocessing the data also used for dimensionality reduction.
Stop words removal
Each term is the axis in the vector space model. In muti-dimensional space, the text or document are constituted as vectors. The number of different words represents the number of dimensions.
The python library that is used to do the preprocessing tasks in nlp is nltk. You can install the nltk package using “pip install nltk”.
It is a method in which sentences are converted into words.
the tokenized words into lower case format. (NLU -> nlu). Words having the same meaning like nlp and NLP if they are not converted into lowercase then these both will constitute as non-identical words in the vector space model.
These are the most often used that do not have any significance while determining the two different documents like (a, an, the, etc.) so they are to be removed. Check the below image where from the sentence “Introduction to Natural Language Processing” the “to” word is removed.
It is the process in which the words are converted to its base from. Check the below code implementation where the words of the sentence are converted to the base form.
Different from stemming, lemmatization lowers the words to word in the present language for example check the below image where word has and is are changed to ha and be respectively.
To get more into pre-processing of the textual data for the data analysis lets us quickly dive into the code where the data is about tweets of the user and the task is to classify these tweets into positive and negative sentiments. You can download the dataset from here. For this task, I will work on Jupyter Notebook.
Initially, the necessary libraries of python like numpy and pandas are imported. Also, the dataset is imported using the standard way of importing csv files and a bit of EDA is done. To know more about EDA that can be done to get familiar with the data you can visit here. The shape of the data is (9093,3).
All the rows having null values are to be removed. There were 5802 null values that were present in emotion_in_tweet_is_directed_at column.
Rows with null values removed
Preprocess the data
Converting all the texts to lowercase format by making usage of str.lower function.
Making selection of only numbers, alphabets and #+_ from the text data by making the use of function re.sub().
Striping all the text using .strip().
View of the pre-processed data
The above image shows the cleaned and preprocessed data where text is converted to lowercase, selection of only part that is to be kept is done and stripping of the text is done.
Encoded the target column where positive sentiment is now 1 and negative sentiment is 0.
Label encoding of the target
"tweet_text" is the independent attribute and "is_there_an_emotion_directed_at_a_brand_or_product" is the target column.
Splitted the data into training and testing data.
X_train.shape() is (2392,)
X_test.shape() is (798,)
Independent feature and target
Convert the data into vector form using CountVectorizer().
Used two different classification models that are Logistic Regression and Multinomial Naive Bayes to classify the sentiments as positive and negative. The prediction score of the logistic regression model came out to be 97% whereas multinomialNB gave the score of 85%.
Classification of tweets using Logistic Regression and MultinomialNB
You can read more about NLP and different applications here in this blog.
In this blog I have discussed the basic preprocessing steps that are required before building models in natural language processing that are fundamentals. These include tokenization, lowercasing the text, stop words removal, stemming and lemmatization.
In last, the hands on problem on sentimental analysis is discussed where the aim is to classify the tweets into positive and negative sentiments.
What is the OpenAI GPT-3?READ MORE
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working EcosystemREAD MORE
Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & ModelsREAD MORE
6 Major Branches of Artificial Intelligence (AI)READ MORE
Top 10 Big Data Technologies in 2020READ MORE
7 types of regression techniques you should know in Machine LearningREAD MORE
How is Artificial Intelligence (AI) Making TikTok Tick?READ MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
8 Most Popular Business Analysis Techniques used by Business AnalystREAD MORE
Introduction to Logistic Regression - Sigmoid Function, Code ExplanationREAD MORE