Natural language processing also known as NLP is the convergence between linguistics, computer science and artificial intelligence. It mainly aims for the interconnection between natural languages and computers that means how to analyse and model high volume of natural language data.
As you already know, computers are able to understand numbers well rather than understanding words. There is a lot of research and development happening in the domain of NLP everyday.
There are huge amounts of applications that are today working because of NLP. Like, Alexa or Apple’s Siri both are able to understand whenever someone raises a question and are even able to answer those whether to ask for playing music or taking the weather updates. The spam mail filtering is also an example of NLP.
Being in Data science worlds, we can use NLP for text classification, sentimental analysis (classifying sentiments as positive or negative), text summarizations and all other classification models amid other applications.
Generally, if the data is scrapped or data is given for analyzing it would always be in its natural human format of sentences, or paragraphs etc. Before doing analysis on that we need to transform that language and clean it so that the computer is able to understand that language in the desired format.
Data preprocessing is a fundamental step while building a machine learning model. If the data is fairly pre-processed the results would be reliable. In NLP, the first step before building the machine learning model, is to pre-process the data. Let’s see the various different steps that are followed while preprocessing the data also used for dimensionality reduction.
Tokenization
Lower casing
Stop words removal
Stemming
Lemmatization
Each term is the axis in the vector space model. In muti-dimensional space, the text or document are constituted as vectors. The number of different words represents the number of dimensions.
The python library that is used to do the preprocessing tasks in nlp is nltk. You can install the nltk package using “pip install nltk”.
It is a method in which sentences are converted into words.
import nltk
from nltk.tokenize import word_tokenize
token = word_tokenize("My Email address is: taneshbalodi8@gmail.com")
token
Tokenization
the tokenized words into lower case format. (NLU -> nlu). Words having the same meaning like nlp and NLP if they are not converted into lowercase then these both will constitute as non-identical words in the vector space model.
Lowercase = []
for lowercase in token:
Lowercase.append(lowercase.lower())
Lowercase
Lowercasing
These are the most often used that do not have any significance while determining the two different documents like (a, an, the, etc.) so they are to be removed. Check the below image where from the sentence “Introduction to Natural Language Processing” the “to” word is removed.
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
from string import punctuation
punct = list(punctuation)
print(dataset[1]['quote'])
tokens = word_tokenize(dataset[1]['quote'])
len(tokens)
Without removing Stopwords
We got to see 50 tokens without removing stopwords, Now we shall remove stopwords.
cleaned_tokens = [token for token in tokens if token not in stop_words
and token not in punctuation]
len(cleaned_tokens)
Stopwords removal
By cleaning the stopwords we got the length of the dataset as 24.
It is the process in which the words are converted to its base from. Check the below code implementation where the words of the sentence are converted to the base form.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem('jumping'))
print(ps.stem('lately'))
print(ps.stem('assess'))
print(ps.stem('ran'))
Stemming
Different from stemming, lemmatization lowers the words to word in the present language for example check the below image where word has and is are changed to ha and be respectively.
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('ran', 'v'))
print(lemmatizer.lemmatize('better', 'a'))
Lemmatization
In this blog, I have discussed the basic preprocessing steps that are required before building models in natural language processing that are fundamentals. These include tokenization, lowercasing the text, stop word removal, stemming, and lemmatization. Although the range with which natural language processing could be implemented is wide, much research has been going on in this particular topic.
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working Ecosystem
READ MORE6 Major Branches of Artificial Intelligence (AI)
READ MORETop 10 Big Data Technologies
READ MOREWhat is the OpenAI GPT-3?
READ MOREIntroduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & Models
READ MORE7 types of regression techniques you should know in Machine Learning
READ MORE8 Most Popular Business Analysis Techniques used by Business Analyst
READ MOREHow Does Linear And Logistic Regression Work In Machine Learning?
READ MORE7 Types of Activation Functions in Neural Network
READ MOREWhat is TikTok and How is AI Making it Tick?
READ MORE
Comments
18bit159
Aug 04, 2020i can't see the code in this blog!