NLP or natural language processing is a topic or field that deals with languages. The intent here is to make the machine learn the human-made languages so that the machine itself could perform tasks that would save time and give better results.
This intent of natural language processing is very much fulfilled with today’s advancement in technology. Learn the basics with the NLP Guide.
The beginner aspirant, who wants to learn the wonders of natural language processing should start from the basics and making a steady growth to the field, there are some python packages that will help in easing the journey, one of the most important python packages which is specially built to perform all the basic tasks of natural language processing is known as NLTK.
Let's learn about it in detail.
NLTK Python Tutorial
Whether you are a beginner or doing research over NLP, NLTK is a python package that can perform every NLP task with ease. NLTK is a short form for natural language toolkit which aids the research work in NLP, cognitive science, Artificial Intelligence, Machine learning, and more.
This NLTK tutorial will help you to implement various NLP techniques like word tokenization, stemming, lemmatization, removing stop words and punctuation, Ngrams, POS tagging, and more.
Word tokenization using NLTK
Word tokenization is a method by which we break the whole paragraph into individual tokens of strings, for example, if the sentence is ‘ my name is tanesh balodi’, then the tokens of strings would be [ ‘my’, ‘name’, ‘is’, ‘tanesh’, ‘balodi’ ], so let’s implement word tokenization using NLTK.
from nltk.tokenize import word_tokenize
word_tokenize("My Email address is: email@example.com")
['My', 'Email', 'address', 'is', ':', 'taneshbalodi8', '@', 'gmail.com']
We can see the output of the above key, where tokens of words as well as special characters can be seen. In the same way, we could also perform sentence tokenization.
Sentence tokenization using NLTK
Sentence tokenization is an NLP technique where we take each sentence in a paragraph as a token, let’s see how we can implement it using NLTK.
from nltk.tokenize import sent_tokenize
sent_tokenize("My name is tanesh balodi. I am a machine learning engineer.")
['My name is tanesh balodi.', 'I am a machine learning engineer.']
We can see that in our above paragraph, there were two sentences and our model perfectly represented both the sentences in the form of a token.
(Also read: Bag of words with Python code)
Loading JSON file using NLTK
dataset = None
with open('quotes.json', 'r') as f:
dataset = json.load(f)
Above is the way by which you can load and read the JSON file, firstly, you need to import the JSON package and then open the JSON with the help of the above format and then assigning a variable name to the opened file.
But how are we going to perform tokenization in the above file? We can make use of the split function, below we are using the first quote of the dataset and applying the split function to test the results.
quote = dataset['quote']
So we can see that with the help of the split function, we are able to form word tokens. Can we also do the tokenization using word_tokenize and sent_tokenize on the whole dataset? Well, the answer is yes, below is the format.
from nltk.tokenize import word_tokenize, sent_tokenize
We have successfully learned tokenization of words as well as sentences using the natural language toolkit(NLTK). Now it’s time to learn how to remove stopwords and punctuation from the content.
(Similar read: Working with Python JSON objects)
Removing Stop Words and Punctuation Using NLTK
Stopwords and punctuation are generally not helpful for the information retrieval and learning part, hence, removal of such stopwords and punctuation not only reduce the number of tokens but aid the speed of information retrieval and learning.
In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk.download(‘stopwords’), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords.words(‘english’) to specify and save it to the variable. At last, we are also importing punctuation from the string library.
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
from string import punctuation
punctuation = list(punctuation)
tokens = word_tokenize(dataset['quote'])
We can see that a total of 50 tokens are made from the 2nd quote of the dataset, if we remove all the stopwords and punctuation, we can significantly reduce the numbers of tokens, let’s see how.
cleaned_tokens = [token for token in tokens if token not in stop_words
and token not in punctuation]
We can see the number of tokens after removing stopwords and punctuation have been limited to a mere 24 from the total of 50.
We have a specific tokenizer that could be customized according to your need, this tokenizer is named ‘regex’ tokenizer.
(Suggested blog: Word embedding in NLP with Python code)
regex = RegexpTokenizer(r'\w+@[A-Za-z]+\.[A-Za-z]+')
regex.tokenize("My email addresses are: firstname.lastname@example.org email@example.com")
Regexp tokenizer is a custom tokenizer that helps in tokenizing and it uses the regular expression to tokenize the sentence.
Let’s move to stemming and lemmatization now
Stemming and lemmatization using NLTK
Stemming is a process by which we tend to form the word stem out of the given word, for example, if the given word is ‘lately’, then the stemming will cut ‘ly’ and give the output as ‘late’, this is done in order to find more context for information retrieval and to reduce the size of the dataset.
Whereas, lemmatization is a process to remove inflection ending from the word by using the better vocabulary than the original, for example, ‘better’ will become ‘good’.
Let’s see how we can perform stemming-:
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk import WordNetLemmatizer
ps = PorterStemmer()
ss = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
All we did in the above part of the code is to import PorterStemmer for stemming and WordNetLemmatizer for lemmatization from NLTK. ps and lemmatizer are variables given to porterStemmer and WordNetLemmatizer functions respectively.
We can see the output and analyze that the output goes according to what was defined as stemming and lemmatization in NLP.
Natural language toolkit is an all-in-one NLP package that helps in solving most of the basic NLP operations and is very fast as well as easy to implement.
(Must catch: NLP Python libraries)
The operations we performed using NLTK paves a way for the beginners to understand as well as implement numerous NLP techniques like a bag of words and word embeddings.