Category
>NLP

Introduction to Text Analytics and Models in Natural Language Processing

Rohit Dwivedi
May 31, 2020
Updated on: Jul 05, 2021

Text Mining

Being an AI technology, text mining, or text analytics deploys NLP techniques to transform unstructured text from documents/databases into normalized, structured data that will be appropriated for data analysis or to drive machine learning algorithms.

In particular, text mining identifies facts, associations, and assertions, once extracted this information is subjected for analysis or presented directly via clustered HTML tables, mind maps, charts, etc.

Extracting useful insights and information from unstructured data is known as the process of Text Mining. There is more and more volume of data that is getting generated every day. Today, the ratio of unstructured data to structured is 90:10. The main source of textual data includes email, surveys, social media, and many more such platforms. And this data holds much of the information.

There are various application of text mining that includes -

Automated processing of large volumes of emails and messages.
Classifying the mail as spam and non-spam messages.
Inspecting insurance claims for dubious claims.
Summarization of Judgments.
Identifications of similar judgements.

Graph of increasing data by years; Source

Textual data cannot be directly used for predictive modelling that means it cannot be directly passed to the machine learning model. The data needs to be cleaned so as to make the computer understand those.

Initially, the words need to be removed from the textual data also called tokenization. Once it is done then the words are supposed to be encoded in integer form that can be fed as an input to the machine learning algorithm, called vectorization. The python library, sklearn has different tools to perform tokenization as well as vectorization.

Text Mining Techniques

Information Extraction - It is the process where entities and relationships between them are used to inspect the unstructured text.
Categorization - It is the process of classifying documents into predefined categories like spam or nonspam.
Clustering - It is the process of searching alike documents for the occurrence of similar queries in tech support databases for automation of resolution.
Summarization - It is the process of finding the main key points from the document or what is the document referring to and summarizing the detail.

(Most related: Text Mining techniques)

Text Mining Challenges

Relationship between word forms and their sight.

Homonomy - Same meaning but different forms. Example- bank (river bank, financial institution)
Polysemy - Similar meanings and similar forms Example - bank (blood bank, financial institution)
Synonyms - Similar meaning but different forms. Example - (singer, vocalist)
Hyponymy - When the words share a subclass of another. Example - (breakfast, meal)

Zipf’s law: Power distribution is there in texts of the word frequencies.

Very often used words that are there are small numbers, usually useless.
Low-frequency words present in big numbers are useful words.

(Must read: Text generation using Markov Chain)

Stopwords in Text Analytics

Many words that are most often used in English like articles (a the, to, etc) are not at all handy in text analytics. There are around 500 such words present. Stopwords for the supplement domain might be set up.

Why remove stop words?

Reducing the size of the data file.
There are a total of 20-30% of stopwords present in total word counts.
They won't play any role in text mining or in the searching process.
Efficiency gets increased after you remove stopwords.

(Also read: Machine Translation in NLP)

Bag of Words Model

It is very important to convert textual data to numeric form so that the machine is able to understand that, It is not possible to directly pass the textual data to the machine. If we need to do the classification of some documents then each document refers to input and output is the class target for the predictive algorithm.

It is important to change the documents to rooted length vectors of numbers, the algorithm takes vectors of number as the input. A logical and effective method while dealing with text documents is called a “Bag-of-Words Model or Bow”. Bow cast all of the systematic information in the words and the centre of attention is the occurrences of words in the document. This thing can be achieved by allocating each word with a distinctive number.

With the length of the vocabulary of the familiar words, any docs can be encoded as vectors having rooted length. With the rate of occurrence of each word in the encoded document, in each position in the vector, the value can be filled up. This is known as the bag of words model where the only concern is about encoding schemes in order to constitute the degree of words that are present in encoded documents without any information about the arrangement.

Word Counts with CountVectorizer

CountVectorizer is used to tokenize the group of text documents and generate the vocabulary of well-known words.

Generate a CountVectorizerclass instance.
Use the .fit() function so as to learn a vocabulary from different docs.
As to encode each document as a vector using the transform() function on each and different documents.
Along with the length of the whole vocabulary and an integer count for the rate of occurrence of every word that comes in the documents, an encoded vector is returned.
The sparse vector is returned when the transform() function is called that can again change back to numpy arrays.

Document Term Matrix

In the context of text mining, the document term matrix is the most often used form of representation.

Term - Single word but can also be a word phrase.
Document - Group of text that is to be recovered.
Can be bigger - documents can be in billions or more than 50K or even more than that.
Can be binary

The below image describes 5 Documents: 4 terms

Document Term Matrix

In DTM, the semantic of the text is disregarded. All the same, terms should look-alike before computing dtm. Also, words should be present in their root form (Stemming) and stopwords should be removed. Learn more about Stemming and Lemmatization in NLP.

Distances in DT matrices

The distance can be computed for the represented doc term matrix. The elements of the matrix could be either 0 or 1. It can be calculated using cosine distance or the euclidean distance.

The angle between the two vectors is the cosine distance. It has proven to work well.
If the documents have nothing in common then it would be 0 or else if they are similar then it would be 1.

(Referred blog: Text cleaning and preprocessing in NLP)

Bag of words / Boolean Method

Example of Bag of words/ Boolean method

Every row is the representation of the sight of the respective term. In which all docs these words are present.
Every column is the representation of sight of respective docs. What terms are present in this document.

Bag of words / Vector Space model

Every document means the vector. The weight of the terms is not 0 or 1 now.
On the basis of term frequency, every term weight is calculated. But term frequency fools as they might take place too often in all the documents in all classes.
All documents are reduced to other words using the concept of weightage in TFIDF.

“The quick brown fox jumped over the lazy dog’s back” - Document

= [ 1 1 1 1 1 1 1 1 2 ] - Vector in feature space

Vector Space Model;

Term Frequency (TF) - It is the rate of occurrence of a term that appears in a document.

It is possible that a word appears many times in a long document compared to short documents because every document is of different length.
Therefore, the TF is frequently divided by the length of the document.

Inverse Document Frequency (IDF): It computes the importance of a term.

Words have little importance although they appear many times in the document.
Calculating the following weight down the often terms while scaling up the rare ones.
- IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

(Suggested blog: NLP techniques to extract information)

Conclusion

I will conclude this blog by stating that unstructured data is increasing so as to make some text predictions or text analytics it is very important to convert the data into a form that can be understood by the machine.

In this blog, I have discussed text mining, applications of text mining, techniques used for text mining, and stopwords in text analytics. I have also tried to introduce you to the Bag of word model where I have discussed the boolean model as well as the vector space model and also the calculation of the distance.