Extracting useful insights and information from unstructured data is known as the process of Text Mining. There is more and more volume of data that is getting generated everyday. Today, the ratio of unstructured data to structured is 90:10. The main source of textual data includes email, surveys, social media and many more such platforms. And this data holds much of the information.
There are various application of text mining that includes -
Automated processing of large volume of emails and messages.
Classifying the mail as spam and non spam messages.
Inspecting insurance claims for dubious claims.
Summarization of Judgments.
Identifications of similar jugements.
Graph of increasing data by years; Source
Textual data cannot be directly used for predictive modelling that means it cannot be directly passed to the machine learning model. The data needs to be cleaned so as to make the computer understand those. Initially, the words need to be removed from the textual data also called tokenization. Once it is done then the words are supposed to be encoded in integer form that can be fed as an input to the machine learning algorithm, called vectorization. The python library, sklearn has different tools to perform tokenization as well as vectorization.
If you want to learn how to pre-process your textual data so as to do predictive modeling you can refer to this blog here.
Information Extraction - It is the process where entities and relationships between them are used to inspect the unstructured text.
Categorization - It is the process of classifying documents into predefined categories like spam or non spam.
Clustering - It is the process of searching alike documents for occurrence of similar queries in tech support databases for automation of resolution.
Summarization - It is the process of finding the main key points from the document or what is the document referring to and summarizing the detail.
Relationship between word forms and their sight.
Homonomy - Same meaning but different forms. Example- bank (river bank, financial institution)
Polysemy - Similar meanings and similar forms Example - bank (blood bank, financial institution)
Synonyms - Similar meaning but different forms. Example - (singer, vocalist)
Hyponymy - When the words share a subclass of another. Example - (breakfast, meal)
Zipf’s law: Power distribution is there in texts of the word frequencies.
Very often used words that are there are small numbers, usually useless.
Low frequency words present in big numbers are the useful words.
Many words that are most often used in English like articles (a, the, to, etc) are not at all handy in text analytics. There are around 500 such words present. Stopwords for the supplement domain might be set up.
Reducing the size of the data file.
There are a total 20-30% of stopwords present in total word counts.
They won't play any role in text mining or in the searching process.
Efficiency gets increased after you remove stopwords.
It is very important to convert textual data to numeric form so that the machine is able to understand that, It is not possible to directly pass the textual data to the machine. If we need to do classification of some documents then each document refers to input and output is the class target for the predictive algorithm.
It is important to change the documents to rooted length vectors of numbers, the algorithm takes vectors of number as the input. Logical and effective method while dealing with text documents is called a “Bag-of-Words Model or Bow”. Bow cast all of the systematic information in the words and the centre of attention is the occurrences of words in the document.
This thing can be achieved by allocating each word with a distinctive number. With the length of the vocabulary of the familiar words, any docs can be encoded as vectors having rooted length. With the rate of occurrence of each word in the encoded document, in each position is the vector the value can be filled up. This is known as the bag of words model where the only concern is about encoding schemes in order to constitute about the degree of words that are present in encoded documents without any information about the arrangement.
CountVectorizer is used to tokenize the group of text documents and generate the vocabulary of well known words.
Generate a CountVectorizerclass instance.
Use the .fit() function so as to learn a vocabulary from different docs.
As to encode each document as a vector use the transform() function on each and different documents.
Along with the length of the whole vocabulary and an integer count for the rate of occurrence of every word that comes in the documents, an encoded vector is returned.
Sparse vector is returned when transform() function is called that can again change back to numpy arrays.
In the context of text mining, the document term matrix is the most often used form of representation.
Term - Single word but can also be a word phrase.
Document - Group of text that is to be recovered.
Can be bigger - documents can be in billions or more than 50K or even more than that.
Can be binary
The below image describe 5 Documents : 4 terms
Document Term Matrix
In DTM, the semantic of the text is disregarded. All the same terms should look alike before computing dtm. Also, words should be present in their root form (Stemming) and stopwords should be removed.
The distance can be computed for the represented doc term matrix. The e elements of the matrix could be either 0 or 1. It can be calculated using cosine distance or the euclidean distance.
Angle between the two vectors is the cosine distance. It has proven to work well.
If the documents have nothing in common then it would be 0 or else if they are similar then it would be 1.
Boolean method of BOW: Source
Example of Bag of words/ Boolean method; Source
Every row is the representation of the sight of the respective term. In which all docs these words are present.
Every column is the representation of sight of respective docs. What terms are present in this document.
Every document means the vector. Weight of the terms are not 0 or 1 now.
On the basis of term frequency every term weight is calculated. But term frequency fools as they might take place too often in all the documents in all classes.
All documents are reduced to other words using the concept of weightage in TFIDF.
“The quick brown fox jumped over the lazy dog’s back” - Document
= [ 1 1 1 1 1 1 1 1 2 ] - Vector in feature space
Vector Space Model; Image Source
Term Frequency (TF) - It is the rate of occurrence of a term that appears in a document.
It is possible that a word appears many times in a long document compared to short documents because every document is of different length.
Therefore, the TF is frequently divided by length of the document.
Inverse Document Frequency (IDF): It computes the importance of a term.
Words have little importance although they appear many times in the document.
Calculating the following weight down the often terms while scaling up the rare ones.
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
I will conclude this blog by stating that unstructured data is increasing so as to make some text predictions or text analytics it is very important to convert the data into the form that can be understood by the machine.
In this blog, I have discussed text mining, applications of text mining, techniques used for text mining, and stopwords in text analytics. I have also tried to introduce you to Bag of word model where I have discussed the boolean model as well as vector space model and also the distances calculation.
Data Science enthusiast who is currently pursuing a Post Graduate Program in Machine learning and Artificial Intelligence from Great Leaning. He has experience in Data Analytics, Machine Learning, Neural Networks, Computer Vision, and Natural Language Processing. He has done various good projects in the domain of analytics. His goal is to build various use cases using the power of Artificial Intelligence and Machine Learning and solving business problems.
Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & ModelsREAD MORE
How is Artificial Intelligence (AI) Making TikTok Tick?READ MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
7 types of regression techniques you should know in Machine LearningREAD MORE
6 Major Branches of Artificial Intelligence (AI)READ MORE
Introduction to Logistic Regression - Sigmoid Function, Code ExplanationREAD MORE
What is K-means Clustering in Machine Learning?READ MORE
Top 10 Big Data Technologies in 2020READ MORE
Introduction to Linear Discriminant Analysis in Supervised LearningREAD MORE
Convolutional Neural Network (CNN): Graphical Visualization with Code ExplanationREAD MORE