• Category
  • >NLP

What is Latent Dirichlet Allocation (LDA) in NLP?

  • Bhumika Dutta
  • Jan 24, 2022
What is Latent Dirichlet Allocation (LDA) in NLP? title banner

Natural Language Processing (NLP) has gained huge popularity in recent days due to its wide range of applications across different sectors. Topic modelling is one such unsupervised machine learning technique in NLP that is used by many organizations to identify documents by looking for a specific set of keywords. 

 

With this technique, one can establish semantic relationships between words, find documents or books through a text summary, enhance customer service, and many more. 

 

Latent Dirichlet Allocation or LDA is an algorithm that is used in topic modelling. In this blog, we are going to learn everything about LDA.


 

About Latent Dirichlet Allocation (LDA):

 

If we look at the name of this algorithm, we can see that the word 'Latent' indicates that the model finds hidden topics in documents, and the word 'Dirichlet' indicates that LDA assumes that the distribution of topics in a document and the distribution of words in topics are both Dirichlet distributions.

 

To extract themes from a corpus, Latent Dirichlet Allocation (LDA) is a popular topic modelling approach. To extract themes from a corpus, Latent Dirichlet Allocation (LDA) is a popular topic modelling approach. 

 

This is a distribution across distributions, which means that each draw from a Dirichlet process is a distribution in and of itself. This means that a Dirichlet process is a probability distribution whose range is a collection of probability distributions.
 

 

Understanding the Algorithm:

 

According to my great learning, each document is created via a statistical generating process, as per LDA. That is, each document is made up of a variety of themes, each of which is made up of a variety of words. 

 

There are three hyperparameters in LDA:

 

  1. Document-topic density factor (‘α’)

  2. Topic-word density factor (‘β’)

  3. The number of topics to be considered (K).

 

The ‘α’ hyperparameter determines how many subjects should be included in the text. A low value of ‘α’ indicates that the papers should have fewer themes in the mix, while a greater value indicates that the documents should have more topics in the mix.

 

The ‘β’ hyperparameter determines how many words are distributed to each subject. Subjects with lower values of ‘β’ will generally have fewer words, whereas topics with higher values will likely have more words. 

 

The 'K' hyperparameter defines the expected number of themes in the document corpus. K is usually assigned a value based on domain expertise. Another option is to train many LDA models with varied numbers of K values and then compute the 'Coherence Score.'

 

(Related reading: Best Data Mining Techniques)

 

Working process of the algorithm:

 

Two fundamental assumptions are made by the LDA:

 

Documents are made up of a variety of themes, while topics are made up of a variety of tokens (or words)

 

And the probability distribution is used to produce the words in these areas. The documents are known as the probability density (or distribution) of subjects, and the topics are known as the probability density (or distribution) of words in statistical terms.

 

To begin, LDA applies the following two key assumptions to the corpus at hand. A document word (or document term matrix), commonly known as DTM, may be used to represent any corpus, which is a collection of documents. The first step in working with text data is to clean, preprocess, and tokenize it. We get the following document word matrix after preprocessing the texts.


5*8 matrix document matrix

5*8 matrix document matrix (source)


The five papers are D1, D2, D3, D4, and D5, and the words are represented by the Ws, therefore there are eight distinct words from W1 to W8. As a result, the corpus now mostly consists of the above-preprocessed document-word matrix, in which each row represents a document and each column represents tokens or words.

 

This document-word matrix is converted by LDA into two additional matrices: Topic Word Matrix and Document Term Matrix.

 

(Must catch: Textless NLP: Definition & Benefits)


 

Dirichlet Distribution:

 

Dirichlet's distribution is a probability density for a vector-valued input with the same properties as theta (θ), our multinomial parameter. It has non-zero values, for example:

 

The vector α, which has the same number of elements as the multinomial parameter K, is used to parameterize the Dirichlet distribution.

 

Given that our Dirichlet distribution has a parameter, we may interpret p(θ|α) as addressing the question "what is the probability density associated with multinomial distribution, given that our Dirichlet distribution has parameter α?"


Dirichlet distribution

Dirichlet distribution (source)


For our purposes, we may suppose that the corners/vertices represent the themes, and the words inside the triangle indicate the words (the word is closer to the topic if it frequently connects to it.) or vice versa. 

 

More than three dimensions can be added to this distribution. We may use a tetrahedron for four dimensions and a polyhedron for additional dimensions.

 

The similarity between LDA and PCA:

 

As written by Analytics Vidhya, topic modelling and Principal Component Analysis are quite similar. So, PCA is a method for reducing dimensionality, right? It's also utilized to store data with numerical values. The model is built using a linear combination of variables from which components are derived. 

 

PCA reduces the dimensions by splitting or decomposing a bigger value (i.e. a solitary value) into smaller values. LDA is similar to PCA in that it works in the same way. 

 

The text data is subjected to LDA. It operates by splitting the corpus document word matrix (big matrix) into two smaller matrices: Document Topic Matrix and Topic Word. As a result, like PCA, LDA is a matrix factorization method.

 

We will have the tokenized words (after the text preparation processes) in the left panel of the above picture if we do not use LDA or any of the topic modelling approaches. These words will then be used to create a model, which may include other characteristics (more than present in the image). 

 

If we divide the initial document-word matrix into two sections, one for the word per subject and the other for the topic per document, we may use the tokenized words generated after using the Bag of Words vectorizer instead.

 

This breaking down of the bigger word features matrix allowed us to categorize text into documents while also drastically reducing the number of characteristics utilized to create the model. LDA divides the corpus document word into smaller matrices. As a result, topic modelling and related approaches are also utilized in dimensionality reduction.

 

Applications of LDA:

 

Traditionally, LDA has been used to detect thematic word clusters or subjects in text data. Aside from that, LDA has been employed as a component in more complex applications. The following are some of the applications.

 

  • Cascaded LDA for taxonomy construction:

 

An online content generating system that allows you to organize and manage a large number of different types of material. Smart Online Content Generation System has further information about this application.

 

  • Recommendation system based on LDA:

 

A book recommendation engine based on their Wikipedia entries The LDA-based recommendation engine has further information on this application.

 

  • Gene Expression Classification:

 

Differential gene expression's significance in cancer aetiology and cellular processes should be understood. A Novel Approach for Classifying Gene Expression Data Using provides further information about this application.


 

This concludes our discussion on Latent Dirichlet Allocation. LDA's uses do not have to be limited to Natural Language Processing. There are also additional applications. However, for the sake of this paper, we will concentrate on LDA for topic modelling, which is a branch of NLP.

Latest Comments