Category
>Machine Learning

Expectation-Maximization (EM) Algorithm in Machine Learning

Neelam Tyagi
Mar 28, 2021

Introduction

The expectation-maximization algorithm is a widely applicable method for iterative computation of maximum likelihood estimates. The K-means algorithm is the most famous variant of this algorithm.

At every iteration of this algorithm, two steps are followed, known as Expectation step (E-Step) and the Maximization step (M-step), and hence the algorithm named as EM algorithm, firstly by Dempster et al in 1977, hereafter this paper is referred as DLR paper, or DLR simply.

The prime idea of EM algorithms is to check the problem of measuring the maximum likelihood estimates (MLE) of a statistical model for the conditions when the latent variables are involved and the data is missing or incomplete. from the data, it is computationally easier if data had been observed on some randomly chosen variables.

Therefore, the blog covers;

Background & Terminologies
What is the EM algorithm?
How does the EM algorithm work?
Properties of EM algorithm
Drawbacks of EM algorithm
Applications of Em algorithm,
EM algorithm with Gaussian Mixture Model (GMM)

Background & Terminologies

Under the domain of statistics, Maximum Likelihood Estimation is the approach of estimating the parameters of a probability distribution through maximizing the likelihood function to make the observed data most probable for the statistical modelling. There is a limitation with MLE, it considers that data is complete and fully observable, and assumes that all the model-associated variables are present already. Instead, in most of the cases, some relevant variables might be hidden that makes inconsistencies. Such unobserved or hidden data variables are known as Latent variables.
Probability density estimation is the forming of the estimates on the basis of observed data that incorporates picking a probability distribution function and the parameters of that function to explain the joint probability of the observed data.
Convergence is simply the instinct on the basis of probability, suppose there is a very small difference of probability between the two random variables, then it is said to be converged. Hare, convergence implies the values match with each other.
A latent variable model consists of observable and unobservable variables. Observed variables are ones that can be measured or recorded, and latent/ hidden variables are those that can’t be observed directly instead need to be inferred from the observed variables.

(Most Related: What is Conditional Probability?)

What is an EM algorithm?

The EM algorithm is the technique that can be deployed in order to determine the local maximum likelihood estimates/ parameters (MLE) or maximum a posteriori (MAP) estimates/parameters for latent variables (unobservable variables that are inferred from observable variables) in statistical models.

Or simply, the EM algorithm in machine learning uses observable instances of latent variables in order to predict values in instances, unobservable for learning, and continues till the convergence of the values takes place.

EM algorithm is the procedure of performing maximum likelihood estimation when the latent variables are presented, as explained in this.

Being an iterative approach, the EM algorithm revolves amid two modes, the first mode estimates the missing or latent variables, called E-step, and the second step optimizes the parameters of the model that explains data more clearly, called M-step. i.e.

E-step: Estimates the missing values in the dataset,
M-step: Maximize the model parameters while the data is present.

The algorithm is used for predicting these values or in computing missing or incomplete data, given the generalized form of probability distribution that is connected with these latent variables.

How does the EM Algorithm Work?

Now, let’s understand the working mechanism of this algorithm,

Workflow of EM algorithm

Step 1: As having one set of missing or incomplete data and another set of starting parameters, we assume that observed data or initial values of the parameters are produced from the specific model.

Therefore, an entire set of incomplete observed data is provided to the system, assuming that an observed data comes from a specific model.

Step 2: Depending on the observable value of the observable instances of the available data, next the values of unobservable instances, or missing data are predicted or estimated. This step is known as Expectation step, or E-step.

Basically, in this step, the observed data is used for estimating or guessing missing or incomplete data values that are used to update the variables.

Step 3: By the produced data from E-step, next we update the parameters and complete the data set. This step is known as Maximization step, or M-step. And we update the hypothesis.

Working of E-step and M-step

As the last step, we check whether the values are converging or not, if yes, stop the process.

If not, then step 2 and step 3 will be imitated until the state of convergence is achieved, or simply, we will repeat E-step and M-step if the values are not converging. (From)

Properties of EM Algorithm

The EM algorithm has multiple applicable properties

It has numerical stability, with EM iteration increasing the likelihood.
In underlying-favorable circumstances, it has reliable-universal convergence.
Being implemented analytically and computationally, It is very easy to program and seek a tiny storage slot. Simply, by observing the continuance expansion in likelihood, in case computed easily, it becomes easier to regulate convergence and programming eros. (As stated in McLachlan and Krishnan, 1997, Section 1.7)
Having minor cost at per iteration, maximum number of iterations can be counterbalanced as such required for the EM algorithm as compared to other methods.
Most of the time, solutions to the M-steps reside in the closed form.
It can be globally accepted to obtain estimates of missing data.

Drawbacks of EM Algorithm

Besides that, some drawbacks are listed below;

It is unable to give automated estimates of the covariance matrix of the parameter estimates, yet such a drawback can be eliminated by applying appropriate methodology, concerning the EM algorithm.
Sometimes, it becomes very slow at convergence, and makes convergence only to local optima.
In a few cases, the E- and M-step could be unmanageable analytically.
It demands both forward and backward probabilities (as numerical optimization needs forward probability only).(Source)

(Also check: What is Confusion Matrix?)

Applications of EM Algorithm

The algorithm has plenty of applications of real-world applications in machine learning, some of them are;

Chosen in unsupervised data clustering and psychiatric analysis.
It has numerous uses in NLP, computer vision, and quantitative analysis of genetics.
Widely used in image reconstruction in the realm of medicine and structural engineering.
Adopted to measure the Gaussian density of a function.
Used in predicting the Hidden Markov Model (HMM) parameters and similar mixed models.
Used in filling missing data in a sample.
Explore the value of latent variables.

EM algorithm with Gaussian Mixture Model (GMM)

A model that consists of an undetermined/unspecified blend of several probability distribution functions is termed a mixture model, and in general, a learning algorithm is employed for estimating the parameters of the probability distributions that fits correctly to the density of a training dataset, provided.

Similarly, a Gaussian Mixture Model is that type of mixture model that takes the combination of Gaussian probability distributions and demands the estimation of mean and standard deviation parameters for each. A plenty of techniques are available that estimate the parameters for GMM, and MLE is very common in that.

For example, the dataset involves several number of data points, to be generated from the two different processes where the data points have gaussian probability distribution for each process.

Since, the data is combined and have identical distributions such that it becomes hard to identify that a given point belongs to which distributions. Here the processes used to generate the data points depict a latent variable, like, process 0 and process 1. In this case, the EM algorithm is the excellent technique to use for estimating the parameters of the distributions. Let’s learn how.

Through this algorithm, the E-step determines the value for the process latent variable for each data point, and M-step optimizes the parameters of the probability distributions in order to capture the density of data.

The process is repeated until the appropriate set of the latent valus and a maximum likelihood are obtained that fits the data.