Category
>Deep Learning

How do LSTM and GRU work in deep learning?

Tanesh Balodi
Dec 10, 2019
Updated on: Mar 15, 2021

Introduction

Even though Recurrent Neural Networks are very powerful, they suffer with the problem of short-term memory. For a long data sequence, RNN encounters difficulty in carrying information from earlier steps to later ones. Therefore, if an individual processes a text paragraph to conclude predictions, there are possibilities that RNN can drop out important information from starting.

In addition to that, while backpropagation RNN suffers from the vanishing gradient problems where gradients are values deployed to update the weights of neural networks.

For a brief, the vanishing gradient problems are when the gradient shrinks as it gets back propagated over time and when it remains too small, it won’t contribute to the learning process as such.

Therefore, under RNN, if few earlier layers obtain least gradient then they stop learning. Since these layers don’t learn, RNN can forget what is observed in a long data sequence, and hence encounter the short-term memory.

In order to deal with this short-term memory, LSTM and GRU have emerged as solutions, and both will be core points of discussion in the blog.

Topics Covered

What are GRU’s?
LSTM’s Network
Why do we use LSTM and GRU?
Working of LSTM
Working of GRU
Applications of LSTM and GRU

What are GRU’s?

A Gated Recurrent Unit is a variant of the RNN architecture, and deploys gated procedure in order to control and operate the flow of information between cells in the neural networks.

Introduced in 2014 by Cho,et al., GRU enables capturing dependencies from massive sequential data without excluding information from the prior portion of the sequences of data.

This is performed by its gated units that solves vanishing/exploding gradient problems of conventional RNN’s. Such gates regulate the information that needs to be maintained or discarded each step.

LSTM’s Network

Developed by Hochreiter & Schmidhuber (1997), LTSM, or, Long Short Term Memory networks are the special mode of RNN that have the capabilities to learn long-term dependencies and correct choice to work in a deep variety of problems.

These are specially designed to cope with long-term dependencies problems, by default behavior, they can remember information for a huge span to time.

Like the chain-like structure of repeating modules of RNN, LSTM has a different structure of the repeating modules, it has the set of four neural network layer that interact with each other in a special way.

Why do we use GRU and LSTM?

Generally, both LSTM and GRU are used with the intuition to solve the vanishing gradient issue, GRU is able to handle the vanishing gradient problem very well, therefore be choosed over other similar methods.

The vanishing gradient problem is a loss of information that occurs while training the model with the help of gradient descent or other techniques where we tend to find loss.

While back-propagating in a network extracts out valuable arbitrary functions or information such that while we back-propagate in a network, then gradients become smaller, implying that we cannot extract the information with full potential and this will cause a poor model accuracy.

Let's quickly move towards the working of LSTM and GRU.

Working of LSTM and GRU

LSTM has a complex design when compared to GRU which is much simpler. Both LSTM and GRU have gates, and the whole working is dependent upon these gates, however, GRU has simplified gates which makes it easier to understand.

Below is the structure of LSTM, it has five components;

Forget Gate
Input Gate
Cell State
Output Gate
Hidden State Output

Working model of Long Short Term Memory (LSTM)

1. Forget Gate

Structure of the Forget Gate of LSTM

This gate is sorting out the relevant and irrelevant information and pushing forward, only the relevant information towards the cell state ( h_t-1 + x_t) is forwarded to forget gate where h_t-1 is the previous hidden state and x_t is current input, the addition of both is processed under sigmoid function which will convert the output value in the range 0 to 1.

So how does it sorts out irrelevant information and relevant information?

It does not push forward any value which is closer to zero as it is irrelevant and every value which is closer to one is pushed forward towards the cell state of the LSTM structure. Therefore at cell state, the equation becomes (c_t-1) * (f_t). Here (c_t-1) is the previous cell state.

2. Input Gate

Structure of the Input Gate of LSTM

This gate is processing the ( h_t-1 + x_t) and giving out new input using the activation function which is usually Sigmoid activation function, again in the range of 0 to 1, not ignoring any information like forget gate.

New input is generated which is known as the input gate input (i_t), while on the other hand (c_t) is being processed under the Tanh activation function which generates the output in the range of -1 to 1.

This new input information is multiplied further with the candidate input (c_t) which turns the input into (c_t) * (i_t). This information is also passed to the cell state.

3. Cell State

Structure of the Cell State of LSTM

The upcoming information from the forget gate i.e (c_t) * (f_t) added with the information coming from input gate i.e (c_t) * (i_t) which makes the whole equation to be;

Cell state = (c_t-1) * (f_t) + (c_t) * (i_t)

Therefore, from the equation and the step by step guide, you can notice that cell state act as a memory unit of the LSTM structure as it contains all the relevant information.

4. Output Gate

Processing of an Output Gate of LSTM

Output gate processes ( h_t-1 + x_t) under the Sigmoid Activation function which squeezes the information in the range of 0 to 1, is further multiplied with the cell state information.

This cell state information is also processed under the Tanh Activation function, all the multiplication done is Hadamard product which is nothing but element-wise multiplication. Cell state information is multiplied with the output information (o_t).

5. Hidden state output

Hidden state output is the multiplication of cell state information and output information

h_t = [ (c_t-1) * (f_t) + (c_t) * (i_t) ] *(tanh) *(o_t)

Presently experiment over activation functions such as Softmax, ReLU, and leaky ReLU activation function presented more optimum results, we have seen in many models that these activations functions help in avoiding vanishing gradient problem.

Working of GRU

Now let’s see how GRU works:-

The working architecture of Gated Recurrent Unit(GRU)

The main problem with the recurrent neural network was that it could not remember the older information no matter whether it was relevant or irrelevant, which itself was against the idea of RNN.

But GRU has been proved to remember every piece of information, even if that information is turned out to be irrelevant, so this technique holds the very idea of recurrent neural network.

GRU also uses gates like LSTM but not too many, the gates used in GRU are update gates and reset gates, the main components of GRU are:-

1. Update Gate

Presentation of an Update Gate of GRU

At this gate, the network learns how much past information to push forward, here the input (x_t) along with its weight (w_t) are multiplied together which are further added with the multiplication of previous hidden state (h_t-1) along with its weight (w_h).

Further, this value is processed under the sigmoid function to compress the value in the range of 0 to 1. Hence the update gate helps in curing vanishing gradient issues.

u_t = sigmoid [ (x_t) * (w_t) + (h_t-1) * (w_h) ]

2. Reset Gate

Structure of the Reset Gate of GRU

Reset gate is almost the same as the Forget gate of LSTM, as it sorts the irrelevant data and tells the model to forget this data and move forward without it. However, the formula is almost the same as the update gate and only differs at weights and functionality.

r_t = sigmoid [ (x_t) * (w_r) + (h_t-1) * (w_hr) ]

If we observe closely, it functions the exact opposite of what the Update Gate does. Here, the Sigmoid Function converts the value in the range that lies between 0 to 1 and the value closer to zero will not be used further while the value ranging closer to 1 will be processed forward.

3. Current Memory Content

Let's consider an example of a movie review,

“Chhichhore’ has a relevant message on the inherent attitude towards academic success and failure that will connect with many youngsters and parents of today. It tells you that the journey is far more important than the destination and that losing is as critical a life lesson as winning. The film scores high on many accounts and is certainly worth watching.”

As we can see the last line depicts the result or conclusion of the review, so our neural network will learn with the help of the reset gate to ignore all other information written above, current memory content uses reset gate for the functioning like sentiment analysis. Let’s see it’s equation:-

cm_t = tanh (wx_t + r_t x yh_t-1)

Here, x_t is multiplied with its weight w and added to the Hadamard product of previously hidden state multiple of its weight. Here, the Tanh function is used to make the values to fit in the range that lies between -1 to 1.

4. Final Memory at Current time state

Let's suppose that while doing sentiment analysis of a movie review, we found the best information at the very first line itself and all other information in the text is of no use, then our model must be able to sort the sentiment out from the first line and ignore other text.

It basically carries the current state input which needs the Update gate, hence the Update gate is an essential prerequisite for this final stage. Let's see its equation:-

f_t = u_t x (h_t-1) + (1-u_t) h'_t

Here, ‘X’ shows the Hadamard product of the Update gate and current information which is added to the Hadamard product of (1-u_t) and h'_t where h'_t show the information or content that is needed to be collected from current information unit.

Applications of LSTM and GRU

Speech Recognition
Speech Synthesis
Sentiment Analysis
Stock Price Prediction
Machine Comprehension