For those who have studied Natural Language Processing (NLP) thoroughly, and are aware of various techniques like the bag of words, tokenization, stemming and lemmatization, etc. they probably know how natural language processing can generate a new word by analyzing the whole statement or sentence, or how it can be used in a sentiment analysis for a movie review, but what if I tell you there is a model of natural language processing which is released only partly, because of the epic results they got from it and feared that if they would release the whole model it might be used disgracefully, this model is our topic for today, it is none other than Generative Pre-trained Transformer-2 (GPT-2) released by the researchers of OpenAI, earlier on 2019.
Let us know about this revolutionary model further.
An OpenAI research team came up with a model for which they trained about 40GB internet text, the performance of the model was unbelievable, the NLP techniques or model, we knew, was able to predict the new text, but this model is such a powerful model that it can predict a whole article or story only with the few sentences or words, and the result was so optimum that you cannot even guess that it has been generated by a machine. Let me give you a sample example from open AI research
Above is a human written query for Generative Pre-trained Network, and the machine-generated answer was more than a shocking result for any NLP model, it was marvelous, let’s see the result -:
It is easy to identify why the Civil War happened because so many people and so many books and so much television and films tell us that it was the cause, that it has something to do with race or economics or religion. There is much agreement that it was essentially a war of slavery on behalf of capitalism, about a century of slavery.
But that’s not what most people think of when they think of the Civil War. Many people think of the war as about states’ rights. You’re not wrong about that, but it’s kind of misleading to say that the Civil War was a conflict between states’ rights and federalism. So let’s try again. What’s the idea that has inspired so many people to go onto the battlefields of Gettysburg and Antietam and Gettysburg and Petersburg and Fredericksburg? The American ideal of the republic--a notion of limited government--is a great part of the history.
This is not even the whole answer that was generated by GPT-2, but moreover, it is just a paragraph, this model fulfilled the intuition behind natural language processing up to an unbelievable extent, as the whole idea tend to provide the understanding of raw text to a machine just like any human will interpret and understand.
GPT-2 whole model was trained on 40GB trained dataset which included more than 1.5 billion parameters with about 48 layers but as a precaution, the released version had about 117 million parameters with about 12 layers in order to lessen its performance and accuracy, another version that is released of this model that carries about 345 million parameters, and hence performed better.
Also, the accuracy of this model is more than the previous record with a substantial difference, for example, the previous model achieved 85.7% accuracy on “Children’s Book Test Common Nouns” dataset whereas GPT-2 acquired 93.30 % accuracy on the same dataset, leaving a difference of less than 3% to acquire human-like accuracy.
Let’s go with the name Generative Pre-Trained Transformer, here ‘Generative’ clearly depicts the generative nature of this model where it tends to understand the text and generates the text which has some real meaning and is based on facts, ‘Pre-Trained’ in the name suggests the huge number of parameters over which this model is trained. ‘Transformer’ in the model name is the most important notation as it depicts its architecture, which we are going to discuss further-:
Above is an architecture of ‘Transformer’, that does all the fine-tuning of text, we can see the different layers with a different purpose, the output result that this transformer provide is text prediction and text classifier.
The huge dataset is fed to this transformer and training of data is done millions of time, This is the reason behind its success over language modeling, machine translation, and auto-text generation. The transformer can be said as the founding stone for this very efficient model. The main purpose of the transformer is to set as an instrument for machine translation in this model for providing optimum results in natural language processing.
To implement the Transformer there are 4 main steps to be followed :
Inserting Input: We have to feed or insert each and every word of the text document to the transformer, the embedding of words is a common practice in neural machine translation. In this step, every word will be provided with a vector known as the embedding vector.
Positional Encoding: Positional encoding refers to providing a position to the embedding vector which we provided in the last step.
Creating Masks: Creating a mask in Transformer serves its purpose in encoder as well as a decoder, mainly it is used to make a perfect prediction of the next word by stopping decoder at the right time.
Feed Forward Layer: The feedforward layer in Transformer has two most important operations which are ReLU (read more about activations functions here) and dropout operations performing linear operations. Also after these operations, normalization is done which is very important in order to provide uniformity in results.
As we know, that major model which are widely used have preferred supervised learning and have achieved major success using them, most of the algorithm best fits for supervised learning, so why unsupervised learning is preferred in OpenAI’s most advanced natural language processing model GPT-2? the reason is very much practical, in their notes they wrote “Since unsupervised learning removes the bottleneck of explicit human labeling, it also scales well with current trends of increasing computation, and availability of raw data.
Unsupervised learning is a very active area of research but practical uses of it are often still limited”. Now you know why they are preferring unsupervised learning, one more reason to add up is that labeled and cleaned data is expensive, so choosing unsupervised learning is a clever choice.
False Information:- Generative Pre-trained Transformer-2 is trained over millions of websites, but the righteousness or correctness of the content on those websites cannot be neglected, as our model is trained on such dataset it creates a problem like exploitation of biases in the data distribution.
Heavy Computation:- OpenAI’s GPT-2 requires heavy computational setup as compared to the previous language models on which training was done using a single GPU, but this model is pre-trained over such a huge dataset that it needs one month on 7-8 GPU’s, also it has about 37 layers and 12 blocks, these numbers tell the amount of computation that is done in this model.
Unpredictable Generalization:- According to the OpenAI research team, this text generator model has performed really well on almost every dataset, but they have seen counterintuitive behavior while evaluating the out-of-distribution way.
GPT-2, as said by their creator, is the most advanced text generator model ever built for language modeling and prediction of next tokens, but the team also remarked it as “The AI that is too dangerous to release”. This statement tells the potential of this NLP model and possible applications of GPT-2 could be of creating a fake text or information which will eventually be the next impossible task to distinguish whether the information is generated by a machine or is an human-generated text.
That might be the reason, the Open AI didn’t release it with a full 1.5 billion parameter pre-trained model, with the possible threat of misuse. GPT-2 can be considered as the most perfect text generative model ever created, although there is an advancement that is needed in the future but seeing it’s potential, we can assume that we are very close to ideal text predictive and generative model.
Why did we felt the need of GPT-2 when there was already GPT? let me tell you that sooner after GPT, google released its natural language processing model known to be BERT which performed better than OpenAI’s GPT model, it was able to generate the words which were just the blank spots in between the sentence which was a big achievement, but later, OpenAI came up with this idea, where they have used the same earlier model with the only advancement or upgrade they did was by installing more GPUs and with a huge parameter and about 40 gigabytes of internet information.
And as a result, it performed phenomenal and better than google BERT by generating the whole document with the information as less as a sentence and sometimes even a word. Another model which was released by google was ELMo that was work on “semi-supervised sequence learning”, and gained good accuracy, on the other hand, BERT stands for Bidirectional encoder representations from transformers, it achieved the accuracy for about 86.7% on MultiNLI dataset which was 4.67 % improvement from previous results. This success of google’s model leads the OpenAI team to think about the new way to implement natural language processing like never before.
Undoubtedly, Generative Pre-Trained Transformer is the best research in the field of Natural Language Processing, though there are huge chances of more substantial advancement, which seems to be achieved earlier than it was predicted. But with this research held by the OpenAI team, fine-tuning is improved with more generalization than ever.
We hope to see new marvels on Natural Language Processing techniques in the future, and not being too much predictive, but in my opinion, we are very near to get the ideal result of our desire. For more blogs in Analytics and new technologies do read Analytics Steps.
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working EcosystemREAD MORE
6 Major Branches of Artificial Intelligence (AI)READ MORE
Top 10 Big Data TechnologiesREAD MORE
What is the OpenAI GPT-3?READ MORE
Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & ModelsREAD MORE
7 types of regression techniques you should know in Machine LearningREAD MORE
8 Most Popular Business Analysis Techniques used by Business AnalystREAD MORE
How Does Linear And Logistic Regression Work In Machine Learning?READ MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
What is TikTok and How is AI Making it Tick?READ MORE
Hey Vivek, Thank you! I hope, I will come up with such topics in the near future.