Sentiment Analysis is one of the Natural Language Processing techniques, which can be used to determine the sensibility behind the texts, i.e. tweets, movie reviews, youtube comments, any incoming message, etc.
For example, Grammarly extension is used to correct the grammar in a document or text, and it also provides the overall meaning or how the document is sounding, it gives feedback like informative, sounds positive, optimistic, and more. This is also done with the help of sentiment analysis of the whole document or a text.
While there are other Deep learning and NLP models that are used for implementing sentiment analysis, Recurrent neural networks are widely used for the implementation of Sentiment analysis.
It’s in use on a large scale as many big firms are using it to examine customer reviews about their product/services, on social media, their websites which helps them to maintain their brand values. Some of the big corporations using this technique to invigorate customer engagement with their service areas:
Other application of sentiment analysis includes:
In this blog, you will perform sentiment analysis on youtube video’s comments. To carry out the sentiment analysis on any youtube video, the first thing you need is the comments on that video which can be extracted by following this blog.
If you have missed that step, then first follow that to extract the comments, followed by the pre-processing part, from the youtube.
With the help of this sentiment analysis of comments, the user can get to know about the community acceptance of its channel/video based on that one can maintain their content quality.
Another use case of the same is analyzing the trending video as many times, there are videos with more views and likes on trending but if you will use sentiment analysis you will easily be able to find the most useful video of a particular channel, celebrity, category, etc.
Dataset has been prepared & pre-processed by the removal of emojis from all of the comments and selected only English comments as discussed in previous blog.
For the better results of the model, we have combined the comments of three videos in one dataset. It contains the video ID, comment ID and comments from the youtube as shown here,
The most prominent step while building any machine learning model is data preprocessing as it will directly affect the result of your model. The more you pre-process the data, the more accurate your model performs.
1. Data Labelling
The dataset is unlabelled because of using API, you can only extract the comments but not the polarity. Polarity is something that can identify the emotion of a particular sentence by using the words present in that. This can be done using the TextBlob module of python, which provides a function to find polarity as follows,
Code illustration: 1
As TextBlob find the polarity score, now using threshold concept you can extract a new feature, i.e., sentiment of each comment either positive (1) or negative (-1) as seen in below one:
Code illustration: 2
Code illustration: 3
Now from the above code, you can see that almost half of the comments are positive, and the rest half are negative.
2. LowerCase the Comments
Now moving further with the data preprocessing step, lowercasing the words in every comment. It becomes important as it makes the data more effective to produce a better result but if not converted then the system treats them as two different words which might be redundant information and led to producing different results than the desired one.
Code illustration: 4
Now using strip() method by python, remove all the trailing spaces from the comments.
Code illustration: 5
4. Stop Words
It contains words of less importance, i.e., commonly used words like is, am, the, are, a, etc. as they don’t add any fruitful information that is required for analysis. So, by removing these kinds of words, our data become more concise as having fewer features but significant.
The task of stopwords removal will be done using the nltk module as it provides a list of stopwords in different languages. Follow the below code to remove stopwords and write the updated comments into a new column/feature as follows,
Code illustration: 6
Code illustration: 7
5. Data Splitting
As of now, you are working with the one dataset, but for making predictions of sentiment analysis or can say that for testing the accuracy of the model trained there will require a test data set/validation data set.
So, splitting the dataset into training data and test data using the train_test_split module of scikit learn. It will split the dataset into two parts in the required proportion. After you run the below cell, there will be two datasets available i.e (X_train, y_train) & (X_test, y_test).
Code illustration: 8
6. Feature Extraction from Text Data
Now it’s time to extract features out of text data, i.e. conversion to integer values or floating-point values as a machine learning model can’t be applied directly to the text data.
Use the CountVectorizer module by scikit learn, which will create a vocabulary from the text data, as it will store the count of each word every time it appears in the text.
CountVectorizer mainly performs three basic steps,
Tokenization: Tokenize the text into words,
Vocabulary: Build vocabulary with all the words present in the text/document,
Encoding: Encode the entire document creating a vector with the same length as of vocabulary.
Follow the below implementation of vectorization,
Code illustration: 9
The length of the encoded vector is 3239, it means total words in vocabulary are 3239. You can also verify the words present in vocabulary by following the below code,
Code illustration: 10
Now you have got the pre-processed data consisting of training (tf_train, y_train) and test (tf_test,y_test) dataset. So the next step is the selection of an appropriate machine learning algorithm for classification, i.e., Logistic Regression.
1. Logistic Regression
It’s a predictive modeling algorithm for the classification where there is a labeled dataset with the categorical target variable. It falls into the category of the supervised machine learning algorithm.
It will help in predicting the probability of outcomes i.e. binary classification or multi-classification. Examples of logistic regression include spam classifier, customer churn prediction, tumor prediction, etc.
These are some of the famous examples of the same, while you can apply the same for other cases as in this instance of sentiment analysis, where there are two classes to classify i.e. either positive (1) or negative (-1).
The scikit learn will be utilized to implement logistic regression for the sentiment classification. Follow the below implementation,
Code illustration: 11
Now after the model training, the accuracy score on the training dataset is 97.9 %, which implies your model is predicting 97.9 % accurate results, which is quite good. In the next step, you can discover the accuracy score on the test dataset,
Code illustration: 12
The accuracy of the model on the test dataset is 89.7 %, which means the model is predicting 89.7 % accurate results on the unseen dataset.
Make predictions on the test dataset by using a model trained above, using the below cell.
Code illustration: 13
2. Confusion Matrix
It will help the developers to know about the performance of their model on the test dataset where they already have the output of the same.
In the below example you can see the elements present in the matrix;
Code illustration: 14
3. Classification Report
This is another metric that will provide the quality of predictions by finding precision, recall, f1 score.
Code illustration: 15
4. F1 Score
It gives a better measure of classification events than the accuracy metric. It is equal to the harmonic mean of precision & recall.
Another advantage of this is, it provides a more precise evaluation than accuracy in imbalance class distribution;
Follow the below code to find the f1 score of your model,
Code illustration: 16
Using this model further will be extremely valuable as to find the most beneficial video from the bulk of videos on youtube like you have to choose the best tutorial of machine learning out of a lot of videos.
For the complete code of the same go through this Github repository.
A machine learning model has been trained for the sentiment analysis of the youtube comments followed by the pre-processing of the dataset. Preprocessing includes data labeling, lowercasing of the text, stopwords removal, data splitting, feature extraction.
For the sentiment classification into two classes positive and negative, Logistic Regression, machine learning classification algorithm has been used & achieved an accuracy score of 97.9% on training data and 89.73 % on test data, and the F1 score is 0.904, which is obviously more reliable than accuracy.