Sentiment Analysis of YouTube Comments

  • Ripul Agrawal
  • Jun 15, 2020
  • NLP
  • Updated on: Jun 15, 2020
Sentiment Analysis of YouTube Comments title banner

Sentiment Analysis is one of the Natural Language Processing techniques, which can be used to determine the sensibility behind the texts, i.e. tweets, movie reviews, youtube comments, any incoming message, etc. 

 

It’s in use on a large scale as many big firms are using it to examine customer reviews about their product/services, on social media, their websites which helps them to maintain their brand values. Some of the big corporations using this technique to invigorate customer engagement with their service areas:

 

  • Trip Advisor

  • Google

  • Apple

  • KFC

 

Other application of sentiment analysis includes:

 


In this blog, you will perform sentiment analysis on youtube video’s comments. To carry out the sentiment analysis on any youtube video, the first thing you need is the comments on that video which can be extracted by following this blog.

 

If you have missed that step, then first follow that to extract the comments, followed by the pre-processing part, from the youtube.

 

With the help of this sentiment analysis of comments, the user can get to know about the community acceptance of its channel/video based on that one can maintain their content quality.

 

Another use case of the same is analyzing the trending video as many times, there are videos with more views and likes on trending but if you will use sentiment analysis you will easily be able to find the most useful video of a particular channel, celebrity, category, etc.


 

Dataset

 

Dataset has been prepared & pre-processed by the removal of emojis from all of the comments and selected only English comments as discussed in previous blog. 

 

For the better results of the model, we have combined the comments of three videos in one dataset. It contains the video ID, comment ID and comments from the youtube as shown here,


 The image is displaying the dataset overview with features present in that.

Dataset Presentation


 

Data Pre-processing

 

The most prominent step while building any machine learning model is data preprocessing as it will directly affect the result of your model. The more you pre-process the data, the more accurate your model performs.

 

1. Data Labelling

 

The dataset is unlabelled because of using API, you can only extract the comments but not the polarity. Polarity is something that can identify the emotion of a particular sentence by using the words present in that. This can be done using the TextBlob module of python, which provides a function to find polarity as follows,


The code demonstrates the finding of the polarity of all the comments.

Code illustration: 1


As TextBlob find the polarity score, now using threshold concept you can extract a new feature, i.e., sentiment of each comment either positive (1) or negative (-1) as seen in below one: 


The code is demonstrating the data labeling of all the comments using their individual polarity by some threshold.

Code illustration: 2


 


The code is demonstrating the distribution of sentiments i.e. positive and negative in the dataset.

Code illustration: 3


Now from the above code, you can see that almost half of the comments are positive, and the rest half are negative.

 

2. LowerCase the Comments

 

Now moving further with the data preprocessing step, lowercasing the words in every comment. It becomes important as it makes the data more effective to produce a better result but if not converted then the system treats them as two different words which might be redundant information and led to producing different results than the desired one.


This code is demonstrating the lower-casing of all the text i.e. comments as the second step of data preprocessing.

Code illustration: 4


 

3. Strip

 

Now using strip() method by python, remove all the trailing spaces from the comments.


This code is demonstrating the removal of all the trailing spaces from the comments

Code illustration: 5


 

4. Stop Words

 

It contains words of less importance, i.e., commonly used words like is, am, the, are, a, etc. as they don’t add any fruitful information that is required for analysis. So, by removing these kinds of words, our data become more concise as having fewer features but significant. 

 

The task of stopwords removal will be done using the nltk module as it provides a list of stopwords in different languages. Follow the below code to remove stopwords and write the updated comments into new column/feature as follows,


This code is demonstrating the removal of stop-words from the comments using "stopwords" by "nltk'

Code illustration: 6


 


This code is demonstrating the dataset after some data pre-processing.

Code illustration: 7


5. Data Splitting

 

As of now, you are working with the one dataset, but for making predictions of sentiment analysis or can say that for testing the accuracy of the model trained there will require a test data set/validation data set.

 

So, splitting the dataset into training data and test data using the train_test_split module of scikit learn. It will split the dataset into two parts in the required proportion. After you run the below cell, there will be two datasets available i.e (X_train, y_train) & (X_test, y_test).


This codeis demonstrating the data splitting into training and test dataset using "train_test_split" by "sklearn"

Code illustration: 8


6. Feature Extraction from Text Data

 

Now it’s time to extract features out of text data, i.e. conversion to integer values or floating-point values as a machine learning model can’t be applied directly to the text data. 

 

Use the CountVectorizer module by scikit learn, which will create a vocabulary from the text data, as it will store the count of each word every time it appears in the text. 

 

CountVectorizer mainly performs three basic steps,

 

  • Tokenization: Tokenize the text into words,

  • Vocabulary: Build vocabulary with all the words present in the text/document,

  • Encoding: Encode the entire document creating a vector with the same length as of vocabulary.

 

Follow  the below implementation of vectorization,


This code is demonstrating the feature extraction from all the comments so as to apply machine learning algorithm.

Code illustration: 9


The length of the encoded vector is 3239, it means total words in vocabulary are 3239. You can also verify the words present in vocabulary by following the below code,


This code demonstrating the step to print the vocabulary created using vectorizer of the text i.e. comments.

Code illustration: 10



 

Sentiment Classification

 

Now you have got the pre-processed data consisting of training (tf_train, y_train) and test (tf_test,y_test) dataset. So the next step is the selection of an appropriate machine learning algorithm for classification, i.e., Logistic Regression.

 

1. Logistic Regression

 

It’s a predictive modeling algorithm for the classification where there is a labeled dataset with the categorical target variable. It falls into the category of the supervised machine learning algorithm.

 

It will help in predicting the probability of outcomes i.e. binary classification or multi-classification. Examples of logistic regression include spam classifier, customer churn prediction, tumor prediction, etc.

 

These are some of the famous examples of the same, while you can apply the same for other cases as in this instance of sentiment analysis, where there are two classes to classify i.e. either positive (1) or negative (-1).

 

The scikit learn will be utilized to implement logistic regression for the sentiment classification. Follow the below implementation,


This code is demonstrating the model training for sentiment analysis, using a logistic regression algorithm.

Code illustration: 11


Now after the model training, the accuracy score on the training dataset is 97.9 %, which implies your model is predicting 97.9 % accurate results, which is quite good. In the next step, you can discover the accuracy score on the test dataset,


This code is demonstrating the accuracy score of logistic regression model on the test dataset.

Code illustration: 12


The accuracy of the model on the test dataset is 89.7 %, which means the model is predicting 89.7 % accurate results on the unseen dataset.

Make predictions on the test dataset by using a model trained above, using the below cell.


This code is demonstrating the predictions of the sentiments on the test dataset using a model trained.

Code illustration: 13


 

2. Confusion Matrix

 

It will help the developers to know about the performance of their model on the test dataset where they already have the output of the same.

 

In the below example you can see the elements present in the matrix;


This image is displaying the example of the confusion matrix with the elements to be present.

Confusion Matrix


 

This code is demonstrating the calculation and plotting of the confusion matrix

Code illustration: 14


3. Classification Report

 

This is another metric that will provide the quality of predictions by finding precision, recall, f1 score.


This code is demonstrating the calculation of the classification report, i.e. one of the evaluation metrics for the model trained.

Code illustration: 15


 

4. F1 Score

 

It gives a better measure of classification events than the accuracy metric. It is equal to the harmonic mean of precision & recall. 

 

Another advantage of this is, it provides a more precise evaluation than accuracy in imbalance class distribution;


This image is demonstrating the formula of F1 score i.e. harmonic mean of Precision & Recall.

F1 score


Follow the below code to find the f1 score of your model,


This code is demonstrating the calculation of the F1 score of the model.

Code illustration: 16


Using this model further will be extremely valuable as to find the most beneficial video from the bulk of videos on youtube like you have to choose the best tutorial of machine learning out of a lot of videos. 


 

For the complete code of the same go through this Github repository.

 

Conclusion

 

A machine learning model has been trained for the sentiment analysis of the youtube comments followed by the pre-processing of the dataset. Preprocessing includes data labeling, lowercasing of the text, stopwords removal, data splitting, feature extraction.


For the sentiment classification into two classes positive and negative, Logistic Regression, machine learning classification algorithm has been used & achieved an accuracy score of  97.9% on training data and 89.73 % on test data, and the F1 score is 0.904, which is obviously more reliable than accuracy.

0%

Comments

  • rajkumarsingh1985

    Jun 17, 2020

    hi , can you please share this data set , please rajkumarsingh1985@gmail.com

    Ripul Agrawal

    Jun 17, 2020

    You can follow the guidelines mention above to get the dataset, that's an easy process.