Category
>NLP

Sentiment Analysis of YouTube Comments

Ripul Agrawal
Jun 15, 2020
Updated on: Nov 21, 2020

Sentiment Analysis is one of the Natural Language Processing techniques, which can be used to determine the sensibility behind the texts, i.e. tweets, movie reviews, youtube comments, any incoming message, etc.

For example, Grammarly extension is used to correct the grammar in a document or text, and it also provides the overall meaning or how the document is sounding, it gives feedback like informative, sounds positive, optimistic, and more. This is also done with the help of sentiment analysis of the whole document or a text.

While there are other Deep learning and NLP models that are used for implementing sentiment analysis, Recurrent neural networks are widely used for the implementation of Sentiment analysis.

It’s in use on a large scale as many big firms are using it to examine customer reviews about their product/services, on social media, their websites which helps them to maintain their brand values. Some of the big corporations using this technique to invigorate customer engagement with their service areas:

Trip Advisor
Google
Apple
KFC

Other application of sentiment analysis includes:

Twitter sentiment analysis,
IMDB movie ratings,
Amazon customer reviews,
YouTube videos comments

In this blog, you will perform sentiment analysis on youtube video’s comments. To carry out the sentiment analysis on any youtube video, the first thing you need is the comments on that video which can be extracted by following this blog.

If you have missed that step, then first follow that to extract the comments, followed by the pre-processing part, from the youtube.

With the help of this sentiment analysis of comments, the user can get to know about the community acceptance of its channel/video based on that one can maintain their content quality.

Another use case of the same is analyzing the trending video as many times, there are videos with more views and likes on trending but if you will use sentiment analysis you will easily be able to find the most useful video of a particular channel, celebrity, category, etc.

Dataset

Dataset has been prepared & pre-processed by the removal of emojis from all of the comments and selected only English comments as discussed in previous blog.

For the better results of the model, we have combined the comments of three videos in one dataset. It contains the video ID, comment ID and comments from the youtube as shown here,

Dataset Presentation

Data Pre-processing

The most prominent step while building any machine learning model is data preprocessing as it will directly affect the result of your model. The more you pre-process the data, the more accurate your model performs.

1. Data Labelling

The dataset is unlabelled because of using API, you can only extract the comments but not the polarity. Polarity is something that can identify the emotion of a particular sentence by using the words present in that. This can be done using the TextBlob module of python, which provides a function to find polarity as follows,

Code illustration: 1

As TextBlob find the polarity score, now using threshold concept you can extract a new feature, i.e., sentiment of each comment either positive (1) or negative (-1) as seen in below one:

Code illustration: 2

Code illustration: 3

Now from the above code, you can see that almost half of the comments are positive, and the rest half are negative.

2. LowerCase the Comments

Now moving further with the data preprocessing step, lowercasing the words in every comment. It becomes important as it makes the data more effective to produce a better result but if not converted then the system treats them as two different words which might be redundant information and led to producing different results than the desired one.

Code illustration: 4

3. Strip

Now using strip() method by python, remove all the trailing spaces from the comments.

Code illustration: 5

4. Stop Words

It contains words of less importance, i.e., commonly used words like is, am, the, are, a, etc. as they don’t add any fruitful information that is required for analysis. So, by removing these kinds of words, our data become more concise as having fewer features but significant.

The task of stopwords removal will be done using the nltk module as it provides a list of stopwords in different languages. Follow the below code to remove stopwords and write the updated comments into a new column/feature as follows,

Code illustration: 6

Code illustration: 7

5. Data Splitting

As of now, you are working with the one dataset, but for making predictions of sentiment analysis or can say that for testing the accuracy of the model trained there will require a test data set/validation data set.

So, splitting the dataset into training data and test data using the train_test_split module of scikit learn. It will split the dataset into two parts in the required proportion. After you run the below cell, there will be two datasets available i.e (X_train, y_train) & (X_test, y_test).

Code illustration: 8

6. Feature Extraction from Text Data

Now it’s time to extract features out of text data, i.e. conversion to integer values or floating-point values as a machine learning model can’t be applied directly to the text data.

Use the CountVectorizer module by scikit learn, which will create a vocabulary from the text data, as it will store the count of each word every time it appears in the text.

CountVectorizer mainly performs three basic steps,

Tokenization: Tokenize the text into words,
Vocabulary: Build vocabulary with all the words present in the text/document,
Encoding: Encode the entire document creating a vector with the same length as of vocabulary.

Follow the below implementation of vectorization,

Code illustration: 9

The length of the encoded vector is 3239, it means total words in vocabulary are 3239. You can also verify the words present in vocabulary by following the below code,

Code illustration: 10

Sentiment Classification

Now you have got the pre-processed data consisting of training (tf_train, y_train) and test (tf_test,y_test) dataset. So the next step is the selection of an appropriate machine learning algorithm for classification, i.e., Logistic Regression.

1. Logistic Regression

It’s a predictive modeling algorithm for the classification where there is a labeled dataset with the categorical target variable. It falls into the category of the supervised machine learning algorithm.

It will help in predicting the probability of outcomes i.e. binary classification or multi-classification. Examples of logistic regression include spam classifier, customer churn prediction, tumor prediction, etc.

These are some of the famous examples of the same, while you can apply the same for other cases as in this instance of sentiment analysis, where there are two classes to classify i.e. either positive (1) or negative (-1).

The scikit learn will be utilized to implement logistic regression for the sentiment classification. Follow the below implementation,

Code illustration: 11

Now after the model training, the accuracy score on the training dataset is 97.9 %, which implies your model is predicting 97.9 % accurate results, which is quite good. In the next step, you can discover the accuracy score on the test dataset,

Code illustration: 12

The accuracy of the model on the test dataset is 89.7 %, which means the model is predicting 89.7 % accurate results on the unseen dataset.

Make predictions on the test dataset by using a model trained above, using the below cell.

Code illustration: 13

2. Confusion Matrix

It will help the developers to know about the performance of their model on the test dataset where they already have the output of the same.

In the below example you can see the elements present in the matrix;

Confusion Matrix

Code illustration: 14

3. Classification Report

This is another metric that will provide the quality of predictions by finding precision, recall, f1 score.

Code illustration: 15

4. F1 Score

It gives a better measure of classification events than the accuracy metric. It is equal to the harmonic mean of precision & recall.

Another advantage of this is, it provides a more precise evaluation than accuracy in imbalance class distribution;

F1 score

Follow the below code to find the f1 score of your model,

Code illustration: 16

Using this model further will be extremely valuable as to find the most beneficial video from the bulk of videos on youtube like you have to choose the best tutorial of machine learning out of a lot of videos.

For the complete code of the same go through this Github repository.

Conclusion

A machine learning model has been trained for the sentiment analysis of the youtube comments followed by the pre-processing of the dataset. Preprocessing includes data labeling, lowercasing of the text, stopwords removal, data splitting, feature extraction.

For the sentiment classification into two classes positive and negative, Logistic Regression, machine learning classification algorithm has been used & achieved an accuracy score of 97.9% on training data and 89.73 % on test data, and the F1 score is 0.904, which is obviously more reliable than accuracy.

Latest Comments

rajkumarsingh1985

Jun 17, 2020

hi , can you please share this data set , please rajkumarsingh1985@gmail.com

1 Reply

Ripul Agrawal

Jun 17, 2020

You can follow the guidelines mention above to get the dataset, that's an easy process.

varshneynehal30

Nov 19, 2020

Where can i get the dataset ?

1 Reply

Ripul Agrawal

Dec 08, 2020

Hey, sorry for the delay! I have collected data from youtube itself using YouTube API. You can follow the steps mentioned above. Let me know if you faces any issue.

360digitmgsk

Dec 02, 2020

Very nice blog and articles. I am realy very happy to visit your blog. Now I am found which I actually want. I check your blog everyday and try to learn something from your blog. Thank you and waiting for your new post.

1 Reply

Ripul Agrawal

Dec 08, 2020

I am glad to know that. Thanks! You can find my other blogs too on the same page

Ripul Agrawal

Dec 09, 2020

For any queries, please reach out to me at ripulagrawal98@gmail.com

vishalchit652

Aug 25, 2021

Hi, thanks for sharing this blog and article is very helpful and its dataset is helpful. If you want to more learn then visit this site link http://pythonandmltrainingcourses.com/courses/best-deep-learning-training-in-delhi/

mosman.hsrw.bd

Sep 23, 2021

Hallo, thanks for your wonderful article. Sent you a query please check your email.

dronihaspellhome

Mar 31, 2023

AM JOSH FROM CA, AM STILL WONDERING HOW DR OSAZEE USES HIS HERBAL MEDICATION TO CURE ME OF DIABETES AND HBP, I SUFFERED FOR SO LONG, BECAUSE OF THE EFFECT OF THOSE ILLNESS, YOU CAN ALSO REACH HIM VIA HIS EMAIL ON: DROSAZEHERBAL@GMAIL.COM OR CALL OR WHATSAP HIM ON +2347089275769.

Sentiment Analysis of YouTube Comments

Dataset

Data Pre-processing

1. Data Labelling

2. LowerCase the Comments

3. Strip

4. Stop Words

5. Data Splitting

6. Feature Extraction from Text Data

Sentiment Classification

1. Logistic Regression

2. Confusion Matrix

3. Classification Report

4. F1 Score

Conclusion

Share Blog :

Trending blogs

Latest Comments

rajkumarsingh1985

Ripul Agrawal

varshneynehal30

Ripul Agrawal

360digitmgsk

Ripul Agrawal

Ripul Agrawal

vishalchit652

mosman.hsrw.bd

dronihaspellhome