Extracting & Pre-processing the YouTube Comments

  • Ripul Agrawal
  • Jun 10, 2020
  • Machine Learning
  • Updated on: Jun 09, 2020
Extracting & Pre-processing the YouTube Comments title banner

The previous blog includes the extraction of the YouTube channel data using the YouTube Data API v3 including channel title, channel ID, channel videos, video title, comment, likes, views count.


In the continuation of this blog, it's time to get insights into comments posted by viewers on particular videos/channels. Once you extract the comments from the YouTube either of particular videos or videos with a particular category or it can be a channel, then they can be further analyzed by using likes/dislikes on every particular comment, sentiment analysis of comments. 

 

All this will be helpful in getting how people are reacting to the channel/videos and can predict community acceptance by analyzing the comments. If you want to go deeper then you can also find the relation between comments and views and can also make estimations of people engagements on your videos.

 

Prerequisite

 

  • Google Account

  • Python 3

  • Anaconda or  GoogleColab: If you want to use your local machine then install Anaconda on the local machine and start Jupyter Notebook there. Or else you can use Google Colab just to save your time in installation and memory usage on a local machine as it runs on the cloud and provides GPU.


 

Getting Started with YouTube API

 

The last blog covered the activation of YouTube API from the Google console. If you have missed that then read that for API generation. The difference is that it includes only the activation of API without setting up the OAuth 2.0 consent screen.

 

It is used for authentication and authorization while using APIs to make some changes on the channel, i.e. to reply to some comment or deletion/insertion of video to Youtube directly using code. 

 

Setting up OAuth 2.0 

 

Some of the steps in the API activation key will be the same as in the previous one, while for the new steps you can follow the below steps. You can also refer to this for the reference or the official documentation.

 

While you are on the APIs& Services dashboard enabling the YouTube API then first of all click on the OAuth consent screen as shown in the following figure.


The Image is showing the API dashboard to set credentials & OAuth screen.

Google Developer Dashboard


 

Once clicked it will take you to the following page where after filling up the application name and email address linked with the Google account, save the details.

 


    The image is showing the OAuth Consent screen setup page.

OAuth 2.0 Consent Screen



The next step is to create credentials, so after selecting Create Credentials, select OAuth client ID.


    This image is showing the Credentials page where the OAuth client ID will be generated for the authorization.

Credentials Page - click on OAuth client ID


Select an Application type, as Desktop app from the dropdown and enter the project name and click create.


     This image is displaying the steps, to add project details for authorization.

Create an OAuth client ID


 

The last and final step is, download the JSON file by clicking on the download icon, in the OAuth 2.0 client ID section as in the below figure, rename it as the client_secret.json, and save into the same path where you will store code.


 

       This image is showing the page, from where OAuth 2.0 client ID can be downloaded.

Download - OAuth Client ID JSON file



 

Extract YouTube Comments

 

  1. Link the Google collab with the Google drive.

  2. Mount the Google drive on Google collab.

  3. Install the required libraries - last time you used google api python client to access YouTube data and pandas library for the analysis part. In addition to these now have to use some additional libraries including:

 

 

     Follow the below code for installation.


This code image demonstrates the installation of packages used for comments extraction and data cleaning. 

Code illustration: 1


 

  1. Import the installed packages

 


 This code demonstrates the step to import the packages in the code to make their use for further pre-processing of comments.

Code illustration: 2


 

  1. Restrict Access and set YouTube Parameters

 

First, specify the path to the credential file named as “client_secret.json” and then restrict the access of API to YouTube only by specifying the scopes following by the setting YouTube parameters,

 


The code demonstrates the path specification of "credential file", also the restriction on the access of API by setting the scope of it.

Code illustration: 3


 

  1. Build the service and get the access token

 

Follow the below code to build the service so as to use API to extract the YouTube comments. After you run the below cell, click the URL from the output of the cell and get the access token and continue.

 


The code demonstrating the build-up of service that will be required to use the API. Apart from that this step also asks for authorization with google account.

Code illustration: 4


 

  1. Perform YouTube Search on query

 

Set the query for YouTube search by providing the video title for which you want to extract the comments.

 

After this run the below code to perform a YouTube search so as to get the snippet of the related YouTube video and will be stored in a dictionary followed by the execution of the list, as in the below figure.

 


 This code demonstrates, how to set queries with particular video titles or any term to perform a YouTube search.

Code illustration: 5


 

In the below screen you can check how the data is extracted. Basically, it will only extract the video basic details like video, channel, its description, etc.

 


This code demonstrates the output of the youtube search result corresponding to the query set by the user.

Code illustration: 6


 

  1. Extract Video Details- videoId, channelId, title, description

 

Run the below cell to extract the video details using the data extracted as a snippet. As you only require the details of one video so just collect the first entry from the list. 

 


   This code demonstrating the procedure to extracting the video details from the "snippet" result of the youtube search in the last step.

Code illustration: 7, Code Credit


 

  1. Extract Video details - comments
     

Now let’s move to the next step, i.e. collection of data (or comments), to extract the comments from the videos you need to use comment thread resources by YouTube Data API v3.

 

But followed from the documentation, it can only extract the first 100 comments from the page so as to extract every comment you can make use of next page token, as you have used in the last one to extract all the videos at the same time.

 

Apart from the comments, “you will also save the comment ID”, “how many people replied to the particular comments”, “count of likes on the comment”, and other “video-related details in their respective lists as followed”.

 


This code is just showing the initialization of lists to store the results i.e. comment details including comments, like/replies, video details, etc.

               Code illustration: 8


The code demonstrating the extraction of all the comments including its other details of the desired video/channel using "next Page token".

Code illustration: 9


 

  1. Store comments in CSV file

Now follow the below code to create a data frame

 


 The code demonstrates the creation of the data frame to structure the comments extracted with their corresponding details in one place.

Code illustration: 10


 

In the below cell output, observe the duplicate comments for the single video.

 


The code demonstrates, how the data stored in data frame, with all comments at one place with their respective details.

Code illustration: 11


 

Follow the below codes to analyze the duplicates comments, you can observe that the majority of the comments are repeating, so you can just drop all of them using the drop_duplicates function and will then create a new data frame named unique_df.

 


   This code proves the presence of duplicate comments in the data extracted and at last, it removes all the duplicate comments from it.

Code illustration: 12


 

After then finally, write the data into a csv file.

 


This code is used to write the data frame of comments into the csv file in local disk.

Code illustration: 13


 

Data Cleaning

 

Now as you have extracted all the comments from YouTube video its time to clean those comments as it has many redundant characters that add no importance to the final goal i.e sentiment analysis.

 

  1. Remove Emojis

Use the demoji library to remove emojis from the text i.e. comments.

So first read the csv file in the code using pandas and then using demoji library remove all the emojis from all the comments and create a new feature clean_comments as follows:

 


This code showing the removal of emojis from the comments by making use of the "demoji" library.

Code illustration: 14


 

  1. Language Detection

 

After removing emojis from the comments, you need to extract only English comments, so as to not be complex for further analysis. To detect the language of the comments, you can use langdetect library. Run the below cell to detect language and create a new feature i.e. language. 

Note:  In the extracted comments, at some point, some comments include only numbers that can't be featured as text so use try and catch statements to deal with this error as in below one.

 


This code demonstrates the language detection of all the comments and to store this as a separate feature.

Code illustration: 15



 

Just see the below one and to know how languages are detected. 


 

This is to just demonstrate the data frame after language detection and emoji filtration of comments.

Code illustration: 16


 

Now you need to extract the English comments only i.e. ‘en’. So run the below one and write into a separate csv file as follows


 

  The code demonstrating the comments filtration to choose only the English language i.e." en" comments using the "langdetect" module and later to write it into the new csv file.

Code illustration: 17


 

  1. Remove Special Characters - using RegEx

 

Till now you have removed the emojis and extracted only English language comments which now become like other common available standard text datasets that only contain texts and no other redundant things. 

 

So, now you have to perform some common pre-processing steps that include the removal of brackets, special characters from the comments. 

 

This step will be carried out by using regex. In this, you have to set an expression that will find all the brackets and special characters excluding numbers and alphabets, and using re, you can also replace those with space or some other character.

 

Follow the below code to create a new column that will save clean comments without brackets and special characters.

 


   This code demonstrates the data cleaning by removal of the special characters from the comments using "regular expression" and later to save into a new feature

Code illustration: 18


 

  1. Prepare the Dataset - into CSV file

 

Now as all the preprocessing has been done, so make a new data frame and only put the Video ID, Comment ID, and Comments i.e. regular comment (from the previous one) and write into a new csv file.


 The final step of data processing, demonstrating the formation of a new data frame including video ID, comment ID, and comments. And write the dataset into a new separate csv file.

Code illustration: 19


 

Note: The next part will cover the Sentiment Analysis of YouTube comments. 

 

For the complete code of the same go through this Github repository.

 

Conclusion

 

YouTube Data APIv3 has been used to extract the comments from particular YouTube videos followed by the restricted access of API to YouTube only by specifying scope. 

 

Service has been built to get the access token, so as to use API for the extraction of comments from the YouTube videos/channels i.e. resource comment threads by YouTube API.

 

Further extracted comments have been pre-processed by the removal of emojis using the demoji module by python, extracting only the English comments using the langdetect module by the same, removal of special characters except for numbers, and alphabets using regex.

0%

Comments