Category
>Machine Learning

Extracting & Pre-processing the YouTube Comments

Ripul Agrawal
Jun 10, 2020
Updated on: Nov 23, 2020

The previous blog includes the extraction of the YouTube channel data using the YouTube Data API v3 including channel title, channel ID, channel videos, video title, comment, likes, views count.

In the continuation of this blog, it's time to get insights into comments posted by viewers on particular videos/channels. Once you extract the comments from YouTube either of particular videos or videos with a particular category or it can be a channel, then they can be further analyzed by using likes/dislikes on every particular comment, sentiment analysis of comments.

All this will be helpful in getting how people are reacting to the channel/videos and can predict community acceptance by analyzing the comments. If you want to go deeper then you can also find the relation between comments and views and can also make estimations of people's engagements on your videos.

Prerequisite

Google Account
Python 3
Anaconda or GoogleColab: If you want to use your local machine then install Anaconda on the local machine and start Jupyter Notebook there. Or else you can use Google Colab just to save your time in installation and memory usage on a local machine as it runs on the cloud and provides GPU.

Getting Started with YouTube API

The last blog covered the activation of YouTube API from the Google console. If you have missed that then read that for API generation. The difference is that it includes only the activation of API without setting up the OAuth 2.0 consent screen.

It is used for authentication and authorization while using APIs to make some changes on the channel, i.e. to reply to some comment or deletion/insertion of video to Youtube directly using code.

Setting up OAuth 2.0

Some of the steps in the API activation key will be the same as in the previous one, while for the new steps you can follow the below steps. You can also refer to this for reference or the official documentation.

While you are on the APIs& Services dashboard enabling the YouTube API then first of all click on the OAuth consent screen as shown in the following figure.

Google Developer Dashboard

Once clicked it will take you to the following page where after filling up the application name and email address linked with the Google account, save the details.

OAuth 2.0 Consent Screen

The next step is to create credentials, so after selecting Create Credentials, select OAuth client ID.

Credentials Page - click on OAuth client ID

Select an Application type, as Desktop app from the dropdown and enter the project name and click create.

Create an OAuth client ID

The last and final step is, download the JSON file by clicking on the download icon, in the OAuth 2.0 client ID section as in the below figure, rename it as the client_secret.json, and save into the same path where you will store code.

Download - OAuth Client ID JSON file

Extract YouTube Comments

Link the Google collab with the Google drive.
Mount the Google drive on Google collab.
Install the required libraries - last time you used google api python client to access YouTube data and pandas library for the analysis part. In addition to these now have to use some additional libraries including:

google-auth google-auth-oauthlib google-auth-httplib2 to handle authentication,
demoji library to remove emojis from the comments, and
langdetect library to detect the language of the comments i.e. ‘en’ or some other, moreover you will work on the ‘en’ language comments.

Follow the below code for installation.

Code illustration: 1

Import the installed packages

Code illustration: 2

Restrict Access and set YouTube Parameters

First, specify the path to the credential file named as “client_secret.json” and then restrict the access of API to YouTube only by specifying the scopes following by the setting YouTube parameters,

Code illustration: 3

Build the service and get the access token

Follow the below code to build the service so as to use API to extract the YouTube comments. After you run the below cell, click the URL from the output of the cell and get the access token and continue.

Code illustration: 4

Perform YouTube Search on query

Set the query for YouTube search by providing the video title for which you want to extract the comments.

After this run the below code to perform a YouTube search so as to get the snippet of the related YouTube video and will be stored in a dictionary followed by the execution of the list, as in the below figure.

Code illustration: 5

In the below screen you can check how the data is extracted. Basically, it will only extract the video's basic details like video, channel, its description, etc.

Code illustration: 6

Extract Video Details- videoId, channelId, title, description

Run the below cell to extract the video details using the data extracted as a snippet. As you only require the details of one video so just collect the first entry from the list.

Code illustration: 7, Code Credit

Extract Video details - comments

Now let’s move to the next step, i.e. collection of data (or comments), to extract the comments from the videos you need to use comment thread resources by YouTube Data API v3.

But followed by the documentation, it can only extract the first 100 comments from the page so as to extract every comment you can make use of the next page token, as you have used in the last one to extract all the videos at the same time.

Apart from the comments, “you will also save the comment ID”, “how many people replied to the particular comments”, “count of likes on the comment”, and other “video-related details in their respective lists as followed”.

Code illustration: 8

Code illustration: 9

Store comments in CSV file

Now follow the below code to create a data frame

Code illustration: 10

In the below cell output, observe the duplicate comments for the single video.

Code illustration: 11

Follow the below codes to analyze the duplicates comments, you can observe that the majority of the comments are repeating, so you can just drop all of them using the drop_duplicates function and will then create a new data frame named unique_df.

Code illustration: 12

After then finally, write the data into a CSV file.

Code illustration: 13

Data Cleaning

Now as you have extracted all the comments from YouTube videos it's time to clean those comments as it has many redundant characters that add no importance to the final goal i.e sentiment analysis.

Remove Emojis

Use the demoji library to remove emojis from the text i.e. comments.

So first read the CSV file in the code using pandas and then using demoji library remove all the emojis from all the comments and create a new feature clean_comments as follows:

Code illustration: 14

Language Detection

After removing emojis from the comments, you need to extract only English comments, so as to not be complex for further analysis. To detect the language of the comments, you can use langdetect library. Run the below cell to detect language and create a new feature i.e. language.

Note: In the extracted comments, at some point, some comments include only numbers that can't be featured as text so use try and catch statements to deal with this error as in the below one.

Code illustration: 15

Just see the below one and to know how languages are detected.

Code illustration: 16

Now you need to extract the English comments only i.e. ‘en’. So run the below one and write into a separate csv file as follows

Code illustration: 17

Remove Special Characters - using RegEx

Till now you have removed the emojis and extracted only English language comments which now become like other common available standard text datasets that only contain texts and no other redundant things.

So, now you have to perform some common pre-processing steps that include the removal of brackets, special characters from the comments.

This step will be carried out by using regex. In this, you have to set an expression that will find all the brackets and special characters excluding numbers and alphabets, and using re, you can also replace those with space or some other character.

Follow the below code to create a new column that will save clean comments without brackets and special characters.

Code illustration: 18

Prepare the Dataset - into CSV file

Now as all the preprocessing has been done, so make a new data frame and only put the Video ID, Comment ID, and Comments i.e. regular comment (from the previous one) and write into a new CSV file.

Code illustration: 19

Note: The next part will cover the Sentiment Analysis of YouTube comments.

For the complete code of the same go through this Github repository.

Conclusion

YouTube Data APIv3 has been used to extract the comments from particular YouTube videos followed by the restricted access of API to YouTube only by specifying scope.

Service has been built to get the access token, so as to use API for the extraction of comments from the YouTube videos/channels i.e. resource comment threads by YouTube API.

Further extracted comments have been pre-processed by the removal of emojis using the demoji module by python, extracting only the English comments using the langdetect module by the same, removal of special characters except for numbers, and alphabets using regex.

Latest Comments

Ripul Agrawal

Dec 09, 2020

For any queries, please reach out to me at ripulagrawal98@gmail.com

19b01a05g4

Apr 26, 2022

why the images and code illustration are hidden?

arti.marketingteam

Jan 30, 2023

Hello Ripul Agrawal, Nice article on Extracting & Pre-processing YouTube Comments.

monika.marketingteam6b78077a97bc4d9b

Aug 22, 2023

I found your blog on extracting and pre-processing YouTube comments extremely insightful! The detailed steps you've provided make it easier for beginners to understand the process. If anyone needs assistance with data extraction, WebDataGuru offers a similar service. Check out their website at https://www.webdataguru.com/custom-data-extraction/ for more information. Great job on the article!

Extracting & Pre-processing the YouTube Comments

Prerequisite

Getting Started with YouTube API

Setting up OAuth 2.0

Extract YouTube Comments

Import the installed packages

Restrict Access and set YouTube Parameters

Build the service and get the access token

Perform YouTube Search on query

Extract Video Details- videoId, channelId, title, description

Extract Video details - comments

Store comments in CSV file

Data Cleaning

Remove Emojis

Language Detection

Remove Special Characters - using RegEx

Prepare the Dataset - into CSV file