The Internet is full of data, isn’t it? You may come up with a lot of websites that provide the datasets for analysis purposes and which you can use to work on your own projects. However, real life is not easy and comfortable. You may or may not come up with direct data files. Sometimes, you may need to go through the highly unstructured data that’s hidden through web pages that are in HTML format. This is nothing but an actual web scraping.
Through this article, we will discuss the web scraping in R Programming. The web scrapping most of the times allows us the wide opportunity to actually create a dataset from the scratch.
Taking Look at a Web Page
Before we start looking into web scrapping actually, it is mandatory that we should have a basic overview of a web page. This will allow us to know the basics of HTML language which is used to develop a web page/s.
The HTML language is not developed for programming purpose. Instead, it is developed for creating a web page layout (more of a Markup Language it is known as). Elements under the HTML page are written under a start tag and an end tag and under both of these tags, there is the content we wish to add to the webpage. These HTML tags are always enclosed within <> symbols.
Following are the common tags that are used under HTML language while developing a webpage:
<h1>, ... , <h6> - Denotes the heading such as Larget heading, second largest heading, etc.
<p> - Represents the paragraph elements.
<table> - Represents the element that is a table.
<body> - Represents the body of the web page.
There are many more tags that are a part of the HTML code, but mentioning all of them here is not important at this point of time as the article is not based on HTML language but the R Programming. Having said that, if you haven’t read our previous articles on R Programming, you can read those here at AnalyticSteps.
Scraping HTML Data in R
Well, the interesting part begins now. There is a relatively new package which is developed by Hadley Wickham named “rvest”. This package is developed taking inspiration from packages such as “beautiful soup” which is useful for the web scrapping. This package comes up with multiple functionalities. However, for this article, our focus will be in looking at the functionalities that help us scraping HTML data from the web.
If you are new to scraping and want to start from something basic, you should read an article for the same here Data Scrapping in R Programming: Part 1. In that article, we tried to take a look on relatively simple scraping task such as reading CSV/excel file from the web.
You will need to install the “rvest” package using install.packages() and then have to import it into your workspace using library() function.
Installing and importing the “rvest” package into R workspace
The first thing that we need to do is read the data through a web page provided. Here, we are using the Web Scraping wiki page which has the URL https://en.wikipedia.org/wiki/Web_scraping. To read this page into R workspace, we have a function called read_html() in R. This will read the data from this weblink.
This function will return the list of tags present in the webpage. Similar to the ones’ we discussed above. Such as paragraph, table, etc.
Example code with output for reading HTML data through a webpage in R
Now, the next interesting part that appears is extracting the actual information through the webpage. You can see in the previous example, read_html() only listed out the basic information of the webpage and it’s content.
Suppose, what if we need to extract the <h1> from the webpage? We have to use the pipe operator (%>%) in combination with the html_node function to get the information about the node h1. See the code below:
Extracting information about the node using html_node() and pipeline operator
If you are new to the pipeline operator, don’t worry, we will cover for you how this operator works in the next few articles. As of now, just remember that this pipeline operator is working as the gate to navigate through multiple components of the webpage data step-by-step.
In the example above, if you see, the entire HTML code is being returned for the node h1. However, we are not interested in it and just wanted to go for the text that the node contains. To achieve this, we need to use the html_text() function in combination with the pipeline operator. See the example code below:
Extracting text from a node of the given HTML page
We can also parse some information about the paragraphs that are present through the web page we are using.
This can be done using the same process we have used to scrap the largest header through the web page. Instead of using the node <h1>, we need to use the <p> node which symbolizes the paragraph. We will be storing scraped data for paragraphs into a new variable named “para_node”.
Scraping paragraph nodes in R Programming
Now, let us try to extract the text from paragraph nodes that we have extracted through the web pages. We will use the same html_text() function we have used in the previous example to scrap the text from paragraph nodes.
This code shows how to extract text from paragraph nodes
Here, we have used the html_text() function to capture the text from the paragraph nodes. Since we have a lot of paragraph nodes, we need to slice the data and hence we are using the slicing method to get the text from the second paragraph from “para_text” variable.
Well, if you see the actual webpage, there are a lot of paragraphs in the bulleted format but are nor being captured under the paragraph nodes. This is because those are scraped under the <ul> node. For example, there is a bulleted text after the Techniques paragraph. See below:
The text under the <ul> node on the Wikipedia link
Let us see how can we extract the text under the <ul> node. See the code below:
This image shows how the text from <ul> nodes can be scraped in R
This is how the web scraping is done for through an HTML page. In our next article, we will walk you through some more about the web scraping i.e. about the cleaning process of the data scraped.
Web scraping is a futuristic way of collecting the data through web and HTML pages.
HTML coding is not the same as the general programming language. It is more of a markup language which is used to design the content of a web page. It decides how the webpage will look like.
To do the web scraping, we have several packages, but “rvest” is one of the best and simple to use developed by R Community.
For web scraping, we use the pipeline operator “%>%” under R programming. This works as accessing the content from a web page step-by-step.
read_html() function helps us reading the webpage through a URL.
html_nodes() is a function that allows us to read information about the nodes (tabs in HTML language) from the html code.
html_text() allows us the read the text from each node that we access through html_nodes() function.
This article ends here. However, there are many more articles varying from SQL to Machine Learning and Artificial Intelligence which you can access through our website.