Data Scraping in R Programming: Part 1

  • Lalit Salunkhe
  • Sep 24, 2020
  • R Programming
Data Scraping in R Programming: Part 1 title banner

The world has grown rapidly and so is the way of storing, sharing, and managing the data. Initially, it was all about sharing excel files, documents, etc. when the term data used to be called. However, this new era of the internet, the world wide web, and the human urge to go online for everything have changed the equations.

 

Nowadays, all we are looking at is a web page, that contains a huge amount of information that we could never have imagined to use for the analysis. But, at the same time, tools have also been evolved in a way that they can help you manage data from web pages.

 

This method of extracting data from webpages, or URL’s is called data scraping. However, it is never an easy task to extract data from web pages. As the data stored online is usually in different formats (such as HTML, XML, JSON, etc.) and we have different techniques to extract it from different sources.


 

Well, taking this topic up has a simple intension; to introduce you guys with web scrapping. Through this article, we will introduce you towards the concept and also see some hands-on about how to scrap and read the csv, Excel, zip files from a web URL. As I said, this article is just introductory and should not be considered as an end-to-end guide. Because web scraping is a vast/deep topic where we can go on and on. If you are new to R, read our article Introduction to R Programming.


 

Reading a CSV File From the Web

 

The basic data files that we will most of the time prefer to use for analysis are the csv or text files. This may not be considered as web scraping, but it is a good hands-on for the beginners about web scrapping. I mean imagine, you don’t need to download the file on your local system every time. You can just go to the URL and use a file stored there in your R session. 

 

See an example below where we try to read the csv file stored at a URL and load the same in our R environment.


Highlighting code for reading a csv file from the web

Reading csv file


The data set stored on the URL is sales data for “sanitized” sales transactions for the month of Jan2020.

 

We are using the read.csv() function to read this data into the R environment. The ”stringsAsFactors = FALSE” argument is used so that the string data is not considered as a factor. I have used the head() function to show the first six rows of data (since data is large, 998 rows of transactions). Let us run this code in R and see how the data looks.


This image shows how the read.csv() function can be used to read a csv data file stored at a web url.

Example code with output for reading the csv data file from the web


You can see the data being extracted and the first six rows of each column are being visible on the console. We can use this data for further basic descriptive statistical analysis (see how to do at Descriptive Statistics in R). If you are not aware of how the read.csv() works, you can follow our article Importing Data into R


 

Reading an Excel File From the URL in R

 

Well, reading an excel file doesn’t seem that easy at first when you try to do it. You have to face an obstacle at the beginning itself. 

 

There are multiple packages in R that allow us to read an excel file from a URL, however, the gdata package is one of the most popular and does all the task required. Install the gdata package in your system as shown below:


Highlighting the installing of gdata package in R.

Installing the gdata package in R


Now, the second thing you need to install is Perl add-in. The Perl add-in if not installed on your local system will not allow you to read the excel files from the web through the gdata package. You can install the Perl add-in from the URL http://www.activestate.com/activeperl/. The installation steps are pretty straight forward and you should be able to install Perl on your own. Make sure you are connected to a stable internet connection to be able to download and install the same.

 

Note: Downloading Perl is mandatory. Without it, you couldn’t download excel files from the web. Now, use the below code to extract the excel file that’s stored at the URL Download me.


This code will allow you to read an excel file from the web url.

Code for reading an excel file from the web


If you see, here we are using the read.xls() function from package gdata. The ”Perl =” argument is mandatory to specify so that the system could decide where to look for Perl add-in at the time of execution and reading a file from the URL. The standard path for the Perl is the same I mentioned above in the code piece (C:/Perl64/bin/perl.exe)

 

Let us see the output for this code above.


This image shows how the read.xls() function from package "gdata" could be used to read an excel file from a web URL.

Code for reading an excel file from the web URL


Here, the file from the URL gets download and read from the R environment. The head() function allows us to show the first six rows of the dataset. Since the dataset is large, we use the head() function to show the overview of the same.


 

Reading a CSV Through a Zip File From the URL

 

There are cases when you want to extract a zip file and read the data from it. This is not an ideal situation in day to day life but surely may occur. For such situations, we have a workaround in R programming.

 

You can use the download.file() function in R that allows us to download the zip file. The unzip() function in R allows you to unzip the zip file. Finally, you can use the read.csv()  function to read the CSV file from the given zip folder.


This code shows how to read a data file from the zip file under R environment.

Code for reading the data file through a zip file


The “mydata.zip” argument under the download.file() function is an optional one. That reflects the name with which the zip file will be stored temporarily. The file gets stored at the local directory of RStudio.

 

Let us see the output for this code as given below:


This image shows how to read a csv file from the zip file through a URL.

Example output for the data reading from zip file


There might be a situation when you want to extract the files from the zip which is big. You might not be interested in downloading the entire zip to extract the certain file from it. You can do the same by using the temporary file which temporarily accesses the zip file and allows us to access the content from that zip file.

 

See an example below where we use the tempfile() function to store zip file temporarily and then access the desired csv file from it.


Example code for reading a data file through zip that is stored temporarily.

Reading a data file from zip that is downloaded temporarily


We first created a temporary file named temp which could hold the zip file temporarily. Then, we downloaded the zip file and stored it temporarily to the temp variable. Finally, we use the combination of read.csv() and unz() function to read a csv file from zip folder by unzipping it. Finally, we unlink the temp file. Which means, no more space-consuming zip files on your system. Although, this is another way of accessing data files through the zip data without storing it permanently on your disk space.


 

Summary

 

  • Accessing data from web URL’s is one of the core tasks nowadays as the definition of data has changed.

  • We can access the data from web pages, weblinks, in different format such as HTML, XML, JASON etc. In this article though, we will try to grab the csv, excel, zip files from web URL.

  • Reading a CSV file from the web can be done using the read.csv() function which is a part of core functions in R. (To read more blogs on R programming, check here)

  • To read an excel file from the web URL, we have multiple packages that could help. Out of all those, gdata is the one that’s trusted. You also need to install the perl in order to keep R reading Excel file from the web.

  • To read a zip file and extract data from it to R environment, we can use the download.file() to download the zip, then unzip() allows to unzip the same and extract files using read.csv().

  • In case the zip file is a big one and you don’t want to store it on your local system, you can access it temporarily to read the file of your interest and then unlink the zip. In this way, you will be saving disk space on your system.

 

This is it from this article. In our next article, we will take a look at actual web scraping, where we scrape the data from web pages that are in HTML, XML format to R environment. Stay tuned for our next article in this series. Until then, stay safe!

0%

Comments