Category
>Python Programming

How Do We Implement Beautiful Soup For Web Scraping?

Tanesh Balodi
Jun 24, 2021

Introduction

Web scraping is a method by which we can extract any information we need from any part of the web. If we do scraping manually, it will take a lot of time. Therefore, we need an automated method to web scrape anything quickly.

There are several tools and APIs that help in web scraping any website such as selenium, scrappy, beautiful soup, and more. We are going to discuss web scraping using beautiful soup.

One of the examples of web scraping is a review-based website that reviews products, veggies, movies, and more. A website could either manually write the reviews, or it could simply scrape out the user reviews and automate the website.

Used cases of Web Scraping

Web scraping is used in various types of websites, a modern-day web requires several kinds of automation, and web scraping is one of them. Four prominent use cases of web scraping are-:

Use cases of Web scraping

Market Analysis: To do the competitor analysis and to review the strategy done by the rivals, whether it be reviewing their SEO strategy, comparing the prices, and studying their funding. (Learn about marketing analytics from the link)

E-commerce website: E-commerce websites can use web scraping in many ways, from price comparison to review analysis, from improvising sales strategy to customer behavioral analysis, web scraping boosts up the growth of an E-commerce website.

Review-detail website: web scraping is like a boon for review-based websites, tools like a beautiful soup can scrape tonnes of reviews for an ideal conclusion.

Social media website: To influence the market using social media, proper research is required related to the target audiences and latest trends, if these things are to be done manually, it will take months while web scraping is just a piece of cake.

4 Steps Process for Web Scraping

There are broadly four steps for web scraping which includes-:

Steps for web scraping

Loading the Document: the very first step towards the web scraping is loading the document, this document is nothing but an HTML page from which we need to scrape the data.
Parsing the HTML Document: The HTML document is needed to be parsed using a parser, there are various parsers in python that are very handy to use.
Extracting the Desired Information: after parsing the HTML document, our next step is to extract the information we need, for example, price, names, and more.
Conversion of extracted data into useful format: This step is not mandatory but if the extracted data is not in the format that a user wants, then the format needs to be changed.

(Related blog: Data scraping in R: Part 1)

Beautiful Soup

HTML and XML files are formed in a structured manner, beautiful soup extracts the data from these HTML and XML files. Beautiful soup is made purely with python which also makes it incredibly easy to use. Beautiful soup is also capable of converting documents into UTF-8 format.

You can install beautiful soup on your device with the following commands-:

pip install beautifulsoup4

You will also need a parser that helps in the interpretation of HTML format, two of the most loved parsers are ‘lxml’ and ‘html-5lib’. These two parsers can be installed using a pip command on the command prompt written below-:

pip install lxml
pip install html5lib

Beautiful Soup Tutorial

Let’s implement beautiful soup by scraping some popular quotes from the Goodreads website.

Popular quotes from GoodReads

Above is a screenshot demonstrating a few quotes out of many that we need to scrape.

import requests

from bs4 import BeautifulSoup

BASE_URL = 'https://www.goodreads.com/quotes/?page='

Imported request module to send HTTP requests and beautiful soup for web scraping. Now we need a URL of the page that we have to scrape. Now that we have imported the necessary libraries and base URL of the page we need to scrape, we are good to move forward. BASE_URL is a variable where we have saved the URL.

n = input()

r = requests.get(BASE_URL + n)


soup = BeautifulSoup(r.text, 'html.parser')

In this step, we are using the request module to get the base URL of the page and a parser to parse the HTML document.

The next step is manual, we need to manually find the name of the ‘div’ class that contains the quotes. For this step, inspect the page using ‘ctrl + shift + i’

In this case, after inspecting, we found out that the div class “quotes” is where all the quotes are contained. You can find it by inspecting and hovering over the div classes, whilst we hover over div class “quotes” we got the following container shape on the page-:

Result after hovering div container

Now in the next step use the beautiful soup to find and select the desired class. In this case, we have to find the div which has a class named “quotes”.


_ = soup.find('div', attrs={'class': 'quotes'})

_ = soup.select('div.quotes')

QUOTES = []

Similarly, we are going to find the details with the help of beautiful soup like author name, quotes, tags, and likes. Also, we need a function that could scrape multiple pages in one go.

def parse(html_blob):

    soup = BeautifulSoup(html_blob, 'html.parser')

    div_quotes = soup.find('div', attrs={'class': 'quotes'})

    

#     for div_quote in div_quotes.find_all('div', attrs={'class': 'quote'}):

    for div_quote in div_quotes.select('div.quote'):

        

        div_quoteText = div_quote.select_one('div.quoteText')

        div_quoteFooter = div_quote.select_one('div.quoteFooter')

        div_tags = div_quoteFooter.find('div', attrs={'class': ['greyText', 'smallText', 'left']})

        

        tags = []

        

        if div_tags:

            anchor_tags = div_tags.find_all('a')

            anchor_likes = div_quoteFooter.find('a', attrs={'class': 'smallText'})

            tags = [a.text for a in anchor_tags]

        

        quote = div_quoteText.text.strip().split('―')

#         quote, author = '―'.join(quote[:-1]).strip(), quote[1]. strip()

        quote, author = quote[0].strip()[1:-1], quote[1].strip()

        author = author.split(',')[0]

        

#using split function to split a string into the list whereas using strip() to remove unnecessary characters.

        likes = int(anchor_likes.text.replace(' likes', ''))

#         print(quote, author, tags, likes, sep='\n', end='\n\n')



        QUOTES.append({'author':author, 'quote':quote, 'tags': tags, 'likes': likes})



#appended author name, quotes, tags, and likes



##############################

def scrape(n_pages):

    for i in range(n_pages):

        print('Onto Scraping Page No.', i+1)

        url = BASE_URL + str(i+1)

        r = requests.get(url)

        parse(r.text)

# function to scrape pages

Now, we shall test the code and scrape the author names, quotes, text, and tags.

div_quotes = soup.find('div', attrs={'class': 'quotes'})

all_quotes = div_quotes.select('div.quote'

all_quotes[0].find('div', attrs={'class': ['greyText', 'smallText', 'left']})

div_quote = all_quotes[0]

q, a = all_quotes[0].select_one('div.quoteText').text.strip().split('―')

q.strip(), a.strip()

div_tags = div_quoteFooter.find('div', attrs={'class': ['greyText', 'smallText', 'left']})

anchor_tags = div_tags.find_all('a')

anchor_tags[0].text



'greatness'



anchor_likes = div_quoteFooter.find('a', attrs={'class': 'smallText'})

anchor_likes.text



'12328 likes'

We have successfully scraped the desired information from the website of our choice, the process remains the same for scraping any website, but the attributes and div tags are to be searched manually as they are different for every website.

(Also read: Data Scraping in R programming: Part-2 & Part-3)

Conclusion

The python based Beautiful soup tool is indeed a very powerful tool for web scraping, however, there are other tools in the market like selenium and scrappy that works faster than beautiful soup which gives other tools an edge over beautiful soup.

The up-gradation is needed in this library to make it faster and competitive, given the environment. The ease of implementation of this library makes it a perfect choice for beginners who want to learn web scraping.