Building a simple web scraper with BeautifulSoup in just 8 lines of code

Aditya R
3 min readJan 9, 2021

BeautifulSoup is a simple web scraper that has some pretty basic functionality but is nonetheless a powerful tool. It can be useful in may situations where you want to collect data from a website that has bad/no API. You can also use it for personal projects or SEO data analysis. Once you know how to use it, its pretty easy to impliment with any use case.

A simple to scrape site for beginners is HackerNews Frontpage. Since the site does not use frontend frameworks like React and also has meaningful class names and ids, its a good starting point for trying out web scraping. Plus they’re not actively trying to avoid people from scraping their site.

Before starting, make sure you have Python installed on your computer and optionally a good text editor / ide.

Start by installing the Python requests library (to fetch the website html) and the BeautifulSoup’s bs4 library (to parse the HTML and get data from it) by typing the below command into your command line.

pip install requests bs4

Now we’ll want to see what data we want to scrape from the website. Chrome/Firefox DevTools makes this very easy.

Go to the HackerNews Frontpage and open up your browser’s devtools by pressing Ctrl + Shift + I

Now click on the arrow button on the top-left of your devtools

and then hover over the data you plan to collect

You should be able to figure out that each headline is shown as a.storylink . This means that the headline is an <a> tag (anchor link) and has a class of storylink (Note that . means class and # means id, also know that many elements can have the same class but id is unique to one element). Now lets scrape that with BeautifulSoup.

Open up your text editor and start by importing the bs4 and requests package you just installed

from bs4 import BeautifulSoup
import requests

Now we have to get the page from the internet

url = 'https://news.ycombinator.com/news'
rawData = requests.get(url).text

Next we pass the rawData into BeautifulSoup and save it inside a soup variable

soup = BeautifulSoup(rawData, 'html.parser')

The html.parser tells beautifulsoup to parse the raw HTML with html.parser. You can also use lxml but I find html.parser simpler. After this, we can extract any data from this soup variable.

headlines = soup.find_all('a', class_="storylink")

Make sure to use class_ instead of class as class is a python keyword

Now we can just print each headline with a forloop.

for headline in headlines:
print(headline.text)

You should get the headlines once you run the program.

Building your first web scraper was a simple as that.

I’m not planning to teach each function in BeautifulSoup as the BeautifulSoup documentation does that pretty well. This was just a simple starter project to get people interested in web scraping.

I will be making more posts about using more advanced scraper tools like MechanicalSoup, Puppeteer, Selenium etc so stay tuned.

--

--

Aditya R
0 Followers

I am writing here to share my knowledge about web scraping to the world.