Building a simple web scraper with BeautifulSoup in just 8 lines of code
BeautifulSoup is a simple web scraper that has some pretty basic functionality but is nonetheless a powerful tool. It can be useful in may situations where you want to collect data from a website that has bad/no API. You can also use it for personal projects or SEO data analysis. Once you know how to use it, its pretty easy to impliment with any use case.
A simple to scrape site for beginners is HackerNews Frontpage. Since the site does not use frontend frameworks like React and also has meaningful class names and ids, its a good starting point for trying out web scraping. Plus they’re not actively trying to avoid people from scraping their site.
Before starting, make sure you have Python installed on your computer and optionally a good text editor / ide.
Start by installing the Python requests
library (to fetch the website html) and the BeautifulSoup’s bs4
library (to parse the HTML and get data from it) by typing the below command into your command line.
pip install requests bs4
Now we’ll want to see what data we want to scrape from the website. Chrome/Firefox DevTools makes this very easy.
Go to the HackerNews Frontpage and open up your browser’s devtools by pressing Ctrl
+ Shift
+ I
Now click on the arrow button on the top-left of your devtools
and then hover over the data you plan to collect
You should be able to figure out that each headline is shown as a.storylink
. This means that the headline is an <a>
tag (anchor link) and has a class of storylink
(Note that .
means class
and #
means id
, also know that many elements can have the same class but id is unique to one element). Now lets scrape that with BeautifulSoup.
Open up your text editor and start by importing the bs4
and requests
package you just installed
from bs4 import BeautifulSoup
import requests
Now we have to get the page from the internet
url = 'https://news.ycombinator.com/news'
rawData = requests.get(url).text
Next we pass the rawData
into BeautifulSoup and save it inside a soup
variable
soup = BeautifulSoup(rawData, 'html.parser')
The html.parser
tells beautifulsoup to parse the raw HTML with html.parser
. You can also use lxml
but I find html.parser
simpler. After this, we can extract any data from this soup
variable.
headlines = soup.find_all('a', class_="storylink")
Make sure to use class_
instead of class
as class is a python keyword
Now we can just print each headline with a forloop.
for headline in headlines:
print(headline.text)
You should get the headlines once you run the program.
Building your first web scraper was a simple as that.
I’m not planning to teach each function in BeautifulSoup as the BeautifulSoup documentation does that pretty well. This was just a simple starter project to get people interested in web scraping.
I will be making more posts about using more advanced scraper tools like MechanicalSoup, Puppeteer, Selenium etc so stay tuned.