Beautiful Soup 101
Hello and welcome! It’s time to put that Beautiful Soup library we’ve been talking about to the test. This time, we are going to start working on a more complex application that will extract pricing information from a page on Data Alliance’s website.
We’re going to fetch the price of the “Alfa AWUS036H 2000mW Long-Range WiFi USB Adapter”, create a graph that shows price variations over time, and even have the application alert us via email when the price changes.
We could visit the site anytime we wanted to see if that Wi-Fi adapter is on sale, of course, but it is much more elegant and faster to run an application and instantly see the product price. Not to mention that if we are interested in buying ten different products from ten different sites, using a custom-made application is much more effective, and our program will
have ten different slots with products that can be monitored each day.
First, we will need to download and install Beautiful Soup 4 from the following URL:
https://pypi.org/project/beautifulsoup4/#files
We can also install the package using Python’s PIP package manager from the command prompt:
pip install beautifulsoup4
An important tip: Some of our students reported that Beautiful Soup can’t be utilized after being installed and the programs that make use of it don’t run. To fix this issue, copy the \bs4 folder that is created after the installation into the same folder as your application.
Let’s start with a few simple examples. We are going to scrape various types of data from Wikipedia first, to ensure that we don’t overload other people’s servers due to our bad code. Hopefully, you’ll get everything right on your first try ;)
Our first program displays the text paragraphs on Wikipedia’s main page:
import urllib.request
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Main_Page"
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'): # extract and print the text paragraphs on the page
print(paragraph.text)
The code uses the urllib and BeautifulSoup libraries to open the desired URL and stores the retrieved HTML data inside the html object. Actually, html is of type ‘bytes’, and if we printed it we would get the full html content of the page.
We use BeautifulSoup to parse the html and we end up with clean, nicely formatted text that can be printed out, one paragraph at a time. Python runs quite fast, though, so we’ll see all the text being displayed at once, just like in the image below.
On a side note, you’re probably going to get a different text when you run the code because Wikipedia changes at least some of the content on this page each day.
The second example scrapes the same page, retrieves and then displays a list with all the images on that page. We will use a similar snippet to populate the table that contains the 10 products we’re interested in tracking.
import urllib.request
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Main_Page"
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for img in soup.find_all('img'): # find and display all the image URLs
print(img.get('src'))
As you can see, the code is very similar to the one used in the first example. The difference is made by the last two lines, which go through the html data and grab the ‘src’ data from the lines that have the ‘img’ tag.
Here’s what’s displayed when we run this piece of code:
So far, we have learned to display the entire content of a page, and a list with the elements of some type on a certain page. It’s time to move on to more complex stuff and read some data that’s specific to a particular element.
Don’t worry, Beautiful Soup continues to keep everything simple for us:
import urllib.request
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Main_Page"
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
div_content = soup.find("div", {"id":
"mp-other-content"}) # use your own div here
if div_content:
print(div_content.text.strip())
This time we use the web scraping library to find the desired ‘div’ element and display its content. To do that, we have to view the HTML source for the page we’re interested in. Fortunately, it’s easy to do so by right-clicking it in a
browser, and then choosing “View Source” from the menu. Alternatively, you can copy/paste the URL below in the browser:
view-source:https://en.wikipedia.org/wiki/Main_Page
When the HTML source is loaded, press Ctrl + F and then type in ‘div’ to find all the div elements on the page:
As you can see, I chose "mp-other-content", which displays the “Other areas of Wikipedia” section:
You can choose any other div, of course, and we encourage you to test the code using various elements.
It was a long tutorial, but we have learned quite a bit, haven’t we? Next time we will go even deeper, scraping product names, pricing information, and more.