An Introduction to Web Scraping
Hey there, fellow coders! Python is a fantastic programming language, and it shines when it comes to accessing web resources. It’s amazing what you can do with only a few lines of code, so let’s get started!
We are going to scrape Wikipedia’s homepage in this tutorial, displaying all the outbound links. Don’t forget that we shouldn’t hammer third-party websites by writing code that brings their servers down. Just visit the page(s) you
want, grab the info, and then leave. Don’t build programs that use ‘for’ loops to access the same website pages over and over.
Let’s look at the entire source code:
See? It’s not complicated at all.
This lesson introduces some new concepts, so let’s discuss them one line at a time.
First, we import two modules:
import urllib.request
import re
The ‘urllib’ module includes methods that open, read, and parse the desired URLs. We will also use the ‘re’ module, short for ‘Regular Expressions’ or ‘RegEx’, to create a search pattern that will filter the outbound links.
The function below runs when we start the application and tries to get the HTML content from Wikipedia’s homepage.
def fetch_html(url):
try:
with urllib.request.urlopen(url) as response:
return response.read().decode('utf-8')
except Exception as e:
print(f"Error fetching HTML: {e}")
return None
If it succeeds, it returns it decoded according to the UTF-8 standard, which can be understood by Python; otherwise, it displays an error message and returns None.
But let’s assume that everything went okay; Wikipedia was online, and our Internet connection worked great. In this case, we’ve got an HTML data chunk we can play with.
# Main program
if __name__ == "__main__":
html = fetch_html("https://www.wikipedia.org/")
The following lines of code create a regular expression that extracts the actual URLs from HTML links. We have added a dedicated RegEx tutorial to the "Python Data Structures" course, so you can find a detailed explanation there.
if html:
link_pattern = r'href=["\'](https?://[^"\']+)["\']'
links = re.findall(link_pattern, html)
Finally, we print the links one per line.
for link in links:
print(link)
Here’s how the output looks:
We hope you liked this tutorial. The code is clean, but HTML parsing isn’t the best solution when it comes to data scraping. For complex projects and sites that change their structure regularly, applications built using HTML parsing may fail eventually.
Fortunately, Python includes several libraries that make web scraping even easier, take care of any accidental HTML errors introduced by web developers, and so much more. The most used library, called Beautiful Soup, will be the star of our next tutorial.