Scraping basics with python3 and urllib
Scraping means using a program to extract data from a source. When the source is a website or a blog, we say web scraping, and today we will see how python can be used to quickly scrape online content.
In this article, we will use only the tools available in python standard library so that there is nothing to install. In an upcoming article, we will see how to use popular external libraries such as BeautifulSoup and Scrapy to improve our scripts.
Usually, a scraping script is used to fetch multiple pages at once. In this article, we will craft a quick script to retrieve all the html pages on a given website.
Fetching a webpage
The first step is to use the library
urllib to fetch the webpage,
Accessing the content
The best way to access the html content would be to parse it using the BeautifulSoup library. But if you like to keep things simple, you can read it like this.
Since our goal is to traverse the whole website, we need a way to extract the links contained in the webpage. The best way would be to use the BeautifulSoup library, but if you like to keep things simple, here’s how to do it using regexes.
As very well explained on this page you need to use a binary regex to search in a binary string, that’s why I’m prefixing the regex with
b'...'. The regex comes from this webpage, and it was good enough for my needs.
In order to stay on the same website, you need to filter the links and keep only the internal ones. The best way would be to use the tldextract library, but we can manage something decent in pure python using urllib once more.
Here is how to use
And we see that in order to remain on the same website, we have to filter urls whose
netloc is external. That’s what we do in the next snippet.
Also, we want to avoid image files so we can filter using common file extensions. This is not perfect, but as is common in engineering, good enough is good enough.
So, we know how to retrieve an HTML webpage and parse the urls it contains. The next step is to iterate and visit those new urls. Here’s a simple loop.
Simpling visiting webpages without processing them is useless, so you’ll likely want to actually do something with the html content we fetched.
Here is the loop.
→ Being able to interrupt your script and resume it without loosing your data is much desirable here. Check out this article to see how I’m doing it.
→ Parse HTML instead of using complicated regexes: Scraping with BeautifulSoup.
→ Use a bullet proof framework instead of writing your own event loop and error logic: Scraping with Scrapy (soon)
Putting it all together
Here is the final script.