Python 5 minutes

Scraping basics with python3 and urllib

Scraping means using a program to extract data from a source. When the source is a website or a blog, we say web scraping, and today we will see how python can be used to quickly scrape online content.

In this article, we will use only the tools available in python standard library so that there is nothing to install. In an upcoming article, we will see how to use popular external libraries such as BeautifulSoup and Scrapy to improve our scripts.

Usually, a scraping script is used to fetch multiple pages at once. In this article, we will craft a quick script to retrieve all the html pages on a given website.

Fetching a webpage

The first step is to use the library urllib to fetch the webpage,

from urllib.request import Request, urlopen
from urllib.error import URLError

def get_html(url):
    # construct an http request for the given url 
    req = Request(url,
              data=None, 
              headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
    
    # send request and fetch html
    html = None
    try:
        html = urlopen(req)
    except URLError as e:
        if hasattr(e, 'reason'):
            print('We failed to reach a server.')
            print('Reason: ', e.reason)
        elif hasattr(e, 'code'):
            print('The server couldn\'t fulfill the request.')
            print('Error code: ', e.code)
    
    # on error, simply return an empty binary string
    if html is None:
        print('Server not found')
        html = b''
    
    # on success, read the html content into a binary string
    else: 
        html  = html.read()

    return html

Accessing the content

The best way to access the html content would be to parse it using the BeautifulSoup library. But if you like to keep things simple, you can read it like this.

url = 'https://google.com'
html_binary = get_html(url)

# use the proper encoding here ('utf-8', 'ascii', ...)
html = html_binary.decode('utf-8')
print(html)

Since our goal is to traverse the whole website, we need a way to extract the links contained in the webpage. The best way would be to use the BeautifulSoup library, but if you like to keep things simple, here’s how to do it using regexes.

import re

url_binary_regex = b'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def find_urls(html_binary):
    urls = re.findall(url_binary_regex, html_binary)
    return urls

As very well explained on this page you need to use a binary regex to search in a binary string, that’s why I’m prefixing the regex with b'...'. The regex comes from this webpage, and it was good enough for my needs.

In order to stay on the same website, you need to filter the links and keep only the internal ones. The best way would be to use the tldextract library, but we can manage something decent in pure python using urllib once more.

Here is how to use urlparse:

from urllib.parse import urlparse

urlparse('https://www.google.com/page1.php?q=5')

>>> ParseResult(scheme='http', netloc='www.google.com', path='/page1.php', params='', query='q=5', fragment='')

And we see that in order to remain on the same website, we have to filter urls whose netloc is external. That’s what we do in the next snippet.

from urllib.parse import urlparse

def filter_urls(urls, netloc):
    return [url for url in urls if urlparse(url).netloc == netloc]

Also, we want to avoid image files so we can filter using common file extensions. This is not perfect, but as is common in engineering, good enough is good enough.

def has_bad_format(url):
    exts = ['.gif', '.png', '.jpg']
    return any(url.find(ext) >= 0 for ext in exts)

def filter_urls(urls, netloc):
    urls = [url for url in urls if urlparse(url).netloc == netloc]
    urls = [url for url in urls if not has_bad_format(url)]
    return urls

Looping

So, we know how to retrieve an HTML webpage and parse the urls it contains. The next step is to iterate and visit those new urls. Here’s a simple loop.

Simpling visiting webpages without processing them is useless, so you’ll likely want to actually do something with the html content we fetched.

def process_html(b_html):
    # do something usefull here 

Here is the loop.

start_url = 'https://www.google.com/'
to_visit = set([start_url])
visited = set()

while to_visit:
    url = to_visit.pop()
    visited.add(url)

    html = get_html(url)
    process_html(url, html)
    
    links = find_urls(html)
    links = filter_urls(links, 'www.google.com')
    links = set(links)
    newlinks = (links - visited) - to_visit
    
    to_visit = to_visit | links

→ Being able to interrupt your script and resume it without loosing your data is much desirable here. Check out this article to see how I’m doing it.

→ Parse HTML instead of using complicated regexes: Scraping with BeautifulSoup.

→ Use a bullet proof framework instead of writing your own event loop and error logic: Scraping with Scrapy (soon)

Putting it all together

Here is the final script.