Web Scraping, Part 2

You have installed BeautifulSoup (bs4) and tried some basic scraping.

If you have not yet installed the Requests module, do it now (in your virtual environment).

pip install requests

If you have not made a virtual environment yet, see these instructions.

The code for this chapter is here.

Using select() instead of find() or find_all()

In the previous section we covered several commonly used commands for scraping with BeautifulSoup:

soup.h1.text
soup.find_all( "td", class_="city" )
soup.find_all("img")
soup.find(id="call")

In chapter 12 of Automate the Boring Stuff with Python (second edition), the author covers another command, the select() method. More info: Read the docs.

This method might hold special appeal to people used to working with JavaScript, because the syntax for targeting HTML elements — inside the parentheses of select() — follows the same syntax as this commonly used JavaScript method:

document.querySelectorAll()

So instead of ( "td", class_="city" ), we would write ( "td.city" ), and instead of (id="call"), we would write ("#call").

Note that select() always returns a list, even when only one item is found.

Be mindful that the way you write out what you’re looking for depends on whether you are calling select() or you are calling find() or find_all(). You’ll get errors if you mix up the syntax.

Working with lists of Tag objects

Both find_all() and select() always return a Python list. Each item in the list is a BeautifulSoup Tag object. You can access any list item using its index — just as you would with any normal Python list.

Try this code in the Python shell:

from bs4 import BeautifulSoup
import requests
url = "https://weimergeeks.com/examples/scraping/example1.html"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
images = soup.select('img')
print(images)

You’ll see that you have a Python list of IMG elements.

You can call .get_text() or .text on a Tag object to get only the text inside the element. To get the text from just one Tag object in a list, use its list index:

cities = soup.select('td.city')
print(cities[0])
print(cities[0].text)

To get the text from all the items in the list, you need a for-loop:

for city in cities:
    print(city.text)

If an element has attributes, you can get a Python dictionary containing all of them — again, use an index to see just one item from the list:

images = soup.select('img')
print( images[0].attrs )

Example, running the commands above in the Python shell:

>>> print( images[0] )
<img alt="thumbnail" src="images/thumbnails/park_structures.jpg"/>
>>> print( images[0].attrs )
{'src': 'images/thumbnails/park_structures.jpg', 'alt': 'thumbnail'}

To get a particular attribute for all the IMG elements, you need a for-loop:

for image in images:
    print( image.attrs['src'] )

Again, here’s how that would run in the Python shell:

>>> for image in images:
...   print( image.attrs['src'] )
...
images/thumbnails/park_structures.jpg
images/thumbnails/building.jpg
images/thumbnails/mosque.jpg
images/thumbnails/turrets.jpg
images/thumbnails/russia.jpg
>>>

View the example web page to get a clear idea of where those attributes came from:

https://weimergeeks.com/examples/scraping/example1.html

Another way to get a particular attribute from a Tag object is with .get():

for image in images:
    print( image.get('src') )

As you see, there are various ways to do the same thing with BeautifulSoup. If you find it confusing, choose one way and stick with it.

When in doubt, refer to the BeautifulSoup documentation — it’s all on one page, so search it with Command-F.

Finding inside a Tag object

The methods find(), find_all(), and select() work on Tag objects as well as BeautifulSoup objects (types of objects are covered here). Here is an example:

../python_code_examples/scraping/table_scrape.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/List_of_Scottish_monarchs"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

# get the first table in the article
table = soup.find( 'table', class_='wikitable' )

# get a list of all rows in that table
rows = table.find_all('tr')

# loop over all rows, get all cells
for row in rows:
    try:
        cells = row.find_all('td')
        # print contents of the first cell in the row
        print( cells[0].text )
    except:
        pass

Once we’ve got table out of soup (line 9 above), we can go on to find elements inside the Tag object table. First we get a list of all rows (line 12). Then we can loop over the list of row objects (starting on line 15) and make a list of all table cells in each row (line 17). From that list, we can extract the contents of one or more cells. In the for-loop, by printing cells[0].text (line 19), we will see a list of all Scottish monarchs in the first table on the page.

It’s as if we are taking apart a set of nested boxes. We go inside the table to get the rows. We go inside a row to get its cells.

Note

Using try / except in the script above enables us to skip over the header row of the table, where the HTML tags are th instead of td.

Since the Scottish monarchs page has multiple tables, the code above should be modified to get them all:

tables = soup.find_all( 'table', class_='wikitable' )

And then we will need to loop through the tables:

for table in tables:
    rows = table.find_all('tr')
    for row in rows:
        try:
            cells = row.find_all('td')
            # print contents of the second cell in the row
            print( cells[0].text )
        except:
            pass

Our set of nested boxes actually begins with the page. Inside the page are several tables. Inside each table, we find rows, and inside each row, we find cells. Inside the first cell in each row, we find the name of a king.

Note

The revised script works perfectly on the Scottish monarchs page because the tables in that page are formatted consistently. On many web pages containing multiple tables, this would not be so.

Moving from page to page while scraping

In chapter 12 of Automate the Boring Stuff with Python (second edition), Sweigart provides a script to scrape the XKCD comics website (“Project: Downloading All XKCD Comics”). The code in steps 3 and 4, which are part of a longer while-loop, get the URL from an element on the page that links to the previous comic. In this way, the script starts on the home page of the site, downloads one comic, and then moves to the previous day’s comic page and downloads the comic there. The script repeats this, moving to the previous page each time, until all comics have been downloaded.

Note

This method is often exactly what you need to scrape the data that you want. However, often you will move forward in a set of pages, using a “next” link instead of a “previous” link.

The trick is to determine exactly how to get the URL that leads to the next page to be scraped.

In the case of the XKCD site, this code works:

prevLink = soup.select('a[rel="prev"]')[0]
url = 'https://xkcd.com' + prevLink.get('href')

The code select('a[rel="prev"]') gets all a elements on the page that contain the attribute rel with the value "prev" — that is, rel="prev". This code returns a list, so it’s necessary to use the list index [0] to get the first list item.

The next line extracts the value of the href attribute from that A element and concatenates it with the base URL, https://xkcd.com.

If you inspect the HTML on any XKCD page with Developer Tools, you can find this A element.

Developer Tools and an XKCD web page screenshot

To understand this code better, you can run it in the Python shell. Here I have started on the page at https://xkcd.com/2260/

>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = 'https://xkcd.com/2260/'
>>> page = requests.get(url)
>>> soup = BeautifulSoup(page.text, 'html.parser')

Note

I am not starting at the home page and I am not looping, because I want to provide a simple demonstration of what the code is getting for us.

Then I continued with the code to get only the “Prev” code from that one page:

>>> prevLink = soup.select('a[rel="prev"]')[0]
>>> print(prevLink)
<a accesskey="p" href="/2259/" rel="prev">&lt; Prev</a>
>>> print( prevLink.get('href') )
/2259/
>>> url = 'https://xkcd.com' + prevLink.get('href')
>>> print(url)
https://xkcd.com/2259/
>>>

Above, I have printed three things — prevLink, prevLink.get('href') and url — so I can see exactly what is being extracted and used.

In the complete script in chapter 12, once the script has that last URL, the while-loop restarts. It opens that page — https://xkcd.com/2259/ — and downloads the comic from it.

This practice of printing the value each time is a way of testing your code as you go — to make sure you are getting what you intend to get. If you have an error, then you must modify the line and print it again, and repeat until that line of code gets what you want.

Important

You must understand that every website is different, so probably no other website in the world has the same HTML as the XKCD website. However, many websites do have Previous and Next buttons. It is necessary to inspect the HTML and determine how to extract the next- or previous-page URL (or partial URL) from the HTML on the button.

Some websites use JavaScript to activate their Previous and Next buttons. In those cases, you will need to use the Selenium module to navigate while scraping. Selenium is covered in the next chapter.

Moving from page to page while scraping, PART 2

I have posted two example scripts to help you with sites where moving from page to page involves something like this:

Paginated site links screenshot

Note

Usually you do NOT need Selenium for these links. They are a lot like the XKCD example discussed above.

The two scripts are:

  • mls_pages.py — This one uses the “Go to next page” link until there is no next page.

  • mls_pages_v2 — This one uses for i in range(), for use if you know how many pages there are.

Note that the two scripts do the same thing on one particular website, the Players section of the Major League Soccer site (the code of that website has changed, and these scripts no longer work). The difference is in the way each script gets the link to the next page.

Important

Remember that every website is different, so probably no other website in the world has the same HTML as the MLS website. However, many websites use a similar set of links to pages. The MLS site does not have a single Players list anymore but instead provides player lists on each team roster page, such as this one: Atlanta United

Harvesting multiple URLs from one page

In some cases, you will want to get all the URLs from one page and save them in a file. You would then use another script to open them one by one and scrape multiple pages.

../python_code_examples/scraping/harvest_urls.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
"""scrape all the URLs from the article segment of a Wikipedia page
   and write them into a plain-text file - gets all links - even
   those that begin with #
"""

from bs4 import BeautifulSoup
import requests

# get the contents of one page
start = 'http://en.wikipedia.org/wiki/Harrison_Ford'
page = requests.get(start)
soup = BeautifulSoup(page.text, 'html.parser')

# name the text file that will be created or overwritten
filename = 'myfile.txt'

def capture_urls(filename, soup):
    """harvest the URLs and write them to file"""
    # create and open the file for writing - note, with 'w' this will
    # delete all contents of the file if it already exists
    myfile = open(filename, 'w')

    # get all contents of only the article
    article = soup.find(id='mw-content-text')

    # get all <a> elements
    links_list = article.find_all('a')

    # get contents of all href='' attributes with a loop
    for link in links_list:
        if 'href' in link.attrs:
            # write one href into the text file - '\n' is newline
            myfile.write(link.attrs['href'] + '\n')

    # close and save the file after loop ends
    myfile.close()

# call the function
capture_urls(filename, soup)

It is likely that you do not want header and footer links from the page. You need to inspect the HTML and ascertain what element holds the main text. For a Wikipedia article, there’s an id attribute with the value 'mw-content-text', so that’s what we start with in line 24.

When we get all the a elements with links_list = article.find_all('a') (line 27), we are getting only the a elements that are inside the DIV element with id='mw-content-text' — because the variable article here is a Tag object containing that entire DIV.

Then we use a loop (lines 30–33) to look at each item in links_list. We check if an href attribute exists in the item with this line — if 'href' in link.attrs: — and if there is an HREF, then we write the value of that HREF into the file (line 33).

The script above writes more than 1,400 partial URLs into a file.

As with the XKCD script in the previous section, here we would also concatenate a base URL with the partial URL in a scraping script:

base_url = 'https://en.wikipedia.org'
url = base_url + '/wiki/Blade_Runner'

It should be noted that the harvest_urls.py script collects a lot of partial URLs we would never want, such as internal anchors that link to parts of the same page — #cite_note-1 and #Early_career, for example. To prevent those from being written to the file, we could use:

for link in links_list:
    if 'href' in link.attrs:
        # eliminate internal anchor links
        if link.attrs['href'][:6] == '/wiki/':
            # eliminate Wikipedia photo and template links
            if link.attrs['href'][6:11] != 'File:' and link.attrs['href'][6:14] != 'Template':
                # write one href into the text file - \n is newline
                myfile.write(link.attrs['href'] + '\n')

That is a bit clunky, but if you look up how to slice strings with Python (Command-F search there for “Slice indices”), I think the code will make sense to you.

As a result, we have about 900 partial URLs instead of more than 1,400.

Important

You will always need to inspect the HTML of a page to figure out how best to harvest URLs from that particular page.

Scrape multiple pages with one script

This example shows how you can scrape multiple items from multiple pages, not using a Previous and Next button but (instead) using a collected list of partial URLs.

../python_code_examples/scraping/scrape_several_pages.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
"""scrape a heading and part of a paragraph from multiple pages,
   using a list of partial URLs
"""

from bs4 import BeautifulSoup
import requests

# links from http://en.wikipedia.org/wiki/Harrison_Ford
link_list = ['/wiki/Melissa_Mathison',
    '/wiki/Calista_Flockhart',
    '/wiki/Han_Solo',
    '/wiki/Star_Wars_Trilogy',
    '/wiki/Indiana_Jones',
    '/wiki/Air_Force_One_(film)',
    '/wiki/Blade_Runner',
    '/wiki/Carrie_Fisher']

# harvest data from each URL
def get_info(page_url):
    page = requests.get('https://en.wikipedia.org' + page_url)
    soup = BeautifulSoup(page.text, 'html.parser')

    try:
        print(soup.h1.get_text())
        # get all paragraphs in the main article
        paragraphs = soup.find(id='mw-content-text').find_all('p')
        for p in paragraphs:
            # skip any paragraph that has attributes
            if not p.attrs:
                # print 280 characters from the first real paragraph on the page
                print( p.text[0:280] )
                break
        print() # blank line
    except:
        print(page_url + ' is missing something!')

# call the function for each URL in the list
for link in link_list:
    get_info(link)

We are just printing the H1 and the paragraph (rather than saving them to a file or a database) for the sake of simplicity. We are using a list of only eight partial URLs for the same reason; normally you would probably have a longer list of pages to scrape.

The key is to write a function that scrapes all the data you want from one page (lines 19–35 above). Then call that function inside a for-loop that feeds each URL into the function (lines 38–39).

Tip

To create a Python list from a file such as myfile2.txt, use the readlines() method. See Reading and Writing Files for details.

Moving onward

In the next chapter, we’ll look at how to handle more complex scraping situations with BeautifulSoup, Requests, and Selenium.

.