Beginning Data Science with Tom Waits and Python: Data Gathering

January 2018 ยท 5 minute read

Why?

I have been a fan of Tom Waits for years. I wondered a few things:

And lots of things that Google couldn’t answer no matter how I asked. So I figured that I would figure it out with Python and get better at writing Python.

TL;DR: To answer some Tom Waits questions and get better at Python.

Who is this for?

Prepping the Environment

We don’t want to dirty up our base machine with a thousand dependencies. If you are new to Python, the default is to install every dependency either user wide on a UNIX system or system wide. There are a variety of third party solutions to this.

Our main goal is to nicely encapsulate all the dependencies per project.

Enter Pipenv. It’s a lovely tool that automatically uses pip and virtualenv together while eschewing requirements.txt for Pipfile. This leaves us with a textfile that lists all of our dependencies for this project that can be checked into version control.

All of this is a long way of saying, its become much easier to manage per project libraries and environments now. Thanks Kenneth Reitz!

So, I’m on Ubuntu. If you’re not, things will change slightly.


$ apt install python3
...Python installs...
$ pip install pipenv
...Pipenv installs for Python3...
$ pipenv install [bs4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
...Installing...
$ pipenv install [requests](http://docs.python-requests.org/en/master/)
...Installing...

For now we are using another Kenneth Reitz hit, Requests, to get the webpages to parse. Then BeautifulSoup to turn the HTML into something we can easily programmatically parse.

Gathering the Data

There’s a lot of garbage on the Internet. Tons in fact. Type “song lyrics” into Google and be mystified.

So I asked Tom’s website directly. Everything I use code to ask is beholden to the data – garbage in, garbage out. I might as well use the listings of songs and lyrics from his own website. I figure if he lists a song twice, he considers them truly, canonically, different. And that’s good enough for me.

from bs4 import BeautifulSoup   # Scraping and parsing web data.
import requests                 # Getting web data.
import shelve                   # Storing parsed data.
import os.path                  # Checking that parsed data exists.


# Returns two objects, a requests.get response, and a parsed BeautifulSoup object.
def create_request_soup(url):
    combined_url_return = {}
    combined_url_return['request'] = requests.get(url)
    combined_url_return['bs'] = BeautifulSoup(combined_url_return['request'].text, "html.parser")

    return combined_url_return


def create_names_list(url):
    names = []

    soup_names = create_request_soup(url)['bs']
    names_table = soup_names.find("table")

    for name in names_table.find_all("tr")[1:]:
        col = name.find_all("td")

        names.append((col[0].string.extract()))

    return names


tom_waits_song_database_location = "/home/chris/Projects/tom_waits/tws.db"

if os.path.exists(tom_waits_song_database_location):
    print("ERROR: Tom Waits Local Database already exists")
    exit()

# Get male and female names
male_names_url = "https://exampleurl/malenames.htm"
female_names_url = "https://exampleurl/femalenames.htm"

names = {'male': create_names_list(male_names_url), 'female': create_names_list(female_names_url)}

song_listing_page = requests.get('http://www.tomwaits.com/songs/')
bare_url = song_listing_page.url[:len(song_listing_page.url) - len(song_listing_page.request.path_url)]

parsed_song_listing_page = BeautifulSoup(song_listing_page.text, "html.parser")

songs = {}

song_urls = parsed_song_listing_page.find_all(href=True)

# Get albums and years
album_listing_page = requests.get("http://www.tomwaits.com/albums/#/albums/album/34/Bad_As_Me/")
parsed_album_listing_page = BeautifulSoup(album_listing_page.text, "html.parser")

albums = {}

album_urls = parsed_album_listing_page.find_all(href=True)
for url in album_urls:
    if '/album/' in url['href']:
        albums[url.text.split("- ")[0]] = {}
        albums[url.text.split("- ")[0]]['year'] = url.text.split("- ")[1]
        albums[url.text.split("- ")[0]]['songs'] = []


# Format as of 12/21/2017
# www.tomwaits.com/songs/song/###/_Song_Name/
for url in song_urls:
    if '/song/' in url['href']:
        song = {}
        song['name'] = url.text
        song['url'] = bare_url + url['href']

        lyrics_page = requests.get(song['url'])
        parsed_lyrics_page = BeautifulSoup(lyrics_page.text, "html.parser")
        current_album = parsed_lyrics_page.find('div', class_='songs-chosen-hd').h4.a.string.extract()

        song['lyrics'] = parsed_lyrics_page.find('div', class_="songs-lyrics").get_text(" ", strip=True)

        albums[current_album]['songs'].append(song)


tom_waits_song_database = shelve.open(tom_waits_song_database_location)

tom_waits_song_database['albums'] = albums
tom_waits_song_database['songs'] = songs
tom_waits_song_database['us_names'] = names

tom_waits_song_database.close()

Just like that we have our baseline dataset that includes all album names, years of release, song titles, lyrics, and the top 5000 or so most popular male and female names in the US in 1990.

Nothing has been massaged yet or changed. Meaning that if I try to parse “I” it won’t include “I’ll,” “I’m,” etc. and the same will go for semi-colons and whatnot.

But we have our data. So, let’s get to a minimum amount of massaging and start pulling out some interesting data.

Code Thoughts

You can see above I created the get_request_soup function. I started to refactor this and make it into something special. But then I realized, I don’t care that much. I wrote this to get the data and I have the data.

One of the little things I will always love about programming is that if it seems like there should be an easier way to do something – there usually is.

Update, Sun Jan 21 2018

As I started double-checking my data for the second part in the series, I found that using song names as dictionary keys was a poor idea. The same song occurs across many albums, and as such, is a poor choice. It gave me pause because it told me some albums were shorter than they should be.

Let’s instead base the structure around albums where each album is a list, containing a dictionary that pertains to each song.


albums = {}

album_urls = parsed_album_listing_page.find_all(href=True)
for url in album_urls:
    if '/album/' in url['href']:
        albums[url.text.split("- ")[0]] = {}
        albums[url.text.split("- ")[0]]['year'] = url.text.split("- ")[1]
        albums[url.text.split("- ")[0]]['songs'] = []


# Format as of 12/21/2017
# www.tomwaits.com/songs/song/###/_Song_Name/
for url in song_urls:
    if '/song/' in url['href']:
        song = {}
        song['name'] = url.text
        song['url'] = bare_url + url['href']

        lyrics_page = requests.get(song['url'])
        parsed_lyrics_page = BeautifulSoup(lyrics_page.text, "html.parser")
        current_album = parsed_lyrics_page.find('div', class_='songs-chosen-hd').h4.a.string.extract()

        song['lyrics'] = parsed_lyrics_page.find('div', class_="songs-lyrics").get_text(" ", strip=True)

        albums[current_album]['songs'].append(song)


# Add year of album release to each song.
for song in songs:
    songs[song]['year'] = albums[songs[song]['album']]['year']

This leaves us with a structure that won’t allow duplicate song names to overwrite each other. Instead basing the structure on songs, we base it on albums. For a visual of what this looks like, and an early stab at answering data questions with Python 3, see part two.