A Tutorial From Semalt: Web Scraping In Python
I have visited KinoPoisk (Russian version of IMDB) recently and found out that over the years I managed to rate over 1000 movies. I thought that it would be interesting to research these data more detailed: have my movie tastes changed over time? During which seasons of the year do I watch more movies?
But before we analyze and build beautiful graphics, we need to get the data. Unfortunately, many services don't have public API, so you have to roll up your sleeves and parse the html pages.
This article is intended for those who always wanted to learn how to use Web Scrapping but didn't get their hands on it or didn't know where to start.
Our task is to extract the data about already seen movies: the movie's title, the date and time of watching, the user's rating.
In fact, our work is going to be done in 2 stages:
Stage 1: download and save html pages
Stage 2: parse html in a format suitable for further analysis (csv, json, pandas dataframe etc.)
There are a lot of python-libraries for sending http-requests. The most famous and very handy one is Requests.
It is also necessary to choose a library for html parsing.
These are the two most popular libraries for parsing html and choosing one of them is just a personal preference. Moreover, these libraries are closely connected to each other: BeautifulSoup started to use lxml as an internal parser for acceleration, and in lxml, a soupparser module was added. To compare the approaches, I will parse the data with BeautifulSoup and using XPath selectors in the module lxml.html.
Let's start downloading data. First of all, let's just try to get the page by url and save it to a local file.
We open the resulting file and see that it's not that simple: the site considered us robot and won't show the data.
Let's Find Out How The Site Works
The browser has no problem in getting information from the site. Let's see how exactly it sends the request. To do this we use the "Network" panel in the "Developer Tools" in the browser (I use Firebug for this), usually, the request we need is the longest one.
As we can see, the browser also sends to headers UserAgent, cookie and another number of parameters. First, we'll just try to send correct UserAgent to a header.
This time we are successful, and now we are given the necessary data. It's worth noting that sometimes the site also checks the validity of cookie, in which case sessions in Requests library will help.
Download All Rates
Now we are able to save one page with rates. But usually the user has a lot of rates, and it's necessary to iterate through all pages. The page number that interests us is easy to transfer directly to the url.
Collecting Data From Html
Now let's get directly to collecting the data from html. The easiest way to understand how the html page is structured is by using the "Inspect element" function in the browser. In this case, everything is quite simple: the entire table with rates is in the tag. Select this node:
from bs4 import BeautifulSoup
from lxml import html
# Beautiful Soup
soup = BeautifulSoup(text)
film_list = soup.find('div', ('class': 'profileFilmsList'))
tree = html.fromstring(text)
film_list_lxml = tree.xpath('//div[@class = ''profileFilmList'']')
Let's learn how to pull out Russian title of the movie and a link to the movie's page (also how to get the text and the value of the attribute).
If you need to extract title in English just change "nameRus" to "nameEng".
We learned how to parse websites, got acquainted with libraries Requests, BeautifulSoup, and lxml, as well as received data suitable for further analysis of already seen movies on KinoPoisk.