Spidering the web with Python

- September 14, 2017

Introduction

We will be talking about

Spidering/Scraping
How to do it elegantly in python
Limitations and restriction

In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze.

Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions.

Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture.

Note: Scraping information from the sites which are not free or is not publically available can have serious consequences.

Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information.

HTML is very complex in itself due to loose rules and a large number of attributes. Information can be scraped in two ways:

Manually filtering using regular expressions
Python's way -Beautiful Soup

In this post, we will be discussing beautiful soup's way of scraping.

Beautiful Soup

As per the definition in its documentation

"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. It commonly saves programmers hours or days of work."

If you have ever tried something like parsing texts and HTML documents that you will understand how brilliantly this module is built and really save a lot of programmers work and time.

Let's start with beautiful soup

Installation

I hope python is installed in your system. To install Beautiful Soup you can use pip

pip install beautifulsoup4

Getting Started

Problem 1: Getting all the links from a page.

For this problem, we will use a sample HTML string which has some links and our goal is to get all the links

html_doc = """
<html>
<body>
<h1>Sample Links</h1>
<br>
<a href="https://www.google.com">Google</a>
<br>
<a href="https://www.apple.com">Apple</a>
<br>
<a href="https://www.yahoo.com">Yahoo</a>
<br>
<a href="https://www.msdn.com">MSDN</a>
</body>
</html>
"""

#to import the package 
from bs4 import BeautifulSoup


#creating an object of BeautifulSoup and pass 2 parameters
#1)the html t be scanned
#2)the parser to be used(html parser ,lxml parser etc)
soup=BeautifulSoup(html_doc,"html.parser")


#to find all the anchor tags in the html string
#findAll returns a list of tags in thi scase anchors(to get first one we can use find )
anchors=soup.findAll('a')

#getting links from anchor tags
for a in anchor:
    print a.get('href') #get is used to get the attributes of a tags element
    #print a['href'] can also be used to access the attribute of a tag

This is it, just 5-6 lines to get any tag from the the html and iterating over it, finding some attriutes.
Can you this of doing this with the help of regular expressions. It will be one heck of a job doing it with RE. We can think of how well the module is coded to perform all this functions.

Talking about the parsers (one we have passed while creating a Beautiful Soup object), we have multiple choices if parsers.

This table summarizes the advantages and disadvantages of each parser library:

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed Lenient (as of Python 2.7.3 and 3.2.)	Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast Lenient	External C dependency
lxml’s XML parser	`BeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely lenient Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

Other methods and Usage

Beautiful soup is a vast library and can do things which are too difficult in just a single line.
Some of the methods for searching tags in HTML are:


#finding by ID

soup.find(id='abc')




#finding through a regex

#lmit the return to 2 tags

soup.find_all(re.compile("^a"),limit=2)




#finding multiple tags

soup.find_all(['a','h1'])




#fiind by custom or built in attributes

soup.find_all(attrs={'data':'abc'})

Problem 2:

In the above example, we are using HTML string for parsing, now we will see how we can hit a URL and get the HTML for that page and then we can parse it in the same manner as we were doing for HTML string above

For this will be using urllib3 package of python. It can be easily installed by the following command

pip install urllib3

Documentation for urllib3 can be seen here.

import urllib3
http = urllib3.PoolManager()
#hiitng the url 
r = http.request('GET', 'https://en.wikipedia.org/wiki/India')

#creating a soup object using html from the link
soup=BeautifulSoup(r.data,"html.parser")

#getting whole text from the wiki page
text=soup.text

#getting all the links from wiki page
links=soup.find_all('a')

#iterating over the new pages and getting text from them
#this can be done in a recursive fashion to parse large number of pages
for link in links:
    prihref=nt link.get('href')
    new_url='https://en.wikipedia.org'+href
    http = urllib3.PoolManager()
    r_new = http.request('GET', new_url)
    #do something with new page
    new_text=r_new.text


#getting source of all the images
src=soup.find('img').get('src')

This was just a basic introduction to web scraping using Python. Much more can be achieved using the packages used in this tutorial. This article can serve as a starting point.

Points to Remember

Web Scraping is very useful in gathering data for different purposes like data mining, knowledge creation, data analysis etc but it should be done with care.
As a basic rule of thumb, we should not scrape anything which is paid content. Being said that we should comply with the robots.txt file of the site to know the areas which can be crawled.