Spidering the web with Python

Introduction


We will be talking about 
  • Spidering/Scraping
  • How to do it elegantly in python
  • Limitations and restriction 

In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze.
Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions.

Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture.

Note: Scraping information from the sites which are not free or is not publically available can have serious consequences.

Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information.

HTML is very complex in itself due to loose rules and a large number of attributes. Information can be scraped in two ways:

  • Manually filtering using regular expressions
  • Python's way -Beautiful Soup
In this post, we will be discussing beautiful soup's way of scraping.


Beautiful Soup

As per the definition in its documentation 

"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. It commonly saves programmers hours or days of work."


If you have ever tried something like parsing texts and HTML documents that you will understand how brilliantly this module is built and really save a lot of programmers work and time.

Let's start with beautiful soup

Installation

I hope python is installed in your system. To install Beautiful Soup you can use pip

pip install beautifulsoup4

Getting Started

Problem 1: Getting all the links from a page.
For this problem, we will use a sample HTML string which has some links and our goal is to get all the links
html_doc = """
<html>
<body>
<h1>Sample Links</h1>
<br>
<a href="https://www.google.com">Google</a>
<br>
<a href="https://www.apple.com">Apple</a>
<br>
<a href="https://www.yahoo.com">Yahoo</a>
<br>
<a href="https://www.msdn.com">MSDN</a>
</body>
</html>
"""


#to import the package 
from bs4 import BeautifulSoup


#creating an object of BeautifulSoup and pass 2 parameters
#1)the html t be scanned
#2)the parser to be used(html parser ,lxml parser etc)
soup=BeautifulSoup(html_doc,"html.parser")


#to find all the anchor tags in the html string
#findAll returns a list of tags in thi scase anchors(to get first one we can use find )
anchors=soup.findAll('a')

#getting links from anchor tags
for a in anchor:
    print a.get('href') #get is used to get the attributes of a tags element
    #print a['href'] can also be used to access the attribute of a tag


This is it, just 5-6 lines to get any tag from the the html and iterating over it, finding some attriutes.
Can you this of doing this with the help of regular expressions. It will be one heck of a job doing it with RE. We can think of how well the module is coded to perform all this functions.

Talking about the parsers (one we have passed while creating a Beautiful Soup object), we have multiple choices if parsers.

This table summarizes the advantages and disadvantages of each parser library:

ParserTypical usageAdvantagesDisadvantages
Python’s html.parserBeautifulSoup(markup, "html.parser")
  • Batteries included
  • Decent speed
  • Lenient (as of Python 2.7.3 and 3.2.)
  • Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parserBeautifulSoup(markup, "lxml")
  • Very fast
  • Lenient
  • External C dependency
lxml’s XML parserBeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")
  • Very fast
  • The only currently supported XML parser
  • External C dependency
html5libBeautifulSoup(markup, "html5lib")
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency

Other methods and Usage

Beautiful soup is a vast library and can do things which are too difficult in just a single line.
Some of the methods for searching tags in HTML are:

#finding by ID
soup.find(id='abc')
#finding through a regex
#lmit the return to 2 tags
soup.find_all(re.compile("^a"),limit=2)
#finding multiple tags
soup.find_all(['a','h1'])
#fiind by custom or built in attributes
soup.find_all(attrs={'data':'abc'})

Problem 2:

In the above example, we are using HTML string for parsing, now we will see how we can hit a URL and get the HTML for that page and then we can parse it in the same manner as we were doing for HTML string above

For this will be using urllib3 package of python. It can be easily installed by the following command

pip install urllib3

Documentation for urllib3 can be seen here.

import urllib3
http = urllib3.PoolManager()
#hiitng the url 
r = http.request('GET', 'https://en.wikipedia.org/wiki/India')

#creating a soup object using html from the link
soup=BeautifulSoup(r.data,"html.parser")

#getting whole text from the wiki page
text=soup.text

#getting all the links from wiki page
links=soup.find_all('a')

#iterating over the new pages and getting text from them
#this can be done in a recursive fashion to parse large number of pages
for link in links:
    prihref=nt link.get('href')
    new_url='https://en.wikipedia.org'+href
    http = urllib3.PoolManager()
    r_new = http.request('GET', new_url)
    #do something with new page
    new_text=r_new.text


#getting source of all the images
src=soup.find('img').get('src')


This was just a basic introduction to web scraping using Python. Much more can be achieved using the packages used in this tutorial. This article can serve as a starting point.

Points to Remember


  • Web Scraping is very useful in gathering data for different purposes like data mining, knowledge creation, data analysis etc but it should be done with care.
  • As a basic rule of thumb, we should not scrape anything which is paid content. Being said that we should comply with the robots.txt file of the site to know the areas which can be crawled.


  • It is very important to look into the legal implications before scraping.



Hope the article was informative.
--
TechScouter (JSC)


Comments

  1. Just taking the opportunity to give back to community.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.
    python Training institute in Pune
    python Training institute in Chennai
    python Training institute in Bangalore

    ReplyDelete
  4. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    cerego

    ReplyDelete




  5. Thanks For Sharing Your information Please keep updating us The Information Shared Is Very valuable Time Just Went On redaing The article Python Online Training Data Science Online Training Aws Online Training Hadoop Online Training

    ReplyDelete
  6. nice course. thanks for sharing this post this post harried me a lot.
    MCSE Training in Noida

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Thanks for sharing this useful piece of content with us...Keep updating regularly..looking forward to see your further posts.



    Machine Learning Course in Chennai

    ReplyDelete
  9. Your articles really impressed for me,because of all information so nice.informatica training in bangalore

    ReplyDelete
  10. Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..

    amazon web services tutorial

    ReplyDelete
  11. Great post! I am actually getting ready to across this information, It’s very helpful for this blog. Also great with all of the valuable information you have Keep up the good work you are doing well.
    Salesforce Training CRS Info Solutions  

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. This comment has been removed by the author.

    ReplyDelete
  15. Such an excellent and interesting blog, do post like this more with more information.
    CRS info solutions
    Salesforce Training Australia  
    Salesforce Training UK 


     

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. This comment has been removed by the author.

    ReplyDelete
  19. Nice & Informative Blog !
    you may encounter various issues in QuickBooks that can create an unwanted interruption in your work. To alter such problems, call us at Quickbooks Error Support Phone Number 1-855-977-7463 and get immediate technical services for QuickBooks in less time.

    ReplyDelete
  20. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
    DevOps Training in Chennai

    DevOps Course in Chennai

    ReplyDelete
  21. Nice & Informative Blog !
    Are you looking for the best ways to solve QuickBooks Error 429,Our team will quickly assist you with the best solutions to solve QuickBooks Error.

    ReplyDelete
  22. Deep Learning Projects assist final year students with improving your applied Deep Learning skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include Deep Learning projects for final year into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Deep Learning Projects for Final Year even arrange a more significant compensatio
    SQL Server DBA Training in Bangalore

    ReplyDelete
  23. Thanks for sharing this useful piece of content with us...Keep updating regularly..looking forward to see your further posts.
    https://socialprachar.com/

    ReplyDelete
  24. great article.very helpfull

    ReplyDelete
  25. Hey! Fabulous post. It is the best thing that I have read on the internet today. Moreover, if you need instant support for QuickBooks Error, visit at QuickBooks Customer Service Phone Number Our team is always ready to help and support their clients.

    ReplyDelete
  26. This post is so interactive and informative.keep update more information...
    Importance of Azure
    Why Mizcrosoft Azure is Used

    ReplyDelete
  27. If you're aspiring to forge a successful career in networking, APTRON Solution Noida stands as your ideal partner. Our CCNA Training Institute in Noida offers an all-encompassing learning experience, blending theoretical knowledge with practical expertise. Join us to unlock the doors to a world of networking opportunities. Your journey towards becoming a proficient networking professional starts here.

    ReplyDelete
  28. Spider the web effectively using Python. Utilize libraries like Scrapy to automate the process of web crawling and data extraction. With its intuitive framework, you can navigate websites, collect information, and store it for analysis. Python's versatility enables you to tailor your spidering tasks, from scraping data to monitoring changes on web pages. Stay respectful of websites' terms of use, and harness Python's capabilities to gather valuable insights from the vast expanse of the internet.

    ReplyDelete
  29. Unlock the power of web scraping with Python and BeautifulSoup: Learn to extract website content effortlessly. Enhance your data gathering skills now!

    For more information, visit:- web scraping using python beautifulsoup

    ReplyDelete
  30. It's truly helpful. Anyone requiring a free proxy trial can visit okeyproxy.com.

    ReplyDelete
  31. Hiii...Thank you so much for sharing Great information....Nice post....Keep moving on....
    If you are looking for the top salesforce development copnay for your next project. I have picked the top 5 Salesforce Development Companies for you.

    ReplyDelete

Post a Comment

Popular posts from this blog

Word Vectorization

Machine Learning -Solution or Problem