Spidering the web with Python

Introduction


We will be talking about 
  • Spidering/Scraping
  • How to do it elegantly in python
  • Limitations and restriction 

In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze.
Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions.

Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture.

Note: Scraping information from the sites which are not free or is not publically available can have serious consequences.

Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information.

HTML is very complex in itself due to loose rules and a large number of attributes. Information can be scraped in two ways:

  • Manually filtering using regular expressions
  • Python's way -Beautiful Soup
In this post, we will be discussing beautiful soup's way of scraping.


Beautiful Soup

As per the definition in its documentation 

"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. It commonly saves programmers hours or days of work."


If you have ever tried something like parsing texts and HTML documents that you will understand how brilliantly this module is built and really save a lot of programmers work and time.

Let's start with beautiful soup

Installation

I hope python is installed in your system. To install Beautiful Soup you can use pip

pip install beautifulsoup4

Getting Started

Problem 1: Getting all the links from a page.
For this problem, we will use a sample HTML string which has some links and our goal is to get all the links
html_doc = """
<html>
<body>
<h1>Sample Links</h1>
<br>
<a href="https://www.google.com">Google</a>
<br>
<a href="https://www.apple.com">Apple</a>
<br>
<a href="https://www.yahoo.com">Yahoo</a>
<br>
<a href="https://www.msdn.com">MSDN</a>
</body>
</html>
"""


#to import the package 
from bs4 import BeautifulSoup


#creating an object of BeautifulSoup and pass 2 parameters
#1)the html t be scanned
#2)the parser to be used(html parser ,lxml parser etc)
soup=BeautifulSoup(html_doc,"html.parser")


#to find all the anchor tags in the html string
#findAll returns a list of tags in thi scase anchors(to get first one we can use find )
anchors=soup.findAll('a')

#getting links from anchor tags
for a in anchor:
    print a.get('href') #get is used to get the attributes of a tags element
    #print a['href'] can also be used to access the attribute of a tag


This is it, just 5-6 lines to get any tag from the the html and iterating over it, finding some attriutes.
Can you this of doing this with the help of regular expressions. It will be one heck of a job doing it with RE. We can think of how well the module is coded to perform all this functions.

Talking about the parsers (one we have passed while creating a Beautiful Soup object), we have multiple choices if parsers.

This table summarizes the advantages and disadvantages of each parser library:

ParserTypical usageAdvantagesDisadvantages
Python’s html.parserBeautifulSoup(markup, "html.parser")
  • Batteries included
  • Decent speed
  • Lenient (as of Python 2.7.3 and 3.2.)
  • Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parserBeautifulSoup(markup, "lxml")
  • Very fast
  • Lenient
  • External C dependency
lxml’s XML parserBeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")
  • Very fast
  • The only currently supported XML parser
  • External C dependency
html5libBeautifulSoup(markup, "html5lib")
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency

Other methods and Usage

Beautiful soup is a vast library and can do things which are too difficult in just a single line.
Some of the methods for searching tags in HTML are:

#finding by ID
soup.find(id='abc')
#finding through a regex
#lmit the return to 2 tags
soup.find_all(re.compile("^a"),limit=2)
#finding multiple tags
soup.find_all(['a','h1'])
#fiind by custom or built in attributes
soup.find_all(attrs={'data':'abc'})

Problem 2:

In the above example, we are using HTML string for parsing, now we will see how we can hit a URL and get the HTML for that page and then we can parse it in the same manner as we were doing for HTML string above

For this will be using urllib3 package of python. It can be easily installed by the following command

pip install urllib3

Documentation for urllib3 can be seen here.

import urllib3
http = urllib3.PoolManager()
#hiitng the url 
r = http.request('GET', 'https://en.wikipedia.org/wiki/India')

#creating a soup object using html from the link
soup=BeautifulSoup(r.data,"html.parser")

#getting whole text from the wiki page
text=soup.text

#getting all the links from wiki page
links=soup.find_all('a')

#iterating over the new pages and getting text from them
#this can be done in a recursive fashion to parse large number of pages
for link in links:
    prihref=nt link.get('href')
    new_url='https://en.wikipedia.org'+href
    http = urllib3.PoolManager()
    r_new = http.request('GET', new_url)
    #do something with new page
    new_text=r_new.text


#getting source of all the images
src=soup.find('img').get('src')


This was just a basic introduction to web scraping using Python. Much more can be achieved using the packages used in this tutorial. This article can serve as a starting point.

Points to Remember


  • Web Scraping is very useful in gathering data for different purposes like data mining, knowledge creation, data analysis etc but it should be done with care.
  • As a basic rule of thumb, we should not scrape anything which is paid content. Being said that we should comply with the robots.txt file of the site to know the areas which can be crawled.


  • It is very important to look into the legal implications before scraping.



Hope the article was informative.
--
TechScouter (JSC)


Comments

  1. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
    python training in Chennai

    ReplyDelete
    Replies
    1. Just taking the opportunity to give back to community.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.
    python Training institute in Pune
    python Training institute in Chennai
    python Training institute in Bangalore

    ReplyDelete
  4. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    Best Devops Training in pune
    Devops Training in Bangalore
    Microsoft azure training in Bangalore
    Power bi training in Chennai

    ReplyDelete
  5. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    cerego

    ReplyDelete




  6. Thanks For Sharing Your information Please keep updating us The Information Shared Is Very valuable Time Just Went On redaing The article Python Online Training Data Science Online Training Aws Online Training Hadoop Online Training

    ReplyDelete
  7. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.

    And indeed, I’m just always astounded concerning the remarkable things served by you. Some four facts on this page are undeniably the most effective I’ve had.
    MATLAB TRAINING IN CHENNAI | Best MATLAB TRAINING Institute IN CHENNAI
    EMBEDDED SYSTEMS TRAINING IN CHENNAI |Best EMBEDDED TRAINING Institute IN CHENNAI
    MCSA / MCSE TRAINING IN CHENNAI |Best MCSE TRAINING Institute IN CHENNAI
    CCNA TRAINING IN CHENNAI | Best CCNA TRAINING Institute IN CHENNAI
    ANDROID TRAINING IN CHENNAI |Best ANDROID TRAINING Institute IN CHENNAI

    ReplyDelete
  8. nice course. thanks for sharing this post this post harried me a lot.
    MCSE Training in Noida

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Thanks for sharing this useful piece of content with us...Keep updating regularly..looking forward to see your further posts.



    Machine Learning Course in Chennai

    ReplyDelete
  11. Your articles really impressed for me,because of all information so nice.informatica training in bangalore

    ReplyDelete
  12. These provided information was really so nice,thanks for giving that post and the more skills to develop after refer that post.dotnet training in bangalore

    ReplyDelete
  13. I gathered a lot of information through this article.Every example is easy to undestandable and explaining the logic easily.hadoop training in bangalore

    ReplyDelete
  14. Very useful and information content has been shared out here, Thanks for sharing it.Microsoft azure training in bangalore

    ReplyDelete
  15. This is really an awesome post, thanks for it. Keep adding more information to this.tableau training in bangalore

    ReplyDelete
  16. Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..

    amazon web services tutorial

    ReplyDelete
  17. It was a very good experience,Faculty members are very knowledgeable and cooperative. Specially My trainer teaching more as he focused upon practical rather than theory. All together it was an enlightening and informative course.

    pega training institutes in bangalore

    pega training in bangalore

    best pega training institutes in bangalore

    pega training course content

    pega training interview questions

    pega training & placement in bangalore

    pega training center in bangalore

    ReplyDelete
  18. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Companies

    Text Analytics Python

    ReplyDelete
  19. Get real time project based and job oriented Salesforce training India course materials for Salesforce Certification with securing a practice org, database terminology, admin and user interface navigation and custom fields creation, reports & analytics, security, customization, automation and web to lead forms.

    ReplyDelete
  20. Great post! I am actually getting ready to across this information, It’s very helpful for this blog. Also great with all of the valuable information you have Keep up the good work you are doing well.
    Salesforce Training CRS Info Solutions  

    ReplyDelete
  21. This comment has been removed by the author.

    ReplyDelete
  22. Enroll today to get free access to our live demo session which is a great opportunity to interact with the trainer directly which is a placement based Salesforce training India with job placement and certification . Get salesforce training in affordable cost from a best computer institute.

    ReplyDelete
  23. It's really worth reading the blog post whcih has useful content. I have recently done CRS Info Solutions which is highly helpful to upgrade my career. What is the Salesforce course training fee in Chennai? Which is the best option from Salesforce training in Chennai? Who offers job oriented Salesforce training in Chennai? Find the best institute for project based Salesforce training in Chennai.

    ReplyDelete
  24. This comment has been removed by the author.

    ReplyDelete
  25. This comment has been removed by the author.

    ReplyDelete
  26. Such an excellent and interesting blog, do post like this more with more information.
    CRS info solutions
    Salesforce Training Australia  
    Salesforce Training UK 


     

    ReplyDelete
  27. Great post! I am actually getting ready to across this information, It’s very helpful for this blog. Also great with all of the valuable information you have Keep up the good work you are doing well.
    CRS Info Solutions Salesforce Training   

    ReplyDelete
  28. Nice post I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this   
    Salesforce Training India 

    ReplyDelete
  29. I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this Salesforce Training who are providing certification and job assistance.
    Salesforce training in Hyderabad

    ReplyDelete
  30. This comment has been removed by the author.

    ReplyDelete
  31. This comment has been removed by the author.

    ReplyDelete
  32. I am so happy to found your blog post because it's really very informative. Please keep writing this kind of blogs and I regularly visit this blog. Have a look at my services.  
    This is really the best Top 20 Salesforce CRM Admin Development Interview Questions highly helpful. I have found these Scenario based Salesforce developers interview questions and answers very helpful to attempt job interviews. Wow, i got this scenario based Salesforce interview questions highly helpful.

    ReplyDelete
  33. This comment has been removed by the author.

    ReplyDelete
  34. This article is really helpful for me. I am regular visitor to this blog. Share such kind of article more in future. Personally i like this article a lot and you can have a look at my services also: I was seriously search for a Salesforce training institutes in ameerpet which offer job assistance and Salesforce training institutes in Hyderabad who are providing certification material. It's worth to join Salesforce training institutes in India because of their real time projects material and 24x7 support from customer desk. You can easily find the best Salesforce training institutes in kukatpally kphb which are also a part of Pega training institutes in hyderabad. This is amazing to join Data science training institutes in ameerpet who are quire popular with Selenium training institutes in ameerpet and trending coureses like Java training institutes in ameerpet and data science related programming coures python training institutes in ameerpet If you want HCM course then this workday training institutes in ameerpet is best for you to get job on workday.

    ReplyDelete
  35. I have recently found an excellent Salesforce Training in India whose faculty is exceptional and you can also try this Salesforce Training and Certification in Jaipur whose syllabus is state of art. Here at Salesforce Training in Mumbai instructors are perfect to teach salesforce crm. My advice for you is to join demo at Salesforce training in Pune | Course Cost and in weekends try this best Salesforce Training in Noida | Course Cost who is providing great teaching services on Salesforce Training in Delhi and Fee Details.

    ReplyDelete
  36. Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work.
    Salesforce Training in Chennai

    Salesforce Online Training in Chennai

    Salesforce Training in Bangalore

    Salesforce Training in Hyderabad

    Salesforce training in ameerpet

    Salesforce Training in Pune

    Salesforce Online Training

    Salesforce Training

    ReplyDelete
  37. Great Article
    Cloud Computing Projects


    Networking Projects

    Final Year Projects for CSE


    JavaScript Training in Chennai

    JavaScript Training in Chennai

    The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

    ReplyDelete
  38. Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
    workday studio online training
    best workday studio online training
    top workday studio online training

    ReplyDelete
  39. Nice & Informative Blog !
    you may encounter various issues in QuickBooks that can create an unwanted interruption in your work. To alter such problems, call us at Quickbooks Error Support Phone Number 1-855-977-7463 and get immediate technical services for QuickBooks in less time.

    ReplyDelete
  40. Usually I never comment on blogs but your article is so convincing that I never stop myself to say something about it. You’re doing a great job Man learn
    Pega Online Training
    Pega Online Course

    ReplyDelete
  41. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
    DevOps Training in Chennai

    DevOps Course in Chennai

    ReplyDelete
  42. Nice & Informative Blog !
    Are you looking for the best ways to solve QuickBooks Error 429,Our team will quickly assist you with the best solutions to solve QuickBooks Error.

    ReplyDelete
  43. Deep Learning Projects assist final year students with improving your applied Deep Learning skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include Deep Learning projects for final year into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Deep Learning Projects for Final Year even arrange a more significant compensatio
    SQL Server DBA Training in Bangalore

    ReplyDelete
  44. Such a very useful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article.

    https://socialprachar.com/

    ReplyDelete
  45. Thanks for sharing this useful piece of content with us...Keep updating regularly..looking forward to see your further posts.
    https://socialprachar.com/

    ReplyDelete
  46. I genuinely appreciated understanding it. Sitting tight for some more incredible articles like this from you in the nearing days.

    Data Science Training In Bangalore
    Data Science Training Institute in Bangalore

    ReplyDelete
  47. great article.very helpfull

    ReplyDelete

  48. Digital Lync offers one of the best Online Courses Hyderabad with a comprehensive course curriculum with Continuous Integration, Delivery, and Testing. Elevate your practical knowledge with quizzes, assignments, Competitions, and Hackathons to give a boost to your confidence with our hands-on Full Stack Training. An advantage of the online Cources development course in Hyderabad from Digital Lync is to get industry-ready with Career Guidance and Interview preparation.
    DevOps Training Institute
    Python Training Institute
    AWS Training Institute
    Online Full Stack Developer Course Hyderabad
    Python Course Hyderabad
    Online AWS Training Course Hyderabad
    devops training in hyderabad
    angular training in hyderabad

    ReplyDelete
  49. Thanks for sharing this Information.
    Mysql DBA Training

    ReplyDelete

Post a Comment

Popular posts from this blog

Doc2Vec Document Vectorization and clustering

Word Vectorization

Celery with heavy workloads Deep Dive in Solution