Spidering the web with Python
- How to do it elegantly in python
- Limitations and restriction
HTML is very complex in itself due to loose rules and a large number of attributes. Information can be scraped in two ways:
- Manually filtering using regular expressions
- Python's way -Beautiful Soup
Can you this of doing this with the help of regular expressions. It will be one heck of a job doing it with RE. We can think of how well the module is coded to perform all this functions.
Talking about the parsers (one we have passed while creating a Beautiful Soup object), we have multiple choices if parsers.
|lxml’s HTML parser|
|lxml’s XML parser|
Other methods and UsageBeautiful soup is a vast library and can do things which are too difficult in just a single line.
Some of the methods for searching tags in HTML are:
Problem 2:In the above example, we are using HTML string for parsing, now we will see how we can hit a URL and get the HTML for that page and then we can parse it in the same manner as we were doing for HTML string above
For this will be using urllib3 package of python. It can be easily installed by the following command
Documentation for urllib3 can be seen here.
This was just a basic introduction to web scraping using Python. Much more can be achieved using the packages used in this tutorial. This article can serve as a starting point.
Points to Remember
- Web Scraping is very useful in gathering data for different purposes like data mining, knowledge creation, data analysis etc but it should be done with care.
- As a basic rule of thumb, we should not scrape anything which is paid content. Being said that we should comply with the robots.txt file of the site to know the areas which can be crawled.
- It is very important to look into the legal implications before scraping.
Hope the article was informative.