Posts

Showing posts with the label Analytics

Spidering the web with Python

Image
Introduction We will be talking about  Spidering/Scraping How to do it elegantly in python Limitations and restriction  In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze. Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions. Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture. Note: Scraping information from the sites which are not free or is not publically available can have serious consequences. Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information. HTML is very complex in itself due to loose rules and a large number of attributes. Inform