Posts

Showing posts from 2017

Word Vectorization

Image
Introduction Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain. There is large amount of textual data present in internet and giant servers around the world. Just for some facts 1,209,600 new data producing social media users each day. 656 million tweets per day! More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day. 67,305,600 Instagram posts uploaded each day There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016. Facebook has 1.32 billion daily active users on average as of June 2017 4.3 BILLION Facebook messages posted daily! 5.75 BILLION Facebook likes every day. 22 billion texts sent every day. 5.2 BILLION daily Google Searches in 2017. Need for Vectorization The amount of textual data is massive, and the problem with textual dat

Spidering the web with Python

Image
Introduction We will be talking about  Spidering/Scraping How to do it elegantly in python Limitations and restriction  In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze. Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions. Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture. Note: Scraping information from the sites which are not free or is not publically available can have serious consequences. Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information. HTML is very complex in itself due to loose rules and a large number of attributes. Inform

Text Analytics-Part 2

Image
Hi readers, In the previous post, I wrote about gaining the knowledge from the Text which is available from many sources. In this post, I will be writing about Topic Mining. Introduction Topic Mining can be described as finding words from the group of words which can best describe the group. Textual Data in raw form is not associated with any context. A human can easily identify the context or topic for an article by reading the article and categorise it in one or other category like politics, sports, economics, crime etc. One of the factors any human will consider while classifying the text into one of the topics is the knowledge that how a word is associated with a topic e.g India won Over Sri Lanka in the test match . World Badminton Championships: When and where to watch Kidambi Srikanth ’s first round, live TV coverage, time in IST, live streaming   Here we may not find word sports explicitly in the sentences but the words marked in bold are associated

Text Analysis -Part 1

Image
Hi Readers, Recently I was going through some text analytics activities at work and learned some techniques for text analytics and mining. In this series of posts, I will be sharing my experiences and learnings. Introduction Firstly the jargon Text Analysis and Text Mining are sub domains of the term Data Mining and are used interchangeably in most scenarios. Broadly  Text Analysis  refers to, Extracting information from textual data keeping the problem for which we want to get data in mind. and  Text Mining  refers to, the process of getting textual data. Nowadays, a large quantity of data is produced by humans.Data is growing faster than ever before and by the year 2020, about  1.7 megabytes of new information will be created every second for every human being on the planet and one of the main components of this data will be textual data. Some of the main sources of textual data are Blogs Articles Websites Facebook Comments Discussion Forums Reviews

Sentiment Analysis-Are we there???

Image
This one took long due to the Analysis work I was doing for this post.There is a lot of work going on in the subject of Sentiment analysis so I decided to compare the accuracy of the products. Let's start with some basics... NLP: Natural Language Processing Natural Language Processing is a very interesting topic and a subject of debate when it comes to accuracy of the NLP. Natural Language is very ambiguous as same sentences can have different meanings like "I saw a man on a hill with a telescope. " It seems like a simple statement until you begin to unpack the many alternate meanings: There’s a man on a hill, and I’m watching him with my telescope. There’s a man on a hill, who I’m seeing, and he has a telescope. There’s a man, and he’s on a hill that also has a telescope on it. I’m on a hill, and I saw a man using a telescope. There’s a man on a hill, and I’m seeing him with a telescope. Sarcasm is that component of the language that is diffi

Machine Learning -Solution or Problem

Image
The article will be divided into different sections as follows: Introduction to Machine Learning Types of Solutions Classification using Naive Bayes A brief about Machine Learning According to the definition by Wikipedia,  Machine learning  is the subfield of  computer science  that, according to  Arthur Samuel  in 1959, gives "computers the ability to learn without being explicitly programmed."  Machine Learning defines a set of problems that have to be evolved through the data by implying some algorithm. One factor that has to be kept in mind while defining a solution through ML is accuracy. Accuracy is very critical in case you are developing a solution in medical domain(cancer detection).There should be a threshold set for every solution which can be based on risk %age that is acceptable. A useful cheatsheet from Microsoft's site to sum up the use of different ML algorithms for the different type of problems. Types of solution Machine Lear