Posts

Celery Optimization For Workloads

Image
Introduction Firstly, a brief background about myself. I am working as a Software Engineer in one of the Alternate Asset Management Organization ( Handling 1.4 Trillion with our product suite ) responsible for maintaining and developing a product ALT Data Analyzer. My work is focused on making the engine run and feed the machines with their food. This article explains the problems we faced with scaling up our architecture and solution we followed.  I am dividing the blog in the following different sections: Product Brief Current Architecture Problem With Data loads and Sleepless Nights Solutions Tried The Final Destination Product Brief The idea of building this product was to give users an aggregated view of the WEB around a company. By WEB I mean the data that is flowing freely over all the decentralized nodes of the internet. We try to capture all the financial, technical and fundamental data for the companies, pass that data through our massaging and analyz

Getting Started with Glove

Image
What is GloVe? GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embeddings by aggregating global word-word co-occurrence matrix from a corpus. The resulting embeddings show interesting linear substructures of the word in vector space. Examples for linear substructures are: These results are pretty impressive. This type of representation can be very useful in many machine learning algorithms. To read more about Word Vectorization you can read my other article . In this blog post, we will be learning about GloVe implementation in python. So, let's get started. Let's create the Embeddings Installing Glove-Python The GloVe is implementation in python is available in library  glove-python. pip install glove_python Text Preprocessing In this step, we will pre-process the text like removing the stop words, lemmatize the words etc. You can perform different steps based on your requirements

Word Vectorization

Image
Introduction Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain. There is large amount of textual data present in internet and giant servers around the world. Just for some facts 1,209,600 new data producing social media users each day. 656 million tweets per day! More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day. 67,305,600 Instagram posts uploaded each day There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016. Facebook has 1.32 billion daily active users on average as of June 2017 4.3 BILLION Facebook messages posted daily! 5.75 BILLION Facebook likes every day. 22 billion texts sent every day. 5.2 BILLION daily Google Searches in 2017. Need for Vectorization The amount of textual data is massive, and the problem with textual dat

Spidering the web with Python

Image
Introduction We will be talking about  Spidering/Scraping How to do it elegantly in python Limitations and restriction  In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze. Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions. Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture. Note: Scraping information from the sites which are not free or is not publically available can have serious consequences. Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information. HTML is very complex in itself due to loose rules and a large number of attributes. Inform

Text Analytics-Part 2

Image
Hi readers, In the previous post, I wrote about gaining the knowledge from the Text which is available from many sources. In this post, I will be writing about Topic Mining. Introduction Topic Mining can be described as finding words from the group of words which can best describe the group. Textual Data in raw form is not associated with any context. A human can easily identify the context or topic for an article by reading the article and categorise it in one or other category like politics, sports, economics, crime etc. One of the factors any human will consider while classifying the text into one of the topics is the knowledge that how a word is associated with a topic e.g India won Over Sri Lanka in the test match . World Badminton Championships: When and where to watch Kidambi Srikanth ’s first round, live TV coverage, time in IST, live streaming   Here we may not find word sports explicitly in the sentences but the words marked in bold are associated