Posts

Pandas - Series

Image
Introduction Python is a really powerful language and any piece of software like web application, windows application, ML model etc  can be built using Python. Prior to Pandas, Python was majorly used for data munging and preparation. It had a very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. This blog provides code for basic usage of Series Data Type in Pandas. Data types in Pandas Series : 1D,labeled homogeneous array, size immutable. Data Frames : General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns Panel : General 3D labeled, size-mutable array. Seri

Doc2Vec Document Vectorization and clustering

Introduction In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it  here . But many times we need to mine the relationships between the phrases rather than the sentences. To take an example John has taken many leaves the year Leaves are falling of the tree In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name. This is going to more of implementation of the doc2vec in python rather than going into the details of the algorithms.  The algorithms use either hierarchical softmax or negative sampling; see  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space, in Proceedings of Workshop at ICLR, 20

Celery with heavy workloads Deep Dive in Solution

Image
Introduction This is about my experience with Celery a distributed framework in Python in heavy workload environment.  I am dividing the blog into the different sections: Product Brief Current Architecture Problem With Data loads and Sleepless Nights and  Solutions Tried The Final Destination Product Brief We are working on a product that aims at fetching data from the multiple sources and aggregating that data to generate insights from that data. Some of the Data Sources that we support as of now are: Stock News from multiple sources RSS feed  Twitter Reddit Yahoo News Yahoo Finance Earning Calenders Report Filings Company Financials Stock Prices Current Architecture Broad View We knew that problem that we are solving has to deal with the cruel decentralized Internet. And we need to divide the large task of getting the data from the web and analyzing it into small tasks. Fig 1 On exploring different projects and technologies and analyzing the community support we cam