Tech Scouter

Posts

Showing posts with the label Clustering

Doc2Vec Document Vectorization and clustering

- August 19, 2018

Introduction In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it here . But many times we need to mine the relationships between the phrases rather than the sentences. To take an example John has taken many leaves the year Leaves are falling of the tree In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name. This is going to more of implementation of the doc2vec in python rather than going into the details of the algorithms. The algorithms use either hierarchical softmax or negative sampling; see Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space, in Proceedings of Workshop at ICLR, 20

Celery Optimization For Workloads

- May 13, 2018

Introduction Firstly, a brief background about myself. I am working as a Software Engineer in one of the Alternate Asset Management Organization ( Handling 1.4 Trillion with our product suite ) responsible for maintaining and developing a product ALT Data Analyzer. My work is focused on making the engine run and feed the machines with their food. This article explains the problems we faced with scaling up our architecture and solution we followed. I am dividing the blog in the following different sections: Product Brief Current Architecture Problem With Data loads and Sleepless Nights Solutions Tried The Final Destination Product Brief The idea of building this product was to give users an aggregated view of the WEB around a company. By WEB I mean the data that is flowing freely over all the decentralized nodes of the internet. We try to capture all the financial, technical and fundamental data for the companies, pass that data through our massaging and analyz