Doc2Vec Document Vectorization and clustering
Introduction
In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it here. But many times we need to mine the relationships between the phrases rather than the sentences. To take an example
- John has taken many leaves the year
- Leaves are falling of the tree
In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name.
This is going to more of implementation of the doc2vec in python rather than going into the details of the algorithms. The algorithms use either hierarchical softmax or negative sampling; see Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space, in Proceedings of Workshop at ICLR, 2013” and Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean: “Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013”.
Requirements:
I will be using python package gensim for implementing doc2vec on a set of news and then will be using Kmeans clustering to bind similar documents together. Following packages would be required for this implementation
- gensim(Installation details)
- sklearn(Installation details)
- nltk(Installation detaiils)
- Pandas
Preparing the Data
I have used a set of 60 news article headings in this blog, the dataset can be found here.
Data used in raw format can rarely generate good results. Remember the principle of "Garbage in Garbage Out"
We will perform some cleaning steps to normalize and clear the news title. For this firstly lets load the file into the memory.
Now we have our file loaded in the memory lets perform the cleaning stuff
Creating the doc2vec model
Now when we have clean data lets transform this data into vectors. Gensim's implementation of doc2vec needs objects of TaggedDocuments class of gensim.
We have successfully cleaned the documents and let's create the model
By now we have a fully loaded doc2vec model of all the document vectors we had in our data frame
To print all the vectors
These vectors now contain the embeddings of the documents and semantic meaning of the documents. We can use methods in the model to find the similar news articles.
Clustering the Documents
We will use the vectors created in the previous section to generate the clusters using K-means clustering algorithm. Implementation of K-means available in sklearn, so I will be using that implementation.
This gives us 4 clusters with all the news items in one cluster in the corresponding key of the dictionary "cluster".
Here I have taken 4 clusters, you can take a different number on the basis of your data set.
I hope the blog was helpful. Feel free to comment.
This comment has been removed by the author.
ReplyDeleteReally a awesome blog for the freshers. Thanks for posting the information.python training in bangalore
ReplyDeletewhy have you taken "for i in range(40)"???? What does that 40 signify?
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteFantastic blog with great information it was very useful thank you.
ReplyDeleteData Analytics Course Online
This is another good tutorial about Vectorization and clustering. You have great knowledge in the field of vector art services. Good work.. Thanks.
ReplyDelete