Doc2Vec Document Vectorization and clustering

- August 19, 2018

Introduction

In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it here. But many times we need to mine the relationships between the phrases rather than the sentences. To take an example

John has taken many leaves the year
Leaves are falling of the tree

In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name.

This is going to more of implementation of the doc2vec in python rather than going into the details of the algorithms. The algorithms use either hierarchical softmax or negative sampling; see Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space, in Proceedings of Workshop at ICLR, 2013” and Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean: “Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013”.

Requirements:

I will be using python package gensim for implementing doc2vec on a set of news and then will be using Kmeans clustering to bind similar documents together. Following packages would be required for this implementation

Preparing the Data

I have used a set of 60 news article headings in this blog, the dataset can be found here.

Data used in raw format can rarely generate good results. Remember the principle of "Garbage in Garbage Out"

We will perform some cleaning steps to normalize and clear the news title. For this firstly lets load the file into the memory.

#import the libraries

import pandas as pd
df=pd.read_csv('path_to_dir/s_news.csv')
#drop the Nan

rowsdf.dropna(inplace=True)

Now we have our file loaded in the memory lets perform the cleaning stuff

#import the libraries
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
lemma = WordNetLemmatizer()
stopword_set = set(stopwords.words('english')+['a','of','at','s','for','share','stock'])

def process(string):
    string=' '+string+' '
    string=' '.join([word if word not in stopword_set else '' for word in string.split()])
    string=re.sub('\@\w*',' ',string)
    string=re.sub('\.',' ',string)
    string=re.sub("[,#'-\(\):$;\?%]",' ',string)
    string=re.sub("\d",' ',string)
    string=string.lower()
    string=re.sub("nyse",' ',string)
    string=re.sub("inc",' ',string)
    string=re.sub(r'[^\x00-\x7F]+',' ', string)
    string=re.sub(' for ',' ', string)
    string=re.sub(' s ',' ', string)
    string=re.sub(' the ',' ', string)
    string=re.sub(' a ',' ', string)
    string=re.sub(' with ',' ', string)
    string=re.sub(' is ',' ', string)
    string=re.sub(' at ',' ', string)
    string=re.sub(' to ',' ', string)
    string=re.sub(' by ',' ', string)
    string=re.sub(' & ',' ', string)
    string=re.sub(' of ',' ', string)
    string=re.sub(' are ',' ', string)
    string=re.sub(' co ',' ', string)
    string=re.sub(' stock ',' ', string)
    string=re.sub(' share ',' ', string)
    string=re.sub(' stake ',' ', string)
    string=re.sub(' corporation ',' ', string)
    string=" ".join(lemma.lemmatize(word) for word in string.split())
    string=re.sub('( [\w]{1,2} )',' ', string)
    string=re.sub("\s+",' ',string)
    
    return string.split()

#drop the duplicate values of news 
df.drop_duplicates(subset=['raw.title'],keep='last',inplace=True)

#reindex the data frame
df.index=range(0,len(df))

#apply the process function to the news titles
df['title_l']=df['raw.title'].apply(process)
df_new=df

Creating the doc2vec model

Now when we have clean data lets transform this data into vectors. Gensim's implementation of doc2vec needs objects of TaggedDocuments class of gensim.

#import the modules
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(list(df_new['title_l']))]

We have successfully cleaned the documents and let's create the model

model = Doc2Vec(documents, size=25, window=2, min_count=1, workers=4)

By now we have a fully loaded doc2vec model of all the document vectors we had in our data frame

To print all the vectors

#appending all the vectors in a list for training
X=[]
for i in range(40):
    X.append(model.docvecs[i])
    print mdoel.docvecs[i]

These vectors now contain the embeddings of the documents and semantic meaning of the documents. We can use methods in the model to find the similar news articles.

#to create a new vector
vector = model.infer_vector(process("Merger news with verizon"))

# to find the siilarity with vector
model.similar_by_vector(vector)

# to find the most similar word to words in 2 document
model.wv.most_similar(documents[1][0])

#find similar documents to document 1
model.docvecs.most_similar(1)

Clustering the Documents

We will use the vectors created in the previous section to generate the clusters using K-means clustering algorithm. Implementation of K-means available in sklearn, so I will be using that implementation.

#import the modules
from sklearn.cluster import KMeans
import numpy as np
#create the kmeans object withe vectors created previously
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)

#print all the labels
print kmeans.labels_

#craete a dictionary to get cluster data
clusters={0:[],1:[],2:[],3:[]}
for i in range(40):
    clusters[kmeans.labels_[i]].append(' '.join(df_new.ix[i,'title_l']))
print clusters

This gives us 4 clusters with all the news items in one cluster in the corresponding key of the dictionary "cluster".

Here I have taken 4 clusters, you can take a different number on the basis of your data set.

I hope the blog was helpful. Feel free to comment.

Comments

Julia Robert19 June 2019 at 09:33
This comment has been removed by the author.
ReplyDelete
Replies
Training for IT and Software Courses28 November 2019 at 05:11
Really a awesome blog for the freshers. Thanks for posting the information.python training in bangalore
ReplyDelete
Replies
Joseph21 March 2020 at 21:17
why have you taken "for i in range(40)"???? What does that 40 signify?
ReplyDelete
Replies
Praneeth M25 April 2020 at 23:58
This comment has been removed by the author.
ReplyDelete
Replies
Data Science 9 September 2020 at 00:16
Fantastic blog with great information it was very useful thank you.
Data Analytics Course Online
ReplyDelete
Replies
Tokyo9 June 2023 at 03:56
This is another good tutorial about Vectorization and clustering. You have great knowledge in the field of vector art services. Good work.. Thanks.

ReplyDelete
Replies

Add comment

Search This Blog

Tech Scouter