Doc2Vec Document Vectorization and clustering

Introduction


In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it here. But many times we need to mine the relationships between the phrases rather than the sentences. To take an example
  • John has taken many leaves the year
  • Leaves are falling of the tree
In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name.



Requirements:

I will be using python package gensim for implementing doc2vec on a set of news and then will be using Kmeans clustering to bind similar documents together. Following packages would be required for this implementation

Preparing the Data

I have used a set of 60 news article headings in this blog, the dataset can be found here.
Data used in raw format can rarely generate good results. Remember the principle of "Garbage in Garbage Out"

We will perform some cleaning steps to normalize and clear the news title. For this firstly lets load the file into the memory.


#import the libraries

import pandas as pd
df=pd.read_csv('path_to_dir/s_news.csv')
#drop the Nan

rowsdf.dropna(inplace=True)

Now we have our file loaded in the memory lets perform the cleaning stuff

#import the libraries
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
lemma = WordNetLemmatizer()
stopword_set = set(stopwords.words('english')+['a','of','at','s','for','share','stock'])

def process(string):
    string=' '+string+' '
    string=' '.join([word if word not in stopword_set else '' for word in string.split()])
    string=re.sub('\@\w*',' ',string)
    string=re.sub('\.',' ',string)
    string=re.sub("[,#'-\(\):$;\?%]",' ',string)
    string=re.sub("\d",' ',string)
    string=string.lower()
    string=re.sub("nyse",' ',string)
    string=re.sub("inc",' ',string)
    string=re.sub(r'[^\x00-\x7F]+',' ', string)
    string=re.sub(' for ',' ', string)
    string=re.sub(' s ',' ', string)
    string=re.sub(' the ',' ', string)
    string=re.sub(' a ',' ', string)
    string=re.sub(' with ',' ', string)
    string=re.sub(' is ',' ', string)
    string=re.sub(' at ',' ', string)
    string=re.sub(' to ',' ', string)
    string=re.sub(' by ',' ', string)
    string=re.sub(' & ',' ', string)
    string=re.sub(' of ',' ', string)
    string=re.sub(' are ',' ', string)
    string=re.sub(' co ',' ', string)
    string=re.sub(' stock ',' ', string)
    string=re.sub(' share ',' ', string)
    string=re.sub(' stake ',' ', string)
    string=re.sub(' corporation ',' ', string)
    string=" ".join(lemma.lemmatize(word) for word in string.split())
    string=re.sub('( [\w]{1,2} )',' ', string)
    string=re.sub("\s+",' ',string)
    
    return string.split()

#drop the duplicate values of news 
df.drop_duplicates(subset=['raw.title'],keep='last',inplace=True)

#reindex the data frame
df.index=range(0,len(df))

#apply the process function to the news titles
df['title_l']=df['raw.title'].apply(process)
df_new=df


Creating the doc2vec model

Now when we have clean data lets transform this data into vectors. Gensim's implementation of doc2vec needs objects of TaggedDocuments class of gensim.

#import the modules
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(list(df_new['title_l']))]

We have successfully cleaned the documents and let's create the model

model = Doc2Vec(documents, size=25, window=2, min_count=1, workers=4)

By now we have a fully loaded doc2vec model of all the document vectors we had in our data frame
To print all the vectors
#appending all the vectors in a list for training
X=[]
for i in range(40):
    X.append(model.docvecs[i])
    print mdoel.docvecs[i]

These vectors now contain the embeddings of the documents and semantic meaning of the documents. We can use methods in the model to find the similar news articles.

#to create a new vector
vector = model.infer_vector(process("Merger news with verizon"))

# to find the siilarity with vector
model.similar_by_vector(vector)

# to find the most similar word to words in 2 document
model.wv.most_similar(documents[1][0])

#find similar documents to document 1
model.docvecs.most_similar(1)

Clustering the Documents

We will use the vectors created in the previous section to generate the clusters using K-means clustering algorithm. Implementation of K-means available in sklearn, so I will be using that implementation.

#import the modules
from sklearn.cluster import KMeans
import numpy as np
#create the kmeans object withe vectors created previously
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)

#print all the labels
print kmeans.labels_

#craete a dictionary to get cluster data
clusters={0:[],1:[],2:[],3:[]}
for i in range(40):
    clusters[kmeans.labels_[i]].append(' '.join(df_new.ix[i,'title_l']))
print clusters

This gives us 4 clusters with all the news items in one cluster in the corresponding key of the dictionary "cluster".
Here I have taken 4 clusters, you can take a different number on the basis of your data set.


I hope the blog was helpful. Feel free to comment.


Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. Really a awesome blog for the freshers. Thanks for posting the information.python training in bangalore

    ReplyDelete
  3. why have you taken "for i in range(40)"???? What does that 40 signify?

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Fantastic blog with great information it was very useful thank you.
    Data Analytics Course Online

    ReplyDelete
  6. This is another good tutorial about Vectorization and clustering. You have great knowledge in the field of vector art services. Good work.. Thanks.

    ReplyDelete

Post a Comment

Popular posts from this blog

Spidering the web with Python

Word Vectorization

Machine Learning -Solution or Problem