Skip to main content

Word Vectorization


Introduction

Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain.
There is large amount of textual data present in internet and giant servers around the world.
Just for some facts

  • 1,209,600 new data producing social media users each day.
  • 656 million tweets per day!
  • More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day.
  • 67,305,600 Instagram posts uploaded each day
  • There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016.
  • Facebook has 1.32 billion daily active users on average as of June 2017
  • 4.3 BILLION Facebook messages posted daily!
  • 5.75 BILLION Facebook likes every day.
  • 22 billion texts sent every day.
  • 5.2 BILLION daily Google Searches in 2017.

Need for Vectorization

The amount of textual data is massive, and the problem with textual data is that it needs to be represented in a format that can be mathematically used in solving some problem.
In simple words, we need to get an integer representation of a word. There are simple to complex ways to solve this problem.


One of the easiest ways to solve the problem is creating a simple word to integer mapping.

#list of sentences to be vectorized
line="Hello this is a tutorial on how to convert the word in an integer format"

#dictionary to hold the words
word_list={}

#initialize the counter for assigning to different word
counter=0

#iterate over the words
for word in line:
 #check if the word is in dict
 if word not in word_list:
  word_list[word]=counter
  #update the counter
  counter+=1
  


This will return us the dictionary of words with the corresponding integer representations.

Another way to get these numbers is by using TD-IDF

TF-IDF stands for term frequency-inverse document frequency which assigns some weight to the word based on the number of occurrences in the document also taking into consideration the frequency of the word in all the documents. This approach is better than the previous approach as it lowers the weight of the words that occur too often in all the sentences like 'a', 'the', 'as' etc and increases the weight of the words that can be important in a sentence.
This is useful in the scenarios where we want to get the important words from all the documents .
This approach is also used in topic modelling.  

The third approach and the one on which this article will be focussing is Word2VEC

Word2vec is a group of related models that are used to produce so-called word embeddings. These models are shallow, two-layer neural networks, that are trained to reconstruct linguistic contexts of words.

After training, word2vec models can be used to map each word to a vector of typically several hundred elements, which represent that word's relation to other words. This vector is the neural network's hidden layer.

Word2vec relies on either skip-grams or continuous bag of words (CBOW) to create neural word embeddings. It was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analyzed and explained by other researchers.

Luckily if you want to use this model in your work you don't have to write these algorithms.
Gensim is one the library in Python that has some of the awesome features required for text processing and Natural Language Processing. In the rest of the article, we will learn to use this awesome library for word vectorization.

Installing Gensim

pip install --upgrade gensim

It has three major dependencies

  • Python
  • NumPy
  • SciPy
Make sure you install the dependencies before installing gensim.
Lets' get our hands dirty on the code.

Text Preprocessing:

In this step, we will pre-process the text like removing the stop words, lemmatize the words etc.
You can perform different steps based on your requirements.

I will use nltk stopword corpus for stop word removal and nltk word lemmatization for finding lemmas.
I order to use nltk corpus you will need to download it using the following commands.

Downloading the corpus

import nltk
nltk.download()
#this will open a GUI from which you can download the corpus

Input initialization

#list of sentences to be vectorized
lines=["Hello this is a tutorial on how to convert the word in an integer format",
"this is a beautiful day","Jack is going to office"]

Removing the Stop Words

from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

lines_without_stopwords=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  if word not in stop_words:
   temp_line.append (word)
 string=' '
 lines_without_stopwords.append(string.join(temp_line))

lines=lines_without_stopwords

Lemmatization

#import WordNet Lemmatizer from nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()


lines_with_lemmas=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  temp_line.append (wordnet_lemmatizer.lemmatize(word))
 string=' '
 lines_with_lemmas.append(string.join(temp_line))
lines=lines_with_lemmas




Now we have done the basic preprocessing of the text. Any other preprocessing stuff can be achieved similarly.

Preparing Input

We have out input in the form of array of lines. In order for model to process the data we need covert our input to an array of array of words ( :\ ).

Our Input
lines=["Hello this is a tutorial on how to convert the word in an integer format",
"this is a beautiful day","Jack is going to office"]

New Input
lines=[['Hello', 'this','tutorial', 'on', 'how','convert' ,'word',' integer','format'],
['this' ,'beautiful', 'day'],['Jack','going' , 'office']

new_lines=[]
for line in lines:
 new_lines=line.split('')
#new lines has the new format
lines=new_lines

Building the WORD2VEC Model

Building a model with gensim is just a piece of cake .

#import the gensim package
model = gensim.models.Word2Vec(lines, min_count=1,size=2)


Here important is to understand the hyperparameters that can be used to train the model.
Word2vec model constructor is defined as:

gensim.models.word2vec.Word2Vec(sentences=None, size=100, 
alpha=0.025, window=5, min_count=5, max_vocab_size=None, 
sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, 
hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, 
iter=5, null_word=0, trim_rule=None, sorted_vocab=1,
batch_words=10000, compute_loss=False)


sentence= This is the input provided in the form of a list

size= This defines the size of the vector we want to convert the word ('Hello'=[ ? , ? , ? ] if size=3)

alpha= It is the initial learning rate (will linearly drop to min_alpha as training progresses).

window= It is the maximum distance between the current and predicted word within a sentence.

min_count= ignore all words with total frequency lower than this.

max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).

sample = threshold for configuring which higher-frequency words are randomly downsampled;
default is 1e-3, useful range is (0, 1e-5).

workers = use this many worker threads to train the model (=faster training with multicore machines).

hs = if 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used.

negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). Default is 5. If set to 0, no negative sampling is used.

cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean. Only applies when cbow is used.

hashfxn = hash function to use to randomly initialize weights, for increased training reproducibility. The default is Python’s rudimentary built-in hash function.

iter = number of iterations (epochs) over the corpus. Default is 5.

trim_rule = vocabulary trimming rule specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.

sorted_vocab = if 1 (default), sort the vocabulary by descending frequency before assigning word indexes.

batch_words = target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

Using the model

#saving the model persistence
model.save('model.bin')

#loading the model
model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True) 

#getting the most similar words
model.most_similar(positive=['beautiful', 'world'], negative=['convert'], topn=1)

#finding the odd one out
model.doesnt_match("bullish holding stock".split())

#getting the vector for any word
model[word]

#finding the similarity between words
model.similarity('woman', 'man')




For more details, you can read the documentation of the word2vec  gensim here.

Comments

  1. Thank you for sharing wonderful information with us to get some idea about that content. check it once through
    Best Machine Learning institute in Chennai | Best Machine learning training | best machine learning training certification

    ReplyDelete
  2. Great blog! I really love how it is easy on my eyes and the information are well written.
    educational software companies

    ReplyDelete
  3. Good information in full details about vector art and conversion of raster to vector Thanks.

    ReplyDelete
  4. Really useful information.

    Machine Learning Training in Pune

    Thank You Very Much For Sharing These Nice Tips.

    ReplyDelete
  5. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Companies

    Text Analytics Python

    ReplyDelete
  6. Great details here, better yet to discover out your blog which is fantastic. Nicely done!!! For more visit Vectorisation of Maps

    ReplyDelete
  7. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    SaaS based Text Analytics Tool

    Python Text Summarization

    ReplyDelete

Post a Comment

Popular posts from this blog

Doc2Vec Document Vectorization and clustering

Introduction
In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it here. But many times we need to mine the relationships between the phrases rather than the sentences. To take an example John has taken many leaves the yearLeaves are falling of the tree In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name.
This is going to more of implementation of the doc2vec in python rather than going into the details of the algorithms. The algorithms use either hierarchical softmax or negative sampling; see Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space, in Proceedings of Workshop at ICLR, 2013” and Tomas Mikolov,…

Celery Optimization For Workloads

Introduction Firstly, a brief background about myself. I am working as a Software Engineer in one of the Alternate Asset Management Organization (Handling 1.4 Trillion with our product suite) responsible for maintaining and developing a product ALT Data Analyzer. My work is focused on making the engine run and feed the machines with their food.
This article explains the problems we faced with scaling up our architecture and solution we followed.
 I am dividing the blog in the following different sections: Product BriefCurrent ArchitectureProblem With Data loads and Sleepless NightsSolutions TriedThe Final Destination
Product Brief The idea of building this product was to give users an aggregated view of the WEB around a company. By WEB I mean the data that is flowing freely over all the decentralized nodes of the internet. We try to capture all the financial, technical and fundamental data for the companies, pass that data through our massaging and analyzing pipes and provide an aggreg…