Word Vectorization

- October 30, 2017

Introduction

Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain.

There is large amount of textual data present in internet and giant servers around the world.

Just for some facts

1,209,600 new data producing social media users each day.

656 million tweets per day!

More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day.

67,305,600 Instagram posts uploaded each day

There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016.

Facebook has 1.32 billion daily active users on average as of June 2017

4.3 BILLION Facebook messages posted daily!

5.75 BILLION Facebook likes every day.

22 billion texts sent every day.

5.2 BILLION daily Google Searches in 2017.

Need for Vectorization

The amount of textual data is massive, and the problem with textual data is that it needs to be represented in a format that can be mathematically used in solving some problem.

In simple words, we need to get an integer representation of a word. There are simple to complex ways to solve this problem.

One of the easiest ways to solve the problem is creating a simple word to integer mapping.

#list of sentences to be vectorized
line="Hello this is a tutorial on how to convert the word in an integer format"

#dictionary to hold the words
word_list={}

#initialize the counter for assigning to different word
counter=0

#iterate over the words
for word in line:
 #check if the word is in dict
 if word not in word_list:
  word_list[word]=counter
  #update the counter
  counter+=1

This will return us the dictionary of words with the corresponding integer representations.

Another way to get these numbers is by using TD-IDF

TF-IDF stands for term frequency-inverse document frequency which assigns some weight to the word based on the number of occurrences in the document also taking into consideration the frequency of the word in all the documents. This approach is better than the previous approach as it lowers the weight of the words that occur too often in all the sentences like 'a', 'the', 'as' etc and increases the weight of the words that can be important in a sentence.

This is useful in the scenarios where we want to get the important words from all the documents .

This approach is also used in topic modelling.

The third approach and the one on which this article will be focussing is Word2VEC

Word2vec is a group of related models that are used to produce so-called word embeddings. These models are shallow, two-layer neural networks, that are trained to reconstruct linguistic contexts of words.

After training, word2vec models can be used to map each word to a vector of typically several hundred elements, which represent that word's relation to other words. This vector is the neural network's hidden layer.

Word2vec relies on either skip-grams or continuous bag of words (CBOW) to create neural word embeddings. It was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analyzed and explained by other researchers.

Luckily if you want to use this model in your work you don't have to write these algorithms.
Gensim is one the library in Python that has some of the awesome features required for text processing and Natural Language Processing. In the rest of the article, we will learn to use this awesome library for word vectorization.

Installing Gensim

pip install --upgrade gensim

It has three major dependencies

Python
NumPy
SciPy

Make sure you install the dependencies before installing gensim.

Lets' get our hands dirty on the code.

Text Preprocessing:

In this step, we will pre-process the text like removing the stop words, lemmatize the words etc.

You can perform different steps based on your requirements.

I will use nltk stopword corpus for stop word removal and nltk word lemmatization for finding lemmas.

I order to use nltk corpus you will need to download it using the following commands.

Downloading the corpus

import nltk
nltk.download()
#this will open a GUI from which you can download the corpus

Input initialization

#list of sentences to be vectorized
lines=["Hello this is a tutorial on how to convert the word in an integer format",
"this is a beautiful day","Jack is going to office"]

Removing the Stop Words

from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

lines_without_stopwords=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  if word not in stop_words:
   temp_line.append (word)
 string=' '
 lines_without_stopwords.append(string.join(temp_line))

lines=lines_without_stopwords

Lemmatization

#import WordNet Lemmatizer from nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()


lines_with_lemmas=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  temp_line.append (wordnet_lemmatizer.lemmatize(word))
 string=' '
 lines_with_lemmas.append(string.join(temp_line))
lines=lines_with_lemmas

Now we have done the basic preprocessing of the text. Any other preprocessing stuff can be achieved similarly.

Preparing Input

We have out input in the form of array of lines. In order for model to process the data we need covert our input to an array of array of words ( :\ ).

Our Input

lines=["Hello this is a tutorial on how to convert the word in an integer format",

"this is a beautiful day","Jack is going to office"]

New Input

lines=[['Hello', 'this','tutorial', 'on', 'how','convert' ,'word',' integer','format'],

['this' ,'beautiful', 'day'],['Jack','going' , 'office']

new_lines=[]
for line in lines:
 new_lines=line.split('')
#new lines has the new format
lines=new_lines

Building the WORD2VEC Model

Building a model with gensim is just a piece of cake .

#import the gensim package
model = gensim.models.Word2Vec(lines, min_count=1,size=2)

Here important is to understand the hyperparameters that can be used to train the model.

Word2vec model constructor is defined as:

gensim.models.word2vec.Word2Vec(sentences=None, size=100, 
alpha=0.025, window=5, min_count=5, max_vocab_size=None, 
sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, 
hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, 
iter=5, null_word=0, trim_rule=None, sorted_vocab=1,
batch_words=10000, compute_loss=False)

sentence= This is the input provided in the form of a list

size= This defines the size of the vector we want to convert the word ('Hello'=[ ? , ? , ? ] if size=3)

alpha= It is the initial learning rate (will linearly drop to min_alpha as training progresses).

window= It is the maximum distance between the current and predicted word within a sentence.

min_count= ignore all words with total frequency lower than this.

max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).

sample = threshold for configuring which higher-frequency words are randomly downsampled;

default is 1e-3, useful range is (0, 1e-5).

workers = use this many worker threads to train the model (=faster training with multicore machines).

hs = if 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used.

negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). Default is 5. If set to 0, no negative sampling is used.

cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean. Only applies when cbow is used.

hashfxn = hash function to use to randomly initialize weights, for increased training reproducibility. The default is Python’s rudimentary built-in hash function.

iter = number of iterations (epochs) over the corpus. Default is 5.

trim_rule = vocabulary trimming rule specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.

sorted_vocab = if 1 (default), sort the vocabulary by descending frequency before assigning word indexes.

batch_words = target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

Using the model

#saving the model persistence
model.save('model.bin')

#loading the model
model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True) 

#getting the most similar words
model.most_similar(positive=['beautiful', 'world'], negative=['convert'], topn=1)

#finding the odd one out
model.doesnt_match("bullish holding stock".split())

#getting the vector for any word
model[word]

#finding the similarity between words
model.similarity('woman', 'man')

For more details, you can read the documentation of the word2vec gensim here.