Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain.
- 1,209,600 new data producing social media users each day.
- 656 million tweets per day!
- More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day.
- 67,305,600 Instagram posts uploaded each day
- There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016.
- Facebook has 1.32 billion daily active users on average as of June 2017
- 4.3 BILLION Facebook messages posted daily!
- 5.75 BILLION Facebook likes every day.
- 22 billion texts sent every day.
- 5.2 BILLION daily Google Searches in 2017.
Need for Vectorization
One of the easiest ways to solve the problem is creating a simple word to integer mapping.
Another way to get these numbers is by using TD-IDF
The third approach and the one on which this article will be focussing is Word2VECWord2vec is a group of related models that are used to produce so-called word embeddings. These models are shallow, two-layer neural networks, that are trained to reconstruct linguistic contexts of words.
After training, word2vec models can be used to map each word to a vector of typically several hundred elements, which represent that word's relation to other words. This vector is the neural network's hidden layer.
Word2vec relies on either skip-grams or continuous bag of words (CBOW) to create neural word embeddings. It was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analyzed and explained by other researchers.
Luckily if you want to use this model in your work you don't have to write these algorithms.
Gensim is one the library in Python that has some of the awesome features required for text processing and Natural Language Processing. In the rest of the article, we will learn to use this awesome library for word vectorization.
pip install --upgrade gensim
It has three major dependencies
Now we have done the basic preprocessing of the text. Any other preprocessing stuff can be achieved similarly.