Text Analytics-Part 2
In the previous post, I wrote about gaining the knowledge from the Text which is available from many sources. In this post, I will be writing about Topic Mining.
Introduction
Topic Mining can be described as finding words from the group of words which can best describe the group.
Textual Data in raw form is not associated with any context. A human can easily identify the context or topic for an article by reading the article and categorise it in one or other category like politics, sports, economics, crime etc.
One of the factors any human will consider while classifying the text into one of the topics is the knowledge that how a word is associated with a topic e.g
- India won Over Sri Lanka in the test match.
- World Badminton Championships: When and where to watch Kidambi Srikanth’s first round, live TV coverage, time in IST, live streaming
Here we may not find word sports explicitly in the sentences but the words marked in bold are associated with sports.
Topic modelling can be broadly categorised into two type
- Rule-Based topic modelling
- Unsupervised topic modelling
Rule-Based Topic Modelling
As the name suggests rule-based topic modelling depends on the rules which can be used to associate a given text with some topic.
In the simplest rule-based approach, we can just search for some words in the text and associate it with a topic e.g finding the word sports for associating the topic with sports, finding travelling for associating with topic travel
This approach can be extended and a topic can be represented as a set of words with some given probabilities e.g For the category sports we can have a set of words with some weight assigned to each word.
Topic : Sports{"sports":0.4,"cricket":0.1,"badminton":0.1 ,"traveling":0.05 .....}
Topic : Travel{"travel":0.4, "hiking":0.1,"train":0.05,"traveling":0.20 .......}
Notice the word "travelling", it occurs in both the Topics but has different weight.
If we have a sentence "Badminton players are travelling to UK for the tournament", by the simple approach of finding the words for the topic then this sentence will go under the topic Travel. The second approach improves the prediction by checking the probabilities and weight for different words, in this case, "Badminton" and "travelling" and improves the result by predicting the more accurate result that is Sports.
The main disadvantage of the Rule-Based approach is that all the topics have to be known in the beginning and probabilities have to be determined and examined. This rules out the possibility of finding some new topic in the text corpus.
Topic : Sports{"sports":0.4,"cricket":0.1,"badminton":0.1 ,"traveling":0.05 .....}
Topic : Travel{"travel":0.4, "hiking":0.1,"train":0.05,"traveling":0.20 .......}
Notice the word "travelling", it occurs in both the Topics but has different weight.
If we have a sentence "Badminton players are travelling to UK for the tournament", by the simple approach of finding the words for the topic then this sentence will go under the topic Travel. The second approach improves the prediction by checking the probabilities and weight for different words, in this case, "Badminton" and "travelling" and improves the result by predicting the more accurate result that is Sports.
The main disadvantage of the Rule-Based approach is that all the topics have to be known in the beginning and probabilities have to be determined and examined. This rules out the possibility of finding some new topic in the text corpus.
Unsupervised Topic Modelling
The topic of a text sentence largely depends on the words used in the sentence and this property is exploited in unsupervised topic modelling technique to extract topics from the sentences.
It largely relies on the Bayesian Inference Model.
It largely relies on the Bayesian Inference Model.
Bayesian Inference Model
It is a method by which we can calculate the probability of occurrence of some event-based on some common sense assumptions and the outcomes of previous related events.
It also allows us to use new observations to improve the model, by going through many iterations where a prior probability is updated with the observational evidence in order to produce a new and improved posterior probability
Some of the techniques for Unsupervised Topic Modelling are:
- TF-IDF
- Latent Semantic Indexing
- Latent Dirichlet Allocation
All the approaches use the vector space representation of the documents. In vector space model a document is represented by a document-term matrix.
Parameters of LdaModel can be tweaked for result improvements.Also by just replacing LdaModel by LsiModel one can use LSI technique
Comments
Post a Comment