Text Analysis -Part 1
Recently I was going through some text analytics activities at work and learned some techniques for text analytics and mining.
In this series of posts, I will be sharing my experiences and learnings.
IntroductionFirstly the jargon Text Analysis and Text Mining are sub domains of the term Data Mining and are used interchangeably in most scenarios.
Broadly Text Analysis refers to,
Extracting information from textual data keeping the problem for which we want to get data in mind.
and Text Mining refers to,
the process of getting textual data.
- Discussion Forums
In the scenario described in the image above every observer may perceive the real world in a different manner and express accordingly e.g different persons may have a different opinion on some topic.
So, while analysing textual data this type of bias ness has to be kept in mind.
Type of information that can be extracted from the text and converted into actionable knowledge can be
- Mining content of data
- Mining knowledge about the language
- Mining knowledge about the observer
- Infer other real world variables
It is a fairly complex problem as understanding natural language involves common sense which machines lack. Many of the sentences spoken by humans involve common sense and are spoken in some context which derives the meaning of the sentence.E.g
- He is an apple of my eye.(Phrase)
- John is the star.(Movie star or celestial body)
Analyzing the Natural language can be broadly classified into 4 types of analysis
- Lexical Analysis-Identifying Parts of Speech, Word Associations, Topic Mining
- Syntactical Analysis-Connecting words and creating phrases and then connecting phrases
- Semantical Analysis-Extracting knowledge about the sentence
- Pragmatic Analysis-Getting the intent of the sentence
- POS Tags(Nouns,Verbs,Adverbs etc)
- Word Associations(How two words are linked)
- Topic(Context of the Paragraph)
- Syntagmatic words
- "Jack drives a car", "car" is associated to "drive"
- "I love cats"
- "I love dogs", In these sentences, the word cats is related to dogs as they can be used interchangeably to make a valid sentence
- Bob eats food
- John drives a car
- He sits on a sofa
X sub w is the probability that the word w is present in the sentence
H(Xw) is the entropy of the variable X sub w
P(Xw=V) is the probability that the variable exists(V=1) and vice-versa(V=0)
logP(Xw=V) is the natural log of the probability that a variable exists(V=1) and vice-versa (V=0)
"Higher the entropy more difficult it is to predict the syntagmatic relation"
Now that we know the concept of entropy and how it can be used to predict the existence of a word in a sentence, let's introduce one more concept called Conditional Entropy.
It is defined as the entropy of a word W when it is known that V has occurred in the sentence.
This helps to reduce the entropy which in turn reduces the randomness of the random variable.
Conditional Entropy can be defined as
H(Xmeat|Xeats =1)=-P(Xmeat=0 | Xeats =1)logP(Xmeat=0 | Xeats =1)
- P(Xmeat=1 | Xeats =1)logP(Xmeat=1 | Xeats =1)
This defines the probability of meat occurring when eats has already occurred.
- I love cats
- I love dogs