Using GROK for Information Extraction from Text
What Information extraction from text is ???
One of the key part while working with text data is extracting information from the raw text data.
Let's take an example of a text sentence that belongs to some data and has data in the following form.
Information Extracted from this text would look like
This information then can be used further in any Machine Learning model.
Generally, we perform this step in very early stages of data preprocessing and there can be many advanced ways to deal with it but the old way of using regex remains undefeated champion.
REGEX plays an important role whenever we are playing with text data. Here, we will discuss two ways to extract the information:
- REGEX
- GROK
to deal with this data extraction.
The REGEX Approach
Regex is defined by regular-expression.info as A regular expression (regex or regexp for short), is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids.
This site has some very good resources to learn regex if you are not familiar with it.
Now, we will create a regular expression and use named groups to extract this information and we will use one of the very useful tools for building and testing your regex is regx101.
This will extract the information from the text as expected
One of the problems with regex is that it can become really complex and can hamper the readability which in turn affects the quality of your code.
One of the clean alternatives for regex is GROK patterns
GROK Approach
As per ELK a Grok is a great way to parse unstructured log data into something structured and queryable. This tool is perfect for Syslog logs, apache and other webserver logs, MySQL logs, and in general, any log format that is generally written for humans and not computer consumption.
Too technical definition isn't it ??
Let's have a look at how we can write GROKS for the same data we used above
Internally the GROK's work in the same way as regex but here it makes the pattern more readable and one doesn't have to understand the regular expressions to build GROK (Although it is required to build good GROKS).
One can use the simple English notations given regular expressions to parse the data.
There is a large list of patterns available that can be explored at this link and also this list can be extended further using a suitable tool.
Once we have GROK ready we can use a GROK Debugger to find any errors in the GROK.
I will write one more post explaining the usage of GROKs in python.
I hope you enjoyed reading.
Keep learning.
Comments
Post a Comment