Using GROK for Information Extraction from Text

What Information extraction from text is ???

One of the key part while working with text data is extracting information from the raw text data.
Let's take an example of a text sentence that belongs to some data and has data in the following form.

Details are: Name Japneet Singh Age 27 Profession Software Engineer

Information Extracted from this text would look like

Name: Japneet Singh
Age: 27
Profession: Software Engineer

This information then can be used further in any Machine Learning model.

Generally, we perform this step in very early stages of data preprocessing and there can be many advanced ways to deal with it but the old way of using regex remains undefeated champion.

REGEX plays an important role whenever we are playing with text data. Here, we will discuss two ways to extract the information:
  1. REGEX 
  2. GROK
to deal with this data extraction.

The REGEX Approach

Regex is defined by regular-expression.info as A regular expression (regex or regexp for short), is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids.

This site has some very good resources to learn regex if you are not familiar with it.

Now, we will create a regular expression and use named groups to extract this information and we will use one of the very useful tools for building and testing your regex is regx101.

^Details are: Name (?P<Name>.*?) Age (?P<Age>\d*?) Profession (?P<Profession>.*?)$

This will extract the information from the text as expected


One of the problems with regex is that it can become really complex and can hamper the readability which in turn affects the quality of your code.
One of the clean alternatives for regex is GROK patterns

GROK Approach

As per ELK a Grok is a great way to parse unstructured log data into something structured and queryable. This tool is perfect for Syslog logs, apache and other webserver logs, MySQL logs, and in general, any log format that is generally written for humans and not computer consumption.

Too technical definition isn't it ??

Let's have a look at how we can write GROKS for the same data we used above

Details are: Name %{DATA:Name} Age %{NUMBER:Age} Profession %{GREEDYDATA:Profession}

Internally the GROK's work in the same way as regex but here it makes the pattern more readable and one doesn't have to understand the regular expressions to build GROK (Although it is required to build good GROKS).

One can use the simple English notations given regular expressions to parse the data.
There is a large list of patterns available that can be explored at this link and also this list can be extended further using a suitable tool.

Once we have GROK ready we can use a GROK Debugger to find any errors in the GROK.


I will write one more post explaining the usage of GROKs in python.
I hope you enjoyed reading.
Keep learning.







Comments

Popular posts from this blog

Word Vectorization

Spidering the web with Python

Machine Learning -Solution or Problem