Information Extraction using GROKS in Python

Groks in Python

In my previous blog, I wrote about information extraction using GROKS and REGEX.
If you have not read that I will encourage you to go through this blog first.
One of the important aspects of any tool is the ability to use it in a different environment and automate the tasks.

In this post, we will be looking at the implementation of GROKs in python using pygrok library.
By now we know that GROKs are a form of regular expressions that are more readable.

Installation

Pygrok is an implementation of GROK patterns in python which is available through pip distribution

pip install pygrok

Usage

The library is extremely useful for using the pre-built groks as well as our own custom-built GROKS.
Let's start with a very basic example:

Parsing Text 

#import the package
from pygrok import Grok
#text to be processed
text = 'gary is male, 25 years old and weighs 68.5 kilograms'
#pattern which you want to match
pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms'
#create a GROK object by giving the pattern
grok = Grok(pattern)
#use match function to get all the parsed patterns
print grok.match(text)

NOTE: This will also return partial matching pattern i.e ignore the unmatched pattern at the start and end of the string.

List of all the GROK patterns available can be seen here

Using Custom GROK patterns 

#import the package
from pygrok import Grok
#text to be processed
text = 'gary is male, 25 years old and weighs 68.5 kilograms'
#pattern which you want to match
pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms'
#create a GROK object by giving the pattern
pat={"S3_REQUEST_LINE": "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})"}
grok = Grok(pattern,custom_patterns_dir=pattern_dir_path,custom_patterns=pat)
#use match function to get all the parsed patterns
print grok.match(text)


We can provide custom pattern directory using the custom_patterns_dir option, here the directory is the same as the one which can be seen here.

If you have a few patterns to add then rather than you can avoid the overhead of creating the directory and pass the patterns as key-value pair in the custom_patterns field.


I feel there is some functionality that can be added to the groks like giving the file path rather than the directory path, parsing the complete text or return None, etc which I will try to contribute to the project.

I hope this blog helps you in the parsing journey.
Happy Learning.



Comments

Popular posts from this blog

Word Vectorization

Spidering the web with Python

Machine Learning -Solution or Problem