Information Extraction using GROKS in Python
Groks in Python
In my previous blog, I wrote about information extraction using GROKS and REGEX.If you have not read that I will encourage you to go through this blog first.
One of the important aspects of any tool is the ability to use it in a different environment and automate the tasks.
In this post, we will be looking at the implementation of GROKs in python using pygrok library.
By now we know that GROKs are a form of regular expressions that are more readable.
Installation
Pygrok is an implementation of GROK patterns in python which is available through pip distributionpip install pygrok
Usage
The library is extremely useful for using the pre-built groks as well as our own custom-built GROKS.
Let's start with a very basic example:
NOTE: This will also return partial matching pattern i.e ignore the unmatched pattern at the start and end of the string.
List of all the GROK patterns available can be seen here
We can provide custom pattern directory using the custom_patterns_dir option, here the directory is the same as the one which can be seen here.
If you have a few patterns to add then rather than you can avoid the overhead of creating the directory and pass the patterns as key-value pair in the custom_patterns field.
I feel there is some functionality that can be added to the groks like giving the file path rather than the directory path, parsing the complete text or return None, etc which I will try to contribute to the project.
I hope this blog helps you in the parsing journey.
Happy Learning.
Let's start with a very basic example:
Parsing Text
#import the package from pygrok import Grok #text to be processed text = 'gary is male, 25 years old and weighs 68.5 kilograms' #pattern which you want to match pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms' #create a GROK object by giving the pattern grok = Grok(pattern) #use match function to get all the parsed patterns print grok.match(text)
NOTE: This will also return partial matching pattern i.e ignore the unmatched pattern at the start and end of the string.
List of all the GROK patterns available can be seen here
Using Custom GROK patterns
#import the package from pygrok import Grok #text to be processed text = 'gary is male, 25 years old and weighs 68.5 kilograms' #pattern which you want to match pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms' #create a GROK object by giving the pattern pat={"S3_REQUEST_LINE": "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})"} grok = Grok(pattern,custom_patterns_dir=pattern_dir_path,custom_patterns=pat) #use match function to get all the parsed patterns print grok.match(text)
We can provide custom pattern directory using the custom_patterns_dir option, here the directory is the same as the one which can be seen here.
If you have a few patterns to add then rather than you can avoid the overhead of creating the directory and pass the patterns as key-value pair in the custom_patterns field.
I feel there is some functionality that can be added to the groks like giving the file path rather than the directory path, parsing the complete text or return None, etc which I will try to contribute to the project.
I hope this blog helps you in the parsing journey.
Happy Learning.
Comments
Post a Comment