Posts

Showing posts with the label Information Extraction

Extracting text from PDF for NLP tasks

Image
Introduction Natural Language Processing is a task that involves data collection from various sources and not every time one is lucky to get the baked data. Many times you have to extract data from various sources, one of them is Files. In this post, I will be talking specifically about the PDF files. Getting the Guns ready After some exploration on the internet, I came across a python package PyPDF  which sounded a good contender to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files, although its usage details are not that clear that's why I thought of writing a post to explain it. Installation pip install PyPDF2 Reading the File and extracting Text import PyPDF2 filename = 'complete path of your pdf file'  #opening the file  pdfFileObj = open(filename,'rb') #creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdf

Information Extraction using GROKS in Python

Image
Groks in Python In my previous blog , I wrote about information extraction using GROKS and REGEX. If you have not read that I will encourage you to go through this blog first. One of the important aspects of any tool is the ability to use it in a different environment and automate the tasks. In this post, we will be looking at the implementation of GROKs in python using pygrok library. By now we know that GROKs are a form of regular expressions that are more readable. Installation Pygrok is an implementation of GROK patterns in python which is available through pip distribution pip install pygrok Usage The library is extremely useful for using the pre-built groks as well as our own custom-built GROKS. Let's start with a very basic example: Parsing Text  #import the package from pygrok import Grok #text to be processed text = ' gary is male, 25 years old and weighs 68.5 kilograms ' #pattern which you want to match pattern = ' % {WORD :

Using GROK for Information Extraction from Text

Image
What Information extraction from text is ??? One of the key part while working with text data is extracting information from the raw text data. Let's take an example of a text sentence that belongs to some data and has data in the following form. Details are: Name Japneet Singh Age 27 Profession Software Engineer Information Extracted from this text would look like Name: Japneet Singh Age: 27 Profession: Software Engineer This information then can be used further in any Machine Learning model. Generally, we perform this step in very early stages of data preprocessing and there can be many advanced ways to deal with it but the old way of using regex remains undefeated champion. REGEX plays an important role whenever we are playing with text data. Here, we will discuss two ways to extract the information: REGEX  GROK to deal with this data extraction. The REGEX Approach Regex is defined by regular-expression.info as A regular expressi