Posts

Showing posts with the label NLP

Information Extraction using GROKS in Python

Image
Groks in Python In my previous blog , I wrote about information extraction using GROKS and REGEX. If you have not read that I will encourage you to go through this blog first. One of the important aspects of any tool is the ability to use it in a different environment and automate the tasks. In this post, we will be looking at the implementation of GROKs in python using pygrok library. By now we know that GROKs are a form of regular expressions that are more readable. Installation Pygrok is an implementation of GROK patterns in python which is available through pip distribution pip install pygrok Usage The library is extremely useful for using the pre-built groks as well as our own custom-built GROKS. Let's start with a very basic example: Parsing Text  #import the package from pygrok import Grok #text to be processed text = ' gary is male, 25 years old and weighs 68.5 kilograms ' #pattern which you want to match pattern = ' % {WORD :

Using GROK for Information Extraction from Text

Image
What Information extraction from text is ??? One of the key part while working with text data is extracting information from the raw text data. Let's take an example of a text sentence that belongs to some data and has data in the following form. Details are: Name Japneet Singh Age 27 Profession Software Engineer Information Extracted from this text would look like Name: Japneet Singh Age: 27 Profession: Software Engineer This information then can be used further in any Machine Learning model. Generally, we perform this step in very early stages of data preprocessing and there can be many advanced ways to deal with it but the old way of using regex remains undefeated champion. REGEX plays an important role whenever we are playing with text data. Here, we will discuss two ways to extract the information: REGEX  GROK to deal with this data extraction. The REGEX Approach Regex is defined by regular-expression.info as A regular expressi

Doc2Vec Document Vectorization and clustering

Introduction In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it  here . But many times we need to mine the relationships between the phrases rather than the sentences. To take an example John has taken many leaves the year Leaves are falling of the tree In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name. This is going to more of implementation of the doc2vec in python rather than going into the details of the algorithms.  The algorithms use either hierarchical softmax or negative sampling; see  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space, in Proceedings of Workshop at ICLR, 20

Word Vectorization

Image
Introduction Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain. There is large amount of textual data present in internet and giant servers around the world. Just for some facts 1,209,600 new data producing social media users each day. 656 million tweets per day! More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day. 67,305,600 Instagram posts uploaded each day There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016. Facebook has 1.32 billion daily active users on average as of June 2017 4.3 BILLION Facebook messages posted daily! 5.75 BILLION Facebook likes every day. 22 billion texts sent every day. 5.2 BILLION daily Google Searches in 2017. Need for Vectorization The amount of textual data is massive, and the problem with textual dat

Text Analysis -Part 1

Image
Hi Readers, Recently I was going through some text analytics activities at work and learned some techniques for text analytics and mining. In this series of posts, I will be sharing my experiences and learnings. Introduction Firstly the jargon Text Analysis and Text Mining are sub domains of the term Data Mining and are used interchangeably in most scenarios. Broadly  Text Analysis  refers to, Extracting information from textual data keeping the problem for which we want to get data in mind. and  Text Mining  refers to, the process of getting textual data. Nowadays, a large quantity of data is produced by humans.Data is growing faster than ever before and by the year 2020, about  1.7 megabytes of new information will be created every second for every human being on the planet and one of the main components of this data will be textual data. Some of the main sources of textual data are Blogs Articles Websites Facebook Comments Discussion Forums Reviews