Posts

Showing posts with the label Text Analytics

Extracting text from PDF for NLP tasks

Image
Introduction Natural Language Processing is a task that involves data collection from various sources and not every time one is lucky to get the baked data. Many times you have to extract data from various sources, one of them is Files. In this post, I will be talking specifically about the PDF files. Getting the Guns ready After some exploration on the internet, I came across a python package PyPDF  which sounded a good contender to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files, although its usage details are not that clear that's why I thought of writing a post to explain it. Installation pip install PyPDF2 Reading the File and extracting Text import PyPDF2 filename = 'complete path of your pdf file'  #opening the file  pdfFileObj = open(filename,'rb') #creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdf...

Using GROK for Information Extraction from Text

Image
What Information extraction from text is ??? One of the key part while working with text data is extracting information from the raw text data. Let's take an example of a text sentence that belongs to some data and has data in the following form. Details are: Name Japneet Singh Age 27 Profession Software Engineer Information Extracted from this text would look like Name: Japneet Singh Age: 27 Profession: Software Engineer This information then can be used further in any Machine Learning model. Generally, we perform this step in very early stages of data preprocessing and there can be many advanced ways to deal with it but the old way of using regex remains undefeated champion. REGEX plays an important role whenever we are playing with text data. Here, we will discuss two ways to extract the information: REGEX  GROK to deal with this data extraction. The REGEX Approach Regex is defined by regular-expression.info as A regular expressi...

Doc2Vec Document Vectorization and clustering

Introduction In my previous blog posts, I have written about word vectorization with implementation and use cases.You can read about it  here . But many times we need to mine the relationships between the phrases rather than the sentences. To take an example John has taken many leaves the year Leaves are falling of the tree In these two sentences, a common word "leaves" has a different meaning based on the sentence in which it is used. This meaning can only be captured when we are taking the context of the complete phrase. Or we would like to measure the similarity of the phrases and cluster them under one name. This is going to more of implementation of the doc2vec in python rather than going into the details of the algorithms.  The algorithms use either hierarchical softmax or negative sampling; see  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space, in Proceedings of W...

Word Vectorization

Image
Introduction Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain. There is large amount of textual data present in internet and giant servers around the world. Just for some facts 1,209,600 new data producing social media users each day. 656 million tweets per day! More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day. 67,305,600 Instagram posts uploaded each day There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016. Facebook has 1.32 billion daily active users on average as of June 2017 4.3 BILLION Facebook messages posted daily! 5.75 BILLION Facebook likes every day. 22 billion texts sent every day. 5.2 BILLION daily Google Searches in 2017. Need for Vectorization The amount of textual data is massive, and the problem with textual dat...

Text Analysis -Part 1

Image
Hi Readers, Recently I was going through some text analytics activities at work and learned some techniques for text analytics and mining. In this series of posts, I will be sharing my experiences and learnings. Introduction Firstly the jargon Text Analysis and Text Mining are sub domains of the term Data Mining and are used interchangeably in most scenarios. Broadly  Text Analysis  refers to, Extracting information from textual data keeping the problem for which we want to get data in mind. and  Text Mining  refers to, the process of getting textual data. Nowadays, a large quantity of data is produced by humans.Data is growing faster than ever before and by the year 2020, about  1.7 megabytes of new information will be created every second for every human being on the planet and one of the main components of this data will be textual data. Some of the main sources of textual data are Blogs Articles Websites Facebook Comments Discussion...