Posts

Showing posts with the label PDF extraction

Extracting text from PDF for NLP tasks

Image
Introduction Natural Language Processing is a task that involves data collection from various sources and not every time one is lucky to get the baked data. Many times you have to extract data from various sources, one of them is Files. In this post, I will be talking specifically about the PDF files. Getting the Guns ready After some exploration on the internet, I came across a python package PyPDF  which sounded a good contender to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files, although its usage details are not that clear that's why I thought of writing a post to explain it. Installation pip install PyPDF2 Reading the File and extracting Text import PyPDF2 filename = 'complete path of your pdf file'  #opening the file  pdfFileObj = open(filename,'rb') #creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdf...