Extracting Text from PDF Using Apache Tika – Learn NLP

Extracting Text from PDF Using Apache Tika

Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we’ll see extracting text from PDF using Apache Tika toolkit.

Tika Installation

pip install tika

Extracting Text

from tika import parser
 
#Replace document.pdf with filename
text = parser.from_file('document.pdf')
 
print (text ['content'])

Tika makes it very convenient to extract text not just from PDFs but more than ten formats. Here is a list of all supported document formats.

References:

  1. Apache Tika Home Page
  2. PyPi Tika 1.15 Package

See more posts like this

2
Comments

avatar
1 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
Devji Chhangaraj tanna Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
raj tanna
Guest
raj tanna

It giving an error “ImportError: cannot import name parser”
please can you suggest how to import parser