Extracting Text from PDF Using Apache Tika – Learn NLP

E

Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we’ll see extracting text from PDF using Apache Tika toolkit.

Tika Installation

pip install tika

Extracting Text

from tika import parser

#Replace document.pdf with filename
text = parser.from_file('document.pdf')

print (text ['content'])

Tika makes it very convenient to extract text not just from PDFs but more than ten formats. Here is a list of all supported document formats.

References:

  1. Apache Tika Home Page
  2. PyPi Tika 1.15 Package

See more posts like this

3 Comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe

Follow me on