Extracting Text from PDF Using Apache Tika - Learn NLP

Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we’ll see extracting text from PDF using Apache Tika toolkit.

Tika Installation
pip install tika

Extracting Text

1
2
3
4
5
6
from tika import parser

#Replace document.pdf with filename
text = parser.from_file('document.pdf')

print (text ['content'])

Tika makes it very convenient to extract text not just from PDFs but more than ten formats. Here is a list of all supported document formats.

References:

  1. Apache Tika Home Page
  2. PyPi Tika 1.15 Package