iDevji

Posted 2017-11-07Updated 2023-08-22NLPa minute read (About 131 words)

Extracting Text from PDF Using Apache Tika - Learn NLP

Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we’ll see extracting text from PDF using Apache Tika toolkit.

Tika Installation
pip install tika

Extracting Text

from tika import parser

#Replace document.pdf with filename
text = parser.from_file('document.pdf')

print (text ['content'])

Tika makes it very convenient to extract text not just from PDFs but more than ten formats. Here is a list of all supported document formats.

References:

Apache Tika Home Page
PyPi Tika 1.15 Package

Links

Categories

Recents

Archives

Tags

Subscribe for updates