Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we’ll see extracting text from PDF using Apache Tika toolkit.
pip install tika
from tika import parser #Replace document.pdf with filename text = parser.from_file('document.pdf') print (text ['content'])
Tika makes it very convenient to extract text not just from PDFs but more than ten formats. Here is a list of all supported document formats.