Extracting Text from PDF Using Apache Tika – Learn NLP

E
Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we'll see extracting text from PDF using Apache Tika toolkit. Tika Installation pip install tika Extracting Text from tika import parser #Replace document.pdf with filename text = parser.from_file('document.pdf') print (text ['content']) Tika makes it very convenient to extract text not just from PDFs but more than ten formats. [hide_from_apps container="span"]Here is a list of all supported document formats. References: Ap
Subscribe or log in to read the rest of this content.

About the author

Devji Chhanga

I teach computer science at university of Kutch since 2011, Kutch is the western most district of India. At iDevji, I share tech stories that excite me. You will love reading the blog if you too believe in the disruptive power of technology. Some stories are purely technical while others can involve empathetical approach to problem solving using technology.

3 Comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Devji Chhanga

I teach computer science at university of Kutch since 2011, Kutch is the western most district of India. At iDevji, I share tech stories that excite me. You will love reading the blog if you too believe in the disruptive power of technology. Some stories are purely technical while others can involve empathetical approach to problem solving using technology.

Get in touch

Quickly communicate covalent niche markets for maintainable sources. Collaboratively harness resource sucking experiences whereas cost effective meta-services.