horton works

Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. … Continue Reading >Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Extracting Text from PDF Using Apache Tika

Extracting Text from PDF Using Apache Tika – Learn NLP

Most NLP applications need to look beyond text and HTML documents as information is contained in PDF, ePub or other formats. Apache Tika is a toolkit that extracts meta data and text from documents. There is a REST based Python library for Tika. … Continue Reading >Extracting Text from PDF Using Apache Tika – Learn NLP

fastext

Tutorial: Text Classification With Python Using fastText

We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. … Continue Reading >Tutorial: Text Classification With Python Using fastText