Tutorial: Text Classification With Python Using fastText

fastext

Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first.

Note: Shell commands should not be confused with Python code.

Get the Training Data Set
We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
 
head cooking.stackexchange.txt

As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.

Training the Classifier
Let’s check the size of training data set:

wc cooking.stackexchange.txt
 
15404  169582 1401900 cooking.stackexchange.txt

It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:

head -n 12404 cooking.stackexchange.txt > cooking.train
 
tail -n 3000 cooking.stackexchange.txt > cooking.valid

Now let’s train using cooking.train

classifier = fasttext.supervised('cooking.train', 'model_cooking')

Our First Prediction

label = classifier.predict('Which baking dish is best to bake a banana bread ?')
print label
 
label = classifier.predict('Why not put knives in the dishwasher? ')
print label

It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.

Testing Precision and Recall
Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:

result = classifier.test ('cooking.valid')
print result.precision
print result.recall
print result.nexamples
There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier

If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!

References:

  1. PyPi, fastext 0.7.0, Python.org
  2. fasText, Text classification with fastText
  3. Cooking StackExchange, cooking.stackexchange.com

Comments

avatar
  Subscribe  
Notify of