Tutorial: Text Classification With Python Using fastText

Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first. Note: Shell commands should not be confused with Python code.

Get the Training Data Set: We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.

1
2
3
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

head cooking.stackexchange.txt

As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.

Training the Classifier: Let’s check the size of training data set:

wc cooking.stackexchange.txt

15404 169582 1401900 cooking.stackexchange.txt

It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:

head -n 12404 cooking.stackexchange.txt > cooking.train

tail -n 3000 cooking.stackexchange.txt > cooking.valid

Now let’s train using cooking.train

classifier = fasttext.supervised('cooking.train', 'model\_cooking')

Our First Prediction:

1
2
3
4
5
label = classifier.predict('Which baking dish is best to bake a banana bread ?')
print label

label = classifier.predict('Why not put knives in the dishwasher? ')
print label

It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.

**Testing Precision and Recall: ** Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:

1
2
3
print result.precision
print result.recall
print result.nexamples

There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier

References:

  1. PyPi, fastext 0.7.0, Python.org
  2. fasText, Text classification with fastText
  3. Cooking StackExchange, cooking.stackexchange.com

If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!