Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first.
Note: Shell commands should not be confused with Python code.
Get the Training Data Set
We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz head cooking.stackexchange.txt
As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.
Training the Classifier
Let’s check the size of training data set:
wc cooking.stackexchange.txt 15404 169582 1401900 cooking.stackexchange.txt
It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:
head -n 12404 cooking.stackexchange.txt > cooking.train tail -n 3000 cooking.stackexchange.txt > cooking.valid
Now let’s train using cooking.train
classifier = fasttext.supervised('cooking.train', 'model_cooking')
Our First Prediction
label = classifier.predict('Which baking dish is best to bake a banana bread ?') print label label = classifier.predict('Why not put knives in the dishwasher? ') print label
It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.
Testing Precision and Recall
Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:
result = classifier.test ('cooking.valid') print result.precision print result.recall print result.nexamples
There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier
If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!