Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first. Note: Shell commands should not be confused with Python code.
Get the Training Data Set: We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.
Training the Classifier: Let’s check the size of training data set:
15404 169582 1401900 cooking.stackexchange.txt
It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:
head -n 12404 cooking.stackexchange.txt > cooking.train
tail -n 3000 cooking.stackexchange.txt > cooking.valid
Now let’s train using cooking.train
classifier = fasttext.supervised('cooking.train', 'model\_cooking')
Our First Prediction:
label = classifier.predict('Which baking dish is best to bake a banana bread ?')
It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.
**Testing Precision and Recall: ** Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:
There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier
- PyPi, fastext 0.7.0, Python.org
- fasText, Text classification with fastText
- Cooking StackExchange, cooking.stackexchange.com
If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!