Improving fastText Classifier

fastext

This post is in continuation of the previous post Text Classification With Python Using fastText. This post describes how to improve fastText classifier using various techniques.

More on Precision and Recall

Precision: Number of correct labels out of total labels predicted by classifier.

Recall: Number of labels successfully predicted out of real labels.

Example:

Why not put knives in the dishwasher?

This question has three labels on StackExchange: equipment, cleaning and knives.

Let us obtain top five labels predicted from our model (k = top k labels to predict):

text = ['Why not put knives in the dishwasher?']
labels = classifier.predict ('text', k=5)
print labels

This gives are food-safety, baking, equipment, substitutions and bread.

One out of five labels predicted by the model is correct, giving a precision of 0.20. Out of the three real labels, only one is predicted by the model, giving a recall of 0.33.

Improving the Model

We ran the model with default parameters and training data as it is. Now let’s tweak a little bit. We will employ following techniques to improve :

  • Preprocessing the data
  • Changing the number of epochs (using the option epoch, standard range [5 – 50])
  • Changing the learning rate (using the option lr, standard range [0.1 – 1.0])
  • Using word n-grams (using the option wordNgrams, standard range [1 – 5])

We will perform these techniques and see improvement in precision and recall at each stage.

Preprocessing The Data
Preprocessing includes removal of special characters and converting entire text to lower case.

cat cooking.stackexchange.txt | sed -e "s/([.\!?,'/()])/ 1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt
head -n 12404 cooking.preprocessed.txt > cooking.train
tail -n 3000 cooking.preprocessed.txt > cooking.valid
classifier = fasttext.supervised('cooking.train', 'model_cooking')
result = classifier.test ('cooking.valid')
print result.precision
    0.161
print result.recall
    0.0696266397578

So after preprocessing precision and recall have improved.

More Epoch and Increased Learning Rate
Epoch can be set using epoch parameter. Default value is 5. We are going to set it to 25. More epoch will result into increased training time but it would be worth.

classifier = fasttext.supervised('cooking.train', 'model_cooking', epoch=25)
result = classifier.test ('cooking.valid')
print result.precision
    0.493
print result.recall
    0.213204555283

Now let’s change learning rate with lr parameter:

classifier = fasttext.supervised('cooking.train', 'model_cooking', lr=1.0)
result = classifier.test ('cooking.valid')
print result.precision
    0.546
print result.recall
    0.236125126135

Results with both epoch and lr together:

classifier = fasttext.supervised('cooking.train', 'model_cooking', epoch=25, lr=1.0)
result = classifier.test ('cooking.valid')
print result.precision
    0.565
print result.recall
    0.244630243621

Using Word n-grams
Word n-grams deal with sequencing of tokens in the text. See examples of word n-grams on Wikipedia.

classifier = fasttext.supervised('cooking.train', 'model_cooking', epoch=25, lr=1.0, )
result = classifier.test ('cooking.valid')
print result.precision
    ???#
print result.recall
    ???#

# I am unable to show results for word n-grams because Python on my system keeps crashing. I will update the post asap.

References:

  1. Introduction to fastText
  2. Text Classification With Python Using fastText
  3. PyPi, fastext 0.7.0, Python.org
  4. fasText, Text classification with fastText
  5. Cooking StackExchange, cooking.stackexchange.com

Comments

avatar
  Subscribe  
Notify of