NLTK Example : Detecting Geographic Setting of Sherlock Holmes Stories

As a young adult nothing thrilled me more than Jeremy Brett’s performance as Sherlock Holmes. “You know my methods, apply them!” he would say. So let’s try to play Sherlock ourselves. We use Natural Language Tool Kit or NLTK to guess setting of a Sherlock story in terms of its geographic location. In this NLTK example, our approach is very naive: identify the most frequent place mentioned in the story. We use Named Entity Recognition (NRE) to identify geopolitical entities (GPE) and filter out the most frequent of them. This approach is very naive because there is no pre-processing on the text and GPEs may include other concepts apart from geographic locations such as nationalities. But we want to keep this really simple and fun. So here we go: Code :

#NLTK example
#This code reads one text file at a time

from nltk import word_tokenize, pos_tag, ne_chunk

read a text file

text = file (‘filepath/file.txt’)

replace \n with a spcae

data=text.read().replace(‘\n’, ‘ ‘)

chunked = ne_chunk (pos_tag ( word_tokenize (data) ))

extract GPEs

extracted = []
for chunk in chunked:
if hasattr (chunk, ‘label’):
if chunk.label() == ‘GPE’:
extracted.append (‘’.join (c[0] for c in chunk))

extract most frequent GPE

from collections import Counter
count = Counter(extracted)
count.most_common(1)

Results:

Sr.

Story

Extracted Location

Actual Setting

Result

The Adventure of the Dancing Men

[(‘Norfolk’, 14)]

Norfolk

Success

The Adventure of the Solitary Cyclist

[(‘Farnham’, 6)]

Farnham

Success

A Scandal in Bohemia

[(‘Bohemia’, 6)]

Bohemia

Success

The Red-Headed League

[(‘London’, 7)]

London

Success

The Final Problem

[(‘London’, 8)]

London

Success

The Greek Interpreter

[(‘Greek’, 15)]

Greece

Fail

We got 5/6 predictions correct! These are not discouraging results and we may think of using this code somewhere in a more serious application. References:

  1. Sherlock Holmes Stories in Plain Text
  2. NLTK Documentation

Improving fastText Classifier

This post is in continuation of the previous post Text Classification With Python Using fastText. This post describes how to improve fastText classifier using various techniques.

More on Precision and Recall

Precision: Number of correct labels out of total labels predicted by classifier. Recall: Number of labels successfully predicted out of real labels. Example:

Why not put knives in the dishwasher?

This question has three labels on StackExchange: equipment, cleaning and knives. Let us obtain top five labels predicted from our model (k = top k labels to predict):

1
2
3
text = ['Why not put knives in the dishwasher?']
labels = classifier.predict ('text', k=5)
print labels

This gives us food-safety, baking, equipment, substitutions and bread. One out of five labels predicted by the model is correct, giving a precision of 0.20. Out of the three real labels, only one is predicted by the model, giving a recall of 0.33.

Improving the Model

We ran the model with default parameters and training data as it is. Now let’s tweak a little bit. We will employ following techniques to improve :

  • Preprocessing the data
  • Changing the number of epochs (using the option epoch, standard range 5 - 50)
  • Changing the learning rate (using the option lr, standard range 0.1 - 1.0)
  • Using word n-grams (using the option wordNgrams, standard range 1 - 5)

We will perform these techniques and see improvement in precision and recall at each stage. Preprocessing The Data Preprocessing includes removal of special characters and converting entire text to lower case.

1
2
3
4
5
6
7
8
9
10
cat cooking.stackexchange.txt  sed -e "s/([.\!?,'/()])/ 1 /g"  tr "\[:upper:\]" "[:lower:]" > cooking.preprocessed.txt
head -n 12404 cooking.preprocessed.txt > cooking.train
tail -n 3000 cooking.preprocessed.txt > cooking.valid

classifier = fasttext.supervised('cooking.train', 'model\_cooking')
result = classifier.test ('cooking.valid')
print result.precision
#0.161
print result.recall
#0.0696266397578

So after preprocessing precision and recall have improved. More Epoch and Increased Learning Rate Epoch can be set using epoch parameter. Default value is 5. We are going to set it to 25. More epoch will result into increased training time but it would be worth.

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', epoch=25)
result = classifier.test ('cooking.valid')
print result.precision
#0.493
print result.recall
#0.213204555283

Now let’s change learning rate with lr parameter:

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', lr=1.0)
result = classifier.test ('cooking.valid')
print result.precision
#0.546
print result.recall
#0.236125126135

Results with both epoch and lr together:

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', epoch=25, lr=1.0)
result = classifier.test ('cooking.valid')
print result.precision
#0.565
print result.recall
#0.244630243621

Using Word n-grams Word n-grams deal with sequencing of tokens in the text. See examples of word n-grams on Wikipedia.

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', epoch=25, lr=1.0, )
result = classifier.test ('cooking.valid')
print result.precision
#???
print result.recall
#???

I am unable to show results for word n-grams because Python on my system keeps crashing. I will update the post asap.

References:

  1. Introduction to fastText

  2. Text Classification With Python Using fastText

  3. PyPi, fastext 0.7.0, Python.org

  4. fasText, Text classification with fastText

  5. Cooking StackExchange, cooking.stackexchange.com

Tutorial: Text Classification With Python Using fastText

Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first. Note: Shell commands should not be confused with Python code.

Get the Training Data Set: We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.

1
2
3
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

head cooking.stackexchange.txt

As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.

Training the Classifier: Let’s check the size of training data set:

wc cooking.stackexchange.txt

15404 169582 1401900 cooking.stackexchange.txt

It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:

head -n 12404 cooking.stackexchange.txt > cooking.train

tail -n 3000 cooking.stackexchange.txt > cooking.valid

Now let’s train using cooking.train

classifier = fasttext.supervised('cooking.train', 'model\_cooking')

Our First Prediction:

1
2
3
4
5
label = classifier.predict('Which baking dish is best to bake a banana bread ?')
print label

label = classifier.predict('Why not put knives in the dishwasher? ')
print label

It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.

**Testing Precision and Recall: ** Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:

1
2
3
print result.precision
print result.recall
print result.nexamples

There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier

References:

  1. PyPi, fastext 0.7.0, Python.org
  2. fasText, Text classification with fastText
  3. Cooking StackExchange, cooking.stackexchange.com

If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!