NLTK Example : Detecting Geographic Setting of Sherlock Holmes Stories


As a young adult nothing thrilled me more than Jeremy Brett’s performance as Sherlock Holmes. “You know my methods, apply them!” he would say. So let’s try to play Sherlock ourselves. We use Natural Language Tool Kit or NLTK to guess setting of a Sherlock story in terms of its geographic location. In this NLTK example, our approach is very naive: identify the most frequent place mentioned in the story.

We use Named Entity Recognition (NRE) to identify geopolitical entities (GPE) and filter out the most frequent of them. This approach is very naive because there is no pre-processing on the text and GPEs may include other concepts apart from geographic locations such as nationalities. But we want to keep this really simple and fun. So here we go:

Code :

#NLTK example
#This code reads one text file at a time

from nltk import word_tokenize, pos_tag, ne_chunk

# read a text file
text = file ('filepath/file.txt')

# replace \n with a spcae'\n', ' ')

chunked =  ne_chunk (pos_tag ( word_tokenize (data) ))

# extract GPEs
extracted = []
for chunk in chunked:
	if hasattr (chunk, 'label'):
		if chunk.label() == 'GPE':
			extracted.append (''.join (c[0] for c in chunk))

# extract most frequent GPE

from collections import Counter
count = Counter(extracted)


Sr. Story Extracted Location Actual Setting Result
1. The Adventure of the Dancing Men [(‘Norfolk’, 14)] Norfolk Success
2. The Adventure of the Solitary Cyclist [(‘Farnham’, 6)] Farnham Success
3. A Scandal in Bohemia [(‘Bohemia’, 6)] Bohemia Success
4. The Red-Headed League [(‘London’, 7)] London Success
5. The Final Problem [(‘London’, 8)] London Success
6. The Greek Interpreter [(‘Greek’, 15)] Greece Fail

We got 5/6 predictions correct! These are not discouraging results and we may think of using this code somewhere in a more serious application.

  1. Sherlock Holmes Stories in Plain Text
  2. NLTK Documentation

1 Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Follow me on