NLTK Example : Detecting Geographic Setting of Sherlock Holmes Stories

As a young adult nothing thrilled me more than Jeremy Brett’s performance as Sherlock Holmes. “You know my methods, apply them!” he would say. So let’s try to play Sherlock ourselves. We use Natural Language Tool Kit or NLTK to guess setting of a Sherlock story in terms of its geographic location. In this NLTK example, our approach is very naive: identify the most frequent place mentioned in the story. We use Named Entity Recognition (NRE) to identify geopolitical entities (GPE) and filter out the most frequent of them. This approach is very naive because there is no pre-processing on the text and GPEs may include other concepts apart from geographic locations such as nationalities. But we want to keep this really simple and fun. So here we go: Code :

#NLTK example
#This code reads one text file at a time

from nltk import word_tokenize, pos_tag, ne_chunk

read a text file

text = file (‘filepath/file.txt’)

replace \n with a spcae

data=text.read().replace(‘\n’, ‘ ‘)

chunked = ne_chunk (pos_tag ( word_tokenize (data) ))

extract GPEs

extracted = []
for chunk in chunked:
if hasattr (chunk, ‘label’):
if chunk.label() == ‘GPE’:
extracted.append (‘’.join (c[0] for c in chunk))

extract most frequent GPE

from collections import Counter
count = Counter(extracted)
count.most_common(1)

Results:

Sr.

Story

Extracted Location

Actual Setting

Result

The Adventure of the Dancing Men

[(‘Norfolk’, 14)]

Norfolk

Success

The Adventure of the Solitary Cyclist

[(‘Farnham’, 6)]

Farnham

Success

A Scandal in Bohemia

[(‘Bohemia’, 6)]

Bohemia

Success

The Red-Headed League

[(‘London’, 7)]

London

Success

The Final Problem

[(‘London’, 8)]

London

Success

The Greek Interpreter

[(‘Greek’, 15)]

Greece

Fail

We got 5/6 predictions correct! These are not discouraging results and we may think of using this code somewhere in a more serious application. References:

  1. Sherlock Holmes Stories in Plain Text
  2. NLTK Documentation