NLTK Example : Detecting Geographic Setting of Sherlock Holmes Stories
As a young adult nothing thrilled me more than Jeremy Brett’s performance as Sherlock Holmes. “You know my methods, apply them!” he would say. So let’s try to play Sherlock ourselves. We use Natural Language Tool Kit or NLTK to guess setting of a Sherlock story in terms of its geographic location. In this NLTK example, our approach is very naive: identify the most frequent place mentioned in the story. We use Named Entity Recognition (NRE) to identify geopolitical entities (GPE) and filter out the most frequent of them. This approach is very naive because there is no pre-processing on the text and GPEs may include other concepts apart from geographic locations such as nationalities. But we want to keep this really simple and fun. So here we go: Code :
#NLTK example
#This code reads one text file at a time
from nltk import word_tokenize, pos_tag, ne_chunk
read a text file
text = file (‘filepath/file.txt’)
replace \n with a spcae
data=text.read().replace(‘\n’, ‘ ‘)
chunked = ne_chunk (pos_tag ( word_tokenize (data) ))
extract GPEs
extracted = []
for chunk in chunked:
if hasattr (chunk, ‘label’):
if chunk.label() == ‘GPE’:
extracted.append (‘’.join (c[0] for c in chunk))
extract most frequent GPE
from collections import Counter
count = Counter(extracted)
count.most_common(1)
Results:
Sr.
Story
Extracted Location
Actual Setting
Result
The Adventure of the Dancing Men
[(‘Norfolk’, 14)]
Norfolk
Success
The Adventure of the Solitary Cyclist
[(‘Farnham’, 6)]
Farnham
Success
A Scandal in Bohemia
[(‘Bohemia’, 6)]
Bohemia
Success
The Red-Headed League
[(‘London’, 7)]
London
Success
The Final Problem
[(‘London’, 8)]
London
Success
The Greek Interpreter
[(‘Greek’, 15)]
Greece
Fail
We got 5/6 predictions correct! These are not discouraging results and we may think of using this code somewhere in a more serious application. References: