Map Reduce Word Count With Python : Learn Data Science

Logistic Regression with Spark

We spent multiple lectures talking about Hadoop architecture at the university. Yes, I even demonstrated the cool playing cards example! In fact we have an 18-page PDF from our data science lab on the installation. Still I saw students shy away perhaps because of complex installation process involved. This tutorial jumps on to hands-on coding to help anyone get up and running with Map Reduce. No Hadoop installation is required.

Problem : Counting word frequencies (word count) in a file.

Data :
Create sample.txt file with following lines. Preferably, create a directory for this tutorial and put all files there including this one.

my home is kolkata
but my real home is kutch

Mapper :

Create a file mapper.py and paste below code there. Mapper receives data from stdin, chunks it and prints the output. Any UNIX/Linux user would know about the beauty of pipes. We’ll later use pipes to throw data from sample.txt to stdin.

#!/usr/bin/env python
import sys
 
# Get input lines from stdin
for line in sys.stdin:
	# Remove spaces from beginning and end of the line
	line = line.strip()
 
	# Split it into words
	words = line.split()
 
	# Output tuples on stdout
	for word in words:
		print '%s\t%s' % (word, "1")

Reducer :

Create a file reducer.py and paste below code there. Reducer reads tuples generated by mapper and aggregates  them.

#!/usr/bin/env python
import sys
 
# Create a dictionary to map words to counts
wordcount = {}
 
# Get input from stdin
for line in sys.stdin:
    #Remove spaces from beginning and end of the line
    line = line.strip()
 
    # parse the input from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        continue
 
    try:
        wordcount[word] = wordcount[word]+count
    except:
        wordcount[word] = count
 
# Write the tuples to stdout
# Currently tuples are unsorted
for word in wordcount.keys():
    print '%s\t%s'% ( word, wordcount[word] )

Execution :
CD to the directory where all files are kept and make both Python files executable:

chmod +x mapper.py
chmod +x reducer.py

And now we will feed cat command to mapper and mapper to reducer using pipe (|). That is output of cat goes to mapper and mapper’s output goes to reducer. (Recall that cat command is used to display contents of any file.

cat sample.txt | ./mapper.py | ./reducer.py

Output :

real	1
kutch	1
is	2
but	1
kolkata	1
home	2
my	2

Yay, so we get the word count kutch x 1, is x 2, but x 1, kolkata x 1, home x 2 and my x 2! You can put your questions in comments section below!

1
Comments

avatar
1 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
1 Comment authors
mathur saarika Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
mathur saarika
Guest
mathur saarika