Map Reduce Word Count With Python : Learn Data Science

We spent multiple lectures talking about Hadoop architecture at the university. Yes, I even demonstrated the cool playing cards example! In fact we have an 18-page PDF from our data science lab on the installation. Still I saw students shy away perhaps because of complex installation process involved. This tutorial jumps on to hands-on coding to help anyone get up and running with Map Reduce. No Hadoop installation is required.

Problem : Counting word frequencies (word count) in a file. Data : Create sample.txt file with following lines. Preferably, create a directory for this tutorial and put all files there including this one.

my home is kolkata
but my real home is kutch

Mapper : Create a file mapper.py and paste below code there. Mapper receives data from stdin, chunks it and prints the output. Any UNIX/Linux user would know about the beauty of pipes. We’ll later use pipes to throw data from sample.txt to stdin.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/env python
import sys

# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line
line = line.strip()

# Split it into words
words = line.split()

# Output tuples on stdout
for word in words:
print '%s\\t%s' % (word, "1")

Reducer : Create a file reducer.py and paste below code there. Reducer reads tuples generated by mapper and aggregates  them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
import sys

# Create a dictionary to map words to counts
wordcount = {}

# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
line = line.strip()

# parse the input from mapper.py
word, count = line.split('\\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue

try:
wordcount\[word\] = wordcount\[word\]+count
except:
wordcount\[word\] = count

# Write the tuples to stdout
# Currently tuples are unsorted
for word in wordcount.keys():
print '%s\\t%s'% ( word, wordcount\[word\] )

Execution : CD to the directory where all files are kept and make both Python files executable:

1
2
chmod +x mapper.py
chmod +x reducer.py

And now we will feed cat command to mapper and mapper to reducer using pipe (). That is output of cat goes to mapper and mapper’s output goes to reducer. (Recall that cat command is used to display contents of any file.

1
cat sample.txt  ./mapper.py  ./reducer.py

Output :

1
2
3
4
5
6
7
real1
kutch1
is2
but1
kolkata1
home2
my2

Yay, so we get the word count kutch x 1, is x 2, but x 1, kolkata x 1, home x 2 and my x 2! You can put your questions in comments section below!

Hello World in 10 Different Programming Languages

Writing Hello World is a custom I dare not to break as I start off with this brand new blog. Here are Hello World programs in ten different programming languages.

1. The C Not only C can be termed as mother of most high level languages but it remains relevant till date. It is undoubtedly the most popular language of all times.

1
2
3
#include void main() {
printf("Hello World!");
}

2. C ++ Bjarne Stroustrup’s language gave programmers a new world view with Object Oriented Programming.

1
2
3
#include void main() {
cout << "Hello World!";
}

3. UNIX Shell Shell is like a Ninja’s sword! You need to see this.

1
2
#!/bin/sh
echo "Hello World!"

4. Java Java is perhaps the only ubiquitous multi platform language we have seen.

1
2
3
4
5
public class HelloWorld {
public static void main(String args \[\]) {
System.out.print("Hello World!");
}
}

5. Hello World in C# C# is a great programming language based on C from Microsoft.

1
2
3
4
5
public class HelloWorld {
static void Main() {
System.Console.Write("Hello World!");
}
}

6. JavaScript JS shaped platforms which employ half of the Indian programmers.

1
alert ("Hello World");

7. PHP PHP makes great front-ends for web applications.

1
echo "Hello World!";

8. Python My favourite! Python is simple, readable and powerful.

1
print ('Hello World!')

9. PL/SQL When you cannot think in SQL, you use PL/SQL!

1
2
3
begin
dbms\_output.put\_line ('Hello World!');
end;

10. Arduino C Arduino propels IoT development with affordable open source prototyping boards for everyone. Arduino-C is indeed C!

1
2
3
4
5
6
7
8
9
10
11
12
13
// setup () function runs only once
void setup() {
// Use on-board LED as output
pinMode(LED\_BUILTIN, OUTPUT);
}

// loop function runs forever
void loop() {
digitalWrite(LED\_BUILTIN, HIGH); // turn the LED on
delay(1000); // wait a sec (or 1000 ms)
digitalWrite(LED\_BUILTIN, LOW); // turn the LED off
delay(1000); // wait a sec (or 1000 ms)
}

So this is how journey at iDevji starts, I hope it will create value, may be little, for everyone specially my students!