MapReduce Real World Example in Python : Learn Data Science

MapReduce real world example on e-commerce transactions data is described here using Python streaming. The example does not require Hadoop installation. However, if you have Hadoop already installed it will run just fine on it. Python programming language is used because it is easy to read and understand. A real world e-commerce transactions dataset from a UK based retailer is used. The best way to learn with this example is to use an Ubuntu machine with Python 2 or 3 installed on it.

Outline

  1. The dataset consists of real world e-commerece data from UK based retailer
  2. The dataset is provided by Kaggle
  3. It contains 5.42k records (which is not small)
  4. Our goal is to find out country wise total sales
  5. Mapper multiplies quantity and unit price
  6. Mapper emits key-value pair as country, sales
  7. Reducer sums-up all pairs for same country
  8. Final output is country, sales for all countries

The Data

Download:Link to Kaggle DatasetSource: The dataset has real-life transaction data from a UK retailer. Format: CSV Size: 43.4 MB (5,42,000 records) Columns:

  1. InvoiceNo
  2. StockCode
  3. Description
  4. Quantity
  5. InvoiceDate
  6. UnitPrice
  7. CustomerID
  8. Country

The Problem

In this MapReduce real world example, we calculate total sales for each country from given dataset.

The Approach

Firstly, our data doesn’t have a Total column so it is to be computed using Quantity and UnitPrice columns as Total = Quantity * UnitPrice.

What Mapper Does

  1. Read the data
  2. Convert data into proper format
  3. Calculate total
  4. Print output as key-value pair CountryName:Total

What Reducer Does

  1. Read input from mapper
  2. Check for existing country key in the disctionary
  3. Add total to existing total value
  4. Print all key-value pairs

See this article on how to run this code

Python Code for Mapper (MapReduce Real World Example)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env python
import sys

# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line

line = line.strip()

# Split it into tokens

tokens = line.split(',')

#Get country, price and quantity values
try:
country = tokens\[7\]
price = float(tokens\[5\])
qty = int(tokens\[3\])
print '%s\\t%s' % (country, (price\*qty))
except ValueError:
pass

Python Code for Reducer (MapReduce Real World Example)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python
import sys

# Create a dictionary to map countries to totals
countrySales = {}

# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
line = line.strip()

# parse the input from mapper.py
country, total = line.split('\\t', 1)

# convert total (currently a string) to float
try:
total = float(total)
except ValueError:
pass

#update dictionary
try:
countrySales\[country\] = countrySales\[country\] + total
except:
countrySales\[country\] = total

# Write the tuples to stdout
for country in countrySales.keys():
print '%s\\t%s'% (country, countrySales\[country\])

Output

Country Score
Canada 3599.68
Brazil 1143.6
Italy 16506.03
Czech Republic 707.72
USA 1730.92
Lithuania 1661.06
Unspecified 4746.65
France 197194.15
Norway 34908.13
Bahrain 548.4
Israel 7867.42
Australia 135330.19
Singapore 9054.69
Iceland 4299.8
Channel Islands 19950.54
Germany 220791.78
Belgium 40752.83
European Community 1291.75
Hong Kong 10037.84
Spain 54632.86
EIRE 262112.48
Netherlands 283440.66
Denmark 18665.18
Poland 7193.34
Finland 22226.69
Saudi Arabia 131.17
Sweden 36374.15
Malta 2503.19
Switzerland 56199.23
Portugal 29272.34
United Arab Emirates 1877.08
Lebanon 1693.88
RSA 1002.31
United Kingdom 8148025.164
Austria 10149.28
Greece 4644.82
Japan 34616.06
Cyprus 12791.31

Conclusions

  • Mapper picks-up a record and emits country and total for that record
  • Mapper repeats this process for all 5.42k records
  • Now, we have 5.42k key value pairs
  • Reducer’s role is to combine these pairs until all keys are unique!

If you have questions, please feel free to comment below.

Arduino Serial Communication Basics : Learn Arduino

Arduino serial communication is one of the many ways Arduino talks to other devices. Arduino uses asynchronous serial communication to send-receive data to and from other devices. Arduino Uno supports serial communication via on-board UART port and Tx/Rx pins. Generally this transmission happens at 9600 bits per second which is termed as baud rate. [caption id=”attachment_840” align=”aligncenter” width=”424”] Arduino Uno Shown with UART port and Tx,Rx Pins[/caption] Arduino may communicate with many devices including:

  • Another Arduino
  • A computer
  • LCD display
  • Modem
  • Or any device that supports serial communication

How to do it? In this example we’ll connect Arduino and a computer using UART port. For this we are going to use Arduino IDE’s Serial Monitor. Serial Monitor will display data we sent from Arduino Uno. Code:

void setup() {
Serial.begin(9600);
}

void loop() {
Serial.println(“Hello World…”);
delay(1000);
}

Upload this code and select Tools > Serial Monitor (Shift+Ctrl+M). You should see this output. [caption id=”attachment_841” align=”alignnone” width=”579”] Serial Monitor output on the computer[/caption] Further Reading:

  1. Wikipedia: Serial communication
  2. Arduino Reference: Serial

Hello World in 10 Different Programming Languages

Writing Hello World is a custom I dare not to break as I start off with this brand new blog. Here are Hello World programs in ten different programming languages.

1. The C Not only C can be termed as mother of most high level languages but it remains relevant till date. It is undoubtedly the most popular language of all times.

1
2
3
#include void main() {
printf("Hello World!");
}

2. C ++ Bjarne Stroustrup’s language gave programmers a new world view with Object Oriented Programming.

1
2
3
#include void main() {
cout << "Hello World!";
}

3. UNIX Shell Shell is like a Ninja’s sword! You need to see this.

1
2
#!/bin/sh
echo "Hello World!"

4. Java Java is perhaps the only ubiquitous multi platform language we have seen.

1
2
3
4
5
public class HelloWorld {
public static void main(String args \[\]) {
System.out.print("Hello World!");
}
}

5. Hello World in C# C# is a great programming language based on C from Microsoft.

1
2
3
4
5
public class HelloWorld {
static void Main() {
System.Console.Write("Hello World!");
}
}

6. JavaScript JS shaped platforms which employ half of the Indian programmers.

1
alert ("Hello World");

7. PHP PHP makes great front-ends for web applications.

1
echo "Hello World!";

8. Python My favourite! Python is simple, readable and powerful.

1
print ('Hello World!')

9. PL/SQL When you cannot think in SQL, you use PL/SQL!

1
2
3
begin
dbms\_output.put\_line ('Hello World!');
end;

10. Arduino C Arduino propels IoT development with affordable open source prototyping boards for everyone. Arduino-C is indeed C!

1
2
3
4
5
6
7
8
9
10
11
12
13
// setup () function runs only once
void setup() {
// Use on-board LED as output
pinMode(LED\_BUILTIN, OUTPUT);
}

// loop function runs forever
void loop() {
digitalWrite(LED\_BUILTIN, HIGH); // turn the LED on
delay(1000); // wait a sec (or 1000 ms)
digitalWrite(LED\_BUILTIN, LOW); // turn the LED off
delay(1000); // wait a sec (or 1000 ms)
}

So this is how journey at iDevji starts, I hope it will create value, may be little, for everyone specially my students!