MapReduce Real World Example in Python : Learn Data Science

mapreduce real world example

MapReduce real world example on e-commerce transactions data is described here using Python streaming. The example does not require Hadoop installation. However, if you have Hadoop already installed it will run just fine on it. Python programming language is used because it is easy to read and understand.

A real world e-commerce transactions dataset from a UK based retailer is used. The best way to learn with this example is to use an Ubuntu machine with Python 2 or 3 installed on it.

Brief Outline

  1. The dataset consists of real world e-commerece data from UK based retailer
  2. The dataset is provided by Kaggle
  3. It contains 5.42k records (which is not small)
  4. Our goal is to find out country wise total sales
  5. Mapper multiplies quantity and unit price
  6. Mapper emits key-value pair as country, sales
  7. Reducer sums-up all pairs for same country
  8. Final output is country, sales for all countries

The Data


Link to Kaggle Dataset

Source: The dataset has real-life transaction data from a UK retailer.
Format: CSV
Size: 43.4 MB (5,42,000 records)

  1. InvoiceNo
  2. StockCode
  3. Description
  4. Quantity
  5. InvoiceDate
  6. UnitPrice
  7. CustomerID
  8. Country

The Problem

In this MapReduce real world example, we calculate total sales for each country from given dataset.

The Approach

Firstly, our data doesn’t have a Total column so it is to be computed using Quantity and UnitPrice columns as Total = Quantity * UnitPrice.

What Mapper Does

  1. Read the data
  2. Convert data into proper format
  3. Calculate total
  4. Print output as key-value pair CountryName:Total

What Reducer Does

  1. Read input from mapper
  2. Check for existing country key in the disctionary
  3. Add total to existing total value
  4. Print all key-value pairs

See this article on how to run this code

Python Code for Mapper (MapReduce Real World Example)

#!/usr/bin/env python
import sys
# Get input lines from stdin
for line in sys.stdin:
	# Remove spaces from beginning and end of the line
	line = line.strip()
	# Split it into tokens
	tokens = line.split(',')
	#Get country, price and quantity values
		country = tokens[7]
		price = float(tokens[5])
		qty = int(tokens[3])
		print '%s\t%s' % (country, (price*qty))
	except ValueError: pass

Python Code for Reducer (MapReduce Real World Example)

#!/usr/bin/env python
import sys
# Create a dictionary to map countries to totals
countrySales = {}
# Get input from stdin
for line in sys.stdin:
	#Remove spaces from beginning and end of the line
	line = line.strip()
	# parse the input from
	country, total = line.split('\t', 1)
	# convert total (currently a string) to float
		total = float(total)
	except ValueError: pass
	#update dictionary
		countrySales[country] = countrySales[country] + total
		countrySales[country] = total
# Write the tuples to stdout
for country in countrySales.keys():
    print '%s\t%s'% (country, countrySales[country])


Canada	3599.68
Brazil	1143.6
Italy	16506.03
Czech Republic	707.72
USA	1730.92
Lithuania	1661.06
Unspecified	4746.65
France	197194.15
Norway	34908.13
Bahrain	548.4
Israel	7867.42
Australia	135330.19
Singapore	9054.69
Iceland	4299.8
Channel Islands	19950.54
Germany	220791.78
Belgium	40752.83
European Community	1291.75
Hong Kong	10037.84
Spain	54632.86
EIRE	262112.48
Netherlands	283440.66
Denmark	18665.18
Poland	7193.34
Finland	22226.69
Saudi Arabia	131.17
Sweden	36374.15
Malta	2503.19
Switzerland	56199.23
Portugal	29272.34
United Arab Emirates	1877.08
Lebanon	1693.88
RSA	1002.31
United Kingdom	8148025.164
Austria	10149.28
Greece	4644.82
Japan	34616.06
Cyprus	12791.31


  • Mapper picks-up a record and emits country and total for that record
  • Mapper repeats this process for all 5.42k records
  • Now, we have 5.42k key value pairs
  • Reducer’s role is to combine these pairs until all keys are unique!

If you have questions, please feel free to comment below.


Notify of