MapReduce Real World Example in Python : Learn Data Science
MapReduce real world example on e-commerce transactions data is described here using Python streaming. The example does not require Hadoop installation. However, if you have Hadoop already installed it will run just fine on it. Python programming language is used because it is easy to read and understand. A real world e-commerce transactions dataset from a UK based retailer is used. The best way to learn with this example is to use an Ubuntu machine with Python 2 or 3 installed on it.
Outline
- The dataset consists of real world e-commerece data from UK based retailer
- The dataset is provided by Kaggle
- It contains 5.42k records (which is not small)
- Our goal is to find out country wise total sales
- Mapper multiplies quantity and unit price
- Mapper emits key-value pair as country, sales
- Reducer sums-up all pairs for same country
- Final output is country, sales for all countries
The Data
Download:Link to Kaggle DatasetSource: The dataset has real-life transaction data from a UK retailer. Format: CSV Size: 43.4 MB (5,42,000 records) Columns:
- InvoiceNo
- StockCode
- Description
- Quantity
- InvoiceDate
- UnitPrice
- CustomerID
- Country
The Problem
In this MapReduce real world example, we calculate total sales for each country from given dataset.
The Approach
Firstly, our data doesn’t have a Total column so it is to be computed using Quantity and UnitPrice columns as Total = Quantity * UnitPrice.
What Mapper Does
- Read the data
- Convert data into proper format
- Calculate total
- Print output as key-value pair CountryName:Total
What Reducer Does
- Read input from mapper
- Check for existing country key in the disctionary
- Add total to existing total value
- Print all key-value pairs
See this article on how to run this code
Python Code for Mapper (MapReduce Real World Example)
1 | #!/usr/bin/env python |
Python Code for Reducer (MapReduce Real World Example)
1 | #!/usr/bin/env python |
Output
Country | Score |
---|---|
Canada | 3599.68 |
Brazil | 1143.6 |
Italy | 16506.03 |
Czech Republic | 707.72 |
USA | 1730.92 |
Lithuania | 1661.06 |
Unspecified | 4746.65 |
France | 197194.15 |
Norway | 34908.13 |
Bahrain | 548.4 |
Israel | 7867.42 |
Australia | 135330.19 |
Singapore | 9054.69 |
Iceland | 4299.8 |
Channel Islands | 19950.54 |
Germany | 220791.78 |
Belgium | 40752.83 |
European Community | 1291.75 |
Hong Kong | 10037.84 |
Spain | 54632.86 |
EIRE | 262112.48 |
Netherlands | 283440.66 |
Denmark | 18665.18 |
Poland | 7193.34 |
Finland | 22226.69 |
Saudi Arabia | 131.17 |
Sweden | 36374.15 |
Malta | 2503.19 |
Switzerland | 56199.23 |
Portugal | 29272.34 |
United Arab Emirates | 1877.08 |
Lebanon | 1693.88 |
RSA | 1002.31 |
United Kingdom | 8148025.164 |
Austria | 10149.28 |
Greece | 4644.82 |
Japan | 34616.06 |
Cyprus | 12791.31 |
Conclusions
- Mapper picks-up a record and emits country and total for that record
- Mapper repeats this process for all 5.42k records
- Now, we have 5.42k key value pairs
- Reducer’s role is to combine these pairs until all keys are unique!
If you have questions, please feel free to comment below.