An Introduction to Hadoop and Hadoop Ecosystem

Welcome to Hadoop and BigData series! This is the first article in the series where we present an introduction to Hadoop and the ecosystem.

In the beginning

In October 2003, a paper titled Google File System (Ghemawat et al.) was published. The paper describes design and implementation of a scalable distibuted file system. This paper along with another paper on MapReduce inspired Doug Cutting and Mike Cafarella to create what is now known as Hadoop. Eventually project development was taken over by Apache Software Foundation, thus the name Apache Hadoop.

What is in the name?

The choice of name Hadoop sparks curosity, but it is not a computing jargon and there is no logic associated with the choice. Cutting couldn’t find a name for their new project, so he named it Hadoop! “Hadoop” was the name his son gave to his stuffed yellow elephant toy!

Why we need Hadoop?

When it comes to processing huge amounts (I mean really huge!) of data Hadoop is really useful. Without Hadoop, processing such huge data was only possible with specialized hardware, or call them supercomputers! The key advantage that Hadoop brings is that it runs on commodity hardware. You can actually use your wife’s and your own laptop to setup a working Hadoop cluster.

Is Hadoop free?

Hadoop is completely free, it is free as it has no price and it is free because you are free to modify it to suite your own needs. It is licensed under Apache License 2.0.

Core components of Hadoop

  1. HDFS : HDFS or Hadoop Distributed File System is the component responsible for storing files in a distributed manner. It is a robust file system which provides integrity, redundancy and other services. It has two main components : NameNode and DataNode
  2. MapReduce : MapReduce provides a programming model for parallel computations. It has two main operations : Map and Reduce. MapReduce 2.0 is sometimes referred to as YARN.

Introduction to Hadoop ecosystem

The Hadoop Ecosystem refers to collection of products which work with Hadoop. Each product carries a different task. For example, using Ambari, we can easily install and manage clusters. At this point, there is no need to dive into details of each product. All of the products shown in the image are from Apache Software Foundation and are free under Apache License 2.0.

Setting up Apache Hadoop Single Node Cluster

This guide will help you to install a single node Apache Hadoop cluster on your machine.

System Requirements

  • Ubuntu 16.04
  • Java 8 Installed

1. Download Hadoop

1
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.0/hadoop-2.7.0.tar.gz

2. Prepare for Installation

1
2
tar xfz hadoop-2.7.0.tar.gz
sudo mv hadoop-2.7.0 /usr/local/hadoop

3. Create Dedicated Group and User

1
2
3
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

4. Switch to Newly Created User Account

1
su -hduser

5. Add Variables to ~/.bashrc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#Begin Hadoop Variables

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

#End Hadoop Variables

6. Source ~/.bashrc

1
source ~/.bashrc

7. Set Java Home for Hadoop

  • Open the file : /usr/local/hadoop/etc/hadoop/hadoop-env.sh
  • Find and edit the line as :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
```

#### 8. Edit core-site.xml

* Open the file: /usr/local/hadoop/etc/hadoop/core-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

fs.default.name
hdfs://localhost:9000

#### 9. Edit yarn-site.xml

* Open the file: /usr/local/hadoop/etc/hadoop/yarn-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler

1
2
3
4
5
6
7
8
9
10

#### 10. Edit mapred-site.xml

* Copy the mapred-site.xml template first using:

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapredsite.xml

* Open the file: /usr/local/hadoop/etc/hadoop/mapred-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

fs.default.name
hdfs://localhost:9000

1
2
3
4

#### 11. Edit hdfs-site.xml

First, we create following directories:

sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo chown hduser:hadoop -R /usr/local/hadoop_store
sudo chmod 777 -R /usr/local/hadoop_store

1
2

Now open /usr/local/hadoop/etc/hadoop/hdfs-site.xml and enter the following content in between the tag <configuration></configuration>

dfs.replication
1
dfs.namenode.name.dir
file:/usr/local/hadoop_store/hdfs/namenode
dfs.datanode.data.dir
file:/usr/local/hadoop_store/hdfs/datanode

1
2

#### 12. Format NameNode

cd /usr/local/hadoop/
bin/hdfs namenode -format

1
2

#### 13. Start Hadoop Daemons

cd /usr/local/hadoop/
sbin/start-dfs.sh
sbin/start-yarn.sh

1
2
3

#### 14. Check Service Status

jps


#### 15. Check Running Jobs

Type in browser's address bar:

http://localhost:8088


#### Done!

MapReduce Real World Example in Python : Learn Data Science

MapReduce real world example on e-commerce transactions data is described here using Python streaming. The example does not require Hadoop installation. However, if you have Hadoop already installed it will run just fine on it. Python programming language is used because it is easy to read and understand. A real world e-commerce transactions dataset from a UK based retailer is used. The best way to learn with this example is to use an Ubuntu machine with Python 2 or 3 installed on it.

Outline

  1. The dataset consists of real world e-commerece data from UK based retailer
  2. The dataset is provided by Kaggle
  3. It contains 5.42k records (which is not small)
  4. Our goal is to find out country wise total sales
  5. Mapper multiplies quantity and unit price
  6. Mapper emits key-value pair as country, sales
  7. Reducer sums-up all pairs for same country
  8. Final output is country, sales for all countries

The Data

Download:Link to Kaggle DatasetSource: The dataset has real-life transaction data from a UK retailer. Format: CSV Size: 43.4 MB (5,42,000 records) Columns:

  1. InvoiceNo
  2. StockCode
  3. Description
  4. Quantity
  5. InvoiceDate
  6. UnitPrice
  7. CustomerID
  8. Country

The Problem

In this MapReduce real world example, we calculate total sales for each country from given dataset.

The Approach

Firstly, our data doesn’t have a Total column so it is to be computed using Quantity and UnitPrice columns as Total = Quantity * UnitPrice.

What Mapper Does

  1. Read the data
  2. Convert data into proper format
  3. Calculate total
  4. Print output as key-value pair CountryName:Total

What Reducer Does

  1. Read input from mapper
  2. Check for existing country key in the disctionary
  3. Add total to existing total value
  4. Print all key-value pairs

See this article on how to run this code

Python Code for Mapper (MapReduce Real World Example)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env python
import sys

# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line

line = line.strip()

# Split it into tokens

tokens = line.split(',')

#Get country, price and quantity values
try:
country = tokens\[7\]
price = float(tokens\[5\])
qty = int(tokens\[3\])
print '%s\\t%s' % (country, (price\*qty))
except ValueError:
pass

Python Code for Reducer (MapReduce Real World Example)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python
import sys

# Create a dictionary to map countries to totals
countrySales = {}

# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
line = line.strip()

# parse the input from mapper.py
country, total = line.split('\\t', 1)

# convert total (currently a string) to float
try:
total = float(total)
except ValueError:
pass

#update dictionary
try:
countrySales\[country\] = countrySales\[country\] + total
except:
countrySales\[country\] = total

# Write the tuples to stdout
for country in countrySales.keys():
print '%s\\t%s'% (country, countrySales\[country\])

Output

Country Score
Canada 3599.68
Brazil 1143.6
Italy 16506.03
Czech Republic 707.72
USA 1730.92
Lithuania 1661.06
Unspecified 4746.65
France 197194.15
Norway 34908.13
Bahrain 548.4
Israel 7867.42
Australia 135330.19
Singapore 9054.69
Iceland 4299.8
Channel Islands 19950.54
Germany 220791.78
Belgium 40752.83
European Community 1291.75
Hong Kong 10037.84
Spain 54632.86
EIRE 262112.48
Netherlands 283440.66
Denmark 18665.18
Poland 7193.34
Finland 22226.69
Saudi Arabia 131.17
Sweden 36374.15
Malta 2503.19
Switzerland 56199.23
Portugal 29272.34
United Arab Emirates 1877.08
Lebanon 1693.88
RSA 1002.31
United Kingdom 8148025.164
Austria 10149.28
Greece 4644.82
Japan 34616.06
Cyprus 12791.31

Conclusions

  • Mapper picks-up a record and emits country and total for that record
  • Mapper repeats this process for all 5.42k records
  • Now, we have 5.42k key value pairs
  • Reducer’s role is to combine these pairs until all keys are unique!

If you have questions, please feel free to comment below.

MapReduce Streaming Example : Running Your First Job on Hadoop

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:

  • Hadoop 2.6.5
  • Python 2.7
  • Log in with your Hadoop user
  • Working directory should be set to /usr/local/hadoop.

Steps:

  1. Make sure you’re in /usr/local/hadoop, if not use:

    cd /usr/local/hadoop

  2. Start HDFS:

    start-dfs.sh

  3. Start YARN:

    start-yarn.sh

  4. Check if everything is up (6 services should be running):

    jps

  5. Download data and code files to be used in this tutorial from here.

  6. Unzip contents of streaming.zip:

    unzip streaming.zip

  7. Move mapper and reducer Python files to /home/$USER/:

    mv streaming/mapper.py /home/$USER/mapper.py
    mv streaming/reducer.py /home/$USER/reducer.py

  8. Create working directories for storing data and downloading output when the Hadoop job finishes:

    mkdir /tmp/wordcount/
    mkdir /tmp/wordcount-output/

  9. Move sample.txt to /tmp

    mv streaming/sample.txt /tmp/wordcount/

  10. Upload data to HDFS:

hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
  1. Submit the job:
yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output
  1. When the job finishes, download output data:
hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
  1. See word count output on terminal:
cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help. References

  1. Hadoop Streaming
  2. Hadopp Installation