An Introduction to Hadoop and Hadoop Ecosystem

Welcome to Hadoop and BigData series! This is the first article in the series where we present an introduction to Hadoop and the ecosystem.

In the beginning

In October 2003, a paper titled Google File System (Ghemawat et al.) was published. The paper describes design and implementation of a scalable distibuted file system. This paper along with another paper on MapReduce inspired Doug Cutting and Mike Cafarella to create what is now known as Hadoop. Eventually project development was taken over by Apache Software Foundation, thus the name Apache Hadoop.

What is in the name?

The choice of name Hadoop sparks curosity, but it is not a computing jargon and there is no logic associated with the choice. Cutting couldn’t find a name for their new project, so he named it Hadoop! “Hadoop” was the name his son gave to his stuffed yellow elephant toy!

Why we need Hadoop?

When it comes to processing huge amounts (I mean really huge!) of data Hadoop is really useful. Without Hadoop, processing such huge data was only possible with specialized hardware, or call them supercomputers! The key advantage that Hadoop brings is that it runs on commodity hardware. You can actually use your wife’s and your own laptop to setup a working Hadoop cluster.

Is Hadoop free?

Hadoop is completely free, it is free as it has no price and it is free because you are free to modify it to suite your own needs. It is licensed under Apache License 2.0.

Core components of Hadoop

  1. HDFS : HDFS or Hadoop Distributed File System is the component responsible for storing files in a distributed manner. It is a robust file system which provides integrity, redundancy and other services. It has two main components : NameNode and DataNode
  2. MapReduce : MapReduce provides a programming model for parallel computations. It has two main operations : Map and Reduce. MapReduce 2.0 is sometimes referred to as YARN.

Introduction to Hadoop ecosystem

The Hadoop Ecosystem refers to collection of products which work with Hadoop. Each product carries a different task. For example, using Ambari, we can easily install and manage clusters. At this point, there is no need to dive into details of each product. All of the products shown in the image are from Apache Software Foundation and are free under Apache License 2.0.

Setting up Apache Hadoop Single Node Cluster

This guide will help you to install a single node Apache Hadoop cluster on your machine.

System Requirements

  • Ubuntu 16.04
  • Java 8 Installed

1. Download Hadoop

1
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.0/hadoop-2.7.0.tar.gz

2. Prepare for Installation

1
2
tar xfz hadoop-2.7.0.tar.gz
sudo mv hadoop-2.7.0 /usr/local/hadoop

3. Create Dedicated Group and User

1
2
3
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

4. Switch to Newly Created User Account

1
su -hduser

5. Add Variables to ~/.bashrc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#Begin Hadoop Variables

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

#End Hadoop Variables

6. Source ~/.bashrc

1
source ~/.bashrc

7. Set Java Home for Hadoop

  • Open the file : /usr/local/hadoop/etc/hadoop/hadoop-env.sh
  • Find and edit the line as :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
```

#### 8. Edit core-site.xml

* Open the file: /usr/local/hadoop/etc/hadoop/core-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

fs.default.name
hdfs://localhost:9000

#### 9. Edit yarn-site.xml

* Open the file: /usr/local/hadoop/etc/hadoop/yarn-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler

1
2
3
4
5
6
7
8
9
10

#### 10. Edit mapred-site.xml

* Copy the mapred-site.xml template first using:

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapredsite.xml

* Open the file: /usr/local/hadoop/etc/hadoop/mapred-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

fs.default.name
hdfs://localhost:9000

1
2
3
4

#### 11. Edit hdfs-site.xml

First, we create following directories:

sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo chown hduser:hadoop -R /usr/local/hadoop_store
sudo chmod 777 -R /usr/local/hadoop_store

1
2

Now open /usr/local/hadoop/etc/hadoop/hdfs-site.xml and enter the following content in between the tag <configuration></configuration>

dfs.replication
1
dfs.namenode.name.dir
file:/usr/local/hadoop_store/hdfs/namenode
dfs.datanode.data.dir
file:/usr/local/hadoop_store/hdfs/datanode

1
2

#### 12. Format NameNode

cd /usr/local/hadoop/
bin/hdfs namenode -format

1
2

#### 13. Start Hadoop Daemons

cd /usr/local/hadoop/
sbin/start-dfs.sh
sbin/start-yarn.sh

1
2
3

#### 14. Check Service Status

jps


#### 15. Check Running Jobs

Type in browser's address bar:

http://localhost:8088


#### Done!

Commonly Used HDFS Commands : Learn Data Science

Hadoop Distributed File System or HDFS is the underlying storage for all Hadoop applications. HDFS can be manipulated using APIs such as Java API or REST API but using HDFS shell is the most commonly used option. Below is a list of ten commonly used HDFS commands.

1. Invoking the file system: HDFS Shell supports various file systems and not just HDFS. This means you can invoke file systems including Local FS, HFTP FS, S3 FS, and others. Invoking generic file system (any file system listed above):

hadoop fs

2. Listing Directory Contents

hdfs dfs -ls /user/hadoop/file1

3. Creating Directories

hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

4. Copying Local Files to HDFS

hdfs dfs -put localfile /user/hadoop/hadoopfile

5. Copying From HDFS to Local FS

hdfs dfs -get /user/hadoop/file1 localfile

6. Renaming or Moving Files

hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2

7. Copying Files Within HDFS

hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2

Tip: There is distcp command for inter-hdfs transfers.

8. Deleting Files From HDFS

hdfs dfs -rm /user/hadoop/file1

Tip: -skipTrash option can also be used. Tip: -rmr command can be used for recursive deletion

9. Display file contents

hdfs dfs -cat /user/hadoop/file1

Tip: We can pipe cat output to native head as there is no head command here.

10. Empty Trash

hdfs dfs -expunge

Please feel free to ask questions in the comments!

Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. Requirements: Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud. This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06. Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox. Steps:

  1. Download example code and data from here

  2. Start sandbox image from VirtualBox

  3. From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops

  4. From dashboard GUI, create directory input

  5. Upload sample.txt to input using Ambari > Files View > Upload

  6. Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password

  7. From shell upload mapper.py and reducer.py using following secure copy (scp) command:

    scp -P 2222 /home/username/Downloads/mapper.py root@sandbox.hortonworks.com:/
    scp -P 2222 /home/username/Downloads/reducer.py root@sandbox.hortonworks.com:/

  8. Run the job using:

    hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /input -output /output -mapper /mapper.py -reducer /reducer.py

    Note: Do not create output directory in advance. Hadoop will create it.

  9. Test output:

    hadoop -fs cat /output/part-0000
    real 1
    my 2
    is 2
    but 1
    kolkata 1
    home 2
    kutch 2

References:

  1. Python MapReduce : Running Your First Hadoop Streaming Job
  2. Map Reduce Word Count With Python : The Simplest Tutorial

MapReduce Streaming Example : Running Your First Job on Hadoop

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:

  • Hadoop 2.6.5
  • Python 2.7
  • Log in with your Hadoop user
  • Working directory should be set to /usr/local/hadoop.

Steps:

  1. Make sure you’re in /usr/local/hadoop, if not use:

    cd /usr/local/hadoop

  2. Start HDFS:

    start-dfs.sh

  3. Start YARN:

    start-yarn.sh

  4. Check if everything is up (6 services should be running):

    jps

  5. Download data and code files to be used in this tutorial from here.

  6. Unzip contents of streaming.zip:

    unzip streaming.zip

  7. Move mapper and reducer Python files to /home/$USER/:

    mv streaming/mapper.py /home/$USER/mapper.py
    mv streaming/reducer.py /home/$USER/reducer.py

  8. Create working directories for storing data and downloading output when the Hadoop job finishes:

    mkdir /tmp/wordcount/
    mkdir /tmp/wordcount-output/

  9. Move sample.txt to /tmp

    mv streaming/sample.txt /tmp/wordcount/

  10. Upload data to HDFS:

hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
  1. Submit the job:
yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output
  1. When the job finishes, download output data:
hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
  1. See word count output on terminal:
cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help. References

  1. Hadoop Streaming
  2. Hadopp Installation

Map Reduce Word Count With Python : Learn Data Science

We spent multiple lectures talking about Hadoop architecture at the university. Yes, I even demonstrated the cool playing cards example! In fact we have an 18-page PDF from our data science lab on the installation. Still I saw students shy away perhaps because of complex installation process involved. This tutorial jumps on to hands-on coding to help anyone get up and running with Map Reduce. No Hadoop installation is required.

Problem : Counting word frequencies (word count) in a file. Data : Create sample.txt file with following lines. Preferably, create a directory for this tutorial and put all files there including this one.

my home is kolkata
but my real home is kutch

Mapper : Create a file mapper.py and paste below code there. Mapper receives data from stdin, chunks it and prints the output. Any UNIX/Linux user would know about the beauty of pipes. We’ll later use pipes to throw data from sample.txt to stdin.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/env python
import sys

# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line
line = line.strip()

# Split it into words
words = line.split()

# Output tuples on stdout
for word in words:
print '%s\\t%s' % (word, "1")

Reducer : Create a file reducer.py and paste below code there. Reducer reads tuples generated by mapper and aggregates  them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
import sys

# Create a dictionary to map words to counts
wordcount = {}

# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
line = line.strip()

# parse the input from mapper.py
word, count = line.split('\\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue

try:
wordcount\[word\] = wordcount\[word\]+count
except:
wordcount\[word\] = count

# Write the tuples to stdout
# Currently tuples are unsorted
for word in wordcount.keys():
print '%s\\t%s'% ( word, wordcount\[word\] )

Execution : CD to the directory where all files are kept and make both Python files executable:

1
2
chmod +x mapper.py
chmod +x reducer.py

And now we will feed cat command to mapper and mapper to reducer using pipe (). That is output of cat goes to mapper and mapper’s output goes to reducer. (Recall that cat command is used to display contents of any file.

1
cat sample.txt  ./mapper.py  ./reducer.py

Output :

1
2
3
4
5
6
7
real1
kutch1
is2
but1
kolkata1
home2
my2

Yay, so we get the word count kutch x 1, is x 2, but x 1, kolkata x 1, home x 2 and my x 2! You can put your questions in comments section below!