MapReduce Streaming Example : Running Your First Job on Hadoop

mapreduce streaming example

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided.

Prerequisites:

  • Hadoop 2.6.5
  • Python 2.7
  • Log in with your Hadoop user
  • Working directory should be set to /usr/local/hadoop.

Steps:

  1. Make sure you’re in /usr/local/hadoop, if not use:
    cd /usr/local/hadoop
  2. Start HDFS:
    start-dfs.sh
  3. Start YARN:
    start-yarn.sh
  4. Check if everything is up (6 services should be running):
    jps
  5. Download data and code files to be used in this tutorial from here.
  6. Unzip contents of streaming.zip:
    unzip streaming.zip
  7. Move mapper and reducer Python files to /home/$USER/:
    mv streaming/mapper.py /home/$USER/mapper.py
    mv streaming/reducer.py /home/$USER/reducer.py
  8. Create working directories for storing data and downloading output when the Hadoop job finishes:
    mkdir /tmp/wordcount/
    mkdir /tmp/wordcount-output/
  9. Move sample.txt to /tmp
    mv streaming/sample.txt /tmp/wordcount/
  10. Upload data to HDFS:
    hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
  11. Submit the job:
    yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
    	-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
    	-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
    	-input /user/$USER/wordcount/*
    	-output /user/$USER/wordcount-output
  12. When the job finishes, download output data:
    hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
  13. See word count output on terminal:
    cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help.

References

  1. Hadoop Streaming
  2. Hadopp Installation

6
Comments

avatar
5 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
5 Comment authors
Devji ChhangaMayurRupeshDeep ChothaniAshish Gusai Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Ashish Gusai
Guest
Ankit Yadav & Ashish Gusai

Error : pipemapred.waitoutputthreads() subprocess failed with code 127 Issue : When the python file is created in Any/Windows environment the new line character is CRLF. My hadoop runs on Linux which understands the newline character as LF . Solution: After changing the CRLF to LF the step ran successfully. For conversion CRLF to LF we can use. dos2unix (also known as fromdos) – converts text files from the DOS format to the Unix format Steps: Step1: install dos2unix using this command:- sudo apt install dos2unix step2: run the command $ dos2unix /home/$USER/mapper.py $ dos2unix /home/$USER/reducer.py all done! then try to… Read more »

Deep Chothani
Guest
Deep Chothani

Hadoop: require root’s password after enter “start-all.sh”
Solution:

1) Generate ssh key without password
$ ssh-keygen -t rsa -P “”

2) Copy id_rsa.pub to authorized-keys
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

3) Start ssh localhost
$ ssh localhost

4) now go to the hadoop sbin directory and start hadoop

$ jps
12373 Jps
11823 SecondaryNameNode
11643 DataNode
12278 NodeManager
11974 ResourceManager
11499 NameNode

Rupesh
Guest
Rupesh

Hadoop : Some Folder are Not Delete or Remove In HDFS Directory Ex :- I have Create one “tmp” folder or one child folder of tmp is tmp1. so it’s look like “tmp/tmp1″ .then if i detele this folder using this command” hdfs dfs -rm -r /tmp ” then it’s show me this type error “rm: Failed to get server trash configuration: null. Consider using -skipTrash option” Solution is : [email protected]:~$ hdfs dfs -rm -r -skipTrash /tmp Deleted /tmp (Output)

Rupesh
Guest
Rupesh

Error : -. Name node is in safe mode
issue :- If i delete some folders in hdfs then this type error show me.
for example :-
[email protected]:~$ hdfs dfs -rm -r /abc
Name node is in safe mode (Error Message)

solution :- Type this command

[email protected]:~$ bin/hadoop dfsadmin -safemode leave
Or
[email protected]:~$ hdfs dfsadmin -safemode leave

Mayur
Guest
Mayur

wget http://idevji.com/delivery/streaming.zip

this download link is not working
its show 404 not found