MapReduce Streaming Example : Running Your First Job on Hadoop

M

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided.

Prerequisites:

  • Hadoop 2.6.5
  • Python 2.7
  • Log in with your Hadoop user
  • Working directory should be set to /usr/local/hadoop.

Steps:

  1. Make sure you’re in /usr/local/hadoop, if not use:
    cd /usr/local/hadoop
  2. Start HDFS:
    start-dfs.sh
  3. Start YARN:
    start-yarn.sh
  4. Check if everything is up (6 services should be running):
    jps
  5. Download data and code files to be used in this tutorial from here.
  6. Unzip contents of streaming.zip:
    unzip streaming.zip
  7. Move mapper and reducer Python files to /home/$USER/:
    mv streaming/mapper.py /home/$USER/mapper.py
    mv streaming/reducer.py /home/$USER/reducer.py
    
  8. Create working directories for storing data and downloading output when the Hadoop job finishes:
    mkdir /tmp/wordcount/
    mkdir /tmp/wordcount-output/
    
  9. Move sample.txt to /tmp
    mv streaming/sample.txt /tmp/wordcount/
    
  10. Upload data to HDFS:
    hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
    
  11. Submit the job:
    yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
    	-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
    	-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
    	-input /user/$USER/wordcount/*
    	-output /user/$USER/wordcount-output
    
  12. When the job finishes, download output data:
    hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
    
  13. See word count output on terminal:
    cat /tmp/wordcount-output/output.txt
    

Note: Many common errors are documented in comments section. Please see comments section for help.
References

  1. Hadoop Streaming
  2. Hadopp Installation

6 Comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • Error : pipemapred.waitoutputthreads() subprocess failed with code 127
    Issue : When the python file is created in Any/Windows environment the new line character is CRLF.
    My hadoop runs on Linux which understands the newline character as LF .
    Solution: After changing the CRLF to LF the step ran successfully.

    For conversion CRLF to LF we can use.
    dos2unix (also known as fromdos) – converts text files from the DOS format to the Unix format
    Steps:
    Step1: install dos2unix
    using this command:- sudo apt install dos2unix
    step2: run the command
    $ dos2unix /home/$USER/mapper.py
    $ dos2unix /home/$USER/reducer.py
    all done!
    then try to Execute the command:-
    Running Hadoop Job:
    [email protected]:/usr/local/hadoop$ bin/hadoop jar share/hadoop/tools/lib/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/wordcount/* -output /user/hduser/wordcount-output

  • Hadoop: require root’s password after enter “start-all.sh”
    Solution:

    1) Generate ssh key without password
    $ ssh-keygen -t rsa -P “”

    2) Copy id_rsa.pub to authorized-keys
    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

    3) Start ssh localhost
    $ ssh localhost

    4) now go to the hadoop sbin directory and start hadoop

    $ jps
    12373 Jps
    11823 SecondaryNameNode
    11643 DataNode
    12278 NodeManager
    11974 ResourceManager
    11499 NameNode

  • Hadoop : Some Folder are Not Delete or Remove In HDFS Directory
    Ex :- I have Create one “tmp” folder or one child folder of tmp is tmp1. so it’s look like “tmp/tmp1″ .then if i detele this folder using this command” hdfs dfs -rm -r /tmp ” then it’s show me this type error “rm: Failed to get server trash configuration: null. Consider using -skipTrash option”

    Solution is :
    [email protected]:~$ hdfs dfs -rm -r -skipTrash /tmp
    Deleted /tmp (Output)

Subscribe

Follow me on