MapReduce Streaming Example : Running Your First Job on Hadoop

M
MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites: Hadoop 2.6.5 Python 2.7 Log in with your Hadoop user Working directory should be set to /usr/local/hadoop. Steps: Make sure you're in /usr/local/hadoop, if not use: cd /usr/local/hadoop Start HDFS: start-dfs.sh Start YARN: start-yarn.sh Check if everything is up (6 services should be running): jps Download data and code files to be used in this tutorial from here. Unzip contents of streaming.zip:unzip stream
Subscribe or log in to read the rest of this content.

About the author

Devji Chhanga

I teach computer science at university of Kutch since 2011, Kutch is the western most district of India. At iDevji, I share tech stories that excite me. You will love reading the blog if you too believe in the disruptive power of technology. Some stories are purely technical while others can involve empathetical approach to problem solving using technology.

6 Comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • Error : pipemapred.waitoutputthreads() subprocess failed with code 127
    Issue : When the python file is created in Any/Windows environment the new line character is CRLF.
    My hadoop runs on Linux which understands the newline character as LF .
    Solution: After changing the CRLF to LF the step ran successfully.

    For conversion CRLF to LF we can use.
    dos2unix (also known as fromdos) – converts text files from the DOS format to the Unix format
    Steps:
    Step1: install dos2unix
    using this command:- sudo apt install dos2unix
    step2: run the command
    $ dos2unix /home/$USER/mapper.py
    $ dos2unix /home/$USER/reducer.py
    all done!
    then try to Execute the command:-
    Running Hadoop Job:
    [email protected]:/usr/local/hadoop$ bin/hadoop jar share/hadoop/tools/lib/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/wordcount/* -output /user/hduser/wordcount-output

  • Hadoop: require root’s password after enter “start-all.sh”
    Solution:

    1) Generate ssh key without password
    $ ssh-keygen -t rsa -P “”

    2) Copy id_rsa.pub to authorized-keys
    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

    3) Start ssh localhost
    $ ssh localhost

    4) now go to the hadoop sbin directory and start hadoop

    $ jps
    12373 Jps
    11823 SecondaryNameNode
    11643 DataNode
    12278 NodeManager
    11974 ResourceManager
    11499 NameNode

  • Hadoop : Some Folder are Not Delete or Remove In HDFS Directory
    Ex :- I have Create one “tmp” folder or one child folder of tmp is tmp1. so it’s look like “tmp/tmp1″ .then if i detele this folder using this command” hdfs dfs -rm -r /tmp ” then it’s show me this type error “rm: Failed to get server trash configuration: null. Consider using -skipTrash option”

    Solution is :
    [email protected]:~$ hdfs dfs -rm -r -skipTrash /tmp
    Deleted /tmp (Output)

Devji Chhanga

I teach computer science at university of Kutch since 2011, Kutch is the western most district of India. At iDevji, I share tech stories that excite me. You will love reading the blog if you too believe in the disruptive power of technology. Some stories are purely technical while others can involve empathetical approach to problem solving using technology.

Get in touch

Quickly communicate covalent niche markets for maintainable sources. Collaboratively harness resource sucking experiences whereas cost effective meta-services.