MapReduce Streaming Example : Running Your First Job on Hadoop
MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:
- Hadoop 2.6.5
- Python 2.7
- Log in with your Hadoop user
- Working directory should be set to /usr/local/hadoop.
Steps:
Make sure you’re in /usr/local/hadoop, if not use:
cd /usr/local/hadoop
Start HDFS:
start-dfs.sh
Start YARN:
start-yarn.sh
Check if everything is up (6 services should be running):
jps
Download data and code files to be used in this tutorial from here.
Unzip contents of streaming.zip:
unzip streaming.zip
Move mapper and reducer Python files to /home/$USER/:
mv streaming/mapper.py /home/$USER/mapper.py
mv streaming/reducer.py /home/$USER/reducer.pyCreate working directories for storing data and downloading output when the Hadoop job finishes:
mkdir /tmp/wordcount/
mkdir /tmp/wordcount-output/Move sample.txt to /tmp
mv streaming/sample.txt /tmp/wordcount/
Upload data to HDFS:
hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
- Submit the job:
yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output
- When the job finishes, download output data:
hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
- See word count output on terminal:
cat /tmp/wordcount-output/output.txt
Note: Many common errors are documented in comments section. Please see comments section for help. References