Tag: streaming

Posted 2017-11-08Updated 2023-08-22Big Data / Hadoop2 minutes read (About 313 words)

MapReduce Streaming Example : Running Your First Job on Hadoop

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:

Hadoop 2.6.5
Python 2.7
Log in with your Hadoop user
Working directory should be set to /usr/local/hadoop.

Steps:

Make sure you’re in /usr/local/hadoop, if not use:

cd /usr/local/hadoop
Start HDFS:

start-dfs.sh
Start YARN:

start-yarn.sh
Check if everything is up (6 services should be running):

jps
Download data and code files to be used in this tutorial from here.
Unzip contents of streaming.zip:

unzip streaming.zip
Move mapper and reducer Python files to /home/$USER/:

mv streaming/mapper.py /home/$USER/mapper.py
mv streaming/reducer.py /home/$USER/reducer.py
Create working directories for storing data and downloading output when the Hadoop job finishes:

mkdir /tmp/wordcount/
mkdir /tmp/wordcount-output/
Move sample.txt to /tmp

mv streaming/sample.txt /tmp/wordcount/
Upload data to HDFS:

hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount

Submit the job:

yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output

When the job finishes, download output data:

hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt

See word count output on terminal:

cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help. References

Links

Categories

Recents

Archives

Tags

Subscribe for updates