MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided.
- Hadoop 2.6.5
- Python 2.7
- Log in with your Hadoop user
- Working directory should be set to /usr/local/hadoop.
- Make sure you’re in /usr/local/hadoop, if not use:
- Start HDFS:
- Start YARN:
- Check if everything is up (6 services should be running):
- Download data and code files to be used in this tutorial from here.
- Unzip contents of streaming.zip:
- Move mapper and reducer Python files to /home/$USER/:
mv streaming/mapper.py /home/$USER/mapper.py mv streaming/reducer.py /home/$USER/reducer.py
- Create working directories for storing data and downloading output when the Hadoop job finishes:
mkdir /tmp/wordcount/ mkdir /tmp/wordcount-output/
- Move sample.txt to /tmp
mv streaming/sample.txt /tmp/wordcount/
- Upload data to HDFS:
hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
- Submit the job:
yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar -file /home/$USER/mapper.py -mapper /home/$USER/mapper.py -file /home/$USER/reducer.py -reducer /home/$USER/reducer.py -input /user/$USER/wordcount/* -output /user/$USER/wordcount-output
- When the job finishes, download output data:
hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
- See word count output on terminal:
Note: Many common errors are documented in comments section. Please see comments section for help.