Python MapReduce with Hadoop Streaming in Hortonworks Sandbox
Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. Requirements: Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud. This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06. Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox. Steps:
Start sandbox image from VirtualBox
From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops
From dashboard GUI, create directory input
Upload sample.txt to input using Ambari > Files View > Upload
Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password
From shell upload mapper.py and reducer.py using following secure copy (scp) command:
scp -P 2222 /home/username/Downloads/mapper.py root@sandbox.hortonworks.com:/
scp -P 2222 /home/username/Downloads/reducer.py root@sandbox.hortonworks.com:/Run the job using:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-input /input -output /output -mapper /mapper.py -reducer /reducer.pyNote: Do not create output directory in advance. Hadoop will create it.
Test output:
hadoop -fs cat /output/part-0000
real 1
my 2
is 2
but 1
kolkata 1
home 2
kutch 2
References: