iDevji

Posted 2017-11-09Updated 2023-08-22Big Data / Hadoop2 minutes read (About 326 words)

Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. Requirements: Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud. This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06. Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox. Steps:

Download example code and data from here
Start sandbox image from VirtualBox
From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops
From dashboard GUI, create directory input
Upload sample.txt to input using Ambari > Files View > Upload
Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password
From shell upload mapper.py and reducer.py using following secure copy (scp) command:

scp -P 2222 /home/username/Downloads/mapper.py root@sandbox.hortonworks.com:/
scp -P 2222 /home/username/Downloads/reducer.py root@sandbox.hortonworks.com:/
Run the job using:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-input /input -output /output -mapper /mapper.py -reducer /reducer.py

Note: Do not create output directory in advance. Hadoop will create it.
Test output:

hadoop -fs cat /output/part-0000
real 1
my 2
is 2
but 1
kolkata 1
home 2
kutch 2

References:

Links

Categories

Recents

Archives

Tags

Subscribe for updates