Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

horton works

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming.

Requirements:

Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud.

This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06.

Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox.

Steps:

  1. Download example code and data from here
  2. Start sandbox image from VirtualBox
  3. From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops
  4. From dashboard GUI, create directory input
  5. Upload sample.txt to input using Ambari > Files View > Upload
  6. Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password
  7. From shell upload mapper.py and reducer.py using following secure copy (scp) command:
    scp -P 2222 /home/username/Downloads/mapper.py  root@sandbox.hortonworks.com:/
    scp -P 2222 /home/username/Downloads/reducer.py  root@sandbox.hortonworks.com:/
  8. Run the job using:
    hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
     -input /input -output /output -mapper /mapper.py -reducer /reducer.py

    Note: Do not create output directory in advance. Hadoop will create it.

  9. Test output:
    hadoop -fs cat /output/part-0000
    real 1
    my 2
    is 2
    but 1
    kolkata 1
    home 2
    kutch 2

References:

  1. Python MapReduce : Running Your First Hadoop Streaming Job
  2. Map Reduce Word Count With Python : The Simplest Tutorial

3
Comments

avatar
3 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
3 Comment authors
EmmaDevji ChhangaVivian Anto Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Vivian Anto
Guest
Vivian Anto

hello sir,
Im not able to access HDP shell using 127.0.0.1:4200.( the browser cannot display the website)
I have finished till step number 5,
Can u pls help!

Emma
Guest
Emma

I cannot do the same too. It’s just write

“ssh: connect to host sandbox.hortonworks.com port 2222: Connection refused
lost connection