Tag: mapreduce

Posted 2017-11-09Updated 2023-08-22Big Data / Hadoop2 minutes read (About 326 words)

Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. Requirements: Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud. This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06. Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox. Steps:

Download example code and data from here
Start sandbox image from VirtualBox
From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops
From dashboard GUI, create directory input
Upload sample.txt to input using Ambari > Files View > Upload
Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password
From shell upload mapper.py and reducer.py using following secure copy (scp) command:

scp -P 2222 /home/username/Downloads/mapper.py root@sandbox.hortonworks.com:/
scp -P 2222 /home/username/Downloads/reducer.py root@sandbox.hortonworks.com:/
Run the job using:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-input /input -output /output -mapper /mapper.py -reducer /reducer.py

Note: Do not create output directory in advance. Hadoop will create it.
Test output:

hadoop -fs cat /output/part-0000
real 1
my 2
is 2
but 1
kolkata 1
home 2
kutch 2

References:

Posted 2017-11-08Updated 2023-08-22Big Data / Hadoop2 minutes read (About 313 words)

MapReduce Streaming Example : Running Your First Job on Hadoop

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:

Hadoop 2.6.5
Python 2.7
Log in with your Hadoop user
Working directory should be set to /usr/local/hadoop.

Steps:

Make sure you’re in /usr/local/hadoop, if not use:

cd /usr/local/hadoop
Start HDFS:

start-dfs.sh
Start YARN:

start-yarn.sh
Check if everything is up (6 services should be running):

jps
Download data and code files to be used in this tutorial from here.
Unzip contents of streaming.zip:

unzip streaming.zip
Move mapper and reducer Python files to /home/$USER/:

mv streaming/mapper.py /home/$USER/mapper.py
mv streaming/reducer.py /home/$USER/reducer.py
Create working directories for storing data and downloading output when the Hadoop job finishes:

mkdir /tmp/wordcount/
mkdir /tmp/wordcount-output/
Move sample.txt to /tmp

mv streaming/sample.txt /tmp/wordcount/
Upload data to HDFS:

hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount

Submit the job:

yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output

When the job finishes, download output data:

hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt

See word count output on terminal:

cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help. References

Posted 2017-11-03Updated 2023-08-22Big Data / Hadoop3 minutes read (About 453 words)

Map Reduce Word Count With Python : Learn Data Science

We spent multiple lectures talking about Hadoop architecture at the university. Yes, I even demonstrated the cool playing cards example! In fact we have an 18-page PDF from our data science lab on the installation. Still I saw students shy away perhaps because of complex installation process involved. This tutorial jumps on to hands-on coding to help anyone get up and running with Map Reduce. No Hadoop installation is required.

Problem : Counting word frequencies (word count) in a file. Data : Create sample.txt file with following lines. Preferably, create a directory for this tutorial and put all files there including this one.

my home is kolkata
but my real home is kutch

Mapper : Create a file mapper.py and paste below code there. Mapper receives data from stdin, chunks it and prints the output. Any UNIX/Linux user would know about the beauty of pipes. We’ll later use pipes to throw data from sample.txt to stdin.

#!/usr/bin/env python
import sys
 
# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line
line = line.strip()

# Split it into words
words = line.split()

# Output tuples on stdout
for word in words:
print '%s\\t%s' % (word, "1")

Reducer : Create a file reducer.py and paste below code there. Reducer reads tuples generated by mapper and aggregates them.

#!/usr/bin/env python
import sys
 
# Create a dictionary to map words to counts
wordcount = {}
 
# Get input from stdin
for line in sys.stdin:
    #Remove spaces from beginning and end of the line
    line = line.strip()
 
    # parse the input from mapper.py
    word, count = line.split('\\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        continue
 
    try:
        wordcount\[word\] = wordcount\[word\]+count
    except:
        wordcount\[word\] = count
 
# Write the tuples to stdout
# Currently tuples are unsorted
for word in wordcount.keys():
    print '%s\\t%s'% ( word, wordcount\[word\] )

Execution : CD to the directory where all files are kept and make both Python files executable:

1 2	chmod +x mapper.py chmod +x reducer.py

And now we will feed cat command to mapper and mapper to reducer using pipe (). That is output of cat goes to mapper and mapper’s output goes to reducer. (Recall that cat command is used to display contents of any file.

1	cat sample.txt ./mapper.py ./reducer.py

Output :

real1
kutch1
is2
but1
kolkata1
home2
my2

Yay, so we get the word count kutch x 1, is x 2, but x 1, kolkata x 1, home x 2 and my x 2! You can put your questions in comments section below!

Links

Categories

Recents

Archives

Tags

Subscribe for updates