An Introduction to Hadoop and Hadoop Ecosystem

Welcome to Hadoop and BigData series! This is the first article in the series where we present an introduction to Hadoop and the ecosystem.

In the beginning

In October 2003, a paper titled Google File System (Ghemawat et al.) was published. The paper describes design and implementation of a scalable distibuted file system. This paper along with another paper on MapReduce inspired Doug Cutting and Mike Cafarella to create what is now known as Hadoop. Eventually project development was taken over by Apache Software Foundation, thus the name Apache Hadoop.

What is in the name?

The choice of name Hadoop sparks curosity, but it is not a computing jargon and there is no logic associated with the choice. Cutting couldn’t find a name for their new project, so he named it Hadoop! “Hadoop” was the name his son gave to his stuffed yellow elephant toy!

Why we need Hadoop?

When it comes to processing huge amounts (I mean really huge!) of data Hadoop is really useful. Without Hadoop, processing such huge data was only possible with specialized hardware, or call them supercomputers! The key advantage that Hadoop brings is that it runs on commodity hardware. You can actually use your wife’s and your own laptop to setup a working Hadoop cluster.

Is Hadoop free?

Hadoop is completely free, it is free as it has no price and it is free because you are free to modify it to suite your own needs. It is licensed under Apache License 2.0.

Core components of Hadoop

  1. HDFS : HDFS or Hadoop Distributed File System is the component responsible for storing files in a distributed manner. It is a robust file system which provides integrity, redundancy and other services. It has two main components : NameNode and DataNode
  2. MapReduce : MapReduce provides a programming model for parallel computations. It has two main operations : Map and Reduce. MapReduce 2.0 is sometimes referred to as YARN.

Introduction to Hadoop ecosystem

The Hadoop Ecosystem refers to collection of products which work with Hadoop. Each product carries a different task. For example, using Ambari, we can easily install and manage clusters. At this point, there is no need to dive into details of each product. All of the products shown in the image are from Apache Software Foundation and are free under Apache License 2.0.

Setting up Apache Hadoop Single Node Cluster

This guide will help you to install a single node Apache Hadoop cluster on your machine.

System Requirements

  • Ubuntu 16.04
  • Java 8 Installed

1. Download Hadoop

1
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.0/hadoop-2.7.0.tar.gz

2. Prepare for Installation

1
2
tar xfz hadoop-2.7.0.tar.gz
sudo mv hadoop-2.7.0 /usr/local/hadoop

3. Create Dedicated Group and User

1
2
3
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

4. Switch to Newly Created User Account

1
su -hduser

5. Add Variables to ~/.bashrc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#Begin Hadoop Variables

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

#End Hadoop Variables

6. Source ~/.bashrc

1
source ~/.bashrc

7. Set Java Home for Hadoop

  • Open the file : /usr/local/hadoop/etc/hadoop/hadoop-env.sh
  • Find and edit the line as :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
```

#### 8. Edit core-site.xml

* Open the file: /usr/local/hadoop/etc/hadoop/core-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

fs.default.name
hdfs://localhost:9000

#### 9. Edit yarn-site.xml

* Open the file: /usr/local/hadoop/etc/hadoop/yarn-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler

1
2
3
4
5
6
7
8
9
10

#### 10. Edit mapred-site.xml

* Copy the mapred-site.xml template first using:

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapredsite.xml

* Open the file: /usr/local/hadoop/etc/hadoop/mapred-site.xml
* Add following lines between _<configuration> ... </configuration>_ tags.

fs.default.name
hdfs://localhost:9000

1
2
3
4

#### 11. Edit hdfs-site.xml

First, we create following directories:

sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo chown hduser:hadoop -R /usr/local/hadoop_store
sudo chmod 777 -R /usr/local/hadoop_store

1
2

Now open /usr/local/hadoop/etc/hadoop/hdfs-site.xml and enter the following content in between the tag <configuration></configuration>

dfs.replication
1
dfs.namenode.name.dir
file:/usr/local/hadoop_store/hdfs/namenode
dfs.datanode.data.dir
file:/usr/local/hadoop_store/hdfs/datanode

1
2

#### 12. Format NameNode

cd /usr/local/hadoop/
bin/hdfs namenode -format

1
2

#### 13. Start Hadoop Daemons

cd /usr/local/hadoop/
sbin/start-dfs.sh
sbin/start-yarn.sh

1
2
3

#### 14. Check Service Status

jps


#### 15. Check Running Jobs

Type in browser's address bar:

http://localhost:8088


#### Done!

Configure Anaconda on Emacs

Perhaps my quest for an ultimate IDE ends with Emacs. My goal was to use Emacs as full-flagged Python IDE. This post describes how to setup Anaconda on Emacs. My Setup:

OS: Trisquel 8.0
Emacs: GNU Emacs 25.3.2

Quick Key Guide (See full guide) :

C-x = Ctrl + x
M-x = Alt + x
RET = ENTER

1. Downloading and installing Anaconda

1.1 Download: Download Anaconda from here. You should download Python 3.x version as Python 2 will run out of support in 2020. You don’t need Python 3.x on your machine. It will be installed by this install script. 1.2 Install:

cd ~/Downloads
bash Anaconda3-2018.12-Linux-x86.sh

2. Adding Anaconda to Emacs

2.1 Adding MELPA to Emacs Emacs package named anaconda-mode can be used. This package is on the MELPA repository. Emacs25 requires this repository to be added explicitly. Important : Follow this post on how to add MELPA to Emacs. 2.2 Installing anaconda-mode package on Emacs

M-x package-install RET
anaconda-mode RET

2.3 Configure anaconda-mode in Emacs

echo “(add-hook ‘python-mode-hook ‘anaconda-mode)” > ~/.emacs.d/init.el

3. Running your first script on Anaconda from Emacs

3.1 Create new .py file

C-x C-f
HelloWorld.py RET

3.2 Add the code

print (“Hello World from Emacs”)

3.3 Running it

C-c C-p
C-c C-c

Output

Python 3.7.1 (default, Dec 14 2018, 19:46:24)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

python.el: native completion setup loaded
Hello World from Emacs

I was encouraged for Emacs usage by Codingquark; Errors and omissions should be reported in comments. Cheers!

MapReduce Real World Example in Python : Learn Data Science

MapReduce real world example on e-commerce transactions data is described here using Python streaming. The example does not require Hadoop installation. However, if you have Hadoop already installed it will run just fine on it. Python programming language is used because it is easy to read and understand. A real world e-commerce transactions dataset from a UK based retailer is used. The best way to learn with this example is to use an Ubuntu machine with Python 2 or 3 installed on it.

Outline

  1. The dataset consists of real world e-commerece data from UK based retailer
  2. The dataset is provided by Kaggle
  3. It contains 5.42k records (which is not small)
  4. Our goal is to find out country wise total sales
  5. Mapper multiplies quantity and unit price
  6. Mapper emits key-value pair as country, sales
  7. Reducer sums-up all pairs for same country
  8. Final output is country, sales for all countries

The Data

Download:Link to Kaggle DatasetSource: The dataset has real-life transaction data from a UK retailer. Format: CSV Size: 43.4 MB (5,42,000 records) Columns:

  1. InvoiceNo
  2. StockCode
  3. Description
  4. Quantity
  5. InvoiceDate
  6. UnitPrice
  7. CustomerID
  8. Country

The Problem

In this MapReduce real world example, we calculate total sales for each country from given dataset.

The Approach

Firstly, our data doesn’t have a Total column so it is to be computed using Quantity and UnitPrice columns as Total = Quantity * UnitPrice.

What Mapper Does

  1. Read the data
  2. Convert data into proper format
  3. Calculate total
  4. Print output as key-value pair CountryName:Total

What Reducer Does

  1. Read input from mapper
  2. Check for existing country key in the disctionary
  3. Add total to existing total value
  4. Print all key-value pairs

See this article on how to run this code

Python Code for Mapper (MapReduce Real World Example)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env python
import sys

# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line

line = line.strip()

# Split it into tokens

tokens = line.split(',')

#Get country, price and quantity values
try:
country = tokens\[7\]
price = float(tokens\[5\])
qty = int(tokens\[3\])
print '%s\\t%s' % (country, (price\*qty))
except ValueError:
pass

Python Code for Reducer (MapReduce Real World Example)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python
import sys

# Create a dictionary to map countries to totals
countrySales = {}

# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
line = line.strip()

# parse the input from mapper.py
country, total = line.split('\\t', 1)

# convert total (currently a string) to float
try:
total = float(total)
except ValueError:
pass

#update dictionary
try:
countrySales\[country\] = countrySales\[country\] + total
except:
countrySales\[country\] = total

# Write the tuples to stdout
for country in countrySales.keys():
print '%s\\t%s'% (country, countrySales\[country\])

Output

Country Score
Canada 3599.68
Brazil 1143.6
Italy 16506.03
Czech Republic 707.72
USA 1730.92
Lithuania 1661.06
Unspecified 4746.65
France 197194.15
Norway 34908.13
Bahrain 548.4
Israel 7867.42
Australia 135330.19
Singapore 9054.69
Iceland 4299.8
Channel Islands 19950.54
Germany 220791.78
Belgium 40752.83
European Community 1291.75
Hong Kong 10037.84
Spain 54632.86
EIRE 262112.48
Netherlands 283440.66
Denmark 18665.18
Poland 7193.34
Finland 22226.69
Saudi Arabia 131.17
Sweden 36374.15
Malta 2503.19
Switzerland 56199.23
Portugal 29272.34
United Arab Emirates 1877.08
Lebanon 1693.88
RSA 1002.31
United Kingdom 8148025.164
Austria 10149.28
Greece 4644.82
Japan 34616.06
Cyprus 12791.31

Conclusions

  • Mapper picks-up a record and emits country and total for that record
  • Mapper repeats this process for all 5.42k records
  • Now, we have 5.42k key value pairs
  • Reducer’s role is to combine these pairs until all keys are unique!

If you have questions, please feel free to comment below.

Apache Log Visualization with Matplotlib : Learn Data Science

This post discusses Apache log visualization with Matplotlib library. First, download the data file used in this example from here.

We will require numpy and matplotlib

In [1]:

import numpy as np
import matplotlib.pyplot as plt

numpy.loadtext() can directly load a text file in an array requests-idevji.txt contains only hour on which request was made, this is achieved by pre-prcoessing the Apache log.

In [2]:

data = np.loadtxt (‘requests-idevji.txt’)

We need 24 bins because we have 24 hours’ data. For other attributes of hist() see references.

In [3]:

plt.hist(data, bins=24)
plt.title(“Requests @ iDevji”)
plt.xlabel(“Hours”)
plt.ylabel(“# Requests”)

Out[3]:

Text(0,0.5,’# Requests’)

In [4]:

plt.show()

![]( AAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo dHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAGYZJREFUeJzt3XnUHXWd5/H3h80FZQ8IAYwiake7 RTsNatPTKLIIjjA2IkprRBxaD/bRcQWPR7TVaTxti9rdOoIgAUVk3KDdMI2KOi4QEFmlyWAYiBGC YXcj8J0/6he4hjzJrZD7rO/XOc95qn5Vt+p76zy5n/x+tdxUFZIkDWujiS5AkjS1GBySpF4MDklS LwaHJKkXg0OS1IvBIUnqxeCQpoEkRyb51jjt66+SXDswf1WSfcZj35oc4n0cmqySLAF2AO4D7ga+ Cbyhqu6eyLrG0j48P1NVOz+MbTwaeAPwMuAJwD3A94APVdVPe2yngN8ABfweuAw4uao+v761SavY 49Bk91+r6jHAHsAzgeMnuJ6RSfI44EfAE4HXAI8D/gT4EvCZJEf13OQz2rF7CnA68K9JTthwFWum Mjg0JVTVr4Dz6QIEgCSPSPKhJP8vyc1J/leSRw0sf1uSZUl+meQ1SSrJk9qy7yZ57cC6r07yg4H5 pyZZmGRFkmuTHD6w7KAkVye5K8nSJG9NsjnwDWCnJHe3n52S7JlkUZI7W40fXsvbPAv4aFW9rqp+ VlV/qKq7q+qLwN7AW5LstqYXrl7/asfu1qo6E3g9cHySbdtrtkxyajtGS5O8P8nG7bjenuTpA9uf leS3SbZPsk+SmwaWLUnygrW8L00zBoemhCQ7Ay8EFg80nwg8mS5MngTMBt7d1j8QeCuwH7A7MPQH WwuBhXQf5NsDRwAfTzK3rXIq8HdV9Vjg6cC3q+qeVt8vq+ox7eeXwEfpwmALYDfgnDH2+dfAvVV1 WpJdknw7ya+TfDLJRVV1G/CPdB/+6+tcYBNgzzZ/OrCS7tg9E9gfeG1V/Z6ul/PygdceDlxYVbc8 jP1rmjA4NNl9JcldwI3ALcAJAEkCHAP8j6paUVV3Af+T7kMeug+6T1fVle1D/T099vkiYElVfbqq VrZzC18EXtqW3wvMTbJFVd1WVZeuZVv3Ak9Ksl3rPfx4jPX2A85u0x8Cfkh3fucrwLzWfhnw1B7v 449U1b3ArcA2SXYADgLeVFX3tEA4iQeP31kD0wCvaG2SwaFJ79D2P/t96D40t2vts4BHA5e0YZXb 6U6ez2rLd6ILm1Vu6LHPxwN7rdpu2/aRdOccAP6G7kP3hiQXJnnOWrZ1NF2v6OdJLk7yojHW2x5Y 2qb/FDirhdY36D7sAXYZWKe3JJvSHZ8VdO9xU2DZwHv8ZKsD4DvAo5PslWQOXa/uy+u7b00vm0x0 AdIwqurCJKfT/W/8ULoP098CT6uqNX2YLqP7oF1l19WW30MXPKs8bmD6Rrphmf3GqOVi4JD2QfwG uuGnXeiuYFp93euAlyfZCHgJ8IUk27Ze0KBbgR3b9BXAK5K8h26Ibbt2buYDwOvWVNOQDqEbmroI 2IzuaqvtqmrlGuq+L8k5dMNVNwNfbb06yR6HppSPAPsleUZV3Q+cApyUZHuAJLOTHNDWPQd4dZK5 7RLX1a8mugx4SZJHtw/loweWfRV4cpJXJtm0/fxFkj9Jslm7Z2LLNvRzJ3B/e93NwLZJtly1oSR/ m2RWq/f21nw/D/Vt4LA2/VbguXQB9jLgQuBTwNtbaPWSZJskRwL/Bnywqn5dVcuAbwH/nGSLJBsl 2a2da1nlrLb/I3GYSgMMDk0ZVbUcOIN2Ahx4B93J8h8nuRP4D7pLT2lDPB+h+0Be3H4POgn4A92H /QLgswP7uYvuRPERwC+BXwEfBB7RVnklsKTt83V0H6xU1c+BzwHXt+GfnYADgauS3E13ovyIqvrt Gt7bfwBbJzmyqm6squdX1Y5VdVRVPQ94QVVd0POQ/aztdzHwWrrzQe8eWP4qup7H1cBtwBd4sNdD Vf2Erme2E90VYxLgDYCaQdLdFLd7VS1e58oTIMlsul7AV+l6U9fTfZC/BtirqsY6PzKukjwf+FRV PXGia9HEsMchTRLtXM1zgN/RXU21gm6Yakvg1RNX2UM8HfjFRBehiWOPQzPGZO9xTAVJPgq8GJhf Vd+b6Ho0MQwOSVIvDlVJknqZlvdxbLfddjVnzpyJLkOSppRLLrnk1qqata71pmVwzJkzh0WLFk10 GZI0pSQZ6gkLDlVJknoxOCRJvRgckqReDA5JUi8GhySpF4NDktSLwSFJ6sXgkCT1YnBIknqZlneO SwBzjvtar/WXnHjwiCqRphd7HJKkXgwOSVIvBockqZeRBkeSJUmuSHJZkkWtbZskC5Nc135v3dqT 5GNJFie5PMmzBrYzv61/XZL5o6xZkrR249HjeF5V7VFV89r8ccAFVbU7cEGbB3ghsHv7OQb4BHRB A5wA7AXsCZywKmwkSeNvIoaqDgEWtOkFwKED7WdU58fAVkl2BA4AFlbViqq6DVgIHDjeRUuSOqMO jgK+leSSJMe0th2qalmb/hWwQ5ueDdw48NqbWttY7X8kyTFJFiVZtHz58g35HiRJA0Z9H8feVbU0 yfbAwiQ/H1xYVZWkNsSOqupk4GSAefPmbZBtSpIeaqQ9jqpa2n7fAnyZ7hzFzW0Iivb7lrb6UmCX gZfv3NrGapckTYCRBUeSzZM8dtU0sD9wJXAesOrKqPnAuW36POBV7eqqZwN3tCGt84H9k2zdTorv 39okSRNglENVOwBfTrJqP2dV1TeTXAyck+Ro4Abg8Lb+14GDgMXAb4CjAKpqRZL3ARe39f6hqlaM sG5J0lqMLDiq6nrgGWto/zWw7xraCzh2jG2dBpy2oWuUJPXnneOSpF4MDklSLwaHJKkXg0OS1IvB IUnqxeCQJPVicEiSejE4JEm9GBySpF4MDklSLwaHJKkXg0OS1IvBIUnqxeCQJPVicEiSejE4JEm9 GBySpF4MDklSLwaHJKkXg0OS1IvBIUnqxeCQJPVicEiSejE4JEm9GBySpF4MDklSLwaHJKkXg0OS 1IvBIUnqxeCQJPVicEiSehl5cCTZOMlPk3y1zT8hyU+SLE7y+SSbtfZHtPnFbfmcgW0c39qvTXLA qGuWJI1tPHocbwSuGZj/IHBSVT0JuA04urUfDdzW2k9q65FkLnAE8DTgQODjSTYeh7olSWsw0uBI sjNwMPCpNh/g+cAX2ioLgEPb9CFtnrZ837b+IcDZVfX7qvoFsBjYc5R1S5LGNuoex0eAtwP3t/lt gduramWbvwmY3aZnAzcCtOV3tPUfaF/Dax6Q5Jgki5IsWr58+YZ+H5KkZmTBkeRFwC1Vdcmo9jGo qk6uqnlVNW/WrFnjsUtJmpE2GeG2/xJ4cZKDgEcCWwAfBbZKsknrVewMLG3rLwV2AW5KsgmwJfDr gfZVBl8jSRpnIwuOqjoeOB4gyT7AW6vqyCT/GzgMOBuYD5zbXnJem/9RW/7tqqok5wFnJfkwsBOw O3DRqOqWpMliznFf6/2aJScePIJK/tgoexxjeQdwdpL3Az8FTm3tpwJnJlkMrKC7koqquirJOcDV wErg2Kq6b/zLliTBOAVHVX0X+G6bvp41XBVVVb8DXjrG6z8AfGB0FUqShuWd45KkXgwOSVIvE3GO Q+ptfU4SShoNexySpF4MDklSLwaHJKkXg0OS1IvBIUnqxauq1qDvFTzjcYu/JE0W9jgkSb0YHJKk XgwOSVIvBockqReDQ5LUi8EhSerF4JAk9WJwSJJ6MTgkSb30Co4kWyf5s1EVI0ma/NYZHEm+m2SL JNsAlwKnJPnw6EuTJE1Gw/Q4tqyqO4GXAGdU1V7AC0ZbliRpshomODZJsiNwOPDVEdcjSZrkhgmO 9wLnA4ur6uIkTwSuG21ZkqTJapjHqi+rqgdOiFfV9Z7jkKSZa5gex78M2SZJmgHG7HEkeQ7wXGBW kjcPLNoC2HjUhUmSJqe1DVVtBjymrfPYgfY7gcNGWZQkafIaMziq6kLgwiSnV9UNAEk2Ah7TLs+V ppW+XxkMfm2wZqZhznH8Y7sBcHPgSuDqJG8bcV2SpElqmOCY23oYhwLfAJ4AvHKkVUmSJq1hgmPT JJvSBcd5VXUvUKMtS5I0WQ0THJ8ElgCbA99L8ni6E+RrleSRSS5K8rMkVyV5b2t/QpKfJFmc5PNJ Nmvtj2jzi9vyOQPbOr61X5vkgP5vU5K0oawzOKrqY1U1u6oOqs4NwPOG2PbvgedX1TOAPYADkzwb +CBwUlU9CbgNOLqtfzRwW2s/qa1HkrnAEcDTgAOBjyfxcmBJmiDDPB13hySnJvlGm58LzF/X61rI 3N1mN20/BTwf+EJrX0A3BAZwSJunLd83SVr72VX1+6r6BbAY2HOYNydJ2vCGGao6ne5ZVTu1+f8E 3jTMxpNsnOQy4BZgIfB/gduramVb5SZgdpueDdwI0JbfAWw72L6G1wzu65gki5IsWr58+TDlSZLW wzDBsV1VnQPcDw98qN83zMar6r6q2gPYma6X8NT1LXSIfZ1cVfOqat6sWbNGtRtJmvGGCY57kmxL u5Kqnae4o89Oqup24DvAc4Ctkqy68XBnYGmbXgrs0vaxCbAl8OvB9jW8RpI0zoYJjjcD5wG7Jfk/ wBnA36/rRUlmJdmqTT8K2A+4hi5AVj2yZD5wbps+jwfPnRwGfLuqqrUf0a66egKwO3DREHVLkkZg nY9Vr6pLk/w18BQgwLXtXo512RFY0K6A2gg4p6q+muRq4Owk7wd+Cpza1j8VODPJYmAF3ZVUVNVV Sc4BrgZWAsdW1VBDZZKkDW+dwZHkVas1PSsJVXXG2l5XVZcDz1xD+/Ws4aqoqvod8NIxtvUB4APr qlWSNHrDfJHTXwxMPxLYF7iUbshKkjTDDDNU9UfnM9p5i7NHVpEkaVIb5uT46u6he9ChJGkGGuYc x7/z4EMNNwLmAueMsihJ0uQ1zDmODw1MrwRuqKqbRlSPJE1b6/NlYZPRMOc4LhyPQiRJU8MwQ1V3 sebv3wjdswy32OBVSZImrWGGqj4CLAPOpAuLI4Edq+rdoyxMkjQ5DXNV1Yur6uNVdVdV3VlVn6B7 1LkkaQYa9iGHR7ZHpG+U5Ei6S3IlSTPQMMHxCuBw4Ob289LWJkmagYa5qmoJDk1Jkpphvjr2yUku SHJlm/+zJO8afWmSpMlomKGqU4DjgXvhgafeHjHKoiRJk9cwwfHoqlr9i5NWrnFNSdK0N0xw3Jpk Nx786tjD6O7rkCTNQMPcAHgscDLw1CRLgV8AfzvSqiRJk9YwV1VdD7wgyebARlV11+jLkiRNVmsd qmo3/W0HUFX3AL9P8t+TXDMu1UmSJp0xgyPJEcAK4PIkFybZH7geOIjueVWSpBlobUNV7wL+vKoW J3kW8CPgsKr69/EpbepYn2fsLznx4BFUIkmjt7ahqj9U1WKAqroUuM7QkCStrcexfZI3D8xvNThf VR8eXVmSpMlqbcFxCvDYtcxLkmagMYOjqt47noVIkqaGYe4clyTpAQaHJKkXg0OS1Msw38fxroHp R4y2HEnSZLe2O8ffkeQ5wGEDzT8afUmSpMlsbZfj/pzu+8WfmOT7bX7bJE+pqmvHpbppzLvNJU1V axuquh14J7AY2Af4aGs/LskP17XhJLsk+U6Sq5NcleSNrX2bJAuTXNd+b93ak+RjSRYnubw95mTV tua39a9LMn8936skaQNYW3AcAHwN2A34MLAXcE9VHVVVzx1i2yuBt1TVXODZwLFJ5gLHARdU1e7A BW0e4IXA7u3nGOAT0AUNcELb/57ACavCRpI0/sYMjqp6Z1XtCywBzgQ2BmYl+UGSdT6zqqqWtWdc 0b7D4xpgNnAIsKCttgA4tE0fApxRnR/TPeJkR7oAW1hVK6rqNmAhcGD/typJ2hCG+QbA86tqEbAo yeurau9V39ExrCRzgGcCPwF2qKpVXz37K2CHNj0buHHgZTe1trHaV9/HMXQ9FXbdddc+5UmSeljn 5bhV9faB2Ve3tluH3UGSxwBfBN5UVXeutu2ifZf5w1VVJ1fVvKqaN2vWrA2xSUnSGgzT43hAVf2s z/pJNqULjc9W1Zda881JdqyqZW0o6pbWvhTYZeDlO7e2pXQn5wfbv9unDkna0NbnysjpYmR3jicJ cCpwzWqPYD8PWHVl1Hzg3IH2V7Wrq54N3NGGtM4H9k+ydTspvn9rkyRNgF49jp7+EnglcEWSy1rb O4ETgXOSHA3cABzeln2d7mtpFwO/AY4CqKoVSd4HXNzW+4eqWjHCuiVJazGy4KiqHwAZY/G+a1i/ gGPH2NZpwGkbrjpJ0vryIYeSpF4MDklSLwaHJKkXg0OS1IvBIUnqZZSX42qG6HsjlI+Hl6Y2exyS pF4MDklSLwaHJKkXz3FMIZ5LkDQZ2OOQJPVij0MaZ/YcNdXZ45Ak9WKPQ+NuJn8BjjQd2OOQJPVi j2MaW5//2TueLmld7HFIknoxOCRJvRgckqReDA5JUi8GhySpF6+q0h/xHgvNRP7d92OPQ5LUi8Eh SerF4JAk9WJwSJJ6MTgkSb0YHJKkXgwOSVIv3schPQxe/6+ZyB6HJKmXkQVHktOS3JLkyoG2bZIs THJd+711a0+SjyVZnOTyJM8aeM38tv51SeaPql5J0nBG2eM4HThwtbbjgAuqanfggjYP8EJg9/Zz DPAJ6IIGOAHYC9gTOGFV2EiSJsbIgqOqvgesWK35EGBBm14AHDrQfkZ1fgxslWRH4ABgYVWtqKrb gIU8NIwkSeNovM9x7FBVy9r0r4Ad2vRs4MaB9W5qbWO1P0SSY5IsSrJo+fLlG7ZqSdIDJuzkeFUV UBtweydX1byqmjdr1qwNtVlJ0mrGOzhubkNQtN+3tPalwC4D6+3c2sZqlyRNkPEOjvOAVVdGzQfO HWh/Vbu66tnAHW1I63xg/yRbt5Pi+7c2SdIEGdkNgEk+B+wDbJfkJrqro04EzklyNHADcHhb/evA QcBi4DfAUQBVtSLJ+4CL23r/UFWrn3CXJI2jkQVHVb18jEX7rmHdAo4dYzunAadtwNIkSQ+Dd45L knoxOCRJvRgckqReDA5JUi8GhySpF4NDktSLwSFJ6sXgkCT14lfHSppW/Drf0bPHIUnqxR6HpEnN HsTkY49DktSLwSFJ6sWhKmmSW5+hmiUnHjyCSh4+h52mB4NDmoamU9ho8nGoSpLUiz0OSevFYaeZ y+CQBBgEGp5DVZKkXgwOSVIvBockqReDQ5LUi8EhSerF4JAk9WJwSJJ6MTgkSb0YHJKkXgwOSVIv BockqReDQ5LUi8EhSeplygRHkgOTXJtkcZLjJroeSZqppkRwJNkY+DfghcBc4OVJ5k5sVZI0M02J 4AD2BBZX1fVV9QfgbOCQCa5JkmakqfJFTrOBGwfmbwL2GlwhyTHAMW327iTXPoz9bQfc+jBeP114 HDoeh47HoTOpj0M++LBe/vhhVpoqwbFOVXUycPKG2FaSRVU1b0NsayrzOHQ8Dh2PQ8fjMHWGqpYC uwzM79zaJEnjbKoEx8XA7kmekGQz4AjgvAmuSZJmpCkxVFVVK5O8ATgf2Bg4raquGuEuN8iQ1zTg ceh4HDoeh86MPw6pqomuQZI0hUyVoSpJ0iRhcEiSejE4BvhYk06SJUmuSHJZkkUTXc94SnJakluS XDnQtk2ShUmua7+3nsgax8MYx+E9SZa2v4vLkhw0kTWOhyS7JPlOkquTXJXkja19xv1NDDI4Gh9r 8hDPq6o9ZuD16qcDB67WdhxwQVXtDlzQ5qe703nocQA4qf1d7FFVXx/nmibCSuAtVTUXeDZwbPtc mIl/Ew8wOB7kY01EVX0PWLFa8yHAgja9ADh0XIuaAGMchxmnqpZV1aVt+i7gGronWcy4v4lBBseD 1vRYk9kTVMtEK+BbSS5pj3KZ6XaoqmVt+lfADhNZzAR7Q5LL21DWjBqeSTIHeCbwE2b434TBoTXZ u6qeRTdsd2yS/zLRBU0W1V2/PlOvYf8EsBuwB7AM+OeJLWf8JHkM8EXgTVV15+Cymfg3YXA8yMea NFW1tP2+Bfgy3TDeTHZzkh0B2u9bJrieCVFVN1fVfVV1P3AKM+TvIsmmdKHx2ar6Umue0X8TBseD fKwJkGTzJI9dNQ3sD1y59ldNe+cB89v0fODcCaxlwqz6oGz+GzPg7yJJgFOBa6rqwwOLZvTfhHeO D2iXF36EBx9r8oEJLmncJXkiXS8DukfSnDWTjkOSzwH70D06+2bgBOArwDnArsANwOFVNa1PHI9x HPahG6YqYAnwdwPj/NNSkr2B7wNXAPe35nfSneeYUX8TgwwOSVIvDlVJknoxOCRJvRgckqReDA5J Ui8GhySpF4NDWk9J7l5t/tVJ/nWi6pHGi8EhTTJJpsRXOmvmMjikEUgyJ8m32wMBL0iya2s/Pclh A+vd3X7vk+T7Sc4Drm538H8tyc+SXJnkZRP0VqSH8H820vp7VJLLBua34cHH1PwLsKCqFiR5DfAx 1v3o7WcBT6+qXyT5G+CXVXUwQJItN3Dt0nqzxyGtv98OfKnRHsC7B5Y9BzirTZ8J7D3E9i6qql+0 6SuA/ZJ8MMlfVdUdG65s6eExOKTxtZL27y7JRsBmA8vuWTVRVf9J1wO5Anh/ksFQkiaUwSGNxg/p nrAMcCTdg/Kgezjgn7fpFwObrunFSXYCflNVnwH+iS5EpEnBcxzSaPw98OkkbwOWA0e19lOAc5P8 DPgmA72M1fwp8E9J7gfuBV4/4nqlofl0XElSLw5VSZJ6MTgkSb0YHJKkXgwOSVIvBockqReDQ5LU i8EhSerl/wNIVKauyJc3VgAAAABJRU5ErkJggg==)

References:

  1. numpy.loadtext(), Scipy.org
  2. PyPlot API, Matplotlib.org

Building a Movie Recommendation Service with Apache Spark

In this tutorial I’ll show you building a movie recommendation service with Apache Spark. Two users are alike if they rated a product similarly. For example, if Alice rated a book 3/5 and Bob also rated the same book 3.3/5 they are very much alike. Now if Bob buys another book and rates it 4/5 we should suggest that book to Alice, that’s what a recommender system does. See references if you want to know more about how recommender systems work. We are going to use Alternating Least Squares method from MLLib, and MovieLens 100K dataset which is only 5 MB in size. Download the dataset from https://grouplens.org/datasets/movielens/. Code :

from pyspark.mllib.recommendation import ALS,MatrixFactorizationModel, Rating
from pyspark import SparkContext

sc = SparkContext ()

#Replace filepath with appropriate data
movielens = sc.textFile(“filepath/u.data”)

movielens.first() #u’196\t242\t3\t881250949’
movielens.count() #100000

#Clean up the data by splitting it,
#movielens readme says the data is split by tabs and
#is user product rating timestamp
clean_data = movielens.map(lambda x:x.split(‘\t’))

#We’ll need to map the movielens data to a Ratings object
#A Ratings object is made up of (user, item, rating)
mls = movielens.map(lambda l: l.split(‘\t’))
ratings = mls.map(lambda x: Rating(int(x[0]),\
int(x[1]), float(x[2])))

#Setting up the parameters for ALS
rank = 5 # Latent Factors to be made
numIterations = 10 # Times to repeat process

#Need a training and test set, test set is not used in this example.
train, test = ratings.randomSplit([0.7,0.3],7856)

#Create the model on the training data
model = ALS.train(train, rank, numIterations)

For Product X, Find N Users to Sell To

model.recommendUsers(242,100)

For User Y Find N Products to Promote

model.recommendProducts(196,10)

#Predict Single Product for Single User
model.predict(196, 242)

References:

  1. Building a Recommender System in Spark with ALS, LearnByMarketing.com
  2. MovieLens
  3. Video : Collaborative Filtering, Stanford University
  4. Matrix Factorisation and Dimensionality Reduction, Thierry Silbermann
  5. Building a Recommendation Engine with Spark, Nick Pentreath, Packt

GraphFrames PySpark Example : Learn Data Science

In this post, GraphFrames PySpark example is discussed with shortest path problem. GraphFrames is a Spark package that allows DataFrame-based graphs in Saprk. Spark version 1.6.2 is considered for all examples. Including the package with PySaprk shell :

pyspark –packages graphframes:graphframes:0.1.0-spark1.6

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext ()
sqlContext = SQLContext(sc)

# create vertex DataFrame for users with id and name attributes
v = sqlContext.createDataFrame([
("a", "Alice"),
("b", "Bob"),
("c", "Charlie"),
], ["id", "name"])

# create edge DataFrame with "src" and "dst" attributes
e = sqlContext.createDataFrame([
("a", "b", "friends"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])

# create a GraphFrame with v, e
from graphframes import *
g = GraphFrame(v, e)

# example : getting in-degrees of each vertex
g.inDegrees.show()

Output:

id inDegree
b 2
c 1

example : getting “follow” relationships in the graph

1
g.edges.filter("relationship = 'follow'").count()

Output:

2

getting shortest paths to “a” from each vertex

1
2
results = g.shortestPaths(landmarks=\["a"\])
results.select("id", "distances").show()

Feel free to ask your questions in the comments section!

Logistic Regression with Spark : Learn Data Science

Logistic regression with Spark is achieved using MLlib. Logistic regression returns binary class labels that is “0” or “1”. In this example, we consider a data set that consists only one variable “study hours” and class label is whether the student passed (1) or not passed (0).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from pyspark import SparkContext
from pyspark import SparkContext
import numpy as np
from numpy import array
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

sc = SparkContext ()

def createLabeledPoints(label, points):
return LabeledPoint(label, points)

studyHours = [
[ 0, [0.5]],
[ 0, [0.75]],
[ 0, [1.0]],
[ 0, [1.25]],
[ 0, [1.5]],
[ 0, [1.75]],
[ 1, [1.75]],
[ 0, [2.0]],
[ 1, [2.25]],
[ 0, [2.5]],
[ 1, [2.75]],
[ 0, [3.0]],
[ 1, [3.25]],
[ 0, [3.5]],
[ 1, [4.0]],
[ 1, [4.25]],
[ 1, [4.5]],
[ 1, [4.75]],
[ 1, [5.0]],
[ 1, [5.5]]
]

data = []

for x, y in studyHours:
data.append(createLabeledPoints(x, y))

model = LogisticRegressionWithLBFGS.train( sc.parallelize(data) )

print (model)

print (model.predict([1]))

Output:

1
2
3
spark-submit regression-mllib.py
(weights=[0.215546777333], intercept=0.0)
1

References:

  1. Logistic Regression - Wikipedia.org
  2. See other posts in Learn Data Science

k-Means Clustering Spark Tutorial : Learn Data Science

k-Means clustering with Spark is easy to understand. MLlib comes bundled with k-Means implementation (KMeans) which can be imported from pyspark.mllib.clustering package. Here is a very simple example of clustering data with height and weight attributes.

Arguments to KMeans.train:

  1. k is the number of desired clusters
  2. maxIterations is the maximum number of iterations to run.
  3. runs is the number of times to run the k-means algorithm
  4. initializationMode can be either ‘random’or ‘k-meansII’
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    from pyspark import SparkContext
    from pyspark.mllib.clustering import KMeans
    from numpy import array

    sc = SparkContext()
    sc.setLogLevel ("ERROR")

    #12 records with height, weight data
    data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

    #Generate Kmeans
    model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

    #Print out the cluster of each data point
    print (model.predict(array([185, 71])))
    print (model.predict(array([170, 56])))
    print (model.predict(array([168, 60])))
    print (model.predict(array([179, 68])))
    print (model.predict(array([182, 72])))
    print (model.predict(array([188, 77])))
    print (model.predict(array([180, 71])))
    print (model.predict(array([180, 70])))
    print (model.predict(array([183, 84])))
    print (model.predict(array([180, 88])))
    print (model.predict(array([180, 67])))
    print (model.predict(array([177, 76])))

Output
0
1
1
0
0
0
0
0
0
0
0
0
(10 items go to cluster 0, where as 2 items go to cluster 2)

Above is a very naive example in which we use training dataset as input data too. In real world we will train a model, save it and later use it for predicting clusters of input data. So here is how you can save a trained model and later load it for prediction.

Training and Storing the Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array

sc = SparkContext()

#12 records with height, weight data
data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

#Generate Kmeans
model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

model.save(sc, "savedModelDir")

This will create a directory, _savedModelDir_ with two subdirectories _data_ and _metadata_ where the model is stored. **Using Already Trained Model for Predicting Clusters** Now, let's use trained model by loading it. We need to import KMeansModel in order to use it for loading the model from file.

from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array

sc = SparkContext()

#Generate Kmeans
model = KMeansModel.load(sc, "savedModelDir")

#Print out the cluster of each data point
print (model.predict(array([185, 71])))
print (model.predict(array([170, 56])))
print (model.predict(array([168, 60])))
print (model.predict(array([179, 68])))
print (model.predict(array([182, 72])))
print (model.predict(array([188, 77])))
print (model.predict(array([180, 71])))
print (model.predict(array([180, 70])))
print (model.predict(array([183, 84])))
print (model.predict(array([180, 88])))
print (model.predict(array([180, 67])))
print (model.predict(array([177, 76])))

References:

  1. Clustering and Feature Extraction in MLlib, UCLA
  2. k-Means Clustering Algorithm Explained, DnI Institute
  3. k-Means Clustering with Python, iDevji

Apriori Algorithm for Generating Frequent Itemsets

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive. Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:  

If an itemset is frequent, then all of its subsets must also be frequent.

  Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent. The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent _k_-itemsets are used to generate _k+1_-itemsets significantly reducing number of candidates. Let’s see an example: Our dataset contains four transactions with following particulars:

TID

Items

100

1, 3, 4

200

2, 3, 5

300

1, 2, 3, 5

400

2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2. Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Now, mark itemsets where support is below threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold. Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Marking 2-itemsets below support threshold

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{2, 3, 5}

2

  Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}. Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}. If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger! Feel free to ask your questions in comments section below!