Logistic Regression with Spark : Learn Data Science

Logistic regression with Spark is achieved using MLlib. Logistic regression returns binary class labels that is “0” or “1”. In this example, we consider a data set that consists only one variable “study hours” and class label is whether the student passed (1) or not passed (0).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from pyspark import SparkContext
from pyspark import SparkContext
import numpy as np
from numpy import array
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

sc = SparkContext ()

def createLabeledPoints(label, points):
return LabeledPoint(label, points)

studyHours = [
[ 0, [0.5]],
[ 0, [0.75]],
[ 0, [1.0]],
[ 0, [1.25]],
[ 0, [1.5]],
[ 0, [1.75]],
[ 1, [1.75]],
[ 0, [2.0]],
[ 1, [2.25]],
[ 0, [2.5]],
[ 1, [2.75]],
[ 0, [3.0]],
[ 1, [3.25]],
[ 0, [3.5]],
[ 1, [4.0]],
[ 1, [4.25]],
[ 1, [4.5]],
[ 1, [4.75]],
[ 1, [5.0]],
[ 1, [5.5]]
]

data = []

for x, y in studyHours:
data.append(createLabeledPoints(x, y))

model = LogisticRegressionWithLBFGS.train( sc.parallelize(data) )

print (model)

print (model.predict([1]))

Output:

1
2
3
spark-submit regression-mllib.py
(weights=[0.215546777333], intercept=0.0)
1

References:

  1. Logistic Regression - Wikipedia.org
  2. See other posts in Learn Data Science

k-Means Clustering Spark Tutorial : Learn Data Science

k-Means clustering with Spark is easy to understand. MLlib comes bundled with k-Means implementation (KMeans) which can be imported from pyspark.mllib.clustering package. Here is a very simple example of clustering data with height and weight attributes.

Arguments to KMeans.train:

  1. k is the number of desired clusters
  2. maxIterations is the maximum number of iterations to run.
  3. runs is the number of times to run the k-means algorithm
  4. initializationMode can be either ‘random’or ‘k-meansII’
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    from pyspark import SparkContext
    from pyspark.mllib.clustering import KMeans
    from numpy import array

    sc = SparkContext()
    sc.setLogLevel ("ERROR")

    #12 records with height, weight data
    data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

    #Generate Kmeans
    model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

    #Print out the cluster of each data point
    print (model.predict(array([185, 71])))
    print (model.predict(array([170, 56])))
    print (model.predict(array([168, 60])))
    print (model.predict(array([179, 68])))
    print (model.predict(array([182, 72])))
    print (model.predict(array([188, 77])))
    print (model.predict(array([180, 71])))
    print (model.predict(array([180, 70])))
    print (model.predict(array([183, 84])))
    print (model.predict(array([180, 88])))
    print (model.predict(array([180, 67])))
    print (model.predict(array([177, 76])))

Output
0
1
1
0
0
0
0
0
0
0
0
0
(10 items go to cluster 0, where as 2 items go to cluster 2)

Above is a very naive example in which we use training dataset as input data too. In real world we will train a model, save it and later use it for predicting clusters of input data. So here is how you can save a trained model and later load it for prediction.

Training and Storing the Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array

sc = SparkContext()

#12 records with height, weight data
data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

#Generate Kmeans
model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

model.save(sc, "savedModelDir")

This will create a directory, _savedModelDir_ with two subdirectories _data_ and _metadata_ where the model is stored. **Using Already Trained Model for Predicting Clusters** Now, let's use trained model by loading it. We need to import KMeansModel in order to use it for loading the model from file.

from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array

sc = SparkContext()

#Generate Kmeans
model = KMeansModel.load(sc, "savedModelDir")

#Print out the cluster of each data point
print (model.predict(array([185, 71])))
print (model.predict(array([170, 56])))
print (model.predict(array([168, 60])))
print (model.predict(array([179, 68])))
print (model.predict(array([182, 72])))
print (model.predict(array([188, 77])))
print (model.predict(array([180, 71])))
print (model.predict(array([180, 70])))
print (model.predict(array([183, 84])))
print (model.predict(array([180, 88])))
print (model.predict(array([180, 67])))
print (model.predict(array([177, 76])))

References:

  1. Clustering and Feature Extraction in MLlib, UCLA
  2. k-Means Clustering Algorithm Explained, DnI Institute
  3. k-Means Clustering with Python, iDevji

Apriori Algorithm for Generating Frequent Itemsets

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive. Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:  

If an itemset is frequent, then all of its subsets must also be frequent.

  Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent. The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent _k_-itemsets are used to generate _k+1_-itemsets significantly reducing number of candidates. Let’s see an example: Our dataset contains four transactions with following particulars:

TID

Items

100

1, 3, 4

200

2, 3, 5

300

1, 2, 3, 5

400

2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2. Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Now, mark itemsets where support is below threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold. Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Marking 2-itemsets below support threshold

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{2, 3, 5}

2

  Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}. Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}. If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger! Feel free to ask your questions in comments section below!

Data Mining : Intuitive Partitioning of Data or 3-4-5 Rule

Introduction

Intuitive partitioning or natural partitioning is used in data discretization. Data discretization is the process of converting continuous values of an attribute into categorical data or partitions or intervals. This helps reducing data size by reducing number of possible values, so instead of storing every observation, we store partition range in which each observation falls. One of the easiest ways to partition numeric values is using intuitive (natural) partitioning.

Intuitive partitioning for data discretization

  1. If an interval covers 3, 6, 7 or 9 distinct values at most significant digit, then create 3 intervals. Here, there can be 3 equal width intervals for 3, 6, 9; and 3 intervals in the grouping of 2-3-2 each for 7.
  2. If it covers 2, 4 or 8 distinct values at most significant digit, then create 4 sub-intervals of equal-width.
  3. If it covers 1, 5, or 10 distinct values at the most significant digit, then partition range into 5 equal-width intervals.

Let’s understand with an example:

Part I : The Data

Assume that we have records showing profits made in each sale throughout a financial year. Profit data range is -3,51,976 to +4,70,00,896. Negative profit value is loss ;)

Part II : Dealing with noisy data

For purpose of avoiding noise, extremely high or extremely low values are not considered. So first we need to smooth out our data so let’s discard bottom 5% and top 5% values.

Part III : Finding MSD and interval range

  • Suppose after discarding above data new values for LOW = -159876 and HIGH = 1838761. Here, most Significant Digit or MSD is at million position.
  • Next step is to round down LOW and round up HIGH at MSD. So LOW = -1000000 and HIGH = 2000000. -1000000 is nearest down million to -159876 and 2000000 is nearest up million to 1838761.
  • Next we identify range of this interval. Range = HIGH - LOW that is 2000000 - (-1000000) = 3000000. We consider only MSD here, so range of this interval is: 3.

Part IV : Applying rules

  • Now that we know range = 3, we can apply rule #1.
  • Rule #1 states that we can divide this interval into three equal size intervals:
    • Interval 1 : -1000000 to 0
    • Interval 2 : 0 to 1000000
    • Interval 3 : 1000000 to 2000000
  • You should be thinking how 0 can be part of multiple intervals? You’re right! We should represent it as follows:
    • Interval 1 : (-1000000 … 0]
    • Interval 2 : (0 … 1000000]
    • Interval 3 : (1000000 … 2000000]
    • Here (a … b] denotes range that excludes a but includes b. ( , ]  is notation for half-open interval.

Conclusion

Now that we have partitions, we would want to replace profit data points with partition value in which each of it falls. This will save us storage space and complexity.

References

  1. Han, J., Kamber M. (2006), Data Mining : Concepts and Techniques, Second Edition,  91-94.
  2. The Range (Statistics), MathIsFun.com

k-means Clustering Algorithm with Python : Learn Data Science

_k_-means clustering algorithm is used to group samples (items) in k clusters; k is specified by the user. The method works by calculating mean distance between cluster centroids and samples, hence the name k-means clustering. Euclidean distance is used as distance measure. See references for more information on the algorithm. This is a article describes k-means Clustering Algorithm with Python.

About this implementation :

  • First k samples are assigned as cluster centroids
  • Cluster IDs start with 1
  • Final assignments are printed in the file named assignment-results.txt
  • Final assignments are printed

Implementation :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import math

nsample = int (input ("Number of Samples: "))
nvar = int (input ("Number of Variables: "))
k = int (input ("Number of Clusters: "))

sampleList = [[0 for x in range(nvar)] for y in range(nsample)]

#Input samples
sampleCount = 0
for sample in sampleList:
print ("\n\nCollecting Data for Sample #{}:".format(sampleCount+1))
print ("----------------------------------------")
i = 0
while i < nvar:
sample [i] = int (input ("Data for var-{} : ".format(i+1)))
i += 1

#First k samples are chosen as cluster centroids
centroidList = [[0 for x in range(nvar)] for y in range(k)]
i = 0
while i < k:
j = 0
while j < nvar:
centroidList[i][j] = sampleList[i][j]
j += 1
i += 1

# distanceList maintains Euclidean distance of given sample
# for all clusters k
distanceList = [0.0 for x in range (k)]

#Open file for writing assignments
fileObject = open ("assignment-results.txt","w")

for sample in sampleList:
n = 0
for centroid in centroidList:
var = 0
total = 0
while var < nvar:
temp = (sample[var] - centroid[var]) ** 2 var += 1
total += temp
distanceList[n] = math.sqrt (total)
n += 1

#Write assignments to file
fileObject.write("{} \t {}\n".format(sample, distanceList.index(min(distanceList))+1))

#Close the file
fileObject.close()
print ("\n\n Final assignments successfully written to file! \n")

References :

  1. K Means Clustering Algorithm: Explained