k-Means Clustering Spark Tutorial : Learn Data Science
k-Means clustering with Spark is easy to understand. MLlib comes bundled with k-Means implementation (KMeans) which can be imported from pyspark.mllib.clustering package. Here is a very simple example of clustering data with height and weight attributes.
Arguments to KMeans.train:
- k is the number of desired clusters
- maxIterations is the maximum number of iterations to run.
- runs is the number of times to run the k-means algorithm
- initializationMode can be either ‘random’or ‘k-meansII’
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array
sc = SparkContext()
sc.setLogLevel ("ERROR")
#12 records with height, weight data
data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)
#Generate Kmeans
model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")
#Print out the cluster of each data point
print (model.predict(array([185, 71])))
print (model.predict(array([170, 56])))
print (model.predict(array([168, 60])))
print (model.predict(array([179, 68])))
print (model.predict(array([182, 72])))
print (model.predict(array([188, 77])))
print (model.predict(array([180, 71])))
print (model.predict(array([180, 70])))
print (model.predict(array([183, 84])))
print (model.predict(array([180, 88])))
print (model.predict(array([180, 67])))
print (model.predict(array([177, 76])))
Output
0
1
1
0
0
0
0
0
0
0
0
0
(10 items go to cluster 0, where as 2 items go to cluster 2)
Above is a very naive example in which we use training dataset as input data too. In real world we will train a model, save it and later use it for predicting clusters of input data. So here is how you can save a trained model and later load it for prediction.
Training and Storing the Model
1 | from pyspark import SparkContext |
References:
- Clustering and Feature Extraction in MLlib, UCLA
- k-Means Clustering Algorithm Explained, DnI Institute
- k-Means Clustering with Python, iDevji
k-Means Clustering Spark Tutorial : Learn Data Science