k-Means clustering with Spark is easy to understand. MLlib comes bundled with k-Means implementation (KMeans) which can be imported from pyspark.mllib.clustering package. Here is a very simple example of clustering data with height and weight attributes.
Arguments to KMeans.train:
k is the number of desired clusters
maxIterations is the maximum number of iterations to run.
runs is the number of times to run the k-means algorithm
initializationMode can be either 'random'or 'k-meansII'
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array
sc = SparkContext()
sc.setLogLevel ("ERROR")
#12 records with height, weight data
data = array([185,72,