Building a Movie Recommendation Service with Apache Spark

In this tutorial I’ll show you building a movie recommendation service with Apache Spark. Two users are alike if they rated a product similarly. For example, if Alice rated a book 3/5 and Bob also rated the same book 3.3/5 they are very much alike. Now if Bob buys another book and rates it 4/5 we should suggest that book to Alice, that’s what a recommender system does. See references if you want to know more about how recommender systems work. We are going to use Alternating Least Squares method from MLLib, and MovieLens 100K dataset which is only 5 MB in size. Download the dataset from https://grouplens.org/datasets/movielens/. Code :

from pyspark.mllib.recommendation import ALS,MatrixFactorizationModel, Rating
from pyspark import SparkContext

sc = SparkContext ()

#Replace filepath with appropriate data
movielens = sc.textFile(“filepath/u.data”)

movielens.first() #u’196\t242\t3\t881250949’
movielens.count() #100000

#Clean up the data by splitting it,
#movielens readme says the data is split by tabs and
#is user product rating timestamp
clean_data = movielens.map(lambda x:x.split(‘\t’))

#We’ll need to map the movielens data to a Ratings object
#A Ratings object is made up of (user, item, rating)
mls = movielens.map(lambda l: l.split(‘\t’))
ratings = mls.map(lambda x: Rating(int(x[0]),\
int(x[1]), float(x[2])))

#Setting up the parameters for ALS
rank = 5 # Latent Factors to be made
numIterations = 10 # Times to repeat process

#Need a training and test set, test set is not used in this example.
train, test = ratings.randomSplit([0.7,0.3],7856)

#Create the model on the training data
model = ALS.train(train, rank, numIterations)

For Product X, Find N Users to Sell To

model.recommendUsers(242,100)

For User Y Find N Products to Promote

model.recommendProducts(196,10)

#Predict Single Product for Single User
model.predict(196, 242)

References:

  1. Building a Recommender System in Spark with ALS, LearnByMarketing.com
  2. MovieLens
  3. Video : Collaborative Filtering, Stanford University
  4. Matrix Factorisation and Dimensionality Reduction, Thierry Silbermann
  5. Building a Recommendation Engine with Spark, Nick Pentreath, Packt

Logistic Regression with Spark : Learn Data Science

Logistic regression with Spark is achieved using MLlib. Logistic regression returns binary class labels that is “0” or “1”. In this example, we consider a data set that consists only one variable “study hours” and class label is whether the student passed (1) or not passed (0).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from pyspark import SparkContext
from pyspark import SparkContext
import numpy as np
from numpy import array
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

sc = SparkContext ()

def createLabeledPoints(label, points):
return LabeledPoint(label, points)

studyHours = [
[ 0, [0.5]],
[ 0, [0.75]],
[ 0, [1.0]],
[ 0, [1.25]],
[ 0, [1.5]],
[ 0, [1.75]],
[ 1, [1.75]],
[ 0, [2.0]],
[ 1, [2.25]],
[ 0, [2.5]],
[ 1, [2.75]],
[ 0, [3.0]],
[ 1, [3.25]],
[ 0, [3.5]],
[ 1, [4.0]],
[ 1, [4.25]],
[ 1, [4.5]],
[ 1, [4.75]],
[ 1, [5.0]],
[ 1, [5.5]]
]

data = []

for x, y in studyHours:
data.append(createLabeledPoints(x, y))

model = LogisticRegressionWithLBFGS.train( sc.parallelize(data) )

print (model)

print (model.predict([1]))

Output:

1
2
3
spark-submit regression-mllib.py
(weights=[0.215546777333], intercept=0.0)
1

References:

  1. Logistic Regression - Wikipedia.org
  2. See other posts in Learn Data Science

Apriori Algorithm for Generating Frequent Itemsets

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive. Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:  

If an itemset is frequent, then all of its subsets must also be frequent.

  Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent. The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent _k_-itemsets are used to generate _k+1_-itemsets significantly reducing number of candidates. Let’s see an example: Our dataset contains four transactions with following particulars:

TID

Items

100

1, 3, 4

200

2, 3, 5

300

1, 2, 3, 5

400

2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2. Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Now, mark itemsets where support is below threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold. Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Marking 2-itemsets below support threshold

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{2, 3, 5}

2

  Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}. Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}. If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger! Feel free to ask your questions in comments section below!

k-means Clustering Algorithm with Python : Learn Data Science

_k_-means clustering algorithm is used to group samples (items) in k clusters; k is specified by the user. The method works by calculating mean distance between cluster centroids and samples, hence the name k-means clustering. Euclidean distance is used as distance measure. See references for more information on the algorithm. This is a article describes k-means Clustering Algorithm with Python.

About this implementation :

  • First k samples are assigned as cluster centroids
  • Cluster IDs start with 1
  • Final assignments are printed in the file named assignment-results.txt
  • Final assignments are printed

Implementation :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import math

nsample = int (input ("Number of Samples: "))
nvar = int (input ("Number of Variables: "))
k = int (input ("Number of Clusters: "))

sampleList = [[0 for x in range(nvar)] for y in range(nsample)]

#Input samples
sampleCount = 0
for sample in sampleList:
print ("\n\nCollecting Data for Sample #{}:".format(sampleCount+1))
print ("----------------------------------------")
i = 0
while i < nvar:
sample [i] = int (input ("Data for var-{} : ".format(i+1)))
i += 1

#First k samples are chosen as cluster centroids
centroidList = [[0 for x in range(nvar)] for y in range(k)]
i = 0
while i < k:
j = 0
while j < nvar:
centroidList[i][j] = sampleList[i][j]
j += 1
i += 1

# distanceList maintains Euclidean distance of given sample
# for all clusters k
distanceList = [0.0 for x in range (k)]

#Open file for writing assignments
fileObject = open ("assignment-results.txt","w")

for sample in sampleList:
n = 0
for centroid in centroidList:
var = 0
total = 0
while var < nvar:
temp = (sample[var] - centroid[var]) ** 2 var += 1
total += temp
distanceList[n] = math.sqrt (total)
n += 1

#Write assignments to file
fileObject.write("{} \t {}\n".format(sample, distanceList.index(min(distanceList))+1))

#Close the file
fileObject.close()
print ("\n\n Final assignments successfully written to file! \n")

References :

  1. K Means Clustering Algorithm: Explained