In this tutorial I’ll show you building a movie recommendation service with Apache Spark. Two users are alike if they rated a product similarly. For example, if Alice rated a book 3/5 and Bob also rated the same book 3.3/5 they are very much alike. Now if Bob buys another book and rates it 4/5 we should suggest that book to Alice, that’s what a recommender system does. See references if you want to know more about how recommender systems work.
We are going to use Alternating Least Squares method from MLLib, and MovieLens 100K dataset which is only 5 MB in size. Download the dataset from https://grouplens.org/datasets/movielens/. Extract the zip file and look for file named u.data.
from pyspark.mllib.recommendation import ALS,MatrixFactorizationModel, Rating from pyspark import SparkContext sc = SparkContext () #Replace filepath with appropriate data movielens = sc.textFile("filepath/u.data") movielens.first() #u'196\t242\t3\t881250949' movielens.count() #100000 #Clean up the data by splitting it, #movielens readme says the data is split by tabs and #is user product rating timestamp clean_data = movielens.map(lambda x:x.split('\t')) #We'll need to map the movielens data to a Ratings object #A Ratings object is made up of (user, item, rating) mls = movielens.map(lambda l: l.split('\t')) ratings = mls.map(lambda x: Rating(int(x),\ int(x), float(x))) #Setting up the parameters for ALS rank = 5 # Latent Factors to be made numIterations = 10 # Times to repeat process #Need a training and test set, test set is not used in this example. train, test = ratings.randomSplit([0.7,0.3],7856) #Create the model on the training data model = ALS.train(train, rank, numIterations) # For Product X, Find N Users to Sell To model.recommendUsers(242,100) # For User Y Find N Products to Promote model.recommendProducts(196,10) #Predict Single Product for Single User model.predict(196, 242)