ArticlePDF Available

Movie Recommender System Based on Collaborative Filtering Using Apache Spark

Authors:
  • Higher Agricultural and Fisheries Committee

Abstract and Figures

Recently, the building of recommender systems becomes a significant research area that attractive several scientists and researchers across the world. The recommender systems are used in a variety of areas including music, movies, books, news, search queries, and commercial products. Collaborative Filtering algorithm is one of the popular successful techniques of RS, which aims to find users closely similar to the active one in order to recommend items. Collaborative filtering (CF) with alternating least squares (ALS) algorithm is the most imperative techniques which are used for building a movie recommendation engine. The ALS algorithm is one of the models of matrix factorization related CF which is considered as the values in the item list of user matrix. As there is a need to perform analysis on the ALS algorithm by selecting different parameters which can eventually help in building efficient movie recommender engine. In this paper, we propose a movie recommender system based on ALS using Apache Spark. This research focuses on the selection of parameters of ALS algorithms that can affect the performance of a building robust RS. From the results, a conclusion is drawn according to the selection of parameters of ALS algorithms which can affect the performance of building of a movie recommender engine. The model evaluation is done using different metrics such as execution time, root mean squared error (RMSE) of rating prediction, and rank in which the best model was trained. Two best cases are chosen based on best parameters selection from experimental results which can lead to building good prediction rating for a movie recommender.
Content may be subject to copyright.
Movie Recommender System Based
on Collaborative Filtering Using
Apache Spark
Mohammed Fadhel Aljunid and D. H. Manjaiah
Abstract Recently, the building of recommender systems becomes a signicant
research area that attractive several scientists and researchers across the world. The
recommender systems are used in a variety of areas including music, movies,
books, news, search queries, and commercial products. Collaborative Filtering
algorithm is one of the popular successful techniques of RS, which aims to nd
users closely similar to the active one in order to recommend items. Collaborative
ltering (CF) with alternating least squares (ALS) algorithm is the most imperative
techniques which are used for building a movie recommendation engine. The ALS
algorithm is one of the models of matrix factorization related CF which is con-
sidered as the values in the item list of user matrix. As there is a need to perform
analysis on the ALS algorithm by selecting different parameters which can even-
tually help in building efcient movie recommender engine. In this paper, we
propose a movie recommender system based on ALS using Apache Spark. This
research focuses on the selection of parameters of ALS algorithms that can affect
the performance of a building robust RS. From the results, a conclusion is drawn
according to the selection of parameters of ALS algorithms which can affect the
performance of building of a movie recommender engine. The model evaluation is
done using different metrics such as execution time, root mean squared error
(RMSE) of rating prediction, and rank in which the best model was trained. Two
best cases are chosen based on best parameters selection from experimental results
which can lead to building good prediction rating for a movie recommender.
Keywords Recommender systems Collaborative ltering Alternating Least
Squares Apache Spark Big data MovieLens dataset
M. F. Aljunid (&)
Mangalore University, Mangalore, Karnataka, India
e-mail: Ngm505@yahoo.com
D. H. Manjaiah (&)
Department of Computer Science, Mangalore University, Mangalore
Karnataka, India
e-mail: drmdh2014@gmail.com
©Springer Nature Singapore Pte Ltd. 2019
V. E. Balas et al. (eds.), Data Management, Analytics and Innovation,
Advances in Intelligent Systems and Computing 839,
https://doi.org/10.1007/978-981-13-1274-8_22
283
1 Introduction
In recent times, big data is becoming one of the newest research interests in the
areas of computer science and other related areas. With the possibility of a radical
change in companies and organizations that use the information for improving the
customer experience and transform their business models. Big data has several
features which are volume, velocity, variety, value, and veracity. Big data is facing
difculties in managing using conventional tools, techniques, and procedures. Big
data analytics is used for handling bulk quantities of data. It is used to mine and
extract patterns, information, and knowledge from the data in an effective way. Big
data analytics become an important trend for organizations and enterprises that are
interesting in providing innovative ideas for enhancing and increasing their busi-
ness performance and decision-making. RS are a group of techniques that allow
ltering through large samples and information space in order to give suggestion to
users when needed. Currently, RS are becoming highly popular and utilized in
different areas such as movies, research articles, search queries, news, books, social
tags, and music. Furthermore, there are other essential RS basically applicable for
specialist, collaborators, funny story, restaurant and hotels, dresses, monetary ser-
vices, life insurance, passion associates which give online dating services and
several other social media such as Twitter, LinkedIn, and Facebook.
RS use a number of different technologies to lter out best suit results and
provide to users to satisfy their information need. RS are classied into three broad
groups which are content-based systems, collaborative ltering systems, and hybrid
recommender system [1]. Content-based systems which try to test the behavior of
the item which is labeled as recommended one. It works by learning the behavior of
the new users based on their information need presented in objects whereby the user
has rated. It is a keyword-specic RS where the keywords are used to illustrate the
items. Thus, in a content-based RS, models work in such a way that they recom-
mend userscomparable items that have been liked in the past or is browsing
currently. For instance, if a MovieLen user has to browse several comedies movies,
then, the RS will classify those movies into the database as getting the most ratings
on the comedy varieties. Collaborative ltering system is based on similarity
measures between users information need and the items. The items recommended
to a new user are those which were liked by other similar users in previous
browsing history. Collaborative ltering algorithm uses an average rating of
objects, recognizes similarities between the users on the basis of their ratings, and
generates new recommendations based on inter-user comparisons. However, it
faces many challenges and limitation such as data sparsity whose role is to the
evaluation of large item set. Another limitation is hard to make prediction based on
nearest neighbor algorithm, third is scalability in which number of users and
number of items both increases, and the last one is cold start where poor rela-
tionship among like-minded people. To solve encounters, above mentioned, we
moved to other approaches of collaborative ltering, and we landed up on
model-based collaborative ltering [2]. Hybrid RS performs their tasks by
284 M. F. Aljunid and D. H. Manjaiah
considering the combining behavior of content-based and collaborative ltering
techniques in such a way that it suits a particular item. Hybrid recommended system
is regarded as the most frequently used RS system considered by many companies
due to its ability to eliminate any weakness that might have arose when one RS is
employed and in addition, its strength is the composite of more than two RS.
The main focus of this work is collaborative ltering system. It is well known that
collaborative ltering could be described as a procedure whereby autom atic prediction
(i.e., ltering) about the interests of a user is made by gathering taste or preferences
information from many users. The unexpressed assumption of the collaborative l-
tering approach can be best explained, viz., supposing a person A has similar opinion
with person B on a particular issue, the assumption is that person A will be more likely
to have the same opinion as person B on a different issue X did the opinion on X of a
person chosen randomly [3]. Take for an instance the movie RSdepicted in Fig. 1
which started with a matrix whose entries are movies rated by users. Both user (shown
in green) and a particular movie (shown in blue) are represented each by column and
rows respectively. Owing to the fact that not all users have rated all movies, all the
entries in the matrix are unknown, which necessitate the need for collaborative l-
tering. There are ratings for only a subset of the movies for each user. With collabo-
rative ltering, the idea is to approximate the rating matrix by factorizing it as the
product of two matrices. That is the one that describes properties of each user (shown in
green), and the other describing properties of each movie.
The minimization of the error for the users/movies pairs was chosen as the basis
for the selection of the two matrices. The alternating least squares algorithm
(ALS) which achieves this by randomly lling the users matrix with values before
optimizing the value of the movies was used for this purpose. The value of the
users matrix is optimized with the movies matrix being kept constant (Fig. 1).
Owing to a xed set of user factors (i.e., values in the users matrix), known ratings
are employed to nd the best values by optimizing the movie factors, written on top
of the gure. The best user factor with the xed movie factors is sleeted. This paper,
reports for the rst time, a movie recommendation system based on collaborative
ltering using apache spark. The performance analysis and evaluation of proposed
approach are performed on a MovieLens dataset. From the results obtained, it is
concluded that the selection of parameters of ALS algorithms can affect the per-
formance of recommender engine to be used.
Fig. 1 Low rank factorization matrix [3]
Movie Recommender System Based on Collaborative Filtering 285
The remainder of this paper is organized as follows: related work is provided in
Sect. 2. Section 3introduces the proposed movie recommender system using col-
laborative ltering with ALS algorithm while the experimental study is introduced
in Sect. 4. Finally, the paper conclusion is presented in Sect. 5.
2 Related Work
So far, several researchers introduced and presented research in the area of building
recommendation systems. Wei et al. [4] proposed a hybrid recommender model to
address the cold start problem, which explores the item content features, learned
from a deep learning neural network and applies them to the timeSVD++ CF model.
A hybrid recommendation model is proposed which combines a time-aware model
timeSVD++ with a deep learning architecture SDAE to address the cold start
problem of collaborative ltering recommendation models. Kupisz and Unold [5]
developed and compared item-based collaborative ltering algorithm using two
cluster computing frameworks normally Hadoops disk-based MapReduce para-
digm and Sparks in-memory based RDD paradigm. In order to enhance the reli-
ability, scalability, and to improve processing ability of large-scale data, Zeng et al.
[6] proposed PLGM. In their work, two matrix factorization algorithms were
considered, which are ALS and SGD. The parallel matrix factorization based on
SGD was implemented on spark and was compared with ALS in MLib for its
performance. The advantage and disadvantage of each model based on test results
were analyzed. A variety of prole aggregation approaches were studied and the
model which gives the best result was adopted. Models such as PLGM and LGM
were studied in terms of efciency and accuracy. Dianping, Lakshmi et al. [7] used
item-based collaborative ltering techniques. In this method, they rst inspect the
user item rating matrix and they categorize the relationships among different items,
and they utilize these relationships so as to gure out the recommendations for the
user. A new concept namely movie swarm mining was proposed by Halder et al. [8]
using format frequent item mining and two pruning rules. It addresses the problem
of item recommendation and thus gives an idea about the user interests and famous
movies trend. This technique can be very helpful for movie producers to manage
their new movies. In addition to this, a new algorithm was proposed to recommend
movies to a new user. A scalable method for building recommender systems based
on similarity join has been proposed by Dev et al. [9]. MapReduce framework was
used to design the system in order to work with big data applications. The
unnecessary computation overhead such as redundant comparisons in the similarity
computing phase can signicantly be reduced by the system using a method called
extended prexltering (REF). Chen et al. [10] used co-clustering with augmented
matrices (CCAM) to design several methods including a heuristic scoring, tradi-
tional classier, and machine learning to build a recommendation system and
integrate content-based collaborative ltering for a hybrid recommendation system.
Similarly, a collaborative ltering algorithm based on the ALS, as a powerful
286 M. F. Aljunid and D. H. Manjaiah
matrix decomposition algorithm, has been proposed by Wilkinson and Schreiber
[11]. They found out that it can be awesome to extend to the distributed computing
and solve the data sparse problem.
3 Proposed Movie Recommender System
This section provides the idea of the proposed system. The proposed system is a
movie recommender system based on ALS using Apache Spark. The novelty of this
work is based on the selection of parameters of ALS algorithms that can affect the
performance of building of a movie recommender system.
3.1 Proposed System Block Diagram
In this work, we apply users ratings from the datasets the popular website like
IMDB, Rotten Tomatoes, MovieLen, and Time Movie Ratings. This dataset is
available in many formats such as CSV le, text le, and databases. We can either
stream the data live from the websites or download and store them on our local le
system or HDFS. Spark streaming is used to stream real-time data from the various
source like Twitter, the stock market, and geographical system and perform pow-
erful analytics to businesses. It used for processing real-time streaming data. We use
collaborative ltering (CF) to predict the ratings of users for particular movies
based on their ratings for other movies. Then collaborate this with another users
rating for that particular movie. We train the ALS algorithm using MovieLen data
and get the results from the machine learning model. We use spark SQLs data
frame, dataset, and SQL service to store the data. The result of the machine learning
model is stored in RDBMS so that the web application can display the recom-
mendation to a particular use. The results of the movie recommendation system are
stored in our local drive. We store the recommendation movies along with the
ratings in a text le and CSV le formats. We prefer storing the result into an
RDBMS system so as to access it directly from the web application and display
recommendation and top movies as shown in Fig. 2.
3.2 Proposed System Steps
This subsection provides the steps of applying the ALS algorithm on MovieLens
datasets for train and test the selection of best parameter when building a movie
recommendation system.
Movie Recommender System Based on Collaborative Filtering 287
Movie Recommendation System using CF with ALS
Input: MovieLens Dataset
Output: Top Recommended Movies.
Procedures:
Procedure 1:Parsing and loading datasets
Procedure 2: Recognize the user as new or regular.
If new user goto Procedure 5
Procedure 3: Load training and test data into the table (userId, movieId, rating)
def parse_the_rating(line):
x = line.split()
return (int (x [0]), int (x [1]), float (x [2]))
training = sc.TrainingFile("__").map(parse_the_Rating).cache()
test = sc.Testfile(“__”).map(parse_the_Rating)
Procedure 4: Train the recommender model.
New_model= ALS.train (rank, train, iteration)
Procedure 5:Create predictions on (user, movie) pairs from the test data
Predict = New_model.predictAll (test.map(lambda x: (x[0], x[1]))
Procedure6: Adding new user ratings
Procedure 7: Display top N recommended movies.
Procedure 8: Save the New_model
4 Experimental Study
This section presents the experimental setup and results in discussion and analysis.
4.1 Apache Spark
Apache Spark [12] is a rapid and general-purpose cluster computing system. It
introduces high-level application programming interfaces (APIs) using
Fig. 2 Proposed movie recommendation system using CF with ALS
288 M. F. Aljunid and D. H. Manjaiah
programming languages such as Java, Python, Scala, and R, and has an engine that
supports general execution graphs. It also supports a good set of higher level tools
involving Spark SQL for structured data processing, MLlib for machine learning,
GraphX for graph processing, and Spark Streaming for real-time applications. It
was built on top of Hadoop and MapReduce and extends the MapReduce Model to
efciently use more types of computations. Spark application runs as a separate set
of process on the cluster. All of the distributed processes are coordinated by a
SparkContext object in the drive program. SparkContext connects to one type of
cluster manager (standalone/Yarn/Mesos) for resource allocation across clusters.
Cluster manager provides executors, which are essentially JVM process to run the
logic and store application data. Then the SparkContext object sends the application
code (jar les/python scripts) to executors. Finally, the SparkContext executes tasks
in each executor.
4.2 Data Preprocessing
The dataset which is used in this work is MovieLens dataset. This dataset contains
24 million ratings and 670,000 tag applications applied to 40,000 movies by
260,000 users. This dataset contains three les called ratings.csv, movies.csv and
tags.csv. ratings.csv contains tree column (userId, movieId, rating). While movies.
csv contains movieId, title, genres. The genres have the format: Genre1, Genre2,
Genre3. The tags le (tags.csv) has the format: userId, movieId, tag, timestamp and
nally, the links.csv le has the format: movieId, imdbId, tmdbId. We can split the
data into three portions which are training, validation, and test data to parse their
lines once they are loaded into RDDs. Parsing the movies and rating les yields two
RDDs: For each row in the ratings dataset, we have created a vector of (userId,
movieId, rating). During preprocessing, we have dropped the timestamp attribute
because we do not need it for this recommender. Similarly, each row in the movies
dataset, we have created a vector of (movieId, title). We have dropped the genres
attribute because we do not use it for this recommender.
In order to determine the best ALS parameters for our experiments, we need to
break up the ratings RDD dataset into three pieces as follows: a training set which
we will use 60% of the data to train models, validation set, which used 20% of the
data to choose the best model and test set, which used 20% of the data for our
experiments to randomly split the dataset into the multiple groups.
4.3 Experimental Environment
The test has been done on a machine which contains the subsequent descriptions
P. A machine with Ubuntu 14.04 LTS, 4 GB memory, and Intel® Corei5-2400
CPU @ 3.10 GHz 4 processor as well as a hard disk of 500 GB. In this machine,
Movie Recommender System Based on Collaborative Filtering 289
Apache Spark with version 2.1.1 is installed and is used to develop the proposed
system. The dataset which is used in research work is MovieLens dataset [13]. In
the proposed model, root mean squared error (RMSE) is used as a performance
measure. RMSE works by measuring the difference between error rate a user gives
to the system and the predicted error by the model. Equation (1) depicts how RMSE
works on movie recommender system.
RMES ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
n
i¼0
xui ygi

2
n
v
u
u
tð1Þ
whereby x
ui
is the rating that user ugives to an item iin the experimental data, y
gi
is
a predicted rating that the movie that user ugives to an item and where nis the
number of ratings in the test data.
4.4 Experimental Results Analysis and Discussion
Recommender system (RS) is becoming growingly popular. In this work, Apache
Spark is used to demonstrate an efcient parallel implementation of a collaborative
ltering method using ALS. ALS is used for dimensionality reduction purpose
which helps in overcoming the limitations of collaborative ltering such as data
sparsity and scalability. The challenges of data sparsity are appearing in numerous
situations, specically, another problem, when a new an item or user has just added
to the system, it is difcult to nd similar ones since there is no sufcient infor-
mation, this problem is called cold start problem [14,15]. When selecting the ALS
algorithm as a part of building the proposed movie recommender system, there is
basic parameter through them can determine the best rating of users for given
movies. These parameters are Rank, Iterations, and Lambda.
The contribution of this paper is to study and determine the selection of
parameters that affect the performance of ALS model in building a movie recom-
mender system because from literature study, it is found that little research work
focused on the study of the selection of ALSs parameters that can affect its per-
formance in building a movie recommender engine using Apache Spark. The
parameters, lambda, and iterations are used in order to control and adjust the
predicting capability of matrix factorization which is depending on ALS technique
which in turn affect the evaluation of movie RS. The iterations and lambda
parameters are used as follows: Lambda which species the regularization
parameter in ALS and iterations in which the proposed model should run the
specied number of iterations. The ALS algorithm achieves its optimal solution
between 5 and 20 iterations.
The parameters lambda and iteration in ALS model are used with different
thresholds to realize the effects of matrix factorization performance on the perfor-
mance of recommendation results and thus take the most appropriate parameters for
290 M. F. Aljunid and D. H. Manjaiah
the following test setups. Tables 1,2and 3show the performance of movie rec-
ommendation engine based on ALS under different values of lambda and iteration.
Table 1illustrates the execution of time with the changes of lambda with iterations
parameters of ALS model, while Table 2the rank of best-trained model with the
changes of lambda with iterations parameters of ALS algorithm, nally Table 3
indicates the RMSE with the changes of lambda with iterations parameters of ALS
model. The results presented in Table 1indicate that when lambda is set to 0.6 and
iteration set is 10, the time value is minimum which is 1.41323 s, and rank value is
8as shown in Table 2. Moreover, the RMSE register for this rating is 1.07424
as indicated in Table 3. On the other hand, as it is indicated in Table 1when
lambda is set to 0.2 and iteration is 15, running time becomes 1.463743 and rank is
12 for this item as shown in Table 2. The RMSE value for this item is the mini-
mum, which is 0.9167, as presented in Table 3.
As mentioned above, the analysis for movie recommendation system is done
using three quality metrics which are RMSE, time, and rank. Using these three
metrics, two cases are achieved as shown in Table 4, case 1 with high time and low
RMSE rate while the case 2 with low time and high RMSE rate. According to
results in Table 4, the prediction for Top 25 movies is shown in Figs. 3and 4.
Table 1 Time of matrix
factorization using lambda
and iteration parameters
Lambda Iteration
5 1015 2025
0.1 1.489 1.454 1.473 1.469 1.512
0.2 1.485 1.437 1.464 1.438 1.514
0.3 1.472 1.481 1.4671 1.441 1.494
0.4 1.658 1.486 1.476 1.495 1.473
0.5 1.431 1.492 1.468 1.478 1.528
0.6 1.615 1.413 1.442 1.459 1.480
0.7 1.443 1.475 1.471 1.446 1.543
0.8 1.554 1.470 1.459 1.449 1.527
0.9 1.491 1.478 1.482 1.471 1.446
Table 2 Rank of matrix
factorization using lambda
and iteration parameters
Lambda Iteration
5 10152025
0.1 4 12 4 4 4
0.2 8 12 12 12 12
0.3 48888
0.4 48888
0.5 88888
0.6 8 88812
0.7 12 12 4 4 12
0.8 12 12 4 4 4
0.9 12 4 4 4 4
Movie Recommender System Based on Collaborative Filtering 291
Table 3 RMSE of matrix factorization using lambda and iteration parameters
Lambda Iteration
5 1015 2025
0.1 0.947 0.942 0.940 0.938 0.938
0.2 0.919 0.917 0.9167 0.917 0.917
0.3 0.941 0.941 0.941 0.941 0.941
0.4 0.975 0.980 0.980 0.981 0.981
0.5 1.018 1.024 1.024 1.024 1.024
0.6 1.069 1.074 1.074 1.074 1.074
0.7 1.127 1.130 1.131 1.131 1.131
0.8 1.192 1.193 1.193 1.193 1.193
0.9 1.261 1.261 1.261 1.261 1.261
Table 4 Two cases for selecting parameters for ALS
Metrics Case
Case 1 Case 2
Time 1.41323 1.463743
Rank 8 12
RMSE 1.07422 0.9167
Fig. 3 Prediction of top 25 movies for case 1
292 M. F. Aljunid and D. H. Manjaiah
In general, the lowest value of the RMSE is considered the best case for pre-
diction in building recommendation system. Therefore, we will adopt the second
case because the value of the RMSE is smaller compared to the value in the rst
case as well as adopt the second case as the best case because there is no signicant
difference in the amount of time execution between the two cases. Now, we can get
the top recommended movies by using the second case. Finally, we concluded that
from these results the best case is the second case which has the best value for
RMSE, which can be useful for building recommendation engines for predicting the
top 25 ranked movies.
5 Conclusion and Future Work
Movie recommender system plays a signicant role in identifying a set of movies
for users based on user interest. Although many move recommendation systems are
available for users, these systems have the limitation of not recommending the
movie efciently to the existing users. This paper presented a movie recommender
system based on collaborative ltering using Apache Spark. From the results, the
selection of parameters of ALS algorithms can affect the performance of building of
a movie recommender engine. System evaluation is done using various metrics
such as execution time, RMSE of rating prediction, and rank in which the best
Fig. 4 Prediction of top 25 movies for case 2
Movie Recommender System Based on Collaborative Filtering 293
model was trained. Two best cases are chosen based on best parameters selection
from experimental results which can lead to building god prediction rating for a
movie recommender engine. From these cases, the lowest value of the RMSE is
considered the best case for prediction in building movie recommendation system.
Therefore, the second case is recommended to be used since the value of the RMSE
is smaller compared to the value in the rst case as well as adopt the second case as
the best case, because there is no signicant difference in the amount of time
execution between the two cases. Finally, we concluded that from these results that
the best case is the second case which has the best value for RMSE, which can be
useful for building recommendation engines for predicting the top 25 ranked
movies. In the future work, we plan to develop and improve a new loss function
because of the shortcomings of the recommender system algorithm based on ALS
model based on the parameter of the best case which has the best value for RMSE
using Apache Spark.
References
1. Verma, J. P., Patel, B., & Patel, A. (2015). Big data analysis: Recommendation system with
Hadoop framework. In 2015 IEEE International Conference on Computational Intelligence &
Communication Technology (CICT). IEEE.
2. Katarya, R., & Verma, O. P. (2016). A collaborative recommender system enhanced with
particle swarm optimization technique. Multimedia Tools and Applications, 75(15), 9225
9239.
3. https://docs.databricks.com/_static/notebooks/cs100x-2015-introduction-to-big-data/module-
5machine-learning-lab.html.
4. Wei, J., et al. (2016). Collaborative ltering and deep learning based hybrid recommendation
for cold start problem. In 2016 IEEE 14th International Conference on Dependable,
Autonomic and Secure Computing, 14th International Conference on Pervasive Intelligence
and Computing, 2nd International Conference on Big Data Intelligence and Computing and
Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE.
5. Kupisz, B., & Unold, O. (2015). Collaborative ltering recommendation algorithm based on
Hadoop and Spark. In 2015 IEEE International Conference on Industrial Technology (ICIT).
IEEE.
6. Zeng, X., et al. (2016). Parallelization of latent group model for group recommendation
algorithm. In IEEE International Conference on Data Science in Cyberspace (DSC). IEEE.
7. Ponnam, L. T., et al. (2016). Movie recommender system using item based collaborative
ltering technique. In International Conference on Emerging Trends in Engineering,
Technology, and Science (ICETETS). IEEE.
8. Halder, S., Sarkar, A. M. J., & Lee, Y.-K. (2012). Movie recommendation system based on
movie swarm. In 2012 Second International Conference on Cloud and Green Computing
(CGC). IEEE.
9. Dev, A. V., & Mohan, A. (2016). Recommendation system for big data applications based on
set similarity of user preferences. In International Conference on Next Generation Intelligent
Systems (ICNGIS). IEEE.
10. Chen, Y.-C., et al. (2016). User behavior analysis and commodity recommendation for
point-earning apps. In 2016 Conference on Technologies and Applications of Articial
Intelligence (TAAI). IEEE.
294 M. F. Aljunid and D. H. Manjaiah
11. Zhou, Y. H., Wilkinson, D., & Schreiber, R. (2008). Large scale parallel collaborative
ltering for the Netix prize. In Proceedings of 4th International Conference on Algorithmic
Aspects in Information and Management (pp. 337348). Shanghai: Springer.
12. https://spark.apache.org/docs/latest/. Accessed March 10, 2017.
13. https://grouplens.org/datasets/movielens/. Accessed May 15, 2017.
14. Delgado, J. A. (2000, February). Agent-based information ltering and recommender systems
on the internet (Ph.D. thesis). Nagoya Institute of Technology.
15. Mooney, R. J., & Roy, L. (1999). Content-based book recommendation using learning for text
categorization. In Proceedings of the Workshop on Recommender Systems: Algorithms and
Evaluation (SIGIR 99). Berkeley, CA, USA.
Movie Recommender System Based on Collaborative Filtering 295
... They achieve this by leveraging the capabilities of Apache Spark and Apache Hadoop. M. Aljunid et al. 2019 [30] Proposed a movie recommender system based on the ALS algorithm with employing Spark. The research explores the optimal selection of the ALS algorithm parameters that can effectively enhance the performance of a robust Recommender System. ...
... Collaborative filtering (CF) is one of the most frequently used method in various fields to recommend items. In CF approach the recommendation had been done based on users and items [13]. The user to user or item to item similarity can be evaluated based on ratings. ...
Article
Full-text available
Depending on the RMSE and sites sharing travel details, enormous reviews have been posted day by day. In order to recognize potential target customers in a quick and effective manner, hotels are necessary to establish a customer recommender system. The data adopted in this study was rendered by the Trip Advisor which permits the customers to rate the hotel on the basis of six criteria such as, Service, Sleep Quality, Value, Location, Cleanliness and Room. This study suggest the multi-criteria recommender system to analyse the impact of contextual segments on the overall rating based on trip type and hotel classes. In this research we have introduced item-item collaborative filtering approach. Here, the adjusted cosine similarity measure is applied to identify the missing value for context in the dataset. For the selection of significant contexts the backward elimination with multi regression algorithm is introduced. The multi-collinearity among predictors is examined on the basis of Variance Inflation Factor (V.I.F). In the experimental scenario, the results are rendered based on hotel class and trip type. The performance of the multiregression model is evaluated by the statistical measures such as R-square, MAE, MSE and RMSE. Along with this, the ANOVA study is conducted for different hotel classes and trip types under 2, 3, 4 and 5 star hotel classes.
... Apache Spark has been used to improve system performance, especially when the amount of data becomes large or when computational operations increase dramatically. In the literature, Apache Spark has been used in various fields such as the medical field [51], machine learning filed [52]- [55], and path-finding, in which one of the recent work done by Alazzam et al., where the authors applied a parallel A* algorithm using a Hadoop Insight cluster provided by Azure with six worker nodes to find the optimal path using the Apache Spark in the mountain climbing environment. They evaluated the proposed algorithm in terms of runtime, cost, efficiency, and speedup on a generated dataset with different sizes. ...
Article
Full-text available
The problem of finding the shortest path between two nodes is a common problem that requires a solution in many applications like games, robotics, and real-life problems. Since its deals with a large number of possibilities. Therefore, parallel algorithms are suitable to solve this optimization problem that has attracted a lot of researchers from both industry and academia to find the optimal path in terms of runtime, speedup, efficiency, and cost compared to sequential algorithms. In mountain climbing, finding the shortest path from the start node under the mountain to reach the destination node is a fundamental operator, and there are some interesting issues to be studied in mountain climbing that cannot be found in a traditional two-dimensional space search. We present a parallel Ant Colony Optimization (ACO) to find the shortest path in the mountain climbing problem using Apache Spark. The proposed algorithm guarantees the security of the selected path by applying some constraints that take into account the secure slop angle for the path. A generated dataset with variable sizes is used to evaluate the proposed algorithm in terms of runtime, speedup, efficiency, and cost. The experimental results show that the parallel ACO algorithm significantly ( p < 0.05) outperformed the best sequential ACO. On the other hand, parallel ACO algorithm compared with one of the most recent research from the literature for finding the best path for mountain climbing problems using parallel A* algorithm with Apache Spark. The parallel ACO algorithm with Spark significantly outperformed the parallel A* algorithm.
... Most research on personalized recommendation services focus on developing recommender systems that consider customers' preferences by using preference ratings, purchase records, and behavioral patterns directly assigned by such customers [8,19,20]. Since the first introduction of the Tapestry recommender system by Goldberg, et al. [21], it also emerged in various fields regarding books [22][23][24], movies [25][26][27][28], music [29][30][31], as well as shopping malls [7,29,30]. Recently, as there has been a gradual expansion in the food market such as with HMR, research on food recommender systems has been widely conducted. ...
Article
Full-text available
With the continuous growth in the Home Meal Replacement (HMR) market, the significance of recommender systems has been raised for effectively recommending customized HMR products to each customer. The extant literature has mainly focused on enhancing the performance of recommender systems based on offline evaluations of customers’ past purchase records. However, since the existing offline evaluation methods evaluate the consistency of products on the recommendation list with ones purchased by customers from the test dataset, they are incapable of encompassing components such as serendipity and novelty that are also crucial in recommendation. Moreover, the existing offline evaluation methods cannot measure rewards such as discount coupons that may play a vital role in strengthening customers’ desire for purchase and thereby stimulating their purchase with a provision of a recommendation list. In this study, we used an SOR model to verify the effect of personalized recommendation stimulus on a customer’s response in an actual online environment. The results indicate that the customers’ response rate was higher with a provision of personalized recommendations than that of bestseller recommendations, and higher when being offered with cash discounts than earning redeemable points. Meanwhile, the response rate to the recommendation with higher volumes of rewards was not as high as expected, while the point pressure mechanism did not work either.
Conference Paper
Full-text available
Tasks critical to enterprise profitability, such as customer churn prediction, fraudulent account detection or customer lifetime value estimation, are often tackled by models trained on features engineered from customer data in tabular format. Application-specific feature engineering adds development, operationalization and maintenance costs over time. Recent advances in representation learning present an opportunity to simplify and generalize feature engineering across applications. When applying these advancements to tabular data researchers deal with data heterogeneity, variations in customer engagement history or the sheer volume of enterprise datasets. In this paper, we propose a novel approach to encode tabular data containing customer transactions, purchase history and other interactions into a generic representation of a customer's association with the business. We then evaluate these embeddings as features to train multiple models spanning a variety of applications. CASPR, Customer Activity Sequence-based Prediction and Representation, applies Transformer architecture to encode activity sequences to improve model performance and avoid bespoke feature engineering across applications. Our experiments at scale validate CASPR for both small & large enterprise applications.
Article
Full-text available
Given the increasing growth of the Web and consequently the growth of e-commerce, the application of recommendation systems becomes more and more extensive. A good recommendation algorithm can provide a better user experience. In the collaborative filtering algorithm recommendation system, many existing approaches to collaborative filtering can neither handle very large datasets nor easily deal with users who have very few ratings, this paper proposes an improved constrained Bayesian probability matrix factorization algorithm. The algorithm introduces a potential similarity constraint matrix for specific sparsely scored users to affect the user’s feature vector, and uses the Logistic function to express the nonlinear relationship of the potential factors, combined with the Markov chain Monte Carlo method for training. Finally, the data set is used for testing and comparative evaluation. This experiment proves that the algorithmic model can be efficiently trained using Markov chain Monte Carlo methods by applying them to the MovieLens and Netflix dataset. The experimental results show that the algorithm has better predictive performance and is suitable for solving the problem of sparse rating matrix of specific users.
Preprint
Tasks critical to enterprise profitability, such as customer churn prediction, fraudulent account detection or customer lifetime value estimation, are often tackled by models trained on features engineered from customer data in tabular format. Application-specific feature engineering adds development, operationalization and maintenance costs over time. Recent advances in representation learning present an opportunity to simplify and generalize feature engineering across applications. When applying these advancements to tabular data researchers deal with data heterogeneity, variations in customer engagement history or the sheer volume of enterprise datasets. In this paper, we propose a novel approach to encode tabular data containing customer transactions, purchase history and other interactions into a generic representation of a customer's association with the business. We then evaluate these embeddings as features to train multiple models spanning a variety of applications. CASPR, Customer Activity Sequence-based Prediction and Representation, applies Transformer architecture to encode activity sequences to improve model performance and avoid bespoke feature engineering across applications. Our experiments at scale validate CASPR for both small \& large enterprise applications.
Article
Full-text available
In a web environment, one of the most evolving application is those with recommendation system (RS). It is a subset of information filtering systems wherein, information about certain products or services or a person are categorized and are recommended for the concerned individual. Most of the authors designed collaborative movie recommendation system by using K-NN and K-means but due to a huge increase in movies and users quantity, the neighbour selection is getting more problematic. We propose a hybrid model based on movie recommender system which utilizes type division method and classified the types of the movie according to users which results reduce computation complexity. K-Means provides initial parameters to particle swarm optimization (PSO) so as to improve its performance. PSO provides initial seed and optimizes fuzzy c-means (FCM), for soft clustering of data items (users), instead of strict clustering behaviour in K-Means. For proposed model, we first adopted type division method to reduce the dense multidimensional data space. We looked up for techniques, which could give better results than K-Means and found FCM as the solution. Genetic algorithm (GA) has the limitation of unguided mutation. Hence, we used PSO. In this article experiment performed on Movielens dataset illustrated that the proposed model may deliver high performance related to veracity, and deliver more predictable and personalized recommendations. When compared to already existing methods and having 0.78 mean absolute error (MAE), our result is 3.503 % better with 0.75 as the MAE, showed that our approach gives improved results.
Conference Paper
Full-text available
A movie recommendation is important in our social life due to its strength in providing enhanced entertainment. Such a system can suggest a set of movies to users based on their interest, or the popularities of the movies. Although, a set of movie recommendation systems have been proposed, most of these either cannot recommend a movie to the existing users efficiently or to a new user by any means. In this paper we propose a movie recommendation system that has the ability to recommend movies to a new user as well as the others. It mines movie databases to collect all the important information, such as, popularity and attractiveness, required for recommendation. It generates movie swarms not only convenient for movie producer to plan a new movie but also useful for movie recommendation. Experimental studies on the real data reveal the efficiency and effectiveness of the proposed system.
Conference Paper
Full-text available
Many recommendation systems suggest items to users by utilizing the techniques of collaborative filtering (CF) based on historical records of items that the users have viewed, purchased, or rated. Two major problems that most CF approaches have to contend with are scalability and sparseness of the user profiles. To tackle these issues, in this paper, we describe a CF algorithm alternating-least-squares with weighted-λ -regularization (ALS-WR), which is implemented on a parallel Matlab platform. We show empirically that the performance of ALS-WR (in terms of root mean squared error (RMSE)) monotonically improves with both the number of features and the number of ALS iterations. We applied the ALS-WR algorithm on a large-scale CF problem, the Netflix Challenge, with 1000 hidden features and obtained a RMSE score of 0.8985, which is one of the best results based on a pure method. In addition, combining with the parallel version of other known methods, we achieved a performance improvement of 5.91% over Netflix’s own CineMatch recommendation system. Our method is simple and scales well to very large datasets.
Conference Paper
In recent years, due to the rapid development of e-commerce, personalized recommendation systems have prevailed in product marketing. However, recommendation systems rely heavily on big data, creating a difficult situation for businesses at initial stages of development. We design several methods — including a traditional classifier, heuristic scoring, and machine learning — to build a recommendation system and integrate content-based collaborative filtering for a hybrid recommendation system using Co-Clustering with Augmented Matrices (CCAM). The source, which include users' persona from action taken in the app & Facebook as well as product information derived from the web. For this particular app, more than 50% users have clicks less than 10 times in 1.5 year leading to insufficient data. Thus, we face the challenge of a cold-start problem in analyzing user information. In order to obtain sufficient purchasing records, we analyzed frequent users and used web crawlers to enhance our item-based data, resulting in F-scores from 0.756 to 0.802. Heuristic scoring greatly enhances the efficiency of our recommendation system.
Conference Paper
Recommender system techniques are software techniques to provide users with tips on the object they need to devour or the item they want to apply. The conventional approach is to consider this as a decision problem and to solve it using rule based techniques, or cluster analysis. But recommendation systems are mainly employed in applications such as online market, which works with big data. Since, performing data mining on big data is a tedious task due to its distributed nature and enormity, instead of data mining, another method known as set-similarity join can be utilized. This paper proposes a solution for item recommendation for big data applications. The proposed work presents customized and personalized item recommendations and prescribes the most suitable items to the users successfully. In particular, key terms are used to indicate users preferences, and a user-based collaborative filtering algorithm is embraced to create suitable suggestions. Proposed work is designed to work with Hadoop, a broadly chosen distributed computing platform using the MapReduce framework
Conference Paper
Recommender systems being a part of information filtering system are used to forecast the bias or ratings the user tend to give for an item. Among different kinds of recommendation approaches, collaborative filtering technique has a very high popularity because of their effectiveness. These traditional collaborative filtering systems can even work very effectively and can produce standard recommendations, even for wide ranging problems. For item based on their neighbor's preferences Collaborative filtering techniques creates better suggestions than others. Whereas other techniques like content based suffers from poor accuracy, scalability, data sparsity and big-error prediction. To find these possibilities we have used item-based collaborative filtering approach. In this Item based collaborative filtering technique we first examine the User item rating matrix and we identify the relationships among various items, and then we use these relationships in order to compute the recommendations for the user.
Big data analysis: Recommendation system with Hadoop framework
  • J P Verma
  • B Patel
  • A Patel
Verma, J. P., Patel, B., & Patel, A. (2015). Big data analysis: Recommendation system with Hadoop framework. In 2015 IEEE International Conference on Computational Intelligence & Communication Technology (CICT). IEEE.