Content uploaded by Kamal Al-Barznji
Author content
All content in this area was uploaded by Kamal Al-Barznji on Dec 14, 2017
Content may be subject to copyright.
International Conference
AUTOMATICS AND INFORMATICS’2017
4-6 October 2017, Sofia, Bulgaria
JOHN ATANASOFF SOCIETY
OF AUTOMATICS AND INFORMATICS
COLLABORATIVE FILTERING TECHNIQUES FOR GENERATING
RECOMMENDATIONS ON BIG DATA
K. Al-BARZNJI, A. ATANASSOV
University of Chemical Technology and Metallurgy-Sofia, Kl. Ochridski Bul. 8, Sofia 1756, Bulgaria,
Tel: (+3592)8163329, E-mails: Kamal.barznji@raparinuni.org; naso@uctm.edu
Abstract: Recommender systems are found in many e-commerce applications today. Recommendation system provides the facility to
understand a person’s taste and find new. As one of the most successful approaches to building recommender systems, collaborative fil-
tering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other
users. In this paper, we first introduce recommendation systems and CF, then we have proposed a recommendation system for a large
amount of data by collaborative filtering techniques (User-based and Item- based), these techniques require no knowledge of properties of
items and characteristics, which only uses the information in the rating matrix. We have implemented these recommendation algorithms
on Hadoop platform using Apache Mahout, a machine learning tool, to provide a scalable system for processing large data sets efficient-
ly. Finally, we combined the results (Recommendations) to provide more useful business intelligence.
Keywords: Recommendation System; Collaborative Filtering; User-based; Item-based; Hadoop Framework; Mahout.
1. INTRODUCTION
A recommendation is a suggestion that can help in making
good decisions faster. It will help customers and businesses to
close transactions faster. Large user feedback data is being
produced every day in various areas such as movies, food,
electronic and so on. The amount of user data is increasing
dramatically due to the growth of e-commerce websites. At
this point, it raises the importance of topics like how could we
store and how to understand all this data. Big data is one of
the best solutions for storing that data and has motivated the
interest of recommendation systems to understand that data
[1]. Recommendations are an application area of machine
learning, which provides the capability to recommend items
(Movies, books, friends) based on analysing patterns of user’s
behaviours or actions (Likes, Ratings, Buy, View) on items
[17]. Recommender Systems (RS), also called recommen-
dation systems, are software systems designed to solve the
problem of estimating user ratings, or preferences, for items
that the user has not yet seen [2].
Most companies such as Netflix and Amazon use recom-
mender systems, which are software that select products to
recommend to individual customers. Successful RS use past
product purchase and satisfaction data to make high-quality
personalised recommendations. The volume of data available
to recommender systems today is staggering and forces a total
re-evaluation of the methods used to compute recommenda-
tions [3]. RS has become popular over the last decade. Since
the number of products has grown in number, the need for
recommender systems has also increased. Recommender sys-
tem tries to predict the interest of a user and recommend
products that match their interest as accurately as possible.
Also, e-commerce business will be profited by the increase in
sales which will obviously occur when the user is presented
with more items that he/she would likely found to match the
interest. RS typically produce a list of recommendations by
using one of two ways through collaborative filtering or con-
tent-based filtering [4]. Content-based filtering considers the
attributes of the user (age, gender), are matched with attrib-
utes of items (movie genre for movies). But Collaborative fil-
tering is finding patterns between users and items [17]. By
combining these two approaches, hybrid recommendation
systems can be developed that considers both the ratings of
the user and the item’s feature to recommend the items to the
user. The use of efficient and accurate recommendation tech-
niques is very important for a system that will provide a good
and useful recommendation to its individual users. Fig.1
shows the different recommendation techniques [5].
Fig. 1. Recommendation Techniques
Collaborative Filtering (CF) is one of the most successful
techniques in recommender systems. The technique of CF can
be divided into two categories: memory-based and model-
based. Memory-based CF algorithms use the entire or a sam-
ple of the user-item database to generate a prediction. CF
technique works by building a database (user-item matrix) of
preferences for items by users. It then matches users with rel-
evant interest and preferences by calculating similarities be-
tween their profiles to make recommendations [5]. Many
commercial sites use CF algorithm to make recommendations
for users. The reason why they use this method is that CF al-
gorithm has an easy implementation and a good expandability
[6].
2. BIG DATA PLATFORMS
2.1 HADOOP FRAMEWORK
Hadoop is an open-source Java framework for huge-scale data
processing and querying the good sized amount of data across
clusters of computer systems. It is an Apache assignment ini-
tiated and led by Yahoo in 2006. It is mostly stimulated via
Google’s MapReduce and Google File System (GFS). Ha-
doop is immensely used by large companies like Yahoo, Mi-
crosoft, Facebook, and Amazon [7]. Hadoop can be used with
data mining applications also recommendation algorithms via
Mahout Apache project. The entire dataset is transferred to
the Hadoop file system and it makes use of a recommendation
algorithm over frameworks [1]. Hadoop framework has pri-
mary components are, Hadoop Distributed File System
(HDFS) is a storage system and MapReduce is a processing
225
system, they are central components of the Hadoop ecosystem
too. HDFS is highly fault-tolerant and is designed to be de-
ployed on low-cost hardware. HDFS presents high-
throughput access to application data and is appropriate for
applications which have big data sets [8]. MapReduce is a
programming model and software program framework first
evolved by Google in 2004. MapReduce helps and simplifies
the processing of big quantities of data in parallel on large
scale clusters of commodity hardware in a robust, reliable and
fault-tolerant way. It can handle petabytes of information with
thousands of nodes [7]. Fig.2 indicates the other Apache pro-
jects that are part of the Hadoop ecosystem [16].
Fig.
2. Apache Hadoop Ecosystem
2.2 APACHE MAHOUT
Apache Mahout is an open source project to provide free im-
plementations of scalable and distributed machine learning
algorithms in the areas of collaborative filtering, clustering
and classification. It provides both non-distributed and dis-
tributed (Map-Reduce) algorithms for the recommendation
[9]. As shown in Fig.3 [10], the Mahout library has concerted
a lot of similarity algorithms and gives permission to the de-
velopers for integrating them into Collaborative Filtering
Recommender Systems for the purpose of clarifying similar
neighbourhoods to the users or computing similarities be-
tween items [11]. Today, the Mahout library is suitable for
applications that require scaling to large datasets because it
was opened to contributions for implementations that run on
top of Apache Hadoop.
Classification
Clustering
Recommender/Collaborative Filtering
Evolutionary Algorithms
Pattern Mining
Regression
Dimension reduction
Similarity Vectors
Similarity Measures
Pearson Correlation
Spearman Correlation
Euclidean Distance
Tanimoto Coefficient
Log Likelihood Similarity
Neighborhood Measures
Nearest N Users Algorithm
Fig.3. List of Mahout Algorithms, Similarity and Measures
3. PROPOSED SYSTEM
In this paper, we proposed the collaborative filtering tech-
niques (User-based and Item-based) as a hybrid for generating
recommendations on a large amount of data using Apache
Mahout, as shown in Fig.4. Here our way is different from
other traditional hybrid recommendation systems, in which
they combine the collaborative filtering algorithms with con-
tent based filtering techniques for generating hybrid recom-
mendation systems. But we use only the Memory-based col-
laborative filtering algorithms and combine both of the tech-
niques to produce better results by involving all the ad-
vantages of the two techniques and by removing their draw-
backs at the same time.
Also, to perform the combining recommendation results we
used one of the hybrid recommender system categories that is
called weighted hybrid. This hybrid combines scores from
each element using the linear formulation. Therefore, com-
ponents must be able to produce its recommendation score
which may be linearly combinable, although, the components
should be regular relative accuracy across the product space
and to perform uniformly [15].
Fig.4. Overview of Proposed Architecture for Big Data for
Generating Recommendations
Our system takes big data (dataset) as input. Then we use two
algorithms for generating recommendations: In the first
phase, user-based CF is implemented on the dataset and then
item-based CF is performed using Mahout. These techniques
require no knowledge of properties of items and charac-
teristics, which only uses the information in the rating matrix.
Finally, we combine the results of both methods. Because of
user-user CF, sometimes suffers from the problem of the less
nearest neighbour problem when preferences of the current
user for whom recommendations are building does not match
any user then result of item-item CF can be helpful. In addi-
tion, the combination of results of two algorithms provides
more useful business intelligence.
3.1 COLLABORATIVE FILTERING ALGORITHM
Collaborative filtering (CF) is very popular recommendation
algorithm. The basic idea behind this algorithm works on past
behaviour of user/users [12]. CF methods analyze a large
amount of information about preferences of users and predict
preferences of similar users for recommending items [9]. Rec-
ommendations that are produced by CF can be either predic-
tion or recommendation. Prediction is a numerical value,
while Recommendation is a list of top N items that the user
will like the most as shown in Fig.5 [5].
Fig.5. Collaborative filtering process
3.1.1 USER-BASED COLLABORATIVE FILTERING
User-user CF is the very straightforward algorithm. It implies
that search for those users whose rating for an item is similar
to the active user and use their preferences on other items to
recommend an item to the active user [12]. This technique
first tries to find the user’s neighbours based on user simi-
larities and then combine the neighbour users’ rating scores
[4]. Fig.6. shows the pseudo code of the user-based CF. The
similarity measure referred to in line 4 can be any similarity
measure [14].
226
Fig.6. User-based collaborative filtering
3.1.2 ITEM-BASED COLLABORATIVE FILTERING
Item-based CF uses the similarities between items for making
recommendations. It is based on past behaviour of the user
and recommends items that are similar to that were liked by
the user in past [12]. The rating of an item by a user can be
predicted by averaging the ratings of other similar items rated
by the user [4]. This is illustrated in Fig.7. As in user-based
CF, the similarity measure referred to in line 4 can be any
similarity measure [14].
Fig.7. Item-based collaborative filtering
4. EXPERIMENTAL EVALUATION
4.1 DATASET
For this paper, the MovieLens data sets (downloaded from
https://grouplens.org/datasets/movielens/), which were col-
lected by the GroupLens Research Project at the University of
Minnesota, are commonly used data sets for collaborative fil-
tering algorithms and recommendation systems We used the
MovieLens 100k (ML100k) dataset and mostly focusing on
rating table. This dataset consists of 100,000 ratings (1-5)
from 943 users on 1682 movies, and each user has rated at
least 20 movies [13].
4.2 SIMILARITY MEASURES
Similarity measures are used in recommender systems to de-
termine the similarity between items and/or users within a
system. Similarity measures are also commonly used for
certain evaluation metrics of recommender systems [14].
Although, for this paper the Pearson Correlation Coefficient
(PCC) Similarity algorithms is measured using the dataset.
From this similarity user preference values are the basis from
which similarities can be calculated between different users
and different items. Therefore, this similarity can be used in
User-User and Item-Item CF to compute recommendations.
The PCC formula is:
Where in (1), U represents the set of common rating items
(set of all items) by user i and j. is the average rat-
ing (predicted rating) value of user i and j respectively.
Denotes the rating (actual rating) of item u by
user i and j respectively [14].
4.3 EVALUATION AND METRICS
Here for both the User-based and Item-based Recommender
Systems, the evaluation is done for metrics Root Mean Square
Error (RMSE), Precision, Recall, and F1 Score, as evaluation
measures which have been widely used to compare and
measure the performance of recommendation systems, and
the certain number of items are also recommended for a par-
ticular user. RMSE is known as predictive accuracy or statis-
tical accuracy metric because it represents how accurately RS
estimates a user’s preference for an item. In our movie dataset
context, RMSE will evaluate how well the RS can predict a
user’s rating for a movie based on a scale from one to five
stars. RMSE is calculated by finding the square root of the
average squared deviations of a user’s estimated rating and
actual rating. The formula is [5]:
Where in (2), is the predicted rating (estimated rating)
for user u on item i, is the actual rating and N is the total
number of ratings on the item set (the total number of items).
Precision is the fraction of recommended items that is actu-
ally relevant to the user, while recall can be defined as the
fraction of relevant items that are also part of the set of rec-
ommended items. They are computed as:
F-measure defined below helps to simplify precision and re-
call into a single metric. The resulting value makes compari-
son between algorithms and across data sets very simple and
straightforward [5].
Where in (5), P is the precision and R is the Recall.
4.4 RESULTS AND DISCUSSION
In here, for each unknown rating, finds the most similar items
that have been rated by the same user (or the most similar us-
ers who have rated the same item) and predicts the rating as a
weighed sum of neighbours’ ratings. Similarity is computed
using the Pearson correlation coefficient (PCC). The evalua-
tion is done for metrics RMSE, Precision, Recall, and F1
Score at 10. Certain number of items (5 items) are also rec-
ommended for a particular user (user_id=15). A recommenda-
tion system is asked to estimate the preference values for the
test data and the results are compared with actual preference
values to measure the quality of recommendation. A score can
be generated for a recommender from evaluation. Lower
score is better as that indicates that estimates are closer to ac-
tual preference values. Table.1 and Fig.8 show the evaluation
results respectively. The User based takes the rows and Item
based takes the columns for similarity measurement, which
means the similarities between the items or the users, are used
to compute recommendations In addition, User-based CF al-
gorithms tend to perform very well in regards to metrics such
as precision and recall. However, they are computationally in-
tensive and thus, do not scale well. But, Item-based CF typi-
cally is less computationally intensive than user-based CF.
However, it tends to produce poorer quality recommendations
in regards to quality metrics such as precision and recall.
Table 1 Evaluation Results for CF Techniques
CF-Techniques
RMSE
Precision
Recall
F1
User-based
1.0686
0.0229
0.0229
0.0229
Item-based
1.0806
0.0067
0.0068
0.0067
227
Fig.8. Evaluation Results for CF Techniques
5. CONCLUSION
Recommender systems are a powerful new technology for ex-
tracting additional value for a business. Тhese systems help
users find items they want to buy/like from a business. Col-
laborative filtering is very popular recommendation algo-
rithm. The basic idea behind this algorithm works on past be-
haviour of user/users. On the other hand, people want to use
an intelligent system to assist them in the decision-making
process in various online environments such as university and
commerce domains among others. Thus, we have proposed
the collaborative filtering techniques (User-based and Item-
based) as a hybrid for generating recommendations on a large
amount of structured data using Apache Mahout. Тhese tech-
niques require no knowledge of properties of items and char-
acteristics, which only uses the information in the rating ma-
trix. Finally, the different approaches are combined to form a
recommender system for better results and the combination of
results of two algorithms provides more useful business intel-
ligence. It will be more useful for adding big value to enter-
prises.
REFERENCES
[1] Y. Yengi and S. İ. Omurca, “Distributed Recommender Systems
with Sentiment Analysis Büyük Veride Tavsiye Sistemlerini
Duygu Analizi ile Desteklemek,” Eur. J. Sci. Technol., vol. 4, no.
7, pp. 51–57, 2016.
[2] S. K. Zhuo Zhang, Paul Cuff, “Iterative Collaborative Filtering
for Recommender Syatems with Sparse Data" Princeton Univer-
sity, Princeton , NJ 08544,” IEEE Int., pp. 1–6, 2012.
[3] M. S. and G. K. David C. Anastasiu, Evangelia Christakopoulou,
Shaden Smith, “Big Data and Recommender Systems,” Tech.
Rep., no. September, pp. 1–26, 2016.
[4] M. Santhini, M. Balamurugan, and M. Govindaraj, “Collabora-
tive Filtering Approach for Big Data Applications Based on Clus-
tering,” Int. J. Recent Res. Math. Comput. Sci. Inf. Technol., vol.
2, no. 1, pp. 202–208, 2015.
[5] B. A. O. F.O. Isinkaye, Y.O. Folajimi, “Recommendation sys-
tems : Principles , methods and evaluation,” Egypt. Informatics
Journa, elsevier, pp. 261–273, 2015.
[6] B. Wang and R. Wang, “A Collaborative Filtering Algorithm
Fusing User-based , Item-based and Social Networks,” IEEE Int.
Conf. Big Data (Big Data), pp. 2337–2343, 2015.
[7] Y. Sowmya, “Parallelizing K-Anonymity Algorithm for Privacy
Preserving Knowledge Discovery from Big Data,” Int. J. Appl.
Eng. Res., vol. 11, no. 2, pp. 1314–1321, 2016.
[8] J. Kim and S. Hwang, “Big Data Platform of a System Recom-
mendation in Cloud Environment,” Int. J. Softw. Eng. Its Appl.,
vol. 9, no. 12, pp. 133–142, 2015.
[9] S. Bagchi, “Performance and Quality Assessment of Similarity
Measures in Collaborative Filtering Using Mahout,” Procedia -
Procedia Comput. Sci., vol. 50, pp. 229–234, 2015.
[10] A. P. Jai Prakash Verma , Bankim Patel, “Big Data Analysis :
Recommendation System with Hadoop Framework,” IEEE Int.
Conf. Comput. Intell. Commun. Technol. Big, pp. 92–97, 2015.
[11] T. Arsan, “Comparison of Collaborative Filtering Algorithms
with Various Similarity Mesures for Movie Recommendation,”
Int. J. Comput. Sci. Eng. Appl., vol. 6, no. 3, pp. 1–20, 2016.
[12] S. Sharma and M. Sethi, “Implimenting Collaborative Filtering
on Large Scale data,” Int. Res. J.Eng.Technol, pp.102–106, 2015.
[13] F. Maxwell Harper and Joseph A. Konstan,”The MovieLens Da-
tasets: History and Context” ACM Transactions on Interactive
Intelligent Systems (TiiS) 5, 4, Article 19, 19 pages, 2015.
[14] Chantal Fry “A Comparison of Collaborative Filtering Algo-
rithms for Job Recommendations Using Apache Mahout”, Master
thesis, Computer Science Department, Faculty of California State
Polytechnic University, Pomona, 2016.
[15] R. Burke, “Hybrid Recommender Systems: Survey and Experi-
ments”, User Modelling and User-Adapted Interaction, vol. 12,
pp. 331-370. 2002.
[16] https://www.mssqltips.com/sqlservertip/3262/big-data-basics--
part-6--related-apache-projects-in-hadoop-ecosystem/, online
Jul, 2017.
[17] https://www.zaizi.com/blog/movie-recommender-using-talend-
machine-learning , online Jul, 2017.
0
0,5
1
1,5
Evaluation Results for CF Techniques
User-based Item-based
228