Content uploaded by Patrice Wira
Author content
All content in this area was uploaded by Patrice Wira on May 15, 2019
Content may be subject to copyright.
User Behavior Analysis with Machine Learning
Techniques in Cloud Computing Architectures
Matias Callara ∗, Patrice Wira ∗
∗Universit´
e de Haute Alsace, IRIMAS Laboratory, 68093 Mulhouse, France
Email: {matias-ezequiel.callara, patrice.wira}@uha.fr
Abstract—This paper presents the use of machine learning
algorithms to analyze the behaviors of users working in a dis-
tributed computer environment. The objective consists in discrim-
inating groups of close users. These groups are composed of users
with similar behaviors. Event related to the user’s behaviors are
recorded and transferred to a database. An approach is developed
to determine the groups of the users. A non-parametric method
of estimating a probability density is used to predict application
launches and session openings in an individual way for each user.
These algorithms have been implemented and demonstrated their
effectiveness within a complete virtualization environment for
workstations and applications under real conditions in a hospital.
Index Terms—Machine learning, user behavior analytics, be-
havior analysis, user classification, prediction
I. INT ROD UC TI ON
Working software environments, in the broadest sense of
the term, can no longer do without remote capabilities. More
recently, it has been see that mobility and productivity are
linked, since the first improves the second. Indeed, in a form
of ultimate culmination of mobility, the user’s workstation
can be a notebook computer but also smartphone or any
other connected object like a tablet, a connected television,
a specific terminal, or even a connect object from the Internet
of Things (IoT). In these terminals, not only the data have to
be loaded from and sent to the server but also the applications
and even complete working environments. Thus, Information
and Communication Technology (ICT) architectures must be
designed with the possibility to render the applications on the
terminal and without installation. This is the new generation
of a cloud computer architecture that involves deploying vir-
tualization software that combines workstation virtualization
and application virtualization. These virtualization solutions
provides mobility or even a form of ultra-mobility that is now
necessary and indispensable to the users to be more effective
in achieving tasks.
In virtual desktops, launching an application can take 15
to 20 seconds while loading a web page from the Internet
takes less than a second. The requirements of the users are
natural, the ICT architecture must evolve to satisfy them and
to tend towards real time. Therefore, the challenge consists
in the analysis of behaviors and the classification of users to
predict their future activities.
By predicting the users activities, it will be possible to
anticipate the opening of a session or the launch of an
application and thus, the data and other resources that are
necessary can be made available in advance, i.e., sessions and
applications can be pre-loaded. Predicting the users activities
is also a way to optimize the resources on the server side [1].
The prediction algorithms can be based on recent advances
in Machine Learning (ML) theories [2]. They are appropriate
to handle the big amount of information collected from the
user and to digest the heterogeneity of the generated data.
Obviously, ML techniques are able to evaluate their own ef-
fectiveness. The learning capabilities of these techniques allow
them to constantly adapt to changes in users behavior and to
achieve virtualization of workstations that evolve according to
the needs.
The rest of the paper is organized as follows. The next
section describes User Behavior Analytic (UBA) issues. Sec-
tion III reviews ML approaches for UBA applications. Specific
learning techniques are presented in Section IV and implemen-
tation aspects and results are provided in Section V. Finally,
conclusions are drawn in Section VI.
II. BE HAVIOR ANA LYSIS AND USE R DET EC TI ON
Since several years, behavioral analysis has been the focus
of intense efforts in marketing applications [3]. Obviously, the
objective is to adopt some new specific and efficient marketing
strategies that are based on data, i.e., recorded information that
represent the past activities of potential clients. This is referred
to as data-based behavioral marketing. Behavioral analysis has
also found its usefulness in the fight against fraud and in
various other applications [4]. Now, it is not a surprise to
see that behavioral analysis can enhance ICT, organize more
efficiently production tools, detect internal threats like targeted
attacks, adapt softwares to the users, accelerate some repetitive
tasks, etc. However, it goes with a certain acceptability from
the users [5].
A. Definition and Objective
A user model is a representation of a user or a group of
users in an ICT system [6], [7]. This model includes a set
of parameters and/or data that are representative of the user’s
past behavior.
The development of user models starts with the design
of systems able to collect all the data that are necessary
to represent the users. The data can be used to get a deep
understanding of the users [8]. In some contexts, the data
related to a single user are huge and it is necessary to
define reliable models of users or groups of users. These
M. Callara and P. Wira, "User Behavior Analysis with Machine Learning Techniques in Cloud Computing Architectures,"
International Conference on Applied Smart Systems (ICASS 2018), Médéa, Algeria, 24-25 November, 2018.
DOI: 10.1109/ICASS.2018.8651961
models are made of features and parameters that represent
the users or groups of users in their activities performed
through various applications. These models can then be used
as a basis for providing personalized services to them. Indeed,
missing information can be retrieved, specific categories can
be deduced, future activities and behaviors can be predicted
and all this helps to an interactive, adapted and personalized
interaction with the users. Applications are various, in natural
dialogue processing, in speech transcription systems, in new
tools for business strategy and marketing, in human resource
management, to detect security anomalies, etc. At the very
end, the main goal is always to improve the user experience.
B. Behavior Analysis
The UBA is the discipline of analyzing user behaviors. In an
operational way, it is essentially the collecting, monitoring and
processing of user data. The data sets collected from the users
are stored in databases, data log files, histories, directories, and
furthermore any other systems recording the user behaviors.
The purpose of this process is to provide parameters and to
build reliable and usable models of users, in other words, that
accurately characterize the users.
For example, the Internet has become a privileged space
for this type of application [9]. Indeed, technologies are now
mature, ready and spread out in order to collect and exploit
the present and past behavior of individual Internet users in
real time. The interests, the attendances, the facts and gestures,
the movements, the attitudes, the lifestyle, the living standard,
etc. are deduced from data sets being produced by surfing on
the Internet. Obviously, the status of a user can evolve and
change at any time. Techniques make it possible to adapt the
models on the basis of the experiment and according to the
evolutions of the collected data in real time.
The UBA relies on three pillars: Data analysis, data in-
tegration and data presentation. Actually, the analysis and
processing the phenomenal amount of data is the most dif-
ficult challenge. The heterogeneity, volume and speed of data
generation are increasing rapidly. This is exacerbated with
the use of wireless networks, IoT sensors, smartphones and
the increasing activities on the Internet. Therefore, real time
UBA must be fast in processing the big amount of data and
ML algorithms should be appropriate candidates [10]. For that
purpose, ML algorithms must be run in real time, access to
the whole data sets, adapt their own parameters, i.e., learn.
ML algorithms can also be interfaced with enterprise resource
planning softwares to get additional information about the
users and to combine them with their past and present activities
while processing. The idea is to enable the establishment of
self-adaptive models.
III. MACH IN E LEARNING ALGORITHMS FOR UBA
The techniques of ML represent a branch of statistics and
computer science and studies the algorithms and architectures
capable of learning from the observed facts, i.e., measured
data [2], [11]–[13]. These techniques include artificial neural
networks with supervised learning, Bayesian decision the-
ory, parametric, semi-parametric and non-parametric methods,
multivariate analysis, hidden Markov models, reinforcement
learning, kernel estimators, graphical models, statistical tests...
ML methods through a learning process are able to self-
adjust their own parameters from a data set. This set of data
contains all the coherent information that are necessary for
example to bring out a classification, modeling or prediction
task. Furthermore, it is often necessary to separate all the
data available in two sub-sets: 1) The learning set which is
used to learn or to calculate the optimal parameters of the
learning machine; 2) The test set which is used to verify the
performance of the machine after learning from the previous
set.
The quickly growing amount of data collected via the
Internet and IoT has promoted the developments of ML
techniques [14]. Many companies have already their own data
harvesting tools and now they are faced with the challenge of
exploiting them in an effective and relevant way.
The user model that must be used varies according to the
applications and the objectives [15]. User models may seek to
describe:
1) The cognitive processes underlying the user’s actions;
2) A difference between user’s skills and expert skills;
3) Behavioral patterns or user preferences;
4) The characteristics of the user.
The first applications of ML techniques for UBA were
centered on the first two types of models. Most recently,
research activities focus on developing the third type of model
and try to find out user preferences. Finally, the applications
of ML techniques aimed at discovering the characteristics of
the users - i.e., related to the fourth type of the previous list -
remain scarce. Today it is the scientific issue that is the most
interesting to explore and that attracts the most attention. In the
design of user models, it is important to distinguish between
approaches to model individual users or communities, classes,
groups of users.
The major limitations in implementing and using automatic
ML techniques for UAB purposes are the followings:
•The amount of data that sometimes requires very large
computing capabilities. Indeed, in most situations, ML
algorithms require a relatively large number of examples
to be precise.
•The validity of the data included in the learning set and
which is necessary for a ML algorithm to build user
models with an acceptable accuracy. In other words, how
can we be sure that a data set corresponds exactly to a
type of user, to one of its behaviors, to atypical and/or ab-
normal behaviors, to changes in the behavior of users...?
A simplistic strategy is to use a large amount of data to
compensate for uncertainties, exceptions, deviations, etc.
IV. UBA IN VIRTUA L DES K ENVIRONMENTS
A. Workstation Virtualization Context
Virtualization of workstations is a logical evolution of
digital transformation. It allows employees to work with fewer
Database Server
Terminal
Application Server Application Server
File Server
Virtual PC
Directory Server
PC
PDA Tablet
RDP Channel
Management Server
Load Balancer
Fig. 1. ICT general architecture for virtualized applications.
constraints and at the same time it reduces the costs for the
ICT administrator. Indeed, this type of infrastructure has the
advantage of significantly reducing the maintenance and client-
side management tasks while providing the user employee
with its full working environment (settings, files, software,
etc.) and this whatever the material support. Such a cloud
computer architecture is represented by Fig. 1.
The benefits workstation virtualization are multiple:
•It increases the employee productivity and mobility
through a single solution and without compromising the
security.
•It delivers remote access while respecting confidentiality,
compliance, and risk management standards.
•It reduces computer-related costs and complexity by cen-
tralizing application and workstation management, and
other automating common tasks on the server side.
•It freee up the ICT resources by simplifying the manage-
ment of applications, workstations and data.
The bulk of the configuration and calculation tasks is
then fully focused on the server side. Companies such as
Citrix1and Systancia2offer a comprehensive and complete
software solution for virtualized desktop and applications. In
this king of cloud computer architecture, several servers are
used and they must be organized and designed to provide users
with real time access to data and applications. A strategy to
allocate the computing loads between servers is required and
its implementation is achieved by the management server in
Fig. 1. The efficiency of this strategy is improved by the
prediction of opening/closing sessions and of launching of
remote applications for each user. By predicting the behavior
of users, it becomes possible to improve their experience [16].
1www.citrix.com
2www.systancia.com
B. Proposed Algorithm for User Classification
Here, the objective is to classify each individual user ac-
cording only to his previous behaviors. To do this, the instants
when a remote application has been launched by a user are
recorded. This allows to build a histogram from the instants of
application launches for each user. Then, a dissimilarity matrix
is calculated using the Jensen-Shannon divergence [17]. This
dissimilarity matrix is then used to project each dot, i.e., user,
that represents a histogram in a multidimensional space in
a new two-dimensional space by ensuring that the distances
between the dots (the dissimilarities) are preserved [18].
Finally, the dots in the new space with reduced dimensions
can be grouped to set up classes or categories of users. The
K-means algorithm [19] is used to determine similar groups
of users.
C. Proposed Algorithm for the Prediction of Application
Launches
Now, the objective is to predict the instant when a user
will launch a remote application. The proposed prediction
algorithm is only based on user-past behaviors. The instants
when a user has launched a remote application has been
recorded and are available at ay time.
As an example, the time interval in which the user will
start the first application in a day is predicted. A time
granularity is used to estimate discrete probability distribution
P(H|W D, U )where H= 0, ..., 23 is the time, W D =
M o, T u, W e, T h, F r, S a, Su is the day of the week and U
represents a user. For each user, the opening of the day’s
applications is predicted by calculating the interval or po-
tentially multiple intervals in which the event is supposed
to occur. This is achieved with a very fine granularity by
estimating the probability distribution of launches over time by
a Kernel Density Estimator (KDE) estimator [20]. It is a kernel
Fig. 2. View of an application launch with the virtualization software AppliDis
Fusion from Systancia.
estimation technique (the Parzen-Rosenblatt method) which is
a non-parametric method of estimating the probability density
of a random variable. The estimator is formed by the average
of the Gaussian curves called kernels. To take into account
the data periodicity, a circular distribution function has been
chosen as the basic function defining a kernel.
The periodicity is defined for a user by the average time
elapsed between two application launches. It can be very
different from one user to another one. We use an envelope
of several Gaussians to obtain an estimate of the probability
density.
1) Model Selection: The behavior of the users will, in
the general case, be a composition of motifs with different
periodicities. The challenge consists in finding a period Tthat
will generate a probability distribution with a low differential
entropy. For each period T, one or more bandwidths of the
kernels can be applied and we have chosen the cross-validation
technique to select the bandwidth [20].
2) Optimization: The period Tis determined by the res-
olution of an optimization problem based on a cost function
J. This allows you to adjust a compromise between taking
into account the size of the interval in order to increase the
likelihood of covering an application launch and computational
resources (CPU and server RAM). Of course, the more the
interval is high, the more prediction accuracy is good and the
required computational costs bulky. In the end, it is the system
administrator that settles this compromise. In practical terms,
for users this amounts on one side by preloaded applications
kept idle but fast to be launched and on the other side by larger
delays for launching applications.
V. IMPLEMENTATION AND RE SU LTS
A. Pratical Aspects
The algorithms have been implemented within AppliDis,
a software for virtualization of workstations and applications
hours of the week
0 24 48 72 96 120 148
quantity of hourly logons
0
2000
4000
6000
8000
Fig. 3. Number of logons per hour in a French university hospital over a
period of a week.
Fig. 4. Two dimensional projection of the users (1 dot for 1 user) with 6
clusters separated with the K-means algorithm for user classification.
in a single management console. Clients of such a software
are companies, large accounts, hospitals, or other big orga-
nizations with a large amount of users, with users showing
different profiles (office staff, doctors, technicians, nurses,
etc.), with users requiring different needs and where some
users must access to remote data and application up to 24/24h.
The ICT cloud architecture is the one of Fig. 1 and Fig. 2
shows the remote application launch through AppliDis Fusion
5 by a user on its workstation.It is the virtualization software
that allows to access to applications hosted on the servers. This
software also stores the instants when an applications has been
asked for, when it is available, when it is closed, etc.
Tests have been achieved on a real cloud computer systems
with the virtualization of workstations and applications. In
practical terms, data have been collected from a French
University Hospital which includes around 800 users accessing
110 applications distributed over 35 servers 24/24h. The
resulting data set represents a period of approximately 12
months. The activities and the user behavior can be seen by
the histogram in Fig. 3 which shows the number of logons
during a full week with a resolution of 1h. In this particular
example, a user opens 5 applications per day in average. This
number can vary from 1 to 82 applications per day depending
on the user. The most frequent periodicities detected in all the
users are 24, 12, and 8h. This means that some applications
are launched every 24, 12 or 8 hours.
B. User Classification
The proposed user classification algorithm is evaluated on
the data set and context previously described.
The algorithm allows to distinguish 108 groups of users.
The user have been projected in a high-dimensional space.
In this case, a 108-dimensional space has been used. A user
is represented by a dot where each coordinate represents the
utilization periodicity of a certain application. A dimension
reduction algorithm is used convert the dots from the high
into a lower dimensional representation. This algorithm must
preserve the distances because 2 dots which are close in the
108-dimensional space represents 2 similar users. The results
of converting the 108-dimensional space into a plane (2D) is
presented by Fig. 4. The 2 dimensions have not units and are
not related to any physical parameters. In this figure, there
are 765 dots where each corresponds to a user. The simple K-
means algorithm has been used to group the dots into clusters.
We chose k=6 cluster and the color of the dots represents its
cluster.
We compared the results obtained with the dimension re-
duction algorithm based on the dissimilarity matrix calculated
with the Jensen-Shannon divergence to other dimension reduc-
tion techniques like Principal Component Analysis (PCA) or
Multidimensional Scaling (MDS). We noticed that the use of
the dissimilarity matrix generates the best results.
C. Prediction of Application Launches
The proposed prediction algorithm is evaluated on the same
data set in order to estimate the future instant of remote
application launches by users.
What characterizes a user is the period T. The choice of
the interval to predict the launch of an application is achieved
by knowing that this interval must cover a probability of mass
greater than a certain threshold. For each user, the number of
predicted application launches is represented by a single num-
ber, the Area Under the Curve (AUC). This AUC varies from
0 to 1 and is directly related to the size of the interval taken
into account for the prediction. This is shown in Fig. 5. On this
figure, the yellow bars represent the probability distribution of
application launches by hour for a weekday (Friday) for the
user with id=3,P(H|W D =F riday, U = 3). The red curve
is the complementary cumulative probability mass function,
this is the probability P(H > h|W D =F r iday, U = 3).
Obviously, the probability decreases when the time moves
forward to the end of the day. Finally the blue bars allow to
highlight the part of the probability distribution that is covered
by the predicted interval from Hour = 10 to H our = 21, this
is P(10 ≤H≤21|W D =F riday, U = 3).
The entropy of the behavior will allow to estimate the
upper bound of the interval, and the algorithm will make the
Hour
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cumulative Probability
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logins (AppliDis Session) Start Dates Histogram
UserId: 3; Weekday: Friday
Interval
Fig. 5. Example of the cumulative probability and probability distribution for
the launch of an application during a user session.
Interval Duration as % of the total period T
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cumulative Probability
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AUC: 0.65745
AUC: 0.76906
Cumulative Probability vs Interval Duration
Fig. 6. Relationship between cumulative probability and the duration of the
interval, here with examples for 2 different values of the AUC.
predictions by trying to reach this limit. The exact value of the
interval’s upper bound cannot be known in advance, so it is not
possible to know whether a prediction is optimal or not. We
seek to improve the prediction performance at each iteration of
the algorithm by using an AUC as close as possible to 1. For
example, Fig. 6 shows the cumulative probability according to
a part (in percent) of the interval with two different values of
the AUC. Finally, in this test 97% of users have predicted
application launches which means that the application will
then be preloaded on their terminal.
By loading the applications in advance with the prediction
algorithm, the user can see a reduced time delay. For the
administrator of the cloud computer architecture, the issue
consists in finding an acceptable compromise between pre-
dictive performance, i.e., accelerated applications, and server
resources, i.e., CPU, RAM, and power consumption. Our
behavior analysis tools are integrated in the virtualization
software and are available to the administrator who can view
the system performances and additional indicators calculated
and predicted by the ML techniques.
We presented an implementation of ML algorithms for user
behavior analyze purposes in a cloud computing environment
that combines workstation virtualization and application virtu-
alization. Our user behavior analysis consists in the classifica-
tion of users and in the prediction of some of their activities
such as application launches. The concept of UBA (User
Behavior Analytic) has been used in this specific context.
VI. CONCLUDING REMARKS
Dissimilarity measures and data clustering methods (K-
means) have enabled the identification of groups of similar
or closely related users. Then, the time interval in which a
user will launch an application has been predicted by using
a non-parametric estimating method of a probability density.
It consists in a Kernel Density Estimator (KDE) estimation
technique.
These algorithms have been implemented within a work-
station and application virtualization software that is able
to track and visualize users’ activity and behavior in real
time. Thanks to the algorithms previously mentioned, the
virtualization software is thus also able to predict in real time
the openings of sessions and applications of users regardless
of the periodicity of the past information. This has been
verified in a working environment and under real operation
conditions. A performance analysis shows that the machine
learning techniques are effective in clustering the users and in
predicting their behaviors.
The proposed solution aims to ensure a fast remote access
to the applications for the user while reducing the maintenance
costs for the ICT architecture administrator.
ACKNOWLEDGMENT
The authors would like to thank the Systancia company for
supporting this work and providing anonymized data from a
cloud computer system.
REFERENCES
[1] G. Warkozek, V. Debusschere, and S. Bacha, “Automated parameters
retrieval for energetic model identification of servers in datacenters,” in
IEEE PowerTech, 2013, Conference Proceedings.
[2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning: Data mining, inference and prediction, ser. Springer Series in
Statistics. Springer-Verlag, 2013.
[3] G. R. Foxall, “Behavior analysis and consumer psychology,” Journal of
Economic Psychology, vol. 15, no. 1, pp. 5 – 91, 1994.
[4] A. R. Baig and H. Jabeen, “Big data analytics for behavior monitoring
of students,” Procedia Computer Science, vol. 82, pp. 43–48, 2016.
[5] J. Barcenilla and J.-M.-C. Bastien, “Acceptability of innovative technolo-
gies: Relationship between ergonomics, usability, and user experience,”
Le travail humain, vol. 72, no. 4, pp. 311–331, 2009.
[6] A. Kobsa, “User modeling: Recent work, prospects and hazards,” Human
Factors in Information Technology, vol. 10, pp. 111–111, 1993.
[7] ——, “Generic user modeling systems,” User Modeling and User-
Adapted Interaction, vol. 11, no. 1, pp. 49–63, 2001.
[8] O. Bent, P. Dey, K. Weldemariam, and M. K. Mohania, “Modeling user
behavior data in systems of engagement,” Future Generation Computer
Systems, vol. 68, pp. 456–464, 2017.
[9] M. Pazzani and D. Billsus, “Learning and revising user profiles: The
identification of interesting web sites,” Machine Learning, vol. 27, no. 3,
pp. 313–331, 1997.
[10] R. F. Molanes, K. Amarasinghe, J. J. Rodriduez-Andina, and M. Manic,
“Deep learning and reconfigurable platforms in the internet of things,”
IEEE Industrial Electronics Magazine, vol. 12, no. 2, pp. 36–49, 2018.
[11] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2007.
[12] K. P. Murphy, Machine Learning: A Probabilistic Perspective. The
MIT Press, 2012.
[13] E. Alpaydin, Machine Learning: The New AI. The MIT Press, 2016.
[14] I. Szilagyi and P. Wira, “An intelligent system for smart buildings using
machine learning and semantic technologies: A hybrid data-knowledge
approach,” in 1st IEEE International Conference on Industrial Cyber-
Physical Systems (ICPS 2018), 2018, pp. 20–25.
[15] G. I. Webb, M. J. Pazzani, and D. Billsus, “Machine learning for user
modeling,” User modeling and user-adapted interaction, vol. 11, no.
1-2, pp. 19–29, 2001.
[16] M. Callara and P. Wira, “Machine learning pour l’analyse de com-
portements et la classification d’utilisateurs,” in Congr`
es National de
la Recherche des IUT (CNRIUT’2017), 2017.
[17] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles
and Techniques - Adaptive Computation and Machine Learning. The
MIT Press, 2009.
[18] T. F. Cox and M. A. A. Cox, Multidimensional scaling, 2nd ed., ser.
Monographs on statistics and applied probability. Boca Raton, Fla.:
Chapman & Hall/CRC, 2001, no. 88.
[19] R. Xu and D. C. Wunsch, “Survey of clustering algorithms,” IEEE
Transactions on Neural Networks, vol. 16, no. 3, pp. 645–678, 2005.
[20] C. O. Wu, “A Cross-Validation Bandwidth Choice for Kernel Density
Estimates with Selection Biased Data,” Journal of Multivariate Analysis,
vol. 61, no. 1, pp. 38 – 60, 1997.