ArticlePDF Available

Online profiling and clustering of Facebook users

Authors:

Abstract and Figures

Abstract In a relatively short period of time, social media have acquired a prominent role in media and daily life. Although this development brought about several academic endeavors, the literature concerning the analysis of social media data to investigate one's customer base appears to be limited. In this paper, we show how data from the social network site Facebook can be operationalized to gain insight into the individuals connected to a company's Facebook site. In particular, we propose a data collection framework to obtain individual specific data and propose methodology to explore user profiles and identify segments based on these profiles. The proposed data collection framework can be used as an identification step in an analytical customer relationship management implementation that specifically focuses on potential customers. We illustrate our methodology by applying it to the Facebook page of an internationally well-known professional football (soccer) club. In our analysis, we identify four clusters of users that differ with respect to their indicated “liking” profiles.
Content may be subject to copyright.
Online proling and clustering of Facebook users
Jan-Willem van Dam, Michel van de Velden
Econometric Institute, Erasmus University Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
abstractarticle info
Article history:
Received 20 November 2013
Received in revised form 22 September 2014
Accepted 1 December 2014
Available online 9 December 2014
Keywords:
Online proling
Social networks
Customer relationship management
Correspondence analysis
Cluster analysis
Facebook
In a relatively short period of time, social media have acquired a prominent role in media and daily life. Although
this development brought about several academic endeavors, the literature concerning the analysis of social
media data to investigate one's customer base appears to be limited. In this paper, we show how data from the
social network site Facebook can be operationalizedto gain insight into the individuals connected to a company's
Facebook site. In particular, we propose a data collection framework to obtain individual specicdataand
propose methodology to explore user proles and identify segments based on these proles. The proposed
data collection framework can be used as an identication step in an analytical customer relationship management
implementation that specically focuses on potential customers. We illustrate our methodology by applying it to
the Facebook page of an internationally well-known professional football (soccer) club. In our analysis, we identify
four clusters of users that differ with respect to their indicated likingproles.
© 2014 Elsevier B.V. All rights reserved.
1. Introduction
Social networks andtheir role played in daily life increasedconsider-
ably over thelast few years. As illustrated by the editorials of two recent
special issues [3,8], recent academic publications cover a broad
spectrum of topics related to social media. Some examples concern
the potential of social media and its effect on customer loyalty [4];
how to use Facebook to activate customers in sharing product/service
recommendations [7,16,19], the role of social networks, in particular
Facebook, on intentional social actions [6]; the relationship between
personal networks and patterns of Facebook usage [29]; the effective-
ness of user generated content in stimulating sales [10,17,39].This
short list of topics and references is by no means exhaustive. It only
serves to illustrate the recent interest and range of applications relating
to rms and social networks. One common element among extant liter-
ature is that none of these studies build on directly observed individual
level social network data. Instead, either aggregate data or focus groups
and/or (online) surveys were employed in order to answer the research
questions. This limitation was recently also observed by [31] In their
study of the effect of social media participation on visit frequency and
protability, survey respondents were linked to their social media (i.e.
Facebook) proles by matching of names. In this paper we add to the
existing literature by explicitly considering the retrieval and analysis
of prole data directly obtained from social network sites. The proposed
methodology can be implemented into an analytical customer
relationship management (CRM) framework aimed at the analysis of
customer characteristics that may help improve a rm's customer man-
agement strategies. Moreover, by focusing on data from public pro-
les from the social media platform Facebook, we are able to identify
potential rather than actual customers. That is, in contrast to typical
CRM implementations that rely on data directly obtained from cus-
tomers, we consider a much broader group of individuals that indicated
an interest in a rm even when an actual purchase has not yet been
recorded.
The contribution of this paper is threefold: First, we show how
Facebook users that likearm can be identied. As also observed by
[31] this is not a trivial task. Second, using the information volunteered
by such Facebook users through their publicly available pages, we show
how segments of Facebook users can be identied through data visual-
ization and cluster analysis methods. Clustering of a rms' Facebook
fans, may improve understanding of strategic segmentation of social
media users connected to a rm [28] Moreover, the cluster results and
visualizations can be used to improve targeting of marketing efforts.
For example, a company may consider seeking cooperation with anoth-
er brand or a popular media gure based on the popularity of such a
brand or person with the (potential) customers. Moreover, such efforts
could be targeted directly at specic groups of (potential) customers
rather than at all (potential) customers. Third, we apply our methodol-
ogy to a (large) data set of Facebook users thatindicated liking a popular
and successful international football club. This football club granted us
administrator rights, under provision of not revealing the name of the
club and any results indicative of the club's name. The results of our
analysis show that, based on the Facebook users' liking behavior,
clusters can be obtained. Given the differences between liking patterns
Decision Support Systems 70 (2015) 6072
Corresponding a uthor at: Michel van de Velden, Econometric Institute, Erasmus
University Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands.
E-mail addresses: jwvdam@gmail.com (J.-W. van Dam), vandevelden@ese.eur.nl
(M. van de Velden).
http://dx.doi.org/10.1016/j.dss.2014.12.001
0167-9236/© 2014 Elsevier B.V. All rights reserved.
Contents lists available at ScienceDirect
Decision Support Systems
journal homepage: www.elsevier.com/locate/dss
in these clusters, differentiated marketing strategies for the different
clusters can be developed.
The remainder of this paper is structuredas follows. First, we briey
review previous research on analytical CRM and online proling. Next,
we briey discuss specic data considerations for Facebook. We
introduce some terminology and review Facebook's data analysis and
programming facilities. In Section 4, we show how specic Facebook
users can be identied. Next, we analyze the individual level data
using a combined multiple correspondence analysis and k-means clus-
ter analysis method. We show how results can be visualized and
interpreted. The paper concludes with a discussion of our results, impli-
cations for research and practice, and future directions.
2. Customer relationship management and online proling
Customer relationship management (CRM) has become widely
recognized as an important business approach [27,31] denes CRM as
an enterprise approach to understanding and inuencing customer be-
havior through meaningful communications in order to improve
customer acquisition, customer retention, customer loyalty, and customer
protability.Hence,customeracquisition(or:identication) can be seen
as the rst step in a customer relationship management cycle that, to-
gether with retention and customer development form a complete cycle
geared at creating a better understanding of (potential) customers in
order to increase long term customer value to the rm [22,33,30].
Customer identication is typically based on information directly
available to a rm [38] For example, customers maybe required to pro-
vide certain background information upon purchasing a product. In ad-
dition, companies may ask customers to volunteer information by
completing a survey or persuading them to join a loyalty program.
Based on the available data, a customer prole, that is, a model of the
customer, can be constructed. Based on such a customer prole, a mar-
keteer decides on appropriate strategies and tactics to meet the specic
needs of the consumer [32]. Hence, possessing accurate information
about preferences and background characteristics of your (potential)
customers makes it possible to improve targeting of, possibly individual
specic, marketing efforts [25].
Obtaining direct customer information requires an existing relation-
ship with the rm. That is, customers need to have either purchased a
product or made contact with the rm in such a way that identication
is possible so that additional information can be collected. In the case of
yet unidentied potential customers,it is not possible to acquire data in
this fashion. Moreover, except for the observable transactional data
(i.e., purchase time and amounts, etc.) customers may decline to pro-
vide additional information.
Social media offer a new source of customer prole information. In
particular, social media offer opportunities to identify potential
customers. Through social media, individuals often express preferences
for brands, products, services, persons, political parties, etc., in a freeun-
solicited way. Thus, if one is able to collectsuch information from poten-
tial customers of a certain rm, for example by focusing on users that
indicated an interest in that rm, online proles can be created that
allow for better, individualized, targeting.
Although it has been suggested that the rise of new media requires
novel approaches to successfully manage customer relationships [18],
applications in which customer background data from social network
sources are used to gain insight into customer backgrounds, appear to
be under represented in the academic literature. There are some studies
[24,23,15,5] in whichsocial network data were used that contained per-
sonal information of the users in the data set. However, the goalof these
studies was to study network ties [24], privacy issues [5,15], or relating
the number of friends to the amount of information available on a
person's Facebook page [23]. None of the studies used the data for on-
line proling: The collection of information from the Internet for the
purpose of formulating a prole of users' habits and interests [37].In
this paper, we ll this gap by proposing a data collection framework
for the purpose of online proling.
Online proling can be divided into two categories: reactive and
non-reactive data collection [37].Non-reactive data collection focuses
on the collection of data concerning Web usage behavior, e.g., IP ad-
dresses of visitors, timespent on certain Web pages, and clicking behav-
ior information. These data are used to gain insight into Web user
behavior, and thus, characteristics of individual visitors or visitor
groups. Non-reactive data form a large and potentially interesting
source of online prole information. However, for the construction of
online user proles, the observed usage behavior must rst be trans-
formed into meaningful variables. The construction and denition of
such variables is not always a easy task. In our study, we therefore
primarily focus on the retrieval and analysis of online proles based
on reactive rather than non-reactive data.
Reactive data collection zooms in on visitor characteristics which
cannot be collected through tracking Web usage behavior of a visitor.
Instead, reactive information is collected by using forms and selection
menus, which have to be lled in by visitors themselves. Reactive data
requires little to no recoding of the original variables and they are im-
mediately collected at the user level. Moreover, in the case of Facebook,
providing reactive data requires very little effort from the users. For ex-
ample, when joining Facebook, users are asked to provide certain per-
sonal background information (e.g., name, gender, date of birth).
Users provide this information by selecting the appropriate options.
This basic background information can be supplemented by more per-
sonal information concerning, for example, hobbies, relationship status
etc. Finally, by likingother pages, personal preferences for persons or
objects can be indicated.
The resulting online proles can be of great value for marketeers, as
they can be used to identify different (segments of) users (customers)
that require different marketing approaches. Moreover, it enablesthe com-
pany to know its potential customers, that is, individuals that indicated a
preference towards the product/brand by likingit on Facebook.
3. Facebook data
Facebook users put personal information on their Facebook page.
Some examples are someone's name, gender, date of birth, e-mail ad-
dress, sexual orientation, marital status, interests, hobbies, favorite
sports team(s), favorite athlete(s), or favorite music. Furthermore, it is
possible to specify your Facebook friends, post messages, publish pic-
tures or other content. Consequently, the potential value to marketeers
and researchers of the information available through Facebook is sub-
stantial. However, extracting the information is no trivial task as:
1. Facebook users are able to make certain information not publicly
available and therefore not visible to non friends.
2. Facebook users are not obliged to ll in elds and therefore, many
users do not specify all possible information about themselves.
3. The default statistics that Facebook offers for Facebook page admin-
istrators are limited.
4. It is not obvious how Facebook users who likeyour page can be
identied.
The rst two points are a result of the design and policy of
Facebook.com and therefore we take these points as these are. Instead,
we focus on theextraction of available data from Facebook and consider
auserprole data collection framework taking into account the above-
mentioned issues.
The Facebook data collection framework that we propose consists of
three steps: 1) identication of fansof the Facebook page, 2) retrieval
of relevant data for the identied fans, and 3) preparation of the data.
The rst step of this framework requires administrator rights to the
page, in the other steps public information from the relevant pages
needs to be collected. Before we show how to implement the data
61J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
collection framework, we briey summarize some important aspects
concerning Facebook pages and the available data.
3.1. Facebook Insights
The owner of a Facebook page is in principle the administrator of the
page. Personal pages are typically managed only by the page owner, how-
ever, in the case of a company's Facebook pages, the page administrator
can also give other Facebook users these administrator rights. It is possible
to have multiple administrators for one Facebook page, e.g., multiple mar-
keting and CRM employees may be page administrators. As Facebook
page administrator, one has certain privileges in comparison to regular
users or visitors of a Facebook page. As administrator, one can edit, pub-
lish and withdraw content, target advertisements and install Facebook
apps on the page. Also, administrators have access to Facebook Insights,
a dashboard which provides statistics on user's growth, demographics,
consumption of content, and creation of content. However, the informa-
tion made available through the dashboard is aggregated over users
who likedthe page. Consequently, the possibilities concerning the anal-
ysis of individual specic data using this feature, are limited.
On Facebook, users can indicate whether they likeanother Facebook
page.Thus,theyareabletoexpressaformofafnity with the company,
person or product behind the Facebook page. Through Facebook Insights
it is possible to see how many users likeyour page, how this number
evolves over time, and whether these users are active on your page or
not. (The denition of an active Facebook user is as follows: users who,
within a chosen time period, engaged with, viewed, or consumed content
generated by a Facebook page). Furthermore, you can see which media
on the page are most popular (e.g., watching videos, listening to
audio, or viewing photos). It is also possible to see which Facebook
tabs (e.g., the wall, information, photos, and events) are most popu-
lar and from which external referrers visitors come. Additionally,
one is able to see which page posts have been viewed the most and
which posts generated most user feedback.
The above-mentioned possibilities of the Facebook Insights dash-
board all concern information related to the Facebook page itself. Infor-
mation about the Facebook users is only present through aggregated
breakdowns.
Fig. 1 gives an example of the breakdowns for the gender and age
distributions based on the information a user provided on his Facebook
page. The home country and home city are determined using the IP
address from which users (who indicated likingthe page) access
the F acebook page. The language is based on the users' default language
setting when accessing Facebook.
Fig. 1. Screenshot of Facebook Insights.
62 J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
Other personal information of users, such as, for example, relation-
ship status, sexual orientation, favorite brands, favorite music, liked
pages, etc., are not accessible through theFacebook Insights' dashboard.
3.2. Facebook application programming interface (API)
Through Facebook Developers Platform [9] it is possible to develop
web applications (or plugins) which make use of the Facebook
platform; e.g., mobile applications which makes it possible to connect
to your Facebook page to post pictures, applications which integrate
Facebook features in a Web site, or applications which make it possible
to nd friends.
The Facebook Developers Platform consists of multiple applica-
tion programming interfaces (APIs). The Graph API is the core of
Facebook Platform, enabling one to read and write data to Facebook.
It provides a simple and consistent view of the social graph, uniform-
ly representing objects (e.g., people, photos, events and pages) and
the connections between them (e.g., friendships, likes and photo
tags) [9]. In addition to the Graph API, there is an Internationalization
API, an Ads API, and a Chat API. For our data collection framework, the
Graph API is crucial.
The Graph API makes itpossible for developers to integrate Facebook
into Web site (Web) applications, and to build Facebook applications.
However, even when a Facebook user has a public prole, which is ac-
cessible (online) by anyone, the data in his prole are not publicly ac-
cessible through the API. In fact, only the following elds are always
publicly available through the API: user id; username; full name; rst
name; last name; gender; locale (i.e. the default language setting);
prole picture. For marketing or customer relationship management
purposes, this list is not very useful. In addition, if we compare this list
with the complete list of elds that Facebook provides, we observe
that there is potentially much more relevant information available on
the Facebook pages.
Considering the complete list of elds that Facebook provides, we
identify as potentially interesting characteristics that are not available
through the API: date of birth; place of birth; sexual orientation;
political view; relationship status; education; work experience; contact
information; activities; interests; likes (i.e. internet pages likedby a
Facebook user, these could correspond to books, movies, athletes but
also friends' Facebook pages. This last eld, likes, is of particular interest
in this study as we want to see if fans can be clustered according to the
preferences indicated in this eld. As the API does not allow the retrieval
of these data, we need to develop alternative methods. In the next section,
we consider how such data can be obtained.
4. Facebook user prole data collection framework
To gather a Facebook user's prole information relevant to customer
relationship management and/or for marketing purposes, the informa-
tion resources described in the previous section, must be combined.
Fig. 2 shows the user prole data collection framework. For convenience
we introduce the term fanfor users who likeda Facebook page. In
fact, Facebook itself originally gave users the option to become a fan of
other pages and changed this into like. The data collection framework
consists of three parts: 1) identifying the fans,2) gathering the personal
information, and 3) preparing and structuring the gathered data.
4.1. Identifying Facebook page fans
Administrators of a Facebook page, can list fans of their page by
accessing https://www.facebook.com/browse/?type=page_fans&page_
id=1234567890, where 1234567890 should be replaced by the page
ID of the page one is interested in (and is administrator of). A screen
shot of the URL is given in Fig. 3. When one clicks on the See more
button on the bottom of the page, more fansare listed. However,
after showing 500 fans, the button does not show up anymore.
After exploring the HTTP requests resulting from clickingthe See
morebutton, we conclude that:
Facebook uses Asynchronous JavaScript (AJAX) for its HTTP requests.
Facebook uses two parameters (fb_dtsg and post_form_id) to prevent
cross-site request forgery (CSRF) in its HTTP requests.
Only authenticated Facebook administrators have access to the page
(cookies are used for authentication).
The response format of the HTTP request is in JSON format, which
contains each fan's picture, name, URL, and ID.
With theseobservations in mind, a PHP script to store the name,URL
and ID of each Facebook user in a (MySQL) database, was written. The
pseudo code for this script can be found as Algorithm 1 in Appendix A.
Running this algorithm yields, after removing duplicate Facebook IDs,
10,000 unique fans. Facebook does not give information about how
these fansare selected. By changing the fb_dtsg, and post_form_id pa-
rameters a new set of 10,000 unique fansis obtained. Between these
sets there exists some overlap, but the greater part of the fansis differ-
ent. As the algorithm always results in 10,000 unique Facebook IDs, it
will be difcult to obtain a list with all fans if the total number of fans ex-
ceeds the 10,000. A sample, however, can be obtained without too much
difculty by using Algorithm 1, and varying the two parameters.
4.2. Gathering fan's public prole information
The second step in the data collection process is gathering informa-
tion of the Facebook fansidentied in the previous step. This can be
done by visiting these Facebook pages and storing the relevant, individ-
ual level, data. Note that only public information can be obtained. The
data we thus obtain, only concerns users that granted public access to
their pages.
4.3. Data preparation and storage
The third step in our prole data collection framework, concerns
preparation and structuring of the gathered data. When one creates or
updates his/her Facebook prole, personal information can (and in
Fig. 2. Facebook data collection framework.
63J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
some cases, such as name and date of birth, must) be provided by
completing several elds. There are text elds (e.g., name, language,
interests), check boxes (e. g., sexual orientation), or lists (e.g., gender,
relationship status). For check boxes and lists, the options which can
be selected are limited. For text elds no such limitation exists and
users can type in anything they like. We distinguish between two
sets of variables: background characteristics and liking data.
4.3.1. Background characteristics
From individual Facebook pages we are able to obtain personal back-
ground information of the users. In particular, from the public proles
we can obtain the variables gender, date of birth, location, relationship
status and the number of Facebook friends. Gender and date of birth
are straightforward background variables. Concerning the other vari-
ables we briey indicate how the data is available on the Facebook
pages and how we process these for our analysis.
4.3.2. Location
A user's location is represented by a string with the name of the city
or town someone lives in and/or comes from, together with a URL to the
Facebook page of that location. Location may be useful when analyzing
fans of a page and we may be interested in more details about the loca-
tion. In particular, for a geographical overview of the fanbase one needs
to know the country, continent, and the coordinates (latitude, longi-
tude) of a location. This can be achieved by using the GeoNames geo-
graphical database [36]. The GeoNames API has a fuzzy search engine
which accepts all kinds of input. For example, the engine accepts both
Rio de Janeiroand Rio Janeiroas search terms for the large Brazilian
city. As output, GeoNames yields various details such as, city name,
country name, latitude, longitude, number of inhabitants, etc.
For fanswhodon't publicly specify their location wecannot discov-
er their latitude and longitude. However, through Facebook's API it is
possible to gather Facebook's language setting. Assuming that the lan-
guages correspond to the user's location, one could use the language
to determine, at country level, the user's location. That is, a Facebook
user who's using Facebook in Japanese is assumed to come from
Japan. Although we believe that this assumption is not a very unrealistic
one, there are cases in which language is not linkable to one specic
country. For example, we cannot infer a country from a user with a
language setting such as Arabic,English,French,Germanor
Spanish.
4.3.3. Relationship status
A Facebook usercan specify their relationship status by selecting one
of the following options: single, in a relationship, engaged, married, it's
complicated, widowed, separated or divorced. However, in our data set
we also found values as in a complicated relationship,in an open rela-
tionship,orcivil partnership. This is probably a result of the fact that
Facebook changed the possible values for the eldrelationship status
over the years. At the time of our data gathering process (2011), it
was not possible to choose other values than the eight values listed
above. Therefore, we convert the values in a complicated relationship
and in an open relationshipinto in a relationship,andcivil partner-
shipbecomes married.
Fig. 3. Facebook administrators: screen shot of Web page which lists people who likea Facebook page.
64 J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
4.3.4. Number of Facebook friends
The number of Facebook friends can serve as a proxy for Facebook
activity or popularity of a user.
4.3.5. Liking data
In addition to the background information, which, with the excep-
tion of the number of Facebook friends, is user supplied, we are able
to nd for each user which other Facebook pages are liked.Basedon
this information we would like to see if clusters/segments of users can
be identied. That is, is it possible to distinguish groups of Facebook
users with similar likingpatterns. Similar patterns could indicate sim-
ilar preferences and companies may be able to employ segment specic
marketing strategies. For example, if a segment of users tends to like
certain artists more often than any other segment, promotions involv-
ing such an artist could be specically targeted at that segment alone.
In the next section, we introduce methodology to nd and interpret
segments based solely on the liking proles of users.
5. Application: clustering Facebook fans
In the previous section we described in some detail how a Facebook
page owner/administrator can obtain data from its fans. A customer re-
lationship manager or a marketing manager would like to make these
data operational by,for example, investigating whether fans can be seg-
mented according to their indicated preferences and/or background
characteristics. That is, is it possible to identify groups of fans on the
basis of individual specic like data. For example, are certain brands or
celebrities notably more popular in subgroups. Such information could
be useful as it allows better targeting of marketing strategies.
A large internationally successful football (soccer) club granted us
administrator rights to its ofcial Facebook site. This enabled us to ex-
tract the data using the framework introduced in Section 2.Forstrategic
purposes, the football club requested that its name, and any information
that could possibly lead readers to infer the name, be kept from the pub-
lic. Consequently, in our data analysis only a selection of general, not
football related, labels are used.
At the time ofthe data extraction, February 2011, thetotal number of
Facebook fans of the club was about 4 million. From these, we extracted
data from over 40,000 fans. To check representativeness of this sample
we compared the gender and age-distribution of our sample to that of
the population as obtained through Facebook Insights. The results, pre-
sented in Table 1, show only small differences indicating that our data
set is representative with respect to gender and age-distribution.
Furthermore, we see that the Facebook fans of the football club are,
not surprisingly, predominantly young males.
As explained in the previous section, Facebook users' data
concerning their location can be enriched with geographical identiers
such as latitude and longitude. In Fig. 4, the resulting concentrations of
fans are visualized in a heatmap created using Google maps API [13].
The gure shows a high concentration of Facebook fansin Europe,
India, Nigeria, South-east Asia, and Central America. Big parts of Africa
and Australia have a low fanconcentration and in China there are hard-
ly any fansvisible. This isa result of the fact that Facebook's penetration
in China and Africa is relatively low, compared to that in other regions
[2]. Australia has a relatively high general Facebook penetration (46 %)
[2]. However, there are hardly any fansof ourfootball club in
Australia, according to Fig. 4. Apparently, this football club is not popu-
lar in Australia on Facebook.
Recall thatour data collection framework only allows for the retriev-
al of publicly published elds. Table 2 shows a breakdown of user's
background data available in our initial sample. We see that except for
gender and language settings, the percentage of users providing the
prole information varies and is generally limited. We thereforeexclude
these variables when attempting to identify subgroups. Instead, we
focus on the likedata. Our goal is to nd clusters of Facebook fans
based on their liking data.
Facebook users can specify what/who they like on their prole page.
For example, not only famous movie stars, movies, sports, athletes, tv-
programs, and actors but also brands, restaurants or personal friends
may be liked. For our initial sample of 43,861 fans, we found that
176,381 unique Facebook pages were liked. However, of these
176,381 pages 77.5% was liked by only 1 user in our sample. Often,
these pages are simply personal friends' Facebook pages. For our pur-
poses, such pages are not interesting and a selection must be made.
We consider only the top 150 Facebook pages in terms of likesin
our sample. Selecting data corresponding to these 150 most popular
pages reduces our sample to 16,170 cases. However, the distribution
of the number of likes in this sample of 16,170 users is rather skewed
as many people have only few likes and only a few have many likes.
To allow for discrimination on the basis of the liking proles, we only
consider users that liked at least 5 other pages. The resulting data set
consists of 11,712 individuals. Constructing a data matrix with individ-
uals as rows and the top 150 liked pages as columns, yields a large
matrix with few observations. The scarcity of data (i.e., the many zeros
indicating that a page is not liked) and dimensionality of the data
set, pose a serious problem for normalcluster analysis methods. We
therefore analyze the large data matrix by using a joint dimension
reduction and clustering approach.
5.1. MCA K-means
There exist several methods for clustering high-dimensional data.
One popular approach is to use a two-step procedure. In the rst step,
a dimension reduction technique is used to reduce the dimensionality
of the data. In the second step, cluster analysis is applied to the data in
the reduced space. This method may be referred to as the tandem ap-
proach [1].Asshownby[35] an important drawback of this method is
that the dimension reduction may distort or hide the cluster structure.
To overcome this problem several methods have been proposed [20,
21,34] here we apply the joint MCA and K-means method proposed
by [20].
MCA, also known as homogeneity analysis [12], yields optimal scal-
ing values for the columns (i.e., quantications for pages) in such a way
that pages differently assessed by the individuals receive dissimilar
scale values. Furthermore, rows (i.e., individuals) exhibiting dissimilar
patterns of liked pages, receive dissimilar scale values. K-means cluster-
ing [26],nds clusters by minimizing the sum of squared deviations be-
tween the individual observations and their cluster means. [20]
proposed a joint method, from here on referred to as MCAKmeans,
that averages the MCA and K-means objective functions. An important
advantage of the MCAKmeans approach is that it enables visualization
of the data. A more formal formulation as well as an efcient algorithm
useful for dealing with large data matrices is given in Appendix B.
5.2. Analysis
We apply MCAKmeans to the 11,712 observations with the 150 bi-
nary variables indicating whether a page was or was not liked by an in-
dividual. To decide upon the number of dimensions and clusters we
inspect the changes in t when more dimensions/clusters are added.
In particular, for the dimensionality, we consider the adjusted explained
Table 1
Age and gender distributions: Insights' data versus our sample.
Overall 1324 2534 3544 4554 55+
Male population 0.82 0.77 0.17 0.03 0.01 0.02
Male sample 0.79 0.74 0.22 0.01 0.01 0.02
Female population 0.14 0.77 0.15 0.04 0.02 0.02
Female sample 0.19 0.78 0.17 0.03 0.01 0.01
65J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
inertia of the MCA solution, as dened by [14] The adjusted explained
inertia takes into account the rather specic structure of the data ap-
proximated in MCA. In particular, it corrects for the underestimation
of typical correspondence analysis t measures when applied to a
(super)indicator matrix. Fig. 5 gives the cumulative explained inertia
for the MCA solutions with different dimensionality. Although the effect
is small, we can see that after three dimensions the effect of adding
more dimensions decreases. We therefore consider only three dimen-
sions in our analysis. An additional benet of this choice is that it allows
for graphical representations.
To select the number of clusters we consider the value of the objec-
tive function, using a three dimensional solution, with different num-
bers of clusters. In Fig. 6 the nal objective function values are plotted
against the number of clusters. The decrease in objective function
value after four clusters is small and we therefore consider three dimen-
sional solutions with four clusters.
5.3. Results
In Fig. 7, the solution using the rst two dimensions of the MCA
Kmeans solution is given. Cluster memberships are indicated by using
different colors and symbols. We see that in the rst two dimensions
three clusters appear separated from each other whereas the fourth
cluster, situated around the origin, appears to overlap all three other
clusters. As can be veried from Fig. 8, this fourth cluster separates itself
from the other three clusters in the third dimension. Note that, in MCA,
the origin corresponds to the average prole. That is the, average distri-
bution of likes over Facebook pages. The fact that many attribute points
are situated close to the origin, is partly due to the relatively large
amount of not liked pages. That is, most users did not likemore than
8 out of the 150 pages. Hence, each row of the data matrix contains
many zeros for the likecolumns and, consequently, many ones for
the corresponding not likedcolumns. This caused the attribute points
corresponding to the not likedpages to dominate the mean prole
and draw the corresponding points to the origin.
The spreadand sizes of the clusters are nicely displayed in Figs. 7 and
8. However, the attributes (i.e. the Facebook pages) are not labeled to
avoid further cluttering. Consequently, interpretation of the clusters in
terms of the liked/not liked pages is not possible from these gures.
For a better interpretation of the cluster with respect to the pages,
Figs. 9 and 10 give joint plots of the cluster centers and the attributes.
Points close to the origin have not been labeled to avoid clutter and,
due to the condentiality agreement we have relabeled pages corre-
sponding famous football players as FFP and pages corresponding to
football clubs as FC.
Fig. 4. Where do the football club's Facebook fans come from?
Table 2
Overview ofthe available prole information for the43,861 users in our sample.
% FB users in our sample
Gender 99.6
Date of birth 2.5
Relationship status 21.5
Sexual orientation 27.8
Location 54.2
Hometown 29.8
FB language setting 97.4
# FB friends 48.6
Education 37.1
Work experience 22.3
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Number of dimensions
Explained inertia (in %)
Fig. 5. Explained adjusted inertia as a function of number of MCA dimensions.
66 J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
Looking at the positions of the attributes and the clusters, we see
that in cluster 1 (3549 observations) there appears to be a link with
Latin America. That is, the Facebook page fútbol and pages correspond-
ing to entertainers particularly popular in Latin America (e.g., Daddy
Yankee, Wisin & Yandel) are relatively often liked. Also, the football
club near cluster 1 is in fact the only South American club present in
our data set. In cluster 2 (1734 observations) we nd relatively many
likes to Southeast Asia related stars and topics (e.g. SCTV and RCTI are
Indonesian television stations, Upin and Ipin is a Malaysian television
series, and Timnas Indonesia is the Facebook page of the Indonesian na-
tional football team). The three pages farthest removed from the origin
and relatively often associated with individuals in cluster 3 (4048 obser-
vations) are cricket and India related (e.g., Sachin Tendulkar is a famous
Indian cricket batsman). Other pages that are relatively often associated
with this cluster are chess, traveling and sleeping. Note that the cluster
mean for this cluster is not far from that of cluster 4 (2381 observations)
so we should be careful in interpreting the pages close to both cluster
centers on the basis of the rst two dimensions. Instead, plotting the
second and third dimension claries some differences as the clusters
separate along the third dimension. Fig. 10 gives the corresponding
plot, where again, to avoid clutter, we removed some labels and use
the general labels for players and clubs. Individuals in the fourth
cluster relatively often like pages corresponding to American enter-
tainers (e.g. Vin Diesel, Selena Gomez, John Cena, Megan Fox). Also,
Disney, Jackie Chan and MaaWars(amultiPlayersocialnetwork
game) and Facebook are liked more often than average in this
cluster.
Itisimportanttonotethatthefourclustersarecharacterizedby
liked Facebook pages that predominantly are not immediately foot-
ball related. In fact, the clutter of football related pages close to the
origin indicates that in all clusters, these pages are liked as well.
This is not surprising as all individuals in our sample likedthe foot-
ball club which granted us administrator rights thus asserting their
interest in football. However, as the clusters differentiate themselves
through non-football related pages, opportunities arise for cluster
specic marketing efforts.
The MCAKmeans approach emphasizes relative rather than abso-
lute differences. This means that if we look at the distribution of likes
in each cluster, the attributes closest to the cluster means in the plot,
need not be the most often observed in the cluster. In fact, as indicated
before, for all clusters, the most often liked pages are predominantly
football related pages. Table 3 lists the 10 most often liked pages in
the four clusters. To distinguish between different football clubs and
players we numbered them. Note that differences among the most
popular Facebook pages are limited. The order of the clubs and
players varies, but these are small differences that are of no practical
signicance.
The MCAKmeans results suggest that the clustering may be
linked to geographical factors. To further study this we consider
the Facebook data concerning the locations of the individuals. How-
ever, if individuals chose not to publish their locations, we cannot
determine the country of origin. The language settings could be
used to nd plausible country or regions for these data. On the other
hand, the fact that the information is missing may also be informative
in itself and we choose to leave the missing locations as they are.
Table 4 gives, for each cluster, the 10 most frequently occurring countries
and the corresponding percentages of occurrences per cluster. We see
that, as conjectured earlier, the rst cluster has a clear Latin American
2 3 4 5
221.6
221.7
221.8
221.9
Number of clusters K
Value of the objective function
Fig. 6. Value of MCAKmeans objective for 3 dimensional solutions for different numbers
of clusters.
Dimension 1
Dimension 2
Cluster 1
Cluster 2
Cluster 3
Cluster 4
LikedPage
Fig. 7. MCAKmeans solution with attributes and subjects.
67J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
component. Cluster 2 is heavily dominated by Facebook users from
Indonesia. Note that this is the only cluster in which unknownis not
the most frequently observed country. Also, with over 62% users from
Indonesia, it is by far the most homogeneous cluster concerning nation-
alities. Facebook users from India are over represented in the third
cluster. For the fourth cluster there does not appear to be a strong geo-
graphical link.
6. Conclusions
In a relatively short time, social network sites have become an im-
portant part of daily life for millions of people. Consequently, such
sites are considered to be an important marketing tool. Interviews
with marketing and customer relationship managers reveal that a
clear strategy regarding the social network sites often does not exist
Dimension 2
Dimension 3
Cluster 1
Cluster 2
Cluster 3
Cluster 4
LikedPage
Fig. 8. MCAKmeans solution with attributes and subjects.
FC FC
FC
FFP
FFP
FC
FFP
FFP
FFP
FC
FFP
FC
FFP
Fútbol
The Simpsons
Eminem
Justin Bieber
Texas Hold’em Poker
Linkin Park
South Park
SpongeBob SquarePants
Toy Story
FFP
FFP
Jackie Chan
Family Guy
FFP
Music
FFP
FC
TIMNAS INDONESIA
FFP
Black Eyed Peas
Shrek
Lil Wayne
Horror film
Two and a Half Men
David Guetta
Swimming
Cluster4
Enrique Iglesias
Badminton
Bible
Cricket
1 Cent
FFP
Need for Speed
Saw
Avril Lavigne
Usain Bolt
Cluster1
Futsal Al−Qur’an
Sepak bola
FIFA 1
American Pie
Upin & Ipin
The Big Bang Theory
How I Met Your Mother
Usher
History
Wisin & Yandel
Prison Break
FFP
Traveling
Volleyball
Top Gear
Jackass
House
Chess
A Walk to Remember
Avenged Sevenfold
Dahsyat
The Hangover
Bon Jovi
FFP
Sachin Tendulkar
FC
Brazil national football team
Sex and the City
FC
FFP
Futurama
Ungu
FFP
Sleeping
RCTI
Cluster3
Daddy Yankee
SCTV
Hip hop music
Indian Cricket Team
Cluster2
Dimension 1
Dimension 2
Fig. 9. MCKmeans plot with cluster means and liked Facebook pages. FC labels denote football clubs, FFP indicates famous players.
68 J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
and managers are often unable to use the social network data in their
customer relationship management strategy.
In this paper, we formulated a data collection framework for re-
trieving online prole data from Facebook users. In particular, we
showed how a Facebook page owner, that is, a person or company
with administrator rights to the Facebook page, can nd other
Facebook users that indicated liking their page. Then, by visiting
the pages of such users, individual level data can be collected.
We applied the data collection framework to obtain a sample of
Facebook users who indicated likinga large international football
club. Then, using a joint dimension reduction and clustering approach,
clusters could be identied on the basis of the users' liking patterns.
Four clusters were obtained that differed with respect to the liking pat-
terns. Moreover, the visualizations immediately exposed how the clus-
ters differentiated themselves. In particular, differences in relative
popularity of non football related Facebook pages characterize the dif-
ferent clusters. Furthermore, the clusters appear to be separated along
geographical lines. That is, although no geographical data were used,
the clusters differentiated themselves along Facebook pages of locally
popular music/tv/sport stars. The popularity of certain pages in only
certain (or one) clusters, could be used to formulate better targeted,
differentiated, marketing strategies.
6.1. Implications for research
In the CRM literature [33,27,38] customer identication is considered
as the rst step in a CRM cycle. Typically, the identication concerns di-
rectly observable customer generated content (e.g., transaction data).
The identication of potential rather than actual customers as imple-
mented in the data collection framework presented in this paper, offers
several new research opportunities. It would, for example, be interesting
to study the added value and incorporation of the proposed framework
into existing CRM systems. Merging the online prole data from the (po-
tential) customers as obtained from social media, with actual transaction
data, offers other research opportunities. Moreover, tracking the proles
over time allows researchers to study effects of targeted marketing
efforts in a structural fashion.
The data collection framework presented in thispaper was designed
specically for Facebook. However, Facebook is not the only social
media platform on which individuals provide information about their
preferences and personal backgrounds. Similar ideas and methods can
perhaps be used to obtain, publicly available data from other social
media platforms (e.g., Twitter, Instagram, Google Plus, LinkedIn). It
may in fact depend on the rm and its product which social media
outlet is the most interesting.
6.2. Implications for practice
Despite the often acknowledged potential of social network data,
most Facebook related marketing research relies on (online) question-
naires and/or focus groups rather than directly exploiting social network
data. One reason for this situation concerns the limited possibilities for
FC
FC
FC FFP
FFP
Football
FC
FFP
FFP
FFP
FC
FFP
Soccer
FC
FFP
Twilight
Fútbol
The Simpsons
Cluster3
Texas Hold’em Poker
South Park
and 1 more
Basketball
SpongeBob SquarePants
Toy Story
FFP
Shakira
FFP
Jackie Chan
Personal Development
Family Guy
FFP
Music
FFP
FC
TIMNAS INDONESIA
Cluster1
FFP
Shrek
David Guetta
Swimming
Enrique Iglesias
Badminton
Bible
Dr. House
Cricket
FFP
PES
Usain Bolt
Comedy
Facebook
Futsal
Al−Qur’an
Mafia Wars
Sepak bola
FIFA 1
John Cena
Upin & Ipin
The Big Bang Theory
History
Sports
Football Forever
Maria Sharapova
Wisin & Yandel
FFP
Vin Diesel
Traveling
Tennis
Volleyball
Jackass
Megan Fox
Disney
HBO
Taylor Swift
Chess
Prince of Persia
A Walk to Remember
Dahsyat
Selena Gomez
FFP
Sachin Tendulkar
FC
Brazil national football team
Cluster4
FC
FFP
Futurama
Ungu
FFP
Sleeping
RCTI
Daddy Yankee
Step Up Movie
SCTV
Hip hop music
Cluster2
Indian Cricket Team
Dimension 2
Dimension 3
Fig. 10. MCAKmeans plot with cluster means and attributes (liked Facebook pages). FC labels correspond to clubs, FFP labels correspond to famous football players.
Table 3
Top 10 pages per cluster.
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Club 1 Club 5 Football Club 3
Club 2 Player 2 Club 1 Club 2
Player 1 Club 2 Club 2 Player 2
Player 2 Club 1 Club 3 Club 1
Club 3 Timnas Indonesia Player 2 Club 4
Fútbol Harry Potter Player 1 Harry Potter
Club 4 Player 3 Harry Potter And 1 more
South Park Club 4 Club 4 Player 5
The Simpsons Player 1 Player 6 Player 1
Club 5 Player 4 AKON Club 5
69J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
directly retrieving data from social network sites. In this paper, we for-
mulated a data collection framework for retrieving online prole data
from Facebook users. We showed how a Facebook page owner, that is
a person or company with administrator rights to the Facebook page,
can nd other Facebook users that indicated liking their page. Then, by
visiting the pages of such users, individual level data can be collected.
The proposed data collection framework has direct potential for
marketing managers as it makes it possible to investigate whether dis-
tinct clusters requiring distinct marketing efforts can be identied
among potential customers (that is, users that already showed some
form of afliation to the company by likingit on Facebook). Hence,
the general framework presented in this paper can be used to improve
and enhance implementation of the identication phase in a rm's
CRM process. In particular, by focusing on potential rather than existing
customers, information becomes available that can be used to improve
marketing efforts aimed specically at acquisition of new customers.
Ideally, a system should be implemented that merges the online prole
data with other available prole data (e.g., proles based on transaction
data).
6.3. Limitations and future research directions
The proposed data collection framework only allows for the retrieval
of data from users with public proles. Moreover, from the sample of
users with a public prole, we selected users that likedat least ve
of the most popular 150 Facebook pages. The sample analyzed in this
paper therefore does not necessarily represent the population of
Facebook users who likedthe football club. Instead, the sample only
represents active main stream users.
Another important issue, inherent to some extent to Facebook and
other newmedia,concerns the rapid developments thatmay overtake
current research. In our case, since collecting the data, February 2011,
and nalizing this paper, the number of users who likedthe Facebook
page increased from around 4 million to 21 million. More importantly,
however, given the steady increase of Facebook users, it may very well
be the case that current users differ from previous users in their usage
of Facebook. As we only received administrator rights for a short period
of time, we did not study such changes. However, the data collection
framework makes it possible to easily track such changes and act
upon them.
In this paper, we considered a clustering analysis based solely on lik-
ing patterns of Facebook users. Although such indicated liking patterns
require very little effort from the users, they are considered as so-called
reactive data. It would be interesting to see whether the reactive data
can be augmented by non-reactive data. For example, considering net-
work data (i.e. by incorporating data concerning the connections be-
tween users), and/or by other data available on users' Facebook pages
(e.g., posted messages/links/pictures, etc.). Augmenting the data in
such a fashion, may yield even richer and more challenging data sets.
Finally, it should be noted that other social network related applica-
tions can also benet from our data collection framework. For example,
recently, [11] considered targeting strategies directed towards individ-
uals in a social network using data obtained directly from a large social
network site. Their analysis could be extended by using Facebook data
obtained after application of our methodology.
Appendix A. Identifying Facebook page fans
Appendix B. MCAK-means
In MCAKmeans, the objective is to minimize a weighted average of
the MCA objective and a K-means objective. The resulting objective
function of MCAKmeans can be expressed as:
min
Y;B;C;GϕY;B;C;GðÞ¼α1MCA þ1α1
ðÞKmeans
¼α1X
q
j¼1
YZjBj
2þ1α1
ðÞYCG
jjjj
2
s:t:YTY¼Ik
where Ydenotes the n×kgroup conguration, Z
j
is the n×p
j
(ob-
served) indicator matrix for the jth variable, B
j
is a matrix of category
quantications (attribute weights), Cdenotes the n×Kcluster mem-
bership matrix and Ggives the K×kmatrix of cluster means. The num-
ber of clusters (K) and the dimensionality (k) need to be selected by the
user. The αcoefcient, which lies between zero and one, allows us to
control for the importance of the dimension reduction part versus the
clustering part. In our application we xαto 0.5 so that both parts are
equally important. An alternating least-squares algorithm can be used
Table 4
Top 10 countries per cluster (with cluster sizes) and clusterwise relative frequencies per country.
Cluster 1 (3549) Cluster 2 (1734) Cluster 3 (4048) Cluster 4 (2381) All (11,712)
Unknown 21.67 Indonesia 62.11 Unknown 16.67 Unknown 21.50 Unknown 18.53
Mexico 9.30 Unknown 12.34 India 16.30 Indonesia 11.68 Indonesia 15.45
USA 6.03 Malaysia 5.94 Indonesia 8.37 Malaysia 6.34 India 7.03
Colombia 5.04 UK 2.48 Malaysia 5.09 India 5.12 Malaysia 4.46
Brazil 3.38 USA 2.13 Nigeria 4.47 UK 4.49 USA 3.94
UK 3.35 Thailand 1.04 UK 4.47 Egypt 3.02 UK 3.84
Indonesia 3.27 Turkey 0.98 USA 3.53 USA 2.86 Mexico 3.63
Argentina 3.07 Spain 0.92 Egypt 2.67 Mexico 2.48 Nigeria 2.23
Chile 2.70 Brazil 0.75 Brazil 2.03 Nigeria 2.10 Brazil 2.09
France 2.51 Mexico 0.58 Iran 1.90 Algeria 1.68 Egypt 2.00
Total 60.33 Total 89.27 Total 65.51 Total 61.28 Total 63.20
70 J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
to solve the minimization problem. For xed Y,thecategoryquantica-
tions become:
Bj¼ZT
jZj

1ZT
jY;
and, similarly, for xed Yand C,theclustergroupmeanscanbe
calculated as:
G¼CTC

1CTY:
Furthermore, [20] shows that, for xed C, the conguration matrix Y
can be obtained using the eigenequation
α1X
q
j¼1
ZjZT
jZj

1ZT
jþα2CC
TC

1CT
0
@1
AY¼YΛ:ð1Þ
By considering the eigenvectors (i.e., the columns of Y) correspond-
ing to the klargest eigenvalues, the optimal group conguration, for
xed C, is obtained.After updating Yin thisfashion, the clustermember-
ship matrix Cis obtained by considering distances of the k-dimensional
points in Yto cluster means in Gand by subsequently assigning obser-
vations to the closest cluster.
Starting with some initial values for Cand Y(e.g., random cluster
memberships and Ythe conguration obtained after applying MCA)
the approximations are sequentially updated leading the objective to
decrease monotonically. If the decrease is below a certain threshold,
the algorithm terminates and a solution is obtained. To reduce the
chances of obtaining a local minimum, several random starts should
be applied.
Note that the eigenEq. (1) is of crucial importance in the proposed
algorithm. For large n, the matrix that needs to be considered becomes
large. It is therefore useful to reformulate the method in a more efcient
way. This can easily be achieved by dening
X¼ffiffiffiffiffiffi
α1
pZD1
2
zffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1α1
ðÞ
pCD1
2
c

;
where Z=[Z
1
,Z
2
,,Z
j
,,Z
q
], D
z
=diag(Z
T
Z)andD
c
=C
T
C.
If we consider the singular value decomposition
X¼YΛ1
2VT
;
where Y
T
Y=V
T
V=I, we get, in accordance with Eq. (1),
XXTY¼YΛ
and
XTXV¼VΛ:ð2Þ
The group conguration Ycan thus be obtained as
Y¼XVΛ
1
2
:ð3Þ
Finally, although not specically mentioned in [20], it is important to
consider all Zmatrices in deviation from the mean vector to avoid a so-
called trivial solution. Alternatively, the trivial solution, i.e.the eigenvec-
tor corresponding to the largest eigenvalue of X
T
X, should be ignored.
An important advantage of using Eq. 2over Eq. 1is that we only need
to nd the k+ 1 largest eigenvalues and corresponding eigenvectors
for the Q×Qmatrix X
T
Xrather than for the n×nmatrix XX
T
.
References
[1] H. Arabie, L. Hubert, Cluster analysis in marke ting research. Factorial k-me ans
analysis for two-way data, in: R. Bago zz (Ed.), Advanced Methods of Marketing
Research (160189), Blackwell, Oxford, 1994.
[2] E.D. Argaez, http://www.internetworldstats.com/facebook.htm2011 (Accessed on 2
June 2011).
[3] S. Ba, H.R. Rao, DSS special issue on the theory and applications of social networks,
Decision Support Systems 55 (4) (2013) 939940 (1. Social Media Research and
Applications 2. Theory and Applications of Social Networks).
[4] C.H. Baird, G. Parasnis, From social media to social customer relationship manage-
ment, Strategy and Leadership 39 (2011) 3037.
[5] R. Chakraborty, C. Vishik, H.R. Rao, Privacy preserving actions of older adults on so-
cial media: exploring the behavior of opting out of information sharing, Decision
Support Systems 55 (4) (2013) 948956 (bce:title N1. Social Media Research and
Applications 2. Theory and Applications of Social Networksb/ce:title N).
[6] C.M. Cheung, M.K. Lee, A theoretical model of intentional social action in online
social networks, Decision Support Systems 49 (1) (2010) 2430.
[7] J. Claussen, T. Kretschmer, P. Mayrhofer, The effect of rewarding user engagement:
the case of Facebook apps, Information Systems Research 24 (2013).
[8] W. Duan, Special issue on social media: an editorial introduction, Decision Support
Systems 55(4) (2013) 861862 861862. 1. Social MediaResearch and Applications
2. Theory and Applications of Social Networks.
[9] Facebook.com.,h ttp://developers.facebook.com/2011 (Accessedon 16 March 2011).
[10] C. Forman, A. Ghose, B. Wiesenfeld, Examining the relationship between reviews
and sales: the role of reviewer identity discloser in electronic markets, Information
Systems Research 19 (2008) 291313.
[11] Gelper, S., Lans, R.Van der Van Bruggen, G. 2014. Competition for attention inonline
social networks: implications for viral marketing (unpublished manuscript).
[12] A. Gi, Nonlinear Multivariate Analysis, Wiley, Chichester, 1990.
[13] Google, http://code.google.com/apis/maps/index.html2004 (Accessed on 3 May 2011).
[14] M.J. Greenacre, Correspondence Analysis in Practice, Academic Press, London,
1993.
[15] R. Gross, A. Acquisti, Information revelation and privacy in online social networks
(the Facebook case), Proceedings of the 2005 ACM Work shop on Privacy in the
Electronic Society2005. 7180.
[16] L. Harris, C. Dennis, Engaging customers on Fac ebook: challenges for e-tailers,
Journal of Consumer Behaviour 10 (2011) 338346.
[17] T.Hennig-Thurau,K.P.Gwinner,G.Walsh,D.D.Gremier,Electronicword-of-
mouth via consumer-opinion platforms: what motivates consumers to articu-
late themselves on the Internet, Journal of Interactive Marketing 18 (2004)
3852.
[18] T. Hennig-Thurau, E.C. Malthouse, C. Friege, Gensler S. Lobschat, A. Rangaswamy, B.
Skiera, The impact of new media on customer relationships, Journal of Service
Research 13 (2010) 311330.
[19] S. Ho, D. Bodoff, K. Tam, Timing of adaptive we b personalization and i ts effects
on online consumer behavior, Information Systems Research 22 (2011)
660679.
[20] H. Hwang, W.R. Dillon, Y. Takane, An extension of multiple correspondence analysis
for identifying heterogenous subgroups of respondents, Psychometrika 71 (2006)
161171.
[21] A. Iodice D' Enza, F. Palumbo, Iterative factor clustering of binary data, Computational
Statistics 119 (2012), http://dx.doi.org/10.1007/s00180-012-0329-x.
[22] A.H. Kracklauer, D.Q. Mills, D. Seifert, Customer managementas the origin of collab-
orative customer relationship management, Collaborative Customer Relationship
Management, Springer, 2004, pp. 36.
[23] C. Lampe, N. Ellison, C. Steineld, A familiar face(book): prole elements as signals
in an online social network, Proceedings of Conference on Human Factors in Com-
puting Systems, ACM Press, 2007, pp. 435444.
[24] K. Lewis, J. Kaufman, M. Gonzalez, A. Wimmer, N. Christakis, Tastes, ties, and time: a
new social network dataset using Facebook.com, Social Networks 30 (4) (2008)
330342.
[25] Y.-M. Li, Y.-L. Shiu, A diffusion mechanism for social advertising over microblogs,
Decision Support Systems 54 (1) (2012) 922.
[26] J. MacQueen, Some methods for classication and analysis of multivariate observa-
tions. In L. Cam & J. Neyman (Eds.),Proceedings of the FifthBerkeley Symposium on
Mathematical Statistics and Probability (1, 281297), University of California Press,
California, 1967.
[27] E.W. Ngai, L. Xiu, D.C. C hau, Application of data mining techniques in customer
relationship management: a literature review and classication, Expert Systems
with Applications 36 (2) (2009) 25922602.
[28] S. Okazaki, What do we know about mobile internet adopters? A cluster analysis,
Information Management 43 (2006) 127141.
[29] N. Park, S. Lee, J.H.Kim, Individuals'personal network characteristics and patterns of
Facebook use: a social network approach,Computers in Human Behavior28 (2012)
17001707.
[30] A. Parvatiyar, J.N. Sheth, Customer relationship management: emerging prac-
tice, process, and discipline, Journal of Economic and Social Research 3 (2)
(2001) 134.
[31] R. Rishika, A. Kumar, R. Janakiraman, R. Bezawada, The effect of customers' social
media participation on customer visit frequency and protability: an empirical
investigation, Information Systems Research 24 (2013).
[32] M.J. Shaw, C. Subramaniam, G.W. Tan, M. Welge, Knowledge management and data
mining for marketing, Decision Support Systems 31 (2001) 127137.
[33] R.S. Swift, Accelerating Customer Relationships: Using CRM and Relationship Tech-
nologies, Prentice Hall Professional, 2001.
71J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
[34] S. Van Buuren, W. Heiser, Clustering n objects into k groups underoptimal scaling of
variables, Psychometrika 54 (1989) 699706.
[35] M. Vichi, H. Kiers, Fa ctorial k-means analysis for two-way data, Computatio nal
Statistics and Data Analysis 37 (2001) 4964.
[36] M. Wick, http://download.geonames.org/export/dump/readme.txt2005 (Accessed
on 28 April 2011).
[37] K.P.Wiedmann,H.Buxel,G.Walsh,Customerproling in e-commerce: method-
ological aspects and challenges, Journal of Database Marketing 9 (2) (2002)
170184.
[38] M. Xu, J. Walton , Gaining customer k nowledge through a nalytical CRM, Ind ustri-
alManagementAmp;DataSystems105(7)(2005)955971.
[39] F. Zhu, X. Zhang, Impact of online consumer reviews on sales: the moderating
role of product and consumer characteristics, Journal of Marketing 74 (2010)
133148.
Jan-Willem van Damobtained his Mastersdegree Cum Laude inEconomics & Informatics
from Erasmus University Rotterdam, the Netherlands, in 2011. The focus of his Masters
thesis is on employing data mining techniques for enhancing sport marketing applica-
tions. His resear ch interests co ver areas such as data mining, Web 2. 0, the Semantic
Web foundations and applications, and Web information systems.
Michel v an de Velden is an assistant professor at the Econometric Institute of the Erasmus
University Rotterdam. His research interests conce rn development and application of
visualization methods for multivariate data. His work covers a wide range of research
disciplines ranging from linear algebra to transportation science, and has been published
in an equally wide range of high standing academic journals including Linear Algebra
and its Applications, Psychometrika, Journal of Computational and Graphical Statistics,
Journal of Statistical Software and Marketing Letters. For a full CV and list of publications,
please visit, http://people.few.eur.nl/vandevelden/
72 J.-W. van Dam, M. van de Velden / Decision Support Systems 70 (2015) 6072
... Segmentation techniques enable the identification of specific fan groups, each displaying unique characteristics and requirements. Leveraging clustering algorithms and data mining methods, teams can categorise fans by engagement metrics and buying tendencies (Oyewole & Thopil, 2023;van Dam & van de Velden, 2015). One example is the use of data and AI by the National Football League (NFL) to improve ticket sales and fan engagement. ...
Chapter
Full-text available
In this chapter, we embark on an in-depth journey to uncover the transformative power of digital marketing analytics in the sports sector. We explore how data analytics has become an integral part of sports marketing strategies, impacting areas such as fan engagement, sponsorships, advertising strategies, ticket sales, and revenue optimisation. With a shift towards data-driven marketing tactics, this chapter illuminates the transition from conventional methods to innovative approaches that harness technology. Through the effective collection, analysis, and application of data across a multitude of digital channels-including social media, official websites, and mobile applications-we demonstrate how the sports industry is evolving. By intertwining theoretical frameworks with real-world case studies, we provide rich insights that serve as an invaluable resource for industry professionals, scholars, and students alike. This chapter underscores the pivotal role of digital marketing analytics in elevating the fan experience, crafting personalised marketing initiatives, and maximising revenue for sports organisations. Moreover, we delve into the latest trends, the challenges faced, and the prospective future developments within this arena. The narrative highlights the urgency for sports businesses to embrace the swiftly changing digital environment in order to maintain a competitive edge.
... Profiling is "the process of generating profiles from obtained data, associated to one or multiple subjects" [49]. Profiling of people is widely used in several areas, such as targeted advertising [50], donation solicitation [51], and volunteer recruitment [52]. Elsner et al [22,49] proposed to use the profiling of volunteers in dispatch algorithms to enhance the prediction of the volunteers' position, trajectory, and constraints. ...
Article
Full-text available
Background Smartphone-based emergency response apps are increasingly being used to identify and dispatch volunteer first responders (VFRs) to medical emergencies to provide faster first aid, which is associated with better prognoses. Volunteers’ availability and willingness to respond are uncertain, leading in recent studies to response rates of 17% to 47%. Dispatch algorithms that select volunteers based on their estimated time of arrival (ETA) without considering the likelihood of response may be suboptimal due to a large percentage of alerts wasted on VFRs with shorter ETA but a low likelihood of response, resulting in delays until a volunteer who will actually respond can be dispatched. Objective This study aims to improve the decision-making process of human emergency medical services dispatchers and autonomous dispatch algorithms by presenting a novel approach for predicting whether a VFR will respond to or ignore a given alert. Methods We developed and compared 4 analytical models to predict VFRs’ response behaviors based on emergency event characteristics, volunteers’ demographic data and previous experience, and condition-specific parameters. We tested these 4 models using 4 different algorithms applied on actual demographic and response data from a 12-month study of 112 VFRs who received 993 alerts to respond to 188 opioid overdose emergencies. Model 4 used an additional dynamically updated synthetic dichotomous variable, frequent responder, which reflects the responder’s previous behavior. Results The highest accuracy (260/329, 79.1%) of prediction that a VFR will ignore an alert was achieved by 2 models that used events data, VFRs’ demographic data, and their previous response experience, with slightly better overall accuracy (248/329, 75.4%) for model 4, which used the frequent responder indicator. Another model that used events data and VFRs’ previous experience but did not use demographic data provided a high-accuracy prediction (277/329, 84.2%) of ignored alerts but a low-accuracy prediction (153/329, 46.5%) of responded alerts. The accuracy of the model that used events data only was unacceptably low. The J48 decision tree algorithm provided the best accuracy. Conclusions VFR dispatch has evolved in the last decades, thanks to technological advances and a better understanding of VFR management. The dispatch of substitute responders is a common approach in VFR systems. Predicting the response behavior of candidate responders in advance of dispatch can allow any VFR system to choose the best possible response candidates based not only on ETA but also on the probability of actual response. The integration of the probability to respond into the dispatch algorithm constitutes a new generation of individual dispatch, making this one of the first studies to harness the power of predictive analytics for VFR dispatch. Our findings can help VFR network administrators in their continual efforts to improve the response times of their networks and to save lives.
... Facebook's API [5], which only allows for the retrieval of a limited selection of predetermined data in a predetermined way, making it difficult to gather all of the information from a user's postings, including their unique preference set. To overcome this, several free automatic web scraping solutions, such as Octoparse, Dexi.io, ...
Article
Full-text available
Language technology involves various language processing tools and techniques which significantly contribute to Natural Language Processing (NLP). Among NLP, natural language text and speech processing are two emerging segments that require huge attention from research. Regional language processing with the advent of Artificial Intelligence brings umpteen opportunities, especially in the Indian context as many languages were spoken in different parts of the Country. A Recommender Model in the Malayalam language in Travel and tourism domain using unsupervised machine learning techniques is the intention behind this paper. Malayalam is a low-resource and highly inflected language that possesses a greater chance for ambiguity. Data sharing online platforms and social media are used as data collection sources, where the availability is still limited and challenging, which may cause scarcity of data. The works propose various methodologies to generate a custom-made scraping model from the social media written in the Malayalam Language and its preprocessing. A deep-level Travelogue Tagger has been specially constructed as part of the experiment. This paper proposes a recommender model based on traveler reviews using Collaborative filtering and Cosine similarity methods. The experiment succeeded with high precision.
Chapter
Social media fake profile serves various illegal social activities. Therefore, detection and prevention of these profiles are essential. The current approaches based on machine learning (ML) are just considering social media user profile attributes by providing a strict classification. This paper provides a view to utilize a scoring-based fake profile classification technique to monitor the activities of user by using profile attributes and published content. This paper first includes a review to know the dataset to be used and the technique to obtain data from social media platform. Then based on social media user’s profile attribute-based ML model has been introduced to classify the fake and legitimate profiles. To train and validate the model, we have used five machine learning algorithms, namely artificial neural network (ANN), support vector machine (SVM), C4.5 decision tree, Bayes classifier, and k-nearest neighbor (k-NN). Here we have found ANN and SVM which is accurate classification technique as compared to others for this task. Finally, by updating the backpropagation neural network and a scoring method for profile a fake profile classification approach has been developed. The developed model is utilizing the content published by users and the basic profile information of public domain. The experiments have been carried out based on real twits and available profile attribute dataset in GitHub. The results are also compared with SVM and ANN algorithms. Based on the precision, recall, and F-score, the proposed technique outperforms as compared to other two implemented models and has been achieved up to 0.94 f-score.KeywordsSocial media analysisSecurity and privacyFake profile detectionMachine learningArtificial intelligence
Article
Full-text available
The escalating ubiquity of social media has intensified the influence of public relations on the general populace’s outlook toward sports and athletes. However, there are limited studies in the literature regarding an overview of public relations in the context of sports. This study addresses this gap by using bibliometric analysis to provide an overview of current trends and future developments in sports PR. The procedures have been executed by analysing the most productive authors and organisations, the frequently researched subjects, and the most cited publications. A comprehensive search of scientific databases was conducted to analyse 524 publications. The datasets retrieved in this study were analysed using ScientoPy and VOSviewer to identify tendencies and map the research themes based on the authors’ keywords. The findings highlight that the keywords frequently used by previous scholars are public relations, sports, and social media. An overall picture of the state of research on sport and public relations is given by this bibliometric analysis. The findings indicate that despite significant advancements in this domain, a considerable amount of work remains to be undertaken, particularly in underexplored areas such as sports communication and image restoration. The findings of this study can be advantageous for individuals involved in sports and public relations, including researchers, practitioners, and students.
Article
Full-text available
Contemporary privacy challenges go beyond individual interests and result in collective harms. To address these challenges, this article argues for a collective interest in Mutual Privacy which is based on our shared genetic, social, and democratic interests as well as our common vulnerabilities against algorithmic grouping. On the basis of the shared interests and participatory action required for its cumulative protection, Mutual Privacy is then classified as an aggregate shared participatory public good which is protected through the group right to Mutual Privacy.
Article
Full-text available
Many firms try to leverage consumers’ interactions on social platforms as part of their communication strategies. However, information on online social networks only propagates if it receives consumers’ attention. This paper proposes a seeding strategy to maximize information propagation while accounting for competition for attention. The theory of exchange networks serves as the framework for identifying the optimal seeding strategy and recommends seeding people that have many friends, who, in turn, have only a few friends. There is little competition for the attention of those seeds’ friends, and these friends are therefore responsive to the messages they receive. Using a game-theoretic model, we show that it is optimal to seed people with the highest Bonacich centrality. Importantly, in contrast to previous seeding literature that assumed a fixed and nonnegative connectivity parameter of the Bonacich measure, we demonstrate that this connectivity parameter is negative and needs to be estimated. Two independent empirical validations using a total of 34 social media campaigns on two different large online social networks show that the proposed seeding strategy can substantially increase a campaign’s reach. The second study uses the activity network of messages exchanged to confirm that the effects are driven by competition for attention. This paper was accepted by Anandhi Bharadwaj, information systems.
Article
Full-text available
Customer relationship management (CRM) has once again gained prominence amongst academics and practitioners. However, there is a tremendous amount of confusion regarding its domain and meaning. In this paper, the authors explore the conceptual foundations of CRM by examining the literature on relationship marketing and other disciplines that contribute to the knowledge of CRM. A CRM process framework is proposed that builds on other relationship development process models. CRM implementation challenges as well as CRM's potential to become a distinct discipline of marketing are also discussed in this paper.
Article
The marketing departments of retailers and manufacturers speak more often in their analyses about “hybrid” consumers — customers who do not demonstrate behavior consistent with simple categories. The smart shopper, one with a Jaguar in the parking lot of a discount hypermarket, is a reality, just as is the college student in a boutique wine shop. Because of this seemingly paradoxical customer behavior, it is becoming more and more difficult for retailers and manufacturers to identify and retain valuable customers.
Article
This publication contains reprint articles for which IEEE does not hold copyright. Full text is not available on IEEE Xplore for these articles.
Article
Social broadcasting networks such as Twitter in the U.S. and ''Weibo'' in China are transforming the way online word of mouth (WOM) is disseminated and consumed in the digital age. In the present study, we investigated whether and how Twitter WOM affects ...
Article
Using the theoretical framework of ego-centric networks, this study examines the associations between the characteristics of both Facebook-specific and pre-existing personal networks and patterns of Facebook use. With data from an ego-network survey of college students, the study discovered that various dimensions of Facebook-specific network characteristics, such as multiplexity, proximity, density, and heterogeneity in race, were positively associated with usage patterns, including time spent on Facebook, posting messages, posting photos, and lurking. In contrast, network characteristics of pre-existing relationships, such as density and heterogeneity in race, were negatively associated with Facebook usage patterns. Theoretical implications and limitations were discussed.