ArticlePDF Available

Association Rules Mining Among Interests and Applications for Users on Social Networks

Authors:

Abstract and Figures

Interest is an important concept in psychology and pedagogy and is widely studied in many fields. Especially in recent years, the widespread use of many interest-based recommendation systems has greatly promoted research on interest modeling and mining on social networks. However, the existing studies have rarely tried to explore the relationships among interests and their application value, and most similar studies analyze user behavior data. In this paper, we propose and verify two hypotheses about the interests of social network users. We then use association rules to mine users’ interests from LinkedIn users’ profiles. Finally, based on interest association rules and user interest distribution on Twitter, we design an approach to mine interests for Twitter users and conduct two experiments to systematically demonstrate the approach’s effectiveness. According to our research, we found that there are a large number of association rules between human interests. These rules play a considerable role in our method of interest mining. Our research work not only provides new ideas for interest mining but also reveals the internal relationship between interest and its application value. The research work has certain theoretical and practical value.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 1
Association Rules Mining among Interests and
Applications for Users on Social Networks
Huayou Si 1,2, Jiayong Zhou1,2, Zhihui Chen1,2, Jian Wan1,2,*, Neal N. Xiong3,*, Wei Zhang1,2,
Athanasios V. Vasilakos4
1School of Computer Science and Technology, Hangzhou Dianzi University, Zhejiang, China
2Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, China
3Department of Mathematics and Computer Science, Northeastern State University, Tahlequah, OK, USA
4College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China, Email: th.vasilakos@gmail.com
Corresponding author: Neal N. Xiong(xiongnaixue@gmail.com), Jian Wan(wanjian@zust.edu.cn)
This work is partly supported by the National Natural Science Foundation of China under Grants No. 61472112 and No. 61502129, and the Key R&D Program
of Zhejiang Science and Technology Department under Grant 2017C03047.
ABSTRACT Interest is an important concept in psychology and pedagogy and is widely studied in many
fields. Especially in recent years, the widespread use of many interest-based recommendation systems has
greatly promoted research on interest modeling and mining on social networks. However, the existing studies
have rarely tried to explore the relationships among interests and their application value, and most similar
studies analyze user behavior data. In this paper, we propose and verify two hypotheses about the interests of
social network users. We then use association rules to mine users' interests from LinkedIn users' profiles.
Finally, based on interest association rules and user interest distribution on Twitter, we design an approach
to mine interests for Twitter users and conduct two experiments to systematically demonstrate the approach’s
effectiveness. According to our research, we found that there are a large number of association rules between
human interests. These rules play a considerable role in our method of interest mining. Our research work
not only provides new ideas for interest mining but also reveals the internal relationship between interest and
its application value. The research work has certain theoretical and practical value.
INDEX TERMS Interests, Correlation Analysis, Association Rules, Interest Mining
I. INTRODUCTION
Interests and hobbies refer to individuals’ psychological
tendencies to desire to know and master something and often
participate in such activities or refer to individuals having a
cognitive tendency of actively exploring something. In
contemporary psychology of interest [1], the term is used as a
general concept that may encompass other more specific
psychological terms, such as curiosity and to a much lesser
degree surprise.[2] In fact, interests have an important
influence on personality formation, mental health, education,
and career development. They are very important concepts in
psychology and pedagogy.
Since the 1980s, scholars have carried out abundant
research on interests. In pedagogy, the relationship between
interest and teaching is a crucial issue in teaching research and
is also an everlasting topic that is always under exploration.
For example, Renninger A et al. [3] systematically discussed
the role of interest in learning and personal development. Hidi
S et al. [4] illustrated the process of interest cultivation. J. M.
et al. [5] believe that interest is constructive to academics and
that raising interest helps students gain a more proactive
learning experience. In psychology, many studies have shown
that interest plays a significant role in personality formation
and career development, as well as in individual mental health.
For example, Philip M. Sadler et al. [6] studied the changes in
students' interests in different periods.
In recent years, with the continuous growth of Internet users
and social network applications, the interest-based
recommendation systems have been widely used in practice.
As a matter of fact, recommending personalized products and
information based on user interests and preferences has
become a very effective method for product sales and
information services. Thus, interest modeling and mining for
Internet users and other related research have been gradually
carried out. For example, Elmongui et al. [7], Qian et al. [8],
Eirinaki et al. [9], and Jiang et al. [10] each proposed a
recommendation service method based on user interests.
Huang et al. [11], Bhattacharya et al. [12], Zarrinkalam et al.
[13][14], and Li et al. [15] focused on interest modeling for
Internet users for different goals and tasks. Moreover,
Kapanipathi et al. [16], Patel et al. [17], and Piao et al. [18]
focused on Interest mining for Internet users based on access
logs, microblog/blog accessing, and content and behavior of
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
browsing, respectively. These studies further extend the areas
of interest in research, development, and application.
Although research on interests is very extensive, the
existing research rarely tries to explore the relationships
among interests and their application value based on big data.
To address the issue and in combination with the requirements
of interest mining for Internet users, we preprocessed the data
of LinkedIn and Twitter and at the same time made
assumptions and verified their distribution. We then designed
a series of methods to mine users' real interests, including
obtaining interest relevance and calculating users' sensitivity
to interest. Our research work shows there exist many
association rules between human interests, which can truly
play a very good role in interest mining in our approach. Our
contributions in this paper are as follows.
Based on tens of thousands of profiles with interests
from LinkedIn, we analyze the distribution of human
interests to mine 210 high frequency interests.
We analyzed the correlation of interests and then study
the association rules among the interests based on our
empirical data.
We analyze the distribution of users’ interests on Twitter
and demonstrate two hypotheses about the distribution
based on empirical data from Twitter and LinkedIn.
We design an approach to mine interests for Twitter
users based on interest association rules and demonstrate
the approach’s effectiveness.
To facilitate the description of our research, we draw a simple
flow chart for mining user interest in social networks, as
shown in Figure 1.
Figure 1. The flow chart for mining the user’s interests
The rest of this paper proceeds as follows. Section II
discusses the distribution and recognition of interests. Section
III studies the association rules among the interests. Section
IV analyzes the distribution of users’ interests on Twitter.
Section V presents our approach for interest mining for
Twitter users and then discusses its effectiveness. Section 6
presents related works. The conclusions are drawn in Section
7.
II. EMPIRICAL DATA COLLECTION AND INTEREST
RECOGNITION
A.
Interest Data Collection
LinkedIn is a very popular business and employment-
oriented social networking service. As of September 2016,
LinkedIn had more than 467 million accounts. The basic
functionality of LinkedIn allows users to create profiles,
which typically consist of a curriculum vitae describing their
work experience, education and training, interests and
hobbies, and a photo of them [19]. The members on LinkedIn
usually aim to create a personal professional image, access
to business insights, develop professional contacts and find
more career opportunities. Compared to other social
networking, LinkedIn members can provide more authentic
and reliable personal profiles.
LinkedIn members usually list their interests in their
profiles. Some interests always appear on the same profiles,
which indicates that these interests have an intrinsic
connection. For example, the interests “read” and “travel”
often appear at the same time. There must be a close
relationship between them. Thus, LinkedIn career profiles
with interests can be collected to analyze correlation
characteristics of interests. In our research, we first design a
LinkedIn crawler and then randomly collect 44,623 LinkedIn
profiles, of which 10,028 are filled with their interests.
B.
Interest Recognition
LinkedIn does not provide a group of interests for its
members to choose when they create their profiles. It is very
open for members’ interests. The members of LinkedIn can
freely edit their interests. Therefore, the interests filled in by
LinkedIn members are not standardized. In an interest list of
a LinkedIn profile, there is no fixed separator between
different interests. Some users use the word "and", some use
a comma ",", some use a semicolon ";", and some directly
use a new line to divide different interests. For example,
some user’s interests are "Movies and walking", while some
are "Yoga; hiking; singing; reading; poetry; art; music;
Kids!", which all contain several different interests divided
by different separators. Moreover, LinkedIn users can
express the same interest in different words. In natural
language, the same interest tends to have a variety of
different expressions. Therefore, in this paper we process the
interest data collected as follows:
We first design an algorithm that can intelligently split
LinkedIn members’ interest list to recognize the
interest words as a collection for each user. From the
10,028 profiles with interests, we find 25,913 interest
words, which represent respective interests. There is no
question that some interest words are synonymous; for
example, “ski” and “skiing”, “book” and “books”,
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
which represent the same interests, are just expressed
in different words.
We then recognize the synonyms and aggregate them
into the same interest items. After we proofread
artificially, 19430 synonym sets are obtained for all the
interest words. There is no question that a synonym set
of interest words corresponds to an interest item. To
facilitate the description of our work, in this paper, the
most frequent interest word in a synonym set is used to
name the interest item.
TABLE I
FREQUENCY OF PARTS OF INTEREST ITEMS IN LINKEDIN
Interest Item
Frequency
Percentage
travel
1689
3.12%
music
1266
2.34%
read
1140
2.11%
technology
1073
1.98%
photography
811
1.50%
movie
772
1.43%
ski
742
1.37%
golf
674
1.25%
cycling
582
1.08%
running
551
1.02%
business
542
1.00%
sport
524
0.97%
cooking
491
0.91%
art
473
0.87%
design
445
0.82%
family
427
0.79%
According to the synonym sets of interest, we replace
the interest words in each profile with the names of
their own interest items. For each interest item, then we
calculate the frequency of its occurrences in 10,028
profiles and the percentage of his occurrences to the
total occurrences of all the interest items, which shows
the universality of the interest. Parts of the results are
shown in Table I.
From the sorted interest items according to their frequency
in descending order, as shown in Figure 2, the cumulative
percentage of the top 10 interests is up to 17.02%, the top 50
is up to 37.69%, the top 100 is up to 46.63% .. Therefore, we
can find that the frequency distribution of the 19430 interests
is very uneven, where very few interests have very high
frequencies and the frequencies of most of the interests are
very low.
Figure 2. The Cumulative Percentage of the Top n Interests
There is no doubt that the higher the frequency of an
interest is, the more popular the interest is, and the
greater the analytical value is. Therefore, we remove
the low-frequency interest items and retain 210 high
frequency interest items as subjects of study. In the
experimental data, 8,675 out of the 10,028 profiles
contain at least one interest in the 210 interest items.
TABLE II
EXAMPLES OF NORMALIZED REPRESENTATION OF INTERESTS IN LINKEDIN
PROFILES
User ID
Interest Strings Collected
Normalized Representation
of Interests
168915697
New technology;
Sciences; Languages
technology; language;
science
27428582
Wine, food and good
music!
music; food; wine
113724463
Rugby; Golf; Travel and
adventures
travel; golf; rugby; adventure
7735645
Theater & Improvisation;
Tango dancing.
dance; movie
145915690
Travelling; Football and
Fishing
travel; fishing; football
134641445
Reading; writing; music;
eating; cooking; traveling;
camping; hiking
read; music; cooking; hike;
travel; writing; camping;
eating
13715016
music: piano and guitar;
photography (b&w);
skiing; badminton
music; photography; ski;
guitar; badminton; piano
So, for each LinkedIn profile, just keeping the interests in
the 210 interest items with standard names, we can get a
normalized representation of the interests. Some examples
are shown in Table II.
III. CORRELATION ANALYSIS FOR INTERESTS
A.
Correlation Analysis Approach for Interests
When something happens in nature, other things will follow.
This relationship is called association. The knowledge that
reflects dependencies or associations between events is
known as relational knowledge. For example, according to
shopping basket analysis, some retail rules can be
determined, such as "70% of customers who buy a basketball
also buy basketball sportswear at the same time" and "40%
of all customers buy a basketball and basketball sportswear
at the same time". These rules are called association rules.
Correlation analysis is also known as association mining, the
purpose of which is to find the association rules between
data items in a given data set and to describe the degree of
closeness between data items. The data set for association
rules mining is usually recorded as D.
D = {T1, T2, ..., Tk, ..., Tn}, where Tk (k= 1, 2, ..., n) is
called a record.
Each record Tk consists of a list of items.
Tk = {i1, i2, ..., im}
In this paper, the data record set D refers to the 8,675
profiles with the 210 high frequency interest items. Tk is one
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
of the profiles. The item list of Tk is the collection of interest
items in a given profile. In this way, we can establish a
correlation analysis approach for interest items.
In correlation analysis, the measurement methods of
importance and value for association rules are confidence,
support, expectation and lift.
Confidence: the measurement of the accuracy and
intensity of association rules. The Confidence of the
rule XY in data record set D represents the frequency
of appearance of Y in all the records where X appears,
also representing the inevitability of rule XY,
denoted as:
   
          
Support: the measurement of the importance of
association rules, which reflects the universality of
association rules and indicates the representation of
association rules in all record sets. The Support of rule
XY in data record set D represents the frequency of
appearance of both X and Y simultaneously in all
records, denoted as:
   
  ,
where |D| refers to the number of all records in data record
set D.
Expectation: for a rule XY, it refers to the frequency
of the occurrences of Y in all data record sets. In rule
XY, it describes the frequency of Y in all records sets
without any influential factors, denoted as:

 
 
  
Lift: for a rule XY, it describes how the occurrence of
X affects the appearance of Y, which is the ratio of
confidence to expectation of the rule, denoted as:

  
 
 
   
  
Thus, based on the analysis of interest items, we can mine
the association rules among interest items and quantify their
characteristics, such as confidence, support, expectation and
lift.
B.
Correlation Analysis of Interests
In the mining process of association rules, it is necessary to
set the minimum confidence threshold and the minimum
support threshold. An association rule that satisfies the
thresholds is a strong and meaningful association rule.
Apriori [20] is one of the most famous algorithms for mining
strong association rules. Based on the empirical data
collected in this paper, we apply the Apriori algorithm to
mine strong association rules. According to different
minimum thresholds, the numbers of strong association rules
we dug out are shown in Table III.
As seen from Table III, a certain number of interest
association rules can be dug out according to different
minimum confidence thresholds and minimum support
thresholds. Therefore, for the specific requirements in the
expected application, a set of strong association rules can be
obtained by setting different minimum thresholds.
TABLE III.
THE NUMBERS OF ASSOCIATION RULES DUG OUT BASED ON DIFFERENT
MINIMUM THRESHOLDS
Minimum
Confidence
Threshold
0.2%
0.6%
1%
1.4%
1.8%
2.0%
10%
1751
309
127
67
38
30
20%
857
133
66
37
19
15
30%
421
58
21
13
8
5
40%
180
17
4
2
1
1
50%
86
4
1
1
0
0
60%
28
1
0
0
0
0
70%
12
1
0
0
0
0
80%
3
0
0
0
0
0
90%
1
0
0
0
0
0
In addition, some association rules that can be dug out are
shown in Table IV. It can be seen from Table IV that there
are some very strong correlations among human interests. An
example is the association rule "culturetravel", for which
the confidence degree is as high as 48.33%, the support
degree is as high as 1.16%, and the lift degree is up to
232.77%. This shows that in the human interests, "culture"
and "travel” are highly relevant. Another example is the
association rule "read; photographytravel", for which the
confidence degree is 53.24% and the lift degree is 256.43%.
Thus, through the empirical correlation analysis, we find that
there is a great deal of association relationships among
human interests and some association rules have high
confidence, lift and support. This shows that there are some
intrinsic inherent links among human interests. Therefore,
they can be applied to interest mining for users on social
networks.
TABLE IV
EXAMPLES OF INTEREST ASSOCIATION RULES DUG OUT
N
o.
Antecede
nt
Consequ
ent
Confide
nce
Lift
Supp
ort
Expectat
ion
1.
friends
family
59.63%
983.4
9%
1.50
%
6.06%
2.
culture
travel
48.33%
232.7
7%
1.16
%
20.76%
3.
food
travel
46.67%
224.7
8%
1.13
%
20.76%
4.
marketin
g
media
32.55%
592.0
2%
1.60
%
5.50%
5.
read;
music
movie
31.05%
343.5
3%
0.99
%
9.04%
6.
read;
photogra
phy
travel
53.24%
256.4
3%
0.85
%
20.76%
7.
read;
cooking
travel
48.18%
232.0
5%
0.76
%
20.76%
8.
read;
movie
music
39.09%
252.8
8%
0.99
%
15.46%
9.
sport;
music
travel
38.01%
183.0
9%
0.75
%
20.76%
10.
read;
movie
travel
35.00%
168.5
9%
0.89
%
20.76%
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
IV. CHARACTERISTICS OF USER INTERESTS ON
TWITTER
A.
Our Hypotheses
Twitter is an online social networking service. Users can
create accounts on Twitter to post and read short 140-
character messages called "tweets". A user's tweets can be
spread to that person’s followers. At present, Twitter is a
very popular user information publishing platform and has
more than 500 million users. There is no doubt that users are
likely to post some tweets that they are interested in.
Therefore, we can make the following hypothesis:
Hypothesis 1: The words that can express a Twitter
user’s interests usually appear in his tweets.
In other words, the interests that are mentioned in the
tweets of a Twitter user probably are the interests of the
Twitter user. We can even make another hypothesis as
follows:
Hypothesis 2: The higher the frequency an interest
appears in tweets of a Twitter user, the more likely it is
to be the user’s real interest.
B.
Verification of Our Hypotheses
To verify our hypotheses, from Twitter we first collect the
tweets of 930 Twitter users who are all members on LinkedIn
and have provided their real interests on LinkedIn. Then,
given a user on Twitter, we determine the interests and their
frequencies mentioned in all of the user’s tweets, where the
interests are all the 210 interests dug out in the subsection
Interest Recognition. There is no doubt that these interests
dug from all the tweets of a user are not necessarily his real
interests, but his real interest is likely to be among them. For
example, from all the tweets of user Melgallant on Twitter,
we dug out 128 interests with their corresponding
frequencies. These interests are sorted by descending order
according to their frequencies and shown in Table V. In
addition, we collected his real interests on LinkedIn, which
are also shown in Table V.
TABLE V.
EXAMPLE OF INTEREST MINING FOR USER MELGALLANT ON TWITTER
Screen_name
Melgallant
Interests from
LinkedIn
media; editing; international relations; writing
Ordered
Interest
List
From Twitter
<media,311>; <art,261>; <read,104>; <leader,103>;
<performance,71>; <Canada,58>; <coffee,49>;
<video,43>; <culture,43>; <marketing,43>;
<ski,37>; <business,37>; <health,30>; <rock,27>;
<internet,27>; <food,25>; <building,25>;
<movie,24>; <kids,24>; <design,23>; <UK,20>;
<communication,17>; <sport,15>; <surf,14>;
<eating,14>; <family,14>; <planning,13>; <talent
management,13>; <music,13>; <wine,13>;
<dog,13>; <dance,12>; <technology,12>;
<writing,11>; <drink,11>; <hockey,10>; <law,9>;
<bridge,8>; <editing,8>; <skate,8>; <analytics,7>;
<shopping,7>; <mentor,7>; <nature,7>;
<recruitment,6>; <science,6>; <rowing,6>;
<sales,6>; <gas,6>; <blogging,6>; <research,6>;
<friends,6>; <travel,6>; <innovation,5>; <yoga,5>;
<speaking,5>; <shooting,5>; <painting,4>;
<security,4>; <startup,4>; <acting,4>; <fashion,4>;
<running,4>; <bigdata,3>; <android,3>;
<motivation,3>; <fitness,3>; <philanthropy,3>;
<camping,3>; <china,3>; <marathon,2>; <risk,2>;
<golf,2>; <fishing,2>; <international relations,2>;
<environment,2>; <opera,2>; <driving,2>;
<drums,2>; <cooking,2>; <Asia,2>; <singing,2>;
<museum,2>; <India,2>; <oil,2>; <bass,2>;
etc.
Recall Rate
100.0%
Accuracy
Rate
3.91%
F1 Rate
51.95%
In this paper, Recall Rate refers to the percentage of real
interests dug out, Accuracy Rate refers to the proportion of
real interest in the dug out interests, while F1 Rate refers to
the average of Recall Rate and Accuracy Rate. Ordered
Interest List from Twitter refers to the list of interests that are
dug out for a Twitter user and sorted by descending order
according to their frequencies.
In Table V, we can find that the real interests of user
Melgallant all appear in his tweets on Twitter, so the recall
rate of his interests is 100%. However, we found that his
interest rate of accuracy was 3.91%. It can be seen that there
are a lot of interests in the tweet that are not his real interest.
Moreover, we take the 930 Twitter users with known
LinkedIn accounts as empirical samples. We can also find
similar results. The specific data are recorded in Table VI.
TABLE VI.
THE STATISTICAL RESULTS OF THE EMPIRICAL SAMPLES FOR RECALL,
ACCURACY, AND F1 RATE
Interval
s
Recall Rate
Accuracy Rate
F1 Rate
Number
s of
Users
Percentag
e
Number
s of
Users
Percentag
e
Percentag
e
[100%-
90%]
318
34.20%
1
0.11%
0.00%
(90%-
80%]
74
7.95%
0
0.00%
0.00%
(80%-
70%]
68
7.31%
0
0.00%
0.00%
(70%-
60%]
104
11.18%
1
0.11%
0.00%
(60%-
50%]
108
11.61%
4
0.43%
0.11%
(50%-
40%]
21
2.26%
2
0.22%
0.54%
(40%-
30%]
54
5.81%
0
0.00%
0.54%
(30%-
20%]
60
6.45%
12
1.29%
6.23%
(20%-
10%]
25
2.69%
99
10.65%
29.57%
(10%-
0%]
98
10.54%
811
87.21%
63.01%
From Table VI, we can see that the vast majority of users
have high recall rates. This means that most of the real
interests of the vast majority of users appear in their own
tweets. Therefore, we can believe that Hypothesis 1 is true,
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
that is to say, the words that can express a Twitter user’s
interests usually appear in his tweets.
Then, we will further analyze these data and find that for
the empirical samples, 17.42% of the users have highest
frequency interests that are their real interests, 15.32% of the
users have second interests that are their real interests, 14.00%
of the users have third interests that are their real interests,
and so on. Parts of the data are shown in Table VII. The
Numbers of Users column in Table VII refers to the numbers
of empirical users for which at least a corresponding number
of interests can be dug out from their own tweets. For
example, in the tenth row in Table VII, the Number of Users
871 refers to that there are 871 users in empirical samples for
whom at least 10 interests are dug out from their own tweets,
and the Number of hits 60 refers to that there are 60 users in
871 Twitter users whose tenth interest is their own real
interest.
From Table VII, we can find that the probability that an
interest is a user’s real interest usually increases with the
increased frequency of that interest in the Twitter user’s
tweets. Figure 3 depicts the trend of proportions of hits along
with the Nth highest frequencies of interests.
TABLE VII.
PROPORTION OF HITS OF THE INTERESTS WITH THE NTH HIGHEST
FREQUENCIES
Order of
Frequency
Number
of
Users
Number of Hits
Proportion of Hits
1
930
162
17.42%
2
927
142
15.32%
3
921
129
14.00%
4
915
120
13.11%
5
911
103
11.30%
6
905
85
9.40%
7
902
83
9.20%
8
895
76
8.50%
9
884
79
8.90%
10
871
60
6.90%
11
867
67
7.70%
12
857
74
8.60%
13
850
54
6.40%
14
842
51
6.00%
15
836
64
7.70%
16
829
44
5.30%
17
815
57
7.00%
18
802
43
5.30%
19
794
44
5.50%
20
786
40
5.10%
Figure 3. Trend of proportion of hits of high frequency of interest
V. INTEREST MINING FOR TWITTER USERS
A.
Our Approach to Interest Mining
Although we have confirmed our hypotheses that the words
that can express a Twitter user’s interests probably appear in
his tweets and the higher the frequency of an interest in a
user’s tweets, the more likely it is to be his real interest, but
we cannot distinguish a user's real interest directly from his
tweets, since usually numerous interests can be dug out from
a Twitter user’s tweets. In addition, from Table VII, we can
also see that the accuracy rate is very low.
In fact, according to the nature of the interest association
rule, we can assume that if a user has an interest, he may also
have an interest associated with that interest. Therefore, we
apply the interest association rules to reorder the interests of
each user that are dug out from Twitter to make their real
interests as far as possible appear in the front of the ordered
interest list from Twitter. Therefore, we can extract the first
few interests as the user's real interests because they are most
likely to be the user's real interests.
Without loss of generality, in this paper, we can regard the
frequencies of the interests as their weights. For a user’s
ordered interest list from Twitter, for example in Table VI,
we change their weights based on interest association rules
and then resort the interests according to their weights by
descending order. Therefore, after comprehensive
consideration, we designed the following approach to apply
interest association rules to interest mining for a user from
Twitter, the steps of which are listed below.
1. Given a Twitter user, collect all his tweets from Twitter.
2. According to the 210 high frequency interest items dug
out in subsection Interest Recognition, mine the interests
and their frequencies mentioned in all his tweets.
3. Sort the interests by descending order according to their
frequencies as an ordered Interest List from Twitter,
denoted as List oittsList.
4. Take out all the elements in List oittsList as a collection,
denoted as Set ittsSet.
5. Select a set of interest association rules as Set ruleSet
dug out in subsection Correlation Analysis of Interests.
6. One by one, take out each interest irt in List oittsList.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
7. If there is a rule in Set ruleSet, the antecedent of which
is interest irt, then add its consequents to Set ittsSet as
interests with the weight W.
8. Until each interest in List oittsList is processed, sort the
interests in Set ittsSet according to their weights by
descending order to form an interest list, denoted as List
rsltList.
9. According to the actual needs, take out the first several
interests from List rsltList as a result of interest mining
for the user.
In the process, the weight W is set to w+k×r, where
parameter w is an existing weight, if the interests to be added
are already in list ittsSet; else, parameter w is 0. In this
formula, parameter k is the constant used to set the influence
of association rules for interest mining in the approach. The
greater the value of k is, the greater the influence of
association rules for interest mining is. In addition,
parameter r is the probability of interest irt to be the user’s
true interest, which refers to the proportion of hits according
to the order of interest irt in List oittsList corresponding to
Table I. This parameter r ensures that if the probability of
interest irt being the real interest is large, the probabilities of
the interests introduced by interest irt are large too.
B.
Experimental Setup for Evaluation
To verify the value of association rules for interest mining,
we mine two sets of interest association rules based on
different thresholds and then set up two sets of experiments
according to the two respective sets of interest association
rules for interest mining. Finally, by comparing and
analyzing the Proportion of Hits, Recall Rate, Accuracy Rate,
and F1 Rate of the results, we determine the value of the
association rules for interest mining. The experiment is set
up as follows
In Experiment 1, we use the association rules dug out
through larger thresholds. Therefore, in this
experiment, there are fewer association rules, but their
association is strong. As shown in Table VI, if the
minimum support and confidence thresholds are set to
0.4% and 20% respectively, 286 association rules can
be dug out based on our empirical data discussed in
Subsection 3.2. In Experiment 1, we take these interest
association rules as a set of interest association rules
for our approach to interest mining.
In Experiment 2, we use the association rules dug out
through smaller thresholds. Therefore, in this
experiment, there are more association rules, but many
of them are weak. In the case that the minimum support
and confidence thresholds are set to 0.2% and 10%,
respectively, we can obtain the 1628 association rules
dug out, the lifts of which are all greater than 100%, as
a set of interest association rules for interest mining.
In addition, for the two experiments, we apply our
approach to process our empirical samples, i.e., the 930
Twitter users discussed in subsection 4.2. Without losing
generality, in each experiment, we set our approach’s
parameter k to 0, 7, 14, 21, 28, 35, and 42 and conduct 7 tests.
In fact, when parameter k is set to 0, the association rules do
not work, and our approach just returns the original
oittsList from Twitter as shown in Table VI, where the
order of interests is just based on the frequencies of their
appearance in users’ tweets.
C.
Experimental Results
1) EXPERIMENT 1
When the 7 tests are completed in this experiment, for each
test, we calculate the proportions of the users whose N-th
interests in their own List rsltList in our approach are their
real interests, which essentially refer to the hit rates of the N-
th interests. For example, for each user's first interest in his
List rsltList, when parameter k is set to 0, 7, 14, 21, 28, 35,
and 42, the corresponding proportions are 17.42%, 19.25%
21.61%, 21.72%, 23.23%, 23.44%, and 23.76%, respectively.
for ease of understanding, other figures are not explained in
detail here. Figure 4 intuitively compares the proportions for
the first 10 interests according to the 7 tests.
Figure 4. Comparison of the hit rates in the 7 tests in experiment 1
As seen from Figure 4, once the association rules work, that
is, parameter k is not set to 0, the hit rates of the first 10
interest have a certain increase. In some cases, the effect of
association rules is obvious. For example, for their first
interest, the hit rates are as high as 23.76% when parameter
k is set to 42, which is significantly higher than the 17.42%
when parameter k is set to 0.Other interest items have similar
situations. This means that the application of interest
association rules greatly improves the probability that the
first several interests in a user’s List rsltList dug out by our
approach are his real interests. This shows that the interest
association rules can play a good role in our approach.
Moreover, for each test’s results, given a user, we first can
get the first 10 interests in his List rsltList dug out and
calculate the recall rate for him. We then count the
proportion of users whose recall rates are greater than a given
value. For example, when parameter k is 0, 7, 14, 21, 28, 35,
and 42, the corresponding proportions of users whose recall
rates are greater than 70% are 6.77%, 7.96%, 8.17%, 7.85%,
7.96%, 8.28%, and 8.49%, respectively. In another example,
the corresponding proportions of users whose recall rates are
greater than 30% are 40.22%, 45.59%, 47.42%, 47.42%,
47.96%, 48.49%, and 48.49%. Figure 5 intuitively compares
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
the proportions according to the values of parameter k.
Figure 5. Comparison of the proportions of users according to different
recall rates and parameter k in experiment 1
From Figure 5, we can find that if the value of parameter k
is set to 0, the corresponding curve is the worst. It is
obviously inferior to other curves. This means that its recall
rate is the lowest under the same conditions. In our tests, if
the value of parameter k is set to 0, the corresponding curve
is obviously quite good. Here, we can see the association
rules obviously improve the recall rate under various weights
for application. They have a good value for interest mining.
In this experiment, we can see that the greater their weight,
the better their effect.
Furthermore, for each test’s results and for a user, we first
also obtain the first 10 interests in his List rsltList that are
dug out and calculate the accuracy rate and F1 rate for the
user. The proportions of users whose accuracy rates (or F1
rates) are greater than a given value are then counted. For
example, when parameter k is set to 0, 7, 14, 21, 28, 35, and
42, the corresponding proportions of users whose accuracy
rates are greater than 70% are 0.32%, 0.32%, 0.43%, 0.54%,
0.65%, 0.65%, and 0.65%, respectively. Figure 6 and Figure
7 depict the proportions of users with accuracy rates and F1
rates, respectively, that are greater than a given value.
Figure 6. Comparison of the proportions of users according to different
accuracy rates and parameter k in experiment 1
Figure 7. Comparison of the proportions of users according to different
f1 rates and parameter k in experiment 1
From Figure 6 and Figure 7, we can also find that the curves
corresponding to parameter k with value 0 are inferior to
other curves, while the curve corresponding to parameter k
with the value of 42 is quite good. This also means that the
association rules are valuable for interest mining.
2) EXPERIMENT 2
In this experiment, we also conduct the 7 tests just based on
the second set of association rules, which has 1628
association rules that are dug out, but many of them are weak.
For each test’s result, we also calculate the hit rates of the Nth
interests. For example, for each user's first interest, when
parameter k is set to 0, 7, 14, 21, 28, 35, and 42, the
corresponding hit rates are 17.42%, 21.94%, 24.19%,
24.19%, 23.44%, 23.76%, and 23.87%, respectively. For
each user’s second interest, the corresponding hit rate is also
over 15%. Figure 8 shows the hit rate of the top 10 interests
at different k values.
Figure 8. Comparison of the hit rates in the 7 tests in experiment 2
From Figure 8, we can find that compared to parameter k
with value 0, in other cases, the hit rates increase
significantly. For example, for their first interest, the hit rates
are as high as 24.19% when parameter k is set to 14, which
is significantly higher than the 17.42% when parameter k is
set to 0. This means that the application of interest
association rules greatly improves the hit rates of the first
several interests in a user’s ordered interest list that are dug
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
out. In this experiment, it does not mean that their effect
improves as their weight increases.
Figure 9. Comparison of the proportions of users according to different
recall rates and parameter k in experiment 2
Figure 10. Comparison of the proportions of users according to
different accuracy rates and parameter k in experiment 2
Figure 11. Comparison of the proportions of users according to
different f1 rates and parameter k in experiment 2
Furthermore, for each test’s results and for a user, we first
obtain his first 10 interests that are dug out and calculate the
recall rate, accuracy rate, and F1 rate for him. The
proportions of users whose recall rates (or accuracy, or rate,
F1 rate) are greater than a given value are then counted. To
reflect the difference, we set the increment value to 7. For
example, when parameter k is set to 0, 7, 14, 21, 28, 35, and
42, the corresponding proportion of users whose recall rates
are greater than 70% are 6.77%, 9.25%, 8.92%, 9.03%,
9.14%, 8.92%, and 8.92%, respectively. Figure 9, Figure 10,
and Figure 11, respectively, depict the proportions of users
according to recall rate, accuracy rate, and F1 rate that are
greater than a given value.
From Figure 9, Figure 10, and Figure 11, we also find that
the curves corresponding to parameter k with value 0 are very
obviously inferior to other curves. This means that the set of
association rules are quite valuable for interest mining. When
k is set to a different value, the difference between the
corresponding curves is not significant. Combined with the
first experiment, we believe that not all association rules may
be beneficial to mining the real interests of users. In
particular, weak association rules may introduce some bias.
D.
Experimental Analysis
Through the two experiments and their results, we find the
interest association rules can truly have a very good effect
for interest mining in our approach. As a matter of fact, this
conclusion should be reasonable. Since the association rules
reflect the relationships between things, in terms of interest,
someone has an interest, and to a certain extent, this means
that he should have the other interests related to that one.
The results of this experiment can also demonstrate that the
interest association rules that we mine based on our empirical
big data are reliable because they are valuable for interest
mining in our approach.
When parameter k is set to different values, the
corresponding values of the recall rate, accuracy rate, and F1
ratio are different. Moreover, in experiment 1, the greater the
value of parameter k, the slightly better the effect of
association rules is. When the parameter k is set to 28, 35, or
42, the corresponding results are closer. In experiment 2, the
value of parameter k has little effect on the experimental
results. This means that the set of association rules and their
weight in application have subtle effects on interest mining.
This is worth exploring further.
In general, the results of experiment 2 are in good
agreement with each other. However, they are slightly
different, that is, the different sets of rules are slightly
different in the application. We can further argue that the
minimum confidence thresholds and minimum support
thresholds for the association rules mining will influence the
expected application. This is worth exploring further too.
VI. RELATED WORKS
Interests are very important concepts in psychology and
pedagogy. Since the 1980s, scholars have carried out
considerable amounts of research on interests in different
areas of research. Michelson et al. [21] use a knowledge base
to eliminate and classify the ambiguities of entities in Tweets.
They then develop a "topic profile", which characterizes
users' topics of interest, by discerning which categories
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
appear frequently and cover the entities. In pedagogy,
Renninger A et al. [3] illustrate the role of interest in learning
and personal development. They agree that interest is an
important force to promote learning. Therefore, it is very
meaningful to cultivate and improve students' interest in
teaching. Hidi S et al. [4] systematically study the cultivation
of interests. They elaborate on the four-stage interest
cultivation process. J. M. Harackiewicz [5] proposed a four-
stage model of interest development, which can help to
establish some measures to effectively enhance interest. In
recommendation, O. Phelan et al. [22] describe a new
approach to news recommendations that uses real-time
microblogging activity from services such as Twitter as the
basis for promoting news stories from users' favorite RSS
feeds. Bharath Sriram et al. [23] provide a short text
classification method. They propose using a small set of
domain-specific features extracted from the author’s profile
and text. The proposed approach effectively classifies the
text to a predefined set of generic classes such as News,
Events, Opinions, Deals, and Private Messages. In addition,
P. M. Sadler [6] observed changes in the interests of more
than 6,000 students in common occupations at different
times.
In recent years, along with the development of the Internet,
the interest-based recommendation systems have been
widely used in e-commerce and social networking. Thus,
interest modeling and mining for Internet users have been
gradually carried out. For example, H. G. Elmongui et al. [7]
proposed a personalized recommendation system for the
user's timeline that combines his user characteristics, social
behavioral characteristics and tweet content to capture his
interests. Qian X et al. [8] design a unified personalized
recommendation model based on personal interest,
interpersonal interest similarity, and interpersonal influence.
The factor of personal interest can make the recommended
items meet users' individualities, especially for experienced
users. For the cold start users, the interpersonal interest
similarity and interpersonal influence can enhance the
intrinsic link among features in the latent space. Their
experimental results show the proposed approach
outperforms the main existing approaches. Eirinaki et al. [9]
proposed a model user interest community detection model
to analyze the text flow from the Weibo website to detect the
user's interest community. His user interest model can solve
the problem that existing community detection methods
ignore the structural and semantic information of posts. In
addition, an allocation model is proposed, which is based on
improved hypertext-induced topic search, which can reduce
the negative impact of nonrelated users and their interests to
improve the accuracy of extracting interest and high-impact
users. The experimental results prove that this model can
effectively solve the sparsity problem of user interest
community detection and solving post data. In addition,
Vijayaraghavan et al. [24], Yin H et al. [25], S. Zhao et al.
[26], K. Xu [27] and other scholars have also put forward
their own methods in this research area. Moreover,
Vijayaraghavan et al. [28] and Yee et al. [29] have applied
for U.S. Patents for their interest-based recommendation
systems.
For interest modeling and its application, Zarrinkalam et
al. [13] integrate the temporal evolution of semantic
information and user interests from the Wikipedia category
structure into their predictive models to address the
limitations of existing methods of interest space operations.
Specifically, in order to capture the temporal behavior of the
topic and the user's interests, they consider discrete intervals
and construct the user's topic profile in each time interval.
Then, the interests observed by the user over several time
intervals are summarized by transferring them over the
Wikipedia category structure. The experimental results show
that they not only enable us to summarize the interests of
users but also enable us to transfer users' interests at different
time intervals that do not necessarily have the same set of
topics. Bhattacharya et al. [12] propose KAURI, a graph-
based framework to collectively link all the named entities in
all tweets posted by a user via modeling the user's topics of
interest. They argue that each user has a potential distribution
of thematic interests across the various named entities, and
then combines the interest information associated with the
user information associated with the tweets into a unified
graph based framework. Their experimental results show
that KAURI significantly outperforms the baseline methods
in terms of accuracy. Zarrinkalam et al. [14] argue that
existing methods of identifying user interest rely heavily on
explicit contributions (posts) from users, ignoring implicit
user interest, that is, potential users who are not explicitly
mentioned but may be interested. So he proposed a
prediction model based on graph join, which runs on a
representation model composed of three types of information:
the explicit contribution of users to the topic, the
relationships between users, and the relevance of topics. The
comparison of the real-world Twitter public demo dataset
shows that this model is very effective in building a cold-
start user interest file. In addition, in order to solve the
problem that the SATM model is too strict and consumes a
large-scale corpus, X. Li et al. [15] propose a generalized
topic model (LTM) for short text, provided that the
observable short text is generated from the original
document. The membership of the original document is
unknown. Experimental results show that the model is more
competitive than commonly used models. M. Huang et al.
[11] built a user model of heterogeneous networks with
undirected and directed edges and applied the model to
propose a new approach to overlapping community detection
in heterogeneous social networks (OCD-HSN). Compared
with the existing state-of-the-art algorithms, this method
shows higher accuracy and lower time consumption under
the real social network.
In terms of interest mining, P. Kapanipathi et al. [16]
establish a hierarchy-based semantics system that infers user
interests expressed as hierarchical interest graphs by
leveraging the hierarchical relationships existing in the
knowledge base and then uses different levels of conceptual
abstraction to personalize or recommend projects. The
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
results show that this method is effective for the users we
study. J Xu [17] proposed a new unsupervised learning
model-latent interest and topic mining model (LITM), which
is used to automatically mine latent user interests and project
topics from the user-project bipartite network. Experiments
show that this work can effectively alleviate the limitations
of a latent factor model (LFM), and the experimental results
verify the effectiveness of LITM model training and its
ability to provide better service recommendation
performance based on a user-project binary network. In
addition, L. He et al. [30], L. Deng et al. [31] and other
scholars have also proposed their own methods for interest
mining. Based on user preferences, Zhou J et al. [32] design
a two-stage mining algorithm (GAUP) to mine the most
influential nodes in a network on a given topic. Given a set
of users’ documents labeled with topics, GAUP first
computes user preferences with a latent feature model based
on SVD or a model based on vector space and then finds
Top-K nodes in the second stage. Overall, these approaches
for interest mining for Internet users are based on access logs,
microblog/blog accessing, and content and behavior of
browsing.
In the larger context, in recent years, social network data
mining has been extensively studied. However, extracting
intelligence from such data has become a quickly widening
multidisciplinary area that demands the synergy of scientific
tools and expertise. Sapountzi A and Psannis K E [33]
illustrate the entire spectrum of social data networking
analysis and their associated frameworks and provide a
sophisticated classification of state-of-the-art frameworks
considering the diversity of practices, methods and
techniques. They demonstrate challenges and future
directions with a focus on text mining and the promising
avenue of computational intelligence. Zhou X et al. [34]
concentrate on user role identification based on their social
connections and influential behaviors in order to facilitate
information sharing and propagation in social networking
environments. Chen C et al. [35] present a study of deceptive
information of great benefit to the detection of Twitter spam.
Guo R et al. [36] propose a novel method for crawling to
extract fresh information from online social networks in an
efficient and effective manner. Moreover, the interest mining
for users has a wide range of application prospects, such as
travel recommendation [37], user personality analysis [38],
organizational behavior analysis [39], and so on [40].
However, just for interest mining, existing research work
being consulted rarely involves the inner relationship among
interests and its application.
VII. CONCLUSIONS
Based on a large amount of empirical data from social
networks, in this paper we have performed the following four
research tasks.
Collecting tens of thousands of profiles with personal
interests from LinkedIn as our empirical data, we
analyze the distribution of human interests and then
mine 210 high frequency interests as the objects of
study.
We analyzed the correlation of interests and study the
association rules among the 210 interests based on our
empirical data.
Based on hundreds of Twitter users with known
interests, we analyze the distribution characteristics of
users’ interests on Twitter.
Based on interest association rules and users’ interest
distribution on Twitter, we design an approach to
interest mining for Twitter users and demonstrate the
approach’s effectiveness.
According to our studies in this paper, we figured out that
there exists a large number of correlations between human
interests, and some association rules have very high degrees
of confidence, lift and support. These findings show that
there are some inherent fixed relationships among human
interests. In addition, we find that when the interest
association rules are applied to interest mining, they can truly
play a very good role in interest mining in our approach.
Our research work not only provides a new idea for
interest mining but also reveals the intrinsic relationships of
association and dependency among interests and their
application value. In fact, the research work has considerable
theoretical and practical value.
In this research work, we also found some topics that are
worth exploring further. Soon, we will carry out the
following research work.
a) Study the optimal solution in which association rules
apply to interest mining, such as the choice of rule sets
and the setting of their weights.
b) Empirically analyze the clustering relationships among
interests based on big data and study their application
value in interest mining.
In addition, we will apply the related theory and methods in
other areas of research, such as the theories [38][39], to study
relationships among users in social networking platform
Twitter. Moreover, we will improve the capabilities of data
processing in our approach to promote practicality for large-
scale data sets.
REFERENCES
[1] P. J. Silvia, Exploring the Psychology of Interest. Oxford University
Press, USA, 2006..
[2] “Interest (emotion),” Wikipedia. 03-Mar-2019.
[3] K. A. Renninger, S. Hidi, A. Krapp, and A. Renninger, The Role of
interest in Learning and Development. Psychology Press, 2014.
[4] S. Hidi and K. A. Renninger, “The Four-Phase Model of Interest
Development,” Educational Psychologist, vol. 41, no. 2, pp. 111–127,
Jun. 2006.
[5] J. M. Harackiewicz, J. L. Smith, and S. J. Priniski, “Interest Matters:
The Importance of Promoting Interest in Education,” Policy Insights
from the Behavioral and Brain Sciences, vol. 3, no. 2, pp. 220227,
Oct. 2016.
[6] P. M. Sadler, G. Sonnert, Z. Hazari, and R. Tai, “Stability and volatility
of STEM career interest in high school: A gender study,” Science
Education, vol. 96, no. 3, pp. 411427, 2012.
[7] H. G. Elmongui, R. Mansour, H. Morsy, S. Khater, A. El-Sharkasy,
and R. Ibrahim, “TRUPI: Twitter recommendation based on users’
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
personal interests,” in International Conference on Intelligent Text
Processing and Computational Linguistics, 2015, pp. 272284.
[8] X. Qian, H. Feng, G. Zhao, and T. Mei, “Personalized
Recommendation Combining User Interest and Social Circle,” IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 7, pp.
17631777, Jul. 2014.
[9] M. Eirinaki, J. Gao, I. Varlamis, and K. Tserpes, “Recommender
Systems for Large-Scale Social Networks: A review of challenges and
solutions,” Future Generation Computer Systems, vol. 78, pp. 413–
418, Jan. 2018.
[10] L. Jiang, L. Shi, L. Liu, J. Yao, and M. A. Yousuf, “User interest
community detection on social media using collaborative filtering,”
Wireless Netw, Feb. 2019.
[11] M. Huang, G. Zou, B. Zhang, Y. Liu, Y. Gu, and K. Jiang,
“Overlapping community detection in heterogeneous social networks
via the user model,” Information Sciences, vol. 432, pp. 164–184,
Mar. 2018.
[12] P. Bhattacharya, M. B. Zafar, N. Ganguly, S. Ghosh, and K. P.
Gummadi, “Inferring user interests in the twitter social network,” in
Proceedings of the 8th ACM Conference on Recommender systems,
2014, pp. 357360.
[13] F. Zarrinkalam, H. Fani, E. Bagheri, and M. Kahani, “Predicting users’
future interests on Twitter,” in European Conference on Information
Retrieval, 2017, pp. 464476.
[14] F. Zarrinkalam, M. Kahani, and E. Bagheri, “Mining user interests
over active topics on social networks,” Information Processing &
Management, vol. 54, no. 2, pp. 339357, Mar. 2018.
[15] X. Li, C. Li, J. Chi, and J. Ouyang, “Short text topic modeling by
exploring original documents,” Knowl Inf Syst, vol. 56, no. 2, pp.
443462, Aug. 2018.
[16] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth, “User
Interests Identification on Twitter Using a Hierarchical Knowledge
Base,” in The Semantic Web: Trends and Challenges, 2014, pp. 99
113.
[17] J. Xu, S. Wang, S. Su, S. A. P. Kumar, and C. Wu, “Latent Interest and
Topic Mining on User-Item Bipartite Networks,” in 2016 IEEE
International Conference on Services Computing (SCC), 2016, pp.
778781.
[18] G. Piao and J. G. Breslin, “Inferring user interests for passive users on
twitter by leveraging followee biographies,” in European Conference
on Information Retrieval, 2017, pp. 122133.
[19] “LinkedIn - Wikipedia.” [Online]. Available:
https://en.wikipedia.org/wiki/LinkedIn. [Accessed: 30-Mar-2019].
[20] Asthana, A. Singh, and D. Singh, "A survey on association rule mining
using apriori based algorithm and hash based methods," International
Journal of Advanced Research in Computer Science Software
Engineering, vol. 3, no. 7, 2013.
[21] M. Michelson and S. A. Macskassy, “Discovering users’ topics of
interest on twitter: a first look,” in Proceedings of the fourth workshop
on Analytics for noisy unstructured text data, 2010, pp. 7380.
[22] O. Phelan, K. McCarthy, and B. Smyth, “Using twitter to recommend
real-time topical news,” in Proceedings of the third ACM conference
on Recommender systems, 2009, pp. 385388.
[23] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas,
“Short text classification in twitter to improve information filtering,”
in Proceedings of the 33rd international ACM SIGIR conference on
Research and development in information retrieval, 2010, pp. 841
842.
[24] V. Vijayakumar, S. Vairavasundaram, R. Logesh, and A. Sivapathi,
“Effective Knowledge Based Recommender System for Tailored
Multiple Point of Interest Recommendation,” International Journal of
Web Portals (IJWP), vol. 11, no. 1, pp. 118, 2019.
[25] H. Yin, B. Cui, Z. Huang, W. Wang, X. Wu, and X. Zhou, "Joint
modeling of users' interests and mobility patterns for point-of-interest
recommendation," in Proceedings of the 23rd ACM international
conference on Multimedia, 2015, pp. 819-822: ACM.
[26] S. Zhao, I. King, and M. R. Lyu, “Aggregated Temporal Tensor
Factorization Model for Point-of-Interest Recommendation,” Neural
Process Lett, vol. 47, no. 3, pp. 975992, Jun. 2018.
[27] K. Xu et al., “Improving user recommendation by extracting social
topics and interest topics of users in uni-directional social networks,”
Knowledge-Based Systems, vol. 140, pp. 120133, Jan. 2018.
[28] R. Vijayaraghavan, S. R. KULKARNI, and K. M. ADUSUMILLI,
“Intent prediction based recommendation system using data combined
from multiple channels,” US20170213274A1, 27-Jul-2017.
[29] Y. H. Yee, J. V. McFadden, J. Kraemer, and D. Sampath, “Methods,
systems, and media for recommending content items based on topics,”
US20170103343A1, 13-Apr-2017.
[30] L. He, Y. Jia, W. Han, and Z. Ding, “Mining user interest in microblogs
with a user-topic model,” China Communications, vol. 11, no. 8, pp.
131144, Aug. 2014.
[31] L. Deng, Y. Jia, B. Zhou, J. Huang, and Y. Han, “User interest mining
via tags and bidirectional interactions on Sina Weibo,” World Wide
Web, vol. 21, no. 2, pp. 515536, Mar. 2018.
[32] J. Zhou, Y. Zhang, and J. Cheng, "Preference-based mining of top-K
influential nodes in social networks," Future Generation Computer
Systems, vol. 31, pp. 40-47, 2014.
[33] A. Sapountzi and K. E. Psannis, "Social networking data analysis tools
& challenges," Future Generation Computer Systems, 2016.
[34] X. Zhou, B. Wu, and Q. Jin, "User role identification based on social
behavior and networking analysis for information dissemination,"
Future Generation Computer Systems, 2017.
[35] C. Chen et al., "Investigating the deceptive information in Twitter
spam," Future Generation Computer Systems, vol. 72, pp. 319-326,
2017.
[36] R. Guo, H. Wang, M. Chen, J. Li, and H. Gao, "Parallelizing the
extraction of fresh information from online social networks," Future
Generation Computer Systems, vol. 59, pp. 33-46, 2016.
[37] Z. Yu, H. Xu, Z. Yang, and B. Guo, “Personalized Travel Package
With Multi-Point-of-Interest Recommendation Based on
Crowdsourced User Footprints,” IEEE Transactions on Human-
Machine Systems, vol. 46, no. 1, pp. 151158, Feb. 2016.
[38] S. Laumer, C. Maier, A. Eckhardt, and T. Weitzel, "User personality
and resistance to mandatory information systems in organizations: a
theoretical model and empirical test of dispositional resistance to
change," Journal of Information Technology, vol. 31, no. 1, pp. 67-82,
2016.
[39] M. J. Gelfand, Z. Aycan, M. Erez, and K. Leung, "Cross-cultural
industrial organizational psychology and organizational behavior: A
hundred-year journey," Journal of Applied Psychology, vol. 102, no.
3, p. 514, 2017.
[40] W. Gao, J. L. Guirao, B. Basavanagoud, and J. Wu, “Partial multi-
dividing ontology learning algorithm,” Information Sciences, vol. 467,
pp. 3558, 2018.
Dr. Huayou Si is a lecturer in School of
Computer Science and Technology, Hangzhou
Dianzi University. He received M.S. and Ph.D. in
Computer Science from Peking University in 2004
and 2012 respectively. During the past several
years, His research interests include P2P network,
service-oriented computing and Semantic Web. In
the related research field, he has published more
than 20 academic papers. In addition, he has
served in the Technical Program Committee of
several international conferences.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2925819, IEEE Access
VOLUME XX, 2017 9
Jiayong Zhou was born in Hangzhou, China. He
is now a student at Hangzhou Dianzi University.
He obtained a bachelor's degree in computer
science and technology from Zhejiang Agriculture
and Forestry University and Yangyang College in
2017. He is currently pursuing a master's degree in
computer technology from Hangzhou Dianzi
University. His current interests are data mining
and the application of big data in the social field.
Zhihui Chen is a master student at the School of
Computer Science and Technology, Hangzhou
Dianzi University, China. He received a B.Eng. at
the School of Computer Science and Technology,
Hangzhou Dianzi University in China in 2015. His
research interests include data mining and web
service. In addition, he is a student member of the
China Computer Federation (CCF).
Dr. Wan Jian received the PhD degree in
Computer Technology from Zhejiang University.
He serves as a professor with the School of
Computer Science and Technology, Hangzhou
Dianzi University, China. His research interests
include virtual computing, grid computing, service
computing, and embedded systems. He is a
member of the Association for Computing
Machinery (ACM) and the China Computer
Federation (CCF).
Neal N. Xiong is currently an Associate
Professor (3rd year) in the Department of
Mathematics and Computer Science, Northeastern
State University, OK, USA. He received his PhD
degrees at Wuhan University (in sensor system
engineering) and at the Japan Advanced Institute
of Science and Technology (on dependable sensor
networks). Before he attended Northeastern State
University, he worked at Georgia State University,
the Wentworth Technology Institution, and the
Colorado Technical University (full professor for
approximately 5 years) for approximately 10 years.
His research interests include Cloud Computing,
Security and Dependability, Parallel and Distributed Computing, Networks,
and Optimization Theory.
Dr. Xiong has published over 280 international journal papers and over
120 international conference papers. Some of his works were published in
IEEE JSAC, IEEE or ACM transactions, ACM Sigcomm workshop, IEEE
INFOCOM, ICDCS, and IPDPS. He has been a General Chair, Program
Chair, Publicity Chair, PC member and OC member of over 100
international conferences and as a reviewer for approximately 100
international journals, including IEEE JSAC, IEEE SMC (Park: A/B/C),
IEEE Transactions on Communications, IEEE Transactions on Mobile
Computing, and IEEE Trans.
Wei Zhang received the BE degree at the School
of Information Science and Engineering of Wuhan
University of Science and Technology in China in
2000, and he received MEc and PhD degrees at the
Computer School of Wuhan University in China
in 2004 and 2008, respectively. He is currently an
associate professor with the School of Computer
Science and Technology, Hangzhou Dianzi
University, China. His research interests include
wireless sensor networks and Intelligent
Computing. He is a member of the Association for
Computing Machinery (ACM) and the China Computer Federation (CCF).
ATHANASIOS V. VASILAKOS is currently a
professor in the Dept. of Computer and T
elecommunications Engineering, University of
Western Macedonia, Greece, and visiting prof
essor at the Graduate Programme of the Dep t. of
Electrical and Computer Engineering, N ational
Technical University of Athens (NTU A). He is a
coauthor (with W. Pedrycz) of t he books
Computational Intelligence in Telec
ommunications Networks (CRC press, USA,
2001), Ambient Intelligence, Wireless Networ king, Ubiquitous Computing
(Artech House, USA, 2006); coauthor (with M. Parashar, S. Karnouskos, W.
Pedrycz) of Autonomic Comm unications (Springer), and Arts and
Technologies (MIT Press); and c oauthor (with M. Anastasopoulos) of
Game Theory in Communicatio n Systems (IGI Inc., USA). He has
published more than 150 articles in top international journals and
conferences. He is the editor-in-chi ef of the Inderscience Publishers
journals: International Journal of A daptive and Autonomous
Communications Systems (IJAACS, http://w ww.inderscience.com/ijaacs),
International Journal of Arts and Techno logy (IJART,
http://www.inderscience.com/ijart). He has been on the editorial board of
more than 20 international journals, including IEE E Communications
Magazine (1999-2002 \& 2008-), IEEE Transactio ns on Systems, Man and
Cybernetics (SMC, Part B, 2007-), IEEE T ransactions on Wireless
Communications (invited), and ACM Transac tions on Autonomous and
Adaptive Systems (invited). He is chairma n of the Telecommunications
Task Force of the Intelligent Systems Applications Technical Committee
(ISATC) of the IEEE Computation al Intelligence Society (CIS). He is the
senior deputy secretary-gener al and fellow member of the ISIBM
(International Society of Intellig ent Biological Medicine. He is a member
of the IEEE and ACM. Email: vasilako@ath.forthnet.gr.
... The classical interpretation of association rules in recommending systems [9,20,26] is to propose items that are often purchased in the same transaction, as an indication that they are jointly chosen by customers who -in turn -may share the same taste, although it is possible to have different interpretations of association rules for many other contexts of applications, e.g., in the medical domain symptoms of patients affected by the same disease [24,39,40], or in social networks which interests are shared by the users [36]. Pattern languages have been designed for extracting given rules based upon application needs, e.g., focusing on given items in purchases (e.g., bread and butter), on given habits/morbidities in patients (e.g., male smokers), and on given interests among users (e.g., movies and music). ...
Preprint
Full-text available
Mining information from graph databases is becoming overly important. To approach this problem, current methods focus on identifying subgraphs with specific topologies; as of today, no work has been focused on expressing jointly the syntax and semantics of mining operations over rich property graphs. We define MINE GRAPH RULE, a new operator for mining association rules from graph databases, by extending classical approaches used in relational databases and exploited by recommending systems. We describe the syntax and semantics of the operator, which is based on measuring the support and confidence of each rule, and then we provide several examples of increasing complexity on top of a realistic example; our operator embeds Cypher for expressing the mining conditions. MINE GRAPH RULE is implemented on top of Neo4j, the most successful graph database system; it takes advantage of built-in optimizations of the Neo4j engine, as well as optimizations that are defined in the context of relational association rules. Our implementation is available as a portable Neo4j plugin. At the end of our paper, we show the execution performance in a variety of settings, by varying the operators, the size of the graph, the ratio between node types, the method for creating relationships, and maximum support and confidence.
... By using association rule [12,13] some of patterns were extract from the dataset. As dataset has participant data and questionnaire data hence based on that few important patterns were identified. ...
... User habits and interests can also provide insight into the mental state of individuals suffering from psychiatric issues. For this reason, Si et al. in 2019 [13] have focused on analyzing users' interests on social networks to discover useful dependencies between their interests. The purpose of this study was to deter-mine the degree of dependence between various types of information gathered from social network info-boxes intended for expressing interests and hobbies, etc. ...
Chapter
Nowadays, social networks provide relevant information that is used in many contexts for different objectives. However, the major challenges remain at the level of processing this data, which is generated in a specific way. In this context, we propose in this paper a hybrid approach based on Bayesian network and ontology techniques for formalizing textual data published on social media by people with personality disorders. The objective of this task is to identify the main factors that have a significant impact on the state of sick persons. Our proposed approach is composed of three major steps: data collection and preprocessing, the construction of a set of Bayesian networks, and finally the incorporation of semantic components into the constructed networks. Our proposed approach takes advantage of both statistic and linguistic techniques, which can provide explainable and enriched results at multiple hierarchical levels. In addition, our approach addresses language issues like the evolution of the lexicon over time, the ellipsis phenomenon, etc. For the evaluation of our proposed approach, we have used two different methods, and in general, we achieved an accuracy rate equal to 83% for correct links prediction.
... Association rules are also known as affinity analysis [16]. Correlation analysis is also known as association mining, and its goal is to discover the rules that relate data items in a given data set and to describe the measure of the correlation between data items [17]. ...
Conference Paper
As redes sociais possuem um vasto conjunto de dados dos seus usuários. Coletar estes dados, transformá-los em informação e, posteriormente, em conhecimento, tem importância ímpar, não apenas para as empresas proprietárias destas redes mas para todo o “ecossistema”nestas redes. Este artigo apresenta um mapeamento sistemático da literatura e teve por objetivo encontrar uma resposta para o seguinte questionamento: quem é o usuário do twitter? Os artigos foram coletados das bases ACM Digital Library, Science Direct e IEEE Xplorer, conforme string definida, utilizando o método de busca automática. Dos artigos selecionados, foram retiradas 8 categorias de identificação de usuários: indivíduo ou organização, multirredes, malicioso, saúde, comportamento, demografia, interesses e identidade. Também a literatura cinzenta foi consultada para integrar o resultado a respeito do usuário do Twitter e gerou informações como a quantidade de usuários por gênero e os países com mais usuários do Twitter.
Chapter
Full-text available
Recently, with the express growth of social media, users have joined more and more of these networks and live their lives virtually. Consequently, they create a huge amount of data on these social media sites, and they become data resources for information processing and have been widely investigated in computer science. Discovering users interests on social media is a problem that has received a lot of attention because it has high applicability in practice. The purpose of this paper is to introduce a method to detect user-interest topics on social media by analyzing the content of user’ posts. Research used a semantic expansion technique based on the Wikipedia dictionary and the N-gram technique to split; it used the TF.IDF weighted vector to represent and estimate based on Pearson correlation. The experimental results show that the research model can be applied to the analysis of many social media sites with many different languages, regardless of the network structure and language used on these social media.
Article
Understanding the residents’ routine and repetitive behavior patterns is important for city planners and strategic partners to enact appropriate city management policies. However, the existing approaches reported in smart city management areas often rely on clustering or machine learning, which are ineffective in capturing such behavioral patterns. Aiming to address this research gap, this article proposes an analytical framework, adopting sequential and periodic pattern mining techniques, to effectively discover residents’ routine behavior patterns. The effectiveness of the proposed framework is demonstrated in a case study of American public behavior based on a large-scale venue check-in dataset. The dataset was collected in 2020 (during the global pandemic due to COVID-19) and contains 257 561 check-in data of 3995 residents. The findings uncovered interesting behavioral patterns and venue visit information of residents in the United States during the pandemic, which could help the public and crisis management in cities.
Conference Paper
Full-text available
Latent Factor Model (LFM) is extensively used in dealing with user-item bipartite networks in service recommendation systems. To alleviate the limitations of LFM, this papers presents a novel unsupervised learning model, Latent Interest and Topic Mining model (LITM), to automatically mine the latent user interests and item topics from user-item bipartite networks. In particular, we introduce the motivation and objectives of this bipartite network based approach, and detail the model development and optimization process of the proposed LITM. This work not only provides an efficient method for latent user interest and item topic mining, but also highlights a new way to improve the accuracy of service recommendation. Experimental studies are performed and the results validate the LITM's efficiency in model training, and its ability to provide better service recommendation performance based on user-item bipartite networks are demonstrated.
Article
Full-text available
The article User interest community detection on social media using collaborative filtering, written by Liang Jiang, Leilei Shi, Lu Liu, Jingjing Yao, Muhammad Ali Yousuf.
Article
Full-text available
Community detection in microblogging environment has become an important tool to understand the emerging events. Most existing community detection methods only use network topology of users to identify optimal communities. These methods ignore the structural information of the posts and the semantic information of users’ interests. To overcome these challenges, this paper uses User Interest Community Detection model to analyze text streams from microblogging sites for detecting users’ interest communities. We propose HITS Latent Dirichlet Allocation model based on modified Hypertext Induced Topic Search and Latent Dirichlet Allocation to distil emerging interests and high-influence users by reducing negative impact of non-related users and its interests. Moreover, we propose HITS Label Propagation Algorithm method based on Label Propagation Algorithm and Collaborative Filtering to segregate the community interests of users more accurately and efficiently. Our experimental results demonstrate the effectiveness of our model on users’ interest community detection and in addressing the data sparsity problem of the posts.
Article
Association rule mining is the most important technique in the field of data mining. The main task of association rule mining is to mine association rules by using minimum support thresholds decided by the user, to find the frequent patterns. Above all, most important is research on increment association rules mining. The Apriori algorithm is a classical algorithm in mining association rules. This classical algorithm is inefficient due to so many scans of database. And if the database is large, it takes too much time to scan the database. This paper presents many improved Apriori algorithm to increase the efficiency of generating association rules.
Article
With the massive growth of the internet, a new paradigm of recommender systems (RS's) is introduced in various real time applications. In the research for better RS's, especially in the travel domain, the evolution of location-based social networks have helped RS's to understand the changing interests of users. In this article, the authors present a new travel RS employed on the mobile device to generate personalized travel planning comprising of multiple Point of Interests (POIs). The recommended personalized list of travel locations will be predicted by generating a heat map of already visited POIs and the highly relevant POIs will be selected for recommendation as destinations. To enhance the recommendation quality, this article exploits the temporal features for increased user visits. A personalized travel plan is recommended to the user based on the user selected POIs and the proposed travel RS is experimentally evaluated with the real-time large-scale dataset. The obtained results of the developed RS are found to be proficient by means of improved diversity and accuracy of generated recommendations.
Article
As an effective data representation, storage, management, calculation and model for analysis, ontology has attracted more and more attention by researchers and it has been applied to various engineering disciplines. In the background of big data, the ontology is expected to increase the amount of data information and the structure of its corresponding ontology graph has become more important due to its complexity. It demands that the ontology algorithm must be more efficient than before. In a specific engineering application, the ontology algorithm is required to find in a quick way the semantic matching set of the concept and rank it back to the user according to their similarities. Therefore, to use learning tricks to get better ontology algorithms is an open problem nowadays. The aim of the present paper is to present a partial multi–dividing ontology algorithm with the aim of obtaining an efficient approach to optimize the partial multi–dividing ontology learning model. For doing it we state several theoretical results from a statistical learning theory perspective. Moreover, we present five experiments in different engineering fields to show the precision of our partial multi-dividing algorithm from angles of ontology, similarity measuring and ontology mapping building point of view.
Article
Inferring users’ interests from their activities on social networks has been an emerging research topic in the recent years. Most existing approaches heavily rely on the explicit contributions (posts) of a user and overlook users’ implicit interests, i.e., those potential user interests that the user did not explicitly mention but might have interest in. Given a set of active topics present in a social network in a specified time interval, our goal is to build an interest profile for a user over these topics by considering both explicit and implicit interests of the user. The reason for this is that the interests of free-riders and cold start users who constitute a large majority of social network users, cannot be directly identified from their explicit contributions to the social network. Specifically, to infer users’ implicit interests, we propose a graph-based link prediction schema that operates over a representation model consisting of three types of information: user explicit contributions to topics, relationships between users, and the relatedness between topics. Through extensive experiments on different variants of our representation model and considering both homogeneous and heterogeneous link prediction, we investigate how topic relatedness and users’ homophily relation impact the quality of inferring users’ implicit interests. Comparison with state-of-the-art baselines on a real-world Twitter dataset demonstrates the effectiveness of our model in inferring users’ interests in terms of perplexity and in the context of retweet prediction application. Moreover, we further show that the impact of our work is especially meaningful when considered in case of free-riders and cold start users.
Article
Clustering users with more common interests who interact frequently on social networking sites has attracted much attention from researchers due to the high economic value and further application prospects. Community detection is a widely accepted means of dealing with the challenge of clustering users, but conventional methods are inadequate since there are billions of vertices and various relations in social media. Through the user model, a heterogeneous network containing both undirected and directed edges is built in this study to exactly simulate a social network. A novel approach for overlapping community detection in a heterogeneous social network (OCD-HSN) is proposed, which contains seed selecting and community initializing and expanding to accurately and efficiently unfold modules in parallel. Experimental results on artificial and real-world social networks demonstrate the higher accuracy and lower time consumption of the proposed scheme compared with other existing state-of-the-art algorithms.
Article
With the rapid growth of population on social networks, people are confronted with information overload problem. This clearly makes filtering the targeted users a demanding and key research task. Uni-directional social networks are the scenarios where users provide limited follow or not binary features. Related works prefer to utilize these follower-followee relations for recommendation. However, a major problem of these methods is that they assume every follower-followee user pairs are equally likely, and this leads to the coarse user following preferences inferring. Intuitively, a user's adoption of others as followees may be motivated by her interests as well as social connections, hence a good recommender should be able to separate the two situations and take both factors into account for better recommendation results. In this regard, we propose a new user recommendation framework namely UIS-MF in this work. UIS-MF can well capture user preferences by involving both interest and social factors in prediction, and targeted to recommend Top-N followees who have similar interest and close social connection relevant to a target user. Specifically, we first present a unified probabilistic topic model on follower-followee relations, namely UIS-LDA, and it employs Generalized Pólya Urn (GPU) models on mutual-following relations for discovering interest topics and social topics of users. Next we propose a community-based method for user recommendation, it organizes social communities and interest communities based on the estimation of topics obtained from UIS-LDA, and then performs Matrix Factorization (MF) method on each community to generate N most likely followees for individual user. Systematic experiments on Twitter, Sina Weibo and Epinions datasets have not only revealed the significant effect of our UIS-LDA model for the extraction of interest and social topics of users in improving recommending accuracy, but also demonstrated the advantage of our proposed recommendation framework over competitive baselines by large margins.