ArticlePDF Available

QANet: Tensor Decomposition Approach for Query-Based Anomaly Detection in Heterogeneous Information Networks

Authors:

Abstract and Figures

Complex networks have now become integral parts of modern information infrastructures. This paper proposes a user-centric method for detecting anomalies in heterogeneous information networks, in which nodes and/or edges might be from different types. In the proposed anomaly detection method, users interact directly with the system and anomalous entities can be detected through queries. Our approach is based on tensor decomposition and clustering methods. We also propose a network generation model to construct synthetic heterogeneous information network to test the performance of the proposed method. The proposed anomaly detection method is compared with state-of-the-art methods in both synthetic and real-world networks. Experimental results show that the proposed tensor-based method considerably outperforms the existing anomaly detection methods.
Content may be subject to copyright.
1
QANet: Tensor Decomposition Approach for
Query-based Anomaly Detection in
Heterogeneous Information Networks
Vahid Ranjbar, Mostafa Salehi, Pegah Jandaghi, and Mahdi Jalili, Senior Member, IEEE
Abstract— Complex networks have now become integral parts of modern information infrastructures. This paper proposes a
user-centric method for detecting anomalies in heterogeneous information networks, in which nodes and/or edges might be
from different types. In the proposed anomaly detection method, users interact directly with the system and anomalous entities
can be detected through queries. Our approach is based on tensor decomposition and clustering methods. We also propose a
network generation model to construct synthetic heterogeneous information network to test the performance of the proposed
method. The proposed anomaly detection method is compared with state-of-the-art methods in both synthetic and real-world
networks. Experimental results show that the proposed tensor-based method considerably outperforms the existing anomaly
detection methods.
Index Terms Anomaly Detection, Heterogeneous Information Networks, Query Based Anomaly Detection, Tensor
Decomposition
—————————— ——————————
1 INTRODUCTION
ANY real systems usually consist of interactions
between various components and entities and can
be modeled as networked structures [1]. Examples in-
clude social activities of humans, ecological systems,
communication and computer networks and biological
systems. Information networks are everywhere and have
become a vital component of modern information infra-
structure. In recent years, analysis of information net-
works has attracted scholars across disciplines including
computer science, social sciences, mathematics and phys-
ics [2]. Modeling real-world data as information networks
is a new tool that can often provide richer information as
compared to traditional modeling techniques such as
multidimensional modeling [3]. Representing data as in-
formation networks makes it possible to model relation-
ships between entities. In some systems the nodes and/or
edges are not from the same type and various types might
coexist in the system; such systems are often modeled by
heterogeneous (or multilayer) networks [4], [5], [6], [7].
One of the major challenges in the analysis of infor-
mation networks is to discover anomaly and abnormal
components. Anomaly detection is a branch of data min-
ing that is associated with discovery of abnormal occur-
rences in the data. It has many applications in areas such
as security, finance, biology, healthcare, and law en-
forcement [8]. For example, online social media detects
spam reviews by finding unusual patterns [9]. Banks of-
ten find fraudulent activities by examining unusual
transaction patterns [10]. Network intrusion detection
methods detect potential network attacks by comparing
traffic signatures with incoming traffic and finding unu-
sual patterns in incoming traffic [11]. Abnormalities can
be a node (entity), an edge (connection) or a subnet (a
group of entities) that should not exist in the network but
exist. Types of attributes and features associated with
nodes and edges that are used to detect abnormalities,
can be of any kind in relation to the entities or relation-
ship between them.
So far, anomaly detection methods have mainly fo-
cused on homogeneous information networks and un-
structured multidimensional data [12], [13]. Anomaly
detection in heterogeneous information networks is a
challenging task mainly due to specific characteristics of
such networks [14]. Most of the methods developed orig-
inally for homogeneous networks do not work in the case
of heterogeneous one. Often, one would like to find
anomalies in certain type of nodes/edges in heterogene-
ous networks. For example, in bibliographic networks in
which nodes can be authors, papers or venues, one might
want to find author(s) that are the most different (abnor-
mal) with others in the way they publish papers (i.e. the
topic of papers and/or venues published). In this exam-
ple, the difference in behavior should be a particular au-
thor chosen as reference. Such anomaly detection is re-
ferred to as query-based anomaly detection method [14].
This paper introduces a novel anomaly detection
method based on tensor decomposition named QANet. In
the proposed method, various meta-paths each node has
with others are considered as properties of that node.
Then, features are extracted using tensor decomposition,
and clustering techniques are used to detect anomalies.
M
————————————————
V. Ranjbar and M. Salehi are with the Faculty of New Sciences and
Technologies, University of Tehran, Tehran, Iran and also with the
School of Computer Science, Institute for Research in Fundamental Sci-
ence (IPM), P.o.Box 19395-5746, Tehran, Iran. E-mail:
vranjbar@ut.ac.ir; mostafa_salehi@ut.ac.ir.
P. Jandaghi is with the department of Computer Science, University of
Southern California, California, USA, E-mail: jandaghi@usc.edu.
M. Jalili is with the School of Engineering, RMIT University, Mel-
bourne, Australia. E-mail: mahdi.jalili@rmit.edu.au.
2
We use synthetic and real datasets to evaluate perfor-
mance of QANet. Specific contributions of our manu-
script are as follows:
We provide a user-centric anomaly detection approach
that uses tensors to store meta-paths in heterogeneous in-
formation networks and also uses tensor decomposition
techniques to extract nodal features from a tensor.
We introduce a network generation model and an anomaly
injection method to construct synthetic heterogeneous in-
formation networks to test the performance of the pro-
posed anomaly detection method.
We create queries from two real-world datasets including
IMDB and DBLP as well as the constructed synthetic da-
taset and compare the proposed method with state-of-the-
art methods.
The rest of the paper is organized as follows. Section 2
presents the preliminaries including a number of defini-
tions and the problem statement. We discuss the related
work in section 3. We discuss the tensor decomposition
model and describe our approach for the ranking candi-
date set of user query in section 4. A set of comprehensive
experiments is performed to evaluate the effectiveness of
QANet in section 5. Section 6 draws the conclusions.
2 PRELIMINARIES
This section provides some preliminaries including a
number of definitions required to formally state the prob-
lem of query-based anomaly detection in heterogeneous
information networks.
DEFINITION 1. (HETEROGENEOUS INFORMATION
NETWORK) [2]. A heterogeneous information network
consists of multi-type entities that can have different type
relationships between them, which are defined by a di-
rected graph. Without loss of generality, the information
network can be dened as   , where is
the set of nodes and is set of edges. Function   
is a function that maps each node to its type, where  
 is the set of node types and is the number
of node types. Each node maps to a particular type
in entity type set. Function     is also a
function that maps each edge to its type from set  
where is number of edge types. Each edge
is mapped to a special type in edge type set
. Fig. 1(a) shows a heterogeneous information
network on bibliographic data. This network includes
three types of node: Papers (P), Authors (A) and Venues
(V). Each paper has link to a set of authors and a venue
where these links belong to a set of link types.
To better understand the type of entities and relation-
ship between them in a heterogeneous information net-
work, it is useful to have a meta level (i.e. schema-level)
description of the network.
DEFINITION 2. (NETWORK SCHEMA) [2]. The network
schema, denoted as , is a meta template for an
information network    with the node type
mapping    and the edge type mapping   ,
which is a directed graph with nodes types in and mul-
ti-type relationships from . Fig. 1(b) shows network
schema for bibliographic network.
DEFINITION 3. (META-PATH) [2], [39] A meta path is
a path dened on a schema , and is denoted in
the form of
, with length of . For sim-
plicity, we can also use node types to denote the meta
path if there are no multiple relation types between the
same pair of node types:   . APV and
APA are meta-paths for heterogeneous information net-
work in Fig. 1.
A meta path   can be reversed where
the reversed path is denoted as   . If
is equal to , then is a symmetric meta-path. For ex-
ample, APA and APVPA are symmetric meta-paths.
DEFINITION 4. (META-PATH INSTANTIATION) [39].
If for each   we have and for each edge
 that belong to relation in meta-path
 , the path  is denoted as meta-
path instance for . It can be denoted as   . Mike-
paper1-VLDB is a meta-path instance for APV meta-path
in the network of Fig. 1.
DEFINITION 5. (ANOMALY QUERY). An anomaly que-
ry Q is denoted by   where is a query on
network entities. is the output indicating the outli-
ers, known as the candidate set. is also a query on
network entities whose output is serving as the
reference of normal nodes. The types of referenced and
candidate sets are the same. Candidate entities can be a
separate set or sub-set of the reference sets.
DEFINITION 6. (ANOMALY SCORE). The degree of
structural difference between a node and those in the ref-
erence set is the anomaly score () of that node rela-
tive to the reference set. The structural difference means
the difference in the formation of node relationship with
others.
3 RELATED WORKS
Various methods have been proposed for anomaly detec-
tion which can be used in certain applications. Classifica-
tion methods [10], [15], [16], [17], [18] require labeled data
and usually provide a label for test data, which are not
appropriate for applications that require a rating for ab-
normality. Another group of methods are based on clus-
tering [19], [20], [21], for which efficiency is highly de-
pendent on the clustering algorithm used. The computa-
tional complexity of these methods is challenging espe-
cially for large-scale feature set. In some other works, the
nearest neighbor methods are used to detect abnormali-
ties [22], [23], [24], [25], [26]. These methods are not suita-
ble for datasets in which anomalies are very close to natu-
ral points, or for those in which natural data are far apart.
Also, effectiveness of these methods is highly dependent
Fig. 1. (a) An instance of heterogeneous information network for
bibliographic network, and (b) bibliographic network schema.
3
on the specific distance measure used in them. Another
category of anomaly detection methods are based on sta-
tistical approaches [27], [28], [29], [30], [31], [32]. The as-
sumption of these approaches is that the data is generated
from a certain distribution. However, such an assumption
might not be valid in many cases, and even when the as-
sumption is correct, it is often difficult to find the right
distribution. Some other works have used information
theory methods to detect abnormalities, which are more
suitable for sequenced and timed data [33], [34], [35]. In
[36] Noble and Cook introduced two techniques for
graph-based anomaly detection based on Minimum De-
scription Length.
Regarding the input data, anomaly detection can be
divided into two categories: structured (or graph-based)
and unstructured multidimensional data. In the first cate-
gory, one tries to model the dependencies within the data
using graphs, while in the other category the data is con-
sidered in a multidimensional space, regardless of de-
pendencies within them. Graph-based methods can be on
either homogeneous or heterogeneous graphs. Most of
the previous works in anomaly detection has been on
homogeneous networks or unstructured multidimension-
al data. In homogeneous networks, all nodes and rela-
tionships between them are of the same type. However,
not all real-world networks are homogeneous. Some real
systems are composed of heterogeneous types of nodes
and/or edges.
There are in general two anomaly detection approach-
es in heterogeneous networks: approaches based on
community distribution and those based on query. In
community distribution approach, instead of considering
the entire heterogeneous network to find possible abnor-
malities, distribution of nodes in communities are used.
In query-based approaches, users create different queries
that determine the type of anomaly and its range. For
example, one might choose a particular node from a cer-
tain type and identify the most different nodes with the
chosen node, as abnormal nodes in the network. Query-
based approaches allow the users to interact directly with
the system. The first work in the field of query-based
anomaly detection was proposed by Gupta et al. in 2014
[37]. Their method considers malformation of each edge,
detects anomalous groups of nodes based on a user query
and returns subnets of the original network. Zhang et al.
[38] provided a method for detecting anomalies based on
user query to find abnormal subnets. In their proposed
method, the users receive a list of abnormal subnets by
defining a query. However, they did not consider the at-
tribute of each entity, and framed the method by consid-
ering only the structural features of the network. An effi-
cient method was proposed by Kuck et al. [14], where a
formal language for queries was presented and an anom-
aly measure was proposed based on the network struc-
ture and existing meta-paths between the nodes. They
defined an outlierness measure named as NetOut. In a
heterogeneous network G and for a given query Q and
for any , the outlierness can be measured by:
  

 (1)
where smaller values correspond to greater likelihood
of being an outlier and  is the number of path
instantiations of  (a symmetric meta-path) between
two nodes and . They also used PathSim and cosine
similarity that were introduced in the literature to com-
pare with their work [39]. PathSim measure between two
nodes and follows a meta-path in a heterogeneous
information network and is defined as,
 
 (2)
For comparison, [14] defined:
  
 (3)
[14] also defined a comparable version of NetOut using
cosine similarity, as:
  

 (4)
where  is the
neighbor vector function.
These methods are state-of-the-art in this field and we
compare the performance of the proposed method with
them.
In proposed method, we use tensor decompositions for
the anomaly detection task. The use of tensors for large-
dimensional data has been of great interest in recent years
[40], [41]. [42] proposed a network analysis system using
tensor decomposition in order to detect malicious pat-
terns over time. Akoglu and Faloutsos [43] proposed a
tensor-based algorithm that operates on a time-varying
homogeneous network and identifies anomalous points
in time at which many nodes change their behavior in a
way it deviates from the norm. [44] introduced a handy
tool to automatically detect and visualize novel subgraph
patterns within a local community of nodes in a hetero-
geneous network. Papalexakis et al. [45] proposed a
method based on tensor decomposition for spotting
anomalies in the check-in behavior of users. Koutra et al.
[46] proposed a method for detection of anomalies, rare
events and changes in behaviors using tensors. Although
there are a number of anomaly detection methods based
on tensors, query-based anomaly detection using tensor
has not yet been introduced in the literature.
4 PROPOSED QUERY-BASED ANOMALY
DETECTION APPROACH (QANET)
An outlier detection algorithm should return outliers as a
subset of the candidate set, i.e.  , that are considera-
bly different from nodes in . Let us first formally define
the query-based anomaly detection problem in heteroge-
neous networks.
DEFINITION 7. (QUERY-BASED ANOMALY DETEC-
TION PROBLEM). Let us consider the heterogeneous in-
formation network    with node type map-
ping function    and edge type mapping function
  . Given as a set of candidate nodes and as a
set of reference nodes, the problem is to return a sorted
list of candidate nodes based on anomaly score, relative
to nodes in the reference set. It is worth mentioning that
the type of nodes in both candidate and reference sets
must be the same.
4
In this paper, we first use tensor decomposition tech-
nique to reduce the feature set, as there are often many
nodes and large number of features for each node. Then,
we use clustering methods to determine the anomaly
score of each node by calculating its distance with the
cluster centers. Heterogeneous networks have many di-
mensions and representing them as a form of matrices,
tables, or vectors often lead to information loss. Tensors
allow us to store data in more than two dimensions, thus
making it possible to store more information from the
network as compared to traditional ways of network rep-
resentation. Furthermore, tensor decomposition methods
can be effectively used to reduce the dimensions of the
feature set. One of the main disadvantages of the previ-
ous methods in this field is their high computational
complexity. Tensor-based methods on the other hand are
computationally effective as there are various infrastruc-
tures to implement them on distributed systems. For ex-
ample, in [47] several distributed methods of decompos-
ing tensors have been implemented on Hadoop frame-
work.
DEFINITION 8. (TENSOR). An n-way (or mode) tensor
is essentially a structure that is indexed by n variables.
More formally, A tensor is represented by an array of  
.
There are a number of methods for decomposing ten-
sors. A simple, interpretable and basic method is PARA-
FAC decomposition [48]. A PARAFAC model decompos-
es a 3-way tensor  to trilinear components. The re-
sult is given by three loading matrices , , and
 with elements ,  and  where is the number
of components. The model is found to minimize the sum
of squares of the residuals, in the model, where 
is a three way array of residuals:
  󰅴󰅴

 (5)
The above relation is graphically shown in Fig. 2, for
two components (F = 2). , and are columns of the
matrices  ,   , and  , respectively.
The multiplications of , , and are defined as fol-
lows:
󰅴󰅴  (6)
Factors are estimated simultaneously using alternating
least squares (ALS) method [49], which indeed assumes
that two of the matrices are constant and the third one is
estimated.
To represent a heterogeneous network by a tensor, we
use the concept of meta-path. Meta-paths can indicate
similarity/proximity between nodes in the network, and
thus can be an important feature for anomaly detection.
We define a 3-way tensor X with dimensions
for , where    and is equal to the number of me-
ta-paths with length less than 2.  is equal to the num-
ber of instances of the kth meta-path from the list of all
meta-paths extracted from the network schema with
length less than 2 between nodes and .  indicates
relationship between nodes and relative to the kth
meta-path, which is considered as a feature for node .
QANet method constructs N×K features for each node
based on meta-paths. This is often a huge number and
traditional clustering algorithms such as k-mean cannot
be used for that. Here we use tensor decomposition
method to reduce dimension of the feature space, and
thus making it possible to apply traditional clustering
algorithms.
We need to compute the abnormality score for the
candidate nodes relative to the reference set given by the
query. Therefore, first a feature reduction model is creat-
ed using tensor decomposition for the reference set. To
this end, we separate a part of the tensor X that contains
the features of the reference set. Let us call it whose
dimensions are , where is the number of
nodes in the reference set. Regarding the PARAFAC de-
composition method, we can obtain three matrices A, B,
and C for . Fig. 3, shows this tensor decomposition.
Matrix contains the main factors of the reference set.
These factors can be considered as new features for each
of the nodes in the reference set. Matrix and can be
used in the next step to obtain features of the candidate
nodes. To obtain the features for the nodes in the candi-
date set, it is sufficient to use (7) to calculate matrix
using the matrices and obtained in the previous step
and , which is defined as the for the candidate set
nodes, as
  (7)
where and are Khatri-rao and Hadamard product,
respectively, and is mode-1 matrixization of . Af-
ter computing , the properties of the candidate nodes
are obtained using the reference nodes model. Finally,
and are fed into the clustering algorithm as inputs (Fig.
4).
We use the K-means clustering method for the clustering
phase. K-means method takes a set of observations and
partitions them into k ( n) sets so as to minimize the
within-cluster sum of squares. In the clustering phase, the
reference nodes are first clustered into k clusters accord-
ing to the features of matrix A. Then, based on the ob-
tained cluster centers (), the anomaly score of
each candidate node is obtained as the distance (Euclide-
an distance) from the center of the nearest cluster. This
allows us to sort the candidate nodes based on their
anomaly score and determine their final rankings:
   
 
(8)
where
and are two points in Euclidean F-space and

is the distance between them and it is given by:

 

 (9)
Fig. 2. A graphical representation of a PARAFAC model of tensor X
[47].
5
Note that QANet is based on calculating similarities
using meta-paths. Homogeneous networks are a special
type of heterogeneous networks that only have one type
of nodes and edges. Meta-path can not be defined in these
networks, and one can only define simple paths of differ-
ent lengths. In the case of homogeneous networks,
QANet considers similarity of nodes (e.g. common neigh-
borhood) and identifies the anomalies based on that.
Let us consider a simple example to better understand
the mechanism of QANet. TABLE 1 shows papers pub-
lished by several authors, where the columns in the table
represent the number of articles published by each author
at various conferences. The question we want to answer is
to consider an author as a reference author and rank oth-
ers against the reference author on the basis of their
anomaly score. As can be seen in the TABLE 1, the refer-
ence author has authored 22 papers; 10 papers published
in VLDB, 10 papers in KDD, and one paper in STOC and
SIGGRAPH.
TABLE 2 shows the results obtained from QANet and
three other methods including CosSim, PathSim and
NetOut, as state-of-the-art methods in query-based
anomaly detection. To compare the performance of
QANet with others, we obtained the distance measures as
one minus the similarity measure. As shown in TABLE 2,
all methods show Sarah exact to the reference author. In
contrast to Lucy, Rob is more abnormal because Rob has
published most of his papers at the conferences where the
reference author has the least activity. As the result
shows, in QANet method, the top anomaly is for Joe, as
Joe is different to reference author both in terms of the
number of papers and participated conferences. It is also
seen that PathSim and CosSim also classified Joe as an
anomalous author, while NetOut measure returns Joe like
Sarah as a normal author. Mikel is also like Joe, but Mikel
has a paper at KDD, which is one of the major confer-
ences for the reference author, and QANet is well re-
sponding for this difference. Between Mikel and Emma,
Emma has less maladaptation than Mikel, as the number
of Emma's papers in the same conference as the reference
author is higher than Mikel. This is only correctly cap-
tured by QANet and not by others.
5 RESULT
In this section we first describe the synthetic and real da-
taset used in this work, and then introduce the evaluation
metric. We compare the performance of QANet and state-
of-the-art methods in efficiently detecting anomalies. We
also discuss the time complexity of QANet.
5.1. Data Sets
We used both synthetic and real datasets to evaluate per-
formance of the proposed method.
Synthetic data
We use a similar way to the method presented in [32] to
generate synthetic heterogeneous networks (Algorithm
1). We first consider nodes in , then assign a color to
each node using the function    , where is
the number of colors that is equal to the number of com-
munities in the network. Also, for each , a type  is
considered. In order to create the graph edges, if the two
nodes and are with the same color, an edge is placed
between them with probability , otherwise the link is
placed with probability   . The network is created as
a heterogeneous network with different type of nodes,
and nodes from the same color are connected more likely
than those from different colors. This makes it possible to
create groups (nodes with the same color) in the network
that are similar to structure of nodes within the group.
These groups represent the communities in the graph. In
bibliography network for example, such groups can rep-
resent a range of research areas in which most of its au-
thors are publishing in that area and in confer-
ences/journals of the same domains.
Anomaly injection
We add some abnormal nodes to the generated network.
These abnormal nodes need to be structurally different
from the rest of the nodes. Thus, we randomly select a
certain portion of nodes from each color, and create links
between them and other nodes with probability  that is
different from probabilities and .
TABLE 1
A TOY EXAMPLE OF BIBLIOGRAPHY NETWORK.
SIGGRAPH
STOC
KDD
VLDB
NAME
1
1
10
10
Reference Author
1
1
10
10
Sarah
20
20
1
0
Rob
10
10
5
0
Lucy
1
0
0
0
Joe
0
0
1
0
Mikel
30
0
0
0
Emma
Taken from an example in [14] with the last row added to it.
Fig. 4. Clustering the nodes in reference set with K-means clustering
method and sorting the candidate nodes based on distance from the
center of the nearest cluster.
TABLE 2
RESULT OF ANOMALY DETECTION METHODS FOR TOY EXAMPLE
PROVIDED IN TABLE 1.
QANet
CosSim
PathSim
NetOut
NAME
rank
Score
rank
Score
rank
Score
Rank
Score
6
0
6
0
6
0
6
0
Sarah
4
0.81
3
0.8757
4
0.9
2
0.9376
Rob
5
0.73
4
0.6717
5
0.6721
3
0.6889
Lucy
1
1.02
2
0.9296
1
0.9901
4
0
Joe
2
0.95
5
0.2964
3
0.9014
5
0
Mikel
3
0.88
1
0.9296
2
0.9455
1
0.9667
Emma
6
Query generation
To generate queries, two sets of reference and candidate
nodes should be selected, which needs to be from the
same type. Suppose that i and j are two colors from the
set of colors  where   . The following query
types are considered in this work.
1. We consider a number of random nodes in color i
as the reference set and a number of random
nodes as the candidate set, half of which are in
color i and the other half in other colors.
2. A random number of nodes is considered as the
reference set, half of which are in color i and the
other half in color j. We also consider a number of
random nodes as the candidate set, half of which
are in color i or j and the other half in other colors.
3. A number of random nodes in color i is consid-
ered as the reference set, and a number of random
nodes as the candidate set, half of which are
anomalous nodes ( ) and the rest in color i.
4. We consider a random number of nodes as the
reference set, half of which are in color i and the
other half in color j, and a number of random
nodes as candidate set, half of which are anoma-
lous nodes ( ) and the rest in color i or j.
5. We consider a number of random nodes in color i
as the reference set and some random nodes as
candidate set, half of which are in color i and the
rest are anomalous ( ) nodes or have color
other than i.
6. A random number of nodes is considered as the
reference set, half of which are in color i and the
other half in color j. We consider some random
nodes as the candidate set, half of which are in
color i or j and the rest are anomalous ( )
nodes or have color other than i or j.
To evaluate the performance of the methods, we labeled
nodes that have the same color as the reference nodes, as
normal nodes and the rest of nodes as abnormal.
Real data
DBLP: We employ a bibliographic dataset from
ArnetMiner3 [50] to construct a heterogeneous infor-
mation network. The dataset consists of 2,092,356 publica-
tions and 1,712,433 authors in the eld of computer sci-
ence. The heterogeneous network contains 3 types of ver-
tices: paper, venue, and author. Possible type of edges
includes paper-author (written-by), paper-venue (pub-
lished in) and paper-paper (cited by).
IMDb: We use movie details dataset from the Internet
Movie Database (IMDb)
1
. This dataset consists of
4,566,466 movies and 8,183,156 individuals in the role of
actor, director or writer. Heterogeneous information net-
work for this dataset contains 4 types of node: movie,
actor, director and writer. Type of edges includes actor-
movie (Acting), director-movie (Directing) and writer-
movie (Writing).
5.2. Evaluation metric
In this paper, we use lift index [51] to evaluate QANet
method and compare it with NetOut, PathSim, and Cos-
Sim methods. Lift index measures the accuracy of a rank-
ing method with respect to the ground truth label. The
procedure for calculation of the lift index is as follows. A
predictive model is first built based on the training data,
which is then applied to the test data to give each test case
a score showing the likelihood of the test case belonging
to the positive class. The test cases are then ranked ac-
cording to the scores in the descending order. After that,
the ranked list is divided into n equal segments, with the
cases that are most likely to be positive in top segment
and those that are least likely to be positive in bottom
[51]. To this end, the nodes in the candidate set are
ranked according to the anomaly score in the descending
order, and the lift index LI is calculated as.
 
 

 (10)
where  is the size of the candidate set and node
is the ith node of the candidate set ranked according to
the anomaly score and is equal to 1 if the node
is an anomalous node, and 0 otherwise. The higher is the
value of LI for a method; the better is its performance. In
the experiments the size of the candidate set is 10, where
according to equation (10), LI takes the value between
80% (the best case) and 30% (the worst case).
5.3. Experimental results
In this section, we compare QANet with the methods pre-
sented by Kuck et al. [14] in 6 types of queries as above.
To ensure the accuracy of the results, each of the experi-
ments was performed 50 times, and the average results
were reported. We assess the performance of the methods
by varying different network parameters including the
network size, the node types (degree of heterogeneity)
and the number of communities in the network. We also
examine two parameters related to the query: the size of
the reference set and the query type. In all these settings,
the two main parameters of QANet method, namely, the
rank of the tensor decomposition and the number of clus-
ters in the clustering method, are studied. When compar-
ing the methods for different network parameters, we use
the query type 5, which has both anomalous nodes (
) and those with colors than the reference set.
1
https://www.imdb.com/interfaces/
Input: Number of nodes (N), Number of communities (C), set of
node types (A), and probabilities .
Output: Heterogeneous information network
1:
consider   ) where = ={}
2:
Insert to into
3:
Using random function    assign a type to each node
4:
Using random function     assign a color to each
node
5:
For each node pair and
6:
If and have same color
7:
Insert edge  into with probability.
8:
Else
9:
Insert edge  into with probability.
10:
Define the edge mapping function    , where
, .
11:
return     
Algorithm 1. Pseudo-code for Generation of synthetic heterogeneous
networks.
7
1) Network size
Fig. 5 shows the lift index of QANet as a function of the
network size. For this simulation, we consider the de-
composition ranks 4 and 8, and the number of clusters 1,
4, and 20. As it can be seen, the decomposition rank does
not have much influence on the accuracy of the proposed
method, however as the number of clusters increases, its
accuracy decline, i.e. the lift index decreases. As the nodes
in the reference set are from the same color (query type 5),
by choosing a cluster size of 1, all nodes in the reference
set are placed in one cluster, and thus improve the accu-
racy of the proposed method in finding the abnormal
nodes.
Fig. 6 compares the accuracy of NetOut, PathSim, Cos-
Sim and QANet for varying network sizes. As it is seen,
the proposed method (QANet) has considerably better
accuracy than others. Also, as the network size increases,
its accuracy also increases. This is due to the fact QANet
is based on network structure, and as the network be-
comes larger, more features can be extracted for nodes,
resulting in improved detection of abnormalities. Other
methods have either decreased or unchanged behavior as
the network size increases. These results indicate the
QANet is more suitable for large-scale network than other
state-of-the-art methods.
2) Node types
Another parameter that affects anomaly detection is node
types, which is indeed an indicator of heterogeneity level
in the network. Increasing the node types while the num-
ber of nodes is kept unchanged, has almost the same ef-
fect as decreasing the number of nodes. Thus, one would
expect declined accuracy for increased node types. On the
other hand, by increasing the node types, the number of
edges in the network decreases with respect to the gener-
ation model, which makes it more difficult to distinguish
the node from one another. However, as it is seen from
the results, QANet is almost insensitive to node type,
while other methods have rather more changes (Fig. 7).
3) Number of communities
Fig. 8, shows the impact of the number of communities in
the network on the accuracy of the methods. In many het-
erogeneous networks the nodes can be divided into dif-
ferent groups. For example, in bibliographic networks
there are different scientific fields, and each node (author,
paper or conference) is in one or more of them. Our re-
sults show that by increasing the number of communities,
the accuracy of QANet decreases (Fig. 8). Indeed, increas-
ing the number of communities in the network reduces
the distinction between the nodes, which reduces the ac-
curacy of the QANet. Other methods however are not
considerably impacted by the number of communities.
4) Size of reference set
Fig. 9, shows the accuracy of the algorithms with respect
to the size of the reference set. As the size of the reference
state increases, the accuracy of QANet is systematically
improved, whereas other methods do not follow any spe-
cific pattern as a function of the size of the reference state.
Fig. 5. The accuracy of QANet as a function of the network size in
synthetic network. In this experiment, we consider the decomposition
ranks 4 and 8, and the number of clusters 1, 4, and 20. Also we set
the node types as 2, number of communities is 4 and query type is 5.
Fig. 6. Accuracy of NetOut, PathSim, CosSim and QANet as a
function of network szie in synthetic networks. In this experiment, we
set the node type as 2, the number of communities as 4, and the
query type 5.
Fig. 7. Accuracy of the methods as a function of node type in synthetic
networks. In this experiment, the number of nodes is 4000, the size of
the reference set is 20, and there query type is 5. We set the number
of cluster to 1 and the decomposition rank to 4 for QANet.
8
Improved performance of QANet as a function of this
parameter is due to enhancing the ability of the method
through reconstructing a more precise model by having a
richer feature set. As in NetOut, PathSim, and CosSim
methods, only a simple averaging is considered for nodes
in the reference set, increasing the number of nodes in the
references set might worsen the performance, which is
clearly seen in the results.
5) Query type
Fig. 10 illustrates the accuracy of the methods for the six
types of queries. As shown in the Fig. 10, QANet is the
top-performer in all query types. While PathSim and
CosSim show similar performance as QANet in query
types 3 and 4, their accuracy is not comparable with
QANet in other query types. NetOut has higher accuracy
than PathSim and CosSim method in query types 1 and 2,
but lower in other types. Because they can detect anoma-
lous nodes (  ) correctly, but cannot detect nodes
from other communities while NetOut cannot detect
anomalous nodes. Therefore, because there are no anoma-
lous nodes in Type 1 and 2 queries, the NetOut is better
and PathSim and CosSim are better in the other queries.
But as shown in the Fig. 10, the proposed method can
well detect both types of abnormalities.
5.4. Case studies
In this section, we examine the effectiveness of QANet
and NetOut on two real datasets.
Case 1: Bibliographic Dataset
In DBLP dataset, we consider all authors who collaborate
with Christos Faloutsos as candidate set and Christos
Faloutsos as reference set. In this query, the reference set
has only one member and the candidate set contains 426
authors. According to the definition of anomalies in the
section 2, the abnormalities indicate the structural differ-
ence in the communication (i.e. different meta-paths) be-
tween any of the nodes in the candidate set and the
node(s) of the reference set. 10 authors who have the
highest degree of malformation based on QANet and
NetOut methods are listed in Tables 3 and 4, respectively.
As seen in Table 3, the top-4 abnormal authors provid-
ed by QANet algorithm each have only one paper co-
authored with the reference author. Each of these authors
has published only their paper at the conference in which
the reference author has published only a paper. Fur-
thermore, this conference was not one of the major con-
ferences of the reference author. Also, they are arranged
based on the citation count that indicates the importance
of their paper and their collaboration with the reference.
The next four authors in Table 3 have also published only
one paper, however they have a common co-author with
the reference, thus strengthening their relationship with
the reference. These four authors are also arranged based
on the citation count that indicates the importance of their
collaboration with the reference author, which also re-
flects the overall view of the QANet. Lei Li is ranked 9.
While the publications of this author are quite similar to
the top-ranked author. The venue that this author has
published his paper is more important to the reference, as
Fig. 8. Accuracy of the methods as a function of the number of
communities in synthetic networks. In this experiment, the number of
nodes is 4000, there are two types of nodes, the size of the reference
set is 20, and there query type is 5. We set the number of cluster to 1
and the decomposition rank to 4 for QANet.
Fig. 9. Accuracy of the methods as a function of the reference size in
the query for synthetic networks. In this experiment, number of nodes
is 6000, node types is 2 and the number of communities is 4, and
query type is 5. In QANet number of clusters is 1 and decomposition
rank is 4.
Fig. 10. Accuracy of the methods as a function of the query types
described in Section 5.1 for synthetic networks. In this experiment, the
number of nodes is 6000, node types is 2, the number of communities
is 4 and the size of the reference set is 20. We set the number of
cluster to 1 and the decomposition rank to 4 for QANet.
9
he has published another paper at this venue. This indi-
cates that in contrast to Philip, Lei Li is closer to Christos
Faloutsos as there are more paths between them. Finally,
the last author in the ranking Table 3 has also coauthored
one paper with the reference author, however this author
has two collaborators who have co-authored with the
reference, making him closer to the reference as com-
pared to those top in the ranking table. These results indi-
cate that the proposed ranking algorithm leads to reason-
able outcome.
In order to have a better understanding on the perfor-
mance of competing algorithms, Table 4 shows the top-10
abnormal authors to the reference author based on
NetOut. This way the query type based upon Tables 3
and 4 are extracted is exactly the same. NetOut requires
the users to specify the meta-path in addition to the refer-
ence and candidate sets. In order to have a fair compari-
son with QANet, we chose all meta-paths of length 2 used
in QANet. As seen from the results of the ranking ob-
tained by NetOut method, this method only works on the
basis of counting meta-paths. This method ranks the au-
thors based on their publication pattern, i.e. the number
of articles and citation, as compared with the reference
author. The method also takes into account joint publica-
tions with the references author. Table 5 shows the top-10
dissimilar authors to the reference author based on these
two algorithms. Clearly, these two ranking algorithms
result in considerably different outcome. However, as it is
clearly seen in Tables 3 and 4, QANet results in more rea-
sonable ranking outcome than NetOut, as the authors
ranked by QANet have more dissimilarity with the refer-
ence author than those ranked by NetOut.
Case 2: Internet Movie Database
In this case study, according to the definition of anoma-
lies in section 2, we want to identify abnormal actors,
among actors of "The Godfather (1972)" movie. According
to the Internet Movie Database (IMDb) site, 34 of the ac-
tors are introduced as the main cast for this movie, and
we consider them as the candidate and reference set. Ta-
ble 6 lists the actors and their details, as well as the rank-
ing of QANet and NetOut methods. We set the number of
cluster to 1 and the decomposition rank to 3 for QANet.
In order to confirm the results of the anomaly detec-
tion methods, we show the matrices of AMA, AMAMA,
AMDMA and AMWMA meta-paths for actors in the form
of heatmap in Fig. 11. With regard to various meta-paths,
James Caan, Marlon Brando, Al Pacino, Robert Duvall and
Diane Keaton are more abnormal than the rest of the ac-
tors. These actors are exactly the five top-ranking actors in
QANet ranking, while NetOut method ranked them 3, 13,
15, 9 and 7, respectively.
In order to better understand our method, we show the
results of tensor decomposition with rank 3 and cluster-
ing in a three-dimensional space in Fig. 12. As shown in
Fig. 12, the abnormal actors are much farther away
than the rest of the actors.
5.5. Time complexity
This section provides some analysis on the computational
TABLE 3
TOP-10 RANKED AUTHORS BY QANET
Number of
APVPA Meta-
Paths with Fa-
loutsos
Number of pa-
per's Citation
Number of APA-
PA Meta-Paths
with Faloutsos
Number of APA
Meta-Paths with
Faloutsos
Number of papers
Author’s Name
Ranking
1
0
0
1
1
Philip Russell (Flip) Korn
1
1
3
0
1
1
George Panagopoulos
2
1
11
0
1
1
Ibrakim Kamel
3
1
21
0
1
1
Yi Rong
4
1
0
1
1
1
Caetano Traina Jnior
5
1
0
1
1
1
Robson L. F. Cordeiro
6
1
6
1
1
1
Yi Zhou
7
1
6
1
1
1
Bin Zhang
8
2
0
0
1
1
Lei Li
9
1
5
2
1
1
Wenyao Ho
10
APA indicates author-paper-author meta-path, and APAPA and APVPA
stand for author-paper-author-paper-author meta-path, author-paper-venue-
paper-author meta-path, respectively.
TABLE 4
TOP-10 RANKED AUTHORS BY NETOUT
Number of
APVPA Meta-
Paths with Fa-
loutsos
Number of pa-
per's Citation
Number of APA-
PA Meta-Paths
with Faloutsos
Number of APA
Meta-Paths with
Faloutsos
Number of papers
Author’s Name
Ranking
1
799
75
1
139
Sebastian B. Thrun
1
5
1962
518
1
124
Venkatesan Guruswami
2
1
478
470
1
84
Arthur Toga
3
11
3757
640
1
252
K. P. Sycara
4
3
667
514
1
73
Asim Smailagic
5
11
799
558
1
88
N. Sadeh
6
65
1962
662
1
192
Dan Siewiorek
7
2
478
536
1
94
David A. Bader
8
8
724
497
1
112
Douglas W. Oard
9
7
1943
534
1
131
M Hebert
10
TABLE 5
TOP-10 RANKED AUTHORS BASED ON QANET AND NETOUT
WITH CHRISTOS FALOUTSOS AS THE REFERENCE SET
TOP-10 RANKED AUTHORS BASED ON
QANET
TOP-10 RANKED AUTHORS BASED ON
NETOUT
QANet Ranking
Author’s Name
NetOut Ranking
QANet Ranking
Author’s Name
NetOut Ranking
1
Philip Russell (Flip) Korn
421
196
Sebastian B. Thrun
1
2
George Panagopoulos
420
197
Venkatesan Guruswami
2
3
Ibrakim Kamel
417
151
Arthur Toga
3
4
Yi Rong
418
252
K. P. Sycara
4
5
Caetano Traina Jnior
408
170
Asim Smailagic
5
6
Robson L. F. Cordeiro
409
209
N. Sadeh
6
7
Yi Zhou
406
245
Dan Siewiorek
7
8
Bin Zhang
407
216
David A. Bader
8
9
Lei Li
422
194
Douglas W. Oard
9
10
Wenyao Ho
395
224
M Hebert
10
10
complexity of the QANet. The QANet consists of several
steps. The first step is to extract queries and prepare the
reference and candidate sets for queries. The second stage
involves the implementation of the tensor decomposition
method and obtaining the candidate and reference matri-
ces. The final step is to apply the clustering method to the
characteristics obtained from the previous steps for the
candidate and reference sets. In the first step, the query
language provided by Kuck et al. [14] is used. Network
tensor and meta-paths of length 2 can be pre-calculated
offline. Sparse tensor is used in the implementation due
to its rather low time complexity. If the number of nodes
in the reference and candidate sets are and , respec-
tively, the complexity of obtaining two tensors and
is equal to . In the second step for decompos-
ing tensors and obtaining the properties of each node,
according to alternating least squares (ALS) algorithm,
the worst case time complexity is ,
where is the decomposition rank (usually less than 10)
and is the number of iterations of the tensor decomposi-
tion method (50 in this manuscript), and and are the
number of nodes in the network and the number of meta-
paths with length less than 2, respectively. is at most
, where is the node types in the network. In the final
step, according to the k-means clustering method and
calculation of the distance between each candidate node
and the cluster centers, time complexity is equal to
, where is the number of clustering itera-
tion and is the number of clusters. Finally, the time
complexity of QANet is .
NetOut, PathSim and CosSim methods have time com-
plexity exponential to the meta-path length. Materializing
neighbor vector requires traversal of the heterogeneous
network, which can be time-consuming when the
specied meta-path is long or when the degree of the
node of interest is high. The time complexity of these
methods is when meta-paths of length 2 are
considered. It is more complicated for meta-paths of
higher lengths.
Fig. 11. Adjacency matrix of Actor-Movie-Actor, Actor-Movie-Actor-
Movie-Actor, Actor-Movie-Director-Movie-Actor and Actor-Movie-
Writer-Movie-Actor metapaths for actors of "The Godfather (1972)".
Fig. 12. Illustration of the QANet method. We set the number of clus-
ter to 1 and the decomposition rank to 3 for QANet. R1, R2 and R3
are columns of respectively. Cluster centroid indicates center of
cluster after clustering phase.
TABLE 6
THE CAST OF "THE GODFATHER (1972)" AND THEIR DETAILS, AS
WELL AS THE RANKING OF QANET AND NETOUT METHODS.
NetOut Ranking
QANet Ranking
Writer
Director
Actor
Actor’s Name
Row
4
13
1
0
105
Richard Conte
1
12
23
0
0
34
Corrado Gaipa
2
27
17
0
0
46
Morgana King
3
31
16
0
1
6
Lenny Montana
4
3
2
1
0
330
James Caan
5
29
28
0
0
14
John Cazale
6
11
32
0
1
84
Sterling Hayden
7
19
30
1
0
48
Talia Shire
8
1
10
0
0
150
Abe Vigoda
9
13
4
1
0
174
Marlon Brando
10
17
22
0
0
38
Rudy Bond
11
21
19
0
0
30
Richard Bright
12
22
27
0
0
51
Richard S. Castellano
13
10
34
1
1
55
Franco Citti
14
34
21
0
0
3
Salvatore Corsitto
15
15
5
4
2
130
Al Pacino
16
32
14
0
0
6
Tony Giorgio
17
14
7
0
0
41
Julie Gregg
18
5
15
0
0
72
Angelo Infanti
19
9
1
5
4
268
Robert Duvall
20
24
31
0
1
14
Al Lettieri
21
18
12
0
0
23
Jeannie Linero
22
33
20
0
0
4
Tere Livrano
23
8
29
0
0
108
John Marley
24
23
33
0
0
62
Al Martino
25
28
24
0
0
13
John Martino
26
7
3
14
1
166
Diane Keaton
27
20
9
0
0
7
Victor Rendina
28
2
8
0
0
143
Alex Rocco
29
26
6
0
2
18
Gianni Russo
30
6
25
0
0
75
Vito Scotti
31
30
18
0
2
13
Ardell Sheridan
32
25
26
0
0
13
Simonetta Stefanelli
33
16
11
0
0
24
Saro Urzì
34
The actor column indicates the number of movies that they act and the director
column indicates the number of movies directed by them and the writer column
indicates the number of movies written by them.
11
6 CONCLUSION
In this paper, we proposed a tensor decomposition
method called QANet for detecting query-based
anomalies in heterogeneous information networks.
QANet considers all the different aspects of the com-
munication structure and uses the PARAFAC tensor
decomposition method to create a model, which is
then used to rank the candidate nodes based on their
abnormality (dissimilarity) with those in the reference
set. Due to lacking tagged data, we introduced a model
to create synthetic heterogeneous information net-
works, and tested effectiveness of the proposed anom-
aly detection algorithm for various query types. We
also compared the performance of QANet with state-
of-the-art algorithms including NetOut, PathSim and
CosSim. The experiments showed that QANet results
in better performance by providing more accurate
prediction of abnormal nodes than other algorithms.
We also compared the performance of QANet and
NetOut on two real datasets: bibliographic network
and Internet Movie Database (IMDb). The results re-
vealed that the ranking provided by QANet is more
reasonable than the one provided by NetOut. QANet
outperformed NetOut. This is mainly due to the fact
that in QANet all meta-paths of the candidate and ref-
erence nodes with other network nodes are consid-
ered. However, in NetOut method, only the symmetric
meta-paths between the candidate and reference nodes
are considered, while the relationships of these nodes
with other network nodes are not considered. Indeed,
QANet takes into account more global information in
building the similarity metrics for the anomaly detec-
tion. Our experimental results confirm that QANet
performs better. Future directions to the research work
introduced here include evaluating our approach on
temporal networks as well as using it for event detec-
tion in time-evolving networks.
ACKNOWLEDGMENT
This work was supported in part by a grant from IPM
(No. CS1396-4-49) and is partially supported by Iran Na-
tional Science Foundation (INSF) (Grant No. 96001338).
Mahdi Jalili is supported by Australian Research Council
through project No DP170102303. Mostafa Salehi is the
corresponding authorship for paper.
REFERENCES
[1] A.-L. Barabási, Network science: Cambridge University Press,
2016.
[2] C. Shi, Y. Li, J. Zhang, Y. Sun, and S. Y. Philip, "A survey of
heterogeneous information network analysis," IEEE Transac-
tions on Knowledge and Data Engineering, vol. 29, 2017, pp. 17-37.
[3] Y. Sun and J. Han, "Mining heterogeneous information net-
works: a structural analysis approach," ACM SIGKDD Explora-
tions Newsletter, vol. 14, 2013, pp. 20-28.
[4] M. Jalili, Y. Orouskhani, M. Asgari, N. Alipourfard, and M.
Perc, "Link prediction in multilayer online social networks,"
Royal Society Open Science, vol. 4, 2017, p. 160863.
[5] M. Kivelä, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno,
and M. A. Porter, "Multilayer networks," Journal of Complex
Networks, vol. 2, 2014, pp. 203-271.
[6] S. Molaei, S. Babei, M. Salehi, and M. Jalili, "Information Spread
and Topic Diffusion in Heterogeneous Information Networks,"
Scientific Reports, vol. 8,2018, p. 9549.
[7] M. Salehi, R. Sharma, M. Marzolla, M. Magnani, P. Siyari, and
D. Montesi, "Spreading processes in Multilayer Networks,"
IEEE Transactions on Network Science and Engineering (TNSE),
vol. 2, 2015, pp. 65 83.
[8] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection
for discrete sequences: A survey," IEEE Transactions on
Knowledge and Data Engineering, vol. 24, 2012, pp. 823-839.
[9] S. Shehnepoor, M. Salehi, R. Farahbakhsh, and N. Crespi.
"NetSpam: A network-based spam detection framework for re-
views in online social media." IEEE Transactions on Information
Forensics and Security, vol. 12, 2017, pp. 1585-1595.
[10] O. Salem, A. Guerassimov, A. Mehaoua, A. Marcus, et al.,
"Anomaly Detection in Medical Wireless Sensor Networks us-
ing SVM and Linear Regression Models," International Journal of
E-Health and Medical Communications, vol. 5, 2016, pp. 20-45.
[11] I. Brahmi, S. B. Yahia, and P. Poncelet, "MAD-IDS: novel intru-
sion detection system using mobile agents and data mining ap-
proaches," in Pacific-Asia Workshop on Intelligence and Security In-
formatics, 2010, pp. 73-76.
[12] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection:
A survey," ACM computing surveys (CSUR), vol. 41, 2009, p. 15.
[13] L. Akoglu, H. Tong, and D. Koutra, "Graph based anomaly
detection and description: a survey," Data Mining and Knowledge
Discovery, vol. 29, 2015, pp. 626-688.
[14] J. Kuck, H. Zhuang, X. Yan, H. Cam, and J. Han, "Query-based
outlier detection in heterogeneous information networks," in
Advances in database technology: proceedings. International Confer-
ence on Extending Database Technology, 2015, p. 325.
[15] E. Eskin, "Anomaly detection over noisy data using learned
probability distributions," in Proceedings of the International Con-
ference on Machine Learning, 2000.
[16] A. M. Kosek, "Contextual anomaly detection for cyber-physical
security in Smart Grids based on an artificial neural network
model," in 2016 Joint Workshop on Cyber-Physical Security and Re-
silience in Smart Grids, 2016.
[17] B. Shah and B. H. Trivedi, "Reducing features of KDD CUP
1999 dataset for anomaly detection using back propagation
neural network," in Fifth International Conference on Advanced
Computing & Communication Technologies, 2015, pp. 247-251.
[18] W. Hu, Y. Liao, and V. R. Vemuri, "Robust anomaly detection
using support vector machines," in Proceedings of the internation-
al conference on machine learning, 2003, pp. 282-289.
[19] J. Liu, S. Chen, Z. Zhou, and T. Wu, "An Anomaly Detection
Algorithm of Cloud Platform Based on Self-Organizing Maps,"
Mathematical Problems in Engineering, vol. 1, 2016, pp. 1-9.
[20] E. De la Hoz, E. De La Hoz, A. Ortiz, J. Ortega, et al., "PCA fil-
tering and probabilistic SOM for network intrusion detection,"
Neurocomputing, vol. 164, 2015, pp. 71-81.
[21] D. J. Miller and J. Browning, "A mixture model and EM-based
algorithm for class discovery, robust classification, and outlier
12
rejection in mixed labeled/unlabeled data sets," IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, vol. 25, 2003,
pp. 1468-1483.
[22] W. Jin, A. K. Tung, and J. Han, "Mining top-n local outliers in
large databases," in Proceedings of the seventh ACM SIGKDD in-
ternational conference on Knowledge discovery and data mining,
2001, pp. 293-298.
[23] A. Ghoting, S. Parthasarathy, and M. E. Otey, "Fast mining of
distance-based outliers in high-dimensional datasets," Data
Mining and Knowledge Discovery, vol. 16, 2008, pp. 349-364.
[24] V. Hautamäki, I. Kärkkäinen, and P. Fränti, "Outlier Detection
Using k-Nearest Neighbour Graph," ICPR (3), 2004, pp. 430-433.
[25] P. Sun, S. Chawla, and B. Arunasalam, "Mining for Outliers in
Sequential Databases," in SDM, 2006, pp. 94-105.
[26] M. Radovanović , A. Nanopoulos, and M. Ivanović , "Reverse
nearest neighbors in unsupervised distance-based outlier detec-
tion," IEEE transactions on knowledge and data engineering, vol. 27,
2015, pp. 1369-1382.
[27] J. Laurikkala, M. Juhola, E. Kentala, N. Lavrac, et al., "Informal
identification of outliers in medical data," in Fifth International
Workshop on Intelligent Data Analysis in Medicine and Pharmacolo-
gy, 2000, pp. 20-24.
[28] B. Rosner, "Percentage points for a generalized ESD many-
outlier procedure," Technometrics, vol. 25, 1983, pp. 165-172.
[29] R. D. Gibbons, D. K. Bhaumik, and S. Aryal, Statistical methods
for groundwater monitoring vol. 59: John Wiley & Sons, 2009.
[30] B. Abraham and G. E. Box, "Bayesian analysis of some outlier
problems in time series," Biometrika, vol. 66, 1979, pp. 229-236.
[31] P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier
detection vol. 589: John Wiley & Sons, 2005.
[32] X. Song, M. Wu, C. Jermaine, and S. Ranka, "Conditional anom-
aly detection," IEEE Transactions on Knowledge and Data Engi-
neering, vol. 19, 2007, pp. 631-645.
[33] Y. Du, R. Zhang, and Y. Guo, "A Useful Anomaly Intrusion
Detection Method Using Variablelength Patterns and Average
Hamming Distance," in JCP, vol. 8, 2010, pp. 1219-26.
[34] M. Li and P. Vitányi, An introduction to Kolmogorov complexity
and its applications: Springer Science & Business Media, 2009.
[35] S. Wu and S. Wang, "Information-theoretic outlier detection for
large-scale categorical data," IEEE transactions on knowledge and
data engineering, vol. 25, 2013, pp. 589-602.
[36] Noble, Caleb C., and Diane J. Cook. "Graph-based anomaly
detection." Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, 2003.
[37] M. Gupta, J. Gao, X. Yan, H. Cam, et al., "Top-k interesting sub-
graph discovery in information networks," in IEEE 30th Interna-
tional Conference on Data Engineering, 2014, pp. 820-831.
[38] H. Zhuang, J. Zhang, G. Brova, J. Tang, et al., "Mining query-
based subnetwork outliers in heterogeneous information net-
works," in 2014 IEEE International Conference on Data Mining,
2014, pp. 1127-1132.
[39] Y. Sun, J. Han, X. Yan, P. S. Yu, et al., "Pathsim: Meta path-based
top-k similarity search in heterogeneous information net-
works," Proceedings of the VLDB Endowment, vol. 4, 2011, pp.
992-1003.
[40] S. Ranshous, S. Shen, D. Koutra, S. Harenberg, et al., "Anomaly
detection in dynamic networks: a survey," Wiley Interdiscipli-
nary Reviews: Computational Statistics, vol. 7, 2015, pp. 223-247.
[41] H. Fanaee-T, and J. Gama. "Tensor-based anomaly detection:
An interdisciplinary survey." Knowledge-Based Systems, vol. 98,
2016, pp. 130-147.
[42] H.-H. Mao, C.-J. Wu, E. E. Papalexakis, C. Faloutsos, et al.,
"Malspot: multi2 malicious network behavior patterns analy-
sis," Advances in Knowledge Discovery and Data Mining, Springer,
2014, pp. 114.
[43] L. Akoglu, and C. Faloutsos. "Event detection in time series of
mobile communication graphs." Army science conference. 2010.
[44] K. Maruhashi, F. Guo, and C. Faloutsos. "Multiaspectforensics:
Pattern mining on large-scale heterogeneous networks with
tensor analysis." Advances in Social Networks Analysis and Mining
(ASONAM), IEEE, 2011.
[45] E. E. Papalexakis, K. Pelechrinis, and C. Faloutsos. "Spotting
misbehaviors in location-based social networks using ten-
sors." Proceedings of the 23rd International Conference on World
Wide Web. ACM, 2014.
[46] K. Danai, E. E. Papalexakis, and C. Faloutsos. "Tensorsplat:
Spotting latent anomalies in time." Informatics (PCI), 16th Pan-
hellenic Conference. IEEE, 2012.
[47] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, "Haten2:
Billion-scale tensor decompositions," in 2015 IEEE 31st Interna-
tional Conference on Data Engineering, 2015, pp. 1047-1058.
[48] R. Bro, "PARAFAC. Tutorial and applications," Chemometrics
and intelligent laboratory systems, vol. 38, 1997, pp. 149-171.
[49] P. M. Kroonenberg and J. De Leeuw, "Principal component
analysis of three-mode data by means of alternating least
squares algorithms," Psychometrika, vol. 45, 1980, pp. 69-97.
[50] J. Tang, J. Zhang, L. Yao, J. Li, et al., "Arnetminer: extraction and
mining of academic social networks," in Proceedings of the 14th
ACM SIGKDD, 2008, pp. 990-998.
[51] T. Y. Lin, Y. Y. Yao, and L. A. Zadeh, Data Mining, Rough Sets
and Granular Computing: Physica-Verlag HD, 2013.
Vahid Ranjbar. Received the B.S. and the M.S. degree
in information technology in 2011 and 2013 respectively.
He is currently working toward the Ph.D. degree in the
University of Tehran, Iran. His research interests include
network science and data mining.
Mostafa Salehi. completed his PhD studies in Computer
Engineering at Sharif University of Technology, Iran in
2012. On 2013, he joined University of Tehran as an
assistant professor. His research interests include net-
work science and multimedia networks.
Pegah Jandaghi. Received the B.S. degree in Comput-
er Engineering and Mathematical and application in
2017. he is currently working toward the MS degree in
the University of Southern California, USA. Her research
interests include Information Networks and data mining.
Mehdi Jalili. (Member 2009, SM 2016) received the
PhD degree from EPFL (Swiss), in 2008. He is now a
senior lecturer with the School of Engineering, RMIT
University, Melbourne, Australia. His research interests
are in network science, dynamical systems, data mining.
... Auxiliary data Alignment data [244], auxiliary descriptions [215], [216], [276] Reweighting and resampling Reweighting [60], [74], [119], [124], [146], [178], [239], [241], [243], [288], resampling [36], [68], [82], [130], [133], [135], [142]- [144], [144], [147], [164], [171], [240], [242] SMOTE SMOTE [25], Mixup [64]- [66] GAN [37], [101], [109], [163], [179] Other methods Predictive data generation [73], [134], label generation [76], [166], [169] Additional constraints Condition relax constraints [138], [212]- [214], imbalance constraints [61], [111], class separation constraints [35], [71], [77]- [100], [102]- [106], [108], [110], [112]- [118], [120]- [123], [125]- [129], [131], [132], [136], [137], [139]- [143], [145]- [162], [165], [167], [168], [170], [172]- [178], [180]- [184], [289] Improving the lowresource part ...
Preprint
Full-text available
Graphs represent interconnected structures prevalent in a myriad of real-world scenarios. Effective graph analytics, such as graph learning methods, enables users to gain profound insights from graph data, underpinning various tasks including node classification and link prediction. However, these methods often suffer from data imbalance, a common issue in graph data where certain segments possess abundant data while others are scarce, thereby leading to biased learning outcomes. This necessitates the emerging field of imbalanced learning on graphs, which aims to correct these data distribution skews for more accurate and representative learning outcomes. In this survey, we embark on a comprehensive review of the literature on imbalanced learning on graphs. We begin by providing a definitive understanding of the concept and related terminologies, establishing a strong foundational understanding for readers. Following this, we propose two comprehensive taxonomies: (1) the problem taxonomy, which describes the forms of imbalance we consider, the associated tasks, and potential solutions; (2) the technique taxonomy, which details key strategies for addressing these imbalances, and aids readers in their method selection process. Finally, we suggest prospective future directions for both problems and techniques within the sphere of imbalanced learning on graphs, fostering further innovation in this critical area.
... Most real-world systems are modeled as single-layer networks. But, there are some systems that can be best modeled in multiple layers [9,10]. For instance, online social networks used by the same people are represented as a multiplex network. ...
Article
Full-text available
In a multiplex network, there exists different types of relationships between the same set of nodes such as people which have different accounts in online social networks. Previous researches have proved that in a multiplex network the structural features of different layers are interrelated. Therefore, effective use of information from other layers can improve link prediction accuracy in a specific layer. In this paper, we propose a new inter-layer similarity metric DSMN, for predicting missing links in multiplex networks. We then combine this metric with a strong intra-layer similarity metric to enhance the performance of link prediction. The efficiency of our proposed method has been evaluated on both real-world and synthetic networks and the experimental results indicate the outperformance of the proposed method in terms of prediction accuracy in comparison with similar methods.
... Most real-world systems are modeled as single-layer networks. But, there are some systems that can be best modeled in multiple layers [9,10]. For instance, online social networks used by the same people are represented as a multiplex network. ...
Article
In order to exploit the advantages of the massive MIMO systems, it is vital to apply the channel estimation task. The huge number of antennas at the base station of a massive MIMO system produces a large set of channel paths which requires to be estimated. Therefore, the channel estimation in such systems is more troublesome. In this paper, we propose to leverage the temporal joint sparsity of the massive MIMO channels to offer a more accurate channel estimation. To attain this goal, we would model the problem to exploit the spatial correlation among different antennas of the BS as well as the inter-user similarity of the channel supports. In addition, by assuming a slow time-varying channel, the supports of the channel matrices of various snapshots would be equal which enables us to impose the temporal joint sparsity on the channel submatrices. The simulation results validate the efficiency and superiority of the suggested scheme over its rivals.
... Profile features also are known as metadata including language, age, gender, geographic locations and account creation time. Behavioral features are based on period or time of posting contents (Velayutham and Tiwari 2017;) and network-based features (Hurtado et al. 2019;Shehnepoor et al. 2017) are such as account popularity , clustering coefficient, community properties (Abu-El-Rub and Mueen 2019; Ranjbar et al. 2018) and homophily properties (Dorri and Dadfarnia 2018). ...
Article
Full-text available
Nowadays, a massive number of people are involved in various social media. This fact enables organizations and institutions to more easily access their audiences across the globe. Some of them use social bots as an automatic entity to gain intangible access and influence on their users by faster content propagation. Thereby, malicious social bots are populating more and more to fool humans with their unrealistic behavior and content. Hence, that's necessary to distinguish these fake social accounts from real ones. Multiple approaches have been investigated in the literature to answer this problem. Statistical machine learning methods are one of them focusing on handcrafted features to represent characteristics of social bots. Although they reached successful results in some cases, they relied on the bot's behavior and failed in the behavioral change patterns of bots. On the other hands, more advanced deep neural network-based methods aim to overcome this limitation. Generative adversarial network (GAN) as new technology from this domain is a semi-supervised method that demonstrates to extract the behavioral pattern of the data. In this work, we use GAN to leak more information of bot samples for state-of-the-art textual bot detection method (Contextual LSTM). Although GAN augments low labeled data, original textual GAN (Sequence Generative Adversarial Net (SeqGAN)) has the known limitation of convergence. In this paper, we invested this limitation and customized the GAN idea in a new framework called GANBOT, in which the generator and classifier connect by an LSTM layer as a shared channel between them. Our experimental results on a bench-marked dataset of Twitter social bot show our proposed framework outperforms the existing contextual LSTM method by increasing bot detection probabilities.
Article
Full-text available
Special conditions of wireless sensor networks, such as energy limitation, make it essential to accelerate the convergence of algorithms in this field, especially in the distributed compressive sensing (DCS) scenarios, which have a complex reconstruction phase. This paper presents a DCS reconstruction algorithm that provides a higher convergence rate. The proposed algorithm is a distributed primal-dual algorithm in a bidirectional incremental cooperation mode where the parameters change with time. The parameters are changed systematically in the convex optimization problems in which the constraint and cooperation functions are strongly convex. The proposed method is supported by simulations, which show the higher performance of the proposed algorithm in terms of convergence rate, even in stricter conditions such as the small number of measurements or the lower degree of sparsity.
Article
Over the decades, traditional outlier detectors have ignored the group-level factor when calculating outlier scores for objects in data by evaluating only the object-level factor, failing to capture the collective outliers. To mitigate this issue, we present a framework called neighborhood representative (NR), which empowers all the existing outlier detectors to efficiently detect outliers, including collective outliers, while maintaining their computational integrity. It achieves this by selecting representative objects, scoring these objects, then applies the score of the representative objects to its collective objects. Without altering existing detectors, NR is compatible with existing detectors, while improving performance on eleven real world datasets with +8% (0.72 to 0.78 AUC) on average relative to twelve state-of-the-art outlier detectors. The implementation of NR can be found via www.OutlierNet.com for reproducibility. Index Terms—Outlier detection, Preprocessing, Neighborhood representative, K nearest neighbors.
Article
Topic-based communities have gradually become a considerable medium for netizens to disseminate and acquire knowledge. These communities consist of entities (actual objects, e.g., a real answer or an actual question) with different types (users, questions and answers) and are usually hidden and overlapping. Nowadays, prevalent community question answering (CQA) platforms have formed mature communities by manually marked topics and extensive accumulated user behavior. However, the ever-growing various entities and complex overlapping topic communities make it inefficient to manually label entity tags (e.g., Question labels supplement domain features; Potential user tags indicate the user's specialty.). Therefore, there is an urgent need for a mechanism that automatically finds hidden semantic communities from user social behavior and lays a foundation for community construction and intelligent recommendation of QA platforms. In this paper, we propose a Heterogeneous Community Detection Approach Based on Graph Neural Network, called HCDBG, to detect heterogeneous communities in CQA. Firstly, we define entity relationships based on user interaction behavior and employ a heterogeneous information network to uniformly represent all connections. Afterward, we exploit the heterogeneous graph neural network to fuse content and topological features of nodes for graph embedding. Finally, we convert the community detection issue in CQA into an entity clustering task in the heterogeneous information network and improve the k-means method to achieve heterogeneous community detection. Based on our knowledge of the existing literature, it is an innovative research direction that utilizes the heterogeneous graph neural network to facilitate QA community detection. Extensive experiments on authentic question-answering datasets illustrate that HCDBG outperforms baseline methods in heterogeneous community detection.
Article
Full-text available
Tensor decomposition is widely used to exploit the internal correlation in multi-way data analysis and process for communications and radar systems. As one of the main tensor decomposition methods, CANDECOMP/PARAFAC decomposition has advantages of uniqueness and interpretation properties which are significant in practical applications. However, traditional decomposition method is sensitive to both predefined rank and noise that results in inaccurate tensor decomposition. In this paper, we propose a improved algorithm called the Element-wise Average Alternating Direction Method of Multipliers by minimizing the sum of all factors’ trace norm and the noise variance. Our algorithm could overcome the dependence on predefined rank in traditional decomposition algorithms and alleviate the impact of noise. Moreover, this algorithm can be transferred to solve the problem of tensor completion conveniently. The simulation results show that our proposed algorithm could decompose the noisy tensor to the factors with above 90% similarity in various SNR and also interpolate the incomplete tensor with higher similar coefficient and lower relative reconstruction error when the missing rate is less than 0.5.
Article
Graph neural network (GNN) has shown its prominent performance in representation learning of graphs but it has not been fully considered for heterogeneous graphs which contain more complex structures and rich semantics. The rich semantic information of heterogeneous graph can be usually revealed by meta-paths. Therefore, most of the existing GNN models designed for heterogeneous graphs utilize the meta-path based neighborhood sampler to divide a heterogeneous graph into multiple homogeneous subgraphs according to various meta-paths so that the homogeneous GNN can be applied to investigate heterogeneous graphs. Nevertheless, the way of embedding semantic information of meta-paths into multiple homogeneous graphs is implicit and ineffective, which cannot accurately capture the semantics of heterogeneous graphs. In this paper, we propose a novel semi-supervised GNN model named E xplicit M essage- P assing Heterogeneous Graph Neural Network (EMP), which executes the process of explicit message-passing along the meta-paths. Besides, we also propose a split method for meta-paths and consider mutual effect between various meta-paths in advance in the proposed model, so that the semantic information of the whole set of meta-paths can be captured accurately. Extensive experiments conducted on three real-world datasets demonstrate the superiority of the proposed model.
Article
Full-text available
Diffusion of information in complex networks largely depends on the network structure. Recent studies have mainly addressed information diffusion in homogeneous networks where there is only a single type of nodes and edges. However, some real-world networks consist of heterogeneous types of nodes and edges. In this manuscript, we model information diffusion in heterogeneous information networks, and use interactions of different meta-paths to predict the diffusion process. A meta-path is a path between nodes across different layers of a heterogeneous network. As its most important feature the proposed method is capable of determining the influence of all meta-paths on the diffusion process. A conditional probability is used assuming interdependent relations between the nodes to calculate the activation probability of each node. As independent cascade models, we consider linear threshold and independent cascade models. Applying the proposed method on two real heterogeneous networks reveals its effectiveness and superior performance over state-of-the-art methods.
Article
Full-text available
Nowadays, a big part of people rely on available content in social media in their decisions (e.g. reviews and feedback on a topic or product). The possibility that anybody can leave a review provide a golden opportunity for spammers to write spam reviews about products and services for different interests. Identifying these spammers and the spam content is a hot topic of research and although a considerable number of studies have been done recently toward this end, but so far the methodologies put forth still barely detect spam reviews, and none of them show the importance of each extracted feature type. In this study, we propose a novel framework, named NetSpam, which utilizes spam features for modeling review datasets as heterogeneous information networks to map spam detection procedure into a classification problem in such networks. Using the importance of spam features help us to obtain better results in terms of different metrics experimented on real-world review datasets from Yelp and Amazon websites. The results show that NetSpam outperforms the existing methods and among four categories of features; including review-behavioral, user-behavioral, reviewlinguistic, user-linguistic, the first type of features performs better than the other categories.
Article
Full-text available
Online social networks play a major role in modern societies, and they have shaped the way social relationships evolve. Link prediction in social networks has many potential applications such as recommending new items to users, friendship suggestion and discovering spurious connections. Many real social networks evolve the connections in multiple layers (e.g. multiple social networking platforms). In this article, we study the link prediction problem in multiplex networks. As an example, we consider a multiplex network of Twitter (as a microblogging service) and Foursquare (as a location-based social network). We consider social networks of the same users in these two platforms and develop a meta-path-based algorithm for predicting the links. The connectivity information of the two layers is used to predict the links in Foursquare network. Three classical classifiers (naive Bayes, support vector machines (SVM) and K-nearest neighbour) are used for the classification task. Although the networks are not highly correlated in the layers, our experiments show that including the cross-layer information significantly improves the prediction performance. The SVM classifier results in the best performance with an average accuracy of 89%.
Article
Full-text available
Virtual machines (VM) on a Cloud platform can be influenced by a variety of factors which can lead to decreased performance and downtime, affecting the reliability of the Cloud platform. Traditional anomaly detection algorithms and strategies for Cloud platforms have some flaws in their accuracy of detection, detection speed, and adaptability. In this paper, a dynamic and adaptive anomaly detection algorithm based on Self-Organizing Maps (SOM) for virtual machines is proposed. A unified modeling method based on SOM to detect the machine performance within the detection region is presented, which avoids the cost of modeling a single virtual machine and enhances the detection speed and reliability of large-scale virtual machines in Cloud platform. The important parameters that affect the modeling speed are optimized in the SOM process to significantly improve the accuracy of the SOM modeling and therefore the anomaly detection accuracy of the virtual machine.
Article
Full-text available
Outlier or anomaly detection in large data sets is a fundamental task in data science, with broad applications. However, in real data sets with high-dimensional space, most outliers are hidden in certain dimensional combinations and are relative to a user's search space and interest. It is often more effective to give power to users and allow them to specify outlier queries flexibly, and the system will then process such mining queries efficiently. In this study, we introduce the concept of query-based outlier in heterogeneous information networks, design a query language to facilitate users to specify such queries flexibly, define a good outlier measure in heterogeneous networks, and study how to process outlier queries efficiently in large data sets. Our experiments on real data sets show that following such a methodology, interesting outliers can be defined and uncovered flexibly and effectively in large heterogeneous networks.
Chapter
This paper details the architecture and describes the preliminary experimentation with the proposed framework for anomaly detection in medical wireless body area networks for ubiquitous patient and healthcare monitoring. The architecture integrates novel data mining and machine learning algorithms with modern sensor fusion techniques. Knowing wireless sensor networks are prone to failures resulting from their limitations (i.e. limited energy resources and computational power), using this framework, the authors can distinguish between irregular variations in the physiological parameters of the monitored patient and faulty sensor data, to ensure reliable operations and real time global monitoring from smart devices. Sensor nodes are used to measure characteristics of the patient and the sensed data is stored on the local processing unit. Authorized users may access this patient data remotely as long as they maintain connectivity with their application enabled smart device. Anomalous or faulty measurement data resulting from damaged sensor nodes or caused by malicious external parties may lead to misdiagnosis or even death for patients. The authors' application uses a Support Vector Machine to classify abnormal instances in the incoming sensor data. If found, the authors apply a periodically rebuilt, regressive prediction model to the abnormal instance and determine if the patient is entering a critical state or if a sensor is reporting faulty readings. Using real patient data in our experiments, the results validate the robustness of our proposed framework. The authors further discuss the experimental analysis with the proposed approach which shows that it is quickly able to identify sensor anomalies and compared with several other algorithms, it maintains a higher true positive and lower false negative rate.
Article
Two models, the aberrant innovation model and the aberrant observation model, are considered to characterize outliers in time series. The approach adopted here allows for a small probability α that any given observation is 'bad' and in this set-up the inference about the parameters of an autoregressive model is considered.
Conference Paper
This paper presents a contextual anomaly detection method and its use in the discovery of malicious voltage control actions in the low voltage distribution grid. The model-based anomaly detection uses an artificial neural network model to identify a distributed energy resource's behaviour under control. An intrusion detection system observes distributed energy resource's behaviour, control actions and the power system impact, and is tested together with an ongoing voltage control attack in a co-simulation set-up. The simulation results obtained with a real photovoltaic rooftop power plant data show that the contextual anomaly detection performs on average 55% better in the control detection and over 56% better in the malicious control detection over the point anomaly detection.
Conference Paper
To detect and classify the anomaly in computer network, KDD CUP 1999 dataset is extensively used. This KDD CUP 1999 data set was generated by domain expert at MIT Lincon lab. To reduced number of features of this KDD CUP data set, various feature reduction techniques has been already used. These techniques reduce features from 41 into range of 10 to 22. Usage of such reduced dataset in machine learning algorithm leads to lower complexity, less processing time and high accuracy. Out of the various feature reduction technique available, one of them is Information Gain (IG) which has been already applied for the random forests classifier by Tesfahun et al. Tesfahun's approach reduces time and complexity of model and improves the detection rate for the minority classes in a considerable amount. This work investigates the effectiveness and the feasibility of Tesfahun et al.'s feature reduction technique on Back Propagation Neural Network classifier. We had performed various experiments on KDD CUP 1999 dataset and recorded Accuracy, Precision, Recall and Fscore values. In this work, we had done Basic, N-Fold Validation and Testing comparisons on reduced dataset with full feature dataset. Basic comparison clearly shows that the reduced dataset outer performs on size, time and complexity parameters. Experiments of N-Fold validation show that classifier that uses reduced dataset, have better generalization capacity. During the testing comparison, we found both the datasets are equally compatible. All the three comparisons clearly show that reduced dataset is better or is equally compatible, and does not have any drawback as compared to full dataset. Our experiments shows that usage of such reduced dataset in BPNN can lead to better model in terms of dataset size, complexity, processing time and generalization ability.