ArticlePDF Available

QANet: Tensor Decomposition Approach for Query-Based Anomaly Detection in Heterogeneous Information Networks

October 2019
IEEE Transactions on Knowledge and Data Engineering 31(11):2178-2189

October 2019
31(11):2178-2189

DOI:10.1109/TKDE.2018.2873391

Authors:

Vahid Ranjbar

Yazd University

Mostafa Salehi

University of Tehran

Pegah Jandaghi

University of Southern California

Mahdi Jalili

RMIT University

Complex networks have now become integral parts of modern information infrastructures. This paper proposes a user-centric method for detecting anomalies in heterogeneous information networks, in which nodes and/or edges might be from different types. In the proposed anomaly detection method, users interact directly with the system and anomalous entities can be detected through queries. Our approach is based on tensor decomposition and clustering methods. We also propose a network generation model to construct synthetic heterogeneous information network to test the performance of the proposed method. The proposed anomaly detection method is compared with state-of-the-art methods in both synthetic and real-world networks. Experimental results show that the proposed tensor-based method considerably outperforms the existing anomaly detection methods.

(a) An instance of heterogeneous information network for bibliographic network, and (b) bibliographic network schema.

…

A TOY EXAMPLE OF BIBLIOGRAPHY NETWORK.

…

A graphical representation of a PARAFAC model of tensor X [47].

…

Tensor decomposition for reference tensor í µí±¿ í µí±. We can obtain three matrices A, B, and C using PARAFAC decomposition method.

…

TOP-10 RANKED AUTHORS BY QANET

…

Figures - uploaded by Vahid Ranjbar

Content may be subject to copyright.

Content uploaded by Vahid Ranjbar

Content may be subject to copyright.

QANet: Tensor Decomposition Approach for

Query-based Anomaly Detection in

Heterogeneous Information Networks

Vahid Ranjbar, Mostafa Salehi, Pegah Jandaghi, and Mahdi Jalili, Senior Member, IEEE

Abstract— Complex networks have now become integral parts of modern information infrastructures. This paper proposes a

user-centric method for detecting anomalies in heterogeneous information networks, in which nodes and/or edges might be

from different types. In the proposed anomaly detection method, users interact directly with the system and anomalous entities

can be detected through queries. Our approach is based on tensor decomposition and clustering methods. We also propose a

network generation model to construct synthetic heterogeneous information network to test the performance of the proposed

method. The proposed anomaly detection method is compared with state-of-the-art methods in both synthetic and real-world

networks. Experimental results show that the proposed tensor-based method considerably outperforms the existing anomaly

detection methods.

Index Terms— Anomaly Detection, Heterogeneous Information Networks, Query Based Anomaly Detection, Tensor

Decomposition

——————————  ——————————

1 INTRODUCTION

ANY real systems usually consist of interactions

between various components and entities and can

be modeled as networked structures [1]. Examples in-

clude social activities of humans, ecological systems,

communication and computer networks and biological

systems. Information networks are everywhere and have

become a vital component of modern information infra-

structure. In recent years, analysis of information net-

works has attracted scholars across disciplines including

computer science, social sciences, mathematics and phys-

ics [2]. Modeling real-world data as information networks

is a new tool that can often provide richer information as

compared to traditional modeling techniques such as

multidimensional modeling [3]. Representing data as in-

formation networks makes it possible to model relation-

ships between entities. In some systems the nodes and/or

edges are not from the same type and various types might

coexist in the system; such systems are often modeled by

heterogeneous (or multilayer) networks [4], [5], [6], [7].

One of the major challenges in the analysis of infor-

mation networks is to discover anomaly and abnormal

components. Anomaly detection is a branch of data min-

ing that is associated with discovery of abnormal occur-

rences in the data. It has many applications in areas such

as security, finance, biology, healthcare, and law en-

forcement [8]. For example, online social media detects

spam reviews by finding unusual patterns [9]. Banks of-

ten find fraudulent activities by examining unusual

transaction patterns [10]. Network intrusion detection

methods detect potential network attacks by comparing

traffic signatures with incoming traffic and finding unu-

sual patterns in incoming traffic [11]. Abnormalities can

be a node (entity), an edge (connection) or a subnet (a

group of entities) that should not exist in the network but

exist. Types of attributes and features associated with

nodes and edges that are used to detect abnormalities,

can be of any kind in relation to the entities or relation-

ship between them.

So far, anomaly detection methods have mainly fo-

cused on homogeneous information networks and un-

structured multidimensional data [12], [13]. Anomaly

detection in heterogeneous information networks is a

challenging task mainly due to specific characteristics of

such networks [14]. Most of the methods developed orig-

inally for homogeneous networks do not work in the case

of heterogeneous one. Often, one would like to find

anomalies in certain type of nodes/edges in heterogene-

ous networks. For example, in bibliographic networks in

which nodes can be authors, papers or venues, one might

want to find author(s) that are the most different (abnor-

mal) with others in the way they publish papers (i.e. the

topic of papers and/or venues published). In this exam-

ple, the difference in behavior should be a particular au-

thor chosen as reference. Such anomaly detection is re-

ferred to as query-based anomaly detection method [14].

This paper introduces a novel anomaly detection

method based on tensor decomposition named QANet. In

the proposed method, various meta-paths each node has

with others are considered as properties of that node.

Then, features are extracted using tensor decomposition,

and clustering techniques are used to detect anomalies.

————————————————

 V. Ranjbar and M. Salehi are with the Faculty of New Sciences and

Technologies, University of Tehran, Tehran, Iran and also with the

School of Computer Science, Institute for Research in Fundamental Sci-

ence (IPM), P.o.Box 19395-5746, Tehran, Iran. E-mail:

vranjbar@ut.ac.ir; mostafa_salehi@ut.ac.ir.

 P. Jandaghi is with the department of Computer Science, University of

Southern California, California, USA, E-mail: jandaghi@usc.edu.

 M. Jalili is with the School of Engineering, RMIT University, Mel-

bourne, Australia. E-mail: mahdi.jalili@rmit.edu.au.

We use synthetic and real datasets to evaluate perfor-

mance of QANet. Specific contributions of our manu-

script are as follows:

 We provide a user-centric anomaly detection approach

that uses tensors to store meta-paths in heterogeneous in-

formation networks and also uses tensor decomposition

techniques to extract nodal features from a tensor.

 We introduce a network generation model and an anomaly

injection method to construct synthetic heterogeneous in-

formation networks to test the performance of the pro-

posed anomaly detection method.

 We create queries from two real-world datasets including

IMDB and DBLP as well as the constructed synthetic da-

taset and compare the proposed method with state-of-the-

art methods.

The rest of the paper is organized as follows. Section 2

presents the preliminaries including a number of defini-

tions and the problem statement. We discuss the related

work in section 3. We discuss the tensor decomposition

model and describe our approach for the ranking candi-

date set of user query in section 4. A set of comprehensive

experiments is performed to evaluate the effectiveness of

QANet in section 5. Section 6 draws the conclusions.

2 PRELIMINARIES

This section provides some preliminaries including a

number of definitions required to formally state the prob-

lem of query-based anomaly detection in heterogeneous

information networks.

DEFINITION 1. (HETEROGENEOUS INFORMATION

NETWORK) [2]. A heterogeneous information network

consists of multi-type entities that can have different type

relationships between them, which are defined by a di-

rected graph. Without loss of generality, the information

network  can be deﬁned as   , where  is

the set of nodes and  is set of edges. Function   

is a function that maps each node to its type, where  

 is the set of node types and  is the number

of node types. Each node  maps to a particular type

in entity type set . Function     is also a

function that maps each edge to its type from set  

 where  is number of edge types. Each edge

 is mapped to a special type in edge type set

 . Fig. 1(a) shows a heterogeneous information

network on bibliographic data. This network includes

three types of node: Papers (P), Authors (A) and Venues

(V). Each paper has link to a set of authors and a venue

where these links belong to a set of link types.

To better understand the type of entities and relation-

ship between them in a heterogeneous information net-

work, it is useful to have a meta level (i.e. schema-level)

description of the network.

DEFINITION 2. (NETWORK SCHEMA) [2]. The network

schema, denoted as  , is a meta template for an

information network    with the node type

mapping    and the edge type mapping   ,

which is a directed graph with nodes types in  and mul-

ti-type relationships from . Fig. 1(b) shows network

schema for bibliographic network.

DEFINITION 3. (META-PATH) [2], [39] A meta path  is

a path deﬁned on a schema   , and is denoted in

the form of 





, with length of . For sim-

plicity, we can also use node types to denote the meta

path if there are no multiple relation types between the

same pair of node types:   . APV and

APA are meta-paths for heterogeneous information net-

work in Fig. 1.

A meta path   can be reversed where

the reversed path is denoted as    . If 

is equal to , then  is a symmetric meta-path. For ex-

ample, APA and APVPA are symmetric meta-paths.

DEFINITION 4. (META-PATH INSTANTIATION) [39].

If for each   we have  and for each edge

 that belong to relation  in meta-path 

 , the path   is denoted as meta-

path instance for . It can be denoted as   . Mike-

paper1-VLDB is a meta-path instance for APV meta-path

in the network of Fig. 1.

DEFINITION 5. (ANOMALY QUERY). An anomaly que-

ry Q is denoted by    where  is a query on

network entities.  is the output indicating the outli-

ers, known as the candidate set.  is also a query on

network entities whose output is  serving as the

reference of normal nodes. The types of referenced and

candidate sets are the same. Candidate entities can be a

separate set or sub-set of the reference sets.

DEFINITION 6. (ANOMALY SCORE). The degree of

structural difference between a node and those in the ref-

erence set is the anomaly score () of that node rela-

tive to the reference set. The structural difference means

the difference in the formation of node relationship with

others.

3 RELATED WORKS

Various methods have been proposed for anomaly detec-

tion which can be used in certain applications. Classifica-

tion methods [10], [15], [16], [17], [18] require labeled data

and usually provide a label for test data, which are not

appropriate for applications that require a rating for ab-

normality. Another group of methods are based on clus-

tering [19], [20], [21], for which efficiency is highly de-

pendent on the clustering algorithm used. The computa-

tional complexity of these methods is challenging espe-

cially for large-scale feature set. In some other works, the

nearest neighbor methods are used to detect abnormali-

ties [22], [23], [24], [25], [26]. These methods are not suita-

ble for datasets in which anomalies are very close to natu-

ral points, or for those in which natural data are far apart.

Also, effectiveness of these methods is highly dependent

Fig. 1. (a) An instance of heterogeneous information network for

bibliographic network, and (b) bibliographic network schema.

on the specific distance measure used in them. Another

category of anomaly detection methods are based on sta-

tistical approaches [27], [28], [29], [30], [31], [32]. The as-

sumption of these approaches is that the data is generated

from a certain distribution. However, such an assumption

might not be valid in many cases, and even when the as-

sumption is correct, it is often difficult to find the right

distribution. Some other works have used information

theory methods to detect abnormalities, which are more

suitable for sequenced and timed data [33], [34], [35]. In

[36] Noble and Cook introduced two techniques for

graph-based anomaly detection based on Minimum De-

scription Length.

Regarding the input data, anomaly detection can be

divided into two categories: structured (or graph-based)

and unstructured multidimensional data. In the first cate-

gory, one tries to model the dependencies within the data

using graphs, while in the other category the data is con-

sidered in a multidimensional space, regardless of de-

pendencies within them. Graph-based methods can be on

either homogeneous or heterogeneous graphs. Most of

the previous works in anomaly detection has been on

homogeneous networks or unstructured multidimension-

al data. In homogeneous networks, all nodes and rela-

tionships between them are of the same type. However,

not all real-world networks are homogeneous. Some real

systems are composed of heterogeneous types of nodes

and/or edges.

There are in general two anomaly detection approach-

es in heterogeneous networks: approaches based on

community distribution and those based on query. In

community distribution approach, instead of considering

the entire heterogeneous network to find possible abnor-

malities, distribution of nodes in communities are used.

In query-based approaches, users create different queries

that determine the type of anomaly and its range. For

example, one might choose a particular node from a cer-

tain type and identify the most different nodes with the

chosen node, as abnormal nodes in the network. Query-

based approaches allow the users to interact directly with

the system. The first work in the field of query-based

anomaly detection was proposed by Gupta et al. in 2014

[37]. Their method considers malformation of each edge,

detects anomalous groups of nodes based on a user query

and returns subnets of the original network. Zhang et al.

[38] provided a method for detecting anomalies based on

user query to find abnormal subnets. In their proposed

method, the users receive a list of abnormal subnets by

defining a query. However, they did not consider the at-

tribute of each entity, and framed the method by consid-

ering only the structural features of the network. An effi-

cient method was proposed by Kuck et al. [14], where a

formal language for queries was presented and an anom-

aly measure was proposed based on the network struc-

ture and existing meta-paths between the nodes. They

deﬁned an outlierness measure named as NetOut. In a

heterogeneous network G and for a given query Q and

for any , the outlierness can be measured by:

  



 (1)

where smaller  values correspond to greater likelihood

of being an outlier and  is the number of path

instantiations of  (a symmetric meta-path) between

two nodes  and . They also used PathSim and cosine

similarity that were introduced in the literature to com-

pare with their work [39]. PathSim measure between two

nodes  and  follows a meta-path  in a heterogeneous

information network and is deﬁned as,

  

 (2)

For comparison, [14] deﬁned:

  

 (3)

[14] also deﬁned a comparable version of NetOut using

cosine similarity, as:

  



 (4)

where  is the

neighbor vector function.

These methods are state-of-the-art in this field and we

compare the performance of the proposed method with

them.

In proposed method, we use tensor decompositions for

the anomaly detection task. The use of tensors for large-

dimensional data has been of great interest in recent years

[40], [41]. [42] proposed a network analysis system using

tensor decomposition in order to detect malicious pat-

terns over time. Akoglu and Faloutsos [43] proposed a

tensor-based algorithm that operates on a time-varying

homogeneous network and identifies anomalous points

in time at which many nodes change their behavior in a

way it deviates from the norm. [44] introduced a handy

tool to automatically detect and visualize novel subgraph

patterns within a local community of nodes in a hetero-

geneous network. Papalexakis et al. [45] proposed a

method based on tensor decomposition for spotting

anomalies in the check-in behavior of users. Koutra et al.

[46] proposed a method for detection of anomalies, rare

events and changes in behaviors using tensors. Although

there are a number of anomaly detection methods based

on tensors, query-based anomaly detection using tensor

has not yet been introduced in the literature.

4 PROPOSED QUERY-BASED ANOMALY

DETECTION APPROACH (QANET)

An outlier detection algorithm should return outliers as a

subset of the candidate set, i.e.  , that are considera-

bly different from nodes in . Let us first formally define

the query-based anomaly detection problem in heteroge-

neous networks.

DEFINITION 7. (QUERY-BASED ANOMALY DETEC-

TION PROBLEM). Let us consider the heterogeneous in-

formation network    with node type map-

ping function    and edge type mapping function

  . Given  as a set of candidate nodes and  as a

set of reference nodes, the problem is to return a sorted

list of candidate nodes based on anomaly score, relative

to nodes in the reference set. It is worth mentioning that

the type of nodes in both candidate and reference sets

must be the same.

In this paper, we first use tensor decomposition tech-

nique to reduce the feature set, as there are often many

nodes and large number of features for each node. Then,

we use clustering methods to determine the anomaly

score of each node by calculating its distance with the

cluster centers. Heterogeneous networks have many di-

mensions and representing them as a form of matrices,

tables, or vectors often lead to information loss. Tensors

allow us to store data in more than two dimensions, thus

making it possible to store more information from the

network as compared to traditional ways of network rep-

resentation. Furthermore, tensor decomposition methods

can be effectively used to reduce the dimensions of the

feature set. One of the main disadvantages of the previ-

ous methods in this field is their high computational

complexity. Tensor-based methods on the other hand are

computationally effective as there are various infrastruc-

tures to implement them on distributed systems. For ex-

ample, in [47] several distributed methods of decompos-

ing tensors have been implemented on Hadoop frame-

work.

DEFINITION 8. (TENSOR). An n-way (or mode) tensor

is essentially a structure that is indexed by n variables.

More formally, A tensor is represented by an array of  

.

There are a number of methods for decomposing ten-

sors. A simple, interpretable and basic method is PARA-

FAC decomposition [48]. A PARAFAC model decompos-

es a 3-way tensor  to trilinear components. The re-

sult is given by three loading matrices , , and

 with elements ,  and  where  is the number

of components. The model is found to minimize the sum

of squares of the residuals,  in the model, where 

is a three way array of residuals:

  󰅴󰅴 



 (5)

The above relation is graphically shown in Fig. 2, for

two components (F = 2). ,  and  are columns of the

matrices  ,   , and   , respectively.

The multiplications of , , and  are defined as fol-

lows:

󰅴󰅴  (6)

Factors are estimated simultaneously using alternating

least squares (ALS) method [49], which indeed assumes

that two of the matrices are constant and the third one is

estimated.

To represent a heterogeneous network by a tensor, we

use the concept of meta-path. Meta-paths can indicate

similarity/proximity between nodes in the network, and

thus can be an important feature for anomaly detection.

We define a 3-way tensor X with dimensions 

for , where    and  is equal to the number of me-

ta-paths with length less than 2.  is equal to the num-

ber of instances of the kth meta-path from the list of all

meta-paths extracted from the network schema with

length less than 2 between nodes  and .  indicates

relationship between nodes  and  relative to the kth

meta-path, which is considered as a feature for node .

QANet method constructs N×K features for each node

based on meta-paths. This is often a huge number and

traditional clustering algorithms such as k-mean cannot

be used for that. Here we use tensor decomposition

method to reduce dimension of the feature space, and

thus making it possible to apply traditional clustering

algorithms.

We need to compute the abnormality score for the

candidate nodes relative to the reference set given by the

query. Therefore, first a feature reduction model is creat-

ed using tensor decomposition for the reference set. To

this end, we separate a part of the tensor X that contains

the features of the reference set. Let us call it  whose

dimensions are  , where  is the number of

nodes in the reference set. Regarding the PARAFAC de-

composition method, we can obtain three matrices A, B,

and C for . Fig. 3, shows this tensor decomposition.

Matrix  contains the main factors of the reference set.

These factors can be considered as new features for each

of the nodes in the reference set. Matrix  and  can be

used in the next step to obtain features of the candidate

nodes. To obtain the features for the nodes in the candi-

date set, it is sufficient to use (7) to calculate  matrix

using the matrices  and  obtained in the previous step

and , which is defined as the  for the candidate set

nodes, as

  (7)

where  and  are Khatri-rao and Hadamard product,

respectively, and  is mode-1 matrixization of . Af-

ter computing , the properties of the candidate nodes

are obtained using the reference nodes model. Finally, 

and  are fed into the clustering algorithm as inputs (Fig.

4).

We use the K-means clustering method for the clustering

phase. K-means method takes a set of observations and

partitions them into k (≤ n) sets so as to minimize the

within-cluster sum of squares. In the clustering phase, the

reference nodes are first clustered into k clusters accord-

ing to the features of matrix A. Then, based on the ob-

tained cluster centers (), the anomaly score of

each candidate node is obtained as the distance (Euclide-

an distance) from the center of the nearest cluster. This

allows us to sort the candidate nodes based on their

anomaly score and determine their final rankings:

   

 

 (8)

where 

 and  are two points in Euclidean F-space and



 is the distance between them and it is given by:



 



 (9)

Fig. 2. A graphical representation of a PARAFAC model of tensor X

[47].

Fig. 3. Tensor decomposition for reference tensor . We can obtain

three matrices A, B, and C using PARAFAC decomposition method.

Note that QANet is based on calculating similarities

using meta-paths. Homogeneous networks are a special

type of heterogeneous networks that only have one type

of nodes and edges. Meta-path can not be defined in these

networks, and one can only define simple paths of differ-

ent lengths. In the case of homogeneous networks,

QANet considers similarity of nodes (e.g. common neigh-

borhood) and identifies the anomalies based on that.

Let us consider a simple example to better understand

the mechanism of QANet. TABLE 1 shows papers pub-

lished by several authors, where the columns in the table

represent the number of articles published by each author

at various conferences. The question we want to answer is

to consider an author as a reference author and rank oth-

ers against the reference author on the basis of their

anomaly score. As can be seen in the TABLE 1, the refer-

ence author has authored 22 papers; 10 papers published

in VLDB, 10 papers in KDD, and one paper in STOC and

SIGGRAPH.

TABLE 2 shows the results obtained from QANet and

three other methods including CosSim, PathSim and

NetOut, as state-of-the-art methods in query-based

anomaly detection. To compare the performance of

QANet with others, we obtained the distance measures as

one minus the similarity measure. As shown in TABLE 2,

all methods show Sarah exact to the reference author. In

contrast to Lucy, Rob is more abnormal because Rob has

published most of his papers at the conferences where the

reference author has the least activity. As the result

shows, in QANet method, the top anomaly is for Joe, as

Joe is different to reference author both in terms of the

number of papers and participated conferences. It is also

seen that PathSim and CosSim also classified Joe as an

anomalous author, while NetOut measure returns Joe like

Sarah as a normal author. Mikel is also like Joe, but Mikel

has a paper at KDD, which is one of the major confer-

ences for the reference author, and QANet is well re-

sponding for this difference. Between Mikel and Emma,

Emma has less maladaptation than Mikel, as the number

of Emma's papers in the same conference as the reference

author is higher than Mikel. This is only correctly cap-

tured by QANet and not by others.

5 RESULT

In this section we first describe the synthetic and real da-

taset used in this work, and then introduce the evaluation

metric. We compare the performance of QANet and state-

of-the-art methods in efficiently detecting anomalies. We

also discuss the time complexity of QANet.

5.1. Data Sets

We used both synthetic and real datasets to evaluate per-

formance of the proposed method.

Synthetic data

We use a similar way to the method presented in [32] to

generate synthetic heterogeneous networks (Algorithm

1). We first consider  nodes in , then assign a color to

each node using the function    , where  is

the number of colors that is equal to the number of com-

munities in the network. Also, for each , a type  is

considered. In order to create the graph edges, if the two

nodes  and  are with the same color, an edge is placed

between them with probability , otherwise the link is

placed with probability   . The network is created as

a heterogeneous network with different type of nodes,

and nodes from the same color are connected more likely

than those from different colors. This makes it possible to

create groups (nodes with the same color) in the network

that are similar to structure of nodes within the group.

These groups represent the communities in the graph. In

bibliography network for example, such groups can rep-

resent a range of research areas in which most of its au-

thors are publishing in that area and in confer-

ences/journals of the same domains.

Anomaly injection

We add some abnormal nodes to the generated network.

These abnormal nodes need to be structurally different

from the rest of the nodes. Thus, we randomly select a

certain portion of nodes from each color, and create links

between them and other nodes with probability  that is

different from probabilities  and .

TABLE 1

A TOY EXAMPLE OF BIBLIOGRAPHY NETWORK.

SIGGRAPH

STOC

KDD

VLDB

NAME

Reference Author

Sarah

Rob

Lucy

Joe

Mikel

Emma

Taken from an example in [14] with the last row added to it.

Fig. 4. Clustering the nodes in reference set with K-means clustering

method and sorting the candidate nodes based on distance from the

center of the nearest cluster.

TABLE 2

RESULT OF ANOMALY DETECTION METHODS FOR TOY EXAMPLE

PROVIDED IN TABLE 1.

QANet

CosSim

PathSim

NetOut

NAME

rank

Score

rank

Score

rank

Score

Rank

Score

Sarah

0.81

0.8757

0.9

0.9376

Rob

0.73

0.6717

0.6721

0.6889

Lucy

1.02

0.9296

0.9901

Joe

0.95

0.2964

0.9014

Mikel

0.88

0.9296

0.9455

0.9667

Emma

Query generation

To generate queries, two sets of reference and candidate

nodes should be selected, which needs to be from the

same type. Suppose that i and j are two colors from the

set of colors  where   . The following query

types are considered in this work.

1. We consider a number of random nodes in color i

as the reference set and a number of random

nodes as the candidate set, half of which are in

color i and the other half in other colors.

2. A random number of nodes is considered as the

reference set, half of which are in color i and the

other half in color j. We also consider a number of

random nodes as the candidate set, half of which

are in color i or j and the other half in other colors.

3. A number of random nodes in color i is consid-

ered as the reference set, and a number of random

nodes as the candidate set, half of which are

anomalous nodes (  ) and the rest in color i.

4. We consider a random number of nodes as the

reference set, half of which are in color i and the

other half in color j, and a number of random

nodes as candidate set, half of which are anoma-

lous nodes (  ) and the rest in color i or j.

5. We consider a number of random nodes in color i

as the reference set and some random nodes as

candidate set, half of which are in color i and the

rest are anomalous (  ) nodes or have color

other than i.

6. A random number of nodes is considered as the

reference set, half of which are in color i and the

other half in color j. We consider some random

nodes as the candidate set, half of which are in

color i or j and the rest are anomalous ( )

nodes or have color other than i or j.

To evaluate the performance of the methods, we labeled

nodes that have the same color as the reference nodes, as

normal nodes and the rest of nodes as abnormal.

Real data

DBLP: We employ a bibliographic dataset from

ArnetMiner3 [50] to construct a heterogeneous infor-

mation network. The dataset consists of 2,092,356 publica-

tions and 1,712,433 authors in the ﬁeld of computer sci-

ence. The heterogeneous network contains 3 types of ver-

tices: paper, venue, and author. Possible type of edges

includes paper-author (written-by), paper-venue (pub-

lished in) and paper-paper (cited by).

IMDb: We use movie details dataset from the Internet

Movie Database (IMDb)

. This dataset consists of

4,566,466 movies and 8,183,156 individuals in the role of

actor, director or writer. Heterogeneous information net-

work for this dataset contains 4 types of node: movie,

actor, director and writer. Type of edges includes actor-

movie (Acting), director-movie (Directing) and writer-

movie (Writing).

5.2. Evaluation metric

In this paper, we use lift index [51] to evaluate QANet

method and compare it with NetOut, PathSim, and Cos-

Sim methods. Lift index measures the accuracy of a rank-

ing method with respect to the ground truth label. The

procedure for calculation of the lift index is as follows. A

predictive model is first built based on the training data,

which is then applied to the test data to give each test case

a score showing the likelihood of the test case belonging

to the positive class. The test cases are then ranked ac-

cording to the scores in the descending order. After that,

the ranked list is divided into n equal segments, with the

cases that are most likely to be positive in top segment

and those that are least likely to be positive in bottom

[51]. To this end, the nodes in the candidate set are

ranked according to the anomaly score in the descending

order, and the lift index LI is calculated as.

 

 



 (10)

where  is the size of the candidate set and node

 is the ith node of the candidate set ranked according to

the anomaly score and is equal to 1 if the node 

is an anomalous node, and 0 otherwise. The higher is the

value of LI for a method; the better is its performance. In

the experiments the size of the candidate set is 10, where

according to equation (10), LI takes the value between

80% (the best case) and 30% (the worst case).

5.3. Experimental results

In this section, we compare QANet with the methods pre-

sented by Kuck et al. [14] in 6 types of queries as above.

To ensure the accuracy of the results, each of the experi-

ments was performed 50 times, and the average results

were reported. We assess the performance of the methods

by varying different network parameters including the

network size, the node types (degree of heterogeneity)

and the number of communities in the network. We also

examine two parameters related to the query: the size of

the reference set and the query type. In all these settings,

the two main parameters of QANet method, namely, the

rank of the tensor decomposition and the number of clus-

ters in the clustering method, are studied. When compar-

ing the methods for different network parameters, we use

the query type 5, which has both anomalous nodes ( 

) and those with colors than the reference set.

https://www.imdb.com/interfaces/

Input: Number of nodes (N), Number of communities (C), set of

node types (A), and probabilities .

Output: Heterogeneous information network 

consider   ) where  = ={}

Insert  to  into 

Using random function    assign a type to each node 

Using random function     assign a color to each

node 

For each node pair  and 

If  and  have same color

Insert edge  into  with  probability.

Else

Insert edge  into  with  probability.

10:

Define the edge mapping function    , where

, .

11:

return     

Algorithm 1. Pseudo-code for Generation of synthetic heterogeneous

networks.

1) Network size

Fig. 5 shows the lift index of QANet as a function of the

network size. For this simulation, we consider the de-

composition ranks 4 and 8, and the number of clusters 1,

4, and 20. As it can be seen, the decomposition rank does

not have much influence on the accuracy of the proposed

method, however as the number of clusters increases, its

accuracy decline, i.e. the lift index decreases. As the nodes

in the reference set are from the same color (query type 5),

by choosing a cluster size of 1, all nodes in the reference

set are placed in one cluster, and thus improve the accu-

racy of the proposed method in finding the abnormal

nodes.

Fig. 6 compares the accuracy of NetOut, PathSim, Cos-

Sim and QANet for varying network sizes. As it is seen,

the proposed method (QANet) has considerably better

accuracy than others. Also, as the network size increases,

its accuracy also increases. This is due to the fact QANet

is based on network structure, and as the network be-

comes larger, more features can be extracted for nodes,

resulting in improved detection of abnormalities. Other

methods have either decreased or unchanged behavior as

the network size increases. These results indicate the

QANet is more suitable for large-scale network than other

state-of-the-art methods.

2) Node types

Another parameter that affects anomaly detection is node

types, which is indeed an indicator of heterogeneity level

in the network. Increasing the node types while the num-

ber of nodes is kept unchanged, has almost the same ef-

fect as decreasing the number of nodes. Thus, one would

expect declined accuracy for increased node types. On the

other hand, by increasing the node types, the number of

edges in the network decreases with respect to the gener-

ation model, which makes it more difficult to distinguish

the node from one another. However, as it is seen from

the results, QANet is almost insensitive to node type,

while other methods have rather more changes (Fig. 7).

3) Number of communities

Fig. 8, shows the impact of the number of communities in

the network on the accuracy of the methods. In many het-

erogeneous networks the nodes can be divided into dif-

ferent groups. For example, in bibliographic networks

there are different scientific fields, and each node (author,

paper or conference) is in one or more of them. Our re-

sults show that by increasing the number of communities,

the accuracy of QANet decreases (Fig. 8). Indeed, increas-

ing the number of communities in the network reduces

the distinction between the nodes, which reduces the ac-

curacy of the QANet. Other methods however are not

considerably impacted by the number of communities.

4) Size of reference set

Fig. 9, shows the accuracy of the algorithms with respect

to the size of the reference set. As the size of the reference

state increases, the accuracy of QANet is systematically

improved, whereas other methods do not follow any spe-

cific pattern as a function of the size of the reference state.

Fig. 5. The accuracy of QANet as a function of the network size in

synthetic network. In this experiment, we consider the decomposition

ranks 4 and 8, and the number of clusters 1, 4, and 20. Also we set

the node types as 2, number of communities is 4 and query type is 5.

Fig. 6. Accuracy of NetOut, PathSim, CosSim and QANet as a

function of network szie in synthetic networks. In this experiment, we

set the node type as 2, the number of communities as 4, and the

query type 5.

Fig. 7. Accuracy of the methods as a function of node type in synthetic

networks. In this experiment, the number of nodes is 4000, the size of

the reference set is 20, and there query type is 5. We set the number

of cluster to 1 and the decomposition rank to 4 for QANet.

Improved performance of QANet as a function of this

parameter is due to enhancing the ability of the method

through reconstructing a more precise model by having a

richer feature set. As in NetOut, PathSim, and CosSim

methods, only a simple averaging is considered for nodes

in the reference set, increasing the number of nodes in the

references set might worsen the performance, which is

clearly seen in the results.

5) Query type

Fig. 10 illustrates the accuracy of the methods for the six

types of queries. As shown in the Fig. 10, QANet is the

top-performer in all query types. While PathSim and

CosSim show similar performance as QANet in query

types 3 and 4, their accuracy is not comparable with

QANet in other query types. NetOut has higher accuracy

than PathSim and CosSim method in query types 1 and 2,

but lower in other types. Because they can detect anoma-

lous nodes (  ) correctly, but cannot detect nodes

from other communities while NetOut cannot detect

anomalous nodes. Therefore, because there are no anoma-

lous nodes in Type 1 and 2 queries, the NetOut is better

and PathSim and CosSim are better in the other queries.

But as shown in the Fig. 10, the proposed method can

well detect both types of abnormalities.

5.4. Case studies

In this section, we examine the effectiveness of QANet

and NetOut on two real datasets.

Case 1: Bibliographic Dataset

In DBLP dataset, we consider all authors who collaborate

with Christos Faloutsos as candidate set and Christos

Faloutsos as reference set. In this query, the reference set

has only one member and the candidate set contains 426

authors. According to the definition of anomalies in the

section 2, the abnormalities indicate the structural differ-

ence in the communication (i.e. different meta-paths) be-

tween any of the nodes in the candidate set and the

node(s) of the reference set. 10 authors who have the

highest degree of malformation based on QANet and

NetOut methods are listed in Tables 3 and 4, respectively.

As seen in Table 3, the top-4 abnormal authors provid-

ed by QANet algorithm each have only one paper co-

authored with the reference author. Each of these authors

has published only their paper at the conference in which

the reference author has published only a paper. Fur-

thermore, this conference was not one of the major con-

ferences of the reference author. Also, they are arranged

based on the citation count that indicates the importance

of their paper and their collaboration with the reference.

The next four authors in Table 3 have also published only

one paper, however they have a common co-author with

the reference, thus strengthening their relationship with

the reference. These four authors are also arranged based

on the citation count that indicates the importance of their

collaboration with the reference author, which also re-

flects the overall view of the QANet. Lei Li is ranked 9.

While the publications of this author are quite similar to

the top-ranked author. The venue that this author has

published his paper is more important to the reference, as

Fig. 8. Accuracy of the methods as a function of the number of

communities in synthetic networks. In this experiment, the number of

nodes is 4000, there are two types of nodes, the size of the reference

set is 20, and there query type is 5. We set the number of cluster to 1

and the decomposition rank to 4 for QANet.

Fig. 9. Accuracy of the methods as a function of the reference size in

the query for synthetic networks. In this experiment, number of nodes

is 6000, node types is 2 and the number of communities is 4, and

query type is 5. In QANet number of clusters is 1 and decomposition

rank is 4.

Fig. 10. Accuracy of the methods as a function of the query types

described in Section 5.1 for synthetic networks. In this experiment, the

number of nodes is 6000, node types is 2, the number of communities

is 4 and the size of the reference set is 20. We set the number of

cluster to 1 and the decomposition rank to 4 for QANet.

he has published another paper at this venue. This indi-

cates that in contrast to Philip, Lei Li is closer to Christos

Faloutsos as there are more paths between them. Finally,

the last author in the ranking Table 3 has also coauthored

one paper with the reference author, however this author

has two collaborators who have co-authored with the

reference, making him closer to the reference as com-

pared to those top in the ranking table. These results indi-

cate that the proposed ranking algorithm leads to reason-

able outcome.

In order to have a better understanding on the perfor-

mance of competing algorithms, Table 4 shows the top-10

abnormal authors to the reference author based on

NetOut. This way the query type based upon Tables 3

and 4 are extracted is exactly the same. NetOut requires

the users to specify the meta-path in addition to the refer-

ence and candidate sets. In order to have a fair compari-

son with QANet, we chose all meta-paths of length 2 used

in QANet. As seen from the results of the ranking ob-

tained by NetOut method, this method only works on the

basis of counting meta-paths. This method ranks the au-

thors based on their publication pattern, i.e. the number

of articles and citation, as compared with the reference

author. The method also takes into account joint publica-

tions with the references author. Table 5 shows the top-10

dissimilar authors to the reference author based on these

two algorithms. Clearly, these two ranking algorithms

result in considerably different outcome. However, as it is

clearly seen in Tables 3 and 4, QANet results in more rea-

sonable ranking outcome than NetOut, as the authors

ranked by QANet have more dissimilarity with the refer-

ence author than those ranked by NetOut.

Case 2: Internet Movie Database

In this case study, according to the definition of anoma-

lies in section 2, we want to identify abnormal actors,

among actors of "The Godfather (1972)" movie. According

to the Internet Movie Database (IMDb) site, 34 of the ac-

tors are introduced as the main cast for this movie, and

we consider them as the candidate and reference set. Ta-

ble 6 lists the actors and their details, as well as the rank-

ing of QANet and NetOut methods. We set the number of

cluster to 1 and the decomposition rank to 3 for QANet.

In order to confirm the results of the anomaly detec-

tion methods, we show the matrices of AMA, AMAMA,

AMDMA and AMWMA meta-paths for actors in the form

of heatmap in Fig. 11. With regard to various meta-paths,

James Caan, Marlon Brando, Al Pacino, Robert Duvall and

Diane Keaton are more abnormal than the rest of the ac-

tors. These actors are exactly the five top-ranking actors in

QANet ranking, while NetOut method ranked them 3, 13,

15, 9 and 7, respectively.

In order to better understand our method, we show the

results of tensor decomposition with rank 3 and cluster-

ing in a three-dimensional space in Fig. 12. As shown in

Fig. 12, the abnormal actors are much farther away

than the rest of the actors.

5.5. Time complexity

This section provides some analysis on the computational

TABLE 3

TOP-10 RANKED AUTHORS BY QANET

Number of

APVPA Meta-

Paths with Fa-

loutsos

Number of pa-

per's Citation

Number of APA-

PA Meta-Paths

with Faloutsos

Number of APA

Meta-Paths with

Faloutsos

Number of papers

Author’s Name

Ranking

Philip Russell (Flip) Korn

George Panagopoulos

Ibrakim Kamel

Yi Rong

Caetano Traina Jnior

Robson L. F. Cordeiro

Yi Zhou

Bin Zhang

Lei Li

Wenyao Ho

APA indicates author-paper-author meta-path, and APAPA and APVPA

stand for author-paper-author-paper-author meta-path, author-paper-venue-

paper-author meta-path, respectively.

TABLE 4

TOP-10 RANKED AUTHORS BY NETOUT

Number of

APVPA Meta-

Paths with Fa-

loutsos

Number of pa-

per's Citation

Number of APA-

PA Meta-Paths

with Faloutsos

Number of APA

Meta-Paths with

Faloutsos

Number of papers

Author’s Name

Ranking

799

139

Sebastian B. Thrun

1962

518

124

Venkatesan Guruswami

478

470

Arthur Toga

3757

640

252

K. P. Sycara

667

514

Asim Smailagic

799

558

N. Sadeh

1962

662

192

Dan Siewiorek

478

536

David A. Bader

724

497

112

Douglas W. Oard

1943

534

131

M Hebert

TABLE 5

TOP-10 RANKED AUTHORS BASED ON QANET AND NETOUT

WITH CHRISTOS FALOUTSOS AS THE REFERENCE SET

TOP-10 RANKED AUTHORS BASED ON

QANET

TOP-10 RANKED AUTHORS BASED ON

NETOUT

QANet Ranking

Author’s Name

NetOut Ranking

QANet Ranking

Author’s Name

NetOut Ranking

Philip Russell (Flip) Korn

421

196

Sebastian B. Thrun

George Panagopoulos

420

197

Venkatesan Guruswami

Ibrakim Kamel

417

151

Arthur Toga

Yi Rong

418

252

K. P. Sycara

Caetano Traina Jnior

408

170

Asim Smailagic

Robson L. F. Cordeiro

409

209

N. Sadeh

Yi Zhou

406

245

Dan Siewiorek

Bin Zhang

407

216

David A. Bader

Lei Li

422

194

Douglas W. Oard

Wenyao Ho

395

224

M Hebert

complexity of the QANet. The QANet consists of several

steps. The first step is to extract queries and prepare the

reference and candidate sets for queries. The second stage

involves the implementation of the tensor decomposition

method and obtaining the candidate and reference matri-

ces. The final step is to apply the clustering method to the

characteristics obtained from the previous steps for the

candidate and reference sets. In the first step, the query

language provided by Kuck et al. [14] is used. Network

tensor and meta-paths of length 2 can be pre-calculated

offline. Sparse tensor is used in the implementation due

to its rather low time complexity. If the number of nodes

in the reference and candidate sets are  and , respec-

tively, the complexity of obtaining two tensors  and 

is equal to . In the second step for decompos-

ing tensors and obtaining the properties of each node,

according to alternating least squares (ALS) algorithm,

the worst case time complexity is ,

where  is the decomposition rank (usually less than 10)

and  is the number of iterations of the tensor decomposi-

tion method (50 in this manuscript), and  and  are the

number of nodes in the network and the number of meta-

paths with length less than 2, respectively.  is at most

, where  is the node types in the network. In the final

step, according to the k-means clustering method and

calculation of the distance between each candidate node

and the cluster centers, time complexity is equal to

, where  is the number of clustering itera-

tion and  is the number of clusters. Finally, the time

complexity of QANet is .

NetOut, PathSim and CosSim methods have time com-

plexity exponential to the meta-path length. Materializing

neighbor vector requires traversal of the heterogeneous

network, which can be time-consuming when the

speciﬁed meta-path is long or when the degree of the

node of interest is high. The time complexity of these

methods is  when meta-paths of length 2 are

considered. It is more complicated for meta-paths of

higher lengths.

Fig. 11. Adjacency matrix of Actor-Movie-Actor, Actor-Movie-Actor-

Movie-Actor, Actor-Movie-Director-Movie-Actor and Actor-Movie-

Writer-Movie-Actor metapaths for actors of "The Godfather (1972)".

Fig. 12. Illustration of the QANet method. We set the number of clus-

ter to 1 and the decomposition rank to 3 for QANet. R1, R2 and R3

are columns of  respectively. Cluster centroid indicates center of

cluster after clustering phase.

TABLE 6

THE CAST OF "THE GODFATHER (1972)" AND THEIR DETAILS, AS

WELL AS THE RANKING OF QANET AND NETOUT METHODS.

NetOut Ranking

QANet Ranking

Writer

Director

Actor

Actor’s Name

Row

105

Richard Conte

Corrado Gaipa

Morgana King

Lenny Montana

330

James Caan

John Cazale

Sterling Hayden

Talia Shire

150

Abe Vigoda

174

Marlon Brando

Rudy Bond

Richard Bright

Richard S. Castellano

Franco Citti

Salvatore Corsitto

130

Al Pacino

Tony Giorgio

Julie Gregg

Angelo Infanti

268

Robert Duvall

Al Lettieri

Jeannie Linero

Tere Livrano

108

John Marley

Al Martino

John Martino

166

Diane Keaton

Victor Rendina

143

Alex Rocco

Gianni Russo

Vito Scotti

Ardell Sheridan

Simonetta Stefanelli

Saro Urzì

The actor column indicates the number of movies that they act and the director

column indicates the number of movies directed by them and the writer column

indicates the number of movies written by them.

6 CONCLUSION

In this paper, we proposed a tensor decomposition

method called QANet for detecting query-based

anomalies in heterogeneous information networks.

QANet considers all the different aspects of the com-

munication structure and uses the PARAFAC tensor

decomposition method to create a model, which is

then used to rank the candidate nodes based on their

abnormality (dissimilarity) with those in the reference

set. Due to lacking tagged data, we introduced a model

to create synthetic heterogeneous information net-

works, and tested effectiveness of the proposed anom-

aly detection algorithm for various query types. We

also compared the performance of QANet with state-

of-the-art algorithms including NetOut, PathSim and

CosSim. The experiments showed that QANet results

in better performance by providing more accurate

prediction of abnormal nodes than other algorithms.

We also compared the performance of QANet and

NetOut on two real datasets: bibliographic network

and Internet Movie Database (IMDb). The results re-

vealed that the ranking provided by QANet is more

reasonable than the one provided by NetOut. QANet

outperformed NetOut. This is mainly due to the fact

that in QANet all meta-paths of the candidate and ref-

erence nodes with other network nodes are consid-

ered. However, in NetOut method, only the symmetric

meta-paths between the candidate and reference nodes

are considered, while the relationships of these nodes

with other network nodes are not considered. Indeed,

QANet takes into account more global information in

building the similarity metrics for the anomaly detec-

tion. Our experimental results confirm that QANet

performs better. Future directions to the research work

introduced here include evaluating our approach on

temporal networks as well as using it for event detec-

tion in time-evolving networks.

ACKNOWLEDGMENT

This work was supported in part by a grant from IPM

(No. CS1396-4-49) and is partially supported by Iran Na-

tional Science Foundation (INSF) (Grant No. 96001338).

Mahdi Jalili is supported by Australian Research Council

through project No DP170102303. Mostafa Salehi is the

corresponding authorship for paper.

REFERENCES

[1] A.-L. Barabási, Network science: Cambridge University Press,

2016.

[2] C. Shi, Y. Li, J. Zhang, Y. Sun, and S. Y. Philip, "A survey of

heterogeneous information network analysis," IEEE Transac-

tions on Knowledge and Data Engineering, vol. 29, 2017, pp. 17-37.

[3] Y. Sun and J. Han, "Mining heterogeneous information net-

works: a structural analysis approach," ACM SIGKDD Explora-

tions Newsletter, vol. 14, 2013, pp. 20-28.

[4] M. Jalili, Y. Orouskhani, M. Asgari, N. Alipourfard, and M.

Perc, "Link prediction in multilayer online social networks,"

Royal Society Open Science, vol. 4, 2017, p. 160863.

[5] M. Kivelä, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno,

and M. A. Porter, "Multilayer networks," Journal of Complex

Networks, vol. 2, 2014, pp. 203-271.

[6] S. Molaei, S. Babei, M. Salehi, and M. Jalili, "Information Spread

and Topic Diffusion in Heterogeneous Information Networks,"

Scientific Reports, vol. 8,2018, p. 9549.

[7] M. Salehi, R. Sharma, M. Marzolla, M. Magnani, P. Siyari, and

D. Montesi, "Spreading processes in Multilayer Networks,"

IEEE Transactions on Network Science and Engineering (TNSE),

vol. 2, 2015, pp. 65 83.

[8] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection

for discrete sequences: A survey," IEEE Transactions on

Knowledge and Data Engineering, vol. 24, 2012, pp. 823-839.

[9] S. Shehnepoor, M. Salehi, R. Farahbakhsh, and N. Crespi.

"NetSpam: A network-based spam detection framework for re-

views in online social media." IEEE Transactions on Information

Forensics and Security, vol. 12, 2017, pp. 1585-1595.

[10] O. Salem, A. Guerassimov, A. Mehaoua, A. Marcus, et al.,

"Anomaly Detection in Medical Wireless Sensor Networks us-

ing SVM and Linear Regression Models," International Journal of

E-Health and Medical Communications, vol. 5, 2016, pp. 20-45.

[11] I. Brahmi, S. B. Yahia, and P. Poncelet, "MAD-IDS: novel intru-

sion detection system using mobile agents and data mining ap-

proaches," in Pacific-Asia Workshop on Intelligence and Security In-

formatics, 2010, pp. 73-76.

[12] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection:

A survey," ACM computing surveys (CSUR), vol. 41, 2009, p. 15.

[13] L. Akoglu, H. Tong, and D. Koutra, "Graph based anomaly

detection and description: a survey," Data Mining and Knowledge

Discovery, vol. 29, 2015, pp. 626-688.

[14] J. Kuck, H. Zhuang, X. Yan, H. Cam, and J. Han, "Query-based

outlier detection in heterogeneous information networks," in

Advances in database technology: proceedings. International Confer-

ence on Extending Database Technology, 2015, p. 325.

[15] E. Eskin, "Anomaly detection over noisy data using learned

probability distributions," in Proceedings of the International Con-

ference on Machine Learning, 2000.

[16] A. M. Kosek, "Contextual anomaly detection for cyber-physical

security in Smart Grids based on an artificial neural network

model," in 2016 Joint Workshop on Cyber-Physical Security and Re-

silience in Smart Grids, 2016.

[17] B. Shah and B. H. Trivedi, "Reducing features of KDD CUP

1999 dataset for anomaly detection using back propagation

neural network," in Fifth International Conference on Advanced

Computing & Communication Technologies, 2015, pp. 247-251.

[18] W. Hu, Y. Liao, and V. R. Vemuri, "Robust anomaly detection

using support vector machines," in Proceedings of the internation-

al conference on machine learning, 2003, pp. 282-289.

[19] J. Liu, S. Chen, Z. Zhou, and T. Wu, "An Anomaly Detection

Algorithm of Cloud Platform Based on Self-Organizing Maps,"

Mathematical Problems in Engineering, vol. 1, 2016, pp. 1-9.

[20] E. De la Hoz, E. De La Hoz, A. Ortiz, J. Ortega, et al., "PCA fil-

tering and probabilistic SOM for network intrusion detection,"

Neurocomputing, vol. 164, 2015, pp. 71-81.

[21] D. J. Miller and J. Browning, "A mixture model and EM-based

algorithm for class discovery, robust classification, and outlier

rejection in mixed labeled/unlabeled data sets," IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, vol. 25, 2003,

pp. 1468-1483.

[22] W. Jin, A. K. Tung, and J. Han, "Mining top-n local outliers in

large databases," in Proceedings of the seventh ACM SIGKDD in-

ternational conference on Knowledge discovery and data mining,

2001, pp. 293-298.

[23] A. Ghoting, S. Parthasarathy, and M. E. Otey, "Fast mining of

distance-based outliers in high-dimensional datasets," Data

Mining and Knowledge Discovery, vol. 16, 2008, pp. 349-364.

[24] V. Hautamäki, I. Kärkkäinen, and P. Fränti, "Outlier Detection

Using k-Nearest Neighbour Graph," ICPR (3), 2004, pp. 430-433.

[25] P. Sun, S. Chawla, and B. Arunasalam, "Mining for Outliers in

Sequential Databases," in SDM, 2006, pp. 94-105.

[26] M. Radovanović , A. Nanopoulos, and M. Ivanović , "Reverse

nearest neighbors in unsupervised distance-based outlier detec-

tion," IEEE transactions on knowledge and data engineering, vol. 27,

2015, pp. 1369-1382.

[27] J. Laurikkala, M. Juhola, E. Kentala, N. Lavrac, et al., "Informal

identification of outliers in medical data," in Fifth International

Workshop on Intelligent Data Analysis in Medicine and Pharmacolo-

gy, 2000, pp. 20-24.

[28] B. Rosner, "Percentage points for a generalized ESD many-

outlier procedure," Technometrics, vol. 25, 1983, pp. 165-172.

[29] R. D. Gibbons, D. K. Bhaumik, and S. Aryal, Statistical methods

for groundwater monitoring vol. 59: John Wiley & Sons, 2009.

[30] B. Abraham and G. E. Box, "Bayesian analysis of some outlier

problems in time series," Biometrika, vol. 66, 1979, pp. 229-236.

[31] P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier

detection vol. 589: John Wiley & Sons, 2005.

[32] X. Song, M. Wu, C. Jermaine, and S. Ranka, "Conditional anom-

aly detection," IEEE Transactions on Knowledge and Data Engi-

neering, vol. 19, 2007, pp. 631-645.

[33] Y. Du, R. Zhang, and Y. Guo, "A Useful Anomaly Intrusion

Detection Method Using Variablelength Patterns and Average

Hamming Distance," in JCP, vol. 8, 2010, pp. 1219-26.

[34] M. Li and P. Vitányi, An introduction to Kolmogorov complexity

and its applications: Springer Science & Business Media, 2009.

[35] S. Wu and S. Wang, "Information-theoretic outlier detection for

large-scale categorical data," IEEE transactions on knowledge and

data engineering, vol. 25, 2013, pp. 589-602.

[36] Noble, Caleb C., and Diane J. Cook. "Graph-based anomaly

detection." Proceedings of the ninth ACM SIGKDD international

conference on Knowledge discovery and data mining. ACM, 2003.

[37] M. Gupta, J. Gao, X. Yan, H. Cam, et al., "Top-k interesting sub-

graph discovery in information networks," in IEEE 30th Interna-

tional Conference on Data Engineering, 2014, pp. 820-831.

[38] H. Zhuang, J. Zhang, G. Brova, J. Tang, et al., "Mining query-

based subnetwork outliers in heterogeneous information net-

works," in 2014 IEEE International Conference on Data Mining,

2014, pp. 1127-1132.

[39] Y. Sun, J. Han, X. Yan, P. S. Yu, et al., "Pathsim: Meta path-based

top-k similarity search in heterogeneous information net-

works," Proceedings of the VLDB Endowment, vol. 4, 2011, pp.

992-1003.

[40] S. Ranshous, S. Shen, D. Koutra, S. Harenberg, et al., "Anomaly

detection in dynamic networks: a survey," Wiley Interdiscipli-

nary Reviews: Computational Statistics, vol. 7, 2015, pp. 223-247.

[41] H. Fanaee-T, and J. Gama. "Tensor-based anomaly detection:

An interdisciplinary survey." Knowledge-Based Systems, vol. 98,

2016, pp. 130-147.

[42] H.-H. Mao, C.-J. Wu, E. E. Papalexakis, C. Faloutsos, et al.,

"Malspot: multi2 malicious network behavior patterns analy-

sis," Advances in Knowledge Discovery and Data Mining, Springer,

2014, pp. 1–14.

[43] L. Akoglu, and C. Faloutsos. "Event detection in time series of

mobile communication graphs." Army science conference. 2010.

[44] K. Maruhashi, F. Guo, and C. Faloutsos. "Multiaspectforensics:

Pattern mining on large-scale heterogeneous networks with

tensor analysis." Advances in Social Networks Analysis and Mining

(ASONAM), IEEE, 2011.

[45] E. E. Papalexakis, K. Pelechrinis, and C. Faloutsos. "Spotting

misbehaviors in location-based social networks using ten-

sors." Proceedings of the 23rd International Conference on World

Wide Web. ACM, 2014.

[46] K. Danai, E. E. Papalexakis, and C. Faloutsos. "Tensorsplat:

Spotting latent anomalies in time." Informatics (PCI), 16th Pan-

hellenic Conference. IEEE, 2012.

[47] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, "Haten2:

Billion-scale tensor decompositions," in 2015 IEEE 31st Interna-

tional Conference on Data Engineering, 2015, pp. 1047-1058.

[48] R. Bro, "PARAFAC. Tutorial and applications," Chemometrics

and intelligent laboratory systems, vol. 38, 1997, pp. 149-171.

[49] P. M. Kroonenberg and J. De Leeuw, "Principal component

analysis of three-mode data by means of alternating least

squares algorithms," Psychometrika, vol. 45, 1980, pp. 69-97.

[50] J. Tang, J. Zhang, L. Yao, J. Li, et al., "Arnetminer: extraction and

mining of academic social networks," in Proceedings of the 14th

ACM SIGKDD, 2008, pp. 990-998.

[51] T. Y. Lin, Y. Y. Yao, and L. A. Zadeh, Data Mining, Rough Sets

and Granular Computing: Physica-Verlag HD, 2013.

Vahid Ranjbar. Received the B.S. and the M.S. degree

in information technology in 2011 and 2013 respectively.

He is currently working toward the Ph.D. degree in the

University of Tehran, Iran. His research interests include

network science and data mining.

Mostafa Salehi. completed his PhD studies in Computer

Engineering at Sharif University of Technology, Iran in

2012. On 2013, he joined University of Tehran as an

assistant professor. His research interests include net-

work science and multimedia networks.

Pegah Jandaghi. Received the B.S. degree in Comput-

er Engineering and Mathematical and application in

2017. he is currently working toward the MS degree in

the University of Southern California, USA. Her research

interests include Information Networks and data mining.

Mehdi Jalili. (Member 2009, SM 2016) received the

PhD degree from EPFL (Swiss), in 2008. He is now a

senior lecturer with the School of Engineering, RMIT

University, Melbourne, Australia. His research interests

are in network science, dynamical systems, data mining.

A Survey of Imbalanced Learning on Graphs: Problems, Techniques, and Future Directions

Preprint

Full-text available

Aug 2023

Graphs represent interconnected structures prevalent in a myriad of real-world scenarios. Effective graph analytics, such as graph learning methods, enables users to gain profound insights from graph data, underpinning various tasks including node classification and link prediction. However, these methods often suffer from data imbalance, a common issue in graph data where certain segments possess abundant data while others are scarce, thereby leading to biased learning outcomes. This necessitates the emerging field of imbalanced learning on graphs, which aims to correct these data distribution skews for more accurate and representative learning outcomes. In this survey, we embark on a comprehensive review of the literature on imbalanced learning on graphs. We begin by providing a definitive understanding of the concept and related terminologies, establishing a strong foundational understanding for readers. Following this, we propose two comprehensive taxonomies: (1) the problem taxonomy, which describes the forms of imbalance we consider, the associated tasks, and potential solutions; (2) the technique taxonomy, which details key strategies for addressing these imbalances, and aids readers in their method selection process. Finally, we suggest prospective future directions for both problems and techniques within the sphere of imbalanced learning on graphs, fostering further innovation in this critical area.

DSMN: A New Approach for Link Prediction in Multilplex Networks

Article

Full-text available

Sep 2022

In a multiplex network, there exists different types of relationships between the same set of nodes such as people which have different accounts in online social networks. Previous researches have proved that in a multiplex network the structural features of different layers are interrelated. Therefore, effective use of information from other layers can improve link prediction accuracy in a specific layer. In this paper, we propose a new inter-layer similarity metric DSMN, for predicting missing links in multiplex networks. We then combine this metric with a strong intra-layer similarity metric to enhance the performance of link prediction. The efficiency of our proposed method has been evaluated on both real-world and synthetic networks and the experimental results indicate the outperformance of the proposed method in terms of prediction accuracy in comparison with similar methods.

Massive MIMO Slow-varying Channel Estimation Using Tensor Sparsity

Article

Mar 2021

In order to exploit the advantages of the massive MIMO systems, it is vital to apply the channel estimation task. The huge number of antennas at the base station of a massive MIMO system produces a large set of channel paths which requires to be estimated. Therefore, the channel estimation in such systems is more troublesome. In this paper, we propose to leverage the temporal joint sparsity of the massive MIMO channels to offer a more accurate channel estimation. To attain this goal, we would model the problem to exploit the spatial correlation among different antennas of the BS as well as the inter-user similarity of the channel supports. In addition, by assuming a slow time-varying channel, the supports of the channel matrices of various snapshots would be equal which enables us to impose the temporal joint sparsity on the channel submatrices. The simulation results validate the efficiency and superiority of the suggested scheme over its rivals.

GANBOT: a GAN-based framework for social bot detection

Article

Full-text available

Nov 2021

Nowadays, a massive number of people are involved in various social media. This fact enables organizations and institutions to more easily access their audiences across the globe. Some of them use social bots as an automatic entity to gain intangible access and influence on their users by faster content propagation. Thereby, malicious social bots are populating more and more to fool humans with their unrealistic behavior and content. Hence, that's necessary to distinguish these fake social accounts from real ones. Multiple approaches have been investigated in the literature to answer this problem. Statistical machine learning methods are one of them focusing on handcrafted features to represent characteristics of social bots. Although they reached successful results in some cases, they relied on the bot's behavior and failed in the behavioral change patterns of bots. On the other hands, more advanced deep neural network-based methods aim to overcome this limitation. Generative adversarial network (GAN) as new technology from this domain is a semi-supervised method that demonstrates to extract the behavioral pattern of the data. In this work, we use GAN to leak more information of bot samples for state-of-the-art textual bot detection method (Contextual LSTM). Although GAN augments low labeled data, original textual GAN (Sequence Generative Adversarial Net (SeqGAN)) has the known limitation of convergence. In this paper, we invested this limitation and customized the GAN idea in a new framework called GANBOT, in which the generator and classifier connect by an LSTM layer as a shared channel between them. Our experimental results on a bench-marked dataset of Twitter social bot show our proposed framework outperforms the existing contextual LSTM method by increasing bot detection probabilities.

Community Detection in Complex Dynamic Networks Based on Graph Embedding and Clustering Ensemble

Article

Full-text available

Jan 2023

Special conditions of wireless sensor networks, such as energy limitation, make it essential to accelerate the convergence of algorithms in this field, especially in the distributed compressive sensing (DCS) scenarios, which have a complex reconstruction phase. This paper presents a DCS reconstruction algorithm that provides a higher convergence rate. The proposed algorithm is a distributed primal-dual algorithm in a bidirectional incremental cooperation mode where the parameters change with time. The parameters are changed systematically in the convex optimization problems in which the constraint and cooperation functions are strongly convex. The proposed method is supported by simulations, which show the higher performance of the proposed algorithm in terms of convergence rate, even in stricter conditions such as the small number of measurements or the lower degree of sparsity.

Neighborhood Representative for Improving Outlier Detectors

Article

Dec 2022
INFORM SCIENCES

Over the decades, traditional outlier detectors have ignored the group-level factor when calculating outlier scores for objects in data by evaluating only the object-level factor, failing to capture the collective outliers. To mitigate this issue, we present a framework called neighborhood representative (NR), which empowers all the existing outlier detectors to efficiently detect outliers, including collective outliers, while maintaining their computational integrity. It achieves this by selecting representative objects, scoring these objects, then applies the score of the representative objects to its collective objects. Without altering existing detectors, NR is compatible with existing detectors, while improving performance on eleven real world datasets with +8% (0.72 to 0.78 AUC) on average relative to twelve state-of-the-art outlier detectors. The implementation of NR can be found via www.OutlierNet.com for reproducibility. Index Terms—Outlier detection, Preprocessing, Neighborhood representative, K nearest neighbors.

Spammer detection via ranking aggregation of group behavior

Article

Dec 2022
EXPERT SYST APPL

Heterogeneous Question Answering Community Detection Based on Graph Neural Network

Article

Nov 2022
INFORM SCIENCES

Topic-based communities have gradually become a considerable medium for netizens to disseminate and acquire knowledge. These communities consist of entities (actual objects, e.g., a real answer or an actual question) with different types (users, questions and answers) and are usually hidden and overlapping. Nowadays, prevalent community question answering (CQA) platforms have formed mature communities by manually marked topics and extensive accumulated user behavior. However, the ever-growing various entities and complex overlapping topic communities make it inefficient to manually label entity tags (e.g., Question labels supplement domain features; Potential user tags indicate the user's specialty.). Therefore, there is an urgent need for a mechanism that automatically finds hidden semantic communities from user social behavior and lays a foundation for community construction and intelligent recommendation of QA platforms. In this paper, we propose a Heterogeneous Community Detection Approach Based on Graph Neural Network, called HCDBG, to detect heterogeneous communities in CQA. Firstly, we define entity relationships based on user interaction behavior and employ a heterogeneous information network to uniformly represent all connections. Afterward, we exploit the heterogeneous graph neural network to fuse content and topological features of nodes for graph embedding. Finally, we convert the community detection issue in CQA into an entity clustering task in the heterogeneous information network and improve the k-means method to achieve heterogeneous community detection. Based on our knowledge of the existing literature, it is an innovative research direction that utilizes the heterogeneous graph neural network to facilitate QA community detection. Extensive experiments on authentic question-answering datasets illustrate that HCDBG outperforms baseline methods in heterogeneous community detection.

EA-ADMM: noisy tensor PARAFAC decomposition based on element-wise average ADMM

Article

Full-text available

Oct 2022

Tensor decomposition is widely used to exploit the internal correlation in multi-way data analysis and process for communications and radar systems. As one of the main tensor decomposition methods, CANDECOMP/PARAFAC decomposition has advantages of uniqueness and interpretation properties which are significant in practical applications. However, traditional decomposition method is sensitive to both predefined rank and noise that results in inaccurate tensor decomposition. In this paper, we propose a improved algorithm called the Element-wise Average Alternating Direction Method of Multipliers by minimizing the sum of all factors’ trace norm and the noise variance. Our algorithm could overcome the dependence on predefined rank in traditional decomposition algorithms and alleviate the impact of noise. Moreover, this algorithm can be transferred to solve the problem of tensor completion conveniently. The simulation results show that our proposed algorithm could decompose the noisy tensor to the factors with above 90% similarity in various SNR and also interpolate the incomplete tensor with higher similar coefficient and lower relative reconstruction error when the missing rate is less than 0.5.

Explicit Message-Passing Heterogeneous Graph Neural Network

Article

Jan 2022

Graph neural network (GNN) has shown its prominent performance in representation learning of graphs but it has not been fully considered for heterogeneous graphs which contain more complex structures and rich semantics. The rich semantic information of heterogeneous graph can be usually revealed by meta-paths. Therefore, most of the existing GNN models designed for heterogeneous graphs utilize the meta-path based neighborhood sampler to divide a heterogeneous graph into multiple homogeneous subgraphs according to various meta-paths so that the homogeneous GNN can be applied to investigate heterogeneous graphs. Nevertheless, the way of embedding semantic information of meta-paths into multiple homogeneous graphs is implicit and ineffective, which cannot accurately capture the semantics of heterogeneous graphs. In this paper, we propose a novel semi-supervised GNN model named E xplicit M essage- P assing Heterogeneous Graph Neural Network (EMP), which executes the process of explicit message-passing along the meta-paths. Besides, we also propose a split method for meta-paths and consider mutual effect between various meta-paths in advance in the proposed model, so that the semantic information of the whole set of meta-paths can be captured accurately. Extensive experiments conducted on three real-world datasets demonstrate the superiority of the proposed model.

Information Spread and Topic Diffusion in Heterogeneous Information Networks

Article

Full-text available

Jun 2018

Diffusion of information in complex networks largely depends on the network structure. Recent studies have mainly addressed information diffusion in homogeneous networks where there is only a single type of nodes and edges. However, some real-world networks consist of heterogeneous types of nodes and edges. In this manuscript, we model information diffusion in heterogeneous information networks, and use interactions of different meta-paths to predict the diffusion process. A meta-path is a path between nodes across different layers of a heterogeneous network. As its most important feature the proposed method is capable of determining the influence of all meta-paths on the diffusion process. A conditional probability is used assuming interdependent relations between the nodes to calculate the activation probability of each node. As independent cascade models, we consider linear threshold and independent cascade models. Applying the proposed method on two real heterogeneous networks reveals its effectiveness and superior performance over state-of-the-art methods.

NetSpam: A Network-Based Spam Detection Framework for Reviews in Online Social Media

Article

Full-text available

Mar 2017

Nowadays, a big part of people rely on available content in social media in their decisions (e.g. reviews and feedback on a topic or product). The possibility that anybody can leave a review provide a golden opportunity for spammers to write spam reviews about products and services for different interests. Identifying these spammers and the spam content is a hot topic of research and although a considerable number of studies have been done recently toward this end, but so far the methodologies put forth still barely detect spam reviews, and none of them show the importance of each extracted feature type. In this study, we propose a novel framework, named NetSpam, which utilizes spam features for modeling review datasets as heterogeneous information networks to map spam detection procedure into a classification problem in such networks. Using the importance of spam features help us to obtain better results in terms of different metrics experimented on real-world review datasets from Yelp and Amazon websites. The results show that NetSpam outperforms the existing methods and among four categories of features; including review-behavioral, user-behavioral, reviewlinguistic, user-linguistic, the first type of features performs better than the other categories.

Link prediction in multiplex online social networks

Article

Full-text available

Feb 2017

Online social networks play a major role in modern societies, and they have shaped the way social relationships evolve. Link prediction in social networks has many potential applications such as recommending new items to users, friendship suggestion and discovering spurious connections. Many real social networks evolve the connections in multiple layers (e.g. multiple social networking platforms). In this article, we study the link prediction problem in multiplex networks. As an example, we consider a multiplex network of Twitter (as a microblogging service) and Foursquare (as a location-based social network). We consider social networks of the same users in these two platforms and develop a meta-path-based algorithm for predicting the links. The connectivity information of the two layers is used to predict the links in Foursquare network. Three classical classifiers (naive Bayes, support vector machines (SVM) and K-nearest neighbour) are used for the classification task. Although the networks are not highly correlated in the layers, our experiments show that including the cross-layer information significantly improves the prediction performance. The SVM classifier results in the best performance with an average accuracy of 89%.

An Anomaly Detection Algorithm of Cloud Platform Based on Self-Organizing Maps

Article

Full-text available

Jan 2016
MATH PROBL ENG

Virtual machines (VM) on a Cloud platform can be influenced by a variety of factors which can lead to decreased performance and downtime, affecting the reliability of the Cloud platform. Traditional anomaly detection algorithms and strategies for Cloud platforms have some flaws in their accuracy of detection, detection speed, and adaptability. In this paper, a dynamic and adaptive anomaly detection algorithm based on Self-Organizing Maps (SOM) for virtual machines is proposed. A unified modeling method based on SOM to detect the machine performance within the detection region is presented, which avoids the cost of modeling a single virtual machine and enhances the detection speed and reliability of large-scale virtual machines in Cloud platform. The important parameters that affect the modeling speed are optimized in the SOM process to significantly improve the accuracy of the SOM modeling and therefore the anomaly detection accuracy of the virtual machine.

Query-Based Outlier Detection in Heterogeneous Information Networks

Article

Full-text available

Mar 2015

Outlier or anomaly detection in large data sets is a fundamental task in data science, with broad applications. However, in real data sets with high-dimensional space, most outliers are hidden in certain dimensional combinations and are relative to a user's search space and interest. It is often more effective to give power to users and allow them to specify outlier queries flexibly, and the system will then process such mining queries efficiently. In this study, we introduce the concept of query-based outlier in heterogeneous information networks, design a query language to facilitate users to specify such queries flexibly, define a good outlier measure in heterogeneous networks, and study how to process outlier queries efficiently in large data sets. Our experiments on real data sets show that following such a methodology, interesting outliers can be defined and uncovered flexibly and effectively in large heterogeneous networks.

Anomaly Detection in Medical Wireless Sensor Networks using SVM and Linear Regression Models

Chapter

Jan 2016

This paper details the architecture and describes the preliminary experimentation with the proposed framework for anomaly detection in medical wireless body area networks for ubiquitous patient and healthcare monitoring. The architecture integrates novel data mining and machine learning algorithms with modern sensor fusion techniques. Knowing wireless sensor networks are prone to failures resulting from their limitations (i.e. limited energy resources and computational power), using this framework, the authors can distinguish between irregular variations in the physiological parameters of the monitored patient and faulty sensor data, to ensure reliable operations and real time global monitoring from smart devices. Sensor nodes are used to measure characteristics of the patient and the sensed data is stored on the local processing unit. Authorized users may access this patient data remotely as long as they maintain connectivity with their application enabled smart device. Anomalous or faulty measurement data resulting from damaged sensor nodes or caused by malicious external parties may lead to misdiagnosis or even death for patients. The authors' application uses a Support Vector Machine to classify abnormal instances in the incoming sensor data. If found, the authors apply a periodically rebuilt, regressive prediction model to the abnormal instance and determine if the patient is entering a critical state or if a sensor is reporting faulty readings. Using real patient data in our experiments, the results validate the robustness of our proposed framework. The authors further discuss the experimental analysis with the proposed approach which shows that it is quickly able to identify sensor anomalies and compared with several other algorithms, it maintains a higher true positive and lower false negative rate.

Bayesian analysis of some outlier problems in time series

Article

Jan 1979

Two models, the aberrant innovation model and the aberrant observation model, are considered to characterize outliers in time series. The approach adopted here allows for a small probability α that any given observation is 'bad' and in this set-up the inference about the parameters of an autoregressive model is considered.

An Introduction to Kolmogorov Complexity and Its Applications

Book

Jan 1993

Contextual anomaly detection for cyber-physical security in Smart Grids based on an artificial neural network model

Conference Paper

Apr 2016

Anna Magdalena Kosek

This paper presents a contextual anomaly detection method and its use in the discovery of malicious voltage control actions in the low voltage distribution grid. The model-based anomaly detection uses an artificial neural network model to identify a distributed energy resource's behaviour under control. An intrusion detection system observes distributed energy resource's behaviour, control actions and the power system impact, and is tested together with an ongoing voltage control attack in a co-simulation set-up. The simulation results obtained with a real photovoltaic rooftop power plant data show that the contextual anomaly detection performs on average 55% better in the control detection and over 56% better in the malicious control detection over the point anomaly detection.

Reducing Features of KDD CUP 1999 Dataset for Anomaly Detection Using Back Propagation Neural Network

Conference Paper

Feb 2015

To detect and classify the anomaly in computer network, KDD CUP 1999 dataset is extensively used. This KDD CUP 1999 data set was generated by domain expert at MIT Lincon lab. To reduced number of features of this KDD CUP data set, various feature reduction techniques has been already used. These techniques reduce features from 41 into range of 10 to 22. Usage of such reduced dataset in machine learning algorithm leads to lower complexity, less processing time and high accuracy. Out of the various feature reduction technique available, one of them is Information Gain (IG) which has been already applied for the random forests classifier by Tesfahun et al. Tesfahun's approach reduces time and complexity of model and improves the detection rate for the minority classes in a considerable amount. This work investigates the effectiveness and the feasibility of Tesfahun et al.'s feature reduction technique on Back Propagation Neural Network classifier. We had performed various experiments on KDD CUP 1999 dataset and recorded Accuracy, Precision, Recall and Fscore values. In this work, we had done Basic, N-Fold Validation and Testing comparisons on reduced dataset with full feature dataset. Basic comparison clearly shows that the reduced dataset outer performs on size, time and complexity parameters. Experiments of N-Fold validation show that classifier that uses reduced dataset, have better generalization capacity. During the testing comparison, we found both the datasets are equally compatible. All the three comparisons clearly show that reduced dataset is better or is equally compatible, and does not have any drawback as compared to full dataset. Our experiments shows that usage of such reduced dataset in BPNN can lead to better model in terms of dataset size, complexity, processing time and generalization ability.

QANet: Tensor Decomposition Approach for Query-Based Anomaly Detection in Heterogeneous Information Networks

Abstract and Figures

Recommended publications

QANet: Tensor Decomposition Approach for Query-based Anomaly Detection in Heterogeneous Information...

Information Spread and Topic Diffusion in Heterogeneous Information Networks

TSS: Temporal similarity search measure for heterogeneous information networks

PReP: Path-Based Relevance from a Probabilistic Perspective in Heterogeneous Information Networks