PreprintPDF Available

On Filter Size in Graph Convolutional Networks

November 2018

November 2018

Authors:

Van Dinh Tran

University of Freiburg

Nicolò Navarin

University of Padova

Preprints and early-stage research may not have been peer reviewed yet.

Recently, many researchers have been focusing on the definition of neural networks for graphs. The basic component for many of these approaches remains the graph convolution idea proposed almost a decade ago. In this paper, we extend this basic component, following an intuition derived from the well-known convolutional filters over multi-dimensional tensors. In particular, we derive a simple, efficient and effective way to introduce a hyper-parameter on graph convolutions that influences the filter size, i.e. its receptive field over the considered graph. We show with experimental results on real-world graph datasets that the proposed graph convolutional filter improves the predictive performance of Deep Graph Convolutional Networks.

The proposed Parametric Graph Convolution. The parameter r controls the maximum distance of the considered neighborhood, and the dimensionality of the output.

…

Figures - uploaded by Nicolò Navarin

Content may be subject to copyright.

Content uploaded by Nicolò Navarin

Content may be subject to copyright.

On Filter Size in Graph Convolutional Networks

Dinh V. Tran∗† , Nicol`

o Navarin∗‡, Alessandro Sperduti∗

∗Department of Mathematics, University of Padova, Italy

{dinh, nnavarin, sperduti}@math.unipd.it

†Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany

dinh@informatik.uni-freiburg.de

‡School of Computer Science, University of Nottingham, United Kingdom

nicolo.navarin@nottingham.ac.uk

Abstract—Recently, many researchers have been focusing on

the deﬁnition of neural networks for graphs. The basic compo-

nent for many of these approaches remains the graph convolution

idea proposed almost a decade ago. In this paper, we extend

this basic component, following an intuition derived from the

well-known convolutional ﬁlters over multi-dimensional tensors.

In particular, we derive a simple, efﬁcient and effective way

to introduce a hyper-parameter on graph convolutions that

inﬂuences the ﬁlter size, i.e. its receptive ﬁeld over the considered

graph. We show with experimental results on real-world graph

datasets that the proposed graph convolutional ﬁlter improves the

predictive performance of Deep Graph Convolutional Networks.

Index Terms—graphs, deep learning for graphs, graph convo-

lution, convolutional neural networks for graphs.

I. INTRODUCTION

Graphs are a common and natural way to represent many

real world data, e.g. in Chemistry a compound can be

represented by its molecular graph, in social networks the

relationships between users are represented as edges in a graph

where users are nodes. Many computational tasks involving

such graphical representations require machine learning, such

as classiﬁcation of active/non-active drugs or prediction of the

creation of a future link between two users in a social network.

State-of-the-art machine learning techniques for classiﬁcation

and regression on graphs are at the moment kernel machines

equipped with speciﬁcally designed kernels for graphs (e.g,

[1]–[3]). Although there are examples of kernels for structures

that can be designed on the basis of a training set [4]–[6], most

of the more efﬁcient and effective graph kernels are based on

predeﬁned structural features, i.e, features deﬁnition is not part

of the learning process.

There is a recent shift of trend from kernels to neural

networks for graphs. Unlike kernels, the deﬁnition of features

in neural networks are deﬁned based on a learning process

which is supervised by the graph’s labels (targets). Many

approaches have addressed the problem of deﬁning neural

networks for graphs [7]. However, one of the core components,

the graph convolution, has not changed much with respect to

the earlier works [8], [9].

In this paper, we work on the re-design of this basic

component. We propose a new formulation for the graph

convolution operator that is strictly more general than the

existing one. Our proposal can be virtually applied to all the

techniques based on graph convolutions.

The paper is organized as follows. We start in Section II

with some basic deﬁnitions and notation. In Section III, we

provide an overview over the various proposals of graph

convolution available in literature. In Section IV we detail our

proposed parametric graph convolutional ﬁlter. In Section V

we discuss other related works that are not based on graph

convolution, including some alternative graph neural network

architectures and graph kernels. In Section VI we report our

experimental results. Finally, Section VII concludes the paper.

II. NOTATION AND DEFI NI TIONS

We denote matrices with bold uppercase letters, vectors with

uppercase letters, and variables with lowercase letters. Given

a matrix M,Midenotes the i-th row of the matrix, and mij

is the element in i-th row and j-th column. Given the vector

V,virefers its i-th element.

Let’s consider G= (VG, EG,XG)as a graph, where

VG={v1, . . . , vn}is the set of vertices (or nodes),

EG⊆VG×VGis the set of edges, and XG∈Rn×dis a node

label matrix, where each row is the label (a vector of size d)

associated to each vertex vi∈VG, i.e. XG

i= (xi,0, . . . , xi,d).

Note that, in this paper, we will not consider edge labels. When

the reference to the graph Gis clear from the context, for the

sake of notation we discard the superscript referring to the

speciﬁc graph. We deﬁne the adjacency matrix A∈Rn×nas

aij = 1 ⇐⇒ (i, j)∈E, 0 otherwise. We also deﬁne the

neighborhood of a vertex vas the set of vertices connected

to vby an edge, i.e. N(v) = {u|(v, u)∈E}. Note that N(v)

is also the set of nodes at shortest path distance exactly one

from v, i.e. N(v) = {u|sp(v, u)=1}, where sp is a function

computing the shortest-path distance between two nodes in a

graph.

In this paper, we consider the problem of graph

classiﬁcation. Given a dataset composed of Npairs

{(Gi, yi)|1≤i≤N}, the task is then, given an unseen graph

G, to predict its correct target y.

III. GRAPH CONVOLUTIONS

The ﬁrst deﬁnition of neural network for graphs has been

proposed in [10]. More recent models have been proposed in

arXiv:1811.10435v1 [cs.LG] 23 Nov 2018

[8], [9]. Both works are based on an idea that has been re-

branded later as graph convolution.

The idea is to deﬁne the neural architecture following

the topology of the graph. Then a transformation is per-

formed from the neurons corresponding to a vertex and its

neighborhood to a hidden representation, that is associated to

the same vertex (possibly in another layer of the network).

This transformation depends on some parameters, that are

shared among all the nodes. In the following, for the sake

of simplicity we ignore the bias terms.

In [9], when considering non-positional graphs, i.e. the most

common deﬁnition, and the one we are considering in this

paper, a transition function on a graph node vat time 0≤t

is deﬁned as:

Ht+1

v=X

u∈N (v)

fΘ(Ht

u, Xv, Xu),(1)

where fΘis a parametric function whose parameters Θhave

to be learned (e.g. a neural network) and are shared among all

the vertices. Note that, if edge labels are available, they can

be included in eq. (1). In fact, in the original formulation, fΘ

depends also on the label of the edge between vand u. This

transition function is part of a recurrent system. It is deﬁned

as a contraction mapping, thus the system is guaranteed to

converge to a ﬁxed point, i.e. a representation, that does not

depend on the particular initialization of the weight matrix

H0. The output is computed from the last representation and

the original node labels as follows:

v=gΘ0(Ht

v, Xv),(2)

where gΘ0is another neural network. [11] extends the work

in [9] by removing the constraint for the recurrent system to

be a contraction mapping, and replacing the recurrent units

with GRUs. However, recently it has been shown in [12] that

stacked graph convolutions are superior to graph recurrent

architectures in terms of both accuracy and computational cost.

In [8], a model referred to as Neural Network for Graphs

(NN4G) is proposed. In the ﬁrst layer, a transformation over

node labels is computed:

v=f



j=1

¯w1,j xv,j 

,(3)

where ¯

W1are the weights connecting the original labels Xto

the current neuron, and 1≤v≤nis the vertex index. The

graph convolution is then deﬁned for the i+ 1-th layer (for

i > 0) as:

hi+1

v=f



j=1

¯wi+1,j xv,j +

k=1

ˆwi+1,k X

u∈N (v)

u

,(4)

where ˆ

Wi+1 are weights connecting the previous hidden layers

to the current neuron (shared). Note that in this formulation,

skip connections are present, to the (i+1)-th layer, from layer

1to layer i. There is an interesting recent work about the par-

allel between skip-connections (residual networks in that case)

Fig. 1: Graph convolution as described in [8], and adopted with

some variations by many state-of-the-art Graph Convolutional

neural networks.

and recurrent networks [13]. However, since in the formulation

in eq. (4), every layer is connected to all the subsequent layers,

it is not possible to reconduct it to a (vanilla) recurrent model.

Let us consider the (i+ 1)-th graph convolutional layer, that

comprehends ci+1 graph convolutional ﬁlters. We can rewrite

eq. (4) for the whole layer as:

Hi+1 =f(X¯

Wi+1 +

k=1

AHkˆ

Wi+1,k),(5)

where i= 0, . . . , l −1(and lis the number of layers),

Wi+1 ∈Rd×ci+1 ,ˆ

Wi+1,k ∈Rck×ci+1 ,Hk∈Rn×ck,ciis

the size of the hidden representation at the i-th layer, and f

is applied element-wise.

An abstract representation of eq. (4) is depicted in Figure 1.

The convolution in eq. (4) is part of a multi-layer architecture,

where each layer’s connectivity resembles the topology of the

graph, and the training is layer-wise. Finally, for each graph,

NN4G computes the average graph node representation for

each hidden layer, and concatenates them. This is the graph

representation computed by NN4G, and it can be used for

the ﬁnal prediction of graph properties with a standard output

layer.

In [14], a hierarchical approach has been proposed. This

method is similar to NN4G and is inspired by circular ﬁn-

gerprints in chemical structures. While [8] adopts Cascade-

Correlation for training, [14] uses an end-to-end back-

propagation. ECC [15] proposes an improvement of [14],

weighting the sum over the neighbors of a node by weights

conditioned by the edge labels. We consider this last version

as a baseline in our experiments.

Recently, [16] derives a graph convolution that closely

resembles (4). Let us, from now on, consider H0=X.

Motivated by a ﬁrst-order approximation of localized spectral

ﬁlters on graphs, the proposed graph convolutional ﬁlter looks

like:

Hi+1 =f(˜

D−1

2˜

A˜

D−1

2HiWi),(6)

where ˜

A=A+I,˜

dii =Pj˜ai,j , and fis any activation

function applied element-wise.

If we ignore the terms ˜

D−1

2(that in practice act as normal-

ization), it is easy to see that eq. (6) is very similar to eq. (5),

the difference being that there are no skip connections in this

case, i.e. the (i+1)-th layer is connected just to the i-th layer.

Consequently, we just have to learn one weight matrix per

layer.

In [17], a slightly more complex model compared to [16]

is proposed. This model shows the highest predictive perfor-

mance with respect to the other methods presented in this

section. The ﬁrst layers of the network are again stacked graph

convolutional layers, deﬁned as follows:

Hi+1 =f(˜

D−1˜

AHiWi),(7)

where H0=Xand ˜

A=A+I. Note that in the

previous equation, we compute the representation of all the

nodes in the graph at once. The difference between eq. (7)

and eq. (6) is the use of different propagation scheme for

nodes’ representations: eq. (6) is based on the normalized

graph Laplacian, while eq. (7) is based on the random-walk

graph Laplacian. In [17], authors state that the choice of

normalization does not signiﬁcantly affect the results. In fact,

both equations can be seen as ﬁrst-order approximations of

the polynomially parameterized spectral graph convolution. In

[17], three graph convolutional layers are stacked. The graph

convolutions are followed by a concatenation layer that merges

the representations computed by each graph convolutional

layer. Then, differently from previous approaches, the paper

introduces a sortpooling layer, that selects a ﬁxed number

of node representations, and computes the output from them

stacking 1D convolutional layers and dense layers. This is the

same network architecture that we considered in this paper.

A. SortPooling layer

After stacking some graph convolution layer, we need a

mechanism to predict the target for the graph, starting from its

node encoding. Ideally, this mechanism should be applicable to

graphs with variable number of vertices. Instead of averaging

the node representations, [17] proposes to solve this issue with

the SortPooling layer.

Let us assume that the encoding, for each node, of the i-

th graph convolution layer is c. Let us consider the output

of the last graph convolution (or concatenation) layer to be

Hl∈Rn×c, where each row is a vertex’s feature descriptor

and each column is a feature channel. The output of the

SortPooling layer is a k×ctensor, where kis a user-deﬁned

integer.

In the SortPooling layer, the rows of Hlare sorted lexico-

graphically (possibly starting from the last column). We can

see the output of the graph convolutional layer as continuous

WL colors, and thus we are sorting all the vertices according

to these colors. This way, a consistent ordering is imposed for

graph vertices, making it possible to train traditional neural

networks on the sorted graph representations.

In addition to sorting vertex features in a consistent order,

the other function of SortPooling is to unify the sizes of the

output tensors. After sorting, we truncate/extend the output

tensor in the ﬁrst dimension from n to k. The intention is

to unify graph sizes, making graphs with different numbers of

vertices unify their sizes to k. The unifying is done by deleting

the last n−krows if n>k, or adding k−nzero rows if

n<k.

Note that if two vertices have the same hidden representa-

tion, it doesn’t matter which node we pick because the output

of the SortPooling layer would be exactly the same.

IV. PARAMETRIC GRAPH CONVOLUTIONS

A straightforward generalization of eq. (7) would be de-

ﬁned on the powers of the adjacency matrix, i.e. on random

walks [18]. This would introduce tottering in the learned repre-

sentation, that is not considered to be beneﬁcial in general. We

decided to follow another approach, based on shortest-paths.

As mentioned before, the adjacency matrix Aof a graph can

be seen as the matrix of the shortest-paths of length 1, i.e.

ai,j =sp1

i,j =(1if sp(i, j)=1

0otherwise .(8)

Moreover, the identity matrix Iis the matrix of the shortest-

paths of length 0(assuming that each node is at dis-

tance zero from itself), i.e. I=SP0. Moreover, note that

A=SP0+SP1.

By means of this new notation, we can rewrite eq. (7) as:

Hl+1 =f˜

D−1SP0+SP1HlWl.(9)

Let us now deﬁne ˆ

ii =Pjspr

i,j . We can now extend

our reasoning and deﬁne our parameterized (by r) graph

convolution layer. In our contribution, we decided to process

information in a slightly different way with respect to (9).

Instead of summing the contributions of the SP matrices, we

decided to keep the contributions of the nodes at different

shortest-path distance separated. This is equivalent to the

deﬁnition of multiple graph convolutional ﬁlters, one for

each shortest-path distance. We deﬁne the Parametric Graph

Convolution as:

Hr,l+1 =kr

j=0f(ˆ

Dj)−1SPjHlWj,l,(10)

where kis the vertical concatenation of vectors. Note that

with our formulation, we have a different Wj,l matrix for

each layer land for each shortest-path distance j. Moreover,

as mentioned before, we are concatenating the information

and not summing it, explicitly keeping the contributions of

the different distances separated. This approach follows the

network-in-network idea [19]. In our case, at each layer, we

are effectively applying, at the same time, r+ 1 convolutions

(one for each shortest-path distance) and concatenating their

output. Let us ﬁx a parameter controlling the number of ﬁlters

for the llayer, say cl, and a value for the hyper-parameter r,

then we have Hr,l+1 ∈Rn×r·cl.

r=0r=1

r=2

Fig. 2: The proposed Parametric Graph Convolution. The

parameter rcontrols the maximum distance of the considered

neighborhood, and the dimensionality of the output.

A. Receptive ﬁeld

It has been shown in [16], [17] that with the standard

deﬁnition of graph convolution, e.g. the ones in eq. (6) and

eq. (7), the receptive ﬁeld of a graph convolutional ﬁlter at

layer lcorresponding to the vertex vis Nl(v). This draws an

interesting parallel with the Weisfeiler-Lehman graph kernel

(see Section V-A), where intuitively the number of WL itera-

tions is equivalent to the number of stacked graph convolution

layers in the architecture.

In our proposed parametric graph convolution in eq. (10),

the rparameter directly inﬂuences the considered neighbor-

hood in the graph convolutional ﬁlter (and the number of

output channels, since we concatenate the output of the con-

volutions for all j≤r). It is easy to see that, by deﬁnition, the

receptive ﬁeld of a graph convolutional ﬁlter parameterized by

rand applied to the vertex vincludes all the nodes at shortest-

path distance at most rfrom v. When we stack multiple layers

of our parametric graph convolution, the receptive ﬁeld grows

in the same way. The receptive ﬁeld of a parametric graph

convolutional ﬁlter of size rat layer lapplied to the vertex v

includes then all the vertices at shortest-path distance at most

l·rfrom v.

B. Computational complexity

Equation (10) requires to compute the all-pairs shortest

paths, up to a ﬁxed length r. While computing the unbounded

shortest paths for a graph with nnodes requires O(n3)time,

if the maximum length is small enough, it is possible to

implement it with one depth-limited breadth-ﬁrst visit starting

from each node, with an overall complexity of O(mr), where

mis the number of edges in a graph.

V. RE LATE D WO RKS

Besides the approaches based on graph convolutions pre-

sented in Section III, there are some other methods in literature

to process graphs with neural networks.

For instance, [20] deﬁned an attention mechanism to prop-

agate information between the nodes in a graph. The basic

idea is the deﬁnition of an external network that, given two

neighboring nodes, outputs an attention weight for that speciﬁc

edge. A shared attentive mechanism a:Rd×Rd→R

computes the attention coefﬁcients

ev,u =aΘ(WXv,WXu),(11)

that indicate the importance of node u’s features to node v.

Here, aΘis a parametric function, that in the original paper is a

single-layer feed-forward network parameterized by the vector

Θ∈R2d. The information about the graph structure is injected

into the mechanism by performing masked attention, i.e. ev,u

is only computed for nodes u∈ N (v). To make coefﬁcients

easily comparable across different nodes, a softmax function

is used:

bv,u =sof tmaxu(ev,u ) = exp(ev,u)

Pk∈N (v)exp(ev,k ).(12)

Once obtained, the normalized attention coefﬁcients are

used to compute a linear combination of the features corre-

sponding to them, to serve as the ﬁnal output features for every

node (after potentially applying a point-wise nonlinearity, f):

Hv=f

X

u∈N (v)

bvuWXu

.(13)

To stabilize the learning process of self-attention, authors

propose to extending the mechanism to employ multi-head

attention(Kdifferent attention weights per edge). For the

last layer, authors employ averaging, and delay applying the

ﬁnal nonlinearity (usually a softmax or logistic sigmoid for

classiﬁcation problems) until then.

This technique has been applied to node classiﬁcation only,

and its complexity (due to implementation issues) is high. In

principle, the same approach of NN4G can be adopted to

generate graph-level representations and predictions for this

model.

[21] (PSCN) proposes another interpretation of graph

convolution. Given a graph, it ﬁrst selects the nodes where

the convolutional ﬁlter have to be centered. Then, it selects a

ﬁxed number of vertices from its neighborhood, and infers an

order on them. This ordering constraint limits the ﬂexibility of

the approach because learning a consistent order is difﬁcult,

and the number of nodes in the convolutional ﬁlter has to be

ﬁxed a-priori.

Diffusion CNN (DCNN) [22] is based on the principle of

heat diffusion (on graphs). The idea is to map from nodes and

their labels to the result of a diffusion process that begins at

that node.

TABLE I: Summary of employed graph datasets

Dataset MUTAG PTC NCI1 PROTEINS D&D COLLAB IMDB-B IMDB-M

#Nodes (Max) 28 109 111 620 5748 492 136 89

#Nodes (Avg) 17.93 25.56 29.87 39.06 284.32 74.49 19.77 13.00

#Graphs 188 344 4110 1113 1178 5000 1000 1500

A. Graph Kernels

Kernel methods deﬁnes the model as a linear classiﬁer in a

Reproducing Kernel Hilbert Space, that is the space implicitly

deﬁned by a kernel function K(x1, x2) = hφ(x1), φ(x2)i.

SVM is the most popular kernelized learning algorithm, that

deﬁnes the solution as the maximum-margin hyper-plane.

Kernel functions can be deﬁned for many objects, and in

particular for graphs. Many graph kernels have been deﬁned

in literature. For instance, Random Walk kernels are based

on the number of common random walks in two graphs [2],

[23] and can be computed efﬁciently in closed form. More

recent proposals focus on more complex structures, and allow

to represent the φfunction explicitly, with computational

beneﬁts. Among others, kernels have been deﬁned considering

graphlets [24], shortest-paths [25], subtrees [26], [27] and

subtree-walks [28], [29]. For instance, the Weisfeiler-Lehman

subtree kernel (WL) deﬁnes its features as rooted subtree-

walks, i.e, subtrees whose nodes can appear multiple times,

up to a user-deﬁned maximum height h(maximum number of

iterations).

Propagation kernels (PK) [30] follow a different idea, in-

spired by the diffusion process in graph node kernels (i.e.

kernels between nodes in a single graph), of propagating the

node label information through the edges in a graph. Then,

for each node, a distribution over the propagated labels is

computed. Finally, the kernel between two graphs compares

such distributions over all the nodes in the two graphs.

While exhibiting state-of-the-art performance on many

graph datasets, the main problem of graph kernels is that

they deﬁne a ﬁxed representation, that is not task-dependent

and can in principle limit the predictive performance of the

method. Deep graph kernels (DGK) [31] propose an approach

to alleviate this problem. Let us ﬁx a base kernel and its

explicit representation φ(·). Then a deep graph kernel can be

deﬁned as:

DGK(x1, x2) = φ(x1)TMφ(x2),

where Mis a matrix of parameters that has to be learned,

possibly including target information.

VI. EXPER IM EN TS

In this section, we aim at evaluating the performance of

the proposed method and comparing it with many existing

graph kernels and deep learning approaches for graphs. We

pay a special attention to the performances of our method

and DGCNN, to see whether the proposed generalization

helps to improve the predictive performance. As a means

to achieve this purpose, various experiments are conducted

in two settings, following the experimental procedure used

in [17] on eight graph datasets (see Table I for a sum-

mary). The code for our experiments is available online at

https://github.com/dinhinfotech/PGC-DGCNN.

In the ﬁrst setting, we compare the performance of our

method with DGCNN and state-of-the-art graph kernels: the

graphlet kernel (GK) [1], the random walk kernel (RW) [2],

the propagation kernel (PK) [30], and the Weisfeiler-Lehman

subtree kernel (WL) [32]. We do not include other state-of-

the-art graph kernels such as NSPDK [33] and ODD [26]

because their performance is not much different from the

considered ones, and it is above the scope of this paper to

extensively compare the graph kernels in literature. In this

setting, ﬁve datasets containing biological node-labeled graphs

are employed, namely MUTAG [34], PTC [35], NCI1 [36],

PROTEINS, and D&D [37]. In the ﬁrst three datasets, each

graph represents a chemical compound, where nodes are

labeled with the atom type, and edges represent bonds between

them. MUTAG is a dataset of aromatic and hetero-aromatic

nitro compounds, where the task is to predict their mutagenic

effect on a bacterium. In PTC, the task is to predict chemical

compounds carcinogenicity for male and female rats. NCI1

contains anti-cancer screens for cell lung cancer. In PRO-

TEINS and D&D, each graph represents a protein. The nodes

are labeled according to the amino-acid type. The proteins are

classiﬁed into two classes: enzymes and non-enzymes.

In the second setting, we desire to evaluate the performance

of the proposed method and DGCNN along with other deep

learning approaches for graphs: PATCHY-SAN (PSCN) [21],

Diffusion CNN (DCNN) [22], ECC [15] and Deep Graphlet

Kernel (DGK) [31]. In this setting, three biological datasets

(NCI1, PROTEINS and D&D) and three social network

datasets from [31] (COLLAB, IMDB-B and IMDB-M) are

used. COLLAB is a dataset of scientiﬁc collaborations, where

ego-networks are generated for researchers and are classiﬁed

in three research ﬁelds. IMDB-B (binary) is a movie collab-

oration dataset where ego-networks for actors/actresses are

classiﬁed in action or romance genres. IMDB-M is a multi-

class version of IMDB-B, containing genres comedy,romance,

and sci-ﬁ.

In this setting, we eliminate MUTAG and PTC since they

have a small number of examples which easily causes over-

ﬁtting problems for deep learning approaches.

Evaluation method and model selection: to evaluate the dif-

ferent methods, a nested 10-fold cross-validation is employed,

i.e, one fold for testing, 9 folds for training of which one is

TABLE II: Comparison with graph kernels. ∗: our proposed approach. DGCNN is similar to

our approach with r= 1.

Dataset MUTAG PTC NCI1 PROTEINS D&D

GK 81.39±1.74 55.65±0.46 62.49±0.27 71.39±0.31 74.38±0.69

RW 79.17±2.07 55.91±0.32 >3 days 59.57±0.09 >3 days

PK 76.00±2.69 59.50±2.44 82.54±0.47 73.68±0.68 78.25±0.51

WL 84.11±1.91 57.97±2.49 84.46±0.45 74.68±0.49 78.34±0.62

DGCNN 85.83±1.66 58.59±2.47 74.44±0.47 75.54±0.94 79.37±0.94

PGC-DGCNN∗(r= 2)87.22±1.43 61.06±1.83 76.13±0.73 76.45±1.02 78.93±0.91

TABLE III: Comparison with other deep learning approaches. ∗: our proposed approach.

DGCNN is similar to our approach with r= 1.

Dataset NCI1 PROTEINS D&D COLLAB IMDB-B IMDB-M

PSCN 76.34±1.68 75.00±2.51 76.27±2.64 72.60±2.15 71.00±2.29 45.23±2.84

DCNN 56.61±1.04 61.29±1.60 58.09±0.53 52.11±0.71 49.06±1.37 33.49±1.42

ECC 76.82 – 72.54 – – –

DGK 62.48±0.25 71.68±0.50 – 73.09±0.25 66.96±0.56 44.55±0.52

DGCNN 74.44±0.47 75.54±0.94 79.37±0.94 73.76±0.49 70.03±0.86 47.83±0.85

PGC-DGCNN∗76.13±0.73 76.45±1.02 78.93±0.91 75.00±0.58 71.62±1.22 47.25±1.44

used as validation set for model selection. For each dataset,

we repeated each experiment 10 times and report the average

accuracy over the 100 resulting folds. To select the best model,

the hyper-parameters’ values of different kernels are set as

follows: the height of WL and PK in {0,1,2,3,4,5}, the bin

width of PK to 0.001, the size of the graphlets in GK to 3 and

the decay of RW to the largest power of 10 that is smaller than

the reciprocal of the squared maximum node degree. Note that

some of our results are reported from [17].

Network architecture: we employ the network architecture

used in [17] to have a fair comparison with DGCNN. The

network consists of three graph convolution layers, a con-

catenation layer, a SortPooling layer, followed by two 1-

D convolutional layers and one dense layer. The activation

function for the graph convolutions is the hyperbolic tangent,

while the 1D convolutions and the dense layer use rectiﬁed

linear units. Note that our proposal, as presented in Section

IV, is a generalization of DGCNN. In other words, DGCNN

is very similar to a special case of our method where just

the neighbors at shortest-path distance 1are considered. On

the contrary, our proposal considers the distance as a hyper-

parameter, r, allowing to ﬂexibly capture local structures

associated to graph nodes. In this section, we set requal to

2 as the ﬁrst attempt and plan to explore neighborhoods with

nodes at a higher distance as a future work.

A. Experimental Results

Table II and III show the performance of various methods

in the ﬁrst and second settings, respectively. Overall, DGCNN

and our proposed method outperform the compared kernels

and deep learning methods in most datasets.

As can be seen from Table II, DGCNN and the proposed

method (PGC-DGCNN) present higher performances in four

out of ﬁve datasets with an improvement ranging from 1.03%

to 3.11%with respect to the best performing kernel. Compared

to the RW kernel, our proposed method impressively achieves

the highest improvement in MUTAG and PROTEINS with

about 8%and 17%, respectively. Concerning PK and WL,

that are similar in spirit to DGCNN and our method as

shown in [17], DGCNN and PGC-DGCNN illustrate higher

performances in most cases with a bigger difference comparing

with PK. It is worth noticing, when comparing with PK

and WL, that their optimal models in each experiment are

selected by tuning the height parameter, h, from a range of

pre-deﬁned values. Instead, DGCNN and our method are

evaluated with a ﬁxed number of layers only. This indicates

that the performance of DGCNN and the proposed method

can be higher if we validate the number of stacked graph

convolutional layers.

Related to the performances of various deep learning meth-

ods described in Table III, our method and DGCNN obtain the

highest results in ﬁve out of six cases, except in NCI1 where

they show marginally lower results. Considering the perfor-

mance of DCNN, DGCNN and our method gain dramatically

higher accuracies with the improvement ranges from around

14%and up to 21%.

We now move our consideration to the difference between

the performances of DGCNN and our proposal. It can be seen

from the Table II and III that our method performs better

than DGCNN in the majority of the datasets. In particular,

PGC-DGCNN outperforms DGCNN in six out of eight cases

with a consistent improvement from about 1%to 2%. In D&D

and IMDB-M, the accuracy of our method is slightly lower

than DGCNN. However, these declines are only marginal.

The general improved performance of our method comparing

to DGCNN can be explained by the fact that our method

parameterizes the graph convolutions, making it a general-

ization of DGCNN. (we recall that we ﬁx the neighborhood

distance r= 2). In this case, our method captures more

information about the local graph structure associated to each

node comparing to DGCNN which considers just the direct

neighbors, i.e. r= 1. It is worth to notice that (1) we use a

single value for rto build our model. However, in general, we

can choose an optimal model by tuning values from a range

of values for r; (2) we utilize the architecture as proposed in

[17], meaning that we have not tried to optimize the network

architecture. Therefore, the performance of our method can

improve if we optimize the distance parameter rand the

number of graph convolutional layers, together with the rest

of the architecture.

VII. CONCLUSIONS AND FUTURE WOR KS

In this paper, we presented a new deﬁnition of graph

convolutional ﬁlter. It generalizes the most commonly adopted

ﬁlter, adding an hyper-parameter controlling the distance of

the considered neighborhood. Experimental results show that

our proposed ﬁlter improves the predictive performance of

Deep graph Convolutional Neural Networks on many real-

world datasets.

In future, we plan to analyze more in depth the impact of

ﬁlter size in graph convolutional networks. We will deﬁne the

1D convolutions as special cases of Graph Convolutions, and

we will explore Fully Graph-Convolutional neural architec-

tures, that will avoid fully-connected layers, and possibly stack

more graph convolution layers. Moreover, we will explore

the impact of different activation functions for the graph

convolutions in such a setting. Finally, we plan to enhance

the input graph representation associating to each node the

explicit features extracted by graph kernels.

ACKNOWLEDGMENT

This project was funded, in part, by the Department of

Mathematics, University of Padova, under the DEEP project

and DFG project, BA 2168/3-3.

REFERENCES

[1] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borg-

wardt, “Efﬁcient graphlet kernels for large graph comparison,” in Arti-

ﬁcial Intelligence and Statistics, 2009, pp. 488–495.

[2] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M.

Borgwardt, “Graph kernels,” Journal of Machine Learning Research,

vol. 11, no. Apr, pp. 1201–1242, 2010.

[3] G. Da San Martino, N. Navarin, and A. Sperduti, “A tree-

based kernel for graphs,” in Proceedings of the Twelfth SIAM

International Conference on Data Mining, Anaheim, California,

USA, April 26-28, 2012., 2012, pp. 975–986. [Online]. Available:

https://doi.org/10.1137/1.9781611972825.84

[4] L. van der Maaten, “Learning discriminative ﬁsher kernels,” in Proceed-

ings of the 28th International Conference on Machine Learning, ICML

2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 2011, pp.

217–224.

[5] F. Aiolli, G. Da San Martino, M. Hagenbuchner, and A. Sperduti,

“Learning nonsparse kernels by self-organizing maps for structured

data,” IEEE Trans. Neural Networks, vol. 20, no. 12, pp. 1938–1949,

2009. [Online]. Available: https://doi.org/10.1109/TNN.2009.2033473

[6] D. Bacciu, A. Micheli, and A. Sperduti, “Generative kernels for

tree-structured data,” IEEE Trans. Neural Netw. Learning Syst., vol.

early access, 2018. [Online]. Available: https://ieeexplore.ieee.org/

document/8259316/

[7] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on

graphs: Methods and applications,” CoRR, vol. abs/1709.05584, 2017.

[Online]. Available: http://arxiv.org/abs/1709.05584

[8] A. Micheli, “Neural network for graphs: A contextual constructive

approach,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp.

498–511, 2009.

[9] F. Scarselli, M. Gori, A. C. Ah Chung Tsoi, M. Hagenbuchner,

and G. Monfardini, “The Graph Neural Network Model,” IEEE

Transactions on Neural Networks, vol. 20, no. 1, pp. 61–

80, 2009. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/

wrapper.htm?arnumber=4700287

[10] A. Sperduti and A. Starita, “Supervised neural networks for

the classiﬁcation of structures,” IEEE Trans. Neural Networks,

vol. 8, no. 3, pp. 714–735, 1997. [Online]. Available: https:

//doi.org/10.1109/72.572108

[11] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated Graph

Sequence Neural Networks,” in ICLR, 2016. [Online]. Available:

http://arxiv.org/abs/1511.05493

[12] X. Bresson and T. Laurent, “An Experimental Study of Neural Networks

for Variable Graphs,” in ICLR 2018 Workshop, 2018.

[13] Q. Liao and T. Poggio, “Bridging the Gaps Between Residual Learning,

Recurrent Neural Networks and Visual Cortex,” arXiv preprint, 2016.

[Online]. Available: http://arxiv.org/abs/1604.03640

[14] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. G ´

omez-

Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convo-

lutional networks on graphs for learning molecular ﬁngerprints,” in

Advances in Neural Information Processing Systems, Montreal, Canada,

2015, pp. 2215–2223.

[15] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned ﬁlters

in convolutional neural networks on graphs,” in CVPR, 2017.

[16] T. N. Kipf and M. Welling, “Semi-Supervised Classiﬁcation with

Graph Convolutional Networks,” in ICLR, 2017, pp. 1–14. [Online].

Available: http://arxiv.org/abs/1609.02907

[17] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An End-to-End Deep

Learning Architecture for Graph Classiﬁcation,” in AAAI Conference on

Artiﬁcial Intelligence, 2018.

[18] S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee, “N-GCN: Multi-

scale Graph Convolution for Semi-supervised Node Classiﬁcation,”

in Proceedings of the 14th International Workshop on Mining

and Learning with Graphs (MLG), 2018. [Online]. Available:

http://arxiv.org/abs/1802.08888

[19] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov,

D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with

convolutions,” in 2015 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR). IEEE, jun 2015, pp. 1–9. [Online].

Available: http://ieeexplore.ieee.org/document/7298594/

[20] P. Veliˇ

ckovi´

c, G. Cucurull, A. Casanova, A. Romero, P. Li`

o, and

Y. Bengio, “Graph Attention Networks,” in ICLR, 2018. [Online].

Available: http://arxiv.org/abs/1710.10903

[21] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural

networks for graphs,” in International conference on machine learning,

2016, pp. 2014–2023.

[22] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,”

in Advances in Neural Information Processing Systems, 2016, pp. 1993–

2001.

[23] T. Gartner, P. Flach, S. Wrobel, and T. G¨

artner, “On Graph Kernels:

Hardness Results and Efﬁcient Alternatives,” in Proceedings of the

16th Annual Conference on Computational Learning Theory and

7th Kernel Workshop, ser. Lecture Notes in Computer Science,

B. Sch¨

olkopf and M. K. Warmuth, Eds., vol. 2777. Berlin, Heidelberg:

Springer Berlin Heidelberg, 2003, pp. 129–143. [Online]. Available:

http://link.springer.com/10.1007/b12006

[24] N. Shervashidze, K. Mehlhorn, T. H. Petri, S. V. N. Vishwanathan,

K. M. Borgwardt, T. H. Petri, K. Mehlhorn, and K. M. Borgwardt,

“Efﬁcient graphlet kernels for large graph comparison,” in AISTATS,

vol. 5. Clearwater Beach, Florida, USA: CSAIL, 2009, pp. 488–495.

[25] K. Borgwardt and H.-P. Kriegel, “Shortest-Path Kernels on Graphs,”

in ICDM. Los Alamitos, CA, USA: IEEE, 2005, pp. 74–

81. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.

htm?arnumber=1565664

[26] G. Da San Martino, N. Navarin, and A. Sperduti, “Ordered Decompo-

sitional DAG Kernels Enhancements,” Neurocomputing, vol. 192, pp.

92–103, 2016.

[27] ——, “A Tree-Based Kernel for Graphs,” in Proceedings of the Twelfth

SIAM International Conference on Data Mining, 2012, pp. 975–986.

[28] ——, “Graph Kernels Exploiting Weisfeiler-Lehman Graph

Isomorphism Test Extensions,” in Neural Information Pro-

cessing, vol. 8835, 2014, pp. 93–100. [Online]. Available:

http://link.springer.com/10.1007/978-3-319- 12640-1{ }12

[29] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and

K. M. Borgwardt, “Weisfeiler-Lehman Graph Kernels,” JMLR, vol. 12,

pp. 2539–2561, 2011.

[30] M. Neumann, N. Patricia, R. Garnett, and K. Kersting, “Efﬁcient graph

kernels by randomization,” in Joint European Conference on Machine

Learning and Knowledge Discovery in Databases. Springer, 2012, pp.

378–393.

[31] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings

of the 21th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining. ACM, 2015, pp. 1365–1374.

[32] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and

K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Ma-

chine Learning Research, vol. 12, no. Sep, pp. 2539–2561, 2011.

[33] F. Costa and K. De Grave, “Fast neighborhood subgraph pairwise

distance kernel,” in ICML. Omnipress, 2010, pp. 255–262.

[34] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J.

Shusterman, and C. Hansch, “Structure-activity relationship of

mutagenic aromatic and heteroaromatic nitro compounds. Correlation

with molecular orbital energies and hydrophobicity,” Journal of

Medicinal Chemistry, vol. 34, no. 2, pp. 786–797, feb 1991. [Online].

Available: http://pubs.acs.org/doi/abs/10.1021/jm00106a046

[35] H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma,

“Statistical evaluation of the predictive toxicology challenge 2000-2001,”

Bioinformatics, 2003.

[36] N. Wale, I. Watson, and G. Karypis, “Comparison of descriptor spaces

for chemical compound retrieval and classiﬁcation,” Knowledge and

Information Systems, vol. 14, no. 3, pp. 347–375, 2008.

[37] P. D. Dobson and A. J. Doig, “Distinguishing Enzyme Structures from

Non-enzymes Without Alignments,” Journal of Molecular Biology, vol.

330, no. 4, pp. 771–783, 2003.

ResearchGate has not been able to resolve any citations for this publication.

Representation Learning on Graphs: Methods and Applications

Article

Full-text available

Sep 2017

Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.

Going deeper with convolutions

Conference Paper

Full-text available

Jun 2015

Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex

Article

Full-text available

Apr 2016

We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologically-plausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR-10 dataset.

Efficient Graph Kernels by Randomization

Conference Paper

Full-text available

Sep 2012

Learning from complex data is becoming increasingly important, and graph kernels have recently evolved into a rapidly developing branch of learning on structured data. However, previously proposed kernels rely on having discrete node label information. In this paper, we explore the power of continuous node-level features for propagation-based graph kernels. Specifically, propagation kernels exploit node label distributions from propagation schemes like label propagation, which naturally enables the construction of graph kernels for partially labeled graphs. In order to efficiently extract graph features from continuous node label distributions, and in general from continuous vector-valued node attributes, we utilize randomized techniques, which easily allow for deriving similarity measures based on propagated information. We show that propagation kernels utilizing locality-sensitive hashing reduce the runtime of existing graph kernels by several orders of magnitude. We evaluate the performance of various propagation kernels on real-world bioinformatics and image benchmark datasets.

Generative Kernels for Tree-Structured Data

Article

Jan 2018

This paper presents a family of methods for the design of adaptive kernels for tree-structured data that exploits the summarization properties of hidden states of hidden Markov models for trees. We introduce a compact and discriminative feature space based on the concept of hidden states multisets and we discuss different approaches to estimate such hidden state encoding. We show how it can be used to build an efficient and general tree kernel based on Jaccard similarity. Furthermore, we derive an unsupervised convolutional generative kernel using a topology induced on the Markov states by a tree topographic mapping. This paper provides an extensive empirical assessment on a variety of structured data learning tasks, comparing the predictive accuracy and computational efficiency of state-of-the-art generative, adaptive, and syntactical tree kernels. The results show that the proposed generative approach has a good tradeoff between computational complexity and predictive performance, in particular when considering the soft matching introduced by the topographic mapping.

Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs

Conference Paper

Jul 2017

Graph Attention Networks

Article

Oct 2017

We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved state-of-the-art results across three established transductive and inductive graph benchmarks: the Cora and Citeseer citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs are entirely unseen during training).

Diffusion-Convolutional Neural Networks

Article

Jan 2015

We present diffusion-convolutional neural networks (DCNNs), a new model for graph-structured data. Through the introduction of a diffusion-convolution operation, we show how diffusion-based representations can be learned from graph-structured data and used as an effective basis for node classification. DCNNs have several attractive qualities, including a latent representation for graphical data that is invariant under isomorphism, as well as polynomial-time prediction and learning that can be represented as tensor operations and efficiently implemented on the GPU. Through several experiments with real structured datasets, we demonstrate that DCNNs are able to outperform probabilistic relational models and kernel-on-graph methods at relational node classification tasks.

Semi-Supervised Classification with Graph Convolutional Networks

Article

Sep 2016

We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.

Deep Graph Kernels

Conference Paper

Aug 2015

In this paper, we present Deep Graph Kernels, a unified framework to learn latent representations of sub-structures for graphs, inspired by latest advancements in language modeling and deep learning. Our framework leverages the dependency information between sub-structures by learning their latent representations. We demonstrate instances of our framework on three popular graph kernels, namely Graphlet kernels, Weisfeiler-Lehman subtree kernels, and Shortest-Path graph kernels. Our experiments on several benchmark datasets show that Deep Graph Kernels achieve significant improvements in classification accuracy over state-of-the-art graph kernels.

On Filter Size in Graph Convolutional Networks

Abstract and Figures

Recommended publications

On Filter Size in Graph Convolutional Networks

Learning Local Receptive Fields and their Weight Sharing Scheme on Graphs

Pre-training Graph Neural Networks with Kernels

Universal Readout for Graph Convolutional Neural Networks

A Framework for the Definition of Complex Structured Feature Spaces