PreprintPDF Available

On Filter Size in Graph Convolutional Networks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Recently, many researchers have been focusing on the definition of neural networks for graphs. The basic component for many of these approaches remains the graph convolution idea proposed almost a decade ago. In this paper, we extend this basic component, following an intuition derived from the well-known convolutional filters over multi-dimensional tensors. In particular, we derive a simple, efficient and effective way to introduce a hyper-parameter on graph convolutions that influences the filter size, i.e. its receptive field over the considered graph. We show with experimental results on real-world graph datasets that the proposed graph convolutional filter improves the predictive performance of Deep Graph Convolutional Networks.
Content may be subject to copyright.
On Filter Size in Graph Convolutional Networks
Dinh V. Tran∗† , Nicol`
o Navarin∗‡, Alessandro Sperduti
Department of Mathematics, University of Padova, Italy
{dinh, nnavarin, sperduti}@math.unipd.it
Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany
dinh@informatik.uni-freiburg.de
School of Computer Science, University of Nottingham, United Kingdom
nicolo.navarin@nottingham.ac.uk
Abstract—Recently, many researchers have been focusing on
the definition of neural networks for graphs. The basic compo-
nent for many of these approaches remains the graph convolution
idea proposed almost a decade ago. In this paper, we extend
this basic component, following an intuition derived from the
well-known convolutional filters over multi-dimensional tensors.
In particular, we derive a simple, efficient and effective way
to introduce a hyper-parameter on graph convolutions that
influences the filter size, i.e. its receptive field over the considered
graph. We show with experimental results on real-world graph
datasets that the proposed graph convolutional filter improves the
predictive performance of Deep Graph Convolutional Networks.
Index Terms—graphs, deep learning for graphs, graph convo-
lution, convolutional neural networks for graphs.
I. INTRODUCTION
Graphs are a common and natural way to represent many
real world data, e.g. in Chemistry a compound can be
represented by its molecular graph, in social networks the
relationships between users are represented as edges in a graph
where users are nodes. Many computational tasks involving
such graphical representations require machine learning, such
as classification of active/non-active drugs or prediction of the
creation of a future link between two users in a social network.
State-of-the-art machine learning techniques for classification
and regression on graphs are at the moment kernel machines
equipped with specifically designed kernels for graphs (e.g,
[1]–[3]). Although there are examples of kernels for structures
that can be designed on the basis of a training set [4]–[6], most
of the more efficient and effective graph kernels are based on
predefined structural features, i.e, features definition is not part
of the learning process.
There is a recent shift of trend from kernels to neural
networks for graphs. Unlike kernels, the definition of features
in neural networks are defined based on a learning process
which is supervised by the graph’s labels (targets). Many
approaches have addressed the problem of defining neural
networks for graphs [7]. However, one of the core components,
the graph convolution, has not changed much with respect to
the earlier works [8], [9].
In this paper, we work on the re-design of this basic
component. We propose a new formulation for the graph
convolution operator that is strictly more general than the
existing one. Our proposal can be virtually applied to all the
techniques based on graph convolutions.
The paper is organized as follows. We start in Section II
with some basic definitions and notation. In Section III, we
provide an overview over the various proposals of graph
convolution available in literature. In Section IV we detail our
proposed parametric graph convolutional filter. In Section V
we discuss other related works that are not based on graph
convolution, including some alternative graph neural network
architectures and graph kernels. In Section VI we report our
experimental results. Finally, Section VII concludes the paper.
II. NOTATION AND DEFI NI TIONS
We denote matrices with bold uppercase letters, vectors with
uppercase letters, and variables with lowercase letters. Given
a matrix M,Midenotes the i-th row of the matrix, and mij
is the element in i-th row and j-th column. Given the vector
V,virefers its i-th element.
Let’s consider G= (VG, EG,XG)as a graph, where
VG={v1, . . . , vn}is the set of vertices (or nodes),
EGVG×VGis the set of edges, and XGRn×dis a node
label matrix, where each row is the label (a vector of size d)
associated to each vertex viVG, i.e. XG
i= (xi,0, . . . , xi,d).
Note that, in this paper, we will not consider edge labels. When
the reference to the graph Gis clear from the context, for the
sake of notation we discard the superscript referring to the
specific graph. We define the adjacency matrix ARn×nas
aij = 1 (i, j)E, 0 otherwise. We also define the
neighborhood of a vertex vas the set of vertices connected
to vby an edge, i.e. N(v) = {u|(v, u)E}. Note that N(v)
is also the set of nodes at shortest path distance exactly one
from v, i.e. N(v) = {u|sp(v, u)=1}, where sp is a function
computing the shortest-path distance between two nodes in a
graph.
In this paper, we consider the problem of graph
classification. Given a dataset composed of Npairs
{(Gi, yi)|1iN}, the task is then, given an unseen graph
G, to predict its correct target y.
III. GRAPH CONVOLUTIONS
The first definition of neural network for graphs has been
proposed in [10]. More recent models have been proposed in
arXiv:1811.10435v1 [cs.LG] 23 Nov 2018
[8], [9]. Both works are based on an idea that has been re-
branded later as graph convolution.
The idea is to define the neural architecture following
the topology of the graph. Then a transformation is per-
formed from the neurons corresponding to a vertex and its
neighborhood to a hidden representation, that is associated to
the same vertex (possibly in another layer of the network).
This transformation depends on some parameters, that are
shared among all the nodes. In the following, for the sake
of simplicity we ignore the bias terms.
In [9], when considering non-positional graphs, i.e. the most
common definition, and the one we are considering in this
paper, a transition function on a graph node vat time 0t
is defined as:
Ht+1
v=X
u∈N (v)
fΘ(Ht
u, Xv, Xu),(1)
where fΘis a parametric function whose parameters Θhave
to be learned (e.g. a neural network) and are shared among all
the vertices. Note that, if edge labels are available, they can
be included in eq. (1). In fact, in the original formulation, fΘ
depends also on the label of the edge between vand u. This
transition function is part of a recurrent system. It is defined
as a contraction mapping, thus the system is guaranteed to
converge to a fixed point, i.e. a representation, that does not
depend on the particular initialization of the weight matrix
H0. The output is computed from the last representation and
the original node labels as follows:
Ot
v=gΘ0(Ht
v, Xv),(2)
where gΘ0is another neural network. [11] extends the work
in [9] by removing the constraint for the recurrent system to
be a contraction mapping, and replacing the recurrent units
with GRUs. However, recently it has been shown in [12] that
stacked graph convolutions are superior to graph recurrent
architectures in terms of both accuracy and computational cost.
In [8], a model referred to as Neural Network for Graphs
(NN4G) is proposed. In the first layer, a transformation over
node labels is computed:
ˆ
h1
v=f
d
X
j=1
¯w1,j xv,j
,(3)
where ¯
W1are the weights connecting the original labels Xto
the current neuron, and 1vnis the vertex index. The
graph convolution is then defined for the i+ 1-th layer (for
i > 0) as:
ˆ
hi+1
v=f
d
X
j=1
¯wi+1,j xv,j +
i
X
k=1
ˆwi+1,k X
u∈N (v)
ˆ
hk
u
,(4)
where ˆ
Wi+1 are weights connecting the previous hidden layers
to the current neuron (shared). Note that in this formulation,
skip connections are present, to the (i+1)-th layer, from layer
1to layer i. There is an interesting recent work about the par-
allel between skip-connections (residual networks in that case)
Fig. 1: Graph convolution as described in [8], and adopted with
some variations by many state-of-the-art Graph Convolutional
neural networks.
and recurrent networks [13]. However, since in the formulation
in eq. (4), every layer is connected to all the subsequent layers,
it is not possible to reconduct it to a (vanilla) recurrent model.
Let us consider the (i+ 1)-th graph convolutional layer, that
comprehends ci+1 graph convolutional filters. We can rewrite
eq. (4) for the whole layer as:
Hi+1 =f(X¯
Wi+1 +
i
X
k=1
AHkˆ
Wi+1,k),(5)
where i= 0, . . . , l 1(and lis the number of layers),
¯
Wi+1 Rd×ci+1 ,ˆ
Wi+1,k Rck×ci+1 ,HkRn×ck,ciis
the size of the hidden representation at the i-th layer, and f
is applied element-wise.
An abstract representation of eq. (4) is depicted in Figure 1.
The convolution in eq. (4) is part of a multi-layer architecture,
where each layer’s connectivity resembles the topology of the
graph, and the training is layer-wise. Finally, for each graph,
NN4G computes the average graph node representation for
each hidden layer, and concatenates them. This is the graph
representation computed by NN4G, and it can be used for
the final prediction of graph properties with a standard output
layer.
In [14], a hierarchical approach has been proposed. This
method is similar to NN4G and is inspired by circular fin-
gerprints in chemical structures. While [8] adopts Cascade-
Correlation for training, [14] uses an end-to-end back-
propagation. ECC [15] proposes an improvement of [14],
weighting the sum over the neighbors of a node by weights
conditioned by the edge labels. We consider this last version
as a baseline in our experiments.
Recently, [16] derives a graph convolution that closely
resembles (4). Let us, from now on, consider H0=X.
Motivated by a first-order approximation of localized spectral
filters on graphs, the proposed graph convolutional filter looks
like:
Hi+1 =f(˜
D1
2˜
A˜
D1
2HiWi),(6)
where ˜
A=A+I,˜
dii =Pj˜ai,j , and fis any activation
function applied element-wise.
If we ignore the terms ˜
D1
2(that in practice act as normal-
ization), it is easy to see that eq. (6) is very similar to eq. (5),
the difference being that there are no skip connections in this
case, i.e. the (i+1)-th layer is connected just to the i-th layer.
Consequently, we just have to learn one weight matrix per
layer.
In [17], a slightly more complex model compared to [16]
is proposed. This model shows the highest predictive perfor-
mance with respect to the other methods presented in this
section. The first layers of the network are again stacked graph
convolutional layers, defined as follows:
Hi+1 =f(˜
D1˜
AHiWi),(7)
where H0=Xand ˜
A=A+I. Note that in the
previous equation, we compute the representation of all the
nodes in the graph at once. The difference between eq. (7)
and eq. (6) is the use of different propagation scheme for
nodes’ representations: eq. (6) is based on the normalized
graph Laplacian, while eq. (7) is based on the random-walk
graph Laplacian. In [17], authors state that the choice of
normalization does not significantly affect the results. In fact,
both equations can be seen as first-order approximations of
the polynomially parameterized spectral graph convolution. In
[17], three graph convolutional layers are stacked. The graph
convolutions are followed by a concatenation layer that merges
the representations computed by each graph convolutional
layer. Then, differently from previous approaches, the paper
introduces a sortpooling layer, that selects a fixed number
of node representations, and computes the output from them
stacking 1D convolutional layers and dense layers. This is the
same network architecture that we considered in this paper.
A. SortPooling layer
After stacking some graph convolution layer, we need a
mechanism to predict the target for the graph, starting from its
node encoding. Ideally, this mechanism should be applicable to
graphs with variable number of vertices. Instead of averaging
the node representations, [17] proposes to solve this issue with
the SortPooling layer.
Let us assume that the encoding, for each node, of the i-
th graph convolution layer is c. Let us consider the output
of the last graph convolution (or concatenation) layer to be
HlRn×c, where each row is a vertex’s feature descriptor
and each column is a feature channel. The output of the
SortPooling layer is a k×ctensor, where kis a user-defined
integer.
In the SortPooling layer, the rows of Hlare sorted lexico-
graphically (possibly starting from the last column). We can
see the output of the graph convolutional layer as continuous
WL colors, and thus we are sorting all the vertices according
to these colors. This way, a consistent ordering is imposed for
graph vertices, making it possible to train traditional neural
networks on the sorted graph representations.
In addition to sorting vertex features in a consistent order,
the other function of SortPooling is to unify the sizes of the
output tensors. After sorting, we truncate/extend the output
tensor in the first dimension from n to k. The intention is
to unify graph sizes, making graphs with different numbers of
vertices unify their sizes to k. The unifying is done by deleting
the last nkrows if n>k, or adding knzero rows if
n<k.
Note that if two vertices have the same hidden representa-
tion, it doesn’t matter which node we pick because the output
of the SortPooling layer would be exactly the same.
IV. PARAMETRIC GRAPH CONVOLUTIONS
A straightforward generalization of eq. (7) would be de-
fined on the powers of the adjacency matrix, i.e. on random
walks [18]. This would introduce tottering in the learned repre-
sentation, that is not considered to be beneficial in general. We
decided to follow another approach, based on shortest-paths.
As mentioned before, the adjacency matrix Aof a graph can
be seen as the matrix of the shortest-paths of length 1, i.e.
ai,j =sp1
i,j =(1if sp(i, j)=1
0otherwise .(8)
Moreover, the identity matrix Iis the matrix of the shortest-
paths of length 0(assuming that each node is at dis-
tance zero from itself), i.e. I=SP0. Moreover, note that
˜
A=SP0+SP1.
By means of this new notation, we can rewrite eq. (7) as:
Hl+1 =f˜
D1SP0+SP1HlWl.(9)
Let us now define ˆ
dr
ii =Pjspr
i,j . We can now extend
our reasoning and define our parameterized (by r) graph
convolution layer. In our contribution, we decided to process
information in a slightly different way with respect to (9).
Instead of summing the contributions of the SP matrices, we
decided to keep the contributions of the nodes at different
shortest-path distance separated. This is equivalent to the
definition of multiple graph convolutional filters, one for
each shortest-path distance. We define the Parametric Graph
Convolution as:
Hr,l+1 =kr
j=0f(ˆ
Dj)1SPjHlWj,l,(10)
where kis the vertical concatenation of vectors. Note that
with our formulation, we have a different Wj,l matrix for
each layer land for each shortest-path distance j. Moreover,
as mentioned before, we are concatenating the information
and not summing it, explicitly keeping the contributions of
the different distances separated. This approach follows the
network-in-network idea [19]. In our case, at each layer, we
are effectively applying, at the same time, r+ 1 convolutions
(one for each shortest-path distance) and concatenating their
output. Let us fix a parameter controlling the number of filters
for the llayer, say cl, and a value for the hyper-parameter r,
then we have Hr,l+1 Rn×r·cl.
r=0r=1
r=2
Fig. 2: The proposed Parametric Graph Convolution. The
parameter rcontrols the maximum distance of the considered
neighborhood, and the dimensionality of the output.
A. Receptive field
It has been shown in [16], [17] that with the standard
definition of graph convolution, e.g. the ones in eq. (6) and
eq. (7), the receptive field of a graph convolutional filter at
layer lcorresponding to the vertex vis Nl(v). This draws an
interesting parallel with the Weisfeiler-Lehman graph kernel
(see Section V-A), where intuitively the number of WL itera-
tions is equivalent to the number of stacked graph convolution
layers in the architecture.
In our proposed parametric graph convolution in eq. (10),
the rparameter directly influences the considered neighbor-
hood in the graph convolutional filter (and the number of
output channels, since we concatenate the output of the con-
volutions for all jr). It is easy to see that, by definition, the
receptive field of a graph convolutional filter parameterized by
rand applied to the vertex vincludes all the nodes at shortest-
path distance at most rfrom v. When we stack multiple layers
of our parametric graph convolution, the receptive field grows
in the same way. The receptive field of a parametric graph
convolutional filter of size rat layer lapplied to the vertex v
includes then all the vertices at shortest-path distance at most
l·rfrom v.
B. Computational complexity
Equation (10) requires to compute the all-pairs shortest
paths, up to a fixed length r. While computing the unbounded
shortest paths for a graph with nnodes requires O(n3)time,
if the maximum length is small enough, it is possible to
implement it with one depth-limited breadth-first visit starting
from each node, with an overall complexity of O(mr), where
mis the number of edges in a graph.
V. RE LATE D WO RKS
Besides the approaches based on graph convolutions pre-
sented in Section III, there are some other methods in literature
to process graphs with neural networks.
For instance, [20] defined an attention mechanism to prop-
agate information between the nodes in a graph. The basic
idea is the definition of an external network that, given two
neighboring nodes, outputs an attention weight for that specific
edge. A shared attentive mechanism a:Rd×RdR
computes the attention coefficients
ev,u =aΘ(WXv,WXu),(11)
that indicate the importance of node u’s features to node v.
Here, aΘis a parametric function, that in the original paper is a
single-layer feed-forward network parameterized by the vector
ΘR2d. The information about the graph structure is injected
into the mechanism by performing masked attention, i.e. ev,u
is only computed for nodes u N (v). To make coefficients
easily comparable across different nodes, a softmax function
is used:
bv,u =sof tmaxu(ev,u ) = exp(ev,u)
Pk∈N (v)exp(ev,k ).(12)
Once obtained, the normalized attention coefficients are
used to compute a linear combination of the features corre-
sponding to them, to serve as the final output features for every
node (after potentially applying a point-wise nonlinearity, f):
Hv=f
X
u∈N (v)
bvuWXu
.(13)
To stabilize the learning process of self-attention, authors
propose to extending the mechanism to employ multi-head
attention(Kdifferent attention weights per edge). For the
last layer, authors employ averaging, and delay applying the
final nonlinearity (usually a softmax or logistic sigmoid for
classification problems) until then.
This technique has been applied to node classification only,
and its complexity (due to implementation issues) is high. In
principle, the same approach of NN4G can be adopted to
generate graph-level representations and predictions for this
model.
[21] (PSCN) proposes another interpretation of graph
convolution. Given a graph, it first selects the nodes where
the convolutional filter have to be centered. Then, it selects a
fixed number of vertices from its neighborhood, and infers an
order on them. This ordering constraint limits the flexibility of
the approach because learning a consistent order is difficult,
and the number of nodes in the convolutional filter has to be
fixed a-priori.
Diffusion CNN (DCNN) [22] is based on the principle of
heat diffusion (on graphs). The idea is to map from nodes and
their labels to the result of a diffusion process that begins at
that node.
TABLE I: Summary of employed graph datasets
Dataset MUTAG PTC NCI1 PROTEINS D&D COLLAB IMDB-B IMDB-M
#Nodes (Max) 28 109 111 620 5748 492 136 89
#Nodes (Avg) 17.93 25.56 29.87 39.06 284.32 74.49 19.77 13.00
#Graphs 188 344 4110 1113 1178 5000 1000 1500
A. Graph Kernels
Kernel methods defines the model as a linear classifier in a
Reproducing Kernel Hilbert Space, that is the space implicitly
defined by a kernel function K(x1, x2) = hφ(x1), φ(x2)i.
SVM is the most popular kernelized learning algorithm, that
defines the solution as the maximum-margin hyper-plane.
Kernel functions can be defined for many objects, and in
particular for graphs. Many graph kernels have been defined
in literature. For instance, Random Walk kernels are based
on the number of common random walks in two graphs [2],
[23] and can be computed efficiently in closed form. More
recent proposals focus on more complex structures, and allow
to represent the φfunction explicitly, with computational
benefits. Among others, kernels have been defined considering
graphlets [24], shortest-paths [25], subtrees [26], [27] and
subtree-walks [28], [29]. For instance, the Weisfeiler-Lehman
subtree kernel (WL) defines its features as rooted subtree-
walks, i.e, subtrees whose nodes can appear multiple times,
up to a user-defined maximum height h(maximum number of
iterations).
Propagation kernels (PK) [30] follow a different idea, in-
spired by the diffusion process in graph node kernels (i.e.
kernels between nodes in a single graph), of propagating the
node label information through the edges in a graph. Then,
for each node, a distribution over the propagated labels is
computed. Finally, the kernel between two graphs compares
such distributions over all the nodes in the two graphs.
While exhibiting state-of-the-art performance on many
graph datasets, the main problem of graph kernels is that
they define a fixed representation, that is not task-dependent
and can in principle limit the predictive performance of the
method. Deep graph kernels (DGK) [31] propose an approach
to alleviate this problem. Let us fix a base kernel and its
explicit representation φ(·). Then a deep graph kernel can be
defined as:
DGK(x1, x2) = φ(x1)TMφ(x2),
where Mis a matrix of parameters that has to be learned,
possibly including target information.
VI. EXPER IM EN TS
In this section, we aim at evaluating the performance of
the proposed method and comparing it with many existing
graph kernels and deep learning approaches for graphs. We
pay a special attention to the performances of our method
and DGCNN, to see whether the proposed generalization
helps to improve the predictive performance. As a means
to achieve this purpose, various experiments are conducted
in two settings, following the experimental procedure used
in [17] on eight graph datasets (see Table I for a sum-
mary). The code for our experiments is available online at
https://github.com/dinhinfotech/PGC-DGCNN.
In the first setting, we compare the performance of our
method with DGCNN and state-of-the-art graph kernels: the
graphlet kernel (GK) [1], the random walk kernel (RW) [2],
the propagation kernel (PK) [30], and the Weisfeiler-Lehman
subtree kernel (WL) [32]. We do not include other state-of-
the-art graph kernels such as NSPDK [33] and ODD [26]
because their performance is not much different from the
considered ones, and it is above the scope of this paper to
extensively compare the graph kernels in literature. In this
setting, five datasets containing biological node-labeled graphs
are employed, namely MUTAG [34], PTC [35], NCI1 [36],
PROTEINS, and D&D [37]. In the first three datasets, each
graph represents a chemical compound, where nodes are
labeled with the atom type, and edges represent bonds between
them. MUTAG is a dataset of aromatic and hetero-aromatic
nitro compounds, where the task is to predict their mutagenic
effect on a bacterium. In PTC, the task is to predict chemical
compounds carcinogenicity for male and female rats. NCI1
contains anti-cancer screens for cell lung cancer. In PRO-
TEINS and D&D, each graph represents a protein. The nodes
are labeled according to the amino-acid type. The proteins are
classified into two classes: enzymes and non-enzymes.
In the second setting, we desire to evaluate the performance
of the proposed method and DGCNN along with other deep
learning approaches for graphs: PATCHY-SAN (PSCN) [21],
Diffusion CNN (DCNN) [22], ECC [15] and Deep Graphlet
Kernel (DGK) [31]. In this setting, three biological datasets
(NCI1, PROTEINS and D&D) and three social network
datasets from [31] (COLLAB, IMDB-B and IMDB-M) are
used. COLLAB is a dataset of scientific collaborations, where
ego-networks are generated for researchers and are classified
in three research fields. IMDB-B (binary) is a movie collab-
oration dataset where ego-networks for actors/actresses are
classified in action or romance genres. IMDB-M is a multi-
class version of IMDB-B, containing genres comedy,romance,
and sci-fi.
In this setting, we eliminate MUTAG and PTC since they
have a small number of examples which easily causes over-
fitting problems for deep learning approaches.
Evaluation method and model selection: to evaluate the dif-
ferent methods, a nested 10-fold cross-validation is employed,
i.e, one fold for testing, 9 folds for training of which one is
TABLE II: Comparison with graph kernels. : our proposed approach. DGCNN is similar to
our approach with r= 1.
Dataset MUTAG PTC NCI1 PROTEINS D&D
GK 81.39±1.74 55.65±0.46 62.49±0.27 71.39±0.31 74.38±0.69
RW 79.17±2.07 55.91±0.32 >3 days 59.57±0.09 >3 days
PK 76.00±2.69 59.50±2.44 82.54±0.47 73.68±0.68 78.25±0.51
WL 84.11±1.91 57.97±2.49 84.46±0.45 74.68±0.49 78.34±0.62
DGCNN 85.83±1.66 58.59±2.47 74.44±0.47 75.54±0.94 79.37±0.94
PGC-DGCNN(r= 2)87.22±1.43 61.06±1.83 76.13±0.73 76.45±1.02 78.93±0.91
TABLE III: Comparison with other deep learning approaches. : our proposed approach.
DGCNN is similar to our approach with r= 1.
Dataset NCI1 PROTEINS D&D COLLAB IMDB-B IMDB-M
PSCN 76.34±1.68 75.00±2.51 76.27±2.64 72.60±2.15 71.00±2.29 45.23±2.84
DCNN 56.61±1.04 61.29±1.60 58.09±0.53 52.11±0.71 49.06±1.37 33.49±1.42
ECC 76.82 72.54
DGK 62.48±0.25 71.68±0.50 73.09±0.25 66.96±0.56 44.55±0.52
DGCNN 74.44±0.47 75.54±0.94 79.37±0.94 73.76±0.49 70.03±0.86 47.83±0.85
PGC-DGCNN76.13±0.73 76.45±1.02 78.93±0.91 75.00±0.58 71.62±1.22 47.25±1.44
used as validation set for model selection. For each dataset,
we repeated each experiment 10 times and report the average
accuracy over the 100 resulting folds. To select the best model,
the hyper-parameters’ values of different kernels are set as
follows: the height of WL and PK in {0,1,2,3,4,5}, the bin
width of PK to 0.001, the size of the graphlets in GK to 3 and
the decay of RW to the largest power of 10 that is smaller than
the reciprocal of the squared maximum node degree. Note that
some of our results are reported from [17].
Network architecture: we employ the network architecture
used in [17] to have a fair comparison with DGCNN. The
network consists of three graph convolution layers, a con-
catenation layer, a SortPooling layer, followed by two 1-
D convolutional layers and one dense layer. The activation
function for the graph convolutions is the hyperbolic tangent,
while the 1D convolutions and the dense layer use rectified
linear units. Note that our proposal, as presented in Section
IV, is a generalization of DGCNN. In other words, DGCNN
is very similar to a special case of our method where just
the neighbors at shortest-path distance 1are considered. On
the contrary, our proposal considers the distance as a hyper-
parameter, r, allowing to flexibly capture local structures
associated to graph nodes. In this section, we set requal to
2 as the first attempt and plan to explore neighborhoods with
nodes at a higher distance as a future work.
A. Experimental Results
Table II and III show the performance of various methods
in the first and second settings, respectively. Overall, DGCNN
and our proposed method outperform the compared kernels
and deep learning methods in most datasets.
As can be seen from Table II, DGCNN and the proposed
method (PGC-DGCNN) present higher performances in four
out of five datasets with an improvement ranging from 1.03%
to 3.11%with respect to the best performing kernel. Compared
to the RW kernel, our proposed method impressively achieves
the highest improvement in MUTAG and PROTEINS with
about 8%and 17%, respectively. Concerning PK and WL,
that are similar in spirit to DGCNN and our method as
shown in [17], DGCNN and PGC-DGCNN illustrate higher
performances in most cases with a bigger difference comparing
with PK. It is worth noticing, when comparing with PK
and WL, that their optimal models in each experiment are
selected by tuning the height parameter, h, from a range of
pre-defined values. Instead, DGCNN and our method are
evaluated with a fixed number of layers only. This indicates
that the performance of DGCNN and the proposed method
can be higher if we validate the number of stacked graph
convolutional layers.
Related to the performances of various deep learning meth-
ods described in Table III, our method and DGCNN obtain the
highest results in five out of six cases, except in NCI1 where
they show marginally lower results. Considering the perfor-
mance of DCNN, DGCNN and our method gain dramatically
higher accuracies with the improvement ranges from around
14%and up to 21%.
We now move our consideration to the difference between
the performances of DGCNN and our proposal. It can be seen
from the Table II and III that our method performs better
than DGCNN in the majority of the datasets. In particular,
PGC-DGCNN outperforms DGCNN in six out of eight cases
with a consistent improvement from about 1%to 2%. In D&D
and IMDB-M, the accuracy of our method is slightly lower
than DGCNN. However, these declines are only marginal.
The general improved performance of our method comparing
to DGCNN can be explained by the fact that our method
parameterizes the graph convolutions, making it a general-
ization of DGCNN. (we recall that we fix the neighborhood
distance r= 2). In this case, our method captures more
information about the local graph structure associated to each
node comparing to DGCNN which considers just the direct
neighbors, i.e. r= 1. It is worth to notice that (1) we use a
single value for rto build our model. However, in general, we
can choose an optimal model by tuning values from a range
of values for r; (2) we utilize the architecture as proposed in
[17], meaning that we have not tried to optimize the network
architecture. Therefore, the performance of our method can
improve if we optimize the distance parameter rand the
number of graph convolutional layers, together with the rest
of the architecture.
VII. CONCLUSIONS AND FUTURE WOR KS
In this paper, we presented a new definition of graph
convolutional filter. It generalizes the most commonly adopted
filter, adding an hyper-parameter controlling the distance of
the considered neighborhood. Experimental results show that
our proposed filter improves the predictive performance of
Deep graph Convolutional Neural Networks on many real-
world datasets.
In future, we plan to analyze more in depth the impact of
filter size in graph convolutional networks. We will define the
1D convolutions as special cases of Graph Convolutions, and
we will explore Fully Graph-Convolutional neural architec-
tures, that will avoid fully-connected layers, and possibly stack
more graph convolution layers. Moreover, we will explore
the impact of different activation functions for the graph
convolutions in such a setting. Finally, we plan to enhance
the input graph representation associating to each node the
explicit features extracted by graph kernels.
ACKNOWLEDGMENT
This project was funded, in part, by the Department of
Mathematics, University of Padova, under the DEEP project
and DFG project, BA 2168/3-3.
REFERENCES
[1] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borg-
wardt, “Efficient graphlet kernels for large graph comparison, in Arti-
ficial Intelligence and Statistics, 2009, pp. 488–495.
[2] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M.
Borgwardt, “Graph kernels, Journal of Machine Learning Research,
vol. 11, no. Apr, pp. 1201–1242, 2010.
[3] G. Da San Martino, N. Navarin, and A. Sperduti, “A tree-
based kernel for graphs,” in Proceedings of the Twelfth SIAM
International Conference on Data Mining, Anaheim, California,
USA, April 26-28, 2012., 2012, pp. 975–986. [Online]. Available:
https://doi.org/10.1137/1.9781611972825.84
[4] L. van der Maaten, “Learning discriminative fisher kernels,” in Proceed-
ings of the 28th International Conference on Machine Learning, ICML
2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 2011, pp.
217–224.
[5] F. Aiolli, G. Da San Martino, M. Hagenbuchner, and A. Sperduti,
“Learning nonsparse kernels by self-organizing maps for structured
data,” IEEE Trans. Neural Networks, vol. 20, no. 12, pp. 1938–1949,
2009. [Online]. Available: https://doi.org/10.1109/TNN.2009.2033473
[6] D. Bacciu, A. Micheli, and A. Sperduti, “Generative kernels for
tree-structured data,” IEEE Trans. Neural Netw. Learning Syst., vol.
early access, 2018. [Online]. Available: https://ieeexplore.ieee.org/
document/8259316/
[7] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on
graphs: Methods and applications,” CoRR, vol. abs/1709.05584, 2017.
[Online]. Available: http://arxiv.org/abs/1709.05584
[8] A. Micheli, “Neural network for graphs: A contextual constructive
approach,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp.
498–511, 2009.
[9] F. Scarselli, M. Gori, A. C. Ah Chung Tsoi, M. Hagenbuchner,
and G. Monfardini, “The Graph Neural Network Model,” IEEE
Transactions on Neural Networks, vol. 20, no. 1, pp. 61–
80, 2009. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/
wrapper.htm?arnumber=4700287
[10] A. Sperduti and A. Starita, “Supervised neural networks for
the classification of structures,” IEEE Trans. Neural Networks,
vol. 8, no. 3, pp. 714–735, 1997. [Online]. Available: https:
//doi.org/10.1109/72.572108
[11] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated Graph
Sequence Neural Networks,” in ICLR, 2016. [Online]. Available:
http://arxiv.org/abs/1511.05493
[12] X. Bresson and T. Laurent, “An Experimental Study of Neural Networks
for Variable Graphs,” in ICLR 2018 Workshop, 2018.
[13] Q. Liao and T. Poggio, “Bridging the Gaps Between Residual Learning,
Recurrent Neural Networks and Visual Cortex, arXiv preprint, 2016.
[Online]. Available: http://arxiv.org/abs/1604.03640
[14] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. G ´
omez-
Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convo-
lutional networks on graphs for learning molecular fingerprints,” in
Advances in Neural Information Processing Systems, Montreal, Canada,
2015, pp. 2215–2223.
[15] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters
in convolutional neural networks on graphs, in CVPR, 2017.
[16] T. N. Kipf and M. Welling, “Semi-Supervised Classification with
Graph Convolutional Networks, in ICLR, 2017, pp. 1–14. [Online].
Available: http://arxiv.org/abs/1609.02907
[17] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An End-to-End Deep
Learning Architecture for Graph Classification,” in AAAI Conference on
Artificial Intelligence, 2018.
[18] S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee, “N-GCN: Multi-
scale Graph Convolution for Semi-supervised Node Classification,
in Proceedings of the 14th International Workshop on Mining
and Learning with Graphs (MLG), 2018. [Online]. Available:
http://arxiv.org/abs/1802.08888
[19] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions, in 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). IEEE, jun 2015, pp. 1–9. [Online].
Available: http://ieeexplore.ieee.org/document/7298594/
[20] P. Veliˇ
ckovi´
c, G. Cucurull, A. Casanova, A. Romero, P. Li`
o, and
Y. Bengio, “Graph Attention Networks,” in ICLR, 2018. [Online].
Available: http://arxiv.org/abs/1710.10903
[21] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural
networks for graphs,” in International conference on machine learning,
2016, pp. 2014–2023.
[22] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,”
in Advances in Neural Information Processing Systems, 2016, pp. 1993–
2001.
[23] T. Gartner, P. Flach, S. Wrobel, and T. G¨
artner, “On Graph Kernels:
Hardness Results and Efficient Alternatives, in Proceedings of the
16th Annual Conference on Computational Learning Theory and
7th Kernel Workshop, ser. Lecture Notes in Computer Science,
B. Sch¨
olkopf and M. K. Warmuth, Eds., vol. 2777. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2003, pp. 129–143. [Online]. Available:
http://link.springer.com/10.1007/b12006
[24] N. Shervashidze, K. Mehlhorn, T. H. Petri, S. V. N. Vishwanathan,
K. M. Borgwardt, T. H. Petri, K. Mehlhorn, and K. M. Borgwardt,
“Efficient graphlet kernels for large graph comparison, in AISTATS,
vol. 5. Clearwater Beach, Florida, USA: CSAIL, 2009, pp. 488–495.
[25] K. Borgwardt and H.-P. Kriegel, “Shortest-Path Kernels on Graphs,”
in ICDM. Los Alamitos, CA, USA: IEEE, 2005, pp. 74–
81. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
htm?arnumber=1565664
[26] G. Da San Martino, N. Navarin, and A. Sperduti, “Ordered Decompo-
sitional DAG Kernels Enhancements,” Neurocomputing, vol. 192, pp.
92–103, 2016.
[27] ——, “A Tree-Based Kernel for Graphs,” in Proceedings of the Twelfth
SIAM International Conference on Data Mining, 2012, pp. 975–986.
[28] ——, “Graph Kernels Exploiting Weisfeiler-Lehman Graph
Isomorphism Test Extensions, in Neural Information Pro-
cessing, vol. 8835, 2014, pp. 93–100. [Online]. Available:
http://link.springer.com/10.1007/978-3-319- 12640-1{ }12
[29] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and
K. M. Borgwardt, “Weisfeiler-Lehman Graph Kernels, JMLR, vol. 12,
pp. 2539–2561, 2011.
[30] M. Neumann, N. Patricia, R. Garnett, and K. Kersting, “Efficient graph
kernels by randomization,” in Joint European Conference on Machine
Learning and Knowledge Discovery in Databases. Springer, 2012, pp.
378–393.
[31] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings
of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, 2015, pp. 1365–1374.
[32] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and
K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Ma-
chine Learning Research, vol. 12, no. Sep, pp. 2539–2561, 2011.
[33] F. Costa and K. De Grave, “Fast neighborhood subgraph pairwise
distance kernel,” in ICML. Omnipress, 2010, pp. 255–262.
[34] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J.
Shusterman, and C. Hansch, “Structure-activity relationship of
mutagenic aromatic and heteroaromatic nitro compounds. Correlation
with molecular orbital energies and hydrophobicity,” Journal of
Medicinal Chemistry, vol. 34, no. 2, pp. 786–797, feb 1991. [Online].
Available: http://pubs.acs.org/doi/abs/10.1021/jm00106a046
[35] H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma,
“Statistical evaluation of the predictive toxicology challenge 2000-2001,
Bioinformatics, 2003.
[36] N. Wale, I. Watson, and G. Karypis, “Comparison of descriptor spaces
for chemical compound retrieval and classification, Knowledge and
Information Systems, vol. 14, no. 3, pp. 347–375, 2008.
[37] P. D. Dobson and A. J. Doig, “Distinguishing Enzyme Structures from
Non-enzymes Without Alignments, Journal of Molecular Biology, vol.
330, no. 4, pp. 771–783, 2003.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.
Article
Full-text available
We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologically-plausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR-10 dataset.
Conference Paper
Full-text available
Learning from complex data is becoming increasingly important, and graph kernels have recently evolved into a rapidly developing branch of learning on structured data. However, previously proposed kernels rely on having discrete node label information. In this paper, we explore the power of continuous node-level features for propagation-based graph kernels. Specifically, propagation kernels exploit node label distributions from propagation schemes like label propagation, which naturally enables the construction of graph kernels for partially labeled graphs. In order to efficiently extract graph features from continuous node label distributions, and in general from continuous vector-valued node attributes, we utilize randomized techniques, which easily allow for deriving similarity measures based on propagated information. We show that propagation kernels utilizing locality-sensitive hashing reduce the runtime of existing graph kernels by several orders of magnitude. We evaluate the performance of various propagation kernels on real-world bioinformatics and image benchmark datasets.
Article
This paper presents a family of methods for the design of adaptive kernels for tree-structured data that exploits the summarization properties of hidden states of hidden Markov models for trees. We introduce a compact and discriminative feature space based on the concept of hidden states multisets and we discuss different approaches to estimate such hidden state encoding. We show how it can be used to build an efficient and general tree kernel based on Jaccard similarity. Furthermore, we derive an unsupervised convolutional generative kernel using a topology induced on the Markov states by a tree topographic mapping. This paper provides an extensive empirical assessment on a variety of structured data learning tasks, comparing the predictive accuracy and computational efficiency of state-of-the-art generative, adaptive, and syntactical tree kernels. The results show that the proposed generative approach has a good tradeoff between computational complexity and predictive performance, in particular when considering the soft matching introduced by the topographic mapping.
Article
We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved state-of-the-art results across three established transductive and inductive graph benchmarks: the Cora and Citeseer citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs are entirely unseen during training).
Article
We present diffusion-convolutional neural networks (DCNNs), a new model for graph-structured data. Through the introduction of a diffusion-convolution operation, we show how diffusion-based representations can be learned from graph-structured data and used as an effective basis for node classification. DCNNs have several attractive qualities, including a latent representation for graphical data that is invariant under isomorphism, as well as polynomial-time prediction and learning that can be represented as tensor operations and efficiently implemented on the GPU. Through several experiments with real structured datasets, we demonstrate that DCNNs are able to outperform probabilistic relational models and kernel-on-graph methods at relational node classification tasks.
Article
We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.
Conference Paper
In this paper, we present Deep Graph Kernels, a unified framework to learn latent representations of sub-structures for graphs, inspired by latest advancements in language modeling and deep learning. Our framework leverages the dependency information between sub-structures by learning their latent representations. We demonstrate instances of our framework on three popular graph kernels, namely Graphlet kernels, Weisfeiler-Lehman subtree kernels, and Shortest-Path graph kernels. Our experiments on several benchmark datasets show that Deep Graph Kernels achieve significant improvements in classification accuracy over state-of-the-art graph kernels.