Conference PaperPDF Available

Hcore-Init: Neural Network Initialization based on Graph Degeneracy

Authors:

Abstract and Figures

Neural networks are the pinnacle of artificial intelligence , as in recent years we witnessed many novel architectures, learning and optimization techniques for deep learning. As neural networks inherently constitute multipartite graphs among neuron layers, we aim to analyze directly their structure to extract meaningful information from the learning processes. To our knowledge graph mining techniques for enhancing learning in neural networks have not been thoroughly investigated. In this paper we propose an adapted version of the k-core structure for the complete weighted multipartite graph extracted from a deep learning architecture. As a multipartite graph is a combination of bipartite graphs, and since bipartite graphs are the incidence graphs of hypergraphs, we design the k-hypercore decomposition, the hypergraph analogue of the k-core degeneracy. We applied hypercore to several neural network architectures, more specifically to convolutional neural networks and multilayer perceptrons for image recognition after a very short pretraining. Then we used the information provided by the core numbers of the neurons to re-initialize the weights of the neural network to give a more specific direction for the gradient optimization scheme. Extensive experiments proved that hypercore outperforms state-of-the-art initialization methods.
Content may be subject to copyright.
Hcore-Init: Neural Network Initialization based on
Graph Degeneracy
Stratis Limnios
´
Ecole Polytechnique
Palaiseau, France
George Dasoulas
´
Ecole Polytechnique &
Noah’s Ark Lab, Huawei Technologies
Paris, France
Dimitrios M. Thilikos
LIRMM, Univ Montpellier, CNRS
Montpellier, France
Michalis Vazirgiannis
´
Ecole Polytechnique
Palaiseau, France
Abstract—Neural networks are the pinnacle of artificial intelli-
gence, as in recent years we witnessed many novel architectures,
learning and optimization techniques for deep learning. As neural
networks inherently constitute multipartite graphs among neuron
layers, we aim to analyze directly their structure to extract
meaningful information from the learning processes. To our
knowledge graph mining techniques for enhancing learning in
neural networks have not been thoroughly investigated. In this
paper we propose an adapted version of the k-core structure
for the complete weighted multipartite graph extracted from
a deep learning architecture. As a multipartite graph is a
combination of bipartite graphs, and since bipartite graphs
are the incidence graphs of hypergraphs, we design the k-
hypercore decomposition, the hypergraph analogue of the k-
core degeneracy. We applied hypercore to several neural network
architectures, more specifically to convolutional neural networks
and multilayer perceptrons for image recognition after a very
short pretraining. Then we used the information provided by
the core numbers of the neurons to re-initialize the weights of
the neural network to give a more specific direction for the
gradient optimization scheme. Extensive experiments proved that
hypercore outperforms state-of-the-art initialization methods.
I. INTRODUCTION
During the last decade deep learning has been intensely
in the focus of the research community. Its applications on
a huge variety of scientific and industrial fields highlighted
the need for new approaches at the level of neural network
design. Researchers have studied until today different aspects
of the Neural Network (NN) architectures and how these can
be optimal for various tasks, i.e the optimization method used
for the error backpropagation, the contribution of the activation
functions between the NN layers or normalization techniques
that encourage the loss convergence, i.e batch normalization,
dropout layer etc.
Weight initialization is one of the aspects of NN model
design that contribute the most to the gradient flow of the
hidden layer weights and by extension to the ability of the
neural network to learn. The main focus on the matter of
weight initialization ([4], [5]) is the observation that weights
among different layers can have a high variance, making the
gradients more likely to explode or vanish. Neural Networks
capitalize on graph structure by design. Surprisingly there has
been very few work analyzing them as a graph with edge
and/or nodes attributes. Recent work [16] introduces graph
metrics to produce latent representation sets capitalising on
Fig. 1. Hypergraph and the corresponding incidence graph
bipartite matching directly implemented in the neural network
architecture, which proved to be a very powerful method. Also
the work by C. Morris analyzes the expressivity of Graph
Neural Networks using the Weisfeiler-Leman isomorphism test
[11]. Our interest lies in trying to refine the optimization
scheme by capitalizing on graph metrics and decomposi-
tions. One natural candidate was the k-core decomposition
[15]. Indeed this decomposition method, being very efficient
(O(nlog(n)) in the best cases [2]), performs very well in
state-of-the-art frameworks for enhancing supervised learning
methods [12]. Providing key subgraphs, and also extracting
very good features.
Unfortunately, in the case of a graph representing a neural
network, k-core might lack some features. As a matter of fact,
graphs extracted from NNs constitute multipartite complete
weighted graphs, in the case of an Multilayer Perceptron and
almost complete for Convolutional Neural Networks. As we
saw different k-core variants for different types of graphs,
such as the k-truss [14] counting triangles, the D-core [3]
for directed graphs, were designed this past decade. A natural
thought was then to design our own version of the k-core for
our precise graph structure.
Hence our contributions are the following:
We provide a unified method of constructing the graph
representation of a neural network as a block composition
of the given architecture (see fig. 2). This is achieved
arXiv:2004.07636v1 [cs.LG] 16 Apr 2020
by transforming each part of the network (i.e linear or
convolutional layers, normalization/dropout layers and
pooling operators) into a subgraph. Having this graph
representation, it is possible to apply different types of
combinatorial algorithms to extract information from the
graph structure of the network.
Next we design a new degeneracy framework, namely
the k-hypercore, extending the concept of k-core to
bipartite graphs by considering that each pair of layers
of the neural network, constituting a bipartite graph, is
the incidence graph of a hypergraph (cf. fig.1).
we propose a novel weight initialization scheme, Hcore-
init by using the information provided by the weighted
version of the k-hypercore of a NN extracted graph, to
re-initialize the weights of the given neural network, in
our case, a Convolutional neural network and a Multilayer
Perceptron. Our proposal clearly outperforms traditional
initialization methods on classical deep learning tasks.
The rest of this paper is organized as follows, first some
preliminary definitions and overview of the state of the art
methods in neural network initialization methods. Then we
provide the methodology which allows us to transform neural
networks to edge weighted graphs. Further on, we proceed
to the main contribution of the paper being the definition of
the hypercore degeneracy and the procedure which produces
our initialization method. Finally we test our method on
several image classification datasets, comparing it the main
initialization method used in neural networks.
II. PRELIMINARIES
In deep neural networks, weight initialization is a vital factor
of the performance of different architectures [10]. The reason
is that an appropriate initialization of the weights of the neural
network can avert the explosion or vanishing of the layer
activation output values.
A. Initialization methods
1) Xavier Initialization: One of the most popular
initialization methods is Xavier initialization [4]. According
to that, the weights of the network are initialized by
drawing them from a normal distribution with E[W] = 0
and Var(wi) = 1
fanin , where fanin is the number of
incoming neurons. Also, more generally, we can define
variance with respect to the number of outgoing neurons
as: Var(wi) = 1
fanin+fanout , where fanout is the number of
neurons that the output is directed to.
2) Kaiming Initialization: Although Xavier initialization
method manages to maintain the variance of all layers equal,
it assumes that the activation function is linear. In most of the
cases of non-linear activation function that Xavier initialization
is used, the hyperbolic tangent activation is employed. The
need for taking into account the activation function for the
weight initialization led to the Kaiming Initialization [5].
According to this method, in the case that we employ ReLU
activation functions, we initialize the network weights by
drawing samples from a normal distribution with zero mean:
E[W]=0and variance that depends on the order of the layer:
Var[W] = 2
nl, where lis the index of the l-th layer.
One main assumption for weight initialization is that the
mean of the random distribution used for initialization needs
to be 0. Otherwise, the calculation of the variances presented
above could not be done and we won’t be able to have a
fixed way to initialize the variance.
Since in our work we want to capitalize on the k-hypercore
decomposition to bias those distributions we will have to face
the fact that we might not be able to control the variance of
the weights we initialize. Thankfully the fact that the initial
distribution has 0mean will ensure that our method respects
as well this condition on every layer of the neural network.
Moreover, since the k-hypercore decomposition is defined
over hypergraphs, let us remind some properties between a
hypergraph and a bipartite graph.
B. Hypergraphs and Bipartite graphs
A hypergraph is a generalization of graph in which an edge
can join any number of vertices. It can be represented and we
keep this notation for the rest of the paper as H= (V, EH)
where Vis the set of nodes, and EHis the set of hyperedges,
i.e. a set of subsets of V. Therefore EHis a subset of
P(V). Moreover a bipartite graph is the incidence graph of a
hypergraph [13]. Indeed, a hypergraph Hmay be represented
by a bipartite graph Gas follows: the sets Xand Eare the
partitions of G, and (x1, e1)are connected with an edge if and
only if vertex x1is contained in edge e1in H. Conversely,
any bipartite graph with fixed parts and no unconnected nodes
in the second part represents some hypergraph in the manner
described above. Hence, we can consider that every pair of
layers in the neural network can be viewed as a hypergraph,
where the left layer represent the hyperedges and the right the
nodes (see fig.1).
III. GRAPH CHARACTERIZATION OF NEURAL NETWORK
We will now describe how we transpose the two classic
neural network architectures we investigate, to graphs, and
more specificaly bipartite ones. Also, from now on, we are
going to refer to a fully-connected neural network as FCNN
and to a convolutional neural network as CNN [8].
A. Fully-Connected Neural Networks
Let a FCNN Fwith Lhidden layers, ni,i= 1, .., L number
of hidden units per layer and WiRni,ni+1 the weight matrix
of the links between the units of the layers iand i+ 1.
We define the graph GF= (V, E, W )as the graph represen-
tation of the FCNN F, where the set of nodes Vcorresponds
to the PL
i=1 ninumber of hidden units of F, the set of edges
Econtains all the links of unit pairs across the layers of
Fand the edge weight matrix Wcorresponds to the link
weight matrices Wi, i = 1, ..., L 1. We note that the graph
representation GFdoes not take into account any activation
functions σused in F.
Fig. 2. Illustration of the transformation of a CNN to graph.
Remark. It is easy to see that GFis a k-partite graph (i.e
a graph whose vertices can be partitioned into kindependent
sets) and more specifically a union of L1complete bipartite
graphs.
B. Convolutional Neural Networks
After showing the correspondence between a FCNN F
and its graph representation GF, we are ready to define the
graph representation of a CNN layer. Let a CNN layer C. The
convolutional layer is characterized by the input information
that has Iinput channels where each channel provides n×n
features (i.e an 24 ×24 image characterized by the 3 RGB
channels), the output information that has Ooutput channels,
where each channel has m×mfeatures and the matrix of
the convolutional kernel FRw×h×I×O, where w, h are the
width and height of the kernel.
In order to define the graph GC= (V, E , W)as the graph
representation of the CNN C, we have to flatten the 3 and 4-
dimensional input, output, and filter matrices correspondingly.
Specifically, the GCis a bipartite graph, where the first
partition of nodes P1is the flattened input information of the
CNN layer ( |P1|=I×n×n) , the second partition of nodes
P2is the flatten output information (|P2|=O×m×m).
IV. WEIGHT INITIALIZATION BASED ON HCORE
As degeneracy frameworks have proven to be very efficient
at extracting influential individuals in a network [1], why
not using the structural information provided by the hcore
decomposition of the network to identify “influential” neurons.
Having a neural network graph, we provide a definition of
degeneracy specifically for bipartite graphs, where standard
k-core does not provide much relevant information.
Definition 1 (Hypercore).Given a hypergraph H= (V, EH)
We define the (k, l)-hypercore as a maximal connected sub-
Fig. 3. Example of a k-hcore decomposition of a hypergraph
graph of Hin which all vertices have hyperdegree at least k
and all hyperedges have at least lincident nodes.
As for now on, we will refer to the (k, 2)-hypercore as the
k-hcore and similarly, the hcore number of the node will be
the largest value of kfor which the given node belongs to the
k-hcore.
This, henceforth provides a hypergraph decomposition and
in our case a decomposition of the right part of the studied
bipartite graph, as we do not care about the hcore of the
hyperedges (cf. fig. 3).
Since we deal with edge-weighted bipartite graphs, we will
use the weighted degree to define the hcore ranking of the
nodes given the following weighted-hypercore definition:
Definition 2 (Weighted-hypercore).Given an edge weighted
hypergraph H= (V, EH), we define the (k, l)-weighted-
hypercore as a maximal connected subgraph of Hin which
all vertices have hyper-weighted-degree at least kand all
hyperedges have at least lincident nodes.
Again, we will refer to the (k, 2)-weighted-hypercore as k-
WHcore. Now that we have this weighted version, we need to
define a way to initialize the weights of the neural network.
Indeed, since the WHcore is a value given to the nodes of
the network and not the edges, being the weights we aim to
initialize. The WHcore shows us which neurons gather the
more information, positive on the one hand and negative on
the other. After a quick pretraining, we learn the weights just
enough to show which neurons have a higher impact on the
learning. This information is then grouped by the WHcore into
influential neurons and less influential ones.
Moreover, since weights in neural networks are sampled from
centered normal law, we have positive and negative weights.
Since the hypercore framework operates on positive weighted
degrees, we provide two graph representations of the neural
network, namely G+and G. The G+graph is built upon the
positive weights of the neural network, and the edge weights of
the Ggraph are the absolute values of the negative weights
of the neural network. Indeed if between neuron xiand neuron
yj,wij >0then we add an edge with weight wij between
node xiand yjin graph G+, otherwise we add an edge with
weight |wij |between node xiand yjto graph G.
Algorithm 1 Hcore decomposition algorithm
1: procedure HCOR E(G, rnodes)
2: Input G: bipartite graph, rnodes: right layer nodes
3: Output hcore: dictionary of hcore values
4:
5: hcore dict((node, 0) for node in rnodes)
6: tokeep rnodes
7: while tokeep 6=do
8: state T rue
9: while state == T rue do
10: state F alse
11: tokeep []
12: for node rnodes do
13: if G.degree(node)> k then
14: tokeep.append(node)
15: else
16: hcore[node] = k
17: graph.remove[node]
18: state T rue
19: end if
20: end for
21: for node G.nodes \rnodes do
22: if G.degree(node)=1 then
23: G.remove(node)
24: end if
25: end for
26: end while
27: kk+ 1
28: end while
29: end procedure
Remark. It is important to note that the WHcore number of
a node is the largest kin which a node is contained in the
k-WHcore. Also, the WHcore number of a node is a function
of the degree of the node. As the degree depends on the
weights, there exists two functions gand hsuch that g(W, x)
outputs the weighted degree of a node x, thus being a linear
combination of the weights W. Then c(W, x) = h(g(W, x)) is
the WHcore number of the node x. For convinience, we now
write c(W+, xk) = c+
kwhere Wkare the positive weights of
the weight matrix W.
Moreover, the following initialization schemes are done after
a small amount of pretraining of the neural network, in order
to have a preliminary information over the importance on the
task of the neurons.
A. Initialization of the FCNN
The initialization then is then dependent on the architecture
we are looking at, indeed for an FCNN as the graph construc-
tion is fairly straight forward we proceed as follows:
For every pair of layers for both positive and negative graphs,
we have nodes xi, with i∈ {1,...,fanin}in the left side
of the bipartite graph, and yjnodes, with j∈ {1,...,fanout}
nodes on the right side. As for every node yiwe compute their
WHcore from the graph G,c
jand from the graph G+,c+
j.
Then the given layer weights wij, are initialized, depending
on their sign, with a normal law with expectancy:
for all iif wij 0,m=c+
j
P1kfanout c+
k
,
else M=c
j
P1kfanout c
k
and with the same variance used in Kaiming initialization. We
prove later that the overall mean value of the new random
variable obtained in this fashion is 0as well, justifying the
use of the Kaiming variance to be optimal.
B. Initialization of the CNN
For the CNN, since the induced graph is more intricate
and the filter weights must follow the same distribution, the
initialization framework has to be adapted. We still compute
the WHcore on a pair of layers but keeping the filters in mind,
the left layer nodes are x(k)
iwith i∈ {1, . . . , n ×n}the input
size and k∈ {1, . . . , I}the number of input filters. Similarly
the left layer nodes are y(k0)
j, where j∈ {1, . . . , m ×m}
the output size, and k0∈ {1, . . . , O}the output channels. We
remind as well that we have two WHcores, one for the positive
graph c+and one for then negative c. Then for a given filter
w(k,k0)its values are initialized with the following method:
we define ffor a given filter Was m(W+) = 1
H2Pjc+
j
and m(W) = 1
H2Pjc
j, if m(W+)m(W+)>0
then M=m(W+)
else M=m(W).
Using the notations given in the previous remark we can
write min the following general form :
m=sign(argmax(m(W+), f(W))) max(m(W+), m(W))
where sign(W+)=1and sign(W) = 1.
This initialization is done for every filter and with variance
given by the Kaiming initialization method. Now we will prove
that for the CNN the overall expectancy of the mean value
produced is indeed 0.
Proposition 1. Given a measurable function fand two
positive i.i.d. random variables X+and X, the random
variable Z:
Z= sign arg max(f(X+), f (X))max f(X+), f(X)
has mean 0.
Proof. We remind that the function I{xX}is the Euler
indicator function:
I{xX}=1if xX
0otherwise.
Let us proceed to evaluate the expectancy of Zprovided
that X+and Xare i.i.d.:
E[Z] = E[ZI{f(X+)>f(X)}] + E[ZI{f(X+)f(X)}] =
E[f(X+)I{f(X+)>f(X)}]E[f(X)I{f(X+)f(X)}] =
E[(f(X+) + f(X))I{f(X+)>f(X)}]E[f(X)].
Given the initial assumptions , we can expand the first term
E[f(X+)I{f(X+)>f(X)}]as follows:
E[f(X+)I{f(X+)>f(X)}] =
ZZ f(X+)I{f(X+)>f (X)}dP (X+)dP (X)
As X+and Xfollow the same distribution, and fis a
measurable function, we use the Fubini theorem to intervert
the integrals as follows:
E[f(X+)I{f(X+)>f(X)}] =
ZZ f(X)I{f(X)f(X+)}dP (X)dP (X+).
Now replacing this in the original equation gives us:
E[Z] = ZZ f(X)I{f(X)f(X+)}dP (X)dP (X+)
+ZZ f(X)I{f(X+)>f (X)}dP (X)dP (X+)
E[f(X)]
=E[(f(X)I{f(X)f(X+)}]
+E[(f(X)I{f(X+)>f(X)}]E[f(X)]
= 0
This completes our proof that Zis a centered random
variable.
Notice that setting the function m=lghwe can write
lg=fand X=h(W). As we defined previously hto be
the weighted degree function of a node :
h(W+
j) = X
i
Wij I{Wij >0}
h(W
j) = X
i
|Wij |I{Wij 0}
which ensures that h(W+)and h(W)follow the same
distribution by linear combination of absolute value of the
same normal distribution. Replacing these function in the
previous proposition, i.e. f=lg,X+=h(W+)and
X=h(W)proves that our initialization method has mean
0. This proof allows us to justify the use of the Kaiming
variance in our initialization method as it was proven to be
the optimal one.
V. EX PE RI ME NT S
We will now evaluate our proposed weight initialization
method in three standard image classification datasets, CIFAR-
10, CIFAR-100, and MNIST and compare it to Kaiming
initialization. It is important to note that we do not exper-
iment on state-of-the-art architectures for each dataset. We
want to show, as our method can be used separately on
different architecture blocks, i.e. only convolutional layers,
or only on the FCNN part, or both, that it out performs
standard initialization methods, regardless of the refinement
of the architecture. Hence in this section, we will evaluate the
classification accuracy on the aforementioned datasets with
simple CNN architectures presented in this section.
A. Dataset specifications.
The CIFAR-10 and CIFAR-100 are labeled subsets of the
80 million tiny images dataset. They were collected by Alex
Krizhevsky, Vinod Nair, and Geoffrey Hinton [7]. We also use
the MNIST dataset
The CIFAR-10 dataset consists of 60000 32 ×32 colour
images in 10 classes, with 6000 images per class. There
are 50000 training images and 10000 test images.
The CIFAR-100 is just like the CIFAR-10, except it has
100 classes containing 600 images each. There are 500
training images and 100 testing images per class.
We also test our model on the MNIST database of handwritten
digits, which has a training set of 60000 examples, and a test
set of 10000 examples. The digits have been size-normalized
and centered in a fixed-size image [9].
The dataset is divided into five training batches and one
test batch, each with 10000 images. The test batch contains
exactly 1000 randomly-selected images from each class. The
training batches contain the remaining images in random order,
but some training batches may contain more images from one
class than another. Between them, the training batches contain
exactly 5000 images from each class.
B. Model setup and baseline.
Next, we present the models that were trained and evaluated
for the image classification task. We note that for every case,
we compare two scenarios:
1) Initialization of the model with Kaiming initialization
[5], training on the train set for 150 epochs and evalua-
tion on the test set.
2) Pretraining of the model (using Kaiming initialization)
for Nepochs, re-initialization of the model with Hcore-
Init, training on the train set for the rest 150Nepochs
and evaluation on the test set. Nhas been set as a hyper-
parameter.
For the CIFAR-10 and CIFAR-100 datasets, we applied 2
convolutional layers with sizes 3×6×5and 6×15 ×5
respectively, where 5is the kernel size and the stride was set
to 1. Moreover, after each convolutional layer, we applied two
2×2max-pooling operators and finally three fully connected
layers with corresponding sizes 400 ×120,120 ×84,84×
#classes, where #classes = 10 and 100 respectively for
the two datasets. Furthermore, we used ReLU as activation
function among the linear layers and tanh for the convolution
layers.
For the MNIST dataset, we applied again 2convolutional
layes of size 1×10 ×5and 10 ×20 ×5, where again the filter
size was set to 5and the stride was set to 1. As in the other
dasets, we employed two 2×2max-pooling operators and we
performed dropout [17] on the output of the 2nd convolutional
layer with probability p= 0.5. Finally, we applied 2fully
connected layers of size 320 ×50 and 50 ×10 and ReLU as
an activation function throughout the layers.
In all cases, we employed stochastic gradient descent [6]
with momentum set to 0.9and learning rate set to 0.001.
As we mentioned before, we chose 2rather simple models,
as we intend to highlight the contribution of Hcore-Init in
comparison to its competitor and not to achieve state-of-
the-art results for the given datasets, which are exhaustively
examined.
C. Settings of the weight initialization.
Next, we present the contribution of Hcore-Init in the
performance of the neural network architecture with respect
to its application on different types of layers. Specifically,
we applied the configurations of the initialization methods (a)
exclusively on the set of the linear layers (b) exclusively on
the set of the convolutional layers (c) on the combined set of
linear and convolutional layers of the model.
Fig. 4. Test accuracy (left) and train loss (right) on CIFAR-10 for the
combined application of the initialization on the linear and the convolutional
layers. For the curves Hcore-init-x,xstands for the number of pretraining
epochs.
On Figure 4, we can observe that for 15 pretraining epochs,
the model initialized with Hcore-Init outperforms the model
initialized with Kaiming initialization. It is, also, noteworthy
that the loss convergence is faster when applying Hcore-Init.
This highlights empirically our initial motivation of encourag-
ing the “important” weights by using the graph information
from the model architecture.
Fig. 5. Test accuracy and train loss on CIFAR-10 for the initialization applied
only on the linear layers.
On Figures 5 and 6, we can notice the contribution again
of Hcore-Init in the performance of the network, when the
former is applied on the fully connected and convolutional
layers respectively. We can see that in both cases, Hcore-
Init with different numbers of pretraining epochs (10 and 20
correspondly) achieves better accuracy results in comparison
to Kaiming.
Fig. 6. Test accuracy and train loss on CIFAR-10 for the initialization applied
only on the convolutional layers.
TABLE I
Top Accuracy results over initializing the full model, only the CNN and only
the FCNN for CIFAR-10, CIFAR-100, and MNIST. Hcore-Init* represent
the top performance over all the pretraining epochs configurations up to 25
CIFAR-10 CIFAR-100 MNIST
Kaiming 64.62 32.56 98,71
Hcore-Init* 65.22 33.48 98.91
Hcore-Init-1 64.91 32.87 98.59
Hcore-Init-5 64.41 32.96 98.70
Hcore-Init-10 65.22 33.41 98.81
Hcore-Init-15 64.94 33.45 98.64
Hcore-Init-20 65.05 33.39 98.87
Hcore-Init-25 64.72 33.48 98.91
Finally, we report the results of the experiments conducted
on the 3datasets in I. Those results correspond to an ablation
study over the different number of pretraining epochs as
well as the different initialization scenarios, i.e. initializing
only on the linear layers, convolutional layers, and the whole
architecture. We kept for each mentioned scenario the best
performance, and Hcore-Init* corresponds to the best overall
accuracy. It is important to notice that we do not necessarily
need a long pretraining phase to achieve the best results, in
fact, only 10 epochs is usually more than enough to outperform
in a significant way the Kaiming initialization. We remind
that this pretraining corresponds to less than 10% of the total
training which is proportional in terms of computation time
to 10% of the time to train the model. Furthermore it is
interesting to notice that in the early stages of pretraining we
are more likely to lose some accuracy as the gradient direction
in this stage of the training might be wrong. This justifies as
well the consistency of our method.
VI. CONCLUSION
In this paper, we propose Hcore-Init, an initialization
method applicable on the most common blocks of neural
network architectures, i.e. convolutional and linear layers. This
method capitalizes on a graph representation of the neural
network and more importantly the definition of hypergraph
degenarcy providing a neuron ranking for the bipartite archi-
tecture of the neural network layers. Our method, learning
with a small pretraining of the neural network, outperforms
the standard Kaiming initialization, under the condition that
the initialization distribution has zero expectancy. This work
is intended to be used as a framework to initialize specific
blocks in more complex architectures that might bear more
information and are more valuable for the task at hand.
REFERENCES
[1] Mohammed Ali Al-garadi, Kasturi Dewi Varathan, and Sri Devi Ravana.
Identification of influential spreaders in online social networks using in-
teraction weighted k-core decomposition method. Physica A: Statistical
Mechanics and its Applications, 468:278–288, 2017.
[2] Vladimir Batagelj and Matjaz Zaversnik. An o (m) algorithm for cores
decomposition of networks. arXiv preprint cs/0310049, 2003.
[3] Christos Giatsidis, Dimitrios M Thilikos, and Michalis Vazirgiannis. D-
cores: measuring collaboration of directed graphs based on degeneracy.
Knowledge and information systems, 35(2):311–343, 2013.
[4] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of
training deep feedforward neural networks. In In Proceedings of
the International Conference on Artificial Intelligence and Statistics
(AISTATS10). Society for Artificial Intelligence and Statistics, 2010.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving
deep into rectifiers: Surpassing human-level performance on imagenet
classification, 2015.
[6] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a
regression function. Ann. Math. Statist., 23(3):462–466, 09 1952.
[7] Alex Krizhevsky et al. Learning multiple layers of features from tiny
images. 2009.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. In Advances
in neural information processing systems, pages 1097–1105, 2012.
[9] Yann LeCun and Corinna Cortes. MNIST handwritten digit database.
2010.
[10] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv
preprint arXiv:1511.06422, 2015.
[11] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton,
Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and
leman go neural: Higher-order graph neural networks. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 33, pages
4602–4609, 2019.
[12] Giannis Nikolentzos, Polykarpos Meladianos, Stratis Limnios, and
Michalis Vazirgiannis. A degeneracy framework for graph similarity.
In Proceedings of the Twenty-Seventh International Joint Conference on
Artificial Intelligence, IJCAI-18, pages 2595–2601. International Joint
Conferences on Artificial Intelligence Organization, 7 2018.
[13] Tomaz Pisanski and Milan Randic. Bridges between geometry and graph
theory. MAA NOTES, pages 174–194, 2000.
[14] Maria-Evgenia G Rossi, Fragkiskos D Malliaros, and Michalis Vazir-
giannis. Spread it good, spread it fast: Identification of influential nodes
in social networks. In Proceedings of the 24th International Conference
on World Wide Web, pages 101–102, 2015.
[15] Stephen B Seidman. Network structure and minimum degree. Social
networks, 5(3):269–287, 1983.
[16] Konstantinos Skianis, Giannis Nikolentzos, Stratis Limnios, and
Michalis Vazirgiannis. Rep the set: Neural networks for learning set
representations. arXiv preprint arXiv:1904.01962, 2019.
[17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural
networks from overfitting. J. Mach. Learn. Res., 15(1):19291958,
January 2014.
... In [25], the authors defined the (k; l)-hypercore of a given hypergraph H as the maximal subhypergraph of H where each node has degree at least k and each hyperedge contains at least l nodes (only the case with l = 2 was used throughout [25]). There are essential differences between this concept and our proposed concept (see Lem. 1). ...
... In [25], the authors defined the (k; l)-hypercore of a given hypergraph H as the maximal subhypergraph of H where each node has degree at least k and each hyperedge contains at least l nodes (only the case with l = 2 was used throughout [25]). There are essential differences between this concept and our proposed concept (see Lem. 1). ...
... Definition 6 ((k; l)-hypercore (Def. 1 in [25])). Given a hypergraph H and k, l ∈ N, the (k; l)-hypercore of H denoted byC k;l (H), is the maximal subhypergraph of H such that each node inC k;l (H) has degree at least k and each hyperedge inC k;l (H) contains at least l nodes. ...
Preprint
Full-text available
Hypergraphs are a powerful abstraction for modeling high-order relations, which are ubiquitous in many fields. A hypergraph consists of nodes and hyperedges (i.e., subsets of nodes); and there have been a number of attempts to extend the notion of $k$-cores, which proved useful with numerous applications for pairwise graphs, to hypergraphs. However, the previous extensions are based on an unrealistic assumption that hyperedges are fragile, i.e., a high-order relation becomes obsolete as soon as a single member leaves it. In this work, we propose a new substructure model, called ($k$, $t$)-hypercore, based on the assumption that high-order relations remain as long as at least $t$ fraction of the members remain. Specifically, it is defined as the maximal subhypergraph where (1) every node has degree at least $k$ in it and (2) at least $t$ fraction of the nodes remain in every hyperedge. We first prove that, given $t$ (or $k$), finding the ($k$, $t$)-hypercore for every possible $k$ (or $t$) can be computed in time linear w.r.t the sum of the sizes of hyperedges. Then, we demonstrate that real-world hypergraphs from the same domain share similar ($k$, $t$)-hypercore structures, which capture different perspectives depending on $t$. Lastly, we show the successful applications of our model in identifying influential nodes, dense substructures, and vulnerability in hypergraphs.
... [12] Networks in which modules or communities have a sparse interaction with nodes belonging to different communities represent a feature of many real systems. Furthermore in some models in ecology (plants-pollinators-insects) [13] [14] and in deep learning [15], the interactions among the nodes belonging to the same community are absent or irrelevant; this property is called disassortative and the associated graph of interactions is multipartite. ...
... As it was mentioned in the introduction, in random block models the diagonal blocks are often considered to be random blocks, representing the interaction within a module or community. For instance in several brain networks it seems 15 Page 15 of 20 AUTHOR SUBMITTED MANUSCRIPT -JPCOMPX-100353.R2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 A c c e p t e d M a n u s c r i p t that nodes are partitioned into densely connected communities separated by sparse inter-community connectivity [30]. ...
Article
Full-text available
We study ensembles of sparse block-structured random matrices generated from the adjacency matrix of a Erdös–Renyi random graph with N vertices of average degree Z , inserting a real symmetric d × d random block at each non-vanishing entry. We consider several ensembles of random block matrices with rank r < d and with maximal rank, r = d . The spectral moments of the sparse block-structured random matrix are evaluated for N → ∞ , d finite or infinite, and several probability distributions for the blocks (e.g. fixed trace, bounded trace and Gaussian). Because of the concentration of the probability measure in the d → ∞ limit, the spectral moments are independent of the probability measure of the blocks (with mild assumptions of isotropy, smoothness and sub-Gaussian tails). The effective medium approximation is the limiting spectral density of the sparse block-structured random ensembles with finite rank. Analogous classes of universality hold for the Laplacian sparse block-structured ensemble. The same limiting distributions are obtained using random regular graphs instead of Erdös–Renyi graphs.
... We give a collective name, hypercores, to such extended concepts. There are several variants [9,21,91,115], and we focus on one based on which patterns on real-world hypergraphs are explored. The concept of ( , )-hypercores proposed by Bu et al. [21] uses a more general definition of subhypergraphs than what we have in Definition 2.1. ...
... Considering a general neural network as a directed bipartite graph (and acyclical), we can define its architecture under the graph representation learning framework. Taking this into account, we can explore more possibilities of model designs, such as the case of random wirings [235], but, also, make inferences over the optimal configuration of model parameters [140]. The idea of the neural network representation as a graph has been recently studied in the field of Neural Architecture Search and AutoML, where the objective is to explore optimal architectures for given tasks in an automated manner [37,130,250,134]. ...
Thesis
Full-text available
As the technological evolution of machine learning is accelerating nowadays, data plays a vital role in building intelligent models, being able to simulate phenomena, predict values and make decisions. In an increasing number of applications, data take the form of networks. The inherent graph structure of network data motivated the evolution of the graph representation learning field. Its scope includes generating meaningful representations for graphs and their components, i.e., the nodes and the edges. The research on graph representation learning was accelerated with the success of message passing frameworks applied on graphs, namely the Graph Neural Networks. Learning informative and expressive representations on graphs plays a critical role in a wide range of real-world applications, from telecommunication and social networks, urban design, chemistry, and biology. In this thesis, we study various aspects from which Graph Neural Networks can be more expressive, and we propose novel approaches to improve their performance in standard graph learning tasks. The main branches of the present thesis include: the universality of graph representations, the increase of the receptive field of graph neural networks, the design of stable deeper graph learning models, and alternatives to the standard message-passing framework. Performing both theoretical and experimental studies, we show how the proposed approaches can become valuable and efficient tools for designing more powerful graph learning models.In the first part of the thesis, we study the quality of graph representations as a function of their discrimination power, i.e., how easily we can differentiate graphs that are not isomorphic. Firstly, we show that standard message-passing schemes are not universal due to the inability of simple aggregators to separate nodes with ambiguities (similar attribute vectors and neighborhood structures). Based on the found limitations, we propose a simple coloring scheme that can provide universal representations with theoretical guarantees and experimental validations of the performance superiority. Secondly, moving beyond the standard message-passing paradigm, we propose an approach for treating a corpus of graphs as a whole instead of examining graph pairs. To do so, we learn a soft permutation matrix for each graph, and we project all graphs in a common vector space, achieving a solid performance on graph classification tasks.In the second part of the thesis, our primary focus is concentrated around the receptive field of the graph neural networks, i.e., how much information a node has in order to update its representation. To begin with, we study the spectral properties of operators that encode adjacency information. We propose a novel parametric family of operators that can adapt throughout training and provide a flexible framework for data-dependent neighborhood representations. We show that the incorporation of this approach has a substantial impact on both node classification and graph classification tasks. Next, we study how considering the k-hop neighborhood information for a node representation can output more powerful graph neural network models. The resulted models are proven capable of identifying structural properties, such as connectivity and triangle-freeness.In the third part of the thesis, we address the problem of long-range interactions, where nodes that lie in distant parts of the graph can affect each other. In this problem, we either need the design of deeper models or the reformulation of how proximity is defined in the graph. Firstly, we study the design of deeper attention models, focusing on graph attention. We calibrate the gradient flow of the model by introducing a novel normalization that enforces Lipschitz continuity. Next, we propose a data augmentation method for enriching the node attributes with information that encloses structural information based on local entropy measures.
Article
Full-text available
Going beyond networks, to include higher-order interactions of arbitrary sizes, is a major step to better describe complex systems. In the resulting hypergraph representation, tools to identify structures and central nodes are scarce. We consider the decomposition of a hypergraph in hyper-cores, subsets of nodes connected by at least a certain number of hyperedges of at least a certain size. We show that this provides a fingerprint for data described by hypergraphs and suggests a novel notion of centrality, the hypercoreness. We assess the role of hyper-cores and nodes with large hypercoreness in higher-order dynamical processes: such nodes have large spreading power and spreading processes are localized in central hyper-cores. Additionally, in the emergence of social conventions very few committed individuals with high hypercoreness can rapidly overturn a majority convention. Our work opens multiple research avenues, from comparing empirical data to model validation and study of temporally varying hypergraphs.
Article
Full-text available
Hypergraphs are a powerful abstraction for modeling high-order relations, which are ubiquitous in many fields. A hypergraph consists of nodes and hyperedges (i.e., subsets of nodes); and there have been a number of attempts to extend the notion of \({\varvec{k}}\)-cores, which proved useful with numerous applications for pairwise graphs, to hypergraphs. However, the previous extensions are based on an unrealistic assumption that hyperedges are fragile, i.e., a high-order relation becomes obsolete as soon as a single member leaves it.In this work, we propose a new substructure model, called \({\varvec{(k,t)}}\)-hypercore, based on the assumption that high-order relations remain as long as at least t fraction of the members remains. Specifically, it is defined as the maximal subhypergraph where (1) every node is contained in at least \({\varvec{k}}\) hyperedges in it and (2) at least \({\varvec{t}}\) fraction of the nodes remain in every hyperedge. We first prove that, given \({\varvec{t}}\) (or \({\varvec{k}}\)), finding the \({\varvec{(k,t)}}\)-hypercore for every possible \({\varvec{k}}\) (or \({\varvec{t}}\)) can be computed in time linear w.r.t the sum of the sizes of hyperedges. Then, we demonstrate that real-world hypergraphs from the same domain share similar \({\varvec{(k,t)}}\)-hypercore structures, which capture different perspectives depending on \({\varvec{t}}\). Lastly, we show the successful applications of our model in identifying influential nodes, dense substructures, and vulnerability in hypergraphs.
Chapter
Recent advances in IoT and AI technologies have enabled mobile IoT devices to provide live video services. In these services, attempts to apply data learning to network control are expanding to minimize the degradation of QoS by the bad network condition. The data learning model affects the system performance, and the improvement of the model can be achieved by adjusting key parameters (i.e., hyperparameters). Therefore, this paper attempts hyperparameter tuning for the deep neural network (DNN) model for predicting network conditions and provides a comparison and analysis results of the hyperparameters. For the optimal DNN model, four hyperparameters (i.e., activation function, batch size, dropout, and weight initialization) are adjusted, and twenty-four conditions of the hyperparameters are analyzed. The DNN model for network condition prediction derived through this work can be used as a basic model of an intelligent network for smart applications.
Conference Paper
Full-text available
Layer-sequential unit-variance (LSUV) initialization -- a simple method for weight initialization for deep net learning -- is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.
Article
In recent years, graph neural networks (GNNs) have emerged as a powerful neural architecture to learn vector representations of nodes and graphs in a supervised, end-to-end fashion. Up to now, GNNs have only been evaluated empirically—showing promising results. The following work investigates GNNs from a theoretical point of view and relates them to the 1-dimensional Weisfeiler-Leman graph isomorphism heuristic (1-WL). We show that GNNs have the same expressiveness as the 1-WL in terms of distinguishing non-isomorphic (sub-)graphs. Hence, both algorithms also have the same shortcomings. Based on this, we propose a generalization of GNNs, so-called k-dimensional GNNs (k-GNNs), which can take higher-order graph structures at multiple scales into account. These higher-order structures play an essential role in the characterization of social networks and molecule graphs. Our experimental evaluation confirms our theoretical findings as well as confirms that higher-order information is useful in the task of graph classification and regression.
Conference Paper
The problem of accurately measuring the similarity between graphs is at the core of many applications in a variety of disciplines. Most existing methods for graph similarity focus either on local or on global properties of graphs. However, even if graphs seem very similar from a local or a global perspective, they may exhibit different structure at different scales. In this paper, we present a general framework for graph similarity which takes into account structure at multiple different scales. The proposed framework capitalizes on the well-known k-core decomposition of graphs in order to build a hierarchy of nested subgraphs. We apply the framework to derive variants of four graph kernels, namely graphlet kernel, shortest-path kernel, Weisfeiler-Lehman subtree kernel, and pyramid match graph kernel. The framework is not limited to graph kernels, but can be applied to any graph comparison algorithm. The proposed framework is evaluated on several benchmark datasets for graph classification. In most cases, the core-based kernels achieve significant improvements in terms of classification accuracy over the base kernels, while their time complexity remains very attractive.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Understanding and controlling spreading dynamics in networks presupposes the identification of those influential nodes that will trigger an efficient information diffusion. It has been shown that the best spreaders are the ones located in the core of the network - as produced by the k-core decomposition. In this paper we further refine the set of the most influential nodes, showing that the nodes belonging to the best K-truss subgraph, as identified by the K-truss decomposition of the network, perform even better leading to faster and wider epidemic spreading.
Article
Online social networks (OSNs) have become a vital part of everyday living. OSNs provide researchers and scientists with unique prospects to comprehend individuals on a scale and to analyze human behavioral patterns. Influential spreaders identification is an important subject in understanding the dynamics of information diffusion in OSNs. Targeting these influential spreaders is significant in planning the techniques for accelerating the propagation of information that is useful for various applications, such as viral marketing applications or blocking the diffusion of annoying information (spreading of viruses, rumors, online negative behaviors, and cyberbullying). Existing K-core decomposition methods consider links equally when calculating the influential spreaders for unweighted networks. Alternatively, the proposed link weights are based only on the degree of nodes. Thus, if a node is linked to high-degree nodes, then this node will receive high weight and is treated as an important node. Conversely, the degree of nodes in OSN context does not always provide accurate influence of users. In the present study, we improve the K-core method for OSNs by proposing a novel link-weighting method based on the interaction among users. The proposed method is based on the observation that the interaction of users is a significant factor in quantifying the spreading capability of user in OSNs. The tracking of diffusion links in the real spreading dynamics of information verifies the effectiveness of our proposed method for identifying influential spreaders in OSNs as compared with degree centrality, PageRank, and original K-core.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.