Conference PaperPDF Available

Hcore-Init: Neural Network Initialization based on Graph Degeneracy

November 2020

November 2020

DOI:10.1109/ICPR48806.2021.9412940

Conference: 25th International Conference on Pattern Recognition (ICPR 2020)

Authors:

George Dasoulas

Harvard University

Michalis Vazirgiannis

École Polytechnique

Neural networks are the pinnacle of artificial intelligence , as in recent years we witnessed many novel architectures, learning and optimization techniques for deep learning. As neural networks inherently constitute multipartite graphs among neuron layers, we aim to analyze directly their structure to extract meaningful information from the learning processes. To our knowledge graph mining techniques for enhancing learning in neural networks have not been thoroughly investigated. In this paper we propose an adapted version of the k-core structure for the complete weighted multipartite graph extracted from a deep learning architecture. As a multipartite graph is a combination of bipartite graphs, and since bipartite graphs are the incidence graphs of hypergraphs, we design the k-hypercore decomposition, the hypergraph analogue of the k-core degeneracy. We applied hypercore to several neural network architectures, more specifically to convolutional neural networks and multilayer perceptrons for image recognition after a very short pretraining. Then we used the information provided by the core numbers of the neurons to re-initialize the weights of the neural network to give a more specific direction for the gradient optimization scheme. Extensive experiments proved that hypercore outperforms state-of-the-art initialization methods.

Hypergraph and the corresponding incidence graph

…

Illustration of the transformation of a CNN to graph.

…

Example of a k-hcore decomposition of a hypergraph

…

Test accuracy and train loss on CIFAR-10 for the initialization applied only on the linear layers.

…

Test accuracy and train loss on CIFAR-10 for the initialization applied only on the convolutional layers.

…

Figures - uploaded by George Dasoulas

Content may be subject to copyright.

Content uploaded by George Dasoulas

Content may be subject to copyright.

Hcore-Init: Neural Network Initialization based on

Graph Degeneracy

Stratis Limnios

Ecole Polytechnique

Palaiseau, France

George Dasoulas

Ecole Polytechnique &

Noah’s Ark Lab, Huawei Technologies

Paris, France

Dimitrios M. Thilikos

LIRMM, Univ Montpellier, CNRS

Montpellier, France

Michalis Vazirgiannis

Ecole Polytechnique

Palaiseau, France

Abstract—Neural networks are the pinnacle of artiﬁcial intelli-

gence, as in recent years we witnessed many novel architectures,

learning and optimization techniques for deep learning. As neural

networks inherently constitute multipartite graphs among neuron

layers, we aim to analyze directly their structure to extract

meaningful information from the learning processes. To our

knowledge graph mining techniques for enhancing learning in

neural networks have not been thoroughly investigated. In this

paper we propose an adapted version of the k-core structure

for the complete weighted multipartite graph extracted from

a deep learning architecture. As a multipartite graph is a

combination of bipartite graphs, and since bipartite graphs

are the incidence graphs of hypergraphs, we design the k-

hypercore decomposition, the hypergraph analogue of the k-

core degeneracy. We applied hypercore to several neural network

architectures, more speciﬁcally to convolutional neural networks

and multilayer perceptrons for image recognition after a very

short pretraining. Then we used the information provided by

the core numbers of the neurons to re-initialize the weights of

the neural network to give a more speciﬁc direction for the

gradient optimization scheme. Extensive experiments proved that

hypercore outperforms state-of-the-art initialization methods.

I. INTRODUCTION

During the last decade deep learning has been intensely

in the focus of the research community. Its applications on

a huge variety of scientiﬁc and industrial ﬁelds highlighted

the need for new approaches at the level of neural network

design. Researchers have studied until today different aspects

of the Neural Network (NN) architectures and how these can

be optimal for various tasks, i.e the optimization method used

for the error backpropagation, the contribution of the activation

functions between the NN layers or normalization techniques

that encourage the loss convergence, i.e batch normalization,

dropout layer etc.

Weight initialization is one of the aspects of NN model

design that contribute the most to the gradient ﬂow of the

hidden layer weights and by extension to the ability of the

neural network to learn. The main focus on the matter of

weight initialization ([4], [5]) is the observation that weights

among different layers can have a high variance, making the

gradients more likely to explode or vanish. Neural Networks

capitalize on graph structure by design. Surprisingly there has

been very few work analyzing them as a graph with edge

and/or nodes attributes. Recent work [16] introduces graph

metrics to produce latent representation sets capitalising on

Fig. 1. Hypergraph and the corresponding incidence graph

bipartite matching directly implemented in the neural network

architecture, which proved to be a very powerful method. Also

the work by C. Morris analyzes the expressivity of Graph

Neural Networks using the Weisfeiler-Leman isomorphism test

[11]. Our interest lies in trying to reﬁne the optimization

scheme by capitalizing on graph metrics and decomposi-

tions. One natural candidate was the k-core decomposition

[15]. Indeed this decomposition method, being very efﬁcient

(O(nlog(n)) in the best cases [2]), performs very well in

state-of-the-art frameworks for enhancing supervised learning

methods [12]. Providing key subgraphs, and also extracting

very good features.

Unfortunately, in the case of a graph representing a neural

network, k-core might lack some features. As a matter of fact,

graphs extracted from NNs constitute multipartite complete

weighted graphs, in the case of an Multilayer Perceptron and

almost complete for Convolutional Neural Networks. As we

saw different k-core variants for different types of graphs,

such as the k-truss [14] counting triangles, the D-core [3]

for directed graphs, were designed this past decade. A natural

thought was then to design our own version of the k-core for

our precise graph structure.

Hence our contributions are the following:

•We provide a uniﬁed method of constructing the graph

representation of a neural network as a block composition

of the given architecture (see ﬁg. 2). This is achieved

arXiv:2004.07636v1 [cs.LG] 16 Apr 2020

by transforming each part of the network (i.e linear or

convolutional layers, normalization/dropout layers and

pooling operators) into a subgraph. Having this graph

representation, it is possible to apply different types of

combinatorial algorithms to extract information from the

graph structure of the network.

•Next we design a new degeneracy framework, namely

the k-hypercore, extending the concept of k-core to

bipartite graphs by considering that each pair of layers

of the neural network, constituting a bipartite graph, is

the incidence graph of a hypergraph (cf. ﬁg.1).

•we propose a novel weight initialization scheme, Hcore-

init by using the information provided by the weighted

version of the k-hypercore of a NN extracted graph, to

re-initialize the weights of the given neural network, in

our case, a Convolutional neural network and a Multilayer

Perceptron. Our proposal clearly outperforms traditional

initialization methods on classical deep learning tasks.

The rest of this paper is organized as follows, ﬁrst some

preliminary deﬁnitions and overview of the state of the art

methods in neural network initialization methods. Then we

provide the methodology which allows us to transform neural

networks to edge weighted graphs. Further on, we proceed

to the main contribution of the paper being the deﬁnition of

the hypercore degeneracy and the procedure which produces

our initialization method. Finally we test our method on

several image classiﬁcation datasets, comparing it the main

initialization method used in neural networks.

II. PRELIMINARIES

In deep neural networks, weight initialization is a vital factor

of the performance of different architectures [10]. The reason

is that an appropriate initialization of the weights of the neural

network can avert the explosion or vanishing of the layer

activation output values.

A. Initialization methods

1) Xavier Initialization: One of the most popular

initialization methods is Xavier initialization [4]. According

to that, the weights of the network are initialized by

drawing them from a normal distribution with E[W] = 0

and Var(wi) = 1

fanin , where fanin is the number of

incoming neurons. Also, more generally, we can deﬁne

variance with respect to the number of outgoing neurons

as: Var(wi) = 1

fanin+fanout , where fanout is the number of

neurons that the output is directed to.

2) Kaiming Initialization: Although Xavier initialization

method manages to maintain the variance of all layers equal,

it assumes that the activation function is linear. In most of the

cases of non-linear activation function that Xavier initialization

is used, the hyperbolic tangent activation is employed. The

need for taking into account the activation function for the

weight initialization led to the Kaiming Initialization [5].

According to this method, in the case that we employ ReLU

activation functions, we initialize the network weights by

drawing samples from a normal distribution with zero mean:

E[W]=0and variance that depends on the order of the layer:

Var[W] = 2

nl, where lis the index of the l-th layer.

One main assumption for weight initialization is that the

mean of the random distribution used for initialization needs

to be 0. Otherwise, the calculation of the variances presented

above could not be done and we won’t be able to have a

ﬁxed way to initialize the variance.

Since in our work we want to capitalize on the k-hypercore

decomposition to bias those distributions we will have to face

the fact that we might not be able to control the variance of

the weights we initialize. Thankfully the fact that the initial

distribution has 0mean will ensure that our method respects

as well this condition on every layer of the neural network.

Moreover, since the k-hypercore decomposition is deﬁned

over hypergraphs, let us remind some properties between a

hypergraph and a bipartite graph.

B. Hypergraphs and Bipartite graphs

A hypergraph is a generalization of graph in which an edge

can join any number of vertices. It can be represented and we

keep this notation for the rest of the paper as H= (V, EH)

where Vis the set of nodes, and EHis the set of hyperedges,

i.e. a set of subsets of V. Therefore EHis a subset of

P(V). Moreover a bipartite graph is the incidence graph of a

hypergraph [13]. Indeed, a hypergraph Hmay be represented

by a bipartite graph Gas follows: the sets Xand Eare the

partitions of G, and (x1, e1)are connected with an edge if and

only if vertex x1is contained in edge e1in H. Conversely,

any bipartite graph with ﬁxed parts and no unconnected nodes

in the second part represents some hypergraph in the manner

described above. Hence, we can consider that every pair of

layers in the neural network can be viewed as a hypergraph,

where the left layer represent the hyperedges and the right the

nodes (see ﬁg.1).

III. GRAPH CHARACTERIZATION OF NEURAL NETWORK

We will now describe how we transpose the two classic

neural network architectures we investigate, to graphs, and

more speciﬁcaly bipartite ones. Also, from now on, we are

going to refer to a fully-connected neural network as FCNN

and to a convolutional neural network as CNN [8].

A. Fully-Connected Neural Networks

Let a FCNN Fwith Lhidden layers, ni,i= 1, .., L number

of hidden units per layer and Wi∈Rni,ni+1 the weight matrix

of the links between the units of the layers iand i+ 1.

We deﬁne the graph GF= (V, E, W )as the graph represen-

tation of the FCNN F, where the set of nodes Vcorresponds

to the PL

i=1 ninumber of hidden units of F, the set of edges

Econtains all the links of unit pairs across the layers of

Fand the edge weight matrix Wcorresponds to the link

weight matrices Wi, i = 1, ..., L −1. We note that the graph

representation GFdoes not take into account any activation

functions σused in F.

Fig. 2. Illustration of the transformation of a CNN to graph.

Remark. It is easy to see that GFis a k-partite graph (i.e

a graph whose vertices can be partitioned into kindependent

sets) and more speciﬁcally a union of L−1complete bipartite

graphs.

B. Convolutional Neural Networks

After showing the correspondence between a FCNN F

and its graph representation GF, we are ready to deﬁne the

graph representation of a CNN layer. Let a CNN layer C. The

convolutional layer is characterized by the input information

that has Iinput channels where each channel provides n×n

features (i.e an 24 ×24 image characterized by the 3 RGB

channels), the output information that has Ooutput channels,

where each channel has m×mfeatures and the matrix of

the convolutional kernel F∈Rw×h×I×O, where w, h are the

width and height of the kernel.

In order to deﬁne the graph GC= (V, E , W)as the graph

representation of the CNN C, we have to ﬂatten the 3 and 4-

dimensional input, output, and ﬁlter matrices correspondingly.

Speciﬁcally, the GCis a bipartite graph, where the ﬁrst

partition of nodes P1is the ﬂattened input information of the

CNN layer ( |P1|=I×n×n) , the second partition of nodes

P2is the ﬂatten output information (|P2|=O×m×m).

IV. WEIGHT INITIALIZATION BASED ON HCORE

As degeneracy frameworks have proven to be very efﬁcient

at extracting inﬂuential individuals in a network [1], why

not using the structural information provided by the hcore

decomposition of the network to identify “inﬂuential” neurons.

Having a neural network graph, we provide a deﬁnition of

degeneracy speciﬁcally for bipartite graphs, where standard

k-core does not provide much relevant information.

Deﬁnition 1 (Hypercore).Given a hypergraph H= (V, EH)

We deﬁne the (k, l)-hypercore as a maximal connected sub-

Fig. 3. Example of a k-hcore decomposition of a hypergraph

graph of Hin which all vertices have hyperdegree at least k

and all hyperedges have at least lincident nodes.

As for now on, we will refer to the (k, 2)-hypercore as the

k-hcore and similarly, the hcore number of the node will be

the largest value of kfor which the given node belongs to the

k-hcore.

This, henceforth provides a hypergraph decomposition and

in our case a decomposition of the right part of the studied

bipartite graph, as we do not care about the hcore of the

hyperedges (cf. ﬁg. 3).

Since we deal with edge-weighted bipartite graphs, we will

use the weighted degree to deﬁne the hcore ranking of the

nodes given the following weighted-hypercore deﬁnition:

Deﬁnition 2 (Weighted-hypercore).Given an edge weighted

hypergraph H= (V, EH), we deﬁne the (k, l)-weighted-

hypercore as a maximal connected subgraph of Hin which

all vertices have hyper-weighted-degree at least kand all

hyperedges have at least lincident nodes.

Again, we will refer to the (k, 2)-weighted-hypercore as k-

WHcore. Now that we have this weighted version, we need to

deﬁne a way to initialize the weights of the neural network.

Indeed, since the WHcore is a value given to the nodes of

the network and not the edges, being the weights we aim to

initialize. The WHcore shows us which neurons gather the

more information, positive on the one hand and negative on

the other. After a quick pretraining, we learn the weights just

enough to show which neurons have a higher impact on the

learning. This information is then grouped by the WHcore into

inﬂuential neurons and less inﬂuential ones.

Moreover, since weights in neural networks are sampled from

centered normal law, we have positive and negative weights.

Since the hypercore framework operates on positive weighted

degrees, we provide two graph representations of the neural

network, namely G+and G−. The G+graph is built upon the

positive weights of the neural network, and the edge weights of

the G−graph are the absolute values of the negative weights

of the neural network. Indeed if between neuron xiand neuron

yj,wij >0then we add an edge with weight wij between

node xiand yjin graph G+, otherwise we add an edge with

weight |wij |between node xiand yjto graph G−.

Algorithm 1 Hcore decomposition algorithm

1: procedure HCOR E(G, rnodes)

2: Input G: bipartite graph, rnodes: right layer nodes

3: Output hcore: dictionary of hcore values

5: hcore ←dict((node, 0) for node in rnodes)

6: tokeep ←rnodes

7: while tokeep 6=∅do

8: state ←T rue

9: while state == T rue do

10: state ←F alse

11: tokeep ←[]

12: for node ∈rnodes do

13: if G.degree(node)> k then

14: tokeep.append(node)

15: else

16: hcore[node] = k

17: graph.remove[node]

18: state ←T rue

19: end if

20: end for

21: for node ∈G.nodes \rnodes do

22: if G.degree(node)=1 then

23: G.remove(node)

24: end if

25: end for

26: end while

27: k←k+ 1

28: end while

29: end procedure

Remark. It is important to note that the WHcore number of

a node is the largest kin which a node is contained in the

k-WHcore. Also, the WHcore number of a node is a function

of the degree of the node. As the degree depends on the

weights, there exists two functions gand hsuch that g(W, x)

outputs the weighted degree of a node x, thus being a linear

combination of the weights W. Then c(W, x) = h(g(W, x)) is

the WHcore number of the node x. For convinience, we now

write c(W+, xk) = c+

kwhere Wkare the positive weights of

the weight matrix W.

Moreover, the following initialization schemes are done after

a small amount of pretraining of the neural network, in order

to have a preliminary information over the importance on the

task of the neurons.

A. Initialization of the FCNN

The initialization then is then dependent on the architecture

we are looking at, indeed for an FCNN as the graph construc-

tion is fairly straight forward we proceed as follows:

For every pair of layers for both positive and negative graphs,

we have nodes xi, with i∈ {1,...,fanin}in the left side

of the bipartite graph, and yjnodes, with j∈ {1,...,fanout}

nodes on the right side. As for every node yiwe compute their

WHcore from the graph G−,c−

jand from the graph G+,c+

Then the given layer weights wij, are initialized, depending

on their sign, with a normal law with expectancy:

•for all iif wij ≥0,m=c+

P1≤k≤fanout c+

•else M=c−

P1≤k≤fanout c−

and with the same variance used in Kaiming initialization. We

prove later that the overall mean value of the new random

variable obtained in this fashion is 0as well, justifying the

use of the Kaiming variance to be optimal.

B. Initialization of the CNN

For the CNN, since the induced graph is more intricate

and the ﬁlter weights must follow the same distribution, the

initialization framework has to be adapted. We still compute

the WHcore on a pair of layers but keeping the ﬁlters in mind,

the left layer nodes are x(k)

iwith i∈ {1, . . . , n ×n}the input

size and k∈ {1, . . . , I}the number of input ﬁlters. Similarly

the left layer nodes are y(k0)

j, where j∈ {1, . . . , m ×m}

the output size, and k0∈ {1, . . . , O}the output channels. We

remind as well that we have two WHcores, one for the positive

graph c+and one for then negative c−. Then for a given ﬁlter

w(k,k0)its values are initialized with the following method:

•we deﬁne ffor a given ﬁlter Was m(W+) = 1

H2Pjc+

and m(W−) = 1

H2Pjc−

j, if m(W+)−m(W+)>0

then M=m(W+)

•else M=−m(W−).

Using the notations given in the previous remark we can

write min the following general form :

m=sign(argmax(m(W+), f(W−))) max(m(W+), m(W−))

where sign(W+)=1and sign(W−) = −1.

This initialization is done for every ﬁlter and with variance

given by the Kaiming initialization method. Now we will prove

that for the CNN the overall expectancy of the mean value

produced is indeed 0.

Proposition 1. Given a measurable function fand two

positive i.i.d. random variables X+and X−, the random

variable Z:

Z= sign arg max(f(X+), f (X−))max f(X+), f(X−)

has mean 0.

Proof. We remind that the function I{x∈X}is the Euler

indicator function:

I{x∈X}=1if x∈X

0otherwise.

Let us proceed to evaluate the expectancy of Zprovided

that X+and X−are i.i.d.:

E[Z] = E[ZI{f(X+)>f(X−)}] + E[ZI{f(X+)≤f(X−)}] =

E[f(X+)I{f(X+)>f(X−)}]−E[f(X−)I{f(X+)≤f(X−)}] =

E[(f(X+) + f(X−))I{f(X+)>f(X−)}]−E[f(X−)].

Given the initial assumptions , we can expand the ﬁrst term

E[f(X+)I{f(X+)>f(X−)}]as follows:

E[f(X+)I{f(X+)>f(X−)}] =

ZZ f(X+)I{f(X+)>f (X−)}dP (X+)dP (X−)

As X+and X−follow the same distribution, and fis a

measurable function, we use the Fubini theorem to intervert

the integrals as follows:

E[f(X+)I{f(X+)>f(X−)}] =

ZZ f(X−)I{f(X−)≥f(X+)}dP (X−)dP (X+).

Now replacing this in the original equation gives us:

E[Z] = ZZ f(X−)I{f(X−)≥f(X+)}dP (X−)dP (X+)

+ZZ f(X−)I{f(X+)>f (X−)}dP (X−)dP (X+)

−E[f(X−)]

=E[(f(X−)I{f(X−)≥f(X+)}]

+E[(f(X−)I{f(X+)>f(X−)}]−E[f(X−)]

= 0

This completes our proof that Zis a centered random

variable.

Notice that setting the function m=l◦g◦hwe can write

l◦g=fand X=h(W). As we deﬁned previously hto be

the weighted degree function of a node :

h(W+

j) = X

Wij I{Wij >0}

h(W−

j) = X

|Wij |I{Wij ≤0}

which ensures that h(W+)and h(W−)follow the same

distribution by linear combination of absolute value of the

same normal distribution. Replacing these function in the

previous proposition, i.e. f=l◦g,X+=h(W+)and

X−=h(W−)proves that our initialization method has mean

0. This proof allows us to justify the use of the Kaiming

variance in our initialization method as it was proven to be

the optimal one.

V. EX PE RI ME NT S

We will now evaluate our proposed weight initialization

method in three standard image classiﬁcation datasets, CIFAR-

10, CIFAR-100, and MNIST and compare it to Kaiming

initialization. It is important to note that we do not exper-

iment on state-of-the-art architectures for each dataset. We

want to show, as our method can be used separately on

different architecture blocks, i.e. only convolutional layers,

or only on the FCNN part, or both, that it out performs

standard initialization methods, regardless of the reﬁnement

of the architecture. Hence in this section, we will evaluate the

classiﬁcation accuracy on the aforementioned datasets with

simple CNN architectures presented in this section.

A. Dataset speciﬁcations.

The CIFAR-10 and CIFAR-100 are labeled subsets of the

80 million tiny images dataset. They were collected by Alex

Krizhevsky, Vinod Nair, and Geoffrey Hinton [7]. We also use

the MNIST dataset

•The CIFAR-10 dataset consists of 60000 32 ×32 colour

images in 10 classes, with 6000 images per class. There

are 50000 training images and 10000 test images.

•The CIFAR-100 is just like the CIFAR-10, except it has

100 classes containing 600 images each. There are 500

training images and 100 testing images per class.

We also test our model on the MNIST database of handwritten

digits, which has a training set of 60000 examples, and a test

set of 10000 examples. The digits have been size-normalized

and centered in a ﬁxed-size image [9].

The dataset is divided into ﬁve training batches and one

test batch, each with 10000 images. The test batch contains

exactly 1000 randomly-selected images from each class. The

training batches contain the remaining images in random order,

but some training batches may contain more images from one

class than another. Between them, the training batches contain

exactly 5000 images from each class.

B. Model setup and baseline.

Next, we present the models that were trained and evaluated

for the image classiﬁcation task. We note that for every case,

we compare two scenarios:

1) Initialization of the model with Kaiming initialization

[5], training on the train set for 150 epochs and evalua-

tion on the test set.

2) Pretraining of the model (using Kaiming initialization)

for Nepochs, re-initialization of the model with Hcore-

Init, training on the train set for the rest 150−Nepochs

and evaluation on the test set. Nhas been set as a hyper-

parameter.

For the CIFAR-10 and CIFAR-100 datasets, we applied 2

convolutional layers with sizes 3×6×5and 6×15 ×5

respectively, where 5is the kernel size and the stride was set

to 1. Moreover, after each convolutional layer, we applied two

2×2max-pooling operators and ﬁnally three fully connected

layers with corresponding sizes 400 ×120,120 ×84,84×

#classes, where #classes = 10 and 100 respectively for

the two datasets. Furthermore, we used ReLU as activation

function among the linear layers and tanh for the convolution

layers.

For the MNIST dataset, we applied again 2convolutional

layes of size 1×10 ×5and 10 ×20 ×5, where again the ﬁlter

size was set to 5and the stride was set to 1. As in the other

dasets, we employed two 2×2max-pooling operators and we

performed dropout [17] on the output of the 2nd convolutional

layer with probability p= 0.5. Finally, we applied 2fully

connected layers of size 320 ×50 and 50 ×10 and ReLU as

an activation function throughout the layers.

In all cases, we employed stochastic gradient descent [6]

with momentum set to 0.9and learning rate set to 0.001.

As we mentioned before, we chose 2rather simple models,

as we intend to highlight the contribution of Hcore-Init in

comparison to its competitor and not to achieve state-of-

the-art results for the given datasets, which are exhaustively

examined.

C. Settings of the weight initialization.

Next, we present the contribution of Hcore-Init in the

performance of the neural network architecture with respect

to its application on different types of layers. Speciﬁcally,

we applied the conﬁgurations of the initialization methods (a)

exclusively on the set of the linear layers (b) exclusively on

the set of the convolutional layers (c) on the combined set of

linear and convolutional layers of the model.

Fig. 4. Test accuracy (left) and train loss (right) on CIFAR-10 for the

combined application of the initialization on the linear and the convolutional

layers. For the curves Hcore-init-x,xstands for the number of pretraining

epochs.

On Figure 4, we can observe that for 15 pretraining epochs,

the model initialized with Hcore-Init outperforms the model

initialized with Kaiming initialization. It is, also, noteworthy

that the loss convergence is faster when applying Hcore-Init.

This highlights empirically our initial motivation of encourag-

ing the “important” weights by using the graph information

from the model architecture.

Fig. 5. Test accuracy and train loss on CIFAR-10 for the initialization applied

only on the linear layers.

On Figures 5 and 6, we can notice the contribution again

of Hcore-Init in the performance of the network, when the

former is applied on the fully connected and convolutional

layers respectively. We can see that in both cases, Hcore-

Init with different numbers of pretraining epochs (10 and 20

correspondly) achieves better accuracy results in comparison

to Kaiming.

Fig. 6. Test accuracy and train loss on CIFAR-10 for the initialization applied

only on the convolutional layers.

TABLE I

Top Accuracy results over initializing the full model, only the CNN and only

the FCNN for CIFAR-10, CIFAR-100, and MNIST. Hcore-Init* represent

the top performance over all the pretraining epochs conﬁgurations up to 25

CIFAR-10 CIFAR-100 MNIST

Kaiming 64.62 32.56 98,71

Hcore-Init* 65.22 33.48 98.91

Hcore-Init-1 64.91 32.87 98.59

Hcore-Init-5 64.41 32.96 98.70

Hcore-Init-10 65.22 33.41 98.81

Hcore-Init-15 64.94 33.45 98.64

Hcore-Init-20 65.05 33.39 98.87

Hcore-Init-25 64.72 33.48 98.91

Finally, we report the results of the experiments conducted

on the 3datasets in I. Those results correspond to an ablation

study over the different number of pretraining epochs as

well as the different initialization scenarios, i.e. initializing

only on the linear layers, convolutional layers, and the whole

architecture. We kept for each mentioned scenario the best

performance, and Hcore-Init* corresponds to the best overall

accuracy. It is important to notice that we do not necessarily

need a long pretraining phase to achieve the best results, in

fact, only 10 epochs is usually more than enough to outperform

in a signiﬁcant way the Kaiming initialization. We remind

that this pretraining corresponds to less than 10% of the total

training which is proportional in terms of computation time

to 10% of the time to train the model. Furthermore it is

interesting to notice that in the early stages of pretraining we

are more likely to lose some accuracy as the gradient direction

in this stage of the training might be wrong. This justiﬁes as

well the consistency of our method.

VI. CONCLUSION

In this paper, we propose Hcore-Init, an initialization

method applicable on the most common blocks of neural

network architectures, i.e. convolutional and linear layers. This

method capitalizes on a graph representation of the neural

network and more importantly the deﬁnition of hypergraph

degenarcy providing a neuron ranking for the bipartite archi-

tecture of the neural network layers. Our method, learning

with a small pretraining of the neural network, outperforms

the standard Kaiming initialization, under the condition that

the initialization distribution has zero expectancy. This work

is intended to be used as a framework to initialize speciﬁc

blocks in more complex architectures that might bear more

information and are more valuable for the task at hand.

REFERENCES

[1] Mohammed Ali Al-garadi, Kasturi Dewi Varathan, and Sri Devi Ravana.

Identiﬁcation of inﬂuential spreaders in online social networks using in-

teraction weighted k-core decomposition method. Physica A: Statistical

Mechanics and its Applications, 468:278–288, 2017.

[2] Vladimir Batagelj and Matjaz Zaversnik. An o (m) algorithm for cores

decomposition of networks. arXiv preprint cs/0310049, 2003.

[3] Christos Giatsidis, Dimitrios M Thilikos, and Michalis Vazirgiannis. D-

cores: measuring collaboration of directed graphs based on degeneracy.

Knowledge and information systems, 35(2):311–343, 2013.

[4] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of

training deep feedforward neural networks. In In Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics

(AISTATS10). Society for Artiﬁcial Intelligence and Statistics, 2010.

[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving

deep into rectiﬁers: Surpassing human-level performance on imagenet

classiﬁcation, 2015.

[6] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a

regression function. Ann. Math. Statist., 23(3):462–466, 09 1952.

[7] Alex Krizhevsky et al. Learning multiple layers of features from tiny

images. 2009.

[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet

classiﬁcation with deep convolutional neural networks. In Advances

in neural information processing systems, pages 1097–1105, 2012.

[9] Yann LeCun and Corinna Cortes. MNIST handwritten digit database.

2010.

[10] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv

preprint arXiv:1511.06422, 2015.

[11] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton,

Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and

leman go neural: Higher-order graph neural networks. In Proceedings

of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages

4602–4609, 2019.

[12] Giannis Nikolentzos, Polykarpos Meladianos, Stratis Limnios, and

Michalis Vazirgiannis. A degeneracy framework for graph similarity.

In Proceedings of the Twenty-Seventh International Joint Conference on

Artiﬁcial Intelligence, IJCAI-18, pages 2595–2601. International Joint

Conferences on Artiﬁcial Intelligence Organization, 7 2018.

[13] Tomaz Pisanski and Milan Randic. Bridges between geometry and graph

theory. MAA NOTES, pages 174–194, 2000.

[14] Maria-Evgenia G Rossi, Fragkiskos D Malliaros, and Michalis Vazir-

giannis. Spread it good, spread it fast: Identiﬁcation of inﬂuential nodes

in social networks. In Proceedings of the 24th International Conference

on World Wide Web, pages 101–102, 2015.

[15] Stephen B Seidman. Network structure and minimum degree. Social

networks, 5(3):269–287, 1983.

[16] Konstantinos Skianis, Giannis Nikolentzos, Stratis Limnios, and

Michalis Vazirgiannis. Rep the set: Neural networks for learning set

representations. arXiv preprint arXiv:1904.01962, 2019.

[17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,

and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural

networks from overﬁtting. J. Mach. Learn. Res., 15(1):19291958,

January 2014.

Hypercore Decomposition for Non-Fragile Hyperedges: Concepts, Algorithms, Observations, and Applications

Preprint

Full-text available

Jan 2023

Hypergraphs are a powerful abstraction for modeling high-order relations, which are ubiquitous in many fields. A hypergraph consists of nodes and hyperedges (i.e., subsets of nodes); and there have been a number of attempts to extend the notion of $k$-cores, which proved useful with numerous applications for pairwise graphs, to hypergraphs. However, the previous extensions are based on an unrealistic assumption that hyperedges are fragile, i.e., a high-order relation becomes obsolete as soon as a single member leaves it. In this work, we propose a new substructure model, called ($k$, $t$)-hypercore, based on the assumption that high-order relations remain as long as at least $t$ fraction of the members remain. Specifically, it is defined as the maximal subhypergraph where (1) every node has degree at least $k$ in it and (2) at least $t$ fraction of the nodes remain in every hyperedge. We first prove that, given $t$ (or $k$), finding the ($k$, $t$)-hypercore for every possible $k$ (or $t$) can be computed in time linear w.r.t the sum of the sizes of hyperedges. Then, we demonstrate that real-world hypergraphs from the same domain share similar ($k$, $t$)-hypercore structures, which capture different perspectives depending on $t$. Lastly, we show the successful applications of our model in identifying influential nodes, dense substructures, and vulnerability in hypergraphs.

Sparse Block-structured Random Matrices : universality

Article

Full-text available

Apr 2023

We study ensembles of sparse block-structured random matrices generated from the adjacency matrix of a Erdös–Renyi random graph with N vertices of average degree Z , inserting a real symmetric d × d random block at each non-vanishing entry. We consider several ensembles of random block matrices with rank r < d and with maximal rank, r = d . The spectral moments of the sparse block-structured random matrix are evaluated for N → ∞ , d finite or infinite, and several probability distributions for the blocks (e.g. fixed trace, bounded trace and Gaussian). Because of the concentration of the probability measure in the d → ∞ limit, the spectral moments are independent of the probability measure of the blocks (with mild assumptions of isotropy, smoothness and sub-Gaussian tails). The effective medium approximation is the limiting spectral density of the sparse block-structured random ensembles with finite rank. Analogous classes of universality hold for the Laplacian sparse block-structured ensemble. The same limiting distributions are obtained using random regular graphs instead of Erdös–Renyi graphs.

A Survey on Hypergraph Mining: Patterns, Tools, and Generators

Preprint

Full-text available

Jan 2024

Towards Expressive Graph Neural Networks : Theory, Algorithms, and Applications

Thesis

Full-text available

Mar 2022

George Dasoulas

As the technological evolution of machine learning is accelerating nowadays, data plays a vital role in building intelligent models, being able to simulate phenomena, predict values and make decisions. In an increasing number of applications, data take the form of networks. The inherent graph structure of network data motivated the evolution of the graph representation learning field. Its scope includes generating meaningful representations for graphs and their components, i.e., the nodes and the edges. The research on graph representation learning was accelerated with the success of message passing frameworks applied on graphs, namely the Graph Neural Networks. Learning informative and expressive representations on graphs plays a critical role in a wide range of real-world applications, from telecommunication and social networks, urban design, chemistry, and biology. In this thesis, we study various aspects from which Graph Neural Networks can be more expressive, and we propose novel approaches to improve their performance in standard graph learning tasks. The main branches of the present thesis include: the universality of graph representations, the increase of the receptive field of graph neural networks, the design of stable deeper graph learning models, and alternatives to the standard message-passing framework. Performing both theoretical and experimental studies, we show how the proposed approaches can become valuable and efficient tools for designing more powerful graph learning models.In the first part of the thesis, we study the quality of graph representations as a function of their discrimination power, i.e., how easily we can differentiate graphs that are not isomorphic. Firstly, we show that standard message-passing schemes are not universal due to the inability of simple aggregators to separate nodes with ambiguities (similar attribute vectors and neighborhood structures). Based on the found limitations, we propose a simple coloring scheme that can provide universal representations with theoretical guarantees and experimental validations of the performance superiority. Secondly, moving beyond the standard message-passing paradigm, we propose an approach for treating a corpus of graphs as a whole instead of examining graph pairs. To do so, we learn a soft permutation matrix for each graph, and we project all graphs in a common vector space, achieving a solid performance on graph classification tasks.In the second part of the thesis, our primary focus is concentrated around the receptive field of the graph neural networks, i.e., how much information a node has in order to update its representation. To begin with, we study the spectral properties of operators that encode adjacency information. We propose a novel parametric family of operators that can adapt throughout training and provide a flexible framework for data-dependent neighborhood representations. We show that the incorporation of this approach has a substantial impact on both node classification and graph classification tasks. Next, we study how considering the k-hop neighborhood information for a node representation can output more powerful graph neural network models. The resulted models are proven capable of identifying structural properties, such as connectivity and triangle-freeness.In the third part of the thesis, we address the problem of long-range interactions, where nodes that lie in distant parts of the graph can affect each other. In this problem, we either need the design of deeper models or the reformulation of how proximity is defined in the graph. Firstly, we study the design of deeper attention models, focusing on graph attention. We calibrate the gradient flow of the model by introducing a novel normalization that enforces Lipschitz continuity. Next, we propose a data augmentation method for enriching the node attributes with information that encloses structural information based on local entropy measures.

Hyper-cores promote localization and efficient seeding in higher-order processes

Article

Full-text available

Oct 2023

Going beyond networks, to include higher-order interactions of arbitrary sizes, is a major step to better describe complex systems. In the resulting hypergraph representation, tools to identify structures and central nodes are scarce. We consider the decomposition of a hypergraph in hyper-cores, subsets of nodes connected by at least a certain number of hyperedges of at least a certain size. We show that this provides a fingerprint for data described by hypergraphs and suggests a novel notion of centrality, the hypercoreness. We assess the role of hyper-cores and nodes with large hypercoreness in higher-order dynamical processes: such nodes have large spreading power and spreading processes are localized in central hyper-cores. Additionally, in the emergence of social conventions very few committed individuals with high hypercoreness can rapidly overturn a majority convention. Our work opens multiple research avenues, from comparing empirical data to model validation and study of temporally varying hypergraphs.

Hypercore decomposition for non-fragile hyperedges: concepts, algorithms, observations, and applications

Article

Full-text available

Aug 2023
DATA MIN KNOWL DISC

Hypergraphs are a powerful abstraction for modeling high-order relations, which are ubiquitous in many fields. A hypergraph consists of nodes and hyperedges (i.e., subsets of nodes); and there have been a number of attempts to extend the notion of ${\varvec{k}}$-cores, which proved useful with numerous applications for pairwise graphs, to hypergraphs. However, the previous extensions are based on an unrealistic assumption that hyperedges are fragile, i.e., a high-order relation becomes obsolete as soon as a single member leaves it.In this work, we propose a new substructure model, called ${\varvec{(k,t)}}$-hypercore, based on the assumption that high-order relations remain as long as at least t fraction of the members remains. Specifically, it is defined as the maximal subhypergraph where (1) every node is contained in at least ${\varvec{k}}$ hyperedges in it and (2) at least ${\varvec{t}}$ fraction of the nodes remain in every hyperedge. We first prove that, given ${\varvec{t}}$ (or ${\varvec{k}}$), finding the ${\varvec{(k,t)}}$-hypercore for every possible ${\varvec{k}}$ (or ${\varvec{t}}$) can be computed in time linear w.r.t the sum of the sizes of hyperedges. Then, we demonstrate that real-world hypergraphs from the same domain share similar ${\varvec{(k,t)}}$-hypercore structures, which capture different perspectives depending on ${\varvec{t}}$. Lastly, we show the successful applications of our model in identifying influential nodes, dense substructures, and vulnerability in hypergraphs.

Hyperparameter Tuning and Comparison Analysis of the DNN Model to Predict Wireless Network Conditions of Live Video Services

Chapter

Jun 2023

Recent advances in IoT and AI technologies have enabled mobile IoT devices to provide live video services. In these services, attempts to apply data learning to network control are expanding to minimize the degradation of QoS by the bad network condition. The data learning model affects the system performance, and the improvement of the model can be achieved by adjusting key parameters (i.e., hyperparameters). Therefore, this paper attempts hyperparameter tuning for the deep neural network (DNN) model for predicting network conditions and provides a comparison and analysis results of the hyperparameters. For the optimal DNN model, four hyperparameters (i.e., activation function, batch size, dropout, and weight initialization) are adjusted, and twenty-four conditions of the hyperparameters are analyzed. The DNN model for network condition prediction derived through this work can be used as a basic model of an intelligent network for smart applications.

All you need is a good init

Conference Paper

Full-text available

May 2016

Layer-sequential unit-variance (LSUV) initialization -- a simple method for weight initialization for deep net learning -- is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.

Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks

Article

Jul 2019

In recent years, graph neural networks (GNNs) have emerged as a powerful neural architecture to learn vector representations of nodes and graphs in a supervised, end-to-end fashion. Up to now, GNNs have only been evaluated empirically—showing promising results. The following work investigates GNNs from a theoretical point of view and relates them to the 1-dimensional Weisfeiler-Leman graph isomorphism heuristic (1-WL). We show that GNNs have the same expressiveness as the 1-WL in terms of distinguishing non-isomorphic (sub-)graphs. Hence, both algorithms also have the same shortcomings. Based on this, we propose a generalization of GNNs, so-called k-dimensional GNNs (k-GNNs), which can take higher-order graph structures at multiple scales into account. These higher-order structures play an essential role in the characterization of social networks and molecule graphs. Our experimental evaluation confirms our theoretical findings as well as confirms that higher-order information is useful in the task of graph classification and regression.

A Degeneracy Framework for Graph Similarity

Conference Paper

Jul 2018

The problem of accurately measuring the similarity between graphs is at the core of many applications in a variety of disciplines. Most existing methods for graph similarity focus either on local or on global properties of graphs. However, even if graphs seem very similar from a local or a global perspective, they may exhibit different structure at different scales. In this paper, we present a general framework for graph similarity which takes into account structure at multiple different scales. The proposed framework capitalizes on the well-known k-core decomposition of graphs in order to build a hierarchy of nested subgraphs. We apply the framework to derive variants of four graph kernels, namely graphlet kernel, shortest-path kernel, Weisfeiler-Lehman subtree kernel, and pyramid match graph kernel. The framework is not limited to graph kernels, but can be applied to any graph comparison algorithm. The proposed framework is evaluated on several benchmark datasets for graph classification. In most cases, the core-based kernels achieve significant improvements in terms of classification accuracy over the base kernels, while their time complexity remains very attractive.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Spread it Good, Spread it Fast: Identification of Influential Nodes in Social Networks

Conference Paper

May 2015

Understanding and controlling spreading dynamics in networks presupposes the identification of those influential nodes that will trigger an efficient information diffusion. It has been shown that the best spreaders are the ones located in the core of the network - as produced by the k-core decomposition. In this paper we further refine the set of the most influential nodes, showing that the nodes belonging to the best K-truss subgraph, as identified by the K-truss decomposition of the network, perform even better leading to faster and wider epidemic spreading.

Identification of influential spreaders in online social networks using interaction weighted K-core decomposition method

Article

Nov 2016
PHYSICA A

Online social networks (OSNs) have become a vital part of everyday living. OSNs provide researchers and scientists with unique prospects to comprehend individuals on a scale and to analyze human behavioral patterns. Influential spreaders identification is an important subject in understanding the dynamics of information diffusion in OSNs. Targeting these influential spreaders is significant in planning the techniques for accelerating the propagation of information that is useful for various applications, such as viral marketing applications or blocking the diffusion of annoying information (spreading of viruses, rumors, online negative behaviors, and cyberbullying). Existing K-core decomposition methods consider links equally when calculating the influential spreaders for unweighted networks. Alternatively, the proposed link weights are based only on the degree of nodes. Thus, if a node is linked to high-degree nodes, then this node will receive high weight and is treated as an important node. Conversely, the degree of nodes in OSN context does not always provide accurate influence of users. In the present study, we improve the K-core method for OSNs by proposing a novel link-weighting method based on the interaction among users. The proposed method is based on the observation that the interaction of users is a significant factor in quantifying the spreading capability of user in OSNs. The tracking of diffusion links in the real spreading dynamics of information verifies the effectiveness of our proposed method for identifying influential spreaders in OSNs as compared with degree centrality, PageRank, and original K-core.

Learning multiple layers of features from tiny images

Article

Jan 2009

Network structure and minimum degree

Article

Jan 1983
SOC NETWORKS

Stephen B Seidman

Bridges between geometry and graph theory

Article

Jan 2003

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Article

Jun 2014
J MACH LEARN RES

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.

Hcore-Init: Neural Network Initialization based on Graph Degeneracy

Abstract and Figures

Recommended publications

Hcore-Init: Neural Network Initialization based on Graph Degeneracy

Etude de Dégénérescence de Graph appliqué à l'Apprentissage Automatique Avancé et Résultats Théoriqu...

Sensitivity and stability of pretrained CNN filters

A Novel Deep Learning Network Architecture with Cross-Layer Neurons