ArticlePDF Available

An Enhanced Scheme for Reducing the Complexity of Pointwise Convolutions in CNNs for Image Classification Based on Interleaved Grouped Filters without Divisibility Constraints

September 2022
Entropy 24(9):1264

September 2022
24(9):1264

DOI:10.3390/e24091264

License
CC BY 4.0

Authors:

Joao Paulo Schwarz Schuler

Universitat Rovira i Virgili

Santiago Romaní

Universitat Rovira i Virgili

Domenec Puig

Universitat Rovira i Virgili

Hatem Rashwan

Universitat Rovira i Virgili

Show all 5 authorsHide

In image classification with Deep Convolutional Neural Networks (DCNNs), the number of parameters in pointwise convolutions rapidly grows due to the multiplication of the number of filters by the number of input channels that come from the previous layer. Existing studies demonstrated that a subnetwork can replace pointwise convolutional layers with significantly fewer parameters and fewer floating-point computations, while maintaining the learning capacity. In this paper, we propose an improved scheme for reducing the complexity of pointwise convolutions in DCNNs for image classification based on interleaved grouped filters without divisibility constraints. The proposed scheme utilizes grouped pointwise convolutions, in which each group processes a fraction of the input channels. It requires a number of channels per group as a hyperparameter Ch. The subnetwork of the proposed scheme contains two consecutive convolutional layers K and L, connected by an interleaving layer in the middle, and summed at the end. The number of groups of filters and filters per group for layers K and L is determined by exact divisions of the original number of input channels and filters by Ch. If the divisions were not exact, the original layer could not be substituted. In this paper, we refine the previous algorithm so that input channels are replicated and groups can have different numbers of filters to cope with non exact divisibility situations. Thus, the proposed scheme further reduces the number of floating-point computations (11%) and trainable parameters (10%) achieved by the previous method. We tested our optimization on an EfficientNet-B0 as a baseline architecture and made classification tests on the CIFAR-10, Colorectal Cancer Histology, and Malaria datasets. For each dataset, our optimization achieves a saving of 76%, 89%, and 91% of the number of trainable parameters of EfficientNet-B0, while keeping its test classification accuracy.

A schematic diagram of our pointwise convolution replacement. This example replaces a pointwise convolution with 14 input channels and 10 filters. It contains two convolutional layers, K and L, one interleaving, and one summation layer. Channels surrounded by a red border represent replicated channels.

…

For a standard pointwise convolution with Ic input channels, F filters, P parameters and a given number of channels per group Ch, this Table shows the calculated parameters for layers K and L: the number of groups G <layer><path> and the number of filters per group Fg <layer><path> . The last 2 columns show the total number of parameters and its percentage with respect to the original layer.

…

Comparing EfficientNet-B0, kEffNet-B0 V1 and kEffNet-B0 V2 with CIFAR-10 dataset after 50 epochs.

…

Number of trainable parameters for EfficientNet, kEffNet V2 16ch and kEffNet V2 32ch with a 10 classes dataset.

…

Results obtained with the CIFAR-10 dataset after 180 epochs.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.





Citation: Schwarz Schuler, J.P.;

Romani, S.; Puig, D.; Rashwan, H.;

Abdel-Nasser, M. An Enhanced

Scheme for Reducing the Complexity

of Pointwise Convolutions in CNNs

for Image Classiﬁcation Based on

Interleaved Grouped Filters without

Divisibility Constraints. Entropy 2022,

24, 1264. https://doi.org/

10.3390/e24091264

Academic Editors: Bin Fan and

Wenqi Ren

Received: 24 July 2022

Accepted: 5 September 2022

Published: 8 September 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional afﬁl-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

entropy

Article

An Enhanced Scheme for Reducing the Complexity of Pointwise

Convolutions in CNNs for Image Classification Based on

Interleaved Grouped Filters without Divisibility Constraints

Joao Paulo Schwarz Schuler 1,* , Santiago Romani Also 1, Domenec Puig 1, Hatem Rashwan 1

and Mohamed Abdel-Nasser 1,2

1Departament d’Enginyeria Informatica i Matemátiques, Universitat Rovira i Virgili, 43007 Tarragona, Spain;

santiago.romani@urv.cat (S.R.A.); domenec.puig@urv.cat (D.P.); hatem.abdellatif@urv.cat (H.R.);

m.nasser@ieee.org (M.A.-N.)

Electronics and Communication Engineering Section, Electrical Engineering Department, Aswan University,

81528 Aswan, Egypt

*Correspondence: joaopaulo.schwarz@estudiants.urv.cat

Abstract:

In image classiﬁcation with Deep Convolutional Neural Networks (DCNNs), the number of

parameters in pointwise convolutions rapidly grows due to the multiplication of the number of ﬁlters

by the number of input channels that come from the previous layer. Existing studies demonstrated

that a subnetwork can replace pointwise convolutional layers with signiﬁcantly fewer parameters and

fewer ﬂoating-point computations, while maintaining the learning capacity. In this paper, we propose

an improved scheme for reducing the complexity of pointwise convolutions in DCNNs for image

classiﬁcation based on interleaved grouped ﬁlters without divisibility constraints. The proposed

scheme utilizes grouped pointwise convolutions, in which each group processes a fraction of the

input channels. It requires a number of channels per group as a hyperparameter

. The subnetwork

of the proposed scheme contains two consecutive convolutional layers K and L, connected by an

interleaving layer in the middle, and summed at the end. The number of groups of ﬁlters and ﬁlters

per group for layers K and L is determined by exact divisions of the original number of input channels

and ﬁlters by

. If the divisions were not exact, the original layer could not be substituted. In this

paper, we reﬁne the previous algorithm so that input channels are replicated and groups can have

different numbers of ﬁlters to cope with non exact divisibility situations. Thus, the proposed scheme

further reduces the number of ﬂoating-point computations (11%) and trainable parameters (10%)

achieved by the previous method. We tested our optimization on an EfﬁcientNet-B0 as a baseline

architecture and made classiﬁcation tests on the CIFAR-10, Colorectal Cancer Histology, and Malaria

datasets. For each dataset, our optimization achieves a saving of 76%, 89%, and 91% of the number of

trainable parameters of EfﬁcientNet-B0, while keeping its test classiﬁcation accuracy.

Keywords:

EfﬁcientNet; deep learning; computer vision; image classiﬁcation; convolutional neural

network; DCNN; grouped convolution; pointwise convolution; data analysis; network optimization;

parameter reduction; parallel branches; channel interleaving

1. Introduction

In 2012, Krizhevsky et al. [

] reported a breakthrough in the ImageNet Large Scale

Visual Recognition Challenge [

] using their AlexNet architecture, which contains 5 con-

volutional layers and 3 dense layers. Since 2012, many other architectures have been

introduced, like ZFNet [

], VGG [

], GoogLeNet [

], ResNet [

] and DenseNet [

]. Since

the number of layers of proposed convolutional neural networks has increased from 5 to

more than 200, those models are usually referred to as “Deep Learning” or DCNN.

In 2013, Min Lin et al. introduced the Network in Network architecture (NiN) [

]. It has

3 spatial convolutional layers with 192 ﬁlters, separated by pairs of pointwise convolutional

Entropy 2022,24, 1264. https://doi.org/10.3390/e24091264 https://www.mdpi.com/journal/entropy

Entropy 2022,24, 1264 2 of 15

layers. These pointwise convolutions enable the architecture to learn patterns without

the computational cost of a spatial convolution. In 2016, ResNet [

] was introduced.

Following VGG [

], all ResNet spatial ﬁlters have 3

3 pixels. Their paper conjectures

that deeper CNNs have exponentially low convergence rates. To deal with this problem,

they introduced skip connections every 2 convolutional layers. In 2017, Ioannou et al. [

]

adapted the NiN architecture to use 2 to 16 convolutional groups per layer for classifying

the CIFAR-10 dataset.

A grouped convolution separates input channels and ﬁlters into groups. Each ﬁlter

processes only input channels entering its group. Each group of ﬁlters can be understood

as an independent (parallel) path for information ﬂow. This aspect drastically reduces

the number of weights in each ﬁlter and, therefore, reduces the number of ﬂoating-point

computations. Grouping 3

3 and 5

5 spatial convolutions, Ioannou et al. were able

to decrease the number of parameters by more than 50% while keeping the NiN classiﬁ-

cation accuracy. Ioannou et al. also adapted the Resnet-50, Resnet-200, and GoogleLeNet

architectures applying 2 to 64 groups per layer when classifying the ImageNet dataset,

obtaining parameter reduction while maintaining or improving the classiﬁcation accu-

racy. Also in 2017, an improvement for the ResNet architecture called ResNeXt [

] was

introduced, replacing the spatial convolutions with parallel paths (groups), reducing the

number of parameters.

Several studies have also reported the creation of parameter-efﬁcient architectures

with grouped convolutions [

–

]. In 2019, Mingxing Tan et al. [

] developed the

EfﬁcientNet architecture. At that time, their EfﬁcientNet-B7 variant was 8.4 times more

parameter-efﬁcient and 6.1 times faster than the best existing architecture, achieving 84.3%

top-1 accuracy on ImageNet. More than 90% of the parameters of EfﬁcientNets come from

standard pointwise convolutions. This aspect opens an opportunity for a huge reduction in

several parameters and ﬂoating-point operations, as we have exploited in the present paper.

Most parameters in DCNNs are redundant [

–

]. Pruning methods remove connec-

tions and neurons found to be irrelevant by different techniques. After training the original

network with the full set of connections, the removal is carried out [

–

]. Our method

differs from pruning as we reduce the number of connections before the training starts,

while pruning does after training. Therefore, our method can save computing resources

during training time.

In previous works [

], we proposed replacing standard pointwise convolutions

with a sub-architecture that contains two grouped pointwise convolutional layers (K and

L), an interleaving layer that mixes channels from layer K before feeding the layer L, and a

summation at the end that sums the results from both convolutional layers. Our original

method accepts a hyperparameter

, which denotes the number of input channels fed

to each group of ﬁlters. Then, our method computes the number of groups of ﬁlters and

ﬁlters per group according to the division of original input channels and ﬁlters by

. Our

original method avoided substituting the layers where the divisions were not exact.

In this paper, we propose an enhanced scheme to allow computing the number of

groups in a ﬂexible manner, in the sense that the divisibility constraints do not have to be

considered anymore. By applying our method to all pointwise convolutional layers of an

EfﬁcientNet-B0 architecture, we are able to reduce a huge amount of resources (trainable

parameters, ﬂoating-point computations) while maintaining the learning capacity.

This paper is structured as follows: Section 2details our improved solution for group-

ing pointwise convolutions while skipping the constraints of divisibility found in our

previous method. Section 3details the experiments carried out for testing our solution.

Section 4summarizes the conclusions and limitations of our proposal.

2. Methodology

2.1. Mathematical Ground for Regular Pointwise Convolutions

Let

Xi={xi

. . .

Ici}

be a set of input feature maps (2D lattices) for a convolu-

tional layer

in a DCNN, where

Ici

denotes the number of input channels for this layer. Let

Entropy 2022,24, 1264 3 of 15

Wi={wi

. . .

Fi}

be a set of ﬁlters containing the weights for convolutions, where

denotes the number of ﬁlters at layer

, which is also the number of output channels of

this layer. Following the notation proposed in [

], a regular DCNN convolution can be

mathematically expressed as in Equation (1):

Xi+1=WiOXi

={wi

1∗Xi,wi

2∗Xi, . . . , wi

Fi∗Xi}(1)

where the

operator indicates that ﬁlters in

are convolved with feature maps in

using the

∗

operator to indicate a 3D tensor multiplication and shifting of a ﬁlter

across

all patches of the size of the ﬁlter in all feature maps. For simplicity, we are ignoring the

bias terms. Consequently,

Xi+1

will contain

feature maps that will feed the next layer

i+1. The tensor shapes of involved elements are the following:

Xi∈ RH×W × Ici

Wi∈ RFi×S×S×I ci→wi

j∈ RS×S ×Ici

Xi+1∈ RH×W × Fi

(2)

where

H × W

is the size (height, width) of feature maps, and

S × S

is the size of a ﬁlter

(usually square). In this paper we work with

1 because we are focused on pointwise

convolutions. In this case, each ﬁlter

carries

Ici

weights. The total number of weights

in layer iis obtained with a simple multiplication:

Pi=Ici·Fi(3)

2.2. Deﬁnition of Grouped Pointwise Convolutions

For expressing a grouped pointwise convolution, let us split the input feature maps

and the set of ﬁlters in

groups, as

Xi=nXi

1,Xi

2, . . . , Xi

Gio

and

Wi=nWi

1,Wi

2, . . . , Wi

Gio

Assuming that both

Ici

and

are divisible by

, the elements in

and

can be evenly

distributed through all their subset

and

. Then, Equation

(1)

can be reformulated as

Equation (4):

Xi+1=nWi

1⊗Xi

1,Wi

2⊗Xi

2, . . . , Wi

Gi⊗Xi

Gio(4)

The shapes of the subsets are the following:

m∈ RH×W × Ici

m∈ RFgi×1×1×Ici

Gi→wi,m

j∈ R1×1×Ici

(5)

where

Fgi

is the number of ﬁlters per group, namely,

Fgi=Fi/Gi

. Since each ﬁlter

wi,m

only

convolves on a fraction of input channels

(Ici/Gi)

, the total number of weights per subset

(Fi/Gi)·(Ici/Gi)

. Multiplying the last expression by the number of groups provides

the total number of weights Piin a grouped pointwise convolutional layer i:

Pi= (Ici·Fi)/Gi(6)

Equation

(6)

shows that the number of trainable parameters is inversely proportional to

the number of groups. However, grouping has the evident drawback that it prevents the

ﬁlters to be connected with all input channel, which reduces the possible connections of

input channels for learning new patterns. As it may lead to a lower learning capacity of the

DCNN, one must be cautious with using such grouping technique.

Entropy 2022,24, 1264 4 of 15

2.3. Improved Scheme for Reducing the Complexity of Pointwise Convolutions

Two major limitations of our previous method were inherited from constraints found

in most deep learning APIs:

• The number of input channels I cimust be multiple of the number of groups Gi.

• The number of ﬁlters Fimust be multiple of the number of groups Gi.

The present work circumvents the ﬁrst limitation by replicating channels from the

input. The second limitation is circumvented by adding a second parallel path with another

pointwise grouped convolution when required. Figure 1shows an example of our updated

architecture.

Details of this process are described below, which is applied to substitute each point-

wise convolutional layer

found in the original architecture. To explain the method, we

start detailing the construction of the layer K shown in Figure 1. For simplicity, we drop the

index

and use the index

to refer to the original hyperparameters, i.e., we use

IcK

instead

Ici

instead of

. Also, we will use the indexes

1 and

2 to refer the parameters of

the two parallel paths that may exist in layer K.

First of all, we must manually specify the value of the hyperparameter

. In the

graphical example shown in Figure 1, we set

Ch =

4. The rest of hyperparameters such

as number of groups in layers K and L are determined automatically by the rules of our

algorithm, according to the chosen value of Ch, the number of input channels IcKand the

number of ﬁlters

. We do not have a procedure to ﬁnd the optimal value of

, hence we

must apply ablation studies on a range of

values as shown in the results section. For the

example in Figure 1, we have chosen the value of

to obtain a full variety of situations

that must be tackled by our algorithm, i.e., non-divisibility conditions.

Figure 1.

A schematic diagram of our pointwise convolution replacement. This example replaces a

pointwise convolution with 14 input channels and 10 ﬁlters. It contains two convolutional layers, K

and L, one interleaving, and one summation layer. Channels surrounded by a red border represent

replicated channels.

2.4. Deﬁnition of Layer K

The ﬁrst step of the algorithm is to compute the number of groups in branch K1, as in

Equation (7):

GK1=IcK

Ch (7)

Since the number of input channels

IcK

may not be divisible by Ch, we use the

ceiling operator on the division to obtain an integer number of groups. In the example,

GK1=d14/4e=4. Thus, the output of ﬁlters in branch K1 can be deﬁned as in (8):

Entropy 2022,24, 1264 5 of 15

K1=nWK1

1⊗XK

1,WK1

2⊗XK

2, . . . , WK1

GK1⊗XK

GK1o(8)

The subsets

are composed of input feature maps

, collected in a sorted manner,

i.e.,

1={x1,x2, . . . , xCh }

2={xCh+1,xCh+2, . . . , x2Ch}

, etc. Equation

(9)

provides a

general deﬁnition of which feature maps xjare included in any feature subset XK

m={xa+1,xa+2, . . . , xa+Ch },a= (m−1)·Ch (9)

However, if

IcK

is not divisible by

, the last group

m=GK1

would not have

channels. In this case, the method will complete this last group replicating

Ch −b

initial

input channels, where bis computed as stated in Equation (10):

GK1={xa+1,xa+2, . . . , xa+b,x1,x2, . . . , xCh−b},

a=(GK1−1)·Ch,

b=GK1·Ch −IcK

(10)

It can be proved that

will always be less or equal than

, since

is the excess of the

integer division

IcK/Ch

, i.e.,

GK1·Ch

will always be above or equal to

IcK

, but less than

IcK+Ch

, because otherwise

GK1

would increase its value (as a quotient of

IcK/Ch

). In the

example, b=2, hence XK1

4={x13,x14 ,x1,x2}.

Then, the method calculates the number of ﬁlters per group FgK1as in (11):

FgK1=FK

GK1(11)

To avoid divisibility conﬂicts, this time we have chosen the ﬂoor integer division. For

the ﬁrst path K1, each of the ﬁlter subsets shown in (8) will contain the following ﬁlters:

WK1

m=nwK1,m

1,wK1,m

2, . . . , wK1,m

FgK1o

wK1,m

j∈ R1×1×Ch (12)

For the ﬁrst path of the example, the number of ﬁlters per group is

FgK1=b

So, the ﬁrst path has 4 groups (

GK1

) of 2 ﬁlters (

FgK1

), each ﬁlter being connected to 4 input

channels (Ch).

is not divisible by

, a second path K2 will provide as many groups as ﬁlters

not provided in K1, with one ﬁlter per group, to complete the total number of ﬁlters FK:

GK2=FK−FgK1·GK1

FgK2=1(13)

In the example,

GK2=

2. The required input channels for the second path is

Ch ·GK2

The method obtains those channels reusing the same subsets of input feature maps

shown in (9). Hence, the output of ﬁlters in path K2 can be deﬁned as in (14):

K2=nwK2

1∗XK

1,wK2

2∗XK

2, . . . , wK2

GK2∗XK

GK2o(14)

where

wK2

j∈ R1×1×Ch

. Therefore, each ﬁlter in

2 operates on exactly the same subset

of input channels than the corresponding subset of ﬁlters in

1. Hence, each ﬁlter in the

second path can be considered as belonging to one of the groups of the ﬁrst path.

It must be noticed that

GK2

will always be less than

GK1

. This is true because

GK2

the reminder of the integer division

FK/GK1

, as can be deduced from

(11)

and

(13)

. This

property warranties that there will be enough subsets XK

mfor this second path.

After deﬁning paths K1 and K2 in layer K, the output of this layer is the concatenation

of both paths:

Entropy 2022,24, 1264 6 of 15

K={K1, K2}(15)

The total number of channels after the concatenation is equal to

FK=GK1·FgK1+GK2

2.5. Interleaving Stage

As mentioned above, grouped convolutions inherently face a limitation: each parallel

group of ﬁlters computes its output from their own subset of input channels, preventing

combinations of channels connected to different groups. To alleviate this limitation, we

propose to interleave the output channels from the convolutional layer K.

The interleaving process simply consists in arranging the odd channels ﬁrst and the

even channels last, as noted in Equation (16):

IK={k1,k3,k5,...,k2c−1,

k2,k4,k6,...,k2c}

c=bFK/2c

(16)

Here we are assuming that

is even. Otherwise, the list of odd channels will include

an extra channel k2c+1.

2.6. Deﬁnition of Layer L

The interleaved output feeds the grouped convolutions in layer L to process data

coming from more than one group from the preceding layer K.

To create layer L, we apply the same algorithm as for layer K, but now the number of

input channels is equal to FKinstead of IcK.

The number of groups in path L1 is computed as:

GL1=FK

Ch (17)

Note that GL1may not be equal to GK1. In the example, GL1=d10/4e=3.

Then, the output of L1 is computed as in

(18)

, where the input channel groups

come

from the interleaving stage. Each group is composed of

channels, whose indexes are

generically deﬁned in (19):

L1=nWL1

1⊗IK

1,WL1

2⊗IK

2, . . . , WK1

GL1⊗IK

GL1o(18)

m=niK

a+1,iK

a+2, . . . , iK

a+Ch o,

a= (m−1)·Ch

(19)

Again, the last group of indexes may not contain

channels due to a non-exact

division condition in

(17)

. Similar to path K1, for path L1 the missing channels in the

last group will be supplied by replicating

Ch −b

initial interleaved channels, where

computed as stated in Equation (20):

GL1=niK

a+1,iK

a+2, . . . , iK

a+b,iK

1,iK

2, . . . , iK

Ch−bo,

a=(GL1−1)·Ch,

b=GL1·Ch −FK

(20)

The number of ﬁlters per group FgL1is computed as in (21):

FgL1=FK

GL1(21)

Entropy 2022,24, 1264 7 of 15

In the example,

FgL1=b

3. Each group of ﬁlters

WL1

shown in

(18)

can be

deﬁned as in (22), each one containing FgL1convolutional ﬁlters of Ch inputs:

WL1

m=nwL1,m

1,wL1,m

2, . . . , wL1,m

FgL1o

wL1,m

j∈ R1×1×Ch (22)

It should be noted that if the division in

(21)

is not exact, the number of output channels

from layer L may not reach the required

outputs. In this case, a second path L2 will be

added, with the following parameters:

GL2=FK−FgL1·GL1

FgL2=1(23)

In the example,

GL2=

1. The output of path L2 is computed as in

(24)

, deﬁning one

extra convolutional ﬁlter for some initial groups of interleaved channels declared in

(18)

and

(19)

, taking into account that

GL2

will always be less than

GL1

according to the same

reasoning done for GK2and GK1:

L2=nwL2

1∗IK

1,wL2

2∗IK

2, . . . , wL2

GL2∗IK

GL2o(24)

The last step in deﬁning the output of layer L is to join the outputs of paths L1 and L2:

L={L1, L2}(25)

2.7. Joining of Layers

Finally, the output of both convolutional layers K and L are summed to create the

output of the original layer:

Xi+1=K+L(26)

Compared to concatenation, summation has the advantage of allowing a residual

learning in the ﬁlters of layer L, because gradient can be backpropagated through L ﬁlters

or directly to K ﬁlters. In other words, residual layers provide more learning capacity

with low degree of downsides due to increasing the number of layers (i.e., overﬁtting,

longer training time, etc.) In the results section, we present an ablation study that contains

experiments done without the interleaving and the L layers (rows labeled with “no L”).

These experiments empirically prove that the interleaving mechanism and the secondary L

layer help in improving the sub-architecture accuracy, with low impact.

It is worth mentioning that we only add the layer L an the interleaving when the

number of input channels is bigger or equal to the number of ﬁlters in layer K.

2.8. Computing the Number of Parameters

We can compute the total number of parameters of our sub-architecture. First, Equa-

tion

(27)

shows that the number of ﬁlters in layer K is equal to the number of ﬁlters in layer

L, which in turn is equal to the total number of ﬁlters in the original convolutional layer

FgK1·GK1+GK2=FgL1·GL1+GL2=Fi(27)

Then, the total number of parameters

is twice the number of original ﬁlters multi-

plied by the number of input channels per ﬁlter:

Pi=2(Fi·Ch)(28)

Therefore, comparing Equation

(28)

with

(3)

, it is clear that

must be signiﬁcantly

less than

Ici/

2 to reduce the number of parameters of a regular pointwise convolutional

layer. Also, comparing Equation

(28)

with

(6)

, our sub-architecture provides a parameter

Entropy 2022,24, 1264 8 of 15

reduction similar to a plain grouped convolutional layer when

is around

Ici/

although we cannot specify a general

term because of the complexity of our pair of

layers with possibly two paths per layer.

The requirement for a low value of

is also necessary to ensure that divisions in

Equations

(7)

and

(17)

provide quotients above one, otherwise our method will not create

grouping. Hence,

must be less or equal to

Ici/

2 and

Fi/

2. These are the only two

constraints that our method is restricted by.

As shown in Table 1, pointwise convolutional layers found in real networks such

as EfﬁcientNet-B0 have signiﬁcant Figures for

Ici

and

, either hundreds or thousands.

Therefore, values of

less or equal than 32 will ensure a good ratio of parameter reduction

for most of these pointwise convolutional layers.

EfﬁcientNet is one of the most complex (but efﬁcient) architectures that can be found in

the literature. To our method, the degree of complexity of a DCNN is mainly related to the

maximum number of input channels and output features in any pointwise convolutional

layer. Our method does not care about the number of layers, neither in depth nor in

parallel, because it works on each layer independently. Therefore, the degree of complexity

of EfﬁcientNet-B0 can be considered signiﬁcantly high, taking into account the values

shown in the last row of Table 1. Arguably, other versions of EfﬁcientNet (B1, B2, etc.)

and other types of DCNN can exceed those values. In such cases, higher values of

may be necessary, but we cannot provide any rule to forecast its optimum value for the

conﬁguration of any pointwise convolutional layer.

Table 1.

For a standard pointwise convolution with

input channels,

ﬁlters,

parameters and a

given number of channels per group

, this Table shows the calculated parameters for layers K and

L: the number of groups

G<layer>< path>

and the number of ﬁlters per group

Fg<l ayer >< path>

. The last

2 columns show the total number of parameters and its percentage with respect to the original layer.

Original Settings Layer K Layer L K+L Params

Ic F P Ch GK1F gK1GK2GL1F gL1GL2Total %

14 10 140 4 4 2 2 3 3 1 80 57.14%

160 3840 614,400 16 10 384 0 0 0 0 61,440 10.00%

32 5 768 0 0 0 0 122,880 20.00%

192 1152 221,184 16 12 96 0 0 0 0 18,432 8.33%

32 6 192 0 0 0 0 36,864 16.67%

1152 320 368,640 16 72 4 32 20 16 0 10,240 2.78%

32 36 8 32 10 32 0 20,480 5.56%

3840 640

2,457,600

16 240 2 160 40 16 0 20,480 0.83%

32 120 5 40 20 32 0 40,960 1.67%

2.9. Activation Function

In 2018, Prajit et al. [

] tested a number of activation functions. In their experi-

mentation, they found that the best performing one was the so-called “swish”, shown in

Equation (29).

f(x) = x·sigmoid(βx)(29)

In previous works [

], we used the ReLU activation function. In this work, we

use the swish activation function. This change gives us better results in our ablation

experiments shown on Table 5.

Entropy 2022,24, 1264 9 of 15

2.10. Implementation Details

We tested our optimization by replacing original pointwise convolutions in the

EfﬁcientNet-B0 and named it as “kEffNet-B0 V2”. With CIFAR-10, we tested an addi-

tional modiﬁcation that skips the ﬁrst 4 convolutional strides, allowing input images with

32x32 pixels instead of the original resolution of 224x224 pixels.

In all our experiments, we saved the trained network from the epoch that achieved the

lowest validation loss for testing with the test dataset. Convolutional layers are initialized

with Glorot’s method [

]. All experiments were trained with RMSProp optimizer, data

augmentation [

] and cyclical learning rate schedule [

]. We worked with various

conﬁgurations of hardware with NVIDIA video cards. Regarding software, we did our

experiments with K-CAI [35] and Keras [36] on the top of Tensorﬂow [37].

Our source code is publicly available at: https://github.com/joaopauloschuler/

kEffNetV2/, accessed on 1 September 2022.

2.11. Horizontal Flip

In some experiments, we run the model twice with the input image and its horizontally

ﬂipped version. The output from the softmax from both runs is summed before class

prediction. In these experiments, the number of ﬂoating-point computations doubles,

although the number of trainable parameters remains the same.

3. Results and Discussion

In this section, we present and discuss the results of the proposed scheme with three

image classiﬁcation datasets: CIFAR-10 dataset [

], Malaria dataset, and colorectal cancer

histology dataset [39,40].

3.1. Results on the CIFAR-10 Dataset

The CIFAR-10 dataset [

] is a subset of [

] and consists of 60k 32x32 images belonging

to 10 different classes: airplane, automobile, bird, cat, deer, dog, frog, horse ship and truck.

These images are taken from natural and uncontrolled lightning environment. They contain

only one prominent instance of the object to which the class refers to. The object may be

partially occluded or seen from an unusual viewpoint. This dataset has 50k images for

training and 10k images for test. We picked 5k images for validation and left the training

set with 45k images. We run experiments with 50 and 180 epochs.

On Table 2we compare kEffNet-B0 V1 (our previous method) and V2 (our current

method), for two values of

. We can see that our V2 models has slightly more reduction

in both number of parameters and ﬂoating-point computations than the V1 counterpart

models, while achieving slightly higher accuracy. Speciﬁcally, V2 models save 10% of the

parameters (from 1,059,202 to 950,650) and 11% of the ﬂoating-point computations (from

138,410,206 to 123,209,110) of V1 models. All of our variants obtain similar accuracy to the

baseline with a remarkable reduction of resources (at least 26.3% of trainable parameters

and 35.5% of computations).

Table 2.

Comparing EfﬁcientNet-B0, kEffNet-B0 V1 and kEffNet-B0 V2 with CIFAR-10 dataset after

50 epochs.

Model Parameters %Computations %Test acc.

EfﬁcientNet-B0 baseline 4,020,358 100.0% 389,969,098 100.0% 93.33%

kEffNet-B0 V1 16ch 639,702 15.9% 84,833,890 21.8% 92.46%

kEffNet-B0 V2 16ch 623,226 15.5% 82,804,374 21.2% 92.61%

kEffNet-B0 V1 32ch 1,059,202 26.3% 138,410,206 35.5% 93.61%

kEffNet-B0 V2 32ch 950,650 23.6% 123,209,110 31.6% 93.67%

As the scope of this work is limited to small datasets and small architectures, we only

experimented with the smallest EfﬁcientNet variant (EfﬁcientNet-B0) and our modiﬁed

Entropy 2022,24, 1264 10 of 15

variant (kEffNet-B0). Nevertheless, Table 3provides the number of trainable parameters of

the other EfﬁcientNet variants (original and parameter-reduced). Equation

(3)

indicates

that the number of parameters grows with the number of ﬁlters and the number of input

channels. Equation

(6)

indicates that the number of parameters decreases with the number

of groups. As we create more groups when the number of input channels grows, we expect

to ﬁnd bigger parameter savings on larger models. This saving can be seen on Table 3.

Table 3.

Number of trainable parameters for EfﬁcientNet, kEffNet V2 16ch and kEffNet V2 32ch with

a 10 classes dataset.

Variant EfﬁcientNet kEffNet V2 16ch %kEffNet V2 32ch %

B0 4,020,358 623,226 15.50% 950,650 23.65%

B1 6,525,994 968,710 14.84% 1,389,062 21.29%

B2 7,715,084 983,198 12.74% 1,524,590 19.76%

B3 10,711,602 1,280,612 11.96% 2,001,430 18.68%

B4 17,566,546 1,858,440 10.58% 2,911,052 16.57%

B5 28,361,274 2,538,870 8.95% 4,011,626 14.14%

B6 40,758,754 3,324,654 8.16% 5,245,140 12.87%

B7 63,812,570 4,585,154 7.19% 7,254,626 11.37%

We also tested our kEffNet-B0 with 2, 4, 8, 16 and 32 channels per group for 50 epochs

as shown in Table 4. As expected, the test classiﬁcation accuracy increases when allocating

more channels per group: from 84.26% for

= 2 to 93.67% for

= 32. Also, the resource

saving decreases as the number of channels per group increase: from 7.8% of parameters

and 11.4% of computations for

= 2 to 23.6% of parameters and 31.6% of computations

for

= 32 (compared to the baseline). For CIFAR-10, if we aim to achieve an accuracy

comparable to the baseline, we must choose at least 16 channels per group. If we add an

extra run per image sample with horizontal ﬂipping when training kEffNet-B0 V2 32ch, the

classiﬁcation accuracy increases from 93.67% to 94.01%.

Table 4.

Ablation study done with the CIFAR-10 dataset for 50 epochs, comparing the effect of

varying the number of channels per group. It also includes the improvement achieved by double

training kEffNet-B0 V2 32ch with original images and horizontally ﬂipped images.

Model Parameters %Computations %Test acc.

EfﬁcientNet-B0 baseline 4,020,358 100.0% 389,969,098 100.0% 93.33%

kEffNet-B0 V2 2ch 311,994 7.8% 44,523,286 11.4% 84.36%

kEffNet-B0 V2 4ch 354,818 8.8% 49,487,886 12.7% 87.66%

kEffNet-B0 V2 8ch 444,346 11.1% 60,313,526 15.5% 90.53%

kEffNet-B0 V2 16ch 623,226 15.5% 82,804,374 21.2% 92.61%

kEffNet-B0 V2 32ch 950,650 23.6% 123,209,110 31.6% 93.67%

kEffNet-B0 V2 32ch + H Flip 950,650 23.6% 246,418,220 63.3% 94.01%

Table 5replicates most of the results shown in Table 4, but comparing the effect of not

including layer L and interleaving, and also substituting the swish activation function with

the typical ReLU. As can be observed, disabling layer L has a noticeable degradation on test

accuracy when the values of

are smaller. For example, when

= 4, the performance

drops more than 5%. On the other hand, when

= 32 the drop is less than 0.5%. This

is logical taking into account that, the more channels are included per group, the more

chances are to combine input features in the ﬁlters. Therefore, a second layer and the

corresponding interleaving is not as crucial as when the ﬁlters of layer K are fed with

fewer channels.

In the comparison of activation functions, the same effect can be appreciated: the

swish function works better than the ReLU function, but provides less improvement for

Entropy 2022,24, 1264 11 of 15

larger number of channels per group. Nevertheless, the gain in the least difference case

(32 ch) is still proﬁtable, with more than 1.5% of extra test accuracy when using the swish

activation function.

Table 5.

Extra experiments made for kEffNet-B0 V2 4ch, 8ch, 16ch and 32ch variants. Rows labeled

with “no L” indicate experiments done using only layer K, i.e., disabling layer L and the interleaving.

Rows labeled with “ReLU” replace the swish activation function by ReLU.

Model Parameters %Computations %Test acc.

EfﬁcientNet-B0 baseline 4,020,358 100.0% 389,969,098 100.0% 93.33%

kEffNet-B0 V2 4ch 354,818 8.8% 49,487,886 12.7% 87.66%

kEffNet-B0 V2 4ch no L 342,070 8.5% 48,064,098 12.3% 82.44%

kEffNet-B0 V2 4ch ReLU 354,818 8.8% 47,595,914 12.2% 85.34%

kEffNet-B0 V2 8ch 444,346 11.1% 60,313,526 15.5% 90.53%

kEffNet-B0 V2 8ch no L 422,886 10.5% 57,466,370 14.7% 89.27%

kEffNet-B0 V2 8ch ReLU 444,346 11.1% 58,421,554 15.0% 88.82%

kEffNet-B0 V2 16ch 623,226 15.5% 82,804,374 21.2% 92.61%

kEffNet-B0 V2 16ch no L 584,934 14.6% 77,356,802 19.8% 91.52%

kEffNet-B0 V2 16ch ReLU 623,226 15.5% 80,912,406 20.8% 91.16%

kEffNet-B0 V2 32ch 950,650 23.6% 123,209,110 31.6% 93.67%

kEffNet-B0 V2 32ch no L 879,750 21.9% 112,684,706 28.9% 93.21%

kEffNet-B0 V2 32ch ReLU 950,650 23.7% 121,317,142 31.1% 92.00%

Table 6shows the effect in accuracy when classifying the CIFAR-10 dataset with

EfﬁcientNet-B0 and our kEffNet-B0 V2 32ch variant for 180 epochs instead of 50 epochs.

The additional training epochs assign slightly higher test accuracy to the baseline than to

our core variant. When adding horizontal ﬂipping, our variant has slightly surpassed the

baseline results. Nevertheless, all three results can be considered similar to each other, but

our variant offers a signiﬁcant saving in parameters and computations. Although the H

ﬂipping doubles the computational cost of our core variant, it still remains only a fraction

(63.3%) of the baseline computational cost.

Table 6. Results obtained with the CIFAR-10 dataset after 180 epochs.

Model Parameters %Computations %Test acc.

EfﬁcientNet-B0 baseline 4,020,358 100.0% 389,969,098 100.0% 94.86%

kEffNet-B0 V2 32ch 950,650 23.6% 123,209,110 31.6% 94.45%

kEffNet-B0 V2 32ch + H Flip 950,650 23.6% 246,418,220 63.3% 94.95%

3.2. Results on the Malaria Dataset

The Malaria dataset [

] has 27,558 cell images from infected and healthy cells sep-

arated into 2 classes. There is the same number of images for healthy and infected cells.

From the original 27,558 images set, we separated 10% of the images (2756 images) for

validation and another 10% for testing. On all training, validation, and test subsets, there

are 50% of healthy cell images. We quadruplicated the number of validation images by

ﬂipping these images horizontally and vertically, resulting in 11,024 images for validation.

On this dataset, we tested our kEffNet-B0 with 2, 4, 8, 12, 16, and 32 channels per

group, as well as the baseline architecture, as shown in Table 7. Our variants have from

7.5% to 23.5% of the trainable parameters and from 15.7% to 42.2% of the computations

allocated by the baseline architecture. Although the worst classiﬁcation accuracy was found

with the smallest variant (2ch), its classiﬁcation accuracy is less than 1% inferior to the

best performing variant (16ch) and only 0.69% below the baseline performance. With only

8 channels per group, our method equals the baseline accuracy with a small portion of

Entropy 2022,24, 1264 12 of 15

the parameters (10.8%) and computations (22.5%) required by the baseline architecture.

Curiously, our 32ch variant is slightly worse than the 16ch variant, but still better than the

baseline. It is an example that a rather low complexity of the input images may require less

channels per ﬁlter (and more parallel groups of ﬁlters), to optimally capture the relevant

features of images.

Table 7. Results obtained with the Malaria dataset after 75 epochs.

Model Parameters % Computations % Test acc.

EfﬁcientNet-B0 baseline 4,010,110 100.0% 389,958,834 100.0% 97.39%

kEffNet-B0 V2 2ch 301,746 7.5% 61,196,070 15.7% 96.70%

kEffNet-B0 V2 4ch 344,570 8.6% 69,691,358 17.9% 96.95%

kEffNet-B0 V2 8ch 434,098 10.8% 87,725,254 22.5% 97.39%

kEffNet-B0 V2 12ch 524,026 13.1% 106,199,566 27.2% 97.31%

kEffNet-B0 V2 16ch 612,978 15.3% 124,672,934 32.0% 97.61%

kEffNet-B0 V2 32ch 940,402 23.5% 164,422,950 42.2% 97.57%

3.3. Results on the Colorectal Cancer Histology Dataset

The collection of samples in colorectal cancer histology dataset [

] contains 5000

150 ×150 images

separated into 8 classes: adipose, complex, debris, empty, lympho, mu-

cosa, stroma, and tumor. Similar to what we did with the Malaria dataset, we separated

10% of the images for validation and another 10% for testing. We also quadruplicated the

number of validation images by ﬂipping these images horizontally and vertically.

On this dataset, we tested our kEffNet-B0 with 2, 4, 8, 12, and 16 channels per group, as

well as the baseline architecture, as shown in Table 8. Similar to the Malaria dataset, higher

values of channels per group do not lead to better performance. In this case, the variants

with the highest classiﬁcation accuracy are 4ch and 8ch, achieving 98.02% of classiﬁcation

accuracy, outperforming the baseline accuracy in 0.41%. The 16ch variant has obtained the

same accuracy than the 2ch variant, but doubling the required resources. Again, it indicates

that the complexity of the images plays a role in the selection of the optimal number of

channels per group. In other words, simpler images may require less channels per group.

Unfortunately, the only method we know to ﬁnd out this optimal value is performing

theses scanning experiments.

Table 8. Results obtained with the colorectal cancer dataset after 1000 epochs.

Model Parameters % Computations % Test acc.

EfﬁcientNet-B0 baseline 4,017,796 100.0% 389,966,532 100.0% 97.61%

kEffNet-B0 V2 2ch 355,064 8.8% 61,203,768 15.7% 97.62%

kEffNet-B0 V2 4ch 397,888 9.9% 69,699,056 17.9% 98.02%

kEffNet-B0 V2 8ch 487,416 12.1% 87,732,952 22.5% 98.02%

kEffNet-B0 V2 12ch 531,712 13.2% 106,207,264 27.2% 97.22%

kEffNet-B0 V2 16ch 620,664 15.4% 124,680,632 32.0% 97.62%

4. Conclusions and Future Work

This paper presented an efﬁcient scheme for decreasing the complexity of pointwise

convolutions in DCNNs for image classiﬁcation based on interleaved grouped ﬁlters with

no divisibility constraints. From our experiments, we can conclude that connecting all input

channels from the previous layer to all ﬁlters is unnecessary: grouped convolutional ﬁlters

can achieve the same learning power with a small fraction of resources (1/3 of ﬂoating-

point computations, 1/4 of parameters). Our enhanced scheme avoids the divisibility

contraints, furter reducing the required resources (up to 10% less) while maintaining or

slightly surpassing the accuracy of our previous method.

Entropy 2022,24, 1264 13 of 15

We have made ablation studies to obtain the optimal number of channels per group

for each dataset. For colorectal cancer dataset, this number is surprisingly low (4 channels

per group). On the other side, for CIFAR-10 the best results require at least 16 channels

per group. This fact indicates that the complexity of the input images affects the optimal

conﬁguration of our sub-architecture.

As the main limitation of our method, it cannot determine the optimal number of chan-

nels per group automatically, according to the complexity of each pointwise convolutional

layer to be substituted and the complexity of input images. A second limitation is that

the same number of channels per group is applied to all pointwise convolutional layers of

the target architecture, regardless of the speciﬁc complexity of each layer. This limitation

could be easily tackled by setting

as a fraction of the total number of parameters of each

layer. This is a straightforward task for future research. Besides, we will apply our method

to different problems, such as instance and semantic image segmentation, developing an

efﬁcient deep learning-based seismic acoustic impedance inversion method [

], object

detection, and forecasting.

Author Contributions:

Conceptualization, J.P.S.S. and S.R.A.; methodology, J.P.S.S. and S.R.A.;

software, J.P.S.S.; validation, S.R.A.; formal analysis, S.R.A.; investigation, J.P.S.S.; resources, D.P.;

data curation, J.P.S.S.; writing—original draft preparation, J.P.S.S. and S.R.A.; writing—review and

editing, J.P.S.S., S.R.A. and M.A.-N.; visualization, J.P.S.S.; supervision, S.R.A., M.A.-N., H.R. and

D.P.; project administration, D.P.; funding acquisition, D.P. All authors have read and agreed to the

published version of the manuscript.

Funding:

The Spanish Government partly supported this research through Project PID2019-105789RB-I00.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement:

Datasets used in this study are publicly available: CIFAR-10 [

], Col-

orectal cancer histology [

] and Malaria [

]. Software APIs are also publicly available: K-CAI [

] and

Keras [

]. Our source code and raw experiment results are publicly available: https://github.com/

joaopauloschuler/kEffNetV2, accessed on 1 September 2022.

Conﬂicts of Interest: The authors declare no conﬂict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

API Application Programming Interface

DCNN Deep Convolutional Neural Network

NiN Network in Network

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classiﬁcation with Deep Convolutional Neural Networks. In Advances in

Neural Information Processing Systems 25; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red

Hook, NY, USA, 2012; pp. 1097–1105.

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al.

ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV)

2015

,115, 211–252, doi:10.1007/s11263-015-0816-y.

Zeiler, Matthew D.and Fergus, R. Visualizing and Understanding Convolutional Networks. In Computer Vision— ECCV 2014;

Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T.; Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 818–833.

Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd

International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings;

Bengio, Y.; LeCun, Y., Eds., 2015.

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with

convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,

USA, 7–12 June 2015; IEEE Computer Society: Los Alamitos, CA, USA, 2015; pp. 1–9, doi:10.1109/CVPR.2015.7298594.

Entropy 2022,24, 1264 14 of 15

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Los Alamitos,

CA, USA, 2016; pp. 770–778, doi:10.1109/CVPR.2016.90.

Huang, G.; Liu, Z.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269,

doi:10.1109/CVPR.2017.243.

8. Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2014, arXiv:cs.NE/1312.4400.

Ioannou, Y.; Robertson, D.P.; Cipolla, R.; Criminisi, A. Deep Roots: Improving CNN Efﬁciency with Hierarchical Filter Groups.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;

pp. 5977–5986, doi:10.1109/CVPR.2017.633.

10.

Xie, S.; Girshick, R.B.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995.

11. Zhang, T.; Qi, G.; Xiao, B.; Wang, J. Interleaved Group Convolutions for Deep Neural Networks. arXiv 2017, arXiv: 1707.02725.

12.

Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShufﬂeNet: An Extremely Efﬁcient Convolutional Neural Network for Mobile Devices. In

Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June

2018; pp. 6848–6856, doi:10.1109/CVPR.2018.00716.

13.

Sun, K.; Li, M.; Liu, D.; Wang, J. IGCV3: Interleaved Low-Rank Group Convolutions for Efﬁcient Deep Neural Networks. In

Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018.

14.

Huang, G.; Liu, S.; Maaten, L.v.d.; Weinberger, K.Q. CondenseNet: An Efﬁcient DenseNet Using Learned Group Convolutions. In

Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA,

18–22 June 2018; pp. 2752–2761, doi:10.1109/CVPR.2018.00291.

15.

Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-HRNet: A Lightweight High-Resolution Network. In

Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25

June 2021; pp. 10435–10445, doi:10.1109/CVPR46437.2021.01030.

16.

Tan, M.; Le, Q. EfﬁcientNet: Rethinking Model Scaling for Convolutional Neural Networks. Int. Conf. Mach. Learn.

2019

97, 6105–6114.

17.

Denil, M.; Shakibi, B.; Dinh, L.; Ranzato, M.; de Freitas, N. Predicting Parameters in Deep Learning. In NIPS’13, Proceedings of

the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, CA, USA, 5–10 December 2013; Curran

Associates Inc.: Red Hook, NY, USA, 2013; Volume 2; pp. 2148–2156.

18.

Cheng, Y.; Yu, F.X.; Feris, R.S.; Kumar, S.; Choudhary, A.N.; Chang, S. An Exploration of Parameter Redundancy in Deep

Networks with Circulant Projections. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago,

Chile, 7–13 December 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 2857–2865, doi:10.1109/ICCV.2015.327.

19.

Yang, W.; Jin, L.; Sile, W.; Cui, Z.; Chen, X.; Chen, L. Thinning of Convolutional Neural Network with Mixed Pruning. IET Image

Process. 2019,13, 779–784. doi:10.1049/iet-ipr.2018.6191.

20.

Kahatapitiya, K.; Rodrigo, R. Exploiting the Redundancy in Convolutional Filters for Parameter Reduction. In Proceedings of the

2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1409–1419,

doi:10.1109/WACV48630.2021.00145.

21.

Liebenwein, L.; Baykal, C.; Carter, B.; Gifford, D.; Rus, D. Lost in Pruning: The Effects of Pruning Neural Networks beyond Test

Accuracy. Proc. Mach. Learn. Syst. 2021,3, 93–138.

22.

LeCun, Y.; Denker, J.; Solla, S. Optimal Brain Damage. In Advances in Neural Information Processing Systems; Touretzky, D., Ed.;

Morgan-Kaufmann: Burlington, MA, USA, 1989; Volume 2.

23. Reed, R. Pruning algorithms-a survey. IEEE Trans. Neural Netw. 1993,4, 740–747, doi:10.1109/72.248452.

24.

Zhuang, Z.; Tan, M.; Zhuang, B.; Liu, J.; Guo, Y.; Wu, Q.; Huang, J.; Zhu, J. Discrimination-aware Channel Pruning for Deep

Neural Networks. In In Advances in Neural Information Processing Systems 31; Bengio, S., Wallach, H., Larochelle, H., Grauman, K.,

Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 881–892.

25.

Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and

Huffman Coding. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4

May 2016.

26.

Baykal, C.; Liebenwein, L.; Gilitschenski, I.; Feldman, D.; Rus, D. Data-Dependent Coresets for Compressing Neural Networks

with Applications to Generalization Bounds. In Proceedings of the International Conference on Learning Representations, New

Orleans, LA, USA, 6–9 May 2019.

27.

Liebenwein, L.; Baykal, C.; Lang, H.; Feldman, D.; Rus, D. Provable Filter Pruning for Efﬁcient Neural Networks. In Proceedings

of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020.

28.

Schuler, J.; Romaní, S.; Abdel-nasser, M.; Rashwan, H.; Puig, D. Grouped Pointwise Convolutions Signiﬁcantly Reduces Parameters in

EfﬁcientNet; IOS Press: Amsterdam, The Netherlands, 2021; pp. 383–391, doi:10.3233/FAIA210158.

29.

Schwarz Schuler, J.P.; Romani, S.; Abdel-Nasser, M.; Rashwan, H.; Puig, D. Grouped Pointwise Convolutions Reduce Parameters

in Convolutional Neural Networks. MENDEL 2022,28, 23–31.

Entropy 2022,24, 1264 15 of 15

30.

Wang, X.; Kan, M.; Shan, S.; Chen, X. Fully Learnable Group Convolution for Acceleration of Deep Neural Networks. In

Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,

15–20 June 2019; IEEE Computer Society: Los Alamitos, CA, USA, 2019; pp. 9041–9050, doi:10.1109/CVPR.2019.00926.

31. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv: 1710.05941.

32.

Glorot, X.; Bengio, Y. Understanding the difﬁculty of training deep feedforward neural networks. JMLR Workshop Conf. Proc.

2010,9, 249–256.

33.

Shorten, C.; Khoshgoftaar, T. A survey on Image Data Augmentation for Deep Learning. J. Big Data

2019

,6, 1–48,

doi:10.1186/s40537-019-0197-0.

34.

Smith, L.N. Cyclical Learning Rates for Training Neural Networks. In Proceedings of the 2017 IEEE Winter Conference on

Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472, doi:10.1109/WACV.2017.58.

35.

Schuler, J.P.S. K-CAI NEURAL API; 2021.Available online: https://zenodo.org/record/5810093#.YxnEvbRBxPY (accessed on 4

September 2022)

36. Chollet, F. Keras. 2015 Available online: https://keras.io (accessed on 1 January 2022 ).

37.

Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow:

Large-Scale Machine Learning on Heterogeneous Systems. Software. 2015. Available online: tensorﬂow.org (accessed on 1

January 2022 ).

38.

Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada,

2009.

39.

Kather, J.N.; Zöllner, F.G.; Bianconi, F.; Melchers, S.M.; Schad, L.R.; Gaiser, T.; Marx, A.; Weis, C.A. Collection of Textures in Colorectal

Cancer Histology; 2016. Available online: https://zenodo.org/record/53169#.YxnFTLRBxPY (accessed on 1 January 2022).

40.

Rajaraman, S.; Antani, S.; Poostchi, M.; Silamut, K.; Hossain, M.; Maude, R.; Jaeger, S.; Thoma, G. Pre-trained convolutional

neural networks as feature extractors toward improved malaria parasite detection in thin blood smear images. PeerJ

2018

,6,

e4568, doi:10.7717/peerj.4568.

41.

Torralba, A.; Fergus, R.; Freeman, W.T. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition.

IEEE Trans. Pattern Anal. Mach. Intell. 2008,30, 1958–1970, doi:10.1109/TPAMI.2008.128.

42.

Shahbazi, A.; Monfared, M.S.; Thiruchelvam, V.; Fei, T.K.; Babasafari, A.A. Integration of knowledge-based seismic inversion and

sedimentological investigations for heterogeneous reservoir. J. Asian Earth Sci.

2020

,202, 104541, doi:10.1016/j.jseaes.2020.104541.

A Lightweight Convolutional Neural Network with Hierarchical Multi-Scale Feature Fusion for Image Classification

Article

Full-text available

Jan 2024

Effective Approaches for Improving the Efficiency of Deep Convolutional Neural Networks for Image Classification

Thesis

Full-text available

Nov 2022

Joao Paulo Schwarz Schuler

Recent architectures in Deep Convolutional Neural Networks (DCNNs) have a very high number of trainable parameters and, consequently, require plenty of hardware and time to run. It’s also commonly found in the literature that most parameters in a DCNN are redundant. This thesis presents two methods for reducing the number of parameters and floating-point computations in existing DCNN architectures applied for image classification. The first method reduces parameters in the first layers of a neural network, while the second method reduces parameters in deeper layers. The first method is a modification of the first layers of a DCNN that splits the channels of an image encoded with CIE Lab color space in two separate branches, one for the achromatic channel and another for the remaining chromatic channels. We modified an Inception V3 architecture to include one branch specific for achromatic data (L channel) and another branch specific for chromatic data (AB channels). This modification takes advantage of the decoupling of chromatic and achromatic information. Besides, splitting branches reduces the number of trainable parameters and computation load by up to 50% of the original figures in the modified layers. We achieved a state-of-the-art classification accuracy of 99.48% on the PlantVillage dataset. This thesis also shows that this two-branch method improves image classification reliability when the input images contain noise. Besides the first layers in a DCNN, in deeper layers of some recent DCNN architectures, more than 80% of the parameters come from standard pointwise convolutions. The parameter count in pointwise convolutions quickly grows due to the multiplication of the filters and input channels from the preceding layer. The second optimization method introduced in this thesis is making pointwise convolutions parameter-efficient via parallel branching to handle this growth. Each branch contains a group of filters and processes a fraction of the input channels. To avoid degrading the learning capability of DCNNs, we propose interleaving the filters’ output from separate branches at intermediate layers of successive pointwise convolutions. We tested our optimization on an EfficientNet-B0 as a baseline architecture and made classification tests on the CIFAR-10, Colorectal Cancer Histology, and Malaria datasets. For each dataset, our optimization saves 76%, 89%, and 91% of the number of trainable parameters of EfficientNet-B0, while keeping its test classification accuracy.

Optimized pointwise convolution operation by Ghost blocks

Article

Full-text available

Mar 2023

In the lightweight convolutional neural network model, the pointwise convolutional structure occupies most of the parameters and computation amount of the model. Therefore, improving the pointwise convolution structure is the best choice to optimize the lightweight model. Aiming at the problem that the pointwise convolution in MobileNetV1 and MobileNetV2 consumes too many computation resources, we designed the novel Ghost-PE and Ghost-PC blocks. First, in order to optimize the channel expanded pointwise convolution with the number of input channels less than the output, Ghost-PE makes full use of the feature maps generated by main convolution of the Ghost module, and adds global average pooling and depth convolution operation to enhance the information of feature maps generated through cheap convolution. Second, in order to optimize the channel compressed pointwise convolution with the number of input channels more than the output, Ghost-PC adjusts the Ghost-PE block to make full use of the features generated by cheap convolution to enhance the feature channel information. Finally, we optimized MobileNetV1 and MobileNetV2 models by Ghost-PC and Ghost-PE blocks, and then tested on Food-101, CIFAR and Mini-ImageNet datasets. Compared with other methods, the experimental results show that Ghost-PE and Ghost-PC still maintain a relatively high accuracy in the case of a small number of parameters.

ParaLkResNet: an efficient multi-scale image classification network

Article

Full-text available

Jun 2024
VISUAL COMPUT

Recently, deep neural networks have achieved remarkable results in computer vision tasks with the widely used visual attention mechanism. However, the introduction of the visual attention mechanism increases the parameters and computational complexity, which limit its application in resource-constrained environments. To solve this problem, we propose a novel convolutional block, the ParaLk block (PLB), a large kernel parallel convolutional block. Additionally, we apply PLB to PreActResNet by replacing the first 2D convolution to capture feature maps at different scales and call this new network ParaLkResNet. In practice, the effective receptive field of a convolutional network is smaller than that in real-world computation. Therefore the PLB is used to increase the receptive field of the network. Besides extracting multi-scale and high fusion features over normal 2D convolution, it has low latency in typical downstream tasks and good scalability to different data. It is worth noting that PLB as a plug-in block can apply to various computer vision tasks not limited to image classification. The proposed method outperforms most current classification networks in image classification. The accuracy on the CIFAR-10 dataset is improved by 2.42% and 0.66% compared to OTTT and IM-Loss, respectively. Our source code is available at: https://doi.org/10.5281/zenodo.11204902.

TwT: A Texture weighted Transformer for Medical Image Classification and Diagnosis

Chapter

Nov 2023

Medical imaging is an integral part of disease diagnosis and treatment. However, interpreting medical images can be time-consuming and subjective, making it challenging for healthcare professionals. Recent advances in deep learning show promising results in automating medical image classification and diagnosis. In this paper, we explore the application of texture-features, Central Difference Convolution (CDC) enhanced Convolutional Neural Networks (CNNs) and Compact Convolutional Transformers (CCT) to medical image classification and diagnosis. We compared the performance of existing architectures and our proposed Texture weighted Transformer (TwT) architecture. We evaluate each model’s performance and develop a robust architecture. Our results show that TwT outperforms other existing models in terms of Accuracy (ACC), Area Under the Receiver Operating Characteristic Curve (AUC) and other metrics. Our model combines texture-features with the advantages and performance of CDC-enhanced CNNs and the CCT architecture. Our proposed architecture gave an AUC of 0.9941 and an ACC of 96.84% on the Malaria dataset and an AUC of 0.9933 and an ACC of 92.75% on the BloodMNIST dataset while being compact (only about 6.3M parameters) and without any pre-training, and at the same time beating the AUC, ACC and other scores of other existing models proving that transfer learning is not always necessary. Our proposed architecture required less training time than most existing architectures, making it more practical for real-world applications. Our findings suggest that TwT can revolutionise medical image analysis by providing accurate and efficient diagnoses of diseases. The proposed architecture can be extended to other medical imaging tasks, including cancer detection, diabetic retinopathy and COVID-19 diagnosis. Thus, it can help healthcare professionals make accurate and timely diagnoses, improving patient outcomes.

Spatial Bias for attention-free non-local neural networks

Article

Oct 2023
EXPERT SYST APPL

Spatial Bias for Attention-free Non-local Neural Networks

Preprint

Full-text available

Feb 2023

In this paper, we introduce the spatial bias to learn global knowledge without self-attention in convolutional neural networks. Owing to the limited receptive field, conventional convolutional neural networks suffer from learning long-range dependencies. Non-local neural networks have struggled to learn global knowledge, but unavoidably have too heavy a network design due to the self-attention operation. Therefore, we propose a fast and lightweight spatial bias that efficiently encodes global knowledge without self-attention on convolutional neural networks. Spatial bias is stacked on the feature map and convolved together to adjust the spatial structure of the convolutional features. Therefore, we learn the global knowledge on the convolution layer directly with very few additional resources. Our method is very fast and lightweight due to the attention-free non-local method while improving the performance of neural networks considerably. Compared to non-local neural networks, the spatial bias use about 10 times fewer parameters while achieving comparable performance with 1.6 ~ 3.3 times more throughput on a very little budget. Furthermore, the spatial bias can be used with conventional non-local neural networks to further improve the performance of the backbone model. We show that the spatial bias achieves competitive performance that improves the classification accuracy by +0.79% and +1.5% on ImageNet-1K and cifar100 datasets. Additionally, we validate our method on the MS-COCO and ADE20K datasets for downstream tasks involving object detection and semantic segmentation.

Grouped Pointwise Convolutions Reduce Parameters in Convolutional Neural Networks

Article

Full-text available

Jun 2022

In Deep Convolutional Neural Networks (DCNNs), the parameter count in pointwise convolutions quickly grows due to the multiplication of the filters and input channels from the preceding layer. To handle this growth, we propose a new technique that makes pointwise convolutions parameter-efficient via employing parallel branching, where each branch contains a group of filters and processes a fraction of the input channels. To avoid degrading the learning capability of DCNNs, we propose interleaving the filters' output from separate branches at intermediate layers of successive pointwise convolutions. To demonstrate the efficacy of the proposed technique, we apply it to various state-of-the-art DCNNs, namely EfficientNet, DenseNet-BC L100, MobileNet and MobileNet V3 Large. The performance of these DCNNs with and without the proposed method is compared on CIFAR-10, CIFAR-100, Cropped-PlantDoc and Oxford-IIIT Pet datasets. The experimental results demonstrated that DCNNs with the proposed technique, when trained from scratch, obtained similar test accuracies to the original EfficientNet and MobileNet V3 Large architectures while saving up to 90% of the parameters and 63% of the floating-point computations.

Grouped Pointwise Convolutions Significantly Reduces Parameters in EfficientNet

Chapter

Full-text available

Oct 2021

EfficientNet is a recent Deep Convolutional Neural Network (DCNN) architecture intended to be proportionally extendible in depth, width and resolution. Through its variants, it can achieve state of the art accuracy on the ImageNet classification task as well as on other classical challenges. Although its name refers to its efficiency with respect to the ratio between outcome (accuracy) and needed resources (number of parameters, flops), we are studying a method to reduce the original number of trainable parameters by more than 84% while keeping a very similar degree of accuracy. Our proposal is to improve the pointwise (1x1) convolutions, whose number of parameters rapidly grows due to the multiplication of the number of filters by the number of input channels that come from the previous layer. Basically, our tweak consists in grouping filters into parallel branches, where each branch processes a fraction of the input channels. However, by doing so, the learning capability of the DCNN is degraded. To avoid this effect, we suggest interleaving the output of filters from different branches at intermediate layers of consecutive pointwise convolutions. Our experiments with the CIFAR-10 dataset show that our optimized EfficientNet has similar learning capacity to the original layout when training from scratch.

Exploiting the Redundancy in Convolutional Filters for Parameter Reduction

Conference Paper

Full-text available

Jan 2021

A survey on Image Data Augmentation for Deep Learning

Article

Full-text available

Jul 2019

Abstract Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfortunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data-space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in this survey include geometric transformations, color space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning. The application of augmentation methods based on GANs are heavily covered in this survey. In addition to augmentation techniques, this paper will briefly discuss other characteristics of Data Augmentation such as test-time augmentation, resolution impact, final dataset size, and curriculum learning. This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing Data Augmentation. Readers will understand how Data Augmentation can improve the performance of their models and expand limited datasets to take advantage of the capabilities of big data.

Thinning of Convolutional Neural Network with Mixed Pruning

Article

Full-text available

Mar 2019
IET IMAGE PROCESS

Deep learning has achieved state-of-the-art performance in accuracy of many computer vision tasks. However, convolutional neural network is difficult to deploy on resource constrained devices due to their limited computation power and memory space. Thus, it is necessary to prune the redundant weights and filters rationally and effectively. Considering that the pruned model still exists, redundancy after weight pruning or filter pruning alone, a method of combining weight pruning and filter pruning is proposed. First, filter pruning is performed, which is to remove filters with least importance and using fine-tuning to recover the model's accuracy. Then, all connection weights below a threshold are set to zero. Finally, the pruned model obtained by the first two steps is fine-tuned to recover its predictive accuracy. Experiments on MNIST and CIFAR-10 datasets demonstrate that the proposed approach is effective and feasible. Compared with only weight pruning or filter pruning, the mixed pruning can achieve higher compression ratio of the model parameters. For LeNet-5, the proposed approach can achieve a compression rate of 13.01×, with 1% drop in accuracy. For VGG-16, it can achieve a compression rate of 19.20×, incurring 1.56% accuracy loss.

CondenseNet: An Efficient DenseNet Using Learned Group Convolutions

Conference Paper

Full-text available

Jun 2018

Lite-HRNet: A Lightweight High-Resolution Network

Conference Paper

Jun 2021

Integration of knowledge-based seismic inversion and sedimentological investigations for heterogeneous reservoir

Article

Oct 2020
J ASIAN EARTH SCI

Conventional geological modelling methods are not capable to provide precise and comprehensive model of the subsurface structures, when dealing with insufficient data. Knowledge based methods employing rule bases techniques are found vast applications in geoscience studies. These methods are applicable for petroleum reservoir geological modelling and characterizations, specifically for geologically complex structures. In this study, we present a knowledge based seismic acoustic impedance inversion method which employs rule based method for porosity estimation. The back propagation algorithm and the fuzzy neural network are also used in the methodology for parameter optimization and definition of nonlinear relationship between seismic attributes and porosity of the reservoir rock. The methodology initiates by seismic acoustic impedance inversion, followed by conventional porosity estimation. Subsequently, a knowledgebase was designed by investigation on more than 24 published case studies. This knowledgebase was used for definition of rules and optimization number of rules and improve efficiency of the inference engine. The porosity model obtained by conventional method in previous step would be used for primary evaluation of the rules. The extracted rules and optimized number rules then would be used for rule-based porosity estimation. The methodology was applied on a petroleum field containing two heterogeneous reservoir formations. Result of application of the proposed approach was evaluated with core analysis, thin sections and drilling data. Consistency of result obtained by the proposed method with geological data has shown its capability to resolve problem of insufficient data in reservoir geological modelling.

Fully Learnable Group Convolution for Acceleration of Deep Neural Networks

Conference Paper

Jun 2019

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Conference Paper

Jun 2018

An Enhanced Scheme for Reducing the Complexity of Pointwise Convolutions in CNNs for Image Classification Based on Interleaved Grouped Filters without Divisibility Constraints

Abstract and Figures

Recommended publications

Effective Approaches for Improving the Efficiency of Deep Convolutional Neural Networks for Image Cl...

Grouped Pointwise Convolutions Reduce Parameters in Convolutional Neural Networks

Color-Aware Two-Branch DCNN for Efficient Plant Disease Classification

Grouped Pointwise Convolutions Significantly Reduces Parameters in EfficientNet