PreprintPDF Available

AutoFCL: Automatically Tuning Fully Connected Layers for Transfer Learning

Authors:
  • RV University Bangalore
  • IIIT Sri City Chittoor
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Deep Convolutional Neural Networks (CNN) have evolved as popular machine learning models for image classification during the past few years, due to their ability to learn the problem-specific features directly from the input images. The success of deep learning models solicits architecture engineering rather than hand-engineering the features. However, designing state-of-the-art CNN for a given task remains a non-trivial and challenging task. While transferring the learned knowledge from one task to another, fine-tuning with the target-dependent fully connected layers produces better results over the target task. In this paper, the proposed AutoFCL model attempts to learn the structure of Fully Connected (FC) layers of a CNN automatically using Bayesian optimization. To evaluate the performance of the proposed AutoFCL, we utilize five popular CNN models such as VGG-16, ResNet, DenseNet, MobileNet, and NASNetMobile. The experiments are conducted on three benchmark datasets, namely CalTech-101, Oxford-102 Flowers, and UC Merced Land Use datasets. Fine-tuning the newly learned (target-dependent) FC layers leads to state-of-the-art performance, according to the experiments carried out in this research. The proposed AutoFCL method outperforms the existing methods over CalTech-101 and Oxford-102 Flowers datasets by achieving the accuracy of 94.38% and 98.89%, respectively. However, our method achieves comparable performance on the UC Merced Land Use dataset with 96.83% accuracy.
AutoFCL: Automatically Tuning Fully Connected Layers for Transfer
Learning
S. H. Shabbeer Basha, Sravan Kumar Vinakota, Shiv Ram Dubey, Viswanath Pulabaigari, Snehasis Mukherjee
Indian Institute of Information Technology Sri City, India
Abstract Deep Convolutional Neural Networks (CNN) have
evolved as popular machine learning models for image classifi-
cation during the past few years, due to their ability to learn the
problem-specific features directly from the input images. The
success of deep learning models solicits architecture engineering
rather than hand-engineering the features. However, designing
state-of-the-art CNN for a given task remains a non-trivial
and challenging task. While transferring the learned knowledge
from one task to another, fine-tuning with the target-dependent
fully connected layers produces better results over the target
task. In this paper, the proposed AutoFCL model attempts
to learn the structure of Fully Connected (FC) layers of a
CNN automatically using Bayesian optimization. To evaluate the
performance of the proposed AutoFCL, we utilize five popular
CNN models such as VGG-16, ResNet, DenseNet, MobileNet,
and NASNetMobile. The experiments are conducted on three
benchmark datasets, namely CalTech-101, Oxford-102 Flowers,
and UC Merced Land Use datasets. Fine-tuning the newly
learned (target-dependent) FC layers leads to state-of-the-art
performance, according to the experiments carried out in this
research. The proposed AutoFCL method outperforms the exist-
ing methods over CalTech-101 and Oxford-102 Flowers datasets
by achieving the accuracy of 94.38% and 98.89%, respectively.
However, our method achieves comparable performance on the
UC Merced Land Use dataset with 96.83% accuracy.
I. INTRODUCTION
Deep Convolutional Neural Networks (CNN) based fea-
tures have outperformed the hand-designed features in most
of the computer vision problems such as object recognition
[1], [2], speech recognition [3], medical applications [4], and
many more. Although several complicated research problems
have been solved by deep learning models, generally, the per-
formance of these models relies on hard-to-tune hyperparam-
eters. Finding the best configuration for the hyperparameters
such as the number of layers, convolution filter dimensions,
number of filters in a convolution layer, and many more to
build a CNN architecture suitable for a given task, is the most
demanding research topic in the area of Automated Machine
Learning (AutoML) [5], [6]. Based on the previous studies
reported in the literature, learning a suitable architecture for
a given task is termed as Neural Architecture Search (NAS)
[7]. Reinforcement Learning (RL) methods have been widely
employed to find the suitable CNN architecture for given task
[8]. However, these methods are focused to find the structure
of CNN from scratch which requires hundreds of GPU hours.
S.H.S. Basha, S.K. Vinakota, S.R. Dubey, Viswanath P., S.
Mukherjee are with Computer Vision and Machine Learning
Groups, Indian Institute of Information Technology, Sri City,
Andhra Pradesh - 517646, India. email: {shabbeer.sh,
sravankumar.v17, srdubey, viswanath.p,
snehasis.mukherjee}@iiits.in
Fig. 1. While transferring the learned knowledge from source task to target
task, learning the optimal structure of FC layers with the knowledge of target
dataset and fine-tuning the learned FC layers leads to better performance.
We propose a method called AutoFCL to automatically tune
the structure of the Fully Connected (FC) layers with respect
to the target dataset while transferring the knowledge from
source task to the target task.
Typically, every CNN contains one or more FC layers
based on the depth of the architecture [9]. For instance,
the popular CNN models proposed to train over large-scale
ImageNet dataset [10] have the following number of FC
layers.
AlexNet [1], ZFNet [11], and VGGNet [12] have 3
dense (FC) layers. Note that these models contain 5,
5, and 13 convolution layers, respectively.
GoogLeNet [2], ResNet [13], DenseNet [14], NASNet
[5], and other modern deep neural networks have a
single FC layer which is responsible for generating the
class scores.
The CNN models introduced in the initial years (during the
years from 2012 to 2014) have a huge number of trainable
parameters in FC layers. Whereas the recent models are
generally deeper, and hence, have a single FC layer which
is responsible for generating the class scores. The state-of-
the-art CNN architectures proposed for ImageNet dataset are
shown in Table I. This table summarizes the total number
of trainable parameters and also the trainable parameters
correspond to FC layers. It is evident from Table I that as
the depth of CNN increases, both the number of dense layers
and the parameters in dense layers gradually decrease.
arXiv:2001.11951v2 [cs.CV] 5 Feb 2020
A large number of parameters involved in the FC layers
of a CNN increases the possibility of overfitting. Xu et al.
[15] shown that removing the connections among FC layers
having less weight magnitude (SparseConnect) leads to better
performance. Basha et al. [9] performed a study to observe
the necessity of FC layers given the depth and width of
both datasets and CNN architectures. To find the best set
of hyperparameters of an Artificial Neural Network (ANN),
Mendoza et al. [16] proposed an automated mechanism
to tune the ANN using Sequential Model-based Algorithm
Configuration (SMAC).
CNNs are used in a wide range of applications in recent
years. However, their performance is poor if the amount of
training data is very limited. Transfer Learning is a way to
reduce the need for more training data and huge computa-
tional resources by reusing the knowledge learned from the
source task. A common approach for classifying such limited
images is re-using the pre-trained models to fine-tune over
other datasets [19]. However, while transferring the learned
knowledge from one task to another, fine-tuning the original
FC layers’ structure may not perform well over the target
dataset because the FC layers are designed for the source
task.
Fig. 1 illustrates the motivation behind learning the target-
dependent fully connected layer’s structure to obtain better
performance over the target task. While transferring the
learned knowledge from source task to target task, the
efficacy (capacity) of the CNN increases for the target task,
which may result in overfitting. The extracted features from
convolutional layers (shown in the left side) are mapped
into more linearly separable feature space (shown on the
right side) by FC layers. Moreover, we believe that learning
the FC layers’ structure with the knowledge from the target
dataset may lead to better linearly separable feature space
which results in better performance over the target dataset. In
this work, we propose a novel framework for automatically
learning the target-dependent fully connected layers structure
in the context of transfer learning. We use Bayesian opti-
mization [20] for optimizing the hyperparameters involved
in forming the FC layers while transferring the knowledge
from one task to another.
II. REL ATED WORK S
Due to the dense connectivity among the FC layers,
the deep CNNs contain an enormous amount of trainable
parameters. For example, the first ImageNet Large Scale
Visual Recognition Competition (ILSVRC)-2012 [21] win-
ning CNN model called AlexNet [1] contains a total of
60 million trainable parameters, among which 58 million
parameters belong to the FC layers. Likewise, VGG-16
[12], a 16 layer deep CNN comprises 138 million trainable
parameters, among which 123 million parameters correspond
to FC layers. In practice, the over-parameterization leads to
overfitting the CNN. Xu et al. [15] proposed SparseConnect
model to reduce the overfitting by removing the connections
with smaller weight values.
Transfer learning is a widely adopted technique to obtain a
reasonable performance with limited data and less computa-
tional resources. Li et al. [22] analyzed various approaches
for transferring the knowledge learned in different scenar-
ios. Fine-tuning the deep CNNs with limited training data
often leads to overfitting the CNN model [23]. Han et al.
[24] introduced a two-phase strategy by combining transfer
learning with web data augmentation to reduce the amount
of over-fitting. They also tuned the hyperparameters such as
learning rate, type of optimizer (Adagrad [25], Adam [26],
etc.) and many more using Bayesian Optimization.
Mendoza et al. [16] proposed Auto-Net, which automat-
ically tunes an artificial neural network without any human
intervention. To learn a distinct set of hyperparameters auto-
matically, they used the Sequential Model-based Algorithm
Configuration (SMAC). The hyperparameters such as the
number of FC layers, number of neurons in each FC layer,
batch size, learning rate, and so on are tuned automatically.
Motivated by this work, we propose a framework to automat-
ically learn the structure of FC layers concerning the target
dataset for better transfer learning.
Many researchers have employed Bayesian Optimization
[20] to learn the entire CNN architecture automatically.
Wistuba et al. [27] combined Bayesian Optimization with
Incremental Evaluation to find the optimal neural network
architecture. However, they limited the depth of the CNN to
5 layers due to the limited computational resources. Jin et
al. [28] proposed a network morphism mechanism for neural
architecture search using Bayesian Optimization. Liu et al.
[6] proposed a method to build the CNN architecture pro-
gressively using the Sequential Model-Based Optimization
(SMBO) based algorithm. However, these methods require a
considerable amount of computational resources and search
time. Recently, Gupta et al. [29] employed Bayesian Opti-
mization to conduct a study for efficient transfer optimiza-
tion.
Transfer Learning allows the pre-trained networks to adopt
for the new tasks [30]. Many researchers utilized the ad-
vantage of transfer learning for various applications [19],
[31]. Ji et al. [28] proposed a framework called Double
Reweighting Multi-source Transfer Learning (DRMTL) to
utilize the decision knowledge from multiple sources to
perform well over the target domain. Generally, after adap-
tation, the efficacy (capacity) of the CNN increases for the
target task. Molchanov et al. [32] proposed a framework for
iteratively pruning the parameters of a CNN to reduce its
capacity for the target task. With regard to our knowledge,
no effort has been made in the literature to learn the structure
of FC layers automatically for better transfer learning. Neural
Architecture Search algorithms consume thousands of GPU
hours [5] to find better performing architectures. So, we made
this attempt in the context of transfer learning to reduce the
architecture search time.
Basha et al. [9] analyzed the necessity of FC layers based
on the depth of a CNN. However, to conduct this study they
performed experiments by adding new FC layers manually
before the output FC layer. Moreover, the hyperparameters
TABLE I
THE S TATE -OF-TH E-ART DE EP NEU RAL NE TWOR KS PRO POSE D FOR TH E IMAGENE T DATA SET,TH E TOTAL N UM BE R OF T RA IN AB LE PAR A ME TE RS A ND
THE NUMBER OF PARAMETERS BELONG TO FC LAYER S AR E SH OW N.
S.No. CNN Model Total #trainable parameters
(in Millions)
#parameters in FC layers
(in Millions)
1 AlexNet [1] 60 M 58 M
2 ZFNet [11] 62.3 M 58.6 M
3 VGG16 [12] 138.3 M 123.6 M
4 VGG19 [12] 143.6 M 123.6 M
5 InceptionV3 [17] 23.8 M 2 M
6 ResNet50 [13] 25.5 M 2 M
7 MobileNet [18] 4 M 1 M
8 DenseNet201 [14] 20 M 1.9 M
9 NASNetLarge [5] 88 M 4 M
10 NASNetMobile [5] 5 M 1 M
involved in FC layers like the number of neurons in every
FC layer, the dropout factor, type of activation, and so on
are chosen manually. In this paper, we attempt to learn the
target-dependent FC layers’ structure automatically for better
transfer learning.
In brief, our contributions in this work are as follows,
We propose a novel method to automatically learn the
target-dependent FC layers structure using Bayesian
Optimization.
By conducting experiments on three benchmark
datasets, we discover the suitable (target-dependent) FC
layers structure specific to the datasets.
The performance of the proposed method is also com-
pared with non-transfer learning and traditional transfer
learning-based methods.
To compare the results obtained using Bayesian Opti-
mization, we employed the random search to find the
best set of hyperparameters involved in FC layers.
III. PROP OSE D AUTOFCL MOD EL
We formulate the task of learning the structure of fully
connected layers as a black-box optimization problem. Let
fis an objective function whose objective is to find x,
which is represented as
x= argmax
x∈H
f(x)(1)
where xRdis the input, usually d20 [20], His
the hyperparameter space as depicted in Table II, and f
is a continuous function. Finding the value of function f
at xrequires training (fine-tuning) the learned FC layers
(explored during the architecture search) of CNN (B) on
training data (TrainData) and evaluating its performance on
the held-out (validation) data ValData.
The xis a CNN with an optimal FC layer’s struc-
ture learned using the Bayesian Optimization for efficient
transfer learning. Therefore, the CNN architecture xis
responsible for maximizing the performance on the ValData .
The proposed AutoFCL method is outlined in Algorithm
1. Given the base CNN model (B), hyperparameters search
space (Param space), TrainData, ValData, and the number
of epochs (E) to train each proxy CNN as an input, the
proposed method learns the most suitable structure of FC
(dense) layers using Bayesian Optimization [20].
The Bayesian Optimization (Bayes Opt) is the most pop-
ular method used for finding the best set of hyperparameters
involved in deep neural networks [33]. Bayes Opt builds a
surrogate model to approximate the objective function using
Gaussian Process (GP) regression [34]. Algorithm 1 observes
the value of fwithout noise for initial n0points which
are chosen uniformly random (n0is 20 in our experimental
settings). After observing the objective at initial n0points,
we can infer the objective value at a new point xnew using
Bayes rule [35] as follows,
f(xnew)|f(x1:n0)N ormal(µn0(xnew ), σ 2
n0(xnew)) (2)
The µn0(xnew)and σ2
n0(xnew)are computed as follows,
µn0(xnew)=P0(xnew :x1:n0)P0(x1:n0, x1:n0)1(f(x1:n0)µ0(x1:n0)) + µ0(xnew )
(3)
σ2
n0(xnew) = P0(xnew , xnew )P0(xnew , x1:n0)P0(x1:n0, x1:n0)1P0(x1:n0:xnew )
(4)
The probability distribution given in Eq. 2 is called poste-
rior probability distribution. In the above equations, µ0,P0
are mean function and covariance functions, respectively.
The optimal configuration of FC layers is one among
the previously evaluated points (initial n0points) with the
maximum f value (f(x+)). Now, if we want to evaluate the
value of f at a new point xnew, which is observed as f(xnew ).
After evaluating the value of fat iteration n0+1, the optimal
f value will be either f(xnew)(if f(xnew )f(x+)) or
f(x+)(if f(x+)f(xnew)). The improvement or gain in
the objective f is f(xnew)f(x+)if its value is positive,
or 0 otherwise.
However, the f(xnew ) value is unknown until observing
its value at xnew which is typically expensive. Instead of
evaluating f at xnew, we can compute the Expected Improve-
ment (EI) and choose the xnew that maximizes the value
of EI. Expected Improvement [36] is the most commonly
used acquisition function for guiding the search process by
proposing the next point to sample.
For a specified input xnew, EI can be represented as,
EI (x) = E[max(f(xnew )f(x+),0)] (5)
Algorithm 1 AutoFCL: A Bayesian Search method for automatically learning the structure of FC layers
Inputs: B (Base Model), Param space, TrainData, ValData , E (num epochs).
Output: A CNN with target-dependent FC layers structure.
1: procedure AUTO FCL
2: Place a Gaussian Process (GP) prior on the objective f
3: while t1,2, ...n0do Observe the value of f at initial n0points
4: Mtbuild CN N (B, P ar am space)sample the initial CNN randomly
5: TtT rain C N N(Mt, T r ainData, E)
6: VtV alidate CN N (Tt, V alidData )
n=n0
7: while tn+ 1, .., N do
8: Update the posterior distribution on f using the prior Using Eq. 2
9: Choose the next sample xtthat maximizes the acquisition function value
10: Observe yt=f(xt)
11: return xtreturn a point with best FC layer structure
where f(x+)is the maximum validation accuracy obtained
so far and x+is the FC layer’s structure for which best vali-
dation accuracy is obtained. Formally, x+can be represented
as,
x+= argmax
xix1:n0
f(xi)(6)
which utilizes the information about the models that were
already explored and finds the next point that maximizes the
expected improvement. After observing the objective at each
point, we update the posterior distribution using the Eq. 2.
IV. HYPERPARAMETER SEARCH SPAC E
This section provides a detailed discussion about the
search space used for finding the target-dependent FC layer’s
structure for efficient transfer learning. A single fully con-
nected layer of a CNN involves various hyperparameters.
To mention a few, the number of neurons, dropout rate,
and many more. The proposed AutoFCL aims to learn the
suitable structure for the FC layers, which includes the
number of FC layers, dropout rate, type of activation, and
the number of neurons in each FC layer to obtain the better
performance over the target dataset. Table II shows the
hyperparameter search space considered in our experimental
settings.
As most of the CNN architectures available in the liter-
ature have a maximum of 3FC layers [1], [12] including
the output layer. Therefore, we consider the search space
for the number of FC layers in the range [1,3] (i.e., 1, 2,
and 3). The other important hyperparameter is the number
of neurons required in each FC layer, for which the proposed
method finds the best set of configuration within the range
[64,1024] in powers of 2 ({64, 128, 256, 512, 1024}).
Besides these hyperparameters, we consider activation func-
tion as another hyperparameter. Three popular non-linear
activations Sigmoid, Tanh, and ReLu are utilized for the
same. To reduce the over-fitting caused due to a large number
of trainable parameters in FC layers, dropout [1] is widely
adopted in deep learning. We consider dropout as another
hyperparameter to learn, the value of which is learned in the
range [0, 0.5] with an offset 0.1 i.e., the proposed AutoFCL
finds the suitable dropout factor within the values {0.0, 0.1,
0.2, 0.3, 0.4, 0.5}.
V. EXP E RI MEN TAL SETTINGS
In this section, we brief the training details, CNN archi-
tectures utilized to learn the structure of FC layers and the
datasets used to evaluate the performance of the developed
image classification models in the context of transfer learn-
ing.
A. Training Details
Training Proxy CNNs: The CNN architectures generated
in the search process of Bayesian Optimization (also called
proxy CNNs) are trained using AdaGrad optimizer [25]. The
initial value of the learning rate is set to 0.01 and its value
is reduced by a factor of 0.1for every 5epochs if there is
no reduction in the validation loss. Since training the CNNs
is a time-consuming task, we train each proxy CNN for 20
epochs as in [6]. The parameters (weights) corresponding to
the FC layers are initialized using He normal initialization
[37].
B. CNN Architectures used for Fine-Tuning
To learn the target-dependent FC layers structure auto-
matically, we use two kinds of CNN architectures which
include i) chain structured (plain) CNNs like VGG-16 [12]
and ii) CNNs involving skip connections like ResNet [13],
DenseNet [14], and many more.
We conduct the experiments using the popular CNN
models that are trained over ImageNet dataset such as VGG-
16 [12], ResNet [13], DenseNet [14], MobileNet [18], and
NASNet-Mobile [5]. In this article, we are interested in
finding the optimal structure of fully connected layers for
efficient transfer learning. To achieve this objective, the
parameters (weights) involved in convolution layers of the
above CNNs trained over ImageNet dataset [10] are frozen.
In other words, the convolution layers of the above CNNs use
the pre-trained weights of ImageNet dataset. The parameters
involved in newly added FC layers are learned using the
TABLE II
HYPERPARAMETER SEARCH SPACE CONSIDERED IN THIS PAPER,W HI CH I NC LU DE S BOT H NE T WOR K HY P ER PARA ME TE RS S UC H AS T HE N UM BE R O F
FU LLY CO NN EC TE D LAYE R S AN D PE R-L AYER H YP ER PAR AM ET ER S LI KE AC TI VATION F UN CT IO N ,DRO PO UT FAC TO R,A ND T HE N UM BE R OF N EU RO NS
ARE PRESENTED IN THIS TABLE.
Name Values Type
Network hyperparameters number of FC layers {1,2,3}integer
Hyperparameters per single FC layer activation function {ReLu, Tanh, Sigmoid}categorical
dropout rate {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}float
number of neurons {64, 128, 256, 512, 1024}integer
back-propagation algorithm [38]. The structure of the FC
layers is tuned automatically using Algorithm 1.
1) Chain Structured CNNs (Plain CNNs): In the initial
years of deep learning, the CNN architectures proposed such
as LeNet [54], AlexNet [1], ZFNet [11], and VGG-16 [12]
have the varying number of trainable layers (convolution,
Batch Normalization, and fully connected layers) and in-
volves a different set of hyper-parameters. However, the
connectivity among the different layers in these architectures
remains the same such that layer Li+1 receives the input
feature map from layer Li. Similarly layer Li+2 receives the
input from layer Li+1 and so on. We consider VGG-16, a
16 layer chain structured deep CNN to learn the structure of
FC layers for efficient transfer learning.
2) CNNs involving Skip Connections: Szegedy et al. [2]
introduced a deep CNN named GoogLeNet with a careful
handcrafted design which allows increasing the depth of
the model. GoogLeNet has a basic building block called
Inception block that uses multi-scale filters. Later on, the
concept of skip connections became very popular after the
emergence of ResNet in 2016 [13]. The skip connections
are also used by recent models such as DenseNet [14], etc.
Moreover, it also became popular among the CNNs learned
using NAS methods such as NASNet [5], PNAS [6], etc.
A layer in the CNNs involving skip connections receives
multiple input feature maps from its previous layers. For
example, layer Li+1 receives the input from both layers Li
and Li1as in ResNet [13]; layer Lnreceives the input
feature map from all of its previous layers {L1,L2, ... Ln1}
as in DenseNet [14]. We utilized ResNet-50, MobileNet,
DenseNet-121, and NASNet-Mobile CNNs involving skip
connections to learn the structure of FC layers.
C. Datasets
To validate the performance of the proposed method,
experiments are conducted on three different kinds of bench-
mark datasets such as CalTech-101, Oxford-102 Flowers, and
UC Merced Land Use.
1) CalTech-101 Dataset: CalTech-101 [39] dataset con-
sists of images belong to 101 object categories. Each class
has the number of images between 40 and 800. The most
common image categories such as human faces tend to have
more images compared to others. The total number of images
are 9144 and each image has a varying spatial dimension.
To conduct the experiments, we utilize 80% of the data for
training (i.e., 7315 images) and the remaining 20% images
to validate the performance of the deep neural networks. To
fit these images as input to the CNN models, we re-size the
image dimension to 224 ×224 ×3. A few samples from
CalTech-101 dataset are presented in Fig. 2(a).
2) Oxford-102 Flowers Dataset: Oxford-102 [40] dataset
comprises images belong to 102 flower categories that are
commonly visible in the United Kingdom. This dataset
contains 8189 images such that each class has a varying
number of flower images ranging from 40 to 258. We utilize
80% of the dataset (6551 images) for training the CNNs and
remaining 1638 images for validating the performance of the
CNNs. To input the images to the CNN models, the image
dimension is re-sized to 224×224×3. Some example images
from Oxford-102 Flowers dataset are shown in Fig. 2(b).
3) UC Merced Land Use Dataset: UC Merced Land Use
dataset [41] contains images belonging to 21 categories of
lands. This dataset has a total of 2100 images with 100
images in each class. The developed CNN models have
trained over 80 images in each class, and the remaining 20
images are used to validate the performance of the models.
The image dimensions are resized from 256 ×256 ×3to
224 ×224 ×3. A few images from the UC Merced Land
Use dataset are shown in Fig. 2(c).
VI. EXPERIMENTAL RES ULT S AN D DISCUSSIONS
This section presents the experimental results (validation
accuracy) over three benchmark datasets.
A. CalTech-101 Image Classification Results
To learn the best set of hyperparameters involved in the FC
layers of a CNN, we employ two popularly adopted search
methods in the literature of Neural Architecture Search
(NAS). Those two search methods include i) Bayesian Opti-
mization and ii) Random Search. Random search chooses the
hyperparameters to explore randomly. In our experimental
settings, the number of iterations for random sampling is
set to 100. Table III presents the comparison among the
performance of proxy CNN models (fine-tuning the best
FC layer structure learned during the search process) found
using Bayesian search and Random search over CalTech-
101 dataset. Table III also lists the best possible set of
hyperparameter values like the number of FC layers, type
of activation, number of neurons in each FC layer, and the
dropout factor for each FC layer that are learned during the
search process. For example, the best structure of FC layers
learned using Bayesian optimization for VGG-16 results in
92.72% validation accuracy. After finding the best set of FC
layers’ hyperparameters using Algorithm 1, we fine-tune the
(a) (b) (c)
Fig. 2. (a) A few set of images belong to CalTech-101 [39]. (b) A few random images from Oxford-102 Flowers [40]. (c) Some example images from
UC Merced Land Use dataset [41].
TABLE III
THE B ES T SE T OF FC LAYE RS HYPERPARAMETERS LEARNED FOR CALTECH -101 DATASE T US IN G TH E BAYES IA N S EA RC H AN D RA ND OM S EA RC H
TE CH NI QU ES . TH E OP TI MA L ST RUC TU RE O F FU LLY C ON NE CT ED L AYER S (EXCLUDING THE OUTPUT FC LAYER)FO R PO PU LA R CNNS SU CH A S
VGG-16, RESNE T5 0, M OBILENET, D EN SE NET 121, AN D NASNET-M OB IL E IS P RE SE NT ED .
S.No Model Search Method #FC layers Activation #neurons dropout rate validation accuracy
1 VGG-16 Bayesian 1 ReLu 256 0 92.72
random 1 ReLu 512 0.3 92.34
2 ResNet Bayesian 0 - - - 90.15
random 1 Sigmoid 256 0.2 89.83
3 MobileNet Bayesian 1 ReLu 1024 0.3 92.50
random 1 ReLu 256 0 88.73
4 DenseNet Bayesian 1 Sigmoid 1024 0.3 90.21
random 1 ReLu 1024 0 88.79
5 NASNet-Mobile Bayesian 1 ReLu 1024 0.1 88.51
random 1 Sigmoid 256 0 86.65
TABLE IV
RES ULTS C OM PAR IS ON B ET WE EN T HE P ROP OS E D AUTO FCL AN D TH E STATE -OF -TH E-A RT ME TH OD S OV ER CA LTECH -101, OXFORD-102 FLOW ER S,
AN D UC MER CE D LAND USE DATAS ET S. T HE S TATE-O F-T HE -ART INCLUDING BOTH TRANSFER LEARNING-BASE D AN D NO N-T RA NS FE R
LEARNING-BAS ED M ET HO DS A RE L IS TE D IN T HI S TAB LE . TH E ROW S CO RR ES PO ND IN G TO T HE B ES T AN D SE CO ND -BE ST P ER FO RM AN C E OVE R EAC H
DATASE T AR E HI GH LI GH TE D IN B OL D AN D bold-italic,R ES PE CT IV ELY.
Dataset Method Accuracy Transfer Learning/Non Transfer Learning
CalTech-101
Lee et al. [42] 65.4Non Transfer Learning
Cubuk et al. [43] 86.9 Transfer Learning
Sawada et al. [44] 91.8 Transfer Learning
Ours (VGG-16 + AutoFCL) 94.38 ±0.005 Transfer Learning
Ours (ResNet-50 + AutoFCL) 91.13 ±0.004 Transfer Learning
Ours (MobileNet + AutoFCL) 92.07 ±0.004 Transfer Learning
Ours (DenseNet-121+ AutoFCL) 89.5±0.005 Transfer Learning
Ours (NASNetMobile+ AutoFCL) 87.77 ±0.005 Transfer Learning
Oxford-102 Flowers
Huang et al. [45] 85.66 Non Transfer Learning
Lv et al. [46] 92.00 Non Transfer Learning
Murabito et al. [47] 79.4Non Transfer Learning
Simon et al. [48] 97.1 Transfer Learning
Karlinsky et al. [49] 89 Transfer Learning
Ours (VGG-16 + AutoFCL) 98.83 ±0.001 Transfer Learning
Ours (ResNet-50 + AutoFCL) 97.21 ±0.05 Transfer Learning
Ours (MobileNet + AutoFCL) 58.6±0.04 Transfer Learning
Ours (DenseNet-121 + AutoFCL) 60.91 ±0.03 Transfer Learning
Ours (NASNetMobile + AutoFCL) 41.3±0.006 Transfer Learning
UC Merced Land Use
Shao et al. [50] 92.38 Non Transfer Learning
Yang et al. [51] 93.67 Non Transfer Learning
Akram et al. [52] 97.6 Transfer Learning
Wang et al. [53] 94.81 Transfer Learning
Ours (VGG-16 + AutoFCL) 96.83 ±0.006 Transfer Learning
Ours (ResNet-50 + AutoFCL) 78 ±0.03 Transfer Learning
Ours (MobileNet + AutoFCL) 88 ±0.004 Transfer Learning
Ours (DenseNet-121 + AutoFCL) 80.8±0.015 Transfer Learning
Ours (NASNetMobile + AutoFCL) 72.28 ±0.016 Transfer Learning
FC layers of the developed CNN models over the CalTech- 101 dataset. The CNN models are trained for 200 epochs
TABLE V
THE O PT IM AL S TRU CT UR E OF F C LAY ER S LE AR NE D FO R OXFORD-102 FLOW ER S DATASE T U SI NG T HE BAYE SI AN S EA RC H AN D RA ND OM S EA RC H.
THE VALUES OF VARIOUS HYPERPARAMETERS FOR VGG-16, RES NET, MOBILENE T, DE NS ENE T, NAS NE T-MOBILE MODELS ARE SHOWN IN THIS
TABL E.
S.No Model Search Method #FC layers Activation #neurons dropout rate validation accuracy
1 VGG-16 Bayesian 1 ReLu 256 0 96.64
random 1 ReLu 64 0.1 94.33
2 ResNet Bayesian 1 Sigmoid 512 0.3 96.31
random 0 - - - 91.73
3 MobileNet Bayesian 1 Sigmoid 512 0.5 61.29
random 1 Sigmoid 512 0.1 55.67
4 DenseNet Bayesian 1 Sigmoid 1024 0.3 68.06
random 1 ReLu 256 0 55.18
5 NASNet-Mobile Bayesian 1 ReLu 256 0.2 40.37
random 1 ReLu 512 0.1 38.37
using AdaGrad optimizer [25]. We consider the values of
other hyperparameters such as the learning rate similar to
the setting of training the proxy CNNs explored during the
search process. Fine-tuning the FC layers (learned using
the proposed AutoFCL) results in state-of-the-art accuracy
94.38% on CalTech-101 dataset.
B. Oxford-102 Flowers Image Classification Results
The optimal FC layers hyperparameters learned for the
Oxford-102 Flowers dataset using Bayesian Optimization
and random search are shown in Table V. Similar to CalTech-
101 dataset, once the search process is completed, the
FC layers of the CNN (the best FC layer structure found
during the search process) are fine-tuned over the Oxford-
102 Flowers dataset for 200 epochs using AdaGrad optimizer
[25]. The proposed AutoFCL achieves the state-of-the-art
accuracy of 98.83% on Oxford-102 Flowers dataset. Table
IV summarizes the performance obtained using the various
CNN models with the target-dependent FC layer structure.
The VGG-16 and ResNet-50 achieve the best and second-
best state-of-the-art accuracy, respectively over Oxford-102
Flowers dataset.
C. UC Merced Land Use Image Classification Results
We consider UC Merced Land Use as another image
dataset to learn the best structure of FC layers for efficient
transfer learning. The proposed method produces comparable
results over UC Merced Land use dataset as presented in
Table IV. From Table IV we can observe that fine-tuning the
FC layers learned using the proposed AutoFCL for VGG-16
produces 96.83% validation accuracy, which is second best
state-of-the-art accuracy. Table VI lists the best configuration
of hyperparameters involved in FC layers found using both
Bayesian search and random search. We also compared
the performance of the proposed method with fine-tuning
original CNN architectures over the target dataset. Fine-
tuning the target-dependent FC layer’s structure of a CNN
over the target dataset results in better performance com-
pared to fine-tuning with the target-independent FC layer’s
structure. Fig. 3 demonstrates that the proposed AutoFCL
outperforms traditional fine-tuning of original FC layers of
CNN architectures.
VII. CONCLUSION AND FUTURE SCO PE
We propose AutoFCL, a method to learn the best possible
set of hyperparameters belonging to Fully Connected (dense)
layers of a CNN for better transfer learning. The Bayesian
Optimization algorithm is used to explore the search space
for the number of FC layers, the number of neurons in
each FC layer, activation function and dropout factor. To
learn the structure of FC layers, experiments are conducted
on CalTech-101, Oxford-102 Flowers and UC Merced Land
Use datasets. Finding the best set of hyperparameters in-
volved in FC layers of CNNs leads to better performance
while transferring the knowledge. The proposed AutoFCL
method outperforms the state-of-the-art on both CalTech-
101, Oxford-102 Flowers datasets and achieves comparable
performance over the UC Merced Land Use dataset. In fu-
ture, the proposed idea of tuning the hyperparameters related
to the FC layers may be extended to tuning the number of
Convolution layers of a CNN based on the similarity between
the source and target datasets.
ACKNOWLEDGMENT
We appreciate NVIDIA Corporation’s support with the
donation of GeForce Titan XP GPU (Grant number: GPU-
900-1G611-2500-000T), which is used for this research.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural
information processing systems, 2012, pp. 1097–1105.
[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 1–9.
[3] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury et al., “Deep neural
networks for acoustic modeling in speech recognition,” IEEE Signal
processing magazine, vol. 29, 2012.
[4] M. Wang, S. Abdelfattah, N. Moustafa, and J. Hu, “Deep gaus-
sian mixture-hidden markov model for classification of eeg signals,
IEEE Transactions on Emerging Topics in Computational Intelligence,
vol. 2, no. 4, pp. 278–287, 2018.
[5] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
architectures for scalable image recognition,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2018,
pp. 8697–8710.
TABLE VI
THE FC LAYE RS HYPERPARAMETERS ARE TUNED FOR UC MER C ED LA ND US E DATASE T AUT OM ATIC AL LY USI NG T HE BAY ES IA N SE AR CH A ND
RANDOM SEARCH ARE PRESENTED.
S.No Model Search Method #FC layers Activation #neurons dropout rate validation accuracy
1 VGG-16 Bayesian 1 ReLu 512 00.3 96.42
random 1 ReLu 64 0.1 95.23
2 ResNet Bayesian 1 Tanh 1024 0.2 83.8
random 1 Tanh 1024 0.4 82.14
3 MobileNet Bayesian 1 Sigmoid 1024 0.5 89.52
random 1 ReLu 1024 0.1 87.38
4 DenseNet Bayesian 1 Sigmoid 1024 0.0 82.38
random 1 ReLu 128 0.2 81.42
5 NASNet-Mobile Bayesian 1 ReLu 128 0 74.76
random 1 Sigmoid 512 0.4 73.33
Fig. 3. The performance comparison between the proposed AutoFCL and traditional Fine-tuning methods over UC Merced Land Use dataset. Learning
the optimal structure of FC layers with the knowledge of the target dataset and fine-tuning the learned FC layers leads to better performance.
[6] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,
A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture
search,” in Proceedings of the European Conference on Computer
Vision (ECCV), 2018, pp. 19–34.
[7] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A
survey,” arXiv preprint arXiv:1808.05377, 2018.
[8] Y. Jaafra, J. L. Laurent, A. Deruyver, and M. S. Naceur, “Reinforce-
ment learning for neural architecture search: A review,” Image and
Vision Computing, vol. 89, pp. 57–66, 2019.
[9] S. S. Basha, S. R. Dubey, V. Pulabaigari, and S. Mukherjee, “Impact
of fully connected layers on performance of convolutional neural
networks for image classification,” Neurocomputing, 2019.
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” in Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on. Ieee, 2009, pp. 248–255.
[11] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
volutional networks,” in European conference on computer vision.
Springer, 2014, pp. 818–833.
[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
2014.
[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 770–778.
[14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
“Densely connected convolutional networks,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017,
pp. 4700–4708.
[15] Q. Xu, M. Zhang, Z. Gu, and G. Pan, “Overfitting remedy by sparsify-
ing regularization on fully-connected layers of cnns,” Neurocomputing,
vol. 328, pp. 69–74, 2019.
[16] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, and F. Hut-
ter, “Towards automatically-tuned neural networks,” in Workshop on
Automatic Machine Learning, 2016, pp. 58–65.
[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
ing the inception architecture for computer vision,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2016, pp. 2818–2826.
[18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
convolutional neural networks for mobile vision applications,arXiv
preprint arXiv:1704.04861, 2017.
[19] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning
for emotion recognition on small datasets using transfer learning,”
in Proceedings of the 2015 ACM on international conference on
multimodal interaction. ACM, 2015, pp. 443–449.
[20] P. I. Frazier, “A tutorial on bayesian optimization,arXiv preprint
arXiv:1807.02811, 2018.
[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,
International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
211–252, 2015.
[22] X. Li, Y. Grandvalet, F. Davoine, J. Cheng, Y. Cui, H. Zhang, S. Be-
longie, Y.-H. Tsai, and M.-H. Yang, “Transfer learning in computer
vision tasks: Remember where you come from,” Image and Vision
Computing, vol. 93, p. 103853, 2020.
[23] J. Hu, “Discriminative transfer learning with sparsity regularization for
single-sample face recognition,” Image and vision computing, vol. 60,
pp. 48–57, 2017.
[24] D. Han, Q. Liu, and W. Fan, “A new image classification method using
cnn transfer learning and web data augmentation,” Expert Systems with
Applications, vol. 95, pp. 43–56, 2018.
[25] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
for online learning and stochastic optimization,” Journal of Machine
Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
[26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[27] M. Wistuba, “Bayesian optimization combined with successive halv-
ing for neural network architecture optimization.” in AutoML@
PKDD/ECML, 2017, pp. 2–11.
[28] D. Ji, Y. Jiang, P. Qian, and S. Wang, “A novel doubly reweight-
ing multisource transfer learning framework,IEEE Transactions on
Emerging Topics in Computational Intelligence, vol. 3, no. 5, pp. 380–
391, 2019.
[29] A. Gupta, Y.-S. Ong, and L. Feng, “Insights on transfer optimiza-
tion: Because experience is the best teacher,IEEE Transactions on
Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 51–
64, 2017.
[30] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
features in deep neural networks?” in Advances in neural information
processing systems, 2014, pp. 3320–3328.
[31] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Transfer
learning from deep features for remote sensing and poverty mapping,
in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[32] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning
convolutional neural networks for resource efficient transfer learning,”
arXiv preprint arXiv:1611.06440, vol. 3, 2016.
[33] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian op-
timization of machine learning algorithms,” in Advances in neural
information processing systems, 2012, pp. 2951–2959.
[34] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine
learning. MIT press Cambridge, MA, 2006, vol. 2, no. 3.
[35] C. E. Rasmussen, “Gaussian processes in machine learning,” in
Summer School on Machine Learning. Springer, 2003, pp. 63–71.
[36] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimiza-
tion of expensive black-box functions,Journal of Global optimization,
vol. 13, no. 4, pp. 455–492, 1998.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision,
2015, pp. 1026–1034.
[38] H. J. Kelley, “Gradient theory of optimal flight paths,Ars Journal,
vol. 30, no. 10, pp. 947–954, 1960.
[39] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object
categories,” IEEE transactions on pattern analysis and machine intel-
ligence, vol. 28, no. 4, pp. 594–611, 2006.
[40] M.-E. Nilsback and A. Zisserman, “Automated flower classification
over a large number of classes,” in Proceedings of the Indian Confer-
ence on Computer Vision, Graphics and Image Processing, Dec 2008.
[41] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions
for land-use classification,” in Proceedings of the 18th SIGSPATIAL
international conference on advances in geographic information sys-
tems. ACM, 2010, pp. 270–279.
[42] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional
deep belief networks for scalable unsupervised learning of hierarchical
representations,” in Proceedings of the 26th annual international
conference on machine learning. ACM, 2009, pp. 609–616.
[43] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
“Autoaugment: Learning augmentation strategies from data,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019, pp. 113–123.
[44] Y. Sawada, Y. Sato, T. Nakada, S. Yamaguchi, K. Ujimoto, and
N. Hayashi, “Improvement in classification performance based on
target vector modification for all-transfer deep learning,Applied
Sciences, vol. 9, no. 1, p. 128, 2019.
[45] B. Huang, Y. Hu, Y. Sun, X. Hao, and C. Yan, “A flower classification
framework based on ensemble of cnns,” in Pacific Rim Conference on
Multimedia. Springer, 2018, pp. 235–244.
[46] X. Lv and F. Duan, “Metric learning via feature weighting for scalable
image retrieval,Pattern Recognition Letters, vol. 109, pp. 97–102,
2018.
[47] F. Murabito, C. Spampinato, S. Palazzo, D. Giordano, K. Pogorelov,
and M. Riegler, “Top-down saliency detection driven by visual clas-
sification,” Computer Vision and Image Understanding, vol. 172, pp.
67–76, 2018.
[48] M. Simon, E. Rodner, T. Darrell, and J. Denzler, “The whole is more
than its parts? from explicit to implicit pose normalization,” IEEE
transactions on pattern analysis and machine intelligence, 2018.
[49] L. Karlinsky, J. Shtok, S. Harary, E. Schwartz, A. Aides, R. Feris,
R. Giryes, and A. M. Bronstein, “Repmet: Representative-based
metric learning for classification and few-shot object detection,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019, pp. 5197–5206.
[50] W. Shao, W. Yang, G.-S. Xia, and G. Liu, “A hierarchical scheme
of multiple feature fusion for high-resolution satellite scene catego-
rization,” in International Conference on Computer Vision Systems.
Springer, 2013, pp. 324–333.
[51] M. Y. Yang, S. Al-Shaikhli, T. Jiang, Y. Cao, and B. Rosenhahn, “Bi-
layer dictionary learning for remote sensing image classification,” in
2016 IEEE International Geoscience and Remote Sensing Symposium
(IGARSS). IEEE, 2016, pp. 3059–3062.
[52] T. Akram, B. Laurent, S. R. Naqvi, M. M. Alex, N. Muhammad et al.,
“A deep heterogeneous feature fusion approach for automatic land-use
classification,” Information Sciences, vol. 467, pp. 199–218, 2018.
[53] E. K. Wang, Y. Li, Z. Nie, J. Yu, Z. Liang, X. Zhang, and S. M. Yiu,
“Deep fusion feature based object detection method for high resolution
optical remote sensing images,” Applied Sciences, vol. 9, no. 6, p.
1130, 2019.
[54] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Convolutional Neural Networks (CNNs), in domains like computer vision, mostly reduced the need for handcrafted features due to its ability to learn the problem-specific features from the raw input data. However, the selection of dataset-specific CNN architecture, which mostly performed by either experience or expertise is a time-consuming and error-prone process. To automate the process of learning a CNN architecture, this paper attempts at finding the relationship between Fully Connected (FC) layers with some of the characteristics of the datasets. The CNN architectures, and recently datasets also, are categorized as deep, shallow, wide, etc. This paper tries to formalize these terms along with answering the following questions. (i) What is the impact of deeper/shallow architectures on the performance of the CNN w.r.t. FC layers?, (ii) How the deeper/wider datasets influence the performance of CNN w.r.t. FC layers?, and (iii) Which kind of architecture (deeper/ shallower) is better suitable for which kind of (deeper/ wider) datasets. To address these findings, we have performed experiments with three CNN architectures having different depths. The experiments are conducted by varying the number of FC layers. We used four widely used datasets including CIFAR-10, CIFAR-100, Tiny ImageNet, and CRCHistoPhenotypes to justify our findings in the context of the image classification problem. The source code of this research is available at this https URL.
Article
Full-text available
With the rapid growth of high-resolution remote sensing image-based applications, one of the fundamental problems in managing the increasing number of remote sensing images is automatic object detection. In this paper, we present a fusion feature-based deep learning approach to detect objects in high-resolution remote sensing images. It employs fine-tuning from ImageNet as a pre-training model to address the challenge of it lacking a large amount of training datasets in remote sensing. Besides, we improve the binarized normed gradients algorithm by multiple weak feature scoring models for candidate window selection and design a deep fusion feature extraction method with the context feature and object feature. Experiments are performed on different sizes of high-resolution optical remote sensing images. The results show that our model is better than regular models, and the average detection accuracy is 8.86% higher than objNet.
Article
Full-text available
This paper proposes a target vector modification method for the all-transfer deep learning (ATDL) method. Deep neural networks (DNNs) have been used widely in many applications; however, the DNN has been known to be problematic when large amounts of training data are not available. Transfer learning can provide a solution to this problem. Previous methods regularize all layers, including the output layer, by estimating the relation vectors, which are then used instead of one-hot target vectors of the target domain. These vectors are estimated by averaging the target domain data of each target domain label in the output space. This method improves the classification performance, but it does not consider the relation between the relation vectors. From this point of view, we propose a relation vector modification based on constrained pairwise repulsive forces. High pairwise repulsive forces provide large distances between the relation vectors. In addition, the risk of divergence is mitigated by the constraint based on distributions of the output vectors of the target domain data. We apply our method to two simulation experiments and a disease classification using two-dimensional electrophoresis images. The experimental results show that reusing all layers through our estimation method is effective, especially for a significantly small number of the target domain data.
Article
Full-text available
Fine-grained classification describes the automated recognition of visually similar object categories like birds species. Previous works were usually based on explicit pose normalization, i.e., the detection and description of object parts. However, recent models based on a final global average or bilinear pooling have achieved a comparable accuracy without this concept. In this paper, we analyze the advantages of these approaches over generic CNNs and explicit pose normalization approaches. We also show how they can achieve an implicit normalization of the object pose. A novel visualization technique called activation flow is introduced to investigate limitations in pose handling in traditional CNNs like AlexNet and VGG. Afterward, we present and compare the explicit pose normalization approach neural activation constellations and a generalized framework for the final global average and bilinear pooling called α-pooling. We observe that the latter often achieves a higher accuracy improving common CNN models by up to 22.9%, but lacks the interpretability of the explicit approaches. We present a visualization approach for understanding and analyzing predictions of the model to address this issue. Furthermore, we show that our approaches for fine-grained recognition are beneficial for other fields like action recognition.
Article
Fine-tuning pre-trained deep networks is a practical way of benefiting from the representation learned on a large database while having relatively few examples to train a model. This adjustment is nowadays routinely performed so as to benefit of the latest improvements of convolutional neural networks trained on large databases. Fine-tuning requires some form of regularization, which is typically implemented by weight decay that drives the network parameters towards zero. This choice conflicts with the motivation for fine-tuning, as starting from a pre-trained solution aims at taking advantage of the previously acquired knowledge. Hence, regularizers promoting an explicit inductive bias towards the pre-trained model have been recently proposed. This paper demonstrates the versatility of this type of regularizer across transfer learning scenarios. We replicated experiments on three state-of-the-art approaches in image classification, image segmentation, and video analysis to compare the relative merits of regularizers. These tests show systematic improvements compared to weight decay. Our experimental protocol put forward the versatility of a regularizer that is easy to implement and to operate that we eventually recommend as the new baseline for future approaches to transfer learning relying on fine-tuning.
Article
In this paper, to effectively use the decision knowledge from multiple source domains to predict the labels of samples in the target domain, a novel doubly reweighting multisource transfer learning called DRMTL framework is proposed. DRMTL aims to simultaneously optimize the structural risk function, domain reweighting adaptation, pointwise reweighting adaptation and manifold consistency. The merits of DRMTL include the following: 1) The importance of every source domain can be evaluated using the proposed novel flexible weighting index; 2) The loss between an unknown label prediction and its prediction by some source decision function for a target sample can be reweighted using a novel domain separator; and 3) The manifold structure of the target domain is effectively used in this framework. Finally, a specific learning algorithm, i.e., a doubly reweighting multisource transfer learning using the regularized least-squares classifier called DRM-RLS, is proposed using the DRMTL framework and the classical regularized least-squares classifier, and its convergence is also proven. Our experimental results from several real-world datasets reveal that the proposed approach outperforms several state-of-the-art transfer learning algorithms.
Article
Deep neural networks are efficient and flexible models that perform well for a variety of tasks such as image, speech recognition and natural language understanding. In particular, convolutional neural networks (CNN) generate a keen interest among researchers in computer vision and more specifically in classification tasks. CNN architecture and related hyperparameters are generally correlated to the nature of the processed task as the network extracts complex and relevant characteristics allowing the optimal convergence. Designing such architectures requires significant human expertise, substantial computation time and does not always lead to the optimal network. Reinforcement learning (RL) has been extensively used in automating CNN models design generating notable advances and interesting results in the field. This work aims at reviewing and discussing the recent progress of RL methods in Neural Architecture Search task and the current challenges that still require further consideration.