Content uploaded by Xiaofei Yang
Author content
All content in this area was uploaded by Xiaofei Yang on Sep 18, 2018
Content may be subject to copyright.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1
Hyperspectral Image Classification With Deep Learning Models
Xiaofei Yang1,2, Yunming Ye1,2, Xutao Li1,2, Raymond Y. K. Lau3, and Xiaofeng Zhang1,2, and Xiaohui Huang4
1Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, China
2Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, 518055, China
3City University of Hong Kong, Hong Kong
4School of Information Engineering East China Jiaotong University, China
Deep learning has achieved great successes in conventional computer vision tasks. In this paper, we exploit deep learning techniques
to address the hyperspectral image classification problem. In contrast to conventional computer vision tasks that only examine the
spatial context, our proposed method can exploit both spatial context and spectral correlation to enhance hyperspectral image
classification. In particular, we advocate four new deep learning models, namely 2D Convolutional neural network (2D-CNN), 3D
Convolutional neural network (3D-CNN), recurrent 2D Convolutional neural network (R-2D-CNN), and recurrent 3D Convolutional
neural network (R-3D-CNN) for hyperspectral image classification. We conducted rigorous experiments based on six publicly available
data sets. Through a comparative evaluation with other state-of-the-art methods, our experimental results confirm the superiority
of the proposed deep learning models, especially the R-3D-CNN and the R-2D-CNN deep learning models.
Index Terms—Deep Learning, Hyperspectral Image, Convolutional Neural Network.
I. INTRODUCTION
RECENTLY, the rapid development of optics and photon-
ics has significantly advanced the field of hyperspectral
imaging techniques. As a result, hyperspectral sensors are
installed in many satellites which can produce images with
rich spectral information. The rich information captured in
hyperspectral images enables us to distinguish very similar
materials and objects by using satellites. Accordingly, hyper-
spectral imaging techniques have been widely used in a variety
of fields such as agriculture, monitoring, astronomy, and
mineral exploration. For example, Brown et al. [1] analyzed
the CRISM hyperspectral data set and used linear mixing of
absorption band techniques to determine the mineralogy of
the surface on Mars. In [2], Brown et al. utilized the VNIR
imaging spectrometer instrument that was a hyperspectral
scanning pushbroom device sensitive to VNIR wavelengths
from 400 ∼1000 nm for mineral exploration.
The existing methods for hyperspectral image classifica-
tion are mostly based on conventional pattern recognition
approaches such as support vector machine (SVM) [3] and
K-nearest neighbor (KNN) classifiers. To address the curse of
dimensionality, namely the Hughes phenomenon [4], Krishna-
puram et al. [5] performed dimensionality reduction against a
data set first and then applied multinomial logistic regression
(MLR) to improve image classification performance. Wang et
al. proposed a novel dimensionality reduction method, namely
the Locality Adaptive Discriminant Analysis (LADA) method
for hyperspectral image analysis [6]. Another way to cope
with the Hughes phenomenon is via the salient band selection
method. For example, Wang et al. [7] proposed a manifold
ranking based salient band selection method. In addition, Yuan
et al. proposed a new dual clustering framework, which was
applied to tackle the inherent drawbacks of the clustering-
based band selection method [8]. It has been shown that a
Corresponding authors: Yunming Ye (email: yeyunming@hit.edu.cn) and
Xutao Li (email: lixutao@hit.edu.cn).
composite kernel approach that requires multiple kernels can
enhance the accuracy of classification by fusing spatial and
spectral information. For example, the Generalized Composite
Kernel (GCK) framework is one of the promising methods
for hyperspectral image classification. Though kernel-based
methods like GCK can exploit both the spectral and the spatial
information, it involves solving a computationally very costly
optimization problem.
As a state-of-the-art machine learning technique, deep learn-
ing [9] [10] has recently attracted a lot of attention for its
application to conventional computer vision tasks. One main
reason is that deep learning can automatically discover an
effective feature representation for a problem domain, thus
avoiding the complicated and hand-crafted feature engineering
process. With a specially-designed deep learning architecture,
convolution neural networks (CNNs) are widely applied to
image recognition and image segmentation which considers
the spatial correlation among pixels. Successful examples of
CNNs include AlexNet [11], VGG [12], GoogLeNet [13], and
ResNet [14]. However, existing CNNs are applied to con-
ventional image classification tasks rather than hyperspectral
image classification tasks where both the spatial and spectral
correlations need to be effectively exploited.
In this paper, we address the hyperspectral image classifica-
tion problem by using a new deep learning technique. As noted
above, both the spectral factor and the spatial factor influence
the class label prediction of a pixel. On one hand, the label
of a pixel is reflected by its spectral values scanned by using
different spectrums. On the other hand, as the geographically
close pixels tend to belong to the same class, predicting the
class label of a pixel should take into account the class labels
of the surrounding pixels. Hence, a good hyperspectral image
classification method should consider both the spectral factor
and the spatial factor. In this paper, we first advocate a 2D-
CNN model and a 3D-CNN model for classifying hyper-
spectral images. The intuition is that a 2D-CNN can exploit
the spatial context, whereas a 3D-CNN can exploit both the
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 2
spatial and the spectral context. Though the aforementioned
models can take into account rich contextual information, the
way that they process the spatial information may introduce
unwanted noise. Accordingly, we further design the recurrent
2D-CNN and the recurrent 3D-CNN to address the noisy
spatial information problem. The main contributions of our
research work are summarized as follows.
1) First, we treat spectral data as the channels of conven-
tional images. To classify each pixel in a hyperspec-
tral image, we extract a small patch centered at the
pixel. The patch is treated as an image with multiple
channels. Then, we design a 2D-CNN model with three
2D convolution layers, followed by a full connection
layer, to classify the patch. The label of the patch is
considered as the label of its central pixel. Though the
pooling layers (such as max pooling layers and average
pooling layers) could reduce the dimensions of feature
maps and simplify the computations, they may affect the
classification accuracy of the network. To preserve as
much contextual information as possible, pooling layers
are excluded from our 2D-CNN model. The convolution
layer, pooling layer, and fully-connection layer of a CNN
will be explained in Section II.B.
2) Though 2D-CNN model can utilize the spatial context,
it fails to consider the spectral correlations. To address
such a problem, we further design a 3D-CNN model
which is composed of seven convolution layers and
one full connection layer. Different from the 2D-CNN,
the convolution operator of this model is 3D, whereas
the first two dimensions are applied to capture the
spatial context, and the third dimension captures the
spectral context. Though the 3D-CNN model contains
more network parameters than its 2D counterpart, it
should be more effective than its 2D counterpart because
of its ability to evaluate the spectral correlations of a
hyperspectral image.
3) The 2D-CNN model may be noisy because the classifi-
cation of a pixel only relies on a small patch centered
at the pixel. To effectively utilize the spatial context,
we further design a recurrent 2D-CNN model (R-2D-
CNN). The R-2D-CNN can extract features by gradually
shrinking the patch to concentrate on the central pixel.
Experimental results show that the R-2D-CNN model
indeed performs better than the 2D-CNN model.
4) Finally, we design the recurrent 3D-CNN model (R-3D-
CNN) to take into account both spatial and spectral con-
texts, while alleviating the problem of a noisy patch. The
R-3D-CNN extends the 3D-CNN model by shrinking
the patch gradually. As a result, the final classification
of each pixel mainly depends on the information of the
pixel rather than a patch. Experimental results show the
superiority of the R-3D-CNN model. In particular, it
converges faster than other methods, and achieves the
best classification performance.
The rest of the paper is organized as follows. Section II
discusses the related research work. In section III, we illustrate
various CNN-based deep learning models and the correspond-
ing algorithms for hyperspectral image classification. Section
VI reports the experimental results of a comparative evaluation
of the experimental methods and other baseline methods. Fi-
nally, we give concluding remarks and highlight the directions
of future research work.
II. RELATED WORK
A. Classical Classification Methods
Hyperspectral remote sensing classification has been ex-
tensively studied recently. For example, Bandos et al. [15]
utilize a linear discriminant method to solve the problem.
However, when the spectral resolution is low, it is necessary to
handle the band mixing problem for better differentiating the
pixels or performing feature selection. To this end, Brown [16]
develop a robust method to automatically separate overlapping
absorption bands, and the advantage of such a method is that
it is relatively noise-insensitive. To address the nonlinearity
of data, quadratic discriminant analysis and logarithmic dis-
criminant analysis are also explored. However, these methods
suffer from the Hughes phenomenon i.e., the classification
performance considerably degrade when the dimensionality
of the problem space becomes high. Wang et al. [6] propose
a novel dimensionality reduction method, namely LADA for
hyperspectral image classification. Following the idea of LDA,
LADA learns a projection matrix Gto pull the points of the
same class close to each other while pushing the ones of dif-
ferent classes far away from each other. To further exploit the
local data manifold, LADA adds one adaptive manifold term
parameterized by a matrix Sinto the computation of within-
class scatter term, and solves the matrix Gand Salternatively.
In 2016, Wang et al. [7] propose manifold ranking based
salient selection method for hyperspectral image classification.
The method first employs an evolution algorithm to group
the bands into several subsets, and finds some representative
bands. Then, it uses the representatives to select salient bands
by a manifold ranking strategy. The performance of the method
significantly relies on the qualities of chosen representatives,
and the constructed manifold.
To improve classification performance, many researchers
resort to kernel-based methods. The main idea of kernel-based
methods is to project samples into a high dimensional space
in which the samples of different classes become linearly
separable. The trick of kernel-based methods is that one does
not need to specify the details of the transformation function.
Instead, we only need to define the linear products among
samples in the high dimensional space. For example, Camps et
al. [17] employe the kernel trick of SVM in that the separation
of classes in a high dimensional space was achieved via a
nonlinear transformation of SVM.
Apart from employing simple kernel tricks, some re-
searchers employed multiple kernels for hyperspectral im-
age classification. For example, Rakotomamonjy et al. [18]
advocate the multiple kernel learning (MKL) method which
could learn a kernel and a classification predictor at the
same time. With the preliminary success of MKL, the same
technique is applied to remote sensing in 2010 [19]. In 2012,
a representative MKL algorithm is developed which could
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 3
establish the weights of kernels according to their statistical
significance [20].
The aforementioned kernel-based methods do not explicitly
exploit a spatial context. To address such a problem, the
composite kernel (CK) method is proposed [21]. In [22], the
CK method is generalized by using extended multi-attribute
profiles (EMAP). Apart from considering the spatial context,
the CK method could exploit the spectral context as well. For
example, a generalized composite kernel (GCK) is developed
to exploit both extended multi-attribute profiles and raw fea-
tures [23]. The GCK method often achieves better performance
than conventional methods such as SVM-CK [23].
Despite achieving promising classification performance, all
the kernel-based methods suffer from two drawbacks: (1) they
often involve solving complicated convex problems which are
in general difficult for a classifier to learn; (2) kernels must
be carefully chosen so as to achieve good performance.
Recently, some more advanced classification methods are
developed for hyperspectral image classification [24] [25].
For example, the logistic regression via variable splitting and
augmented Lagrangian (LORSAL) algorithm [26] is developed
to tackle larger data sets efficiently. In [27], Sun et al. propose
a hyperspectral image classification model, named SMLR-
SpATV, which includes a spectral data fidelity term and a
spatially adaptive Markov random field (MRF) prior in the
hidden field. Li et al. [28] propose a new multiple feature
learning framework (MFL), which pursues the combination of
multiple features for the hyperspectral scenes categorization.
The method can handle both linear and nonlinear classification.
In [29], a novel SVM based classification method is proposed
by applying the 3-dimensional discrete wavelet transform
(SVM-3DG).
B. Deep Learning Models
Recently, CNN models have achieved a breakthrough in the
performance of image classification. A CNN model (see Fig.1)
is a multi-layer neural network, composed of convolution
layer, pooling layer, and full connection layer. The convolution
layer contains N filters (C1 in Fig. 1), each of which is a
small weighted matrix. By convolving the N filters with an
input image and transforming the output with a non-linear
activation function, N feature maps are produced. The feature
maps often contain redundant information. To reduce the
redundancy, a pooling layer is appended (S2 in Fig. 1), which
summarizes feature maps into small matrices by calculating
the average (average pooling) or maximum value (maximum
pooling) locally. The convolution layer and pooling layer can
be repeated multiple times (C3 and S4 in Fig. 1) until the
generated feature maps are of size 1-by-1. Finally, a fully
connected layer will be appended for categorization. The
neurons of the fully connected layer take all the 1-by-1 feature
maps as their inputs.
The first CNN model is developed by LeCun in 1996 [30]
[31]. Combined with the back propagation model, the CNN
model achieves very good performance in handwritten digit
recognition. With the advancement of Graphics Processing
Units (GPUs), deep learning has attracted a lot of attention
C1
C1
C1
S2
S2
S2
C3
C3
C3
S4
S4
S4
NN
Convolution(C)
Pooling(S)Input Output
Full connection
Fig. 1. The CNN model consisting of convolution layers, pooling layers, and
full connection layer
by researchers. On the other hand, the CNN model has been
improved by the recent deep learning techniques. For example,
Glorot et al. [32] introduce the Rectified Linear Units (ReLU)
as the activation function for CNNs in 2011. By doing so,
the vanishing gradient problem and the ineffective explo-
ration problem of the BP method can be alleviated. In 2012,
Krizhevsky et al. [11] designed the AlexNet network which
was a deep CNN model with the ReLU activation function.
The AlexNet network won the annual ImageNet competition
in 2012. To avoid overfitting, Srivastava et al. [33] proposed
the dropout technique for deep CNN. In addition, Szegedy
et al. [13] designed the GoogLeNet model which is a deep
CNN model with each layer comprising multi-scale CNN.
He et al. [14] proposed a deep residual CNN model which
won the ImageNet competition in 2015. In [34], an end-to-end
band-adaptive spectral-spatial feature learning neural network
was proposed. In [35], Cao et al. proposed a hyperspectral
image segmentation method by using markov random fields
and a convolutional network. To tackle the street scene labeling
problem, Wang et al. [36] proposed a hybrid method that
utilized priori convolutional neural networks at superpixel
level and soft restricted context transfer. The former technique
aims to learn prior location information and produces coarse
label predication, whereas the latter technique aims to improve
the coarse prediction by reducing over-smoothness. However,
the algorithm works for conventional images only. It does not
take into account the characteristics of rich band information
in hyperspectral images.
All the above models are 2D-CNN models within which the
convolution operators only deal with two dimensional spatial
features. In [37], a 3D convolution network is designed to
handle video categorization tasks effectively. Following the
framework of 3D-CNN models, we employ such an architec-
ture for hyperspectral image classification, in which the third
dimension refers to the spectral axis.
Apart from CNN models, another important deep learning
framework is the recurrent neural network (RNN) which is of-
ten applied to process sequence data arising from applications
such as speech recognition [38], machine translation [39], bot
chat [40], and so on. In [41], Mou et al. proposed a novel RNN
model for hyperspectral image classification, which could
effectively analyzed hyperspectral pixels as sequential data and
then determined information categories via network reasoning.
The basic intuition of RNN is that it applies the same neural
network block recurrently for sequence prediction. To preserve
the information of observed historical sequences, a RNN is
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 4
2D Conv
ReLU
2D convolutional operation
Input Output
Fig. 2. The 2D-CNN model consisting 2D convolutional operation with kernel
size (k) and number of feature maps (m) at each convolutional layer for
hyperspectra image classification
fed with the current observation and the hidden layers trained
by the previously observed sequences. By doing so, the RNN
can take into account both the features of the current sequence
and that of the historical observations to improve the current
prediction. In contrast to the aforementioned approaches, we
apply the RNN model to deal with the spatial contexts
recurrently.
III. THE PROPOSED METHODS
In this section, we illustrate the design of new 2D-CNN, R-
2D-CNN, 3D-CNN, and R-3D-CNN models for hyperspectral
image classification. For these methods, we extract a small
patch centered at each pixel to build the classification models.
Among the proposed models, the 2D-CNN and the R-2D-CNN
models exploit the spatial contexts only, whereas the 3D-CNN
and the R-3D CNN models exploit both the spatial features
and the spectral correlations of pixels.
A. 2D-CNN model
As illustrated in Fig.2, our 2D-CNN model is composed
of three main phases, which are patch extraction, feature ex-
traction, and label identification. Given a hyperspectral image,
we first extract a small patch centered at each pixel as the
raw feature. Then, a deep learning model is constructed to
acquire the feature maps of these patches. Finally, the label
of each pixel is classified based on the feature map of the
corresponding patch. For all four models, we exclude the
pooling layers so as to preserve as much information of a
pixel as possible. The three-phase processing of the 2D-CNN
model is illustrated below.
Assume that we are given a hyperspectral image of size N×
M×D, where Nand Mare the width and the height of the
image, and Ddenotes the number of spectral bands. We aim at
predicting the label of each pixel of the image. As the spatially
adjacent pixels often have the same labels, it is desirable for
the proposed model to consider the “spatial coherence”. To
this end, the first processing phase of our model is to extract
aK×K×Dpatch for each pixel. In particular, each patch
(i.e., the spatial context) is constructed surrounding a pixel, the
center point of the patch. For the pixels that reside near the
edge of the image, there may not be sufficient information to
build a patch of the expected size. Accordingly, we construct
the spatial context by performing a mirror padding operation
for these pixels.
For the second processing phase, each extracted patch
is treated as an image with multiple channels on its own.
Thereby, we can apply a deep CNN model with 2D
convolution layers to extract the feature maps for the patch.
More specifically, the 2D-CNN operator at each layer is
formulated as follows:
vxy
ij =F(bij +X
m
Ni−1
X
p=0
Mi−1
X
q=0
wpq
ijm v(x+p)(y+q)
i−1)(1)
where iindicates the particular layer under consideration, and
jis the number of feature maps of the layer i;vxy
ij stands
for the output at position (x, y)of the jth feature map at
the ith layer; bij refers to the bias term, and F(·)denotes
the activation function of the layer; mindexes over the set
of feature maps of the (i−1)th layer, which are the inputs
to the ith layer. wmpq
ij is the value at position (p, q)of the
convolution kernel connected to the ith feature map to the jth
feature map, and Niand Miare the height and width of this
kernel. For the proposed model, we adopt the ReLU function
as the activation function F, which is defined as follows:
F(x) = max(0, x)(2)
In our 2D-CNN model, three convolutional layers are uti-
lized. To preserve the vital information of each pixel, we
exclude the pooling layers from our 2D-CNN model. Finally,
a fully-connected layer, which takes the feature maps of the
last 2D convolutional layer as inputs, is constructed to make
the prediction. Here, we leverage the softmax function to
compute the probability for each class. The softmax function
is an extension of the sigmod function, and used for multiple
classification. The purpose of the softmax function is to find
the parameters in the maximum zvalue of Yk. Moreover, the
cross entropy function is adopted as the objective function to
drive the back-propagation based training process.
Let Wand bdenote all the parameters of our 2D-CNN
model. We train the 2D-CNN model by maximizing the likeli-
hood, and transform the scores fc(Ii,j,k; (W,b)) of each class
of interest c∈ {1, . . . , N }into the conditional probabilities by
using the following softmax function [42]:
p(c|Ii,j,k; (W,b)) = efc(Ii,j,k;(W,b))
P
d∈{1,...,N}
efd(Ii,j,k;(W,b)) (3)
The parameters (W, b)are learned by minimizing the neg-
ative log-likelihood based on the training set:
L(W,b) = −X
Ii,j,k
ln p(li,j,k|Ii,j,k ; (W,b)) (4)
where li,j,k is the correct class label of the pixel at position
(i, j)of the image Ik. To optimize the objective function,
stochastic gradient descent (SGD) with back-propagation is
applied.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 5
Input data
Spectral
bands
D
M(height)
N
(weight) K
K
K
3D convolution
operation
H
Fig. 3. The 3D-CNN model comprising 3 3D convolution operations with
the corresponding kernel size (K) and the number of feature maps (m) for
each convolutional layer.
At the testing time, the output layer of the proposed model
predicts the label of the pixel located at (i, j)of the image I
by using the argmax function:
c
li,j = arg max
c∈{1,...,N}
p(c|Ii,j ; (W,b)) (5)
B. 3D-CNN model
One main difference between a hyperspectral image and a
conventional image is that the former is captured by scanning
the same region with different spectral bands, while the latter
is not. As the image formed by hyperspectral bands may have
some correlations e.g., close hyperspectral bands may result
in similar images, it is desirable to take into account hyper-
spectral correlations. Though the 2D-CNN model can utilize
the spatial context, it ignores the hyperspectral correlations.
Hence, we develop a 3D-CNN model to address this issue.
As shown in Fig.3, the operational details of our 3D-CNN
model are quite similar to those of the 2D-CNN model. The
main difference is that the 3D-CNN model has one extra phase
of reordering. In this phase, we rearrange the Dhyperspectral
bands according to an ascending order. By doing so, images
of similar spectral bands are sequentially ordered, which can
preserve their correlations under a spectral context. The patch
extraction phase and the label identification phase of the two
models are quite similar. For the feature extraction phase, a
3D convolution operator instead of 2D convolution operator is
applied to the 3D-CNN model.
More specifically, the 3D convolution operation is formu-
lated as follows:
vxyz
ij =F(bij +X
m
Ni−1
X
p=0
Mi−1
X
q=0
Di−1
X
r=0
wpqr
ijm v(x+p)(y+q)(z+r)
(i−1)m)
(6)
where Diis the size of the 3D kernel along the spectral
dimension, and jis the number of kernels of the ilayer; wpqr
ijm
is the value at the (p, q, r)th position of the kernel connected
to the mth feature map (a cube) of the preceding layer. Again,
the ReLU function is adopted as the activation function F.
The 3D convolution operation is illustrated in Fig. 2. We
can see that the 3D convolution operation is applied to a 3D
patch step by step e.g., from top to down, from left to right,
and from inner to outer. In each step, a convolution scalar
is produced and placed at the corresponding position of the
feature map (shown as red lines in Fig. 2). This operation
produces a smaller 3D cube as a feature map. Training a 3D-
CNN model is similar to training a 2D-CNN model in which
we utilize the softmax function to compute the probability of
each class. Moreover, we formulate the training process as an
optimization problem by maximizing the log-likelihood of the
training data. In addition, stochastic gradient descent (SGD)
with back-propagation is applied to network training.
C. R-2D-CNN model
As noted above, though the 2D-CNN model can exploit the
spatial context, it may introduce unwanted noise because the
classification of a pixel relies on the features of a small patch
surrounding the pixel rather than the features directly attached
to the pixel. To better exploit the spatial context, we design a
recurrent 2D-CNN model (R-2D-CNN). In particular, the R-
2D-CNN model constructs multiple shrunk patches as multi-
level instances (see Fig.4), and leverages a multi-scale deep
neural network to fuse the multi-level instances for prediction.
For clarity, we denote the instances as 1-st level, 2-nd level,
· · ·, and the P-th level, corresponding from the bigger patches
to the smaller patches, where the P-th level often corresponds
to the pixel for classification, i.e., a 1-by-1 patch. The R-
2D-CNN deep neural network comprises a recurrent CNN
structure, where a basic 2D-CNN block is reused multiple
times. More specifically, it uses the basic 2D-CNN block to
extract the feature maps for the 1-st level instances at the
beginning. These feature maps are then concatenated with the
2-nd level instances, which are fed to the same 2D-CNN block
for extracting the next level feature maps. This procedure is
repeated until the P-th level instances are fused. Finally, a
softmax layer is then applied to compute the probability of
each class. By utilizing the multiple shrunk patches, we can
consider the spatial context information, and also can focus
more on the information closer to the pixel for classification.
Hence, the unwanted noises can be reduced.
The main architecture of the R-2D-CNN model is illustrated
in Fig.5. At the p-th level, the network is fed with an input
“feature image” Fpof H+D(H represents the number of
feature maps produced by the 2D-CNN block) 2D images,
which comprises Hfeature maps of the p−1-th instances, D
hyperspectral images of the p-th instances, and 1≤p≤P.
Formally, the procedure is defined as follows:
Fp= [F(Fp−1, Ip
i,j,k)], F 1= [0, Ii,j,k].(7)
where Ii,j,k stands for the input patch surrounding the pixel
at location (i, j)of the training image k. At the first level, the
network only takes the original image as the input because
there is not instance from a previous layer to produce the
feature maps. Though the R-2D-CNN model is multi-level,
the model complexity does not increase with respect to the
number of levels. The reason is that the parameters pertaining
to different levels are shared (as shown in Fig.5).
Model training of the R-2D-CNN model is the same as
that of the 2D-CNN model, where gradients are computed by
using the (BPTT) algorithm [43] during the back-propagation
process. More specifically, we first unfold the network as
shown in Fig. 5, and then train the model with the BPTT
algorithm. However, in contrast to the 2D-CNN model, we
have to learn the network parameters (W,b)by a new loss
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 6
plain
1 instance
2 instance
3 instance
a) b)
Fig. 4. Context input patch of ”plain”: a)and recurrent context input patch
b). The size of context input patch b) increases as the number of instances in
the recurrent 2D convolutional network increases.
F
F F
F
Shared
Fig. 5. The R-2D-CNN model comprising 2 basic 2D-CNN block with
parameters share across levels.
function due to the multi-level recurrent architecture. The loss
function is defined according to Eq.(7):
L(F) + L(F◦F) + . . . +L(F◦pF),(8)
where L(F)is a shorthand for the log-likelihood defined in
Eq.(3) of the 2D-CNN model, and ◦pdenotes the composition
operation performed ptimes. Thus, each network instance is
trained to produce the correct label at the location (i, j). In this
manner, the R-2D-CNN model is able to learn and corrects its
mistakes produced by the earlier iterations. As a by-product,
the R-2D-CNN model can also classify the dependencies, that
is, predicting the label of an instance based on the label of the
previous instance around location (i, j).
It is worth noting that the sizes of multi-level instances in
a R-2D-CNN model must be carefully designed so that the
instances can be concatenated with the feature maps of the
previous instances. To this end, we first need to establish how
the size of a feature map changes when it is applied to a 2D
convolution layer. Let szm−1denote the size of the feature
map of the m−1-th convolution layer. Then, the size of
the feature map produced by the m-th convolution layer is
computed as follows:
szm=szm−1−kWm
dWm
+ 1 (9)
where kWmis the size of the convolution kernel of the mth
layer, and dWmis the stride size. By Eq.(8), we can compute
the size of a feature map produced by our 2D-CNN block.
Hence, we can estimate the appropriate sizes of the instances
with respect to different levels.
D. R-3D-CNN model
To better utilize the spatial and the spectral contexts of
hyperspectral images, we design the recurrent 3D-CNN model
(R-3D-CNN). As for the R-2D-CNN model, the R-3D-CNN
model is also underpinned by multi-level recurrent neural
networks which shrink a patch gradually to form multi-level
instances. There are two main differences between the R-3D-
CNN model and the R-2D-CNN model. The first difference is
that the former utilizes 3D convolution operators whereas the
latter uses its 2D counterparts. Hence, the R-3D-CNN model
can be regarded as an extension of the 3D-CNN model in a
recurrent manner. The second difference is that the instances of
the next level need to be preprocessed and concatenated with
the feature maps generated from the current level. The reason
is that we adopt 3D convolution layers which lead to variable
length of the spectral bands. Hence, we have to preprocess the
instances of the next level by some 3D convolution operations
of the spectral channels to adapt to the changing sizes.
Fig.6 depicts an example of the proposed R-3D-CNN
model. The model consists of a multi-level recurrent neural
network with Pmulti-level instances. As for the R-2D-CNN
model, a ”plain” 3D convolution network is applied to extract
the corresponding feature maps, which are then concatenated
with the next level instance to form new feature maps at each
level. This procedure is repeated until all multi-level instances
(patches) are incorporated. To ensure the consistency of the
sizes of feature maps of the current level and the sizes of the
instances at the next level, a preprocessing step is introduced to
the spectral channels. Finally, a softmax layer is applied, and a
cross entropy objective function is adopted. The optimization
process is again performed by using the BPTT algorithm. As
for the R-2D-CNN model, the complexity of the R-3D-CNN
model remains moderate because the recurrent structure shares
the same network parameters across multiple levels. As for the
3D-CNN model, we need to reorder the hyperspectral images
according to the ordering of spectral bands. Also, the size of
multi-level instances must be carefully determined as for the
R-2D-CNN model.
IV. EXP ERIME NTAL RESULTS
We chose to use six publicly available hyperspectral image
data sets for evaluating the performance of the proposed
models. For a comparative evaluation, we also adopted SVM-
CK, GCK, LORSAL, SMLR-SpATV, MFL, SVM-3D, SVM-
3DG, and CNN-MRF as the baselines. For the performance
metrics, we used the overall accuracy of all classes, denoted
as OA, and the average accuracy of each class, denoted as
AA. We ran all the models on a desktop PC equipped with an
Intel Core 7 Duo CPU (at 3.40 GHz) with 12 GB of RAM,
and two GTX 1080Ti GPUs (16 GB of ROM) were also used.
A. Data Sets
1) Indian Pines Scene
The data set was collected in 1992 by the AVIRIS sensor
which records the remote sensing images of Indian Pines
located at north-western of India. The hyperspectral image
contains 145 ×145 pixels in spatial dimensions, and 224
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 7
preprocess for the spectral
plain network
recurrent network
concat
Fig. 6. The R-3D-CNN model with network parameters shared across multiple levels. The plain network is built with a small instance based on the basic
3D-CNN model while the recurrent network is built with two instances of the basic 3D-CNN model; the complexity of the model remains moderate because
of the shared parameters across multiple levels
a) b) c)
d) e) f)
Fig. 7. Labeled images of different data sets: a)Indian Pines Scene.
b)Botswana Scene. c)Salinas scene. d)Pavia Centre scene. e)Pavia University
scene. f) Kennedy Space Center.
hyperspectral bands. Due to the presence of noisy bands, we
only used 200 hyperspectral bands. Specifically, the bands
covering the regions of water absorption, i.e., [104-108], [150-
163], 220, were removed. The ground truth available includes
16 classes which are not all mutually exclusive. As shown in
Fig.7 a), we randomly divided the labeled data into the training
(70%) and the testing (30%) sets for our experiment.
2) Botswana Scene
Botswana Scene was acquired by the Hyperion sensor on
the NASA EO-1 satellite in May 31, 2001.; This data set was
collected over the Okavango Delta. The hyperspectral image
contains 1476 ×1476 pixels taken by 224 bands, from 400nm
to 2500nm with an incremental step of 10 nm. As for the
Indian Pines Scene data set, we removed the noisy bands to
produce an experimental data set containing 145 bands only.
The image data set contains 14 categories. As shown in Fig.7
b), we randomly split the data set to the training (70%) and
the testing (30%) sets, respectively.
3) Salinas scene
The Salinas Scene was a hyperspectral image data set
recorded in 1992 by the AVIRIS sensor which captured images
about the Salinas Valley, California. The original images were
composed by 224 bands. We discarded 20 noisy bands for
example bands [108-112], bands [154-167] and band 224 to
generate a hyperspectral image data set of 204 bands. For the
spatial dimensions, the scene includes 512×217 pixels. There
are 16 labeled classes in the original data set as shown in Fig.7
c).
4) Pavia Centre scene
The hyperspectral image data set captured Pavia acquired
over northern Italy. It was produced in 2001 by using the Re-
flective Optics System Imaging Spectrometer (ROSIS) sensor.
The Pavia Centre scene comprised 1096 ×1096 pixels with
114 hyperspectral bands. We preprocessed these images by
removing 12 noisy bands. There are nine labeled classes in
the data set as shown in Fig.7 d).
5) Pavia University scene
This hyperspectral image data set captured the Pavia uni-
versity in Italy by using the ROSIS sensor. There are 103
hyperspectral bands in the image data set, with 610 ×340
pixels for the spatial dimensions. The image contains nine
labeled classes as shown in Fig.7 e).
6) Kennedy Space Center
The last data set, namely Kennedy Space Center (KSC)
captured the KSC area in Florida by using the AVIRIS sensor
on March 23, 1996. The hyperspectral image consists of
512 ×614 pixels of spatial dimensions, with 224 spectral
bands. After removing 48 noisy bands, we obtained 172
spectral bands. There are 13 labeled classes as shown in Fig.7
f).
B. Experimental Results
1) Results for the Indian Pines Scene
Before reporting the details of our experimental results,
we first elaborate the various settings of the deep learning
techniques employed in our experiments. The structure of the
2D-CNN model is depicted in Fig.8. For the classification
of each pixel, a 7×7×200 patch surrounding it is first
constructed. Following this, three 2D convolution layers of size
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 8
3×3are utilized. Moreover, 200 spectral bands are treated as
channels. The number of filters is set to 400 for the respective
layers, and the stride is set to 1. As a result, the feature maps
produced by the first, second, and last convolution layers are
5×5×400,3×3×400 and 1×1×400, respectively. Finally, a
softmax layer of 16 classes is deployed to classify the images.
The proposed network structure does not include the pooling
layers so as to keep as much information of each pixel as
possible. In addition, we apply SGD for network training and
set the mini-batch size to 10.
The structure of the 3D-CNN model is depicted in Fig.9.
Similar to the 2D-CNN model, a 7×7×200 patch is first
extracted. Next, we build eight 3D convolution layers. The
size, number, stride and feature map sizes of the 3D filters in
each layer are shown in Fig.14. Again, we exclude the pooling
layers and adopt a mini-batch size of 10. Before applying the
3D-CNN model, the hyperspectral bands are first reordered.
Fig.10 depicts the structure of the R-2D-CNN model. For
this model, the first level instance is a 13 ×13 ×200 patch.
Then, we apply a three-layer 2D-CNN block to the instance,
which results in a 7×7×400 feature map. After concatenating
this feature map with our second level instance, that is, a 7×
7×200 patch, and reusing the 2D-CNN block, we obtain a
800-dimensional vector, which is connected to a softmax layer
to classify images. Again, we adopt a mini-batch size of 10
and do not utilize any pooling layers.
Similarly, we construct a 13×13×200 patch and a 7×7×200
patch at the first two levels of the R-3D-CNN model. As shown
in Fig.11, we build a seven-layer 3D-CNN block, and apply it
to the first level instance. This produces a 7×7×187 feature
map. Since the spectral band dimension is changed from 200
to 187, we first apply a three-layer 3D convolution operation
to the second level instance. By doing so, the third dimension
is reduced to 187, and it can be concatenated with the feature
map of the first level instance. Then, we reuse the seven-layer
3D convolution block which produces a 1×10 ×35 feature
map. Finally, a softmax layer is applied to the resulting feature
map to determine the class label.
Next, we report the experimental results based on various
data sets. Table V presents the performance of all the methods.
We observe that the R-3D-CNN model achieves the best
performance, of which the OA is 99.50%. Although the
OA of the SMLR-SpATV is 99.11%, the R-3D-CNN model
outperforms it by more than 44%, if we consider the reduction
of error rates. The main reason is that R-3D-CNN model
considers both the spectral and the spatial contexts, where the
former is inferred via the 3D convolution operation and the
latter is inferred by using the multi-level recurrent structure.
In terms of AA and OA, the R-2D-CNN model is ranked
as the second best, which is followed by the SMLR-SpATV,
2D-CNN model and the 3D-CNN model. Though R-2D-CNN
ignores the spectral correlations, its recurrent structures can
effectively capture the spatial context for subsequent image
classification. Our experimental results also imply that the
spatial context is more important than the spectral correlations
for hyperspectral image classification. As shown in Table V,
the results of SVM-CK are better than SVM-3D, SVM-3DG
and CNN-MRF. However, its performance is much worse than
various deep learning techniques. The first reasons is that
the SVM-CK classifier ignores the spatial and the spectral
contexts. The second reason is that the SVM-CK classifier
cannot effectively capture the nonlinear relationships between
the features and the class labels of hyperspectral images. As
a promising classification method, the GCK achieves compa-
rable performance as those of the 2D-CNN and the 3D-CNN
methods because it can extract EMAP information pertaining
to the spectral and the spatial contexts. Fig.12 provides a visual
comparison of the performance of all methods.
2) Results for the Pavia University Scene
In this experiment, the structures of the deep learning mod-
els were quite like those applied to the original Indian Pines
Scene experiment. The only difference was that the numbers of
parameters were adjusted to match the 102 hyperspectral bands
of our refined data set. Recall that there were 200 bands of the
original data set. Table VI presents the experimental results of
all the methods based on the Pavia University Scene data set.
Again, we can see that the proposed R-3D-CNN performs the
best, followed by the R-2D-CNN model, the SMLR-SpATV
method, the MFL method and the GCK method. The OA of the
R-3D-CNN model is 99.97%, which is 0.39% higher than that
of the GCK (99.48%). And the R-CNN-3D model outperforms
the GCK method by more than 94%, when we consider the
reduction of error rates. The SVM-3DG method, the 3D-CNN
and the 2D-CNN models achieve comparable results. The
LORSAL classifier produces the worst performance among
all the methods. Fig.13 visualizes the classification results of
all the methods.
3) Results for the Botswana Scene
As for the earlier scenes, we only modified the number of
parameters for our deep learning models. Table VII presents
the experimental results of all the methods based on the
Botswana Scene data set. We can see that the proposed R-
3D-CNN model and the MFL method achieves the highest
performance, followed by the SVM-3DG, the GCK method
and the R-2D-CNN model. The OA of the R-3D-CNN model
is 99.38%, which is 0.29% higher than that of the MFL
(99.07%). And the R-3D-CNN model outperforms the MFL
method by more than 31%, in terms of the reduction of
error rates. Again, the other models such as the 3D-CNN,
the SMLR-SpATV and the 2D-CNN models perform better
than the LORSAL classifier which produces the worst result.
Fig.14 visualizes the classification results of all the methods.
4) Results for the Salinas Scene
Table VIII presents the experimental results of all the
methods based on the Salinas scene data set. The R-3D-
CNN model achieves the best performance, followed by the
GCK method, the MFL method and the R-2D-CNN model.
The SMLR-SpATV, the 3D-CNN and the 2D-CNN models
also achieve promising results. The OA of the R-3D-CNN
model is 99.80%, which is 0.46% higher than that of the
GCK (99.34%). And the R-3D-CNN model improves the GCK
method by more than 70%, when we consider the reduction of
error rates. Again, the LORSAL classifier produces the worst
result among all the methods. Due to memory limitations of
our computer, we cannot perform the CNN-MRF classifier on
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 9
TABLE I
CLA SSI FIC ATIO N RE SULT S OF INDIAN PI NES SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 85.71±0.4 92.86±0.5 85.71±0.1 92.86±0.2 85.71±0.3 57.14±0.4 64.29±0.4 84.62±0.2 71.72±1.0 85.71±0.4 78.57±0.1 100
2 86.82±0.3 98.12±0.4 89.88±0.3 98.59±0.2 96.24±0.2 79.29±0.3 80.00±0.2 65.65±0.3 95.85±0.6 96.46±0.3 99.29±0.1 100
3 86.12±0.2 94.29±0.3 82.04±0.1 98.37±0.2 92.65±0.3 71.02±0.3 73.47±0.4 96.36±0.4 95.90±0.3 97.13±0.1 98.77±0.3 100
4 88.40±0.5 94.20±0.3 82.61±0.3 100 97.10±0.4 97.10±0.4 97.10±0.5 88.73±0.3 73.91±0.1 98.55±0.3 100 100
5 95.10±0.4 96.50±0.4 91.61±0.3 98.60±0.2 97.20±0.3 95.80±0.1 91.61±0.3 93.06±0.2 97.20±0.2 97.90±0.2 97.90±0.2 100
6 98.61±0.7 99.08±0.1 99.08±0.2 99.54±0.4 99.54±0.2 98.16±0.3 97.70±0.4 99.09±0.4 96.31±0.3 97.68±0.5 99.53±0.3 100
7 75.00±0.1 100 75.00±0.1 100 100 75.00±0.3 62.50±0.3 50.00±0.1 100 100 87.50±0.2 100
8 98.60±0.1 100 97.90±0.2 100 100 99.30±0.2 100 95.10±0.4 100 99.30±0.07 100 100
9 100 83.33±0.3 83.33±0.1 83.33±0.3 100 100 100 100 100 100 100 100
10 87.19±0.7 93.43±0.4 85.81±0.5 98.27±0.3 92.04±0.2 75.09±0.4 75.78±0.4 76.63±0.2 97.20±0.6 98.26±0.3 98.95±0.3 99.65±0.3
11 91.01±0.8 98.09±0.8 88.83±0.1 99.59±0.2 98.50±0.3 88.01±0.2 95.37±0.4 97.55±0.2 99.04±0.4 98.77±0.4 99.45±0.2 99.31±0.2
12 94.84±0.6 94.89±0.4 88.64±0.2 99.43±0.3 96.02±0.3 85.80±0.4 86.36±0.3 76.27±0.3 95.45±0.6 97.15±0.4 99.43±0.2 98.85±0.2
13 100 100 100 100 98.36±0.3 98.36±0.3 98.36±0.2 100 100 96.72±0.1 98.36±0.1 100
14 96.81±0.7 99.20±0.2 96.02±0.2 100 99.47±0.2 97.88±0.4 97.08±0.4 99.47±0.4 98.94±0.3 99.46±0.6 100 99.73±0.2
15 81.57±0.8 95.61±0.6 83.33±0.3 97.37±0.2 97.37±0.3 86.84±0.2 100 95.65±0.2 94.73±0.6 93.80±0.5 98.24±0.2 96.46±0.3
16 100 100 85.71±0.1 100 100 82.14±0.2 782.14±0.3 100 100 100 96.42±0.8 96.42±0.5
AA 91.62±0.3 96.22±0.5 88.47±0.2 97.87±0.2 96.89±0.2 85.23±0.3 87.61±0.1 88.26±0.3 96.37±0.3 97.31±0.2 97.03±0.3 99.42±0.3
OA 91.51±0.2 97.44±0.4 90.10±0.1 99.11±0.2 97.05±0.3 86.55±0.2 89.44±0.2 88.95±0.2 97.08±0.2 98.92±0.4 99.19±0.3 99.50±0.3
TABLE II
CLA SSI FIC ATIO N RE SULT S OF T HE PAVIA UNIVERSITY SCENE.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 99.53±0.3 99.64±0.4 91.20±0.2 99.85±0.2 100 98.14±0.2 99.45±0.2 99.60±0.4 91.65±0.2 98.70±0.3 99.34±0.3 100
2 98.59±0.5 99.85±0.3 96.92±0.1 100 99.93±0.3 99.30±0.2 99.86±0.3 98.09±0.4 99.45±0.1 99.77±0.4 99.96±0.4 100
3 85.21±0.3 97.61±0.5 64.07±0.3 100 93.64±0.2 88.08±0.4 87.12±0.1 76.31±0.2 92.09±0.3 97.94±0.4 98.88±0.2 100
4 96.52±0.2 98.62±0.2 88.57±0.4 93.58±0.4 98.59±0.5 99.02±0.2 99.67±0.3 96.08±0.4 87.26±0.4 94.55±0.2 93.91±0.3 99.89±0.1
5 99.75±0.5 99.34±0.7 99.75±0.2 100 99.50±0.3 100 100 99.75±0.2 91.05±0.2 97.77±0.1 98.27±0.2 100
6 92.24±0.4 99.78±0.5 57.76±0.3 100 99.67±0.3 96.82±0.1 99.07±0.3 88.66±0.3 98.61±0.4 99.60±0.2 99.94±0.4 100
7 91.71±0.1 99.41±0.5 59.05±0.2 100 99.75±0.2 95.48±0.3 95.73±0.2 83.71±0.1 90.72±0.1 98.00±0.2 98.49±0.2 100
8 93.30±0.2 98.52±0.4 80.45±0.3 99.82±0.2 99.10±0.4 95.20±0.4 96.38±0.4 92.75±0.2 93.57±0.4 98.55±0.3 99.81±0.2 100
9 99.65±0.5 99.65±0.3 97.89±0.4 95.77±0.3 100 98.94±0.2 98.94±0.2 98.59±0.4 86.62±0.2 81.34±0.2 94.72±0.3 98.94±0.3
AA 94.52±0.3 99.21±0.3 81.74±0.2 98.78±0.4 98.91±0.2 96.78±0.4 97.36±0.2 92.62±0.2 98.78±0.4 96.25±0.3 98.15±0.2 99.87±0.3
OA 94.72±0.2 99.48±0.2 86.74±0.2 99.41±0.2 99.42±0.2 97.80±0.4 98.62±0.1 95.16±0.3 95.46±0.2 98.49±0.2 99.19±0.2 99.97±0.2
TABLE III
CLA SSI FIC ATIO N RE SULT S OF T HE BOT SWANA SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 100 100 100 100 100 100 100 100 98.77±0.4 100 100 100
2 100 100 90.00±0.1 100 100 100 100 100 93.33±0.3 93.33±0.2 100 100
3 98.66±0.8 100 93.33±0.3 100 97.33±0.3 98.67±0.3 100 98.67±0.3 97.33±0.3 94.67±0.3 98.66±0.2 100
4 96.87±0.5 99.48±0.6 89.06±0.3 100 100 98.44±0.2 100 79.69±0.4 98.44±0.5 96.88±0.3 96.87±0.3 100
5 92.59±0.2 93.17±0.2 76.54±0.2 97.53±0.1 97.53±0.1 97.53±0.3 98.77±0.3 90.00±0.1 91.36±0.3 96.30±0.3 96.29±0.2 97.53±0.3
6 90.12±0.4 93.97±0.5 58.02±0.1 98.77±0.3 100 92.59±0.4 88.89±0.4 90.00±0.2 100 98.63±0.1 100 100
7 100 100 98.68±0.4 100 100 100 100 92.21±0.1 100 98.69±0.3 100 100
8 100 100 86.67±0.3 98.33±0.2 100 100 100 90.00±0.3 100 95.00±0.1 96.66±0.2 98.33±0.3
9 96.80±0.6 98.33±0.7 78.72±0.1 100 96.81±0.2 97.87±0.4 100 96.81±0.1 77.66±0.8 95.75±0.2 100 98.94±0.4
10 94.59±0.3 100 74.32±0.1 100 100 97.30±0.4 100 87.84±0.2 96.81±0.4 100 98.64±0.2 100
11 92.30±0.4 97.54±0.4 90.11±0.2 100 98.90±0.3 96.70±0.4 98.90±0.4 100 98.65±0.3 100 100 100
12 94.93±0.5 100 90.74±0.2 98.15±0.3 100 100 100 64.81±0.1 100 100 96.29±0.4 98.14±0.4
13 91.53±0.3 99.59±0.7 93.67±0.3 100 100 100 100 95.00±0.3 98.73±0.5 100 100 100
14 100 96.00±0.1 32.14±0.2 42.86±0.2 96.43±0.2 96.43±0.3 96.43±0.2 53.57±0.3 92.86±0.3 92.86±0.4 85.71±0.4 96.43±0.3
AA 96.79±0.3 98.33±0.6 82.29±0.4 95.40±0.4 99.07±0.3 98.25±0.3 98.78±0.4 88.47±0.3 97.21±0.3 97.30±0.3 98.89±0.3 99.24±0.2
OA 96.28±0.1 98.21±0.2 84.09±0.3 97.83±0.2 99.07±0.4 98.14±0.2 98.76±0.3 90.70±0.4 97.60±0.5 97.21±0.1 98.54±0.2 99.38±0.2
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 10
145
145 200 400 400 400 16
200
7
3
3
5
3
7
extract
Conv1
Conv2 Conv 3
full
connected classify
sotfmax
3
33
5
Fig. 8. The 2D-CNN network for hyperspectral remote sensing classification(The stride of each layer is 1).
Fig. 9. The 3D-CNN network for remote sensing hyperspectral image classification.
7
7
7
7
200
600
Concat
Conv4
Conv 5
145
145
200 400 400
200
13
3
3
11
3 3
13
extract 11
Conv1 Conv2 Conv 3
3
9
93
400
7
7
3
3
3
3
3
3
5
5
3
3
Conv 6
800800800
Softmax
16
Fig. 10. The R-2D-CNN network for remote sensing hyperspectral image classification(The stride of each layer is 1).
this data set. Fig.15 visualizes the classification results of all
the methods.
5) Results for the Pavia Centre Scene
Table IX presents the experimental results of all the methods
based on the Pavia Centre Scene data set. The results are quite
different from those obtained based on the previous four data
sets. We observe that R-2D-CNN performs the best, followed
by the SVM-3DG and the MFL classifiers. And, the R-2D-
CNN model outperforms the SVM-3DG method by more
than 88%, in terms of the reduction of error rates.The R-3D-
CNN model, which achieves the best performance based on
the previous data sets, produces unsatisfactory results when
compared to those of the R-2D-CNN model and the SVM-CK
classifier. The reason may be that the R-3D-CNN model fuses
the spectral and the spatial information by using a 3D operator.
However, the channel of the Pavia Centre scene contains 102
bands only; it is smaller than the other data sets. On the
other hand, the 2D-CNN and the 3D-CNN models perform the
worst among all the methods because it is difficult for these
models to classify the 3-rd class and the 9-th class due to the
limited number of instances and channels. Since the methods
such as the GCK, the SMLR-SpATV, and CNN-MRF, require
more RAM than that equipped with our computer, we cannot
obtain their performance on the data set. Fig.16 shows the
classification results of all the methods.
6) Results for the Kennedy Space Center Scene
Table X presents the experimental results of all the methods
based on the Kennedy Space Center Scene data set. Again,
we observe that the proposed R-3D-CNN performs the best,
followed by the R-2D-CNN model and the GCK model. The
R-3D-CNN model outperforms the GCK method by more than
95% in terms of error rate. The 3D-CNN model, the MFL
method, and the SVM-3DG models achieve comparable re-
sults, followed by the CNN-MRF method, the 2D-CNN model,
and the SVM-CK methods. The SVM-3D classifier produces
the worst result, and the LORSAL method outperforms the
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 11
Fig. 11. The R-3D-CNN network for remote sensing hyperspectral image classification.
*URXQGWUXWK 690&. *&. /256$/ 60/56S$79 0)L 690'
690'* &1105) '&11 '&11 5'&11 5'&11
Fig. 12. Classification maps and overall classification accuracies obtained for the AVIRIS Indian Pines data set (overall accuracies are reported in parentheses).
SVM-3D method by 2% in terms of OA. Fig.17 shows the
classification results of all the methods.
C. Convergent Speed Comparison
Fig.18 plots the accuracies of different deep learning models
against the number of iterations based on the six data sets. We
observe that the R-3D-CNN model can converge with fewer
number of iterations when compared to the other models, with
the only exception for the Salinas data set. The efficiency
improvement brought by the R-3D-CNN model is attributed
to the recurrent structure and the 3D convolutional operation.
Specifically, the feature maps that are extracted by the R-
3D-CNN model contain richer contextual information of the
images, which leads to a quicker convergence of model
training.
D. The Impact of the Size of Training Samples
In this experiment, we examined how the performance of the
proposed deep learning models changed against varying sizes
of the training samples. To this end, we varied the number
of training samples from 10% to 70%, and reported the OA
achieved by all methods. Fig.19 show the results based on six
data sets. From Fig. 19, we can make two important observa-
tions. First, for the conventional classifiers, i.e., GCK, MFL,
SVM-CK, SVM-3D, SVM-3DG, SMLR-SpATV, LORSAL,
we find that their classification performances are insensitive to
the number of training samples, especially on the Bostwana
Scene, Salinas Scene, Pavia Centre Scene, Pavia University
Scene, and Kennedy Space Center data sets. Promising results
are achieved when 10% training samples are utilized for
these methods, and feeding more training samples to these
methods only leads to marginal performance improvement.
Among the conventional methods, the GCK and the MFL
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 12
TABLE IV
CLA SSI FIC ATIO N RE SULT S OF T HE SAL INA S SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 99.83±0.3 100 96.85±0.2 99.00±0.3 100 100 100 - 99.50±0.2 97.68±0.3 100 99.83±0.2
2 100 99.82±0.2 98.93±0.2 99.82±0.1 100 99.82±0.2 99.91±0.3 - 99.82±0.5 99.46±0.2 99.46±0.3 99.91±0.4
3 100 100 80.24±0.2 95.61±0.1 100 99.49±0.4 99.49±0.4 - 98.31±0.3 97.80±0.5 99.49±0.1 99.66±0.2
4 99.04±0.4 99.76±0.3 99.28±0.4 99.76±0.3 99.52±0.1 99.04±0.2 99.28±0.4 - 97.61±0.4 97.13±0.3 97.13±0.2 97.37±0.4
5 99.75±0.5 99.50±0.1 98.13±0.2 99.75±0.2 98.63±0.3 99.63±0.3 99.75±0.2 - 98.50±0.5 98.80±0.4 99.13±0.3 99.50±0.1
6 100 100 99.92±0.1 100 99.83±0.3 100 100 - 99.75±0.3 98.91±0.3 99.07±0.3 100
7 99.91±0.1 99.81±0.2 99.16±0.3 99.72±0.1 99.44±0.2 100 100 - 98.42±0.4 96.55±0.4 99.81±0.4 99.53±0.1
8 92.69±0.5 96.54±0.5 88.10±0.4 96.98±0.4 97.93±0.3 92.28±0.3 94.11±0.2 - 99.47±0.4 97.28±0.3 99.91±0.2 99.97±0.4
9 100 100 99.25±0.3 100 100 99.62±0.1 99.62±0.1 - 99.89±0.2 99.84±0.2 99.84±0.2 100
10 99.08±0.4 99.80±0.3 84.33±0.2 96.64±0.2 99.69±0.4 97.55±0.3 98.88±0.4 - 99.70±0.1 98.78±0.2 99.18±0.2 99.90±0.2
11 100 99.69±0.4 85.89±0.4 94.36±0.3 99.69±0.3 98.75±0.3 98.75±0.2 - 99.37±0.5 99.06±0.3 99.06±0.3 100
12 100 100 100 100 100 100 100 - 98.09±0.6 99.14±0.4 98.27±0.5 99.65±0.4
13 99.64±0.4 99.27±0.3 98.91±0.2 99.27±0.4 98.55±0.3 98.55±0.3 98.55±0.3 - 96.00±0.1 90.55±0.5 98.55±0.4 100
14 98.75±0.5 99.07±0.4 94.08±0.4 99.38±0.3 95.64±0.2 96.57±0.2 98.75±0.2 - 96.89±0.3 93.46±0.6 97.20±0.3 98.44±0.2
15 82.61±0.2 96.14±0.4 52.98±0.3 97.84±0.2 99.17±0.4 77.37±0.3 81.46±0.3 - 99.22±0.4 97.47±0.4 99.73±0.4 100
16 99.82±0.3 100 96.67±0.4 98.89±0.4 98.89±0.4 99.26±0.3 99.45±0.3 - 99.41±0.1 99.06±0.3 99.81±0.2 100
AA 96.06±0.5 98.66±0.4 92.05±0.3 98.56±0.3 99.19±0.4 97.37±0.3 98.00±0.4 - 98.90±0.3 98.65±0.2 99.12±0.4 99.61±0.2
OA 96.00±0.3 99.34±0.3 88.57±0.3 98.46±0.3 99.16±0.3 94.95±0.2 96.03±0.3 - 98.96±0.4 99.08±0.3 99.47±0.3 99.80±0.2
TABLE V
CLA SSI FIC ATIO N RE SULT S OF T HE PAVIA CEN TR E SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 100 - - - 99.99±0.4 100 100 - 99.36±0.2 99.85±0.2 100 99.32±0.2
2 98.24±0.4 - - - 97.41±0.2 96.14±0.2 98.03±0.2 - 87.51±0.3 95.73±0.2 99.65±0.4 86.52±0.3
3 96.54±0.5 - - - 91.15±0.3 93.96±0.3 92.34±0.2 - 88.35±0.5 95.73±0.1 99.78±0.3 85.11±0.3
4 94.30±0.2 - - - 97.89±0.2 90.20±0.2 94.42±0.1 - 88.59±0.4 95.04±0.1 99.88±0.4 91.80±0.1
5 98.18±0.3 - - - 99.70±0.4 99.09±0.4 99.44±0.2 - 91.89±0.4 97.52±0.2 99.85±0.4 92.95±0.3
6 99.31±0.5 - - - 98.59±0.3 98.34±0.2 99.24±0.2 - 90.48±0.3 97.76±0.3 99.75±0.3 90.87±0.4
7 95.38±0.4 - - - 93.87±0.3 96.89±0.5 98.35±0.3 - 96.30±0.4 99.36±0.2 99.95±0.2 96.97±0.2
8 99.80±0.5 - - - 99.87±0.2 99.93±0.2 99.95±0.2 - 96.98±0.5 98.84±0.3 99.90±0.3 96.49±0.1
9 100 - - - 94.76±0.3 99.65±0.3 99.88±0.4 - 73.69±0.1 91.51±0.4 98.25±0.3 83.80±0.2
AA 97.97±0.4 - - - 97.03±0.2 97.13±0.2 97.96±0.3 - 90.32±0.4 96.82±0.3 99.67±0.2 91.54±0.2
OA 97.32±0.3 - - - 99.10±0.2 99.17±0.2 99.47±0.3 - 96.02±0.4 98.75±0.1 99.88±0.3 96.79±0.2
TABLE VI
CLA SSI FIC ATIO N RE SULT S OF T HE KEN NE DY SPACE CE NTE R.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 96.49±0.2 100 94.74±0.2 97.37±0.4 99.56±0.3 98.25±0.2 99.56±0.3 96.50±0.2 98.68±0.4 98.68±0.4 100 100
2 95.90±0.2 100 90.41±0.1 98.63±0.2 100 86.30±0.3 97.26±0.2 97.22±0.2 100 91.78±0.1 100 97.26±0.2
3 97.40±0.3 98.70±0.3 93.51±0.2 98.70±0.4 100 97.40±0.3 97.40±0.2 98.68±0.4 100 98.78±0.1 100 100
4 86.84±0.2 90.79±0.4 75.00±0.4 84.21±0.1 100 77.63±0.2 96.05±0.2 77.33±0.2 94.73±0.6 97.37±0.4 94.74±0.3 98.68±0.4
5 77.08±0.3 97.92±0.3 72.92±0.2 81.25±0.2 100 83.33±0.4 83.33±0.3 83.33±0.3 93.75±0.2 91.67±0.6 97.92±0.3 100
6 78.26±0.1 100 75.82±0.3 90.11±0.2 99.56±0.3 98.55±0.3 98.55±0.4 100 100 97.10±0.1 98.55±0.1 98.55±0.1
7 87.09±0.6 100 96.77±0.3 100 100 100 100 100 100 100 100 100
8 96.90±0.9 99.23±0.4 96.90±0.4 98.45±0.4 98.45±0.3 93.20±0.4 94.57±0.3 99.22±0.2 93.80±0.8 100 96.90±0.2 100
9 100 100 98.72±0.1 100 99.36±0.3 100 100 100 100 99.36±0.6 100 99.35±0.1
10 97.50±0.1 98.33±0.3 97.50±0.4 98.45±0.3 98.33±0.3 68.33±0.2 98.33±0.2 100 98.32±0.8 .94.12±0.6 100 100
11 99.20±0.1 100 100 100 87.20±0.4 99.20±0.2 99.200±0.4 100 100 100 96.80±0.1 99.20±0.1
12 98.67±0.5 97.35±0.1 92.72±0.1 95.36±0.3 96.69±0.4 96.03±0.2 100 100 98.01±0.3 96.69±0.5 99.34±0.3 98.68±0.2
13 100 100 97.84±0.4 100 100 87.77±0.4 100 100 97.48±0.2 97.12±0.2 100 98.56±0.1
AA 95.96±0.2 98.64±0.1 90.95±0.3 95.33±0.2 98.43±0.2 91.22±0.2 97.25±0.3 96.30±0.2 97.81±0.3 98.20±0.1 98.79±0.4 99.23±0.1
OA 96.95±0.6 98.98±0.3 93.59±0.4 96.86±0.3 98.27±0.3 91.67±0.3 98.27±0.3 97.56±0.3 97.76±0.5 98.46±0.3 99.22±0.4 99.85±0.3
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13
Ground-truth SVM-CK(94.72%) GCK(99.48%) LORSAL(86.74%) SMLR-SpATV(99.41%) MFL(99.42%) SVM-3D(97.80%)
SVM-3DG(98.62%) CNN-MRF(95.16%) 2D-CNN(95.46%) 3D-CNN(98.49%) R-2D-CNN(99.19%) R-3D-CNN(99.97%)
Fig. 13. Classification maps and overall classification accuracies obtained for the Pavia University scene data set (overall accuracies are reported in parentheses).
Ground-truth SVM-CK(96.28%) GCK(98.21%) LORSAL(84.09%) SMLR-SpATV(97.83%) NMF(99.07%) SVM-3D(98.14%)
SVM-3DG(98.76%) CNN-MRF(90.70%) 2D-CNN(97.60%) 3D-CNN(97.21%) R-2D-CNN(98.54%) R-3D-CNN(99.38%)
Fig. 14. Classification maps and overall classification accuracies obtained for the Botswana Scene data set (overall accuracies are reported in parentheses).
methods often perform the best. Second, we can observe
an obvious performance improvement when the number of
training samples is increased (from 10% to 50%) for the
proposed deep learning models i.e., 2D-CNN, 3D-CNN, R-
2D-CNN, and R-3D-CNN. When more than 60% training
samples are employed, the R-2D-CNN and the R-3D-CNN
models often achieve comparable results when compared to the
best conventional methods such as GCK and MFL. Moreover,
we observe that the deep learning-based model CNN-MRF
produces unstable classification performance, that is, results
can be better or worse when more training samples are used.
As a matter of fact, the proposed R-2D-CNN and R-3D-
CNN models outperform CNN-MRF when sufficient training
samples are provided. All these observations indicate that the
proposed deep learning models (R-2D-CNN and R-3D-CNN)
are more effective than the baselines when sufficient training
samples are provided.
E. Discussions
In this subsection, we briefly discuss the experimental
results presented above. First, we find that the R-3D-CNN
model often performs better than other models across all the
six data sets. There are two possible two reasons for such a per-
formance improvement: (i) the R-3D-CNN model effectively
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 14
Ground-truth SVM-CK(96.00%) GCK(99.34) LORSAL(88.57) SMLR(98.46%) MFL(99.16%)
SVM-3D(94.95%) SVM-3DG(98.00) 2D-CNN(98.96%) 3D-CNN(99.08%) R-2D-CNN(99.47%) R-3D-CNN(99.80%)
Fig. 15. Classification maps and overall classification accuracies obtained for the Salinas scene data set (overall accuracies are reported in parentheses).
Ground-truth SVM-CK(94.72%) GCK(99.48%) LORSAL(86.74%) SMLR-SpATV(99.41%) MFL(99.42%) SVM-3D(97.80%)
SVM-3DG(98.62%) CNN-MRF(95.16%) 2D-CNN(95.46%) 3D-CNN(98.49%) R-2D-CNN(99.19%) R-3D-CNN(99.97%)
Fig. 16. Classification maps and overall classification accuracies obtained for the Pavia Centre scene data set (overall accuracies are reported in parentheses).
Ground-truth SVM-CK(96.95%) GCK(98.98%) LORSAL(93.59%) SMLR-SpATV(96.86%) NMF(98.27%) SVM-3D(91.67%)
SVM-3DG(98.27%) CNN-MRF(97.56%) 2D-CNN(97.76%) 3D-CNN(98.46%) R-2D-CNN(99.22%) R-3D-CNN(99.85%)
Fig. 17. Classification maps and overall classification accuracies obtained for the Kennedy Space Center data set (overall accuracies are reported in parentheses).
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 15
0 200 400 600 800 1000
The itrations
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
a)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 1000 2000 3000 4000 5000
The itrations
0
0.2
0.4
0.6
0.8
1
The accuracy
b)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
c)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
d)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
e)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
f)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
Fig. 18. The accuracy varies against the number of iterations on different
data sets. (a) Indian Pines Scene; (b) Bostwana Scene; (c) Salinas Scene; (d)
Pavia Centre Scene; (e) Pavia University Scene; (f) Kennedy Space Center.
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
55
60
65
70
75
80
85
90
95
100
The accuray(%)
a)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
40
50
60
70
80
90
100
The accuray(%)
b)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
20
30
40
50
60
70
80
90
100
The accuray(%)
c)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
50
60
70
80
90
100
The accuray(%)
d)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
55
60
65
70
75
80
85
90
95
100
The accuray(%)
e)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
40
50
60
70
80
90
100
The accuray(%)
f)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
Fig. 19. The influence of training samples proportion. a) Indian Pines Scene;
b) Botswana Scene; c) Salinas scene; d)Pavia Centre scene; e)Pavia University
scene; f)Kennedy Space Center;
fuses the spatial and the spectral correlations; (ii) the multi-
level recurrent structure can exploit spatial contexts better than
a flat non-recurrent structure. For the same reason, we also
observe that the R-2D-CNN model often outperforms the other
models (such as LORSAL, GCK, SMLR-SpATV, CNN-MRF,
SVM-CK, SVM-3D, SVM-3DG, and MFL). The MFL and the
GCK models perform better than the 2D-CNN and the 3D-
CNN models because of their well-designed EMAP attributes
which can effectively represent the spatial contexts. On the
other hand, the 2D-CNN and the 3D-CNN models perform
better than the SVM-CK, the SVM-3D, and the LORSAL
classifiers in general. All our experimental results verify the
effectiveness and the advantages of the deep learning-based
methods.
Second, the 3D-CNN model often performs better than the
2D-CNN model. The main reason is that the 3D convolu-
tion operation can exploit both spatial features and spectral
correlations while the 2D convolution operation can only
exploit spatial features. On the other hand, the R-2D-CNN
model often performs better than the 3D-CNN model and
the 2D-CNN model because its recurrent structure can more
effectively exploit the spatial contexts than the latter two
models. Among all the four models, the R-3D-CNN model not
only performs the best for most data sets but it also converges
faster.
Finally, we find that the proposed deep learning models
(e.g., R-3D-CNN and R-2D-CNN) may be slightly inferior
to conventional machine learning techniques if the training
samples are limited. However, when a reasonable number of
training samples are available, their performance is consid-
erably better than that of the conventional machine learning
techniques such as LORSAL, GCK, MFL, SMLR-SpATV,
SVM-3D, SVM-3DG, and SVM-CK. The main reason is that
deep learning models usually contain more model parameters,
and hence more training samples are required to estimate the
values of these parameters.
V. CONCLUSIONS
In this paper, we have explored deep learning techniques
for solving the hyperspectral image classification problem.
In particular, four deep learning models such as 2D-CNN,
3D-CNN, R-2D-CNN, and R-3D-CNN have been designed
and developed. Rigorous experiments were conducted based
on six publicly available hyperspectral image data sets, and
our experimental results confirm the superiority of these
deep learning methods when compared to traditional machine
learning methods such as LORSAL, MFL, GCK, SVM-3D,
and SVM-CK. In addition, the proposed R-3D-CNN and R-
2D-CNN models outperform the CNN-MRF, SVM-3DG, and
SMLR-SpATV. As a whole, the proposed R-3D-CNN model
often outperforms other models for most of the data sets, and
it can also converge faster because of its 3D convolutional
operators and the recurrent network structure which can effec-
tively exploit both the spectral and the spatial contexts. If we
measure classification performance in terms of error rate, the
proposed methods (R-2D-CNN and R-3D-CNN) outperform
the baselines by more than 30%. Despite the superiority of
the proposed models, we find that our deep learning models
often require more training samples than traditional machine
learning methods. Accordingly, it will be a very interesting
future research topic of incorporating prior domain knowledge
into the proposed deep learning models. Alternatively, we will
explore applying transfer learning approaches to alleviate the
shortcomings of our current deep learning models.
ACKNOWLEDGMENT
This research was supported in part by Shenzhen
Science and Technology Program under Grant No.
JCYJ20160330163900579 and No.JCYJ20170413105929681.
Huang’s work is supported by the National Natural Science
Foundation of China (NSFC) under Grant No.61562027
and Education Department of Jiangxi Province under Grant
No.GJJ170413. Lau’s work is supported by grants from the
RGC of the Hong Kong SAR (Projects: CityU 11502115 and
CityU 11525716), the NSFC Basic Research Program (Project:
71671155), the Shenzhen Municipal Science and Technology
Innovation Fund (Project: JCYJ20160229165300897), and
the CityU Shenzhen Research Institute.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 16
REFERENCES
[1] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders,
N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data
analysis and future challenges,” IEEE Geoscience and remote sensing
magazine, vol. 1, no. 2, pp. 6–36, 2013.
[2] A. J. Brown, B. Sutter, and S. Dunagan, “The marte vnir imaging
spectrometer experiment: design and analysis,” Astrobiology, vol. 8,
no. 5, pp. 1001–1011, 2008.
[3] B. Scholkopf and A. J. Smola, Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press, 2001.
[4] G. Hughes, “On the mean accuracy of statistical pattern recognizers,”
in IEEE Trans. Inf. Theory 1968, 1968, pp. 55–63.
[5] B. Krishnapuram, L. Carin, M. A. Figueiredo, and A. J. Hartemink,
“Sparse multinomial logistic regression: Fast algorithms and general-
ization bounds,” IEEE transactions on pattern analysis and machine
intelligence, vol. 27, no. 6, pp. 957–968, 2005.
[6] Q. Wang, Z. Meng, and X. Li, “Locality adaptive discriminant anal-
ysis for spectral–spatial classification of hyperspectral images,” IEEE
Geoscience and Remote Sensing Letters, vol. 14, no. 11, pp. 2077–2081,
2017.
[7] Q. Wang, J. Lin, and Y. Yuan, “Salient band selection for hyperspectral
image classification via manifold ranking,” IEEE transactions on neural
networks and learning systems, vol. 27, no. 6, pp. 1279–1289, 2016.
[8] Y. Yuan, J. Lin, and Q. Wang, “Dual-clustering-based hyperspectral band
selection by contextual analysis,” IEEE Transactions on Geoscience and
Remote Sensing, vol. 54, no. 3, pp. 1431–1445, 2016.
[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[10] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning.” in
AAAI, 2017, pp. 4278–4284.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
tion with deep convolutional neural networks,” in Advances in neural
information processing systems, 2012, pp. 1097–1105.
[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 770–778.
[15] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of
hyperspectral images with regularized linear discriminant analysis,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 3,
pp. 862–873, 2009.
[16] A. J. Brown, “Spectral curve fitting for automatic hyperspectral data
analysis,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 44, no. 6, pp. 1601–1608, 2006.
[17] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspec-
tral image classification,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 43, no. 6, pp. 1351–1362, 2005.
[18] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “Sim-
plemkl,” Journal of Machine Learning Research, vol. 9, no. 3, pp. 2491–
2521, 2008.
[19] D. Tuia, G. Camps-Valls, G. Matasci, and M. Kanevski, “Learn-
ing relevant image features with multiple-kernel classification,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 48, no. 10, pp.
3780–3791, 2010.
[20] Y. Gu, C. Wang, D. You, Y. Zhang, S. Wang, and Y. Zhang, “Representa-
tive multiple kernel learning for classification in hyperspectral imagery,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 7,
pp. 2852–2865, 2012.
[21] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone,
“Morphological attribute profiles for the analysis of very high resolution
images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48,
no. 10, pp. 3747–3762, 2010.
[22] M. Dalla Mura, J. Atli Benediktsson, B. Waske, and L. Bruzzone,
“Extended profiles with morphological attribute filters for the analysis
of hyperspectral data,” International Journal of Remote Sensing, vol. 31,
no. 22, pp. 5975–5991, 2010.
[23] J. Li, P. R. Marpu, A. Plaza, J. M. Bioucas-Dias, and J. A. Benedikts-
son, “Generalized composite kernel framework for hyperspectral image
classification,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 51, no. 9, pp. 4816–4829, 2013.
[24] Q. Huang, C. K. Jia, X. Zhang, and Y. Ye, “Learning discriminative sub-
space models for weakly supervised face detection,” IEEE Transactions
on Industrial Informatics, vol. 13, no. 6, pp. 2956–2964, 2017.
[25] X. Ma, Q. Liu, Z. He, X. Zhang, and W.-S. Chen, “Visual tracking via
exemplar regression model,” Knowledge-Based Systems, vol. 106, pp.
26–37, 2016.
[26] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image
classification via kernel sparse representation,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 51, no. 1, pp. 217–231, 2013.
[27] L. Sun, Z. Wu, J. Liu, L. Xiao, and Z. Wei, “Supervised spectral–spatial
hyperspectral image classification with weighted markov random fields,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 3,
pp. 1490–1503, 2015.
[28] J. Li, X. Huang, P. Gamba, J. M. Bioucas-Dias, L. Zhang, J. A.
Benediktsson, and A. Plaza, “Multiple feature learning for hyperspectral
image classification,” IEEE Transactions on Geoscience and Remote
sensing, vol. 53, no. 3, pp. 1592–1606, 2015.
[29] X. Cao, L. Xu, D. Meng, Q. Zhao, and Z. Xu, “Integration of 3-
dimensional discrete wavelet transform and markov random field for
hyperspectral image classification,” Neurocomputing, vol. 226, pp. 90–
100, 2017.
[30] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.
Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-
propagation network,” in Advances in neural information processing
systems, 1990, pp. 396–404.
[31] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551,
1989.
[32] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks.” in Aistats, vol. 15, no. 106, 2011, p. 275.
[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: A simple way to prevent neural networks from over-
fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
1929–1958, 2014.
[34] A. Santara, K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, and
P. Mitra, “Bass net: Band-adaptive spectral-spatial feature learning neu-
ral network for hyperspectral image classification,” IEEE Transactions
on Geoscience and Remote Sensing, 2017.
[35] X. Cao, F. Zhou, L. Xu, D. Meng, Z. Xu, and J. Paisley, “Hyperspectral
image segmentation with markov random fields and a convolutional
neural network,” arXiv preprint arXiv:1705.00727, 2017.
[36] Q. Wang, J. Gao, and Y. Yuan, “A joint convolutional neural networks
and context transfer for street scenes labeling,” IEEE Transactions on
Intelligent Transportation Systems, 2017.
[37] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
for human action recognition,” IEEE transactions on pattern analysis and
machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[38] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in Acoustics, speech and signal
processing (icassp), 2013 ieee international conference on. IEEE, 2013,
pp. 6645–6649.
[39] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in neural information processing
systems, 2014, pp. 3104–3112.
[40] K. Cho, B. Van Merri¨
enboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[41] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks for
hyperspectral image classification,” IEEE Transactions on Geoscience
and Remote Sensing, 2017.
[42] P. Pinheiro and R. Collobert, “Recurrent convolutional neural
networks for scene labeling,” in Proceedings of the 31st International
Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 1.
Bejing, China: PMLR, 22–24 Jun 2014, pp. 82–90. [Online]. Available:
http://proceedings.mlr.press/v32/pinheiro14.html
[43] P. J. Werbos, “Backpropagation through time: what it does and how to
do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 17
Xiaofei Yang Xiaofei Yang received the B.Sc. from
Suihua University in 2007 and 2011, and received
M.Sc. degrees from Harbin Institute of Technology
in 2011 and 2013, respectively. Currently, he is
a Ph.D. candidate in Shenzhen Graduate School,
Harbin Institute of Technology. His research inter-
ests are in the areas of semi-supervised learning,
deep learning, remote sensing, transfer learning and
graph mining.
Yunming Ye Yunming Ye received the Ph.D. in
Computer Science from Shanghai Jiao Tong Univer-
sity. He is now a professor in the Shenzhen Graduate
School, Harbin Institute of Technology. His research
interests include data mining, text mining, and en-
semble learning algorithms.
Xutao Li Xutao Li is now an Associate Professor
in the Shenzhen Graduate School, Harbin Institute
of Technology. He received the Ph.D. and Master
degrees in Computer Science from Harbin Institute
of Technology in 2013 and 2009, and the Bachelor
from Lanzhou University of Technology in 2007.
His research interests include data mining, machine
learning, graph mining and social network analysis,
especially tensor based learning and mining algo-
rithms.
Raymond Y. K. Lau Raymond Y. K. Lau is an
Associate Professor in the Department of Informa-
tion Systems at City University of Hong Kong. He
is the author of two hundred refereed international
journals and conference papers. His research work
has been published in renowned journals such as
MIS Quarterly, INFORMS Journal on Computing,
ACM Transactions on Information Systems, IEEE
Transactions on Knowledge and Data Engineering,
IEEE Internet Computing, Journal of MIS, Decision
Support Systems, etc. His research interests include
Big Data Analytics, Social Media Analytics, FinTech, and AI for Business.
He is a senior member of the IEEE and the ACM, respectively.
Xiaohui Huang Xiaohui Huang received the B.Eng.
and masters degrees from Jiangxi Normal University,
Nanchang, China, in 2005 and 2008, respectively,
and the Ph.D. degree from the Shenzhen Graduate
School, Harbin Institute of Technology, Shenzhen,
China, in 2014. Since 2015, he has been with the
School of Information Engineering Department, East
China Jiaotong University, Nanchang, China, where
he is currently a lecturer of computer science. His
current research interests include clustering, social
media analysis, and deep learning.
Xiaofeng Zhang Xiaofeng Zhang received the MSc
degree from Harbin Institute of Technology in 1999,
and the Ph.D. degree from Hong Kong Baptist
University in 2008. He has worked in R&D center
of Peking University Founder Group and E-business
Technology Institute of Hong Kong University. He
is now an associate professor at department of
computer science of Harbin Institute of Technology
Shenzhen Graduate School. His research interests
include data mining, machine learning and graph
mining.