ArticlePDF Available

Hyperspectral Image Classification With Deep Learning Models

Authors:

Abstract and Figures

Deep learning has achieved great successes in conventional computer vision tasks. In this paper, we exploit deep learning techniques to address the hyperspectral image classification problem. In contrast to conventional computer vision tasks that only examine the spatial context, our proposed method can exploit both spatial context and spectral correlation to enhance hyperspectral image classification. In particular, we advocate four new deep learning models, namely, 2-D convolutional neural network (2-D-CNN), 3-D-CNN, recurrent 2-D CNN (R-2-D-CNN), and recurrent 3-D-CNN (R-3-D-CNN) for hyperspectral image classification. We conducted rigorous experiments based on six publicly available data sets. Through a comparative evaluation with other state-of-the-art methods, our experimental results confirm the superiority of the proposed deep learning models, especially the R-3-D-CNN and the R-2-D-CNN deep learning models.
Content may be subject to copyright.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1
Hyperspectral Image Classification With Deep Learning Models
Xiaofei Yang1,2, Yunming Ye1,2, Xutao Li1,2, Raymond Y. K. Lau3, and Xiaofeng Zhang1,2, and Xiaohui Huang4
1Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, China
2Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, 518055, China
3City University of Hong Kong, Hong Kong
4School of Information Engineering East China Jiaotong University, China
Deep learning has achieved great successes in conventional computer vision tasks. In this paper, we exploit deep learning techniques
to address the hyperspectral image classification problem. In contrast to conventional computer vision tasks that only examine the
spatial context, our proposed method can exploit both spatial context and spectral correlation to enhance hyperspectral image
classification. In particular, we advocate four new deep learning models, namely 2D Convolutional neural network (2D-CNN), 3D
Convolutional neural network (3D-CNN), recurrent 2D Convolutional neural network (R-2D-CNN), and recurrent 3D Convolutional
neural network (R-3D-CNN) for hyperspectral image classification. We conducted rigorous experiments based on six publicly available
data sets. Through a comparative evaluation with other state-of-the-art methods, our experimental results confirm the superiority
of the proposed deep learning models, especially the R-3D-CNN and the R-2D-CNN deep learning models.
Index Terms—Deep Learning, Hyperspectral Image, Convolutional Neural Network.
I. INTRODUCTION
RECENTLY, the rapid development of optics and photon-
ics has significantly advanced the field of hyperspectral
imaging techniques. As a result, hyperspectral sensors are
installed in many satellites which can produce images with
rich spectral information. The rich information captured in
hyperspectral images enables us to distinguish very similar
materials and objects by using satellites. Accordingly, hyper-
spectral imaging techniques have been widely used in a variety
of fields such as agriculture, monitoring, astronomy, and
mineral exploration. For example, Brown et al. [1] analyzed
the CRISM hyperspectral data set and used linear mixing of
absorption band techniques to determine the mineralogy of
the surface on Mars. In [2], Brown et al. utilized the VNIR
imaging spectrometer instrument that was a hyperspectral
scanning pushbroom device sensitive to VNIR wavelengths
from 400 1000 nm for mineral exploration.
The existing methods for hyperspectral image classifica-
tion are mostly based on conventional pattern recognition
approaches such as support vector machine (SVM) [3] and
K-nearest neighbor (KNN) classifiers. To address the curse of
dimensionality, namely the Hughes phenomenon [4], Krishna-
puram et al. [5] performed dimensionality reduction against a
data set first and then applied multinomial logistic regression
(MLR) to improve image classification performance. Wang et
al. proposed a novel dimensionality reduction method, namely
the Locality Adaptive Discriminant Analysis (LADA) method
for hyperspectral image analysis [6]. Another way to cope
with the Hughes phenomenon is via the salient band selection
method. For example, Wang et al. [7] proposed a manifold
ranking based salient band selection method. In addition, Yuan
et al. proposed a new dual clustering framework, which was
applied to tackle the inherent drawbacks of the clustering-
based band selection method [8]. It has been shown that a
Corresponding authors: Yunming Ye (email: yeyunming@hit.edu.cn) and
Xutao Li (email: lixutao@hit.edu.cn).
composite kernel approach that requires multiple kernels can
enhance the accuracy of classification by fusing spatial and
spectral information. For example, the Generalized Composite
Kernel (GCK) framework is one of the promising methods
for hyperspectral image classification. Though kernel-based
methods like GCK can exploit both the spectral and the spatial
information, it involves solving a computationally very costly
optimization problem.
As a state-of-the-art machine learning technique, deep learn-
ing [9] [10] has recently attracted a lot of attention for its
application to conventional computer vision tasks. One main
reason is that deep learning can automatically discover an
effective feature representation for a problem domain, thus
avoiding the complicated and hand-crafted feature engineering
process. With a specially-designed deep learning architecture,
convolution neural networks (CNNs) are widely applied to
image recognition and image segmentation which considers
the spatial correlation among pixels. Successful examples of
CNNs include AlexNet [11], VGG [12], GoogLeNet [13], and
ResNet [14]. However, existing CNNs are applied to con-
ventional image classification tasks rather than hyperspectral
image classification tasks where both the spatial and spectral
correlations need to be effectively exploited.
In this paper, we address the hyperspectral image classifica-
tion problem by using a new deep learning technique. As noted
above, both the spectral factor and the spatial factor influence
the class label prediction of a pixel. On one hand, the label
of a pixel is reflected by its spectral values scanned by using
different spectrums. On the other hand, as the geographically
close pixels tend to belong to the same class, predicting the
class label of a pixel should take into account the class labels
of the surrounding pixels. Hence, a good hyperspectral image
classification method should consider both the spectral factor
and the spatial factor. In this paper, we first advocate a 2D-
CNN model and a 3D-CNN model for classifying hyper-
spectral images. The intuition is that a 2D-CNN can exploit
the spatial context, whereas a 3D-CNN can exploit both the
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 2
spatial and the spectral context. Though the aforementioned
models can take into account rich contextual information, the
way that they process the spatial information may introduce
unwanted noise. Accordingly, we further design the recurrent
2D-CNN and the recurrent 3D-CNN to address the noisy
spatial information problem. The main contributions of our
research work are summarized as follows.
1) First, we treat spectral data as the channels of conven-
tional images. To classify each pixel in a hyperspec-
tral image, we extract a small patch centered at the
pixel. The patch is treated as an image with multiple
channels. Then, we design a 2D-CNN model with three
2D convolution layers, followed by a full connection
layer, to classify the patch. The label of the patch is
considered as the label of its central pixel. Though the
pooling layers (such as max pooling layers and average
pooling layers) could reduce the dimensions of feature
maps and simplify the computations, they may affect the
classification accuracy of the network. To preserve as
much contextual information as possible, pooling layers
are excluded from our 2D-CNN model. The convolution
layer, pooling layer, and fully-connection layer of a CNN
will be explained in Section II.B.
2) Though 2D-CNN model can utilize the spatial context,
it fails to consider the spectral correlations. To address
such a problem, we further design a 3D-CNN model
which is composed of seven convolution layers and
one full connection layer. Different from the 2D-CNN,
the convolution operator of this model is 3D, whereas
the first two dimensions are applied to capture the
spatial context, and the third dimension captures the
spectral context. Though the 3D-CNN model contains
more network parameters than its 2D counterpart, it
should be more effective than its 2D counterpart because
of its ability to evaluate the spectral correlations of a
hyperspectral image.
3) The 2D-CNN model may be noisy because the classifi-
cation of a pixel only relies on a small patch centered
at the pixel. To effectively utilize the spatial context,
we further design a recurrent 2D-CNN model (R-2D-
CNN). The R-2D-CNN can extract features by gradually
shrinking the patch to concentrate on the central pixel.
Experimental results show that the R-2D-CNN model
indeed performs better than the 2D-CNN model.
4) Finally, we design the recurrent 3D-CNN model (R-3D-
CNN) to take into account both spatial and spectral con-
texts, while alleviating the problem of a noisy patch. The
R-3D-CNN extends the 3D-CNN model by shrinking
the patch gradually. As a result, the final classification
of each pixel mainly depends on the information of the
pixel rather than a patch. Experimental results show the
superiority of the R-3D-CNN model. In particular, it
converges faster than other methods, and achieves the
best classification performance.
The rest of the paper is organized as follows. Section II
discusses the related research work. In section III, we illustrate
various CNN-based deep learning models and the correspond-
ing algorithms for hyperspectral image classification. Section
VI reports the experimental results of a comparative evaluation
of the experimental methods and other baseline methods. Fi-
nally, we give concluding remarks and highlight the directions
of future research work.
II. RELATED WORK
A. Classical Classification Methods
Hyperspectral remote sensing classification has been ex-
tensively studied recently. For example, Bandos et al. [15]
utilize a linear discriminant method to solve the problem.
However, when the spectral resolution is low, it is necessary to
handle the band mixing problem for better differentiating the
pixels or performing feature selection. To this end, Brown [16]
develop a robust method to automatically separate overlapping
absorption bands, and the advantage of such a method is that
it is relatively noise-insensitive. To address the nonlinearity
of data, quadratic discriminant analysis and logarithmic dis-
criminant analysis are also explored. However, these methods
suffer from the Hughes phenomenon i.e., the classification
performance considerably degrade when the dimensionality
of the problem space becomes high. Wang et al. [6] propose
a novel dimensionality reduction method, namely LADA for
hyperspectral image classification. Following the idea of LDA,
LADA learns a projection matrix Gto pull the points of the
same class close to each other while pushing the ones of dif-
ferent classes far away from each other. To further exploit the
local data manifold, LADA adds one adaptive manifold term
parameterized by a matrix Sinto the computation of within-
class scatter term, and solves the matrix Gand Salternatively.
In 2016, Wang et al. [7] propose manifold ranking based
salient selection method for hyperspectral image classification.
The method first employs an evolution algorithm to group
the bands into several subsets, and finds some representative
bands. Then, it uses the representatives to select salient bands
by a manifold ranking strategy. The performance of the method
significantly relies on the qualities of chosen representatives,
and the constructed manifold.
To improve classification performance, many researchers
resort to kernel-based methods. The main idea of kernel-based
methods is to project samples into a high dimensional space
in which the samples of different classes become linearly
separable. The trick of kernel-based methods is that one does
not need to specify the details of the transformation function.
Instead, we only need to define the linear products among
samples in the high dimensional space. For example, Camps et
al. [17] employe the kernel trick of SVM in that the separation
of classes in a high dimensional space was achieved via a
nonlinear transformation of SVM.
Apart from employing simple kernel tricks, some re-
searchers employed multiple kernels for hyperspectral im-
age classification. For example, Rakotomamonjy et al. [18]
advocate the multiple kernel learning (MKL) method which
could learn a kernel and a classification predictor at the
same time. With the preliminary success of MKL, the same
technique is applied to remote sensing in 2010 [19]. In 2012,
a representative MKL algorithm is developed which could
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 3
establish the weights of kernels according to their statistical
significance [20].
The aforementioned kernel-based methods do not explicitly
exploit a spatial context. To address such a problem, the
composite kernel (CK) method is proposed [21]. In [22], the
CK method is generalized by using extended multi-attribute
profiles (EMAP). Apart from considering the spatial context,
the CK method could exploit the spectral context as well. For
example, a generalized composite kernel (GCK) is developed
to exploit both extended multi-attribute profiles and raw fea-
tures [23]. The GCK method often achieves better performance
than conventional methods such as SVM-CK [23].
Despite achieving promising classification performance, all
the kernel-based methods suffer from two drawbacks: (1) they
often involve solving complicated convex problems which are
in general difficult for a classifier to learn; (2) kernels must
be carefully chosen so as to achieve good performance.
Recently, some more advanced classification methods are
developed for hyperspectral image classification [24] [25].
For example, the logistic regression via variable splitting and
augmented Lagrangian (LORSAL) algorithm [26] is developed
to tackle larger data sets efficiently. In [27], Sun et al. propose
a hyperspectral image classification model, named SMLR-
SpATV, which includes a spectral data fidelity term and a
spatially adaptive Markov random field (MRF) prior in the
hidden field. Li et al. [28] propose a new multiple feature
learning framework (MFL), which pursues the combination of
multiple features for the hyperspectral scenes categorization.
The method can handle both linear and nonlinear classification.
In [29], a novel SVM based classification method is proposed
by applying the 3-dimensional discrete wavelet transform
(SVM-3DG).
B. Deep Learning Models
Recently, CNN models have achieved a breakthrough in the
performance of image classification. A CNN model (see Fig.1)
is a multi-layer neural network, composed of convolution
layer, pooling layer, and full connection layer. The convolution
layer contains N filters (C1 in Fig. 1), each of which is a
small weighted matrix. By convolving the N filters with an
input image and transforming the output with a non-linear
activation function, N feature maps are produced. The feature
maps often contain redundant information. To reduce the
redundancy, a pooling layer is appended (S2 in Fig. 1), which
summarizes feature maps into small matrices by calculating
the average (average pooling) or maximum value (maximum
pooling) locally. The convolution layer and pooling layer can
be repeated multiple times (C3 and S4 in Fig. 1) until the
generated feature maps are of size 1-by-1. Finally, a fully
connected layer will be appended for categorization. The
neurons of the fully connected layer take all the 1-by-1 feature
maps as their inputs.
The first CNN model is developed by LeCun in 1996 [30]
[31]. Combined with the back propagation model, the CNN
model achieves very good performance in handwritten digit
recognition. With the advancement of Graphics Processing
Units (GPUs), deep learning has attracted a lot of attention
C1
C1
C1
S2
S2
S2
C3
C3
C3
S4
S4
S4
NN
Convolution(C)
Pooling(S)Input Output
Full connection
Fig. 1. The CNN model consisting of convolution layers, pooling layers, and
full connection layer
by researchers. On the other hand, the CNN model has been
improved by the recent deep learning techniques. For example,
Glorot et al. [32] introduce the Rectified Linear Units (ReLU)
as the activation function for CNNs in 2011. By doing so,
the vanishing gradient problem and the ineffective explo-
ration problem of the BP method can be alleviated. In 2012,
Krizhevsky et al. [11] designed the AlexNet network which
was a deep CNN model with the ReLU activation function.
The AlexNet network won the annual ImageNet competition
in 2012. To avoid overfitting, Srivastava et al. [33] proposed
the dropout technique for deep CNN. In addition, Szegedy
et al. [13] designed the GoogLeNet model which is a deep
CNN model with each layer comprising multi-scale CNN.
He et al. [14] proposed a deep residual CNN model which
won the ImageNet competition in 2015. In [34], an end-to-end
band-adaptive spectral-spatial feature learning neural network
was proposed. In [35], Cao et al. proposed a hyperspectral
image segmentation method by using markov random fields
and a convolutional network. To tackle the street scene labeling
problem, Wang et al. [36] proposed a hybrid method that
utilized priori convolutional neural networks at superpixel
level and soft restricted context transfer. The former technique
aims to learn prior location information and produces coarse
label predication, whereas the latter technique aims to improve
the coarse prediction by reducing over-smoothness. However,
the algorithm works for conventional images only. It does not
take into account the characteristics of rich band information
in hyperspectral images.
All the above models are 2D-CNN models within which the
convolution operators only deal with two dimensional spatial
features. In [37], a 3D convolution network is designed to
handle video categorization tasks effectively. Following the
framework of 3D-CNN models, we employ such an architec-
ture for hyperspectral image classification, in which the third
dimension refers to the spectral axis.
Apart from CNN models, another important deep learning
framework is the recurrent neural network (RNN) which is of-
ten applied to process sequence data arising from applications
such as speech recognition [38], machine translation [39], bot
chat [40], and so on. In [41], Mou et al. proposed a novel RNN
model for hyperspectral image classification, which could
effectively analyzed hyperspectral pixels as sequential data and
then determined information categories via network reasoning.
The basic intuition of RNN is that it applies the same neural
network block recurrently for sequence prediction. To preserve
the information of observed historical sequences, a RNN is
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 4
2D Conv
ReLU
2D convolutional operation
Input Output
Fig. 2. The 2D-CNN model consisting 2D convolutional operation with kernel
size (k) and number of feature maps (m) at each convolutional layer for
hyperspectra image classification
fed with the current observation and the hidden layers trained
by the previously observed sequences. By doing so, the RNN
can take into account both the features of the current sequence
and that of the historical observations to improve the current
prediction. In contrast to the aforementioned approaches, we
apply the RNN model to deal with the spatial contexts
recurrently.
III. THE PROPOSED METHODS
In this section, we illustrate the design of new 2D-CNN, R-
2D-CNN, 3D-CNN, and R-3D-CNN models for hyperspectral
image classification. For these methods, we extract a small
patch centered at each pixel to build the classification models.
Among the proposed models, the 2D-CNN and the R-2D-CNN
models exploit the spatial contexts only, whereas the 3D-CNN
and the R-3D CNN models exploit both the spatial features
and the spectral correlations of pixels.
A. 2D-CNN model
As illustrated in Fig.2, our 2D-CNN model is composed
of three main phases, which are patch extraction, feature ex-
traction, and label identification. Given a hyperspectral image,
we first extract a small patch centered at each pixel as the
raw feature. Then, a deep learning model is constructed to
acquire the feature maps of these patches. Finally, the label
of each pixel is classified based on the feature map of the
corresponding patch. For all four models, we exclude the
pooling layers so as to preserve as much information of a
pixel as possible. The three-phase processing of the 2D-CNN
model is illustrated below.
Assume that we are given a hyperspectral image of size N×
M×D, where Nand Mare the width and the height of the
image, and Ddenotes the number of spectral bands. We aim at
predicting the label of each pixel of the image. As the spatially
adjacent pixels often have the same labels, it is desirable for
the proposed model to consider the “spatial coherence”. To
this end, the first processing phase of our model is to extract
aK×K×Dpatch for each pixel. In particular, each patch
(i.e., the spatial context) is constructed surrounding a pixel, the
center point of the patch. For the pixels that reside near the
edge of the image, there may not be sufficient information to
build a patch of the expected size. Accordingly, we construct
the spatial context by performing a mirror padding operation
for these pixels.
For the second processing phase, each extracted patch
is treated as an image with multiple channels on its own.
Thereby, we can apply a deep CNN model with 2D
convolution layers to extract the feature maps for the patch.
More specifically, the 2D-CNN operator at each layer is
formulated as follows:
vxy
ij =F(bij +X
m
Ni1
X
p=0
Mi1
X
q=0
wpq
ijm v(x+p)(y+q)
i1)(1)
where iindicates the particular layer under consideration, and
jis the number of feature maps of the layer i;vxy
ij stands
for the output at position (x, y)of the jth feature map at
the ith layer; bij refers to the bias term, and F(·)denotes
the activation function of the layer; mindexes over the set
of feature maps of the (i1)th layer, which are the inputs
to the ith layer. wmpq
ij is the value at position (p, q)of the
convolution kernel connected to the ith feature map to the jth
feature map, and Niand Miare the height and width of this
kernel. For the proposed model, we adopt the ReLU function
as the activation function F, which is defined as follows:
F(x) = max(0, x)(2)
In our 2D-CNN model, three convolutional layers are uti-
lized. To preserve the vital information of each pixel, we
exclude the pooling layers from our 2D-CNN model. Finally,
a fully-connected layer, which takes the feature maps of the
last 2D convolutional layer as inputs, is constructed to make
the prediction. Here, we leverage the softmax function to
compute the probability for each class. The softmax function
is an extension of the sigmod function, and used for multiple
classification. The purpose of the softmax function is to find
the parameters in the maximum zvalue of Yk. Moreover, the
cross entropy function is adopted as the objective function to
drive the back-propagation based training process.
Let Wand bdenote all the parameters of our 2D-CNN
model. We train the 2D-CNN model by maximizing the likeli-
hood, and transform the scores fc(Ii,j,k; (W,b)) of each class
of interest c∈ {1, . . . , N }into the conditional probabilities by
using the following softmax function [42]:
p(c|Ii,j,k; (W,b)) = efc(Ii,j,k;(W,b))
P
d∈{1,...,N}
efd(Ii,j,k;(W,b)) (3)
The parameters (W, b)are learned by minimizing the neg-
ative log-likelihood based on the training set:
L(W,b) = X
Ii,j,k
ln p(li,j,k|Ii,j,k ; (W,b)) (4)
where li,j,k is the correct class label of the pixel at position
(i, j)of the image Ik. To optimize the objective function,
stochastic gradient descent (SGD) with back-propagation is
applied.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 5
Input data
Spectral
bands
D
M(height)
N
(weight) K
K
K
3D convolution
operation
H
Fig. 3. The 3D-CNN model comprising 3 3D convolution operations with
the corresponding kernel size (K) and the number of feature maps (m) for
each convolutional layer.
At the testing time, the output layer of the proposed model
predicts the label of the pixel located at (i, j)of the image I
by using the argmax function:
c
li,j = arg max
c∈{1,...,N}
p(c|Ii,j ; (W,b)) (5)
B. 3D-CNN model
One main difference between a hyperspectral image and a
conventional image is that the former is captured by scanning
the same region with different spectral bands, while the latter
is not. As the image formed by hyperspectral bands may have
some correlations e.g., close hyperspectral bands may result
in similar images, it is desirable to take into account hyper-
spectral correlations. Though the 2D-CNN model can utilize
the spatial context, it ignores the hyperspectral correlations.
Hence, we develop a 3D-CNN model to address this issue.
As shown in Fig.3, the operational details of our 3D-CNN
model are quite similar to those of the 2D-CNN model. The
main difference is that the 3D-CNN model has one extra phase
of reordering. In this phase, we rearrange the Dhyperspectral
bands according to an ascending order. By doing so, images
of similar spectral bands are sequentially ordered, which can
preserve their correlations under a spectral context. The patch
extraction phase and the label identification phase of the two
models are quite similar. For the feature extraction phase, a
3D convolution operator instead of 2D convolution operator is
applied to the 3D-CNN model.
More specifically, the 3D convolution operation is formu-
lated as follows:
vxyz
ij =F(bij +X
m
Ni1
X
p=0
Mi1
X
q=0
Di1
X
r=0
wpqr
ijm v(x+p)(y+q)(z+r)
(i1)m)
(6)
where Diis the size of the 3D kernel along the spectral
dimension, and jis the number of kernels of the ilayer; wpqr
ijm
is the value at the (p, q, r)th position of the kernel connected
to the mth feature map (a cube) of the preceding layer. Again,
the ReLU function is adopted as the activation function F.
The 3D convolution operation is illustrated in Fig. 2. We
can see that the 3D convolution operation is applied to a 3D
patch step by step e.g., from top to down, from left to right,
and from inner to outer. In each step, a convolution scalar
is produced and placed at the corresponding position of the
feature map (shown as red lines in Fig. 2). This operation
produces a smaller 3D cube as a feature map. Training a 3D-
CNN model is similar to training a 2D-CNN model in which
we utilize the softmax function to compute the probability of
each class. Moreover, we formulate the training process as an
optimization problem by maximizing the log-likelihood of the
training data. In addition, stochastic gradient descent (SGD)
with back-propagation is applied to network training.
C. R-2D-CNN model
As noted above, though the 2D-CNN model can exploit the
spatial context, it may introduce unwanted noise because the
classification of a pixel relies on the features of a small patch
surrounding the pixel rather than the features directly attached
to the pixel. To better exploit the spatial context, we design a
recurrent 2D-CNN model (R-2D-CNN). In particular, the R-
2D-CNN model constructs multiple shrunk patches as multi-
level instances (see Fig.4), and leverages a multi-scale deep
neural network to fuse the multi-level instances for prediction.
For clarity, we denote the instances as 1-st level, 2-nd level,
· · ·, and the P-th level, corresponding from the bigger patches
to the smaller patches, where the P-th level often corresponds
to the pixel for classification, i.e., a 1-by-1 patch. The R-
2D-CNN deep neural network comprises a recurrent CNN
structure, where a basic 2D-CNN block is reused multiple
times. More specifically, it uses the basic 2D-CNN block to
extract the feature maps for the 1-st level instances at the
beginning. These feature maps are then concatenated with the
2-nd level instances, which are fed to the same 2D-CNN block
for extracting the next level feature maps. This procedure is
repeated until the P-th level instances are fused. Finally, a
softmax layer is then applied to compute the probability of
each class. By utilizing the multiple shrunk patches, we can
consider the spatial context information, and also can focus
more on the information closer to the pixel for classification.
Hence, the unwanted noises can be reduced.
The main architecture of the R-2D-CNN model is illustrated
in Fig.5. At the p-th level, the network is fed with an input
“feature image” Fpof H+D(H represents the number of
feature maps produced by the 2D-CNN block) 2D images,
which comprises Hfeature maps of the p1-th instances, D
hyperspectral images of the p-th instances, and 1pP.
Formally, the procedure is defined as follows:
Fp= [F(Fp1, Ip
i,j,k)], F 1= [0, Ii,j,k].(7)
where Ii,j,k stands for the input patch surrounding the pixel
at location (i, j)of the training image k. At the first level, the
network only takes the original image as the input because
there is not instance from a previous layer to produce the
feature maps. Though the R-2D-CNN model is multi-level,
the model complexity does not increase with respect to the
number of levels. The reason is that the parameters pertaining
to different levels are shared (as shown in Fig.5).
Model training of the R-2D-CNN model is the same as
that of the 2D-CNN model, where gradients are computed by
using the (BPTT) algorithm [43] during the back-propagation
process. More specifically, we first unfold the network as
shown in Fig. 5, and then train the model with the BPTT
algorithm. However, in contrast to the 2D-CNN model, we
have to learn the network parameters (W,b)by a new loss
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 6
plain
1 instance
2 instance
3 instance
a) b)
Fig. 4. Context input patch of ”plain”: a)and recurrent context input patch
b). The size of context input patch b) increases as the number of instances in
the recurrent 2D convolutional network increases.
F
F F
F
Shared
Fig. 5. The R-2D-CNN model comprising 2 basic 2D-CNN block with
parameters share across levels.
function due to the multi-level recurrent architecture. The loss
function is defined according to Eq.(7):
L(F) + L(FF) + . . . +L(FpF),(8)
where L(F)is a shorthand for the log-likelihood defined in
Eq.(3) of the 2D-CNN model, and pdenotes the composition
operation performed ptimes. Thus, each network instance is
trained to produce the correct label at the location (i, j). In this
manner, the R-2D-CNN model is able to learn and corrects its
mistakes produced by the earlier iterations. As a by-product,
the R-2D-CNN model can also classify the dependencies, that
is, predicting the label of an instance based on the label of the
previous instance around location (i, j).
It is worth noting that the sizes of multi-level instances in
a R-2D-CNN model must be carefully designed so that the
instances can be concatenated with the feature maps of the
previous instances. To this end, we first need to establish how
the size of a feature map changes when it is applied to a 2D
convolution layer. Let szm1denote the size of the feature
map of the m1-th convolution layer. Then, the size of
the feature map produced by the m-th convolution layer is
computed as follows:
szm=szm1kWm
dWm
+ 1 (9)
where kWmis the size of the convolution kernel of the mth
layer, and dWmis the stride size. By Eq.(8), we can compute
the size of a feature map produced by our 2D-CNN block.
Hence, we can estimate the appropriate sizes of the instances
with respect to different levels.
D. R-3D-CNN model
To better utilize the spatial and the spectral contexts of
hyperspectral images, we design the recurrent 3D-CNN model
(R-3D-CNN). As for the R-2D-CNN model, the R-3D-CNN
model is also underpinned by multi-level recurrent neural
networks which shrink a patch gradually to form multi-level
instances. There are two main differences between the R-3D-
CNN model and the R-2D-CNN model. The first difference is
that the former utilizes 3D convolution operators whereas the
latter uses its 2D counterparts. Hence, the R-3D-CNN model
can be regarded as an extension of the 3D-CNN model in a
recurrent manner. The second difference is that the instances of
the next level need to be preprocessed and concatenated with
the feature maps generated from the current level. The reason
is that we adopt 3D convolution layers which lead to variable
length of the spectral bands. Hence, we have to preprocess the
instances of the next level by some 3D convolution operations
of the spectral channels to adapt to the changing sizes.
Fig.6 depicts an example of the proposed R-3D-CNN
model. The model consists of a multi-level recurrent neural
network with Pmulti-level instances. As for the R-2D-CNN
model, a ”plain” 3D convolution network is applied to extract
the corresponding feature maps, which are then concatenated
with the next level instance to form new feature maps at each
level. This procedure is repeated until all multi-level instances
(patches) are incorporated. To ensure the consistency of the
sizes of feature maps of the current level and the sizes of the
instances at the next level, a preprocessing step is introduced to
the spectral channels. Finally, a softmax layer is applied, and a
cross entropy objective function is adopted. The optimization
process is again performed by using the BPTT algorithm. As
for the R-2D-CNN model, the complexity of the R-3D-CNN
model remains moderate because the recurrent structure shares
the same network parameters across multiple levels. As for the
3D-CNN model, we need to reorder the hyperspectral images
according to the ordering of spectral bands. Also, the size of
multi-level instances must be carefully determined as for the
R-2D-CNN model.
IV. EXP ERIME NTAL RESULTS
We chose to use six publicly available hyperspectral image
data sets for evaluating the performance of the proposed
models. For a comparative evaluation, we also adopted SVM-
CK, GCK, LORSAL, SMLR-SpATV, MFL, SVM-3D, SVM-
3DG, and CNN-MRF as the baselines. For the performance
metrics, we used the overall accuracy of all classes, denoted
as OA, and the average accuracy of each class, denoted as
AA. We ran all the models on a desktop PC equipped with an
Intel Core 7 Duo CPU (at 3.40 GHz) with 12 GB of RAM,
and two GTX 1080Ti GPUs (16 GB of ROM) were also used.
A. Data Sets
1) Indian Pines Scene
The data set was collected in 1992 by the AVIRIS sensor
which records the remote sensing images of Indian Pines
located at north-western of India. The hyperspectral image
contains 145 ×145 pixels in spatial dimensions, and 224
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 7
preprocess for the spectral
plain network
recurrent network
concat
Fig. 6. The R-3D-CNN model with network parameters shared across multiple levels. The plain network is built with a small instance based on the basic
3D-CNN model while the recurrent network is built with two instances of the basic 3D-CNN model; the complexity of the model remains moderate because
of the shared parameters across multiple levels
a) b) c)
d) e) f)
Fig. 7. Labeled images of different data sets: a)Indian Pines Scene.
b)Botswana Scene. c)Salinas scene. d)Pavia Centre scene. e)Pavia University
scene. f) Kennedy Space Center.
hyperspectral bands. Due to the presence of noisy bands, we
only used 200 hyperspectral bands. Specifically, the bands
covering the regions of water absorption, i.e., [104-108], [150-
163], 220, were removed. The ground truth available includes
16 classes which are not all mutually exclusive. As shown in
Fig.7 a), we randomly divided the labeled data into the training
(70%) and the testing (30%) sets for our experiment.
2) Botswana Scene
Botswana Scene was acquired by the Hyperion sensor on
the NASA EO-1 satellite in May 31, 2001.; This data set was
collected over the Okavango Delta. The hyperspectral image
contains 1476 ×1476 pixels taken by 224 bands, from 400nm
to 2500nm with an incremental step of 10 nm. As for the
Indian Pines Scene data set, we removed the noisy bands to
produce an experimental data set containing 145 bands only.
The image data set contains 14 categories. As shown in Fig.7
b), we randomly split the data set to the training (70%) and
the testing (30%) sets, respectively.
3) Salinas scene
The Salinas Scene was a hyperspectral image data set
recorded in 1992 by the AVIRIS sensor which captured images
about the Salinas Valley, California. The original images were
composed by 224 bands. We discarded 20 noisy bands for
example bands [108-112], bands [154-167] and band 224 to
generate a hyperspectral image data set of 204 bands. For the
spatial dimensions, the scene includes 512×217 pixels. There
are 16 labeled classes in the original data set as shown in Fig.7
c).
4) Pavia Centre scene
The hyperspectral image data set captured Pavia acquired
over northern Italy. It was produced in 2001 by using the Re-
flective Optics System Imaging Spectrometer (ROSIS) sensor.
The Pavia Centre scene comprised 1096 ×1096 pixels with
114 hyperspectral bands. We preprocessed these images by
removing 12 noisy bands. There are nine labeled classes in
the data set as shown in Fig.7 d).
5) Pavia University scene
This hyperspectral image data set captured the Pavia uni-
versity in Italy by using the ROSIS sensor. There are 103
hyperspectral bands in the image data set, with 610 ×340
pixels for the spatial dimensions. The image contains nine
labeled classes as shown in Fig.7 e).
6) Kennedy Space Center
The last data set, namely Kennedy Space Center (KSC)
captured the KSC area in Florida by using the AVIRIS sensor
on March 23, 1996. The hyperspectral image consists of
512 ×614 pixels of spatial dimensions, with 224 spectral
bands. After removing 48 noisy bands, we obtained 172
spectral bands. There are 13 labeled classes as shown in Fig.7
f).
B. Experimental Results
1) Results for the Indian Pines Scene
Before reporting the details of our experimental results,
we first elaborate the various settings of the deep learning
techniques employed in our experiments. The structure of the
2D-CNN model is depicted in Fig.8. For the classification
of each pixel, a 7×7×200 patch surrounding it is first
constructed. Following this, three 2D convolution layers of size
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 8
3×3are utilized. Moreover, 200 spectral bands are treated as
channels. The number of filters is set to 400 for the respective
layers, and the stride is set to 1. As a result, the feature maps
produced by the first, second, and last convolution layers are
5×5×400,3×3×400 and 1×1×400, respectively. Finally, a
softmax layer of 16 classes is deployed to classify the images.
The proposed network structure does not include the pooling
layers so as to keep as much information of each pixel as
possible. In addition, we apply SGD for network training and
set the mini-batch size to 10.
The structure of the 3D-CNN model is depicted in Fig.9.
Similar to the 2D-CNN model, a 7×7×200 patch is first
extracted. Next, we build eight 3D convolution layers. The
size, number, stride and feature map sizes of the 3D filters in
each layer are shown in Fig.14. Again, we exclude the pooling
layers and adopt a mini-batch size of 10. Before applying the
3D-CNN model, the hyperspectral bands are first reordered.
Fig.10 depicts the structure of the R-2D-CNN model. For
this model, the first level instance is a 13 ×13 ×200 patch.
Then, we apply a three-layer 2D-CNN block to the instance,
which results in a 7×7×400 feature map. After concatenating
this feature map with our second level instance, that is, a 7×
7×200 patch, and reusing the 2D-CNN block, we obtain a
800-dimensional vector, which is connected to a softmax layer
to classify images. Again, we adopt a mini-batch size of 10
and do not utilize any pooling layers.
Similarly, we construct a 13×13×200 patch and a 7×7×200
patch at the first two levels of the R-3D-CNN model. As shown
in Fig.11, we build a seven-layer 3D-CNN block, and apply it
to the first level instance. This produces a 7×7×187 feature
map. Since the spectral band dimension is changed from 200
to 187, we first apply a three-layer 3D convolution operation
to the second level instance. By doing so, the third dimension
is reduced to 187, and it can be concatenated with the feature
map of the first level instance. Then, we reuse the seven-layer
3D convolution block which produces a 1×10 ×35 feature
map. Finally, a softmax layer is applied to the resulting feature
map to determine the class label.
Next, we report the experimental results based on various
data sets. Table V presents the performance of all the methods.
We observe that the R-3D-CNN model achieves the best
performance, of which the OA is 99.50%. Although the
OA of the SMLR-SpATV is 99.11%, the R-3D-CNN model
outperforms it by more than 44%, if we consider the reduction
of error rates. The main reason is that R-3D-CNN model
considers both the spectral and the spatial contexts, where the
former is inferred via the 3D convolution operation and the
latter is inferred by using the multi-level recurrent structure.
In terms of AA and OA, the R-2D-CNN model is ranked
as the second best, which is followed by the SMLR-SpATV,
2D-CNN model and the 3D-CNN model. Though R-2D-CNN
ignores the spectral correlations, its recurrent structures can
effectively capture the spatial context for subsequent image
classification. Our experimental results also imply that the
spatial context is more important than the spectral correlations
for hyperspectral image classification. As shown in Table V,
the results of SVM-CK are better than SVM-3D, SVM-3DG
and CNN-MRF. However, its performance is much worse than
various deep learning techniques. The first reasons is that
the SVM-CK classifier ignores the spatial and the spectral
contexts. The second reason is that the SVM-CK classifier
cannot effectively capture the nonlinear relationships between
the features and the class labels of hyperspectral images. As
a promising classification method, the GCK achieves compa-
rable performance as those of the 2D-CNN and the 3D-CNN
methods because it can extract EMAP information pertaining
to the spectral and the spatial contexts. Fig.12 provides a visual
comparison of the performance of all methods.
2) Results for the Pavia University Scene
In this experiment, the structures of the deep learning mod-
els were quite like those applied to the original Indian Pines
Scene experiment. The only difference was that the numbers of
parameters were adjusted to match the 102 hyperspectral bands
of our refined data set. Recall that there were 200 bands of the
original data set. Table VI presents the experimental results of
all the methods based on the Pavia University Scene data set.
Again, we can see that the proposed R-3D-CNN performs the
best, followed by the R-2D-CNN model, the SMLR-SpATV
method, the MFL method and the GCK method. The OA of the
R-3D-CNN model is 99.97%, which is 0.39% higher than that
of the GCK (99.48%). And the R-CNN-3D model outperforms
the GCK method by more than 94%, when we consider the
reduction of error rates. The SVM-3DG method, the 3D-CNN
and the 2D-CNN models achieve comparable results. The
LORSAL classifier produces the worst performance among
all the methods. Fig.13 visualizes the classification results of
all the methods.
3) Results for the Botswana Scene
As for the earlier scenes, we only modified the number of
parameters for our deep learning models. Table VII presents
the experimental results of all the methods based on the
Botswana Scene data set. We can see that the proposed R-
3D-CNN model and the MFL method achieves the highest
performance, followed by the SVM-3DG, the GCK method
and the R-2D-CNN model. The OA of the R-3D-CNN model
is 99.38%, which is 0.29% higher than that of the MFL
(99.07%). And the R-3D-CNN model outperforms the MFL
method by more than 31%, in terms of the reduction of
error rates. Again, the other models such as the 3D-CNN,
the SMLR-SpATV and the 2D-CNN models perform better
than the LORSAL classifier which produces the worst result.
Fig.14 visualizes the classification results of all the methods.
4) Results for the Salinas Scene
Table VIII presents the experimental results of all the
methods based on the Salinas scene data set. The R-3D-
CNN model achieves the best performance, followed by the
GCK method, the MFL method and the R-2D-CNN model.
The SMLR-SpATV, the 3D-CNN and the 2D-CNN models
also achieve promising results. The OA of the R-3D-CNN
model is 99.80%, which is 0.46% higher than that of the
GCK (99.34%). And the R-3D-CNN model improves the GCK
method by more than 70%, when we consider the reduction of
error rates. Again, the LORSAL classifier produces the worst
result among all the methods. Due to memory limitations of
our computer, we cannot perform the CNN-MRF classifier on
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 9
TABLE I
CLA SSI FIC ATIO N RE SULT S OF INDIAN PI NES SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 85.71±0.4 92.86±0.5 85.71±0.1 92.86±0.2 85.71±0.3 57.14±0.4 64.29±0.4 84.62±0.2 71.72±1.0 85.71±0.4 78.57±0.1 100
2 86.82±0.3 98.12±0.4 89.88±0.3 98.59±0.2 96.24±0.2 79.29±0.3 80.00±0.2 65.65±0.3 95.85±0.6 96.46±0.3 99.29±0.1 100
3 86.12±0.2 94.29±0.3 82.04±0.1 98.37±0.2 92.65±0.3 71.02±0.3 73.47±0.4 96.36±0.4 95.90±0.3 97.13±0.1 98.77±0.3 100
4 88.40±0.5 94.20±0.3 82.61±0.3 100 97.10±0.4 97.10±0.4 97.10±0.5 88.73±0.3 73.91±0.1 98.55±0.3 100 100
5 95.10±0.4 96.50±0.4 91.61±0.3 98.60±0.2 97.20±0.3 95.80±0.1 91.61±0.3 93.06±0.2 97.20±0.2 97.90±0.2 97.90±0.2 100
6 98.61±0.7 99.08±0.1 99.08±0.2 99.54±0.4 99.54±0.2 98.16±0.3 97.70±0.4 99.09±0.4 96.31±0.3 97.68±0.5 99.53±0.3 100
7 75.00±0.1 100 75.00±0.1 100 100 75.00±0.3 62.50±0.3 50.00±0.1 100 100 87.50±0.2 100
8 98.60±0.1 100 97.90±0.2 100 100 99.30±0.2 100 95.10±0.4 100 99.30±0.07 100 100
9 100 83.33±0.3 83.33±0.1 83.33±0.3 100 100 100 100 100 100 100 100
10 87.19±0.7 93.43±0.4 85.81±0.5 98.27±0.3 92.04±0.2 75.09±0.4 75.78±0.4 76.63±0.2 97.20±0.6 98.26±0.3 98.95±0.3 99.65±0.3
11 91.01±0.8 98.09±0.8 88.83±0.1 99.59±0.2 98.50±0.3 88.01±0.2 95.37±0.4 97.55±0.2 99.04±0.4 98.77±0.4 99.45±0.2 99.31±0.2
12 94.84±0.6 94.89±0.4 88.64±0.2 99.43±0.3 96.02±0.3 85.80±0.4 86.36±0.3 76.27±0.3 95.45±0.6 97.15±0.4 99.43±0.2 98.85±0.2
13 100 100 100 100 98.36±0.3 98.36±0.3 98.36±0.2 100 100 96.72±0.1 98.36±0.1 100
14 96.81±0.7 99.20±0.2 96.02±0.2 100 99.47±0.2 97.88±0.4 97.08±0.4 99.47±0.4 98.94±0.3 99.46±0.6 100 99.73±0.2
15 81.57±0.8 95.61±0.6 83.33±0.3 97.37±0.2 97.37±0.3 86.84±0.2 100 95.65±0.2 94.73±0.6 93.80±0.5 98.24±0.2 96.46±0.3
16 100 100 85.71±0.1 100 100 82.14±0.2 782.14±0.3 100 100 100 96.42±0.8 96.42±0.5
AA 91.62±0.3 96.22±0.5 88.47±0.2 97.87±0.2 96.89±0.2 85.23±0.3 87.61±0.1 88.26±0.3 96.37±0.3 97.31±0.2 97.03±0.3 99.42±0.3
OA 91.51±0.2 97.44±0.4 90.10±0.1 99.11±0.2 97.05±0.3 86.55±0.2 89.44±0.2 88.95±0.2 97.08±0.2 98.92±0.4 99.19±0.3 99.50±0.3
TABLE II
CLA SSI FIC ATIO N RE SULT S OF T HE PAVIA UNIVERSITY SCENE.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 99.53±0.3 99.64±0.4 91.20±0.2 99.85±0.2 100 98.14±0.2 99.45±0.2 99.60±0.4 91.65±0.2 98.70±0.3 99.34±0.3 100
2 98.59±0.5 99.85±0.3 96.92±0.1 100 99.93±0.3 99.30±0.2 99.86±0.3 98.09±0.4 99.45±0.1 99.77±0.4 99.96±0.4 100
3 85.21±0.3 97.61±0.5 64.07±0.3 100 93.64±0.2 88.08±0.4 87.12±0.1 76.31±0.2 92.09±0.3 97.94±0.4 98.88±0.2 100
4 96.52±0.2 98.62±0.2 88.57±0.4 93.58±0.4 98.59±0.5 99.02±0.2 99.67±0.3 96.08±0.4 87.26±0.4 94.55±0.2 93.91±0.3 99.89±0.1
5 99.75±0.5 99.34±0.7 99.75±0.2 100 99.50±0.3 100 100 99.75±0.2 91.05±0.2 97.77±0.1 98.27±0.2 100
6 92.24±0.4 99.78±0.5 57.76±0.3 100 99.67±0.3 96.82±0.1 99.07±0.3 88.66±0.3 98.61±0.4 99.60±0.2 99.94±0.4 100
7 91.71±0.1 99.41±0.5 59.05±0.2 100 99.75±0.2 95.48±0.3 95.73±0.2 83.71±0.1 90.72±0.1 98.00±0.2 98.49±0.2 100
8 93.30±0.2 98.52±0.4 80.45±0.3 99.82±0.2 99.10±0.4 95.20±0.4 96.38±0.4 92.75±0.2 93.57±0.4 98.55±0.3 99.81±0.2 100
9 99.65±0.5 99.65±0.3 97.89±0.4 95.77±0.3 100 98.94±0.2 98.94±0.2 98.59±0.4 86.62±0.2 81.34±0.2 94.72±0.3 98.94±0.3
AA 94.52±0.3 99.21±0.3 81.74±0.2 98.78±0.4 98.91±0.2 96.78±0.4 97.36±0.2 92.62±0.2 98.78±0.4 96.25±0.3 98.15±0.2 99.87±0.3
OA 94.72±0.2 99.48±0.2 86.74±0.2 99.41±0.2 99.42±0.2 97.80±0.4 98.62±0.1 95.16±0.3 95.46±0.2 98.49±0.2 99.19±0.2 99.97±0.2
TABLE III
CLA SSI FIC ATIO N RE SULT S OF T HE BOT SWANA SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 100 100 100 100 100 100 100 100 98.77±0.4 100 100 100
2 100 100 90.00±0.1 100 100 100 100 100 93.33±0.3 93.33±0.2 100 100
3 98.66±0.8 100 93.33±0.3 100 97.33±0.3 98.67±0.3 100 98.67±0.3 97.33±0.3 94.67±0.3 98.66±0.2 100
4 96.87±0.5 99.48±0.6 89.06±0.3 100 100 98.44±0.2 100 79.69±0.4 98.44±0.5 96.88±0.3 96.87±0.3 100
5 92.59±0.2 93.17±0.2 76.54±0.2 97.53±0.1 97.53±0.1 97.53±0.3 98.77±0.3 90.00±0.1 91.36±0.3 96.30±0.3 96.29±0.2 97.53±0.3
6 90.12±0.4 93.97±0.5 58.02±0.1 98.77±0.3 100 92.59±0.4 88.89±0.4 90.00±0.2 100 98.63±0.1 100 100
7 100 100 98.68±0.4 100 100 100 100 92.21±0.1 100 98.69±0.3 100 100
8 100 100 86.67±0.3 98.33±0.2 100 100 100 90.00±0.3 100 95.00±0.1 96.66±0.2 98.33±0.3
9 96.80±0.6 98.33±0.7 78.72±0.1 100 96.81±0.2 97.87±0.4 100 96.81±0.1 77.66±0.8 95.75±0.2 100 98.94±0.4
10 94.59±0.3 100 74.32±0.1 100 100 97.30±0.4 100 87.84±0.2 96.81±0.4 100 98.64±0.2 100
11 92.30±0.4 97.54±0.4 90.11±0.2 100 98.90±0.3 96.70±0.4 98.90±0.4 100 98.65±0.3 100 100 100
12 94.93±0.5 100 90.74±0.2 98.15±0.3 100 100 100 64.81±0.1 100 100 96.29±0.4 98.14±0.4
13 91.53±0.3 99.59±0.7 93.67±0.3 100 100 100 100 95.00±0.3 98.73±0.5 100 100 100
14 100 96.00±0.1 32.14±0.2 42.86±0.2 96.43±0.2 96.43±0.3 96.43±0.2 53.57±0.3 92.86±0.3 92.86±0.4 85.71±0.4 96.43±0.3
AA 96.79±0.3 98.33±0.6 82.29±0.4 95.40±0.4 99.07±0.3 98.25±0.3 98.78±0.4 88.47±0.3 97.21±0.3 97.30±0.3 98.89±0.3 99.24±0.2
OA 96.28±0.1 98.21±0.2 84.09±0.3 97.83±0.2 99.07±0.4 98.14±0.2 98.76±0.3 90.70±0.4 97.60±0.5 97.21±0.1 98.54±0.2 99.38±0.2
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 10
145
145 200 400 400 400 16
200
7
3
3
5
3
7
extract
Conv1
Conv2 Conv 3
full
connected classify
sotfmax
3
33
5
Fig. 8. The 2D-CNN network for hyperspectral remote sensing classification(The stride of each layer is 1).
Fig. 9. The 3D-CNN network for remote sensing hyperspectral image classification.
7
7
7
7
200
600
Concat
Conv4
Conv 5
145
145
200 400 400
200
13
3
3
11
3 3
13
extract 11
Conv1 Conv2 Conv 3
3
9
93
400
7
7
3
3
3
3
3
3
5
5
3
3
Conv 6
800800800
Softmax
16
Fig. 10. The R-2D-CNN network for remote sensing hyperspectral image classification(The stride of each layer is 1).
this data set. Fig.15 visualizes the classification results of all
the methods.
5) Results for the Pavia Centre Scene
Table IX presents the experimental results of all the methods
based on the Pavia Centre Scene data set. The results are quite
different from those obtained based on the previous four data
sets. We observe that R-2D-CNN performs the best, followed
by the SVM-3DG and the MFL classifiers. And, the R-2D-
CNN model outperforms the SVM-3DG method by more
than 88%, in terms of the reduction of error rates.The R-3D-
CNN model, which achieves the best performance based on
the previous data sets, produces unsatisfactory results when
compared to those of the R-2D-CNN model and the SVM-CK
classifier. The reason may be that the R-3D-CNN model fuses
the spectral and the spatial information by using a 3D operator.
However, the channel of the Pavia Centre scene contains 102
bands only; it is smaller than the other data sets. On the
other hand, the 2D-CNN and the 3D-CNN models perform the
worst among all the methods because it is difficult for these
models to classify the 3-rd class and the 9-th class due to the
limited number of instances and channels. Since the methods
such as the GCK, the SMLR-SpATV, and CNN-MRF, require
more RAM than that equipped with our computer, we cannot
obtain their performance on the data set. Fig.16 shows the
classification results of all the methods.
6) Results for the Kennedy Space Center Scene
Table X presents the experimental results of all the methods
based on the Kennedy Space Center Scene data set. Again,
we observe that the proposed R-3D-CNN performs the best,
followed by the R-2D-CNN model and the GCK model. The
R-3D-CNN model outperforms the GCK method by more than
95% in terms of error rate. The 3D-CNN model, the MFL
method, and the SVM-3DG models achieve comparable re-
sults, followed by the CNN-MRF method, the 2D-CNN model,
and the SVM-CK methods. The SVM-3D classifier produces
the worst result, and the LORSAL method outperforms the
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 11
Fig. 11. The R-3D-CNN network for remote sensing hyperspectral image classification.
*URXQGWUXWK 690&. *&. /256$/ 60/56S$79 0)L 690'
690'* &1105) '&11 '&11 5'&11 5'&11
Fig. 12. Classification maps and overall classification accuracies obtained for the AVIRIS Indian Pines data set (overall accuracies are reported in parentheses).
SVM-3D method by 2% in terms of OA. Fig.17 shows the
classification results of all the methods.
C. Convergent Speed Comparison
Fig.18 plots the accuracies of different deep learning models
against the number of iterations based on the six data sets. We
observe that the R-3D-CNN model can converge with fewer
number of iterations when compared to the other models, with
the only exception for the Salinas data set. The efficiency
improvement brought by the R-3D-CNN model is attributed
to the recurrent structure and the 3D convolutional operation.
Specifically, the feature maps that are extracted by the R-
3D-CNN model contain richer contextual information of the
images, which leads to a quicker convergence of model
training.
D. The Impact of the Size of Training Samples
In this experiment, we examined how the performance of the
proposed deep learning models changed against varying sizes
of the training samples. To this end, we varied the number
of training samples from 10% to 70%, and reported the OA
achieved by all methods. Fig.19 show the results based on six
data sets. From Fig. 19, we can make two important observa-
tions. First, for the conventional classifiers, i.e., GCK, MFL,
SVM-CK, SVM-3D, SVM-3DG, SMLR-SpATV, LORSAL,
we find that their classification performances are insensitive to
the number of training samples, especially on the Bostwana
Scene, Salinas Scene, Pavia Centre Scene, Pavia University
Scene, and Kennedy Space Center data sets. Promising results
are achieved when 10% training samples are utilized for
these methods, and feeding more training samples to these
methods only leads to marginal performance improvement.
Among the conventional methods, the GCK and the MFL
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 12
TABLE IV
CLA SSI FIC ATIO N RE SULT S OF T HE SAL INA S SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 99.83±0.3 100 96.85±0.2 99.00±0.3 100 100 100 - 99.50±0.2 97.68±0.3 100 99.83±0.2
2 100 99.82±0.2 98.93±0.2 99.82±0.1 100 99.82±0.2 99.91±0.3 - 99.82±0.5 99.46±0.2 99.46±0.3 99.91±0.4
3 100 100 80.24±0.2 95.61±0.1 100 99.49±0.4 99.49±0.4 - 98.31±0.3 97.80±0.5 99.49±0.1 99.66±0.2
4 99.04±0.4 99.76±0.3 99.28±0.4 99.76±0.3 99.52±0.1 99.04±0.2 99.28±0.4 - 97.61±0.4 97.13±0.3 97.13±0.2 97.37±0.4
5 99.75±0.5 99.50±0.1 98.13±0.2 99.75±0.2 98.63±0.3 99.63±0.3 99.75±0.2 - 98.50±0.5 98.80±0.4 99.13±0.3 99.50±0.1
6 100 100 99.92±0.1 100 99.83±0.3 100 100 - 99.75±0.3 98.91±0.3 99.07±0.3 100
7 99.91±0.1 99.81±0.2 99.16±0.3 99.72±0.1 99.44±0.2 100 100 - 98.42±0.4 96.55±0.4 99.81±0.4 99.53±0.1
8 92.69±0.5 96.54±0.5 88.10±0.4 96.98±0.4 97.93±0.3 92.28±0.3 94.11±0.2 - 99.47±0.4 97.28±0.3 99.91±0.2 99.97±0.4
9 100 100 99.25±0.3 100 100 99.62±0.1 99.62±0.1 - 99.89±0.2 99.84±0.2 99.84±0.2 100
10 99.08±0.4 99.80±0.3 84.33±0.2 96.64±0.2 99.69±0.4 97.55±0.3 98.88±0.4 - 99.70±0.1 98.78±0.2 99.18±0.2 99.90±0.2
11 100 99.69±0.4 85.89±0.4 94.36±0.3 99.69±0.3 98.75±0.3 98.75±0.2 - 99.37±0.5 99.06±0.3 99.06±0.3 100
12 100 100 100 100 100 100 100 - 98.09±0.6 99.14±0.4 98.27±0.5 99.65±0.4
13 99.64±0.4 99.27±0.3 98.91±0.2 99.27±0.4 98.55±0.3 98.55±0.3 98.55±0.3 - 96.00±0.1 90.55±0.5 98.55±0.4 100
14 98.75±0.5 99.07±0.4 94.08±0.4 99.38±0.3 95.64±0.2 96.57±0.2 98.75±0.2 - 96.89±0.3 93.46±0.6 97.20±0.3 98.44±0.2
15 82.61±0.2 96.14±0.4 52.98±0.3 97.84±0.2 99.17±0.4 77.37±0.3 81.46±0.3 - 99.22±0.4 97.47±0.4 99.73±0.4 100
16 99.82±0.3 100 96.67±0.4 98.89±0.4 98.89±0.4 99.26±0.3 99.45±0.3 - 99.41±0.1 99.06±0.3 99.81±0.2 100
AA 96.06±0.5 98.66±0.4 92.05±0.3 98.56±0.3 99.19±0.4 97.37±0.3 98.00±0.4 - 98.90±0.3 98.65±0.2 99.12±0.4 99.61±0.2
OA 96.00±0.3 99.34±0.3 88.57±0.3 98.46±0.3 99.16±0.3 94.95±0.2 96.03±0.3 - 98.96±0.4 99.08±0.3 99.47±0.3 99.80±0.2
TABLE V
CLA SSI FIC ATIO N RE SULT S OF T HE PAVIA CEN TR E SC EN E.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 100 - - - 99.99±0.4 100 100 - 99.36±0.2 99.85±0.2 100 99.32±0.2
2 98.24±0.4 - - - 97.41±0.2 96.14±0.2 98.03±0.2 - 87.51±0.3 95.73±0.2 99.65±0.4 86.52±0.3
3 96.54±0.5 - - - 91.15±0.3 93.96±0.3 92.34±0.2 - 88.35±0.5 95.73±0.1 99.78±0.3 85.11±0.3
4 94.30±0.2 - - - 97.89±0.2 90.20±0.2 94.42±0.1 - 88.59±0.4 95.04±0.1 99.88±0.4 91.80±0.1
5 98.18±0.3 - - - 99.70±0.4 99.09±0.4 99.44±0.2 - 91.89±0.4 97.52±0.2 99.85±0.4 92.95±0.3
6 99.31±0.5 - - - 98.59±0.3 98.34±0.2 99.24±0.2 - 90.48±0.3 97.76±0.3 99.75±0.3 90.87±0.4
7 95.38±0.4 - - - 93.87±0.3 96.89±0.5 98.35±0.3 - 96.30±0.4 99.36±0.2 99.95±0.2 96.97±0.2
8 99.80±0.5 - - - 99.87±0.2 99.93±0.2 99.95±0.2 - 96.98±0.5 98.84±0.3 99.90±0.3 96.49±0.1
9 100 - - - 94.76±0.3 99.65±0.3 99.88±0.4 - 73.69±0.1 91.51±0.4 98.25±0.3 83.80±0.2
AA 97.97±0.4 - - - 97.03±0.2 97.13±0.2 97.96±0.3 - 90.32±0.4 96.82±0.3 99.67±0.2 91.54±0.2
OA 97.32±0.3 - - - 99.10±0.2 99.17±0.2 99.47±0.3 - 96.02±0.4 98.75±0.1 99.88±0.3 96.79±0.2
TABLE VI
CLA SSI FIC ATIO N RE SULT S OF T HE KEN NE DY SPACE CE NTE R.
Class Methods
# SVM-
CK
[17]
GCK
[23]
LORSAL
[26]
SMLR-
SpATV
[27]
MFL
[28]
SVM-
3D
[29]
SVM-
3DG
[29]
CNN-
MRF
[35]
2D-
CNN
3D-
CNN
R-2D-
CNN
R-3D-
CNN
1 96.49±0.2 100 94.74±0.2 97.37±0.4 99.56±0.3 98.25±0.2 99.56±0.3 96.50±0.2 98.68±0.4 98.68±0.4 100 100
2 95.90±0.2 100 90.41±0.1 98.63±0.2 100 86.30±0.3 97.26±0.2 97.22±0.2 100 91.78±0.1 100 97.26±0.2
3 97.40±0.3 98.70±0.3 93.51±0.2 98.70±0.4 100 97.40±0.3 97.40±0.2 98.68±0.4 100 98.78±0.1 100 100
4 86.84±0.2 90.79±0.4 75.00±0.4 84.21±0.1 100 77.63±0.2 96.05±0.2 77.33±0.2 94.73±0.6 97.37±0.4 94.74±0.3 98.68±0.4
5 77.08±0.3 97.92±0.3 72.92±0.2 81.25±0.2 100 83.33±0.4 83.33±0.3 83.33±0.3 93.75±0.2 91.67±0.6 97.92±0.3 100
6 78.26±0.1 100 75.82±0.3 90.11±0.2 99.56±0.3 98.55±0.3 98.55±0.4 100 100 97.10±0.1 98.55±0.1 98.55±0.1
7 87.09±0.6 100 96.77±0.3 100 100 100 100 100 100 100 100 100
8 96.90±0.9 99.23±0.4 96.90±0.4 98.45±0.4 98.45±0.3 93.20±0.4 94.57±0.3 99.22±0.2 93.80±0.8 100 96.90±0.2 100
9 100 100 98.72±0.1 100 99.36±0.3 100 100 100 100 99.36±0.6 100 99.35±0.1
10 97.50±0.1 98.33±0.3 97.50±0.4 98.45±0.3 98.33±0.3 68.33±0.2 98.33±0.2 100 98.32±0.8 .94.12±0.6 100 100
11 99.20±0.1 100 100 100 87.20±0.4 99.20±0.2 99.200±0.4 100 100 100 96.80±0.1 99.20±0.1
12 98.67±0.5 97.35±0.1 92.72±0.1 95.36±0.3 96.69±0.4 96.03±0.2 100 100 98.01±0.3 96.69±0.5 99.34±0.3 98.68±0.2
13 100 100 97.84±0.4 100 100 87.77±0.4 100 100 97.48±0.2 97.12±0.2 100 98.56±0.1
AA 95.96±0.2 98.64±0.1 90.95±0.3 95.33±0.2 98.43±0.2 91.22±0.2 97.25±0.3 96.30±0.2 97.81±0.3 98.20±0.1 98.79±0.4 99.23±0.1
OA 96.95±0.6 98.98±0.3 93.59±0.4 96.86±0.3 98.27±0.3 91.67±0.3 98.27±0.3 97.56±0.3 97.76±0.5 98.46±0.3 99.22±0.4 99.85±0.3
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13
Ground-truth SVM-CK(94.72%) GCK(99.48%) LORSAL(86.74%) SMLR-SpATV(99.41%) MFL(99.42%) SVM-3D(97.80%)
SVM-3DG(98.62%) CNN-MRF(95.16%) 2D-CNN(95.46%) 3D-CNN(98.49%) R-2D-CNN(99.19%) R-3D-CNN(99.97%)
Fig. 13. Classification maps and overall classification accuracies obtained for the Pavia University scene data set (overall accuracies are reported in parentheses).
Ground-truth SVM-CK(96.28%) GCK(98.21%) LORSAL(84.09%) SMLR-SpATV(97.83%) NMF(99.07%) SVM-3D(98.14%)
SVM-3DG(98.76%) CNN-MRF(90.70%) 2D-CNN(97.60%) 3D-CNN(97.21%) R-2D-CNN(98.54%) R-3D-CNN(99.38%)
Fig. 14. Classification maps and overall classification accuracies obtained for the Botswana Scene data set (overall accuracies are reported in parentheses).
methods often perform the best. Second, we can observe
an obvious performance improvement when the number of
training samples is increased (from 10% to 50%) for the
proposed deep learning models i.e., 2D-CNN, 3D-CNN, R-
2D-CNN, and R-3D-CNN. When more than 60% training
samples are employed, the R-2D-CNN and the R-3D-CNN
models often achieve comparable results when compared to the
best conventional methods such as GCK and MFL. Moreover,
we observe that the deep learning-based model CNN-MRF
produces unstable classification performance, that is, results
can be better or worse when more training samples are used.
As a matter of fact, the proposed R-2D-CNN and R-3D-
CNN models outperform CNN-MRF when sufficient training
samples are provided. All these observations indicate that the
proposed deep learning models (R-2D-CNN and R-3D-CNN)
are more effective than the baselines when sufficient training
samples are provided.
E. Discussions
In this subsection, we briefly discuss the experimental
results presented above. First, we find that the R-3D-CNN
model often performs better than other models across all the
six data sets. There are two possible two reasons for such a per-
formance improvement: (i) the R-3D-CNN model effectively
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 14
Ground-truth SVM-CK(96.00%) GCK(99.34) LORSAL(88.57) SMLR(98.46%) MFL(99.16%)
SVM-3D(94.95%) SVM-3DG(98.00) 2D-CNN(98.96%) 3D-CNN(99.08%) R-2D-CNN(99.47%) R-3D-CNN(99.80%)
Fig. 15. Classification maps and overall classification accuracies obtained for the Salinas scene data set (overall accuracies are reported in parentheses).
Ground-truth SVM-CK(94.72%) GCK(99.48%) LORSAL(86.74%) SMLR-SpATV(99.41%) MFL(99.42%) SVM-3D(97.80%)
SVM-3DG(98.62%) CNN-MRF(95.16%) 2D-CNN(95.46%) 3D-CNN(98.49%) R-2D-CNN(99.19%) R-3D-CNN(99.97%)
Fig. 16. Classification maps and overall classification accuracies obtained for the Pavia Centre scene data set (overall accuracies are reported in parentheses).
Ground-truth SVM-CK(96.95%) GCK(98.98%) LORSAL(93.59%) SMLR-SpATV(96.86%) NMF(98.27%) SVM-3D(91.67%)
SVM-3DG(98.27%) CNN-MRF(97.56%) 2D-CNN(97.76%) 3D-CNN(98.46%) R-2D-CNN(99.22%) R-3D-CNN(99.85%)
Fig. 17. Classification maps and overall classification accuracies obtained for the Kennedy Space Center data set (overall accuracies are reported in parentheses).
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 15
0 200 400 600 800 1000
The itrations
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
a)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 1000 2000 3000 4000 5000
The itrations
0
0.2
0.4
0.6
0.8
1
The accuracy
b)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
c)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
d)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
e)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
0 200 400 600 800 1000
The itrations
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The accuracy
f)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
Fig. 18. The accuracy varies against the number of iterations on different
data sets. (a) Indian Pines Scene; (b) Bostwana Scene; (c) Salinas Scene; (d)
Pavia Centre Scene; (e) Pavia University Scene; (f) Kennedy Space Center.
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
55
60
65
70
75
80
85
90
95
100
The accuray(%)
a)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
40
50
60
70
80
90
100
The accuray(%)
b)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
20
30
40
50
60
70
80
90
100
The accuray(%)
c)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
50
60
70
80
90
100
The accuray(%)
d)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
55
60
65
70
75
80
85
90
95
100
The accuray(%)
e)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
10% 20% 30% 40% 50% 60% 70%
Training samples of proportion (%)
40
50
60
70
80
90
100
The accuray(%)
f)
2D-CNN
3D-CNN
R-2D-CNN
R-3D-CNN
SVM-CK
GCK
CNN-MRF
LORSASL
MFL
SMLR-SpATV
SVM-3D
SVM-3DG
Fig. 19. The influence of training samples proportion. a) Indian Pines Scene;
b) Botswana Scene; c) Salinas scene; d)Pavia Centre scene; e)Pavia University
scene; f)Kennedy Space Center;
fuses the spatial and the spectral correlations; (ii) the multi-
level recurrent structure can exploit spatial contexts better than
a flat non-recurrent structure. For the same reason, we also
observe that the R-2D-CNN model often outperforms the other
models (such as LORSAL, GCK, SMLR-SpATV, CNN-MRF,
SVM-CK, SVM-3D, SVM-3DG, and MFL). The MFL and the
GCK models perform better than the 2D-CNN and the 3D-
CNN models because of their well-designed EMAP attributes
which can effectively represent the spatial contexts. On the
other hand, the 2D-CNN and the 3D-CNN models perform
better than the SVM-CK, the SVM-3D, and the LORSAL
classifiers in general. All our experimental results verify the
effectiveness and the advantages of the deep learning-based
methods.
Second, the 3D-CNN model often performs better than the
2D-CNN model. The main reason is that the 3D convolu-
tion operation can exploit both spatial features and spectral
correlations while the 2D convolution operation can only
exploit spatial features. On the other hand, the R-2D-CNN
model often performs better than the 3D-CNN model and
the 2D-CNN model because its recurrent structure can more
effectively exploit the spatial contexts than the latter two
models. Among all the four models, the R-3D-CNN model not
only performs the best for most data sets but it also converges
faster.
Finally, we find that the proposed deep learning models
(e.g., R-3D-CNN and R-2D-CNN) may be slightly inferior
to conventional machine learning techniques if the training
samples are limited. However, when a reasonable number of
training samples are available, their performance is consid-
erably better than that of the conventional machine learning
techniques such as LORSAL, GCK, MFL, SMLR-SpATV,
SVM-3D, SVM-3DG, and SVM-CK. The main reason is that
deep learning models usually contain more model parameters,
and hence more training samples are required to estimate the
values of these parameters.
V. CONCLUSIONS
In this paper, we have explored deep learning techniques
for solving the hyperspectral image classification problem.
In particular, four deep learning models such as 2D-CNN,
3D-CNN, R-2D-CNN, and R-3D-CNN have been designed
and developed. Rigorous experiments were conducted based
on six publicly available hyperspectral image data sets, and
our experimental results confirm the superiority of these
deep learning methods when compared to traditional machine
learning methods such as LORSAL, MFL, GCK, SVM-3D,
and SVM-CK. In addition, the proposed R-3D-CNN and R-
2D-CNN models outperform the CNN-MRF, SVM-3DG, and
SMLR-SpATV. As a whole, the proposed R-3D-CNN model
often outperforms other models for most of the data sets, and
it can also converge faster because of its 3D convolutional
operators and the recurrent network structure which can effec-
tively exploit both the spectral and the spatial contexts. If we
measure classification performance in terms of error rate, the
proposed methods (R-2D-CNN and R-3D-CNN) outperform
the baselines by more than 30%. Despite the superiority of
the proposed models, we find that our deep learning models
often require more training samples than traditional machine
learning methods. Accordingly, it will be a very interesting
future research topic of incorporating prior domain knowledge
into the proposed deep learning models. Alternatively, we will
explore applying transfer learning approaches to alleviate the
shortcomings of our current deep learning models.
ACKNOWLEDGMENT
This research was supported in part by Shenzhen
Science and Technology Program under Grant No.
JCYJ20160330163900579 and No.JCYJ20170413105929681.
Huang’s work is supported by the National Natural Science
Foundation of China (NSFC) under Grant No.61562027
and Education Department of Jiangxi Province under Grant
No.GJJ170413. Lau’s work is supported by grants from the
RGC of the Hong Kong SAR (Projects: CityU 11502115 and
CityU 11525716), the NSFC Basic Research Program (Project:
71671155), the Shenzhen Municipal Science and Technology
Innovation Fund (Project: JCYJ20160229165300897), and
the CityU Shenzhen Research Institute.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 16
REFERENCES
[1] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders,
N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data
analysis and future challenges,” IEEE Geoscience and remote sensing
magazine, vol. 1, no. 2, pp. 6–36, 2013.
[2] A. J. Brown, B. Sutter, and S. Dunagan, “The marte vnir imaging
spectrometer experiment: design and analysis,” Astrobiology, vol. 8,
no. 5, pp. 1001–1011, 2008.
[3] B. Scholkopf and A. J. Smola, Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press, 2001.
[4] G. Hughes, “On the mean accuracy of statistical pattern recognizers,”
in IEEE Trans. Inf. Theory 1968, 1968, pp. 55–63.
[5] B. Krishnapuram, L. Carin, M. A. Figueiredo, and A. J. Hartemink,
“Sparse multinomial logistic regression: Fast algorithms and general-
ization bounds,” IEEE transactions on pattern analysis and machine
intelligence, vol. 27, no. 6, pp. 957–968, 2005.
[6] Q. Wang, Z. Meng, and X. Li, “Locality adaptive discriminant anal-
ysis for spectral–spatial classification of hyperspectral images,” IEEE
Geoscience and Remote Sensing Letters, vol. 14, no. 11, pp. 2077–2081,
2017.
[7] Q. Wang, J. Lin, and Y. Yuan, “Salient band selection for hyperspectral
image classification via manifold ranking,” IEEE transactions on neural
networks and learning systems, vol. 27, no. 6, pp. 1279–1289, 2016.
[8] Y. Yuan, J. Lin, and Q. Wang, “Dual-clustering-based hyperspectral band
selection by contextual analysis,” IEEE Transactions on Geoscience and
Remote Sensing, vol. 54, no. 3, pp. 1431–1445, 2016.
[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[10] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning.” in
AAAI, 2017, pp. 4278–4284.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
tion with deep convolutional neural networks,” in Advances in neural
information processing systems, 2012, pp. 1097–1105.
[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 770–778.
[15] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of
hyperspectral images with regularized linear discriminant analysis,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 3,
pp. 862–873, 2009.
[16] A. J. Brown, “Spectral curve fitting for automatic hyperspectral data
analysis,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 44, no. 6, pp. 1601–1608, 2006.
[17] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspec-
tral image classification,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 43, no. 6, pp. 1351–1362, 2005.
[18] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “Sim-
plemkl,” Journal of Machine Learning Research, vol. 9, no. 3, pp. 2491–
2521, 2008.
[19] D. Tuia, G. Camps-Valls, G. Matasci, and M. Kanevski, “Learn-
ing relevant image features with multiple-kernel classification,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 48, no. 10, pp.
3780–3791, 2010.
[20] Y. Gu, C. Wang, D. You, Y. Zhang, S. Wang, and Y. Zhang, “Representa-
tive multiple kernel learning for classification in hyperspectral imagery,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 7,
pp. 2852–2865, 2012.
[21] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone,
“Morphological attribute profiles for the analysis of very high resolution
images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48,
no. 10, pp. 3747–3762, 2010.
[22] M. Dalla Mura, J. Atli Benediktsson, B. Waske, and L. Bruzzone,
“Extended profiles with morphological attribute filters for the analysis
of hyperspectral data,” International Journal of Remote Sensing, vol. 31,
no. 22, pp. 5975–5991, 2010.
[23] J. Li, P. R. Marpu, A. Plaza, J. M. Bioucas-Dias, and J. A. Benedikts-
son, “Generalized composite kernel framework for hyperspectral image
classification,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 51, no. 9, pp. 4816–4829, 2013.
[24] Q. Huang, C. K. Jia, X. Zhang, and Y. Ye, “Learning discriminative sub-
space models for weakly supervised face detection,” IEEE Transactions
on Industrial Informatics, vol. 13, no. 6, pp. 2956–2964, 2017.
[25] X. Ma, Q. Liu, Z. He, X. Zhang, and W.-S. Chen, “Visual tracking via
exemplar regression model,” Knowledge-Based Systems, vol. 106, pp.
26–37, 2016.
[26] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image
classification via kernel sparse representation,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 51, no. 1, pp. 217–231, 2013.
[27] L. Sun, Z. Wu, J. Liu, L. Xiao, and Z. Wei, “Supervised spectral–spatial
hyperspectral image classification with weighted markov random fields,
IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 3,
pp. 1490–1503, 2015.
[28] J. Li, X. Huang, P. Gamba, J. M. Bioucas-Dias, L. Zhang, J. A.
Benediktsson, and A. Plaza, “Multiple feature learning for hyperspectral
image classification,” IEEE Transactions on Geoscience and Remote
sensing, vol. 53, no. 3, pp. 1592–1606, 2015.
[29] X. Cao, L. Xu, D. Meng, Q. Zhao, and Z. Xu, “Integration of 3-
dimensional discrete wavelet transform and markov random field for
hyperspectral image classification,” Neurocomputing, vol. 226, pp. 90–
100, 2017.
[30] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.
Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-
propagation network,” in Advances in neural information processing
systems, 1990, pp. 396–404.
[31] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551,
1989.
[32] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks.” in Aistats, vol. 15, no. 106, 2011, p. 275.
[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: A simple way to prevent neural networks from over-
fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
1929–1958, 2014.
[34] A. Santara, K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, and
P. Mitra, “Bass net: Band-adaptive spectral-spatial feature learning neu-
ral network for hyperspectral image classification,” IEEE Transactions
on Geoscience and Remote Sensing, 2017.
[35] X. Cao, F. Zhou, L. Xu, D. Meng, Z. Xu, and J. Paisley, “Hyperspectral
image segmentation with markov random fields and a convolutional
neural network,” arXiv preprint arXiv:1705.00727, 2017.
[36] Q. Wang, J. Gao, and Y. Yuan, “A joint convolutional neural networks
and context transfer for street scenes labeling,” IEEE Transactions on
Intelligent Transportation Systems, 2017.
[37] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
for human action recognition,” IEEE transactions on pattern analysis and
machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[38] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in Acoustics, speech and signal
processing (icassp), 2013 ieee international conference on. IEEE, 2013,
pp. 6645–6649.
[39] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in neural information processing
systems, 2014, pp. 3104–3112.
[40] K. Cho, B. Van Merri¨
enboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[41] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks for
hyperspectral image classification,” IEEE Transactions on Geoscience
and Remote Sensing, 2017.
[42] P. Pinheiro and R. Collobert, “Recurrent convolutional neural
networks for scene labeling,” in Proceedings of the 31st International
Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 1.
Bejing, China: PMLR, 22–24 Jun 2014, pp. 82–90. [Online]. Available:
http://proceedings.mlr.press/v32/pinheiro14.html
[43] P. J. Werbos, “Backpropagation through time: what it does and how to
do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 17
Xiaofei Yang Xiaofei Yang received the B.Sc. from
Suihua University in 2007 and 2011, and received
M.Sc. degrees from Harbin Institute of Technology
in 2011 and 2013, respectively. Currently, he is
a Ph.D. candidate in Shenzhen Graduate School,
Harbin Institute of Technology. His research inter-
ests are in the areas of semi-supervised learning,
deep learning, remote sensing, transfer learning and
graph mining.
Yunming Ye Yunming Ye received the Ph.D. in
Computer Science from Shanghai Jiao Tong Univer-
sity. He is now a professor in the Shenzhen Graduate
School, Harbin Institute of Technology. His research
interests include data mining, text mining, and en-
semble learning algorithms.
Xutao Li Xutao Li is now an Associate Professor
in the Shenzhen Graduate School, Harbin Institute
of Technology. He received the Ph.D. and Master
degrees in Computer Science from Harbin Institute
of Technology in 2013 and 2009, and the Bachelor
from Lanzhou University of Technology in 2007.
His research interests include data mining, machine
learning, graph mining and social network analysis,
especially tensor based learning and mining algo-
rithms.
Raymond Y. K. Lau Raymond Y. K. Lau is an
Associate Professor in the Department of Informa-
tion Systems at City University of Hong Kong. He
is the author of two hundred refereed international
journals and conference papers. His research work
has been published in renowned journals such as
MIS Quarterly, INFORMS Journal on Computing,
ACM Transactions on Information Systems, IEEE
Transactions on Knowledge and Data Engineering,
IEEE Internet Computing, Journal of MIS, Decision
Support Systems, etc. His research interests include
Big Data Analytics, Social Media Analytics, FinTech, and AI for Business.
He is a senior member of the IEEE and the ACM, respectively.
Xiaohui Huang Xiaohui Huang received the B.Eng.
and masters degrees from Jiangxi Normal University,
Nanchang, China, in 2005 and 2008, respectively,
and the Ph.D. degree from the Shenzhen Graduate
School, Harbin Institute of Technology, Shenzhen,
China, in 2014. Since 2015, he has been with the
School of Information Engineering Department, East
China Jiaotong University, Nanchang, China, where
he is currently a lecturer of computer science. His
current research interests include clustering, social
media analysis, and deep learning.
Xiaofeng Zhang Xiaofeng Zhang received the MSc
degree from Harbin Institute of Technology in 1999,
and the Ph.D. degree from Hong Kong Baptist
University in 2008. He has worked in R&D center
of Peking University Founder Group and E-business
Technology Institute of Hong Kong University. He
is now an associate professor at department of
computer science of Harbin Institute of Technology
Shenzhen Graduate School. His research interests
include data mining, machine learning and graph
mining.
... Sun et al. [6] introduced a spectralspatial pyramid attention network for informative spectralspatial features. ContextNet [7] captured spatio-spectral correlation contexts, while Song et al. [8] introduced a deep feature fusion technique leveraging correlated features across layers. Zhong et al. proposed a spectral-spatial ResNet (SSRN) [9] using 3D-CNNs to exploit HSI's feature space. ...
... For instance, the receptive field is constrained, data is lost during the downsampling phase, and deep networks require a large amount of processing power [16]. On the other hand, in the field of computer vision, vision transformers (ViTs) have demonstrated significant promise recently [17]- [21]. By means of the incorporation of a multi-layer perceptron (MLP) and a multi-headed selfattention (MHSA) module, ViTs are capable of acquiring global long-range data interactions in the input sequential data. ...
Preprint
Full-text available
Convolutional Neural Networks (CNNs) and vision transformers (ViTs) have shown excellent capability in complex hyperspectral image (HSI) classification. However, these models require a significant number of training data and are computational resources. On the other hand, modern Multi-Layer Perceptrons (MLPs) have demonstrated great classification capability. These modern MLP-based models require significantly less training data compared to CNNs and ViTs, achieving the state-of-the-art classification accuracy. Recently, Kolmogorov-Arnold Networks (KANs) were proposed as viable alternatives for MLPs. Because of their internal similarity to splines and their external similarity to MLPs, KANs are able to optimize learned features with remarkable accuracy in addition to being able to learn new features. Thus, in this study, we assess the effectiveness of KANs for complex HSI data classification. Moreover, to enhance the HSI classification accuracy obtained by the KANs, we develop and propose a Hybrid architecture utilizing 1D, 2D, and 3D KANs. To demonstrate the effectiveness of the proposed KAN architecture, we conducted extensive experiments on three newly created HSI benchmark datasets: QUH-Pingan, QUH-Tangdaowan, and QUH-Qingyun. The results underscored the competitive or better capability of the developed hybrid KAN-based model across these benchmark datasets over several other CNN- and ViT-based algorithms, including 1D-CNN, 2DCNN, 3D CNN, VGG-16, ResNet-50, EfficientNet, RNN, and ViT. The code are publicly available at (https://github.com/aj1365/HSIConvKAN)
... Liu et al. [25] proposed a Siamese CNN to capture rich and effective spatial-spectral information and used SVM to achieve the final HSI classification. Yang et al. [26] designed 2D and 3D CNNs and improved the regression model for HSI classification. Wang et al. [27] used a residual network model to capture spatial-spectral information quickly. ...
Article
Full-text available
Labeled hyperspectral image (HSI) information is commonly difficult to acquire, so the lack of valid labeled data becomes a major puzzle for HSI classification. Semi-supervised methods can efficiently exploit unlabeled and labeled data for classification, which is highly valuable. Graph-based semi-supervised methods only focus on HSI local or global data and cannot fully utilize spatial–spectral information; this significantly limits the performance of classification models. To solve this problem, we propose an adaptive global–local feature fusion (AGLFF) method. First, the global high-order and local graphs are adaptively fused, and their weight parameters are automatically learned in an adaptive manner to extract the consistency features. The class probability structure is then used to express the relationship between the fused feature and the categories and to calculate their corresponding pseudo-labels. Finally, the fused features are imported into the broad learning system as weights, and the broad expansion of the fused features is performed with the weighted broad network to calculate the model output weights. Experimental results from three datasets demonstrate that AGLFF outperforms other methods.
... To address this problem, several 2D convolutional neural network (CNN) [27][28][29] methods have been proposed, which can directly handle the 3D cubes patch of HSI. In order to further explore the spatial-spectral information in 3D HSI patches, researchers proposed a 3D CNN [30][31][32][33]. However, while 2D CNNs fail to effectively exploit the spectral dimension of HSI, 3D CNNs often encounter challenges such as a substantial parameter count, high computational complexity, and vulnerability to overfitting. ...
Article
Full-text available
In recent years, deep learning methods have achieved remarkable success in hyperspectral image classification (HSIC), and the utilization of convolutional neural networks (CNNs) has proven to be highly effective. However, there are still several critical issues that need to be addressed in the HSIC task, such as the lack of labeled training samples, which constrains the classification accuracy and generalization ability of CNNs. To address this problem, a deep multi-scale attention fusion network (DMAF-NET) is proposed in this paper. This network is based on multi-scale features and fully exploits the deep features of samples from multiple levels and different perspectives with an aim to enhance HSIC results using limited samples. The innovation of this article is mainly reflected in three aspects: Firstly, a novel baseline network for multi-scale feature extraction is designed with a pyramid structure and densely connected 3D octave convolutional network enabling the extraction of deep-level information from features at different granularities. Secondly, a multi-scale spatial–spectral attention module and a pyramidal multi-scale channel attention module are designed, respectively. This allows modeling of the comprehensive dependencies of coordinates and directions, local and global, in four dimensions. Finally, a multi-attention fusion module is designed to effectively combine feature mappings extracted from multiple branches. Extensive experiments on four popular datasets demonstrate that the proposed method can achieve high classification accuracy even with fewer labeled samples.
Article
The effectiveness of spectral-spatial feature learning is crucial for the hyperspectral image (HSI) classification task. Diffusion models, as a new class of groundbreaking generative models, have the ability to learn both contextual semantics and textual details from the distinct timestep dimension, enabling the modeling of complex spectral-spatial relations in HSIs. However, existing diffusion-based HSI classification methods only utilize manually selected single-timestep single-stage features, limiting the full exploration and exploitation of rich contextual semantics and textual information hidden in the diffusion model. To address this issue, we propose a novel diffusion-based feature learning framework that explores Multi-Timestep Multi-Stage Diffusion features for HSI classification for the first time, called MTMSD. Specifically, the diffusion model is first pretrained with unlabeled HSI patches to mine the connotation of unlabeled data, and then is used to extract the multi-timestep multi-stage diffusion features. To effectively and efficiently leverage multi-timestep multi-stage features, two strategies are further developed. One strategy is class & timestep-oriented multi-stage feature purification module with the inter-class and inter-timestep prior for reducing the redundancy of multi-stage features and alleviating memory constraints. The other one is selective timestep feature fusion module with the guidance of global features to adaptively select different timestep features for integrating texture and semantics. Both strategies facilitate the generality and adaptability of the MTMSD framework for diverse patterns of different HSI data. Extensive experiments are conducted on four public HSI datasets, and the results demonstrate that our method outperforms state-of-the-art methods for HSI classification, especially on the challenging Houston 2018 dataset. The codes are available at https://github.com/zjyaccount/MTMSD.
Article
Graph Attention Network (GAT) has a wide range of applications in HSI classification. The GAT-based semi-supervised learning approach enables the integration of valuable information from both labeled and unlabeled samples, effectively reducing the model’s reliance on labeled data. However, the node-wise training approach of GAT often overlooks the inherent global feature of graph data and the long-range dependencies among nodes, thereby limiting the model’s generalization ability on unlabeled data. Therefore, we propose a semi-supervised HSI classification model based on the dynamic evolution graph attention network (DEGAT). The main contributions: 1)We design a dynamic graph evolution mechanism (DGEM) that enables the model to capture the interactive information between local graph attention coefficients and the global graph structure, thus obtaining more discriminative graph representations. 2)DEGAT utilizes the multi-scale mechanism and message-passing mechanism to capture the information of nodes with long-range dependencies, extracting richer spatial-spectral features. State-of-the-art results are achieved with very few labeled training samples on two typical benchmark HSI datasets, where the overall accuracy reaches 95.12% and 98.76% respectively.
Article
Linear discriminant analysis (LDA) is a popular technique for supervised dimensionality reduction, but with less concern about a local data structure. This makes LDA inapplicable to many real-world situations, such as hyperspectral image (HSI) classification. In this letter, we propose a novel dimensionality reduction algorithm, locality adaptive discriminant analysis (LADA) for HSI classification. The proposed algorithm aims to learn a representative subspace of data, and focuses on the data points with close relationship in spectral and spatial domains. An intuitive motivation is that data points of the same class have similar spectral feature and the data points among spatial neighborhood are usually associated with the same class. Compared with traditional LDA and its variants, LADA is able to adaptively exploit the local manifold structure of data. Experiments carried out on several real hyperspectral data sets demonstrate the effectiveness of the proposed method.
Article
Learning object detection models from weakly labeled data is an important topic in computer vision, and this task can be naturally cast as a Multiple Instance Learning (MIL) problem. Existing MIL approaches for object detection suffer from high false positive rates due to the lack of advanced instance selection techniques. In this study, a subspace based generative model is proposed to choose positive instances by minimizing rank of the coefficient matrix associated with the subspace models. An incoherence term between subspace models and some “hard” negative instances in also introduced, which is realized by an $\epsilon$ -insensitive loss function. To further improve the discriminative ability, a multiple subspace models approach is proposed by employing certain ensemble learning strategies. Rigorous experiments are performed on several data sets, and the promising experimental results have demonstrated that the proposed approach is superior to the state-of-the-art weakly supervised learning algorithms in terms of precision, recall and F-score.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Street scene understanding is an essential task for autonomous driving. One important step toward this direction is scene labeling, which annotates each pixel in the images with a correct class label. Although many approaches have been developed, there are still some weak points. First, many methods are based on the hand-crafted features whose image representation ability is limited. Second, they cannot label foreground objects accurately due to the data set bias. Third, in the refinement stage, the traditional Markov random filed inference is prone to over smoothness. For improving the above problems, this paper proposes a joint method of priori convolutional neural networks at superpixel level (called as "priori s-CNNs") and soft restricted context transfer. Our contributions are threefold: 1) a priori s-CNNs model that learns priori location information at superpixel level is proposed to describe various objects discriminatingly; 2) a hierarchical data augmentation method is presented to alleviate data set bias in the priori s-CNNs training stage, which improves foreground objects labeling significantly; and 3) a soft restricted MRF energy function is defined to improve the priori s-CNNs model's labeling performance and reduce the over smoothness at the same time. The proposed approach is verified on CamVid data set (11 classes) and SIFT Flow Street data set (16 classes) and achieves a competitive performance.
Article
In recent years, vector-based machine learning algorithms, such as random forests, support vector machines, and 1-D convolutional neural networks, have shown promising results in hyperspectral image classification. Such methodologies, nevertheless, can lead to information loss in representing hyperspectral pixels, which intrinsically have a sequence-based data structure. A recurrent neural network (RNN), an important branch of the deep learning family, is mainly designed to handle sequential data. Can sequence-based RNN be an effective method of hyperspectral image classification? In this paper, we propose a novel RNN model that can effectively analyze hyperspectral pixels as sequential data and then determine information categories via network reasoning. As far as we know, this is the first time that an RNN framework has been proposed for hyperspectral image classification. Specifically, our RNN makes use of a newly proposed activation function, parametric rectified tanh (PRetanh), for hyperspectral sequential data analysis instead of the popular tanh or rectified linear unit. The proposed activation function makes it possible to use fairly high learning rates without the risk of divergence during the training procedure. Moreover, a modified gated recurrent unit, which uses PRetanh for hidden representation, is adopted to construct the recurrent layer in our network to efficiently process hyperspectral data and reduce the total number of parameters. Experimental results on three airborne hyperspectral images suggest competitive performance in the proposed mode. In addition, the proposed network architecture opens a new window for future research, showcasing the huge potential of deep recurrent networks for hyperspectral data analysis.
Article
Deep learning based landcover classification algorithms have recently been proposed in literature. In hyperspectral images (HSI) they face the challenges of large dimensionality, spatial variability of spectral signatures and scarcity of labeled data. In this article we propose an end-to-end deep learning architecture that extracts band specific spectral-spatial features and performs landcover classification. The architecture has fewer independent connection weights and thus requires lesser number of training data. The method is found to outperform the highest reported accuracies on popular hyperspectral image data sets.
Article
Hyperspectral image (HSI) classification is one of the fundamental tasks in HSI analysis. Recently, many approaches have been extensively studied to improve the classification performance, among which integrating the spatial information underlying HSIs is a simple yet effective way. However, most of the current approaches haven't fully exploited the spatial information prior. They usually consider this prior either in the step of extracting spatial feature before classification or in the step of post-processing label map after classification, while don't integratively employ the prior in both steps, which thus leaves a room for further enhancing their performance. In this paper, we propose a novel spectral-spatial HSI classification method, which fully utilizes the spatial information in both steps. Firstly, the spatial feature is extracted by applying the 3-dimensional discrete wavelet transform (3D-DWT). Secondly, the local spatial correlation of neighboring pixels is modeled using Markov random field (MRF) based on the probabilistic classification map obtained by applying probabilistic support vector machine (SVM) to the extracted 3D-DWT feature in the first step, and then a maximum a posterior (MAP) classification problem can be formulated in a Bayesian perspective. Finally, α-Expansion min-cut-based optimization algorithm is adopted to solve this MAP problem efficiently. Experimental results on two benchmark HSIs show that the proposed method achieves a significant performance gain beyond state-of-the-art methods.