ArticlePDF Available

Hyperspectral Image Classification With Deep Learning Models

April 2018
IEEE Transactions on Geoscience and Remote Sensing PP(99):1-16

April 2018
PP(99):1-16

DOI:10.1109/TGRS.2018.2815613

Authors:

Xiaofei Yang

Harbin Institute of Technology

Xutao Li

Nanyang Technological University

Raymond Y. K. Lau

City University of Hong Kong

Show all 6 authorsHide

Deep learning has achieved great successes in conventional computer vision tasks. In this paper, we exploit deep learning techniques to address the hyperspectral image classification problem. In contrast to conventional computer vision tasks that only examine the spatial context, our proposed method can exploit both spatial context and spectral correlation to enhance hyperspectral image classification. In particular, we advocate four new deep learning models, namely, 2-D convolutional neural network (2-D-CNN), 3-D-CNN, recurrent 2-D CNN (R-2-D-CNN), and recurrent 3-D-CNN (R-3-D-CNN) for hyperspectral image classification. We conducted rigorous experiments based on six publicly available data sets. Through a comparative evaluation with other state-of-the-art methods, our experimental results confirm the superiority of the proposed deep learning models, especially the R-3-D-CNN and the R-2-D-CNN deep learning models.

The CNN model consisting of convolution layers, pooling layers, and full connection layer

…

The 2D-CNN model consisting 2D convolutional operation with kernel size (k) and number of feature maps (m) at each convolutional layer for hyperspectra image classification

…

Figures - uploaded by Xiaofei Yang

Content may be subject to copyright.

Content uploaded by Xiaofei Yang

Content may be subject to copyright.

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Hyperspectral Image Classiﬁcation With Deep Learning Models

Xiaofei Yang1,2, Yunming Ye1,2, Xutao Li1,2, Raymond Y. K. Lau3, and Xiaofeng Zhang1,2, and Xiaohui Huang4

1Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, China

2Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, 518055, China

3City University of Hong Kong, Hong Kong

4School of Information Engineering East China Jiaotong University, China

Deep learning has achieved great successes in conventional computer vision tasks. In this paper, we exploit deep learning techniques

to address the hyperspectral image classiﬁcation problem. In contrast to conventional computer vision tasks that only examine the

spatial context, our proposed method can exploit both spatial context and spectral correlation to enhance hyperspectral image

classiﬁcation. In particular, we advocate four new deep learning models, namely 2D Convolutional neural network (2D-CNN), 3D

Convolutional neural network (3D-CNN), recurrent 2D Convolutional neural network (R-2D-CNN), and recurrent 3D Convolutional

neural network (R-3D-CNN) for hyperspectral image classiﬁcation. We conducted rigorous experiments based on six publicly available

data sets. Through a comparative evaluation with other state-of-the-art methods, our experimental results conﬁrm the superiority

of the proposed deep learning models, especially the R-3D-CNN and the R-2D-CNN deep learning models.

Index Terms—Deep Learning, Hyperspectral Image, Convolutional Neural Network.

I. INTRODUCTION

RECENTLY, the rapid development of optics and photon-

ics has signiﬁcantly advanced the ﬁeld of hyperspectral

imaging techniques. As a result, hyperspectral sensors are

installed in many satellites which can produce images with

rich spectral information. The rich information captured in

hyperspectral images enables us to distinguish very similar

materials and objects by using satellites. Accordingly, hyper-

spectral imaging techniques have been widely used in a variety

of ﬁelds such as agriculture, monitoring, astronomy, and

mineral exploration. For example, Brown et al. [1] analyzed

the CRISM hyperspectral data set and used linear mixing of

absorption band techniques to determine the mineralogy of

the surface on Mars. In [2], Brown et al. utilized the VNIR

imaging spectrometer instrument that was a hyperspectral

scanning pushbroom device sensitive to VNIR wavelengths

from 400 ∼1000 nm for mineral exploration.

The existing methods for hyperspectral image classiﬁca-

tion are mostly based on conventional pattern recognition

approaches such as support vector machine (SVM) [3] and

K-nearest neighbor (KNN) classiﬁers. To address the curse of

dimensionality, namely the Hughes phenomenon [4], Krishna-

puram et al. [5] performed dimensionality reduction against a

data set ﬁrst and then applied multinomial logistic regression

(MLR) to improve image classiﬁcation performance. Wang et

al. proposed a novel dimensionality reduction method, namely

the Locality Adaptive Discriminant Analysis (LADA) method

for hyperspectral image analysis [6]. Another way to cope

with the Hughes phenomenon is via the salient band selection

method. For example, Wang et al. [7] proposed a manifold

ranking based salient band selection method. In addition, Yuan

et al. proposed a new dual clustering framework, which was

applied to tackle the inherent drawbacks of the clustering-

based band selection method [8]. It has been shown that a

Corresponding authors: Yunming Ye (email: yeyunming@hit.edu.cn) and

Xutao Li (email: lixutao@hit.edu.cn).

composite kernel approach that requires multiple kernels can

enhance the accuracy of classiﬁcation by fusing spatial and

spectral information. For example, the Generalized Composite

Kernel (GCK) framework is one of the promising methods

for hyperspectral image classiﬁcation. Though kernel-based

methods like GCK can exploit both the spectral and the spatial

information, it involves solving a computationally very costly

optimization problem.

As a state-of-the-art machine learning technique, deep learn-

ing [9] [10] has recently attracted a lot of attention for its

application to conventional computer vision tasks. One main

reason is that deep learning can automatically discover an

effective feature representation for a problem domain, thus

avoiding the complicated and hand-crafted feature engineering

process. With a specially-designed deep learning architecture,

convolution neural networks (CNNs) are widely applied to

image recognition and image segmentation which considers

the spatial correlation among pixels. Successful examples of

CNNs include AlexNet [11], VGG [12], GoogLeNet [13], and

ResNet [14]. However, existing CNNs are applied to con-

ventional image classiﬁcation tasks rather than hyperspectral

image classiﬁcation tasks where both the spatial and spectral

correlations need to be effectively exploited.

In this paper, we address the hyperspectral image classiﬁca-

tion problem by using a new deep learning technique. As noted

above, both the spectral factor and the spatial factor inﬂuence

the class label prediction of a pixel. On one hand, the label

of a pixel is reﬂected by its spectral values scanned by using

different spectrums. On the other hand, as the geographically

close pixels tend to belong to the same class, predicting the

class label of a pixel should take into account the class labels

of the surrounding pixels. Hence, a good hyperspectral image

classiﬁcation method should consider both the spectral factor

and the spatial factor. In this paper, we ﬁrst advocate a 2D-

CNN model and a 3D-CNN model for classifying hyper-

spectral images. The intuition is that a 2D-CNN can exploit

the spatial context, whereas a 3D-CNN can exploit both the

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 2

spatial and the spectral context. Though the aforementioned

models can take into account rich contextual information, the

way that they process the spatial information may introduce

unwanted noise. Accordingly, we further design the recurrent

2D-CNN and the recurrent 3D-CNN to address the noisy

spatial information problem. The main contributions of our

research work are summarized as follows.

1) First, we treat spectral data as the channels of conven-

tional images. To classify each pixel in a hyperspec-

tral image, we extract a small patch centered at the

pixel. The patch is treated as an image with multiple

channels. Then, we design a 2D-CNN model with three

2D convolution layers, followed by a full connection

layer, to classify the patch. The label of the patch is

considered as the label of its central pixel. Though the

pooling layers (such as max pooling layers and average

pooling layers) could reduce the dimensions of feature

maps and simplify the computations, they may affect the

classiﬁcation accuracy of the network. To preserve as

much contextual information as possible, pooling layers

are excluded from our 2D-CNN model. The convolution

layer, pooling layer, and fully-connection layer of a CNN

will be explained in Section II.B.

2) Though 2D-CNN model can utilize the spatial context,

it fails to consider the spectral correlations. To address

such a problem, we further design a 3D-CNN model

which is composed of seven convolution layers and

one full connection layer. Different from the 2D-CNN,

the convolution operator of this model is 3D, whereas

the ﬁrst two dimensions are applied to capture the

spatial context, and the third dimension captures the

spectral context. Though the 3D-CNN model contains

more network parameters than its 2D counterpart, it

should be more effective than its 2D counterpart because

of its ability to evaluate the spectral correlations of a

hyperspectral image.

3) The 2D-CNN model may be noisy because the classiﬁ-

cation of a pixel only relies on a small patch centered

at the pixel. To effectively utilize the spatial context,

we further design a recurrent 2D-CNN model (R-2D-

CNN). The R-2D-CNN can extract features by gradually

shrinking the patch to concentrate on the central pixel.

Experimental results show that the R-2D-CNN model

indeed performs better than the 2D-CNN model.

4) Finally, we design the recurrent 3D-CNN model (R-3D-

CNN) to take into account both spatial and spectral con-

texts, while alleviating the problem of a noisy patch. The

R-3D-CNN extends the 3D-CNN model by shrinking

the patch gradually. As a result, the ﬁnal classiﬁcation

of each pixel mainly depends on the information of the

pixel rather than a patch. Experimental results show the

superiority of the R-3D-CNN model. In particular, it

converges faster than other methods, and achieves the

best classiﬁcation performance.

The rest of the paper is organized as follows. Section II

discusses the related research work. In section III, we illustrate

various CNN-based deep learning models and the correspond-

ing algorithms for hyperspectral image classiﬁcation. Section

VI reports the experimental results of a comparative evaluation

of the experimental methods and other baseline methods. Fi-

nally, we give concluding remarks and highlight the directions

of future research work.

II. RELATED WORK

A. Classical Classiﬁcation Methods

Hyperspectral remote sensing classiﬁcation has been ex-

tensively studied recently. For example, Bandos et al. [15]

utilize a linear discriminant method to solve the problem.

However, when the spectral resolution is low, it is necessary to

handle the band mixing problem for better differentiating the

pixels or performing feature selection. To this end, Brown [16]

develop a robust method to automatically separate overlapping

absorption bands, and the advantage of such a method is that

it is relatively noise-insensitive. To address the nonlinearity

of data, quadratic discriminant analysis and logarithmic dis-

criminant analysis are also explored. However, these methods

suffer from the Hughes phenomenon i.e., the classiﬁcation

performance considerably degrade when the dimensionality

of the problem space becomes high. Wang et al. [6] propose

a novel dimensionality reduction method, namely LADA for

hyperspectral image classiﬁcation. Following the idea of LDA,

LADA learns a projection matrix Gto pull the points of the

same class close to each other while pushing the ones of dif-

ferent classes far away from each other. To further exploit the

local data manifold, LADA adds one adaptive manifold term

parameterized by a matrix Sinto the computation of within-

class scatter term, and solves the matrix Gand Salternatively.

In 2016, Wang et al. [7] propose manifold ranking based

salient selection method for hyperspectral image classiﬁcation.

The method ﬁrst employs an evolution algorithm to group

the bands into several subsets, and ﬁnds some representative

bands. Then, it uses the representatives to select salient bands

by a manifold ranking strategy. The performance of the method

signiﬁcantly relies on the qualities of chosen representatives,

and the constructed manifold.

To improve classiﬁcation performance, many researchers

resort to kernel-based methods. The main idea of kernel-based

methods is to project samples into a high dimensional space

in which the samples of different classes become linearly

separable. The trick of kernel-based methods is that one does

not need to specify the details of the transformation function.

Instead, we only need to deﬁne the linear products among

samples in the high dimensional space. For example, Camps et

al. [17] employe the kernel trick of SVM in that the separation

of classes in a high dimensional space was achieved via a

nonlinear transformation of SVM.

Apart from employing simple kernel tricks, some re-

searchers employed multiple kernels for hyperspectral im-

age classiﬁcation. For example, Rakotomamonjy et al. [18]

advocate the multiple kernel learning (MKL) method which

could learn a kernel and a classiﬁcation predictor at the

same time. With the preliminary success of MKL, the same

technique is applied to remote sensing in 2010 [19]. In 2012,

a representative MKL algorithm is developed which could

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 3

establish the weights of kernels according to their statistical

signiﬁcance [20].

The aforementioned kernel-based methods do not explicitly

exploit a spatial context. To address such a problem, the

composite kernel (CK) method is proposed [21]. In [22], the

CK method is generalized by using extended multi-attribute

proﬁles (EMAP). Apart from considering the spatial context,

the CK method could exploit the spectral context as well. For

example, a generalized composite kernel (GCK) is developed

to exploit both extended multi-attribute proﬁles and raw fea-

tures [23]. The GCK method often achieves better performance

than conventional methods such as SVM-CK [23].

Despite achieving promising classiﬁcation performance, all

the kernel-based methods suffer from two drawbacks: (1) they

often involve solving complicated convex problems which are

in general difﬁcult for a classiﬁer to learn; (2) kernels must

be carefully chosen so as to achieve good performance.

Recently, some more advanced classiﬁcation methods are

developed for hyperspectral image classiﬁcation [24] [25].

For example, the logistic regression via variable splitting and

augmented Lagrangian (LORSAL) algorithm [26] is developed

to tackle larger data sets efﬁciently. In [27], Sun et al. propose

a hyperspectral image classiﬁcation model, named SMLR-

SpATV, which includes a spectral data ﬁdelity term and a

spatially adaptive Markov random ﬁeld (MRF) prior in the

hidden ﬁeld. Li et al. [28] propose a new multiple feature

learning framework (MFL), which pursues the combination of

multiple features for the hyperspectral scenes categorization.

The method can handle both linear and nonlinear classiﬁcation.

In [29], a novel SVM based classiﬁcation method is proposed

by applying the 3-dimensional discrete wavelet transform

(SVM-3DG).

B. Deep Learning Models

Recently, CNN models have achieved a breakthrough in the

performance of image classiﬁcation. A CNN model (see Fig.1)

is a multi-layer neural network, composed of convolution

layer, pooling layer, and full connection layer. The convolution

layer contains N ﬁlters (C1 in Fig. 1), each of which is a

small weighted matrix. By convolving the N ﬁlters with an

input image and transforming the output with a non-linear

activation function, N feature maps are produced. The feature

maps often contain redundant information. To reduce the

redundancy, a pooling layer is appended (S2 in Fig. 1), which

summarizes feature maps into small matrices by calculating

the average (average pooling) or maximum value (maximum

pooling) locally. The convolution layer and pooling layer can

be repeated multiple times (C3 and S4 in Fig. 1) until the

generated feature maps are of size 1-by-1. Finally, a fully

connected layer will be appended for categorization. The

neurons of the fully connected layer take all the 1-by-1 feature

maps as their inputs.

The ﬁrst CNN model is developed by LeCun in 1996 [30]

[31]. Combined with the back propagation model, the CNN

model achieves very good performance in handwritten digit

recognition. With the advancement of Graphics Processing

Units (GPUs), deep learning has attracted a lot of attention

Convolution(C)

Pooling(S)Input Output

Full connection

Fig. 1. The CNN model consisting of convolution layers, pooling layers, and

full connection layer

by researchers. On the other hand, the CNN model has been

improved by the recent deep learning techniques. For example,

Glorot et al. [32] introduce the Rectiﬁed Linear Units (ReLU)

as the activation function for CNNs in 2011. By doing so,

the vanishing gradient problem and the ineffective explo-

ration problem of the BP method can be alleviated. In 2012,

Krizhevsky et al. [11] designed the AlexNet network which

was a deep CNN model with the ReLU activation function.

The AlexNet network won the annual ImageNet competition

in 2012. To avoid overﬁtting, Srivastava et al. [33] proposed

the dropout technique for deep CNN. In addition, Szegedy

et al. [13] designed the GoogLeNet model which is a deep

CNN model with each layer comprising multi-scale CNN.

He et al. [14] proposed a deep residual CNN model which

won the ImageNet competition in 2015. In [34], an end-to-end

band-adaptive spectral-spatial feature learning neural network

was proposed. In [35], Cao et al. proposed a hyperspectral

image segmentation method by using markov random ﬁelds

and a convolutional network. To tackle the street scene labeling

problem, Wang et al. [36] proposed a hybrid method that

utilized priori convolutional neural networks at superpixel

level and soft restricted context transfer. The former technique

aims to learn prior location information and produces coarse

label predication, whereas the latter technique aims to improve

the coarse prediction by reducing over-smoothness. However,

the algorithm works for conventional images only. It does not

take into account the characteristics of rich band information

in hyperspectral images.

All the above models are 2D-CNN models within which the

convolution operators only deal with two dimensional spatial

features. In [37], a 3D convolution network is designed to

handle video categorization tasks effectively. Following the

framework of 3D-CNN models, we employ such an architec-

ture for hyperspectral image classiﬁcation, in which the third

dimension refers to the spectral axis.

Apart from CNN models, another important deep learning

framework is the recurrent neural network (RNN) which is of-

ten applied to process sequence data arising from applications

such as speech recognition [38], machine translation [39], bot

chat [40], and so on. In [41], Mou et al. proposed a novel RNN

model for hyperspectral image classiﬁcation, which could

effectively analyzed hyperspectral pixels as sequential data and

then determined information categories via network reasoning.

The basic intuition of RNN is that it applies the same neural

network block recurrently for sequence prediction. To preserve

the information of observed historical sequences, a RNN is

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 4

2D Conv

ReLU

2D convolutional operation

Input Output

Fig. 2. The 2D-CNN model consisting 2D convolutional operation with kernel

size (k) and number of feature maps (m) at each convolutional layer for

hyperspectra image classiﬁcation

fed with the current observation and the hidden layers trained

by the previously observed sequences. By doing so, the RNN

can take into account both the features of the current sequence

and that of the historical observations to improve the current

prediction. In contrast to the aforementioned approaches, we

apply the RNN model to deal with the spatial contexts

recurrently.

III. THE PROPOSED METHODS

In this section, we illustrate the design of new 2D-CNN, R-

2D-CNN, 3D-CNN, and R-3D-CNN models for hyperspectral

image classiﬁcation. For these methods, we extract a small

patch centered at each pixel to build the classiﬁcation models.

Among the proposed models, the 2D-CNN and the R-2D-CNN

models exploit the spatial contexts only, whereas the 3D-CNN

and the R-3D CNN models exploit both the spatial features

and the spectral correlations of pixels.

A. 2D-CNN model

As illustrated in Fig.2, our 2D-CNN model is composed

of three main phases, which are patch extraction, feature ex-

traction, and label identiﬁcation. Given a hyperspectral image,

we ﬁrst extract a small patch centered at each pixel as the

raw feature. Then, a deep learning model is constructed to

acquire the feature maps of these patches. Finally, the label

of each pixel is classiﬁed based on the feature map of the

corresponding patch. For all four models, we exclude the

pooling layers so as to preserve as much information of a

pixel as possible. The three-phase processing of the 2D-CNN

model is illustrated below.

Assume that we are given a hyperspectral image of size N×

M×D, where Nand Mare the width and the height of the

image, and Ddenotes the number of spectral bands. We aim at

predicting the label of each pixel of the image. As the spatially

adjacent pixels often have the same labels, it is desirable for

the proposed model to consider the “spatial coherence”. To

this end, the ﬁrst processing phase of our model is to extract

aK×K×Dpatch for each pixel. In particular, each patch

(i.e., the spatial context) is constructed surrounding a pixel, the

center point of the patch. For the pixels that reside near the

edge of the image, there may not be sufﬁcient information to

build a patch of the expected size. Accordingly, we construct

the spatial context by performing a mirror padding operation

for these pixels.

For the second processing phase, each extracted patch

is treated as an image with multiple channels on its own.

Thereby, we can apply a deep CNN model with 2D

convolution layers to extract the feature maps for the patch.

More speciﬁcally, the 2D-CNN operator at each layer is

formulated as follows:

vxy

ij =F(bij +X

Ni−1

p=0

Mi−1

q=0

wpq

ijm v(x+p)(y+q)

i−1)(1)

where iindicates the particular layer under consideration, and

jis the number of feature maps of the layer i;vxy

ij stands

for the output at position (x, y)of the jth feature map at

the ith layer; bij refers to the bias term, and F(·)denotes

the activation function of the layer; mindexes over the set

of feature maps of the (i−1)th layer, which are the inputs

to the ith layer. wmpq

ij is the value at position (p, q)of the

convolution kernel connected to the ith feature map to the jth

feature map, and Niand Miare the height and width of this

kernel. For the proposed model, we adopt the ReLU function

as the activation function F, which is deﬁned as follows:

F(x) = max(0, x)(2)

In our 2D-CNN model, three convolutional layers are uti-

lized. To preserve the vital information of each pixel, we

exclude the pooling layers from our 2D-CNN model. Finally,

a fully-connected layer, which takes the feature maps of the

last 2D convolutional layer as inputs, is constructed to make

the prediction. Here, we leverage the softmax function to

compute the probability for each class. The softmax function

is an extension of the sigmod function, and used for multiple

classiﬁcation. The purpose of the softmax function is to ﬁnd

the parameters in the maximum zvalue of Yk. Moreover, the

cross entropy function is adopted as the objective function to

drive the back-propagation based training process.

Let Wand bdenote all the parameters of our 2D-CNN

model. We train the 2D-CNN model by maximizing the likeli-

hood, and transform the scores fc(Ii,j,k; (W,b)) of each class

of interest c∈ {1, . . . , N }into the conditional probabilities by

using the following softmax function [42]:

p(c|Ii,j,k; (W,b)) = efc(Ii,j,k;(W,b))

d∈{1,...,N}

efd(Ii,j,k;(W,b)) (3)

The parameters (W, b)are learned by minimizing the neg-

ative log-likelihood based on the training set:

L(W,b) = −X

Ii,j,k

ln p(li,j,k|Ii,j,k ; (W,b)) (4)

where li,j,k is the correct class label of the pixel at position

(i, j)of the image Ik. To optimize the objective function,

stochastic gradient descent (SGD) with back-propagation is

applied.

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 5

Input data

Spectral

bands

M(height)

(weight) K

3D convolution

operation

Fig. 3. The 3D-CNN model comprising 3 3D convolution operations with

the corresponding kernel size (K) and the number of feature maps (m) for

each convolutional layer.

At the testing time, the output layer of the proposed model

predicts the label of the pixel located at (i, j)of the image I

by using the argmax function:

li,j = arg max

c∈{1,...,N}

p(c|Ii,j ; (W,b)) (5)

B. 3D-CNN model

One main difference between a hyperspectral image and a

conventional image is that the former is captured by scanning

the same region with different spectral bands, while the latter

is not. As the image formed by hyperspectral bands may have

some correlations e.g., close hyperspectral bands may result

in similar images, it is desirable to take into account hyper-

spectral correlations. Though the 2D-CNN model can utilize

the spatial context, it ignores the hyperspectral correlations.

Hence, we develop a 3D-CNN model to address this issue.

As shown in Fig.3, the operational details of our 3D-CNN

model are quite similar to those of the 2D-CNN model. The

main difference is that the 3D-CNN model has one extra phase

of reordering. In this phase, we rearrange the Dhyperspectral

bands according to an ascending order. By doing so, images

of similar spectral bands are sequentially ordered, which can

preserve their correlations under a spectral context. The patch

extraction phase and the label identiﬁcation phase of the two

models are quite similar. For the feature extraction phase, a

3D convolution operator instead of 2D convolution operator is

applied to the 3D-CNN model.

More speciﬁcally, the 3D convolution operation is formu-

lated as follows:

vxyz

ij =F(bij +X

Ni−1

p=0

Mi−1

q=0

Di−1

r=0

wpqr

ijm v(x+p)(y+q)(z+r)

(i−1)m)

(6)

where Diis the size of the 3D kernel along the spectral

dimension, and jis the number of kernels of the ilayer; wpqr

ijm

is the value at the (p, q, r)th position of the kernel connected

to the mth feature map (a cube) of the preceding layer. Again,

the ReLU function is adopted as the activation function F.

The 3D convolution operation is illustrated in Fig. 2. We

can see that the 3D convolution operation is applied to a 3D

patch step by step e.g., from top to down, from left to right,

and from inner to outer. In each step, a convolution scalar

is produced and placed at the corresponding position of the

feature map (shown as red lines in Fig. 2). This operation

produces a smaller 3D cube as a feature map. Training a 3D-

CNN model is similar to training a 2D-CNN model in which

we utilize the softmax function to compute the probability of

each class. Moreover, we formulate the training process as an

optimization problem by maximizing the log-likelihood of the

training data. In addition, stochastic gradient descent (SGD)

with back-propagation is applied to network training.

C. R-2D-CNN model

As noted above, though the 2D-CNN model can exploit the

spatial context, it may introduce unwanted noise because the

classiﬁcation of a pixel relies on the features of a small patch

surrounding the pixel rather than the features directly attached

to the pixel. To better exploit the spatial context, we design a

recurrent 2D-CNN model (R-2D-CNN). In particular, the R-

2D-CNN model constructs multiple shrunk patches as multi-

level instances (see Fig.4), and leverages a multi-scale deep

neural network to fuse the multi-level instances for prediction.

For clarity, we denote the instances as 1-st level, 2-nd level,

· · ·, and the P-th level, corresponding from the bigger patches

to the smaller patches, where the P-th level often corresponds

to the pixel for classiﬁcation, i.e., a 1-by-1 patch. The R-

2D-CNN deep neural network comprises a recurrent CNN

structure, where a basic 2D-CNN block is reused multiple

times. More speciﬁcally, it uses the basic 2D-CNN block to

extract the feature maps for the 1-st level instances at the

beginning. These feature maps are then concatenated with the

2-nd level instances, which are fed to the same 2D-CNN block

for extracting the next level feature maps. This procedure is

repeated until the P-th level instances are fused. Finally, a

softmax layer is then applied to compute the probability of

each class. By utilizing the multiple shrunk patches, we can

consider the spatial context information, and also can focus

more on the information closer to the pixel for classiﬁcation.

Hence, the unwanted noises can be reduced.

The main architecture of the R-2D-CNN model is illustrated

in Fig.5. At the p-th level, the network is fed with an input

“feature image” Fpof H+D(H represents the number of

feature maps produced by the 2D-CNN block) 2D images,

which comprises Hfeature maps of the p−1-th instances, D

hyperspectral images of the p-th instances, and 1≤p≤P.

Formally, the procedure is deﬁned as follows:

Fp= [F(Fp−1, Ip

i,j,k)], F 1= [0, Ii,j,k].(7)

where Ii,j,k stands for the input patch surrounding the pixel

at location (i, j)of the training image k. At the ﬁrst level, the

network only takes the original image as the input because

there is not instance from a previous layer to produce the

feature maps. Though the R-2D-CNN model is multi-level,

the model complexity does not increase with respect to the

number of levels. The reason is that the parameters pertaining

to different levels are shared (as shown in Fig.5).

Model training of the R-2D-CNN model is the same as

that of the 2D-CNN model, where gradients are computed by

using the (BPTT) algorithm [43] during the back-propagation

process. More speciﬁcally, we ﬁrst unfold the network as

shown in Fig. 5, and then train the model with the BPTT

algorithm. However, in contrast to the 2D-CNN model, we

have to learn the network parameters (W,b)by a new loss

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 6

plain

1 instance

2 instance

3 instance

a) b)

Fig. 4. Context input patch of ”plain”: a)and recurrent context input patch

b). The size of context input patch b) increases as the number of instances in

the recurrent 2D convolutional network increases.

F F

Shared

Fig. 5. The R-2D-CNN model comprising 2 basic 2D-CNN block with

parameters share across levels.

function due to the multi-level recurrent architecture. The loss

function is deﬁned according to Eq.(7):

L(F) + L(F◦F) + . . . +L(F◦pF),(8)

where L(F)is a shorthand for the log-likelihood deﬁned in

Eq.(3) of the 2D-CNN model, and ◦pdenotes the composition

operation performed ptimes. Thus, each network instance is

trained to produce the correct label at the location (i, j). In this

manner, the R-2D-CNN model is able to learn and corrects its

mistakes produced by the earlier iterations. As a by-product,

the R-2D-CNN model can also classify the dependencies, that

is, predicting the label of an instance based on the label of the

previous instance around location (i, j).

It is worth noting that the sizes of multi-level instances in

a R-2D-CNN model must be carefully designed so that the

instances can be concatenated with the feature maps of the

previous instances. To this end, we ﬁrst need to establish how

the size of a feature map changes when it is applied to a 2D

convolution layer. Let szm−1denote the size of the feature

map of the m−1-th convolution layer. Then, the size of

the feature map produced by the m-th convolution layer is

computed as follows:

szm=szm−1−kWm

dWm

+ 1 (9)

where kWmis the size of the convolution kernel of the mth

layer, and dWmis the stride size. By Eq.(8), we can compute

the size of a feature map produced by our 2D-CNN block.

Hence, we can estimate the appropriate sizes of the instances

with respect to different levels.

D. R-3D-CNN model

To better utilize the spatial and the spectral contexts of

hyperspectral images, we design the recurrent 3D-CNN model

(R-3D-CNN). As for the R-2D-CNN model, the R-3D-CNN

model is also underpinned by multi-level recurrent neural

networks which shrink a patch gradually to form multi-level

instances. There are two main differences between the R-3D-

CNN model and the R-2D-CNN model. The ﬁrst difference is

that the former utilizes 3D convolution operators whereas the

latter uses its 2D counterparts. Hence, the R-3D-CNN model

can be regarded as an extension of the 3D-CNN model in a

recurrent manner. The second difference is that the instances of

the next level need to be preprocessed and concatenated with

the feature maps generated from the current level. The reason

is that we adopt 3D convolution layers which lead to variable

length of the spectral bands. Hence, we have to preprocess the

instances of the next level by some 3D convolution operations

of the spectral channels to adapt to the changing sizes.

Fig.6 depicts an example of the proposed R-3D-CNN

model. The model consists of a multi-level recurrent neural

network with Pmulti-level instances. As for the R-2D-CNN

model, a ”plain” 3D convolution network is applied to extract

the corresponding feature maps, which are then concatenated

with the next level instance to form new feature maps at each

level. This procedure is repeated until all multi-level instances

(patches) are incorporated. To ensure the consistency of the

sizes of feature maps of the current level and the sizes of the

instances at the next level, a preprocessing step is introduced to

the spectral channels. Finally, a softmax layer is applied, and a

cross entropy objective function is adopted. The optimization

process is again performed by using the BPTT algorithm. As

for the R-2D-CNN model, the complexity of the R-3D-CNN

model remains moderate because the recurrent structure shares

the same network parameters across multiple levels. As for the

3D-CNN model, we need to reorder the hyperspectral images

according to the ordering of spectral bands. Also, the size of

multi-level instances must be carefully determined as for the

R-2D-CNN model.

IV. EXP ERIME NTAL RESULTS

We chose to use six publicly available hyperspectral image

data sets for evaluating the performance of the proposed

models. For a comparative evaluation, we also adopted SVM-

CK, GCK, LORSAL, SMLR-SpATV, MFL, SVM-3D, SVM-

3DG, and CNN-MRF as the baselines. For the performance

metrics, we used the overall accuracy of all classes, denoted

as OA, and the average accuracy of each class, denoted as

AA. We ran all the models on a desktop PC equipped with an

Intel Core 7 Duo CPU (at 3.40 GHz) with 12 GB of RAM,

and two GTX 1080Ti GPUs (16 GB of ROM) were also used.

A. Data Sets

1) Indian Pines Scene

The data set was collected in 1992 by the AVIRIS sensor

which records the remote sensing images of Indian Pines

located at north-western of India. The hyperspectral image

contains 145 ×145 pixels in spatial dimensions, and 224

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 7

preprocess for the spectral

plain network

recurrent network

concat

Fig. 6. The R-3D-CNN model with network parameters shared across multiple levels. The plain network is built with a small instance based on the basic

3D-CNN model while the recurrent network is built with two instances of the basic 3D-CNN model; the complexity of the model remains moderate because

of the shared parameters across multiple levels

a) b) c)

d) e) f)

Fig. 7. Labeled images of different data sets: a)Indian Pines Scene.

b)Botswana Scene. c)Salinas scene. d)Pavia Centre scene. e)Pavia University

scene. f) Kennedy Space Center.

hyperspectral bands. Due to the presence of noisy bands, we

only used 200 hyperspectral bands. Speciﬁcally, the bands

covering the regions of water absorption, i.e., [104-108], [150-

163], 220, were removed. The ground truth available includes

16 classes which are not all mutually exclusive. As shown in

Fig.7 a), we randomly divided the labeled data into the training

(70%) and the testing (30%) sets for our experiment.

2) Botswana Scene

Botswana Scene was acquired by the Hyperion sensor on

the NASA EO-1 satellite in May 31, 2001.; This data set was

collected over the Okavango Delta. The hyperspectral image

contains 1476 ×1476 pixels taken by 224 bands, from 400nm

to 2500nm with an incremental step of 10 nm. As for the

Indian Pines Scene data set, we removed the noisy bands to

produce an experimental data set containing 145 bands only.

The image data set contains 14 categories. As shown in Fig.7

b), we randomly split the data set to the training (70%) and

the testing (30%) sets, respectively.

3) Salinas scene

The Salinas Scene was a hyperspectral image data set

recorded in 1992 by the AVIRIS sensor which captured images

about the Salinas Valley, California. The original images were

composed by 224 bands. We discarded 20 noisy bands for

example bands [108-112], bands [154-167] and band 224 to

generate a hyperspectral image data set of 204 bands. For the

spatial dimensions, the scene includes 512×217 pixels. There

are 16 labeled classes in the original data set as shown in Fig.7

c).

4) Pavia Centre scene

The hyperspectral image data set captured Pavia acquired

over northern Italy. It was produced in 2001 by using the Re-

ﬂective Optics System Imaging Spectrometer (ROSIS) sensor.

The Pavia Centre scene comprised 1096 ×1096 pixels with

114 hyperspectral bands. We preprocessed these images by

removing 12 noisy bands. There are nine labeled classes in

the data set as shown in Fig.7 d).

5) Pavia University scene

This hyperspectral image data set captured the Pavia uni-

versity in Italy by using the ROSIS sensor. There are 103

hyperspectral bands in the image data set, with 610 ×340

pixels for the spatial dimensions. The image contains nine

labeled classes as shown in Fig.7 e).

6) Kennedy Space Center

The last data set, namely Kennedy Space Center (KSC)

captured the KSC area in Florida by using the AVIRIS sensor

on March 23, 1996. The hyperspectral image consists of

512 ×614 pixels of spatial dimensions, with 224 spectral

bands. After removing 48 noisy bands, we obtained 172

spectral bands. There are 13 labeled classes as shown in Fig.7

f).

B. Experimental Results

1) Results for the Indian Pines Scene

Before reporting the details of our experimental results,

we ﬁrst elaborate the various settings of the deep learning

techniques employed in our experiments. The structure of the

2D-CNN model is depicted in Fig.8. For the classiﬁcation

of each pixel, a 7×7×200 patch surrounding it is ﬁrst

constructed. Following this, three 2D convolution layers of size

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 8

3×3are utilized. Moreover, 200 spectral bands are treated as

channels. The number of ﬁlters is set to 400 for the respective

layers, and the stride is set to 1. As a result, the feature maps

produced by the ﬁrst, second, and last convolution layers are

5×5×400,3×3×400 and 1×1×400, respectively. Finally, a

softmax layer of 16 classes is deployed to classify the images.

The proposed network structure does not include the pooling

layers so as to keep as much information of each pixel as

possible. In addition, we apply SGD for network training and

set the mini-batch size to 10.

The structure of the 3D-CNN model is depicted in Fig.9.

Similar to the 2D-CNN model, a 7×7×200 patch is ﬁrst

extracted. Next, we build eight 3D convolution layers. The

size, number, stride and feature map sizes of the 3D ﬁlters in

each layer are shown in Fig.14. Again, we exclude the pooling

layers and adopt a mini-batch size of 10. Before applying the

3D-CNN model, the hyperspectral bands are ﬁrst reordered.

Fig.10 depicts the structure of the R-2D-CNN model. For

this model, the ﬁrst level instance is a 13 ×13 ×200 patch.

Then, we apply a three-layer 2D-CNN block to the instance,

which results in a 7×7×400 feature map. After concatenating

this feature map with our second level instance, that is, a 7×

7×200 patch, and reusing the 2D-CNN block, we obtain a

800-dimensional vector, which is connected to a softmax layer

to classify images. Again, we adopt a mini-batch size of 10

and do not utilize any pooling layers.

Similarly, we construct a 13×13×200 patch and a 7×7×200

patch at the ﬁrst two levels of the R-3D-CNN model. As shown

in Fig.11, we build a seven-layer 3D-CNN block, and apply it

to the ﬁrst level instance. This produces a 7×7×187 feature

map. Since the spectral band dimension is changed from 200

to 187, we ﬁrst apply a three-layer 3D convolution operation

to the second level instance. By doing so, the third dimension

is reduced to 187, and it can be concatenated with the feature

map of the ﬁrst level instance. Then, we reuse the seven-layer

3D convolution block which produces a 1×10 ×35 feature

map. Finally, a softmax layer is applied to the resulting feature

map to determine the class label.

Next, we report the experimental results based on various

data sets. Table V presents the performance of all the methods.

We observe that the R-3D-CNN model achieves the best

performance, of which the OA is 99.50%. Although the

OA of the SMLR-SpATV is 99.11%, the R-3D-CNN model

outperforms it by more than 44%, if we consider the reduction

of error rates. The main reason is that R-3D-CNN model

considers both the spectral and the spatial contexts, where the

former is inferred via the 3D convolution operation and the

latter is inferred by using the multi-level recurrent structure.

In terms of AA and OA, the R-2D-CNN model is ranked

as the second best, which is followed by the SMLR-SpATV,

2D-CNN model and the 3D-CNN model. Though R-2D-CNN

ignores the spectral correlations, its recurrent structures can

effectively capture the spatial context for subsequent image

classiﬁcation. Our experimental results also imply that the

spatial context is more important than the spectral correlations

for hyperspectral image classiﬁcation. As shown in Table V,

the results of SVM-CK are better than SVM-3D, SVM-3DG

and CNN-MRF. However, its performance is much worse than

various deep learning techniques. The ﬁrst reasons is that

the SVM-CK classiﬁer ignores the spatial and the spectral

contexts. The second reason is that the SVM-CK classiﬁer

cannot effectively capture the nonlinear relationships between

the features and the class labels of hyperspectral images. As

a promising classiﬁcation method, the GCK achieves compa-

rable performance as those of the 2D-CNN and the 3D-CNN

methods because it can extract EMAP information pertaining

to the spectral and the spatial contexts. Fig.12 provides a visual

comparison of the performance of all methods.

2) Results for the Pavia University Scene

In this experiment, the structures of the deep learning mod-

els were quite like those applied to the original Indian Pines

Scene experiment. The only difference was that the numbers of

parameters were adjusted to match the 102 hyperspectral bands

of our reﬁned data set. Recall that there were 200 bands of the

original data set. Table VI presents the experimental results of

all the methods based on the Pavia University Scene data set.

Again, we can see that the proposed R-3D-CNN performs the

best, followed by the R-2D-CNN model, the SMLR-SpATV

method, the MFL method and the GCK method. The OA of the

R-3D-CNN model is 99.97%, which is 0.39% higher than that

of the GCK (99.48%). And the R-CNN-3D model outperforms

the GCK method by more than 94%, when we consider the

reduction of error rates. The SVM-3DG method, the 3D-CNN

and the 2D-CNN models achieve comparable results. The

LORSAL classiﬁer produces the worst performance among

all the methods. Fig.13 visualizes the classiﬁcation results of

all the methods.

3) Results for the Botswana Scene

As for the earlier scenes, we only modiﬁed the number of

parameters for our deep learning models. Table VII presents

the experimental results of all the methods based on the

Botswana Scene data set. We can see that the proposed R-

3D-CNN model and the MFL method achieves the highest

performance, followed by the SVM-3DG, the GCK method

and the R-2D-CNN model. The OA of the R-3D-CNN model

is 99.38%, which is 0.29% higher than that of the MFL

(99.07%). And the R-3D-CNN model outperforms the MFL

method by more than 31%, in terms of the reduction of

error rates. Again, the other models such as the 3D-CNN,

the SMLR-SpATV and the 2D-CNN models perform better

than the LORSAL classiﬁer which produces the worst result.

Fig.14 visualizes the classiﬁcation results of all the methods.

4) Results for the Salinas Scene

Table VIII presents the experimental results of all the

methods based on the Salinas scene data set. The R-3D-

CNN model achieves the best performance, followed by the

GCK method, the MFL method and the R-2D-CNN model.

The SMLR-SpATV, the 3D-CNN and the 2D-CNN models

also achieve promising results. The OA of the R-3D-CNN

model is 99.80%, which is 0.46% higher than that of the

GCK (99.34%). And the R-3D-CNN model improves the GCK

method by more than 70%, when we consider the reduction of

error rates. Again, the LORSAL classiﬁer produces the worst

result among all the methods. Due to memory limitations of

our computer, we cannot perform the CNN-MRF classiﬁer on

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 9

TABLE I

CLA SSI FIC ATIO N RE SULT S OF INDIAN PI NES SC EN E.

Class Methods

# SVM-

[17]

GCK

[23]

LORSAL

[26]

SMLR-

SpATV

[27]

MFL

[28]

SVM-

[29]

SVM-

3DG

[29]

CNN-

MRF

[35]

2D-

CNN

3D-

CNN

R-2D-

CNN

R-3D-

CNN

1 85.71±0.4 92.86±0.5 85.71±0.1 92.86±0.2 85.71±0.3 57.14±0.4 64.29±0.4 84.62±0.2 71.72±1.0 85.71±0.4 78.57±0.1 100

2 86.82±0.3 98.12±0.4 89.88±0.3 98.59±0.2 96.24±0.2 79.29±0.3 80.00±0.2 65.65±0.3 95.85±0.6 96.46±0.3 99.29±0.1 100

3 86.12±0.2 94.29±0.3 82.04±0.1 98.37±0.2 92.65±0.3 71.02±0.3 73.47±0.4 96.36±0.4 95.90±0.3 97.13±0.1 98.77±0.3 100

4 88.40±0.5 94.20±0.3 82.61±0.3 100 97.10±0.4 97.10±0.4 97.10±0.5 88.73±0.3 73.91±0.1 98.55±0.3 100 100

5 95.10±0.4 96.50±0.4 91.61±0.3 98.60±0.2 97.20±0.3 95.80±0.1 91.61±0.3 93.06±0.2 97.20±0.2 97.90±0.2 97.90±0.2 100

6 98.61±0.7 99.08±0.1 99.08±0.2 99.54±0.4 99.54±0.2 98.16±0.3 97.70±0.4 99.09±0.4 96.31±0.3 97.68±0.5 99.53±0.3 100

7 75.00±0.1 100 75.00±0.1 100 100 75.00±0.3 62.50±0.3 50.00±0.1 100 100 87.50±0.2 100

8 98.60±0.1 100 97.90±0.2 100 100 99.30±0.2 100 95.10±0.4 100 99.30±0.07 100 100

9 100 83.33±0.3 83.33±0.1 83.33±0.3 100 100 100 100 100 100 100 100

10 87.19±0.7 93.43±0.4 85.81±0.5 98.27±0.3 92.04±0.2 75.09±0.4 75.78±0.4 76.63±0.2 97.20±0.6 98.26±0.3 98.95±0.3 99.65±0.3

11 91.01±0.8 98.09±0.8 88.83±0.1 99.59±0.2 98.50±0.3 88.01±0.2 95.37±0.4 97.55±0.2 99.04±0.4 98.77±0.4 99.45±0.2 99.31±0.2

12 94.84±0.6 94.89±0.4 88.64±0.2 99.43±0.3 96.02±0.3 85.80±0.4 86.36±0.3 76.27±0.3 95.45±0.6 97.15±0.4 99.43±0.2 98.85±0.2

13 100 100 100 100 98.36±0.3 98.36±0.3 98.36±0.2 100 100 96.72±0.1 98.36±0.1 100

14 96.81±0.7 99.20±0.2 96.02±0.2 100 99.47±0.2 97.88±0.4 97.08±0.4 99.47±0.4 98.94±0.3 99.46±0.6 100 99.73±0.2

15 81.57±0.8 95.61±0.6 83.33±0.3 97.37±0.2 97.37±0.3 86.84±0.2 100 95.65±0.2 94.73±0.6 93.80±0.5 98.24±0.2 96.46±0.3

16 100 100 85.71±0.1 100 100 82.14±0.2 782.14±0.3 100 100 100 96.42±0.8 96.42±0.5

AA 91.62±0.3 96.22±0.5 88.47±0.2 97.87±0.2 96.89±0.2 85.23±0.3 87.61±0.1 88.26±0.3 96.37±0.3 97.31±0.2 97.03±0.3 99.42±0.3

OA 91.51±0.2 97.44±0.4 90.10±0.1 99.11±0.2 97.05±0.3 86.55±0.2 89.44±0.2 88.95±0.2 97.08±0.2 98.92±0.4 99.19±0.3 99.50±0.3

TABLE II

CLA SSI FIC ATIO N RE SULT S OF T HE PAVIA UNIVERSITY SCENE.

Class Methods

# SVM-

[17]

GCK

[23]

LORSAL

[26]

SMLR-

SpATV

[27]

MFL

[28]

SVM-

[29]

SVM-

3DG

[29]

CNN-

MRF

[35]

2D-

CNN

3D-

CNN

R-2D-

CNN

R-3D-

CNN

1 99.53±0.3 99.64±0.4 91.20±0.2 99.85±0.2 100 98.14±0.2 99.45±0.2 99.60±0.4 91.65±0.2 98.70±0.3 99.34±0.3 100

2 98.59±0.5 99.85±0.3 96.92±0.1 100 99.93±0.3 99.30±0.2 99.86±0.3 98.09±0.4 99.45±0.1 99.77±0.4 99.96±0.4 100

3 85.21±0.3 97.61±0.5 64.07±0.3 100 93.64±0.2 88.08±0.4 87.12±0.1 76.31±0.2 92.09±0.3 97.94±0.4 98.88±0.2 100

4 96.52±0.2 98.62±0.2 88.57±0.4 93.58±0.4 98.59±0.5 99.02±0.2 99.67±0.3 96.08±0.4 87.26±0.4 94.55±0.2 93.91±0.3 99.89±0.1

5 99.75±0.5 99.34±0.7 99.75±0.2 100 99.50±0.3 100 100 99.75±0.2 91.05±0.2 97.77±0.1 98.27±0.2 100

6 92.24±0.4 99.78±0.5 57.76±0.3 100 99.67±0.3 96.82±0.1 99.07±0.3 88.66±0.3 98.61±0.4 99.60±0.2 99.94±0.4 100

7 91.71±0.1 99.41±0.5 59.05±0.2 100 99.75±0.2 95.48±0.3 95.73±0.2 83.71±0.1 90.72±0.1 98.00±0.2 98.49±0.2 100

8 93.30±0.2 98.52±0.4 80.45±0.3 99.82±0.2 99.10±0.4 95.20±0.4 96.38±0.4 92.75±0.2 93.57±0.4 98.55±0.3 99.81±0.2 100

9 99.65±0.5 99.65±0.3 97.89±0.4 95.77±0.3 100 98.94±0.2 98.94±0.2 98.59±0.4 86.62±0.2 81.34±0.2 94.72±0.3 98.94±0.3

AA 94.52±0.3 99.21±0.3 81.74±0.2 98.78±0.4 98.91±0.2 96.78±0.4 97.36±0.2 92.62±0.2 98.78±0.4 96.25±0.3 98.15±0.2 99.87±0.3

OA 94.72±0.2 99.48±0.2 86.74±0.2 99.41±0.2 99.42±0.2 97.80±0.4 98.62±0.1 95.16±0.3 95.46±0.2 98.49±0.2 99.19±0.2 99.97±0.2

TABLE III

CLA SSI FIC ATIO N RE SULT S OF T HE BOT SWANA SC EN E.

Class Methods

# SVM-

[17]

GCK

[23]

LORSAL

[26]

SMLR-

SpATV

[27]

MFL

[28]

SVM-

[29]

SVM-

3DG

[29]

CNN-

MRF

[35]

2D-

CNN

3D-

CNN

R-2D-

CNN

R-3D-

CNN

1 100 100 100 100 100 100 100 100 98.77±0.4 100 100 100

2 100 100 90.00±0.1 100 100 100 100 100 93.33±0.3 93.33±0.2 100 100

3 98.66±0.8 100 93.33±0.3 100 97.33±0.3 98.67±0.3 100 98.67±0.3 97.33±0.3 94.67±0.3 98.66±0.2 100

4 96.87±0.5 99.48±0.6 89.06±0.3 100 100 98.44±0.2 100 79.69±0.4 98.44±0.5 96.88±0.3 96.87±0.3 100

5 92.59±0.2 93.17±0.2 76.54±0.2 97.53±0.1 97.53±0.1 97.53±0.3 98.77±0.3 90.00±0.1 91.36±0.3 96.30±0.3 96.29±0.2 97.53±0.3

6 90.12±0.4 93.97±0.5 58.02±0.1 98.77±0.3 100 92.59±0.4 88.89±0.4 90.00±0.2 100 98.63±0.1 100 100

7 100 100 98.68±0.4 100 100 100 100 92.21±0.1 100 98.69±0.3 100 100

8 100 100 86.67±0.3 98.33±0.2 100 100 100 90.00±0.3 100 95.00±0.1 96.66±0.2 98.33±0.3

9 96.80±0.6 98.33±0.7 78.72±0.1 100 96.81±0.2 97.87±0.4 100 96.81±0.1 77.66±0.8 95.75±0.2 100 98.94±0.4

10 94.59±0.3 100 74.32±0.1 100 100 97.30±0.4 100 87.84±0.2 96.81±0.4 100 98.64±0.2 100

11 92.30±0.4 97.54±0.4 90.11±0.2 100 98.90±0.3 96.70±0.4 98.90±0.4 100 98.65±0.3 100 100 100

12 94.93±0.5 100 90.74±0.2 98.15±0.3 100 100 100 64.81±0.1 100 100 96.29±0.4 98.14±0.4

13 91.53±0.3 99.59±0.7 93.67±0.3 100 100 100 100 95.00±0.3 98.73±0.5 100 100 100

14 100 96.00±0.1 32.14±0.2 42.86±0.2 96.43±0.2 96.43±0.3 96.43±0.2 53.57±0.3 92.86±0.3 92.86±0.4 85.71±0.4 96.43±0.3

AA 96.79±0.3 98.33±0.6 82.29±0.4 95.40±0.4 99.07±0.3 98.25±0.3 98.78±0.4 88.47±0.3 97.21±0.3 97.30±0.3 98.89±0.3 99.24±0.2

OA 96.28±0.1 98.21±0.2 84.09±0.3 97.83±0.2 99.07±0.4 98.14±0.2 98.76±0.3 90.70±0.4 97.60±0.5 97.21±0.1 98.54±0.2 99.38±0.2

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 10

145

145 200 400 400 400 16

200

3

extract

Conv1

Conv2 Conv 3

full

connected classify

sotfmax



Fig. 8. The 2D-CNN network for hyperspectral remote sensing classiﬁcation(The stride of each layer is 1).

Fig. 9. The 3D-CNN network for remote sensing hyperspectral image classiﬁcation.

200

600

Concat

Conv4

Conv 5

145

200 400 400

200

3 3

extract 11

Conv1 Conv2 Conv 3

400

Conv 6

800800800

Softmax

Fig. 10. The R-2D-CNN network for remote sensing hyperspectral image classiﬁcation(The stride of each layer is 1).

this data set. Fig.15 visualizes the classiﬁcation results of all

the methods.

5) Results for the Pavia Centre Scene

Table IX presents the experimental results of all the methods

based on the Pavia Centre Scene data set. The results are quite

different from those obtained based on the previous four data

sets. We observe that R-2D-CNN performs the best, followed

by the SVM-3DG and the MFL classiﬁers. And, the R-2D-

CNN model outperforms the SVM-3DG method by more

than 88%, in terms of the reduction of error rates.The R-3D-

CNN model, which achieves the best performance based on

the previous data sets, produces unsatisfactory results when

compared to those of the R-2D-CNN model and the SVM-CK

classiﬁer. The reason may be that the R-3D-CNN model fuses

the spectral and the spatial information by using a 3D operator.

However, the channel of the Pavia Centre scene contains 102

bands only; it is smaller than the other data sets. On the

other hand, the 2D-CNN and the 3D-CNN models perform the

worst among all the methods because it is difﬁcult for these

models to classify the 3-rd class and the 9-th class due to the

limited number of instances and channels. Since the methods

such as the GCK, the SMLR-SpATV, and CNN-MRF, require

more RAM than that equipped with our computer, we cannot

obtain their performance on the data set. Fig.16 shows the

classiﬁcation results of all the methods.

6) Results for the Kennedy Space Center Scene

Table X presents the experimental results of all the methods

based on the Kennedy Space Center Scene data set. Again,

we observe that the proposed R-3D-CNN performs the best,

followed by the R-2D-CNN model and the GCK model. The

R-3D-CNN model outperforms the GCK method by more than

95% in terms of error rate. The 3D-CNN model, the MFL

method, and the SVM-3DG models achieve comparable re-

sults, followed by the CNN-MRF method, the 2D-CNN model,

and the SVM-CK methods. The SVM-3D classiﬁer produces

the worst result, and the LORSAL method outperforms the

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 11

Fig. 11. The R-3D-CNN network for remote sensing hyperspectral image classiﬁcation.

*URXQGWUXWK 690&. *&. /256$/ 60/56S$79 0)L 690'

690'* &1105) '&11 '&11 5'&11 5'&11

Fig. 12. Classiﬁcation maps and overall classiﬁcation accuracies obtained for the AVIRIS Indian Pines data set (overall accuracies are reported in parentheses).

SVM-3D method by 2% in terms of OA. Fig.17 shows the

classiﬁcation results of all the methods.

C. Convergent Speed Comparison

Fig.18 plots the accuracies of different deep learning models

against the number of iterations based on the six data sets. We

observe that the R-3D-CNN model can converge with fewer

number of iterations when compared to the other models, with

the only exception for the Salinas data set. The efﬁciency

improvement brought by the R-3D-CNN model is attributed

to the recurrent structure and the 3D convolutional operation.

Speciﬁcally, the feature maps that are extracted by the R-

3D-CNN model contain richer contextual information of the

images, which leads to a quicker convergence of model

training.

D. The Impact of the Size of Training Samples

In this experiment, we examined how the performance of the

proposed deep learning models changed against varying sizes

of the training samples. To this end, we varied the number

of training samples from 10% to 70%, and reported the OA

achieved by all methods. Fig.19 show the results based on six

data sets. From Fig. 19, we can make two important observa-

tions. First, for the conventional classiﬁers, i.e., GCK, MFL,

SVM-CK, SVM-3D, SVM-3DG, SMLR-SpATV, LORSAL,

we ﬁnd that their classiﬁcation performances are insensitive to

the number of training samples, especially on the Bostwana

Scene, Salinas Scene, Pavia Centre Scene, Pavia University

Scene, and Kennedy Space Center data sets. Promising results

are achieved when 10% training samples are utilized for

these methods, and feeding more training samples to these

methods only leads to marginal performance improvement.

Among the conventional methods, the GCK and the MFL

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 12

TABLE IV

CLA SSI FIC ATIO N RE SULT S OF T HE SAL INA S SC EN E.

Class Methods

# SVM-

[17]

GCK

[23]

LORSAL

[26]

SMLR-

SpATV

[27]

MFL

[28]

SVM-

[29]

SVM-

3DG

[29]

CNN-

MRF

[35]

2D-

CNN

3D-

CNN

R-2D-

CNN

R-3D-

CNN

1 99.83±0.3 100 96.85±0.2 99.00±0.3 100 100 100 - 99.50±0.2 97.68±0.3 100 99.83±0.2

2 100 99.82±0.2 98.93±0.2 99.82±0.1 100 99.82±0.2 99.91±0.3 - 99.82±0.5 99.46±0.2 99.46±0.3 99.91±0.4

3 100 100 80.24±0.2 95.61±0.1 100 99.49±0.4 99.49±0.4 - 98.31±0.3 97.80±0.5 99.49±0.1 99.66±0.2

4 99.04±0.4 99.76±0.3 99.28±0.4 99.76±0.3 99.52±0.1 99.04±0.2 99.28±0.4 - 97.61±0.4 97.13±0.3 97.13±0.2 97.37±0.4

5 99.75±0.5 99.50±0.1 98.13±0.2 99.75±0.2 98.63±0.3 99.63±0.3 99.75±0.2 - 98.50±0.5 98.80±0.4 99.13±0.3 99.50±0.1

6 100 100 99.92±0.1 100 99.83±0.3 100 100 - 99.75±0.3 98.91±0.3 99.07±0.3 100

7 99.91±0.1 99.81±0.2 99.16±0.3 99.72±0.1 99.44±0.2 100 100 - 98.42±0.4 96.55±0.4 99.81±0.4 99.53±0.1

8 92.69±0.5 96.54±0.5 88.10±0.4 96.98±0.4 97.93±0.3 92.28±0.3 94.11±0.2 - 99.47±0.4 97.28±0.3 99.91±0.2 99.97±0.4

9 100 100 99.25±0.3 100 100 99.62±0.1 99.62±0.1 - 99.89±0.2 99.84±0.2 99.84±0.2 100

10 99.08±0.4 99.80±0.3 84.33±0.2 96.64±0.2 99.69±0.4 97.55±0.3 98.88±0.4 - 99.70±0.1 98.78±0.2 99.18±0.2 99.90±0.2

11 100 99.69±0.4 85.89±0.4 94.36±0.3 99.69±0.3 98.75±0.3 98.75±0.2 - 99.37±0.5 99.06±0.3 99.06±0.3 100

12 100 100 100 100 100 100 100 - 98.09±0.6 99.14±0.4 98.27±0.5 99.65±0.4

13 99.64±0.4 99.27±0.3 98.91±0.2 99.27±0.4 98.55±0.3 98.55±0.3 98.55±0.3 - 96.00±0.1 90.55±0.5 98.55±0.4 100

14 98.75±0.5 99.07±0.4 94.08±0.4 99.38±0.3 95.64±0.2 96.57±0.2 98.75±0.2 - 96.89±0.3 93.46±0.6 97.20±0.3 98.44±0.2

15 82.61±0.2 96.14±0.4 52.98±0.3 97.84±0.2 99.17±0.4 77.37±0.3 81.46±0.3 - 99.22±0.4 97.47±0.4 99.73±0.4 100

16 99.82±0.3 100 96.67±0.4 98.89±0.4 98.89±0.4 99.26±0.3 99.45±0.3 - 99.41±0.1 99.06±0.3 99.81±0.2 100

AA 96.06±0.5 98.66±0.4 92.05±0.3 98.56±0.3 99.19±0.4 97.37±0.3 98.00±0.4 - 98.90±0.3 98.65±0.2 99.12±0.4 99.61±0.2

OA 96.00±0.3 99.34±0.3 88.57±0.3 98.46±0.3 99.16±0.3 94.95±0.2 96.03±0.3 - 98.96±0.4 99.08±0.3 99.47±0.3 99.80±0.2

TABLE V

CLA SSI FIC ATIO N RE SULT S OF T HE PAVIA CEN TR E SC EN E.

Class Methods

# SVM-

[17]

GCK

[23]

LORSAL

[26]

SMLR-

SpATV

[27]

MFL

[28]

SVM-

[29]

SVM-

3DG

[29]

CNN-

MRF

[35]

2D-

CNN

3D-

CNN

R-2D-

CNN

R-3D-

CNN

1 100 - - - 99.99±0.4 100 100 - 99.36±0.2 99.85±0.2 100 99.32±0.2

2 98.24±0.4 - - - 97.41±0.2 96.14±0.2 98.03±0.2 - 87.51±0.3 95.73±0.2 99.65±0.4 86.52±0.3

3 96.54±0.5 - - - 91.15±0.3 93.96±0.3 92.34±0.2 - 88.35±0.5 95.73±0.1 99.78±0.3 85.11±0.3

4 94.30±0.2 - - - 97.89±0.2 90.20±0.2 94.42±0.1 - 88.59±0.4 95.04±0.1 99.88±0.4 91.80±0.1

5 98.18±0.3 - - - 99.70±0.4 99.09±0.4 99.44±0.2 - 91.89±0.4 97.52±0.2 99.85±0.4 92.95±0.3

6 99.31±0.5 - - - 98.59±0.3 98.34±0.2 99.24±0.2 - 90.48±0.3 97.76±0.3 99.75±0.3 90.87±0.4

7 95.38±0.4 - - - 93.87±0.3 96.89±0.5 98.35±0.3 - 96.30±0.4 99.36±0.2 99.95±0.2 96.97±0.2

8 99.80±0.5 - - - 99.87±0.2 99.93±0.2 99.95±0.2 - 96.98±0.5 98.84±0.3 99.90±0.3 96.49±0.1

9 100 - - - 94.76±0.3 99.65±0.3 99.88±0.4 - 73.69±0.1 91.51±0.4 98.25±0.3 83.80±0.2

AA 97.97±0.4 - - - 97.03±0.2 97.13±0.2 97.96±0.3 - 90.32±0.4 96.82±0.3 99.67±0.2 91.54±0.2

OA 97.32±0.3 - - - 99.10±0.2 99.17±0.2 99.47±0.3 - 96.02±0.4 98.75±0.1 99.88±0.3 96.79±0.2

TABLE VI

CLA SSI FIC ATIO N RE SULT S OF T HE KEN NE DY SPACE CE NTE R.

Class Methods

# SVM-

[17]

GCK

[23]

LORSAL

[26]

SMLR-

SpATV

[27]

MFL

[28]

SVM-

[29]

SVM-

3DG

[29]

CNN-

MRF

[35]

2D-

CNN

3D-

CNN

R-2D-

CNN

R-3D-

CNN

1 96.49±0.2 100 94.74±0.2 97.37±0.4 99.56±0.3 98.25±0.2 99.56±0.3 96.50±0.2 98.68±0.4 98.68±0.4 100 100

2 95.90±0.2 100 90.41±0.1 98.63±0.2 100 86.30±0.3 97.26±0.2 97.22±0.2 100 91.78±0.1 100 97.26±0.2

3 97.40±0.3 98.70±0.3 93.51±0.2 98.70±0.4 100 97.40±0.3 97.40±0.2 98.68±0.4 100 98.78±0.1 100 100

4 86.84±0.2 90.79±0.4 75.00±0.4 84.21±0.1 100 77.63±0.2 96.05±0.2 77.33±0.2 94.73±0.6 97.37±0.4 94.74±0.3 98.68±0.4

5 77.08±0.3 97.92±0.3 72.92±0.2 81.25±0.2 100 83.33±0.4 83.33±0.3 83.33±0.3 93.75±0.2 91.67±0.6 97.92±0.3 100

6 78.26±0.1 100 75.82±0.3 90.11±0.2 99.56±0.3 98.55±0.3 98.55±0.4 100 100 97.10±0.1 98.55±0.1 98.55±0.1

7 87.09±0.6 100 96.77±0.3 100 100 100 100 100 100 100 100 100

8 96.90±0.9 99.23±0.4 96.90±0.4 98.45±0.4 98.45±0.3 93.20±0.4 94.57±0.3 99.22±0.2 93.80±0.8 100 96.90±0.2 100

9 100 100 98.72±0.1 100 99.36±0.3 100 100 100 100 99.36±0.6 100 99.35±0.1

10 97.50±0.1 98.33±0.3 97.50±0.4 98.45±0.3 98.33±0.3 68.33±0.2 98.33±0.2 100 98.32±0.8 .94.12±0.6 100 100

11 99.20±0.1 100 100 100 87.20±0.4 99.20±0.2 99.200±0.4 100 100 100 96.80±0.1 99.20±0.1

12 98.67±0.5 97.35±0.1 92.72±0.1 95.36±0.3 96.69±0.4 96.03±0.2 100 100 98.01±0.3 96.69±0.5 99.34±0.3 98.68±0.2

13 100 100 97.84±0.4 100 100 87.77±0.4 100 100 97.48±0.2 97.12±0.2 100 98.56±0.1

AA 95.96±0.2 98.64±0.1 90.95±0.3 95.33±0.2 98.43±0.2 91.22±0.2 97.25±0.3 96.30±0.2 97.81±0.3 98.20±0.1 98.79±0.4 99.23±0.1

OA 96.95±0.6 98.98±0.3 93.59±0.4 96.86±0.3 98.27±0.3 91.67±0.3 98.27±0.3 97.56±0.3 97.76±0.5 98.46±0.3 99.22±0.4 99.85±0.3

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13

Ground-truth SVM-CK(94.72%) GCK(99.48%) LORSAL(86.74%) SMLR-SpATV(99.41%) MFL(99.42%) SVM-3D(97.80%)

SVM-3DG(98.62%) CNN-MRF(95.16%) 2D-CNN(95.46%) 3D-CNN(98.49%) R-2D-CNN(99.19%) R-3D-CNN(99.97%)

Fig. 13. Classiﬁcation maps and overall classiﬁcation accuracies obtained for the Pavia University scene data set (overall accuracies are reported in parentheses).

Ground-truth SVM-CK(96.28%) GCK(98.21%) LORSAL(84.09%) SMLR-SpATV(97.83%) NMF(99.07%) SVM-3D(98.14%)

SVM-3DG(98.76%) CNN-MRF(90.70%) 2D-CNN(97.60%) 3D-CNN(97.21%) R-2D-CNN(98.54%) R-3D-CNN(99.38%)

Fig. 14. Classiﬁcation maps and overall classiﬁcation accuracies obtained for the Botswana Scene data set (overall accuracies are reported in parentheses).

methods often perform the best. Second, we can observe

an obvious performance improvement when the number of

training samples is increased (from 10% to 50%) for the

proposed deep learning models i.e., 2D-CNN, 3D-CNN, R-

2D-CNN, and R-3D-CNN. When more than 60% training

samples are employed, the R-2D-CNN and the R-3D-CNN

models often achieve comparable results when compared to the

best conventional methods such as GCK and MFL. Moreover,

we observe that the deep learning-based model CNN-MRF

produces unstable classiﬁcation performance, that is, results

can be better or worse when more training samples are used.

As a matter of fact, the proposed R-2D-CNN and R-3D-

CNN models outperform CNN-MRF when sufﬁcient training

samples are provided. All these observations indicate that the

proposed deep learning models (R-2D-CNN and R-3D-CNN)

are more effective than the baselines when sufﬁcient training

samples are provided.

E. Discussions

In this subsection, we brieﬂy discuss the experimental

results presented above. First, we ﬁnd that the R-3D-CNN

model often performs better than other models across all the

six data sets. There are two possible two reasons for such a per-

formance improvement: (i) the R-3D-CNN model effectively

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 14

Ground-truth SVM-CK(96.00%) GCK(99.34) LORSAL(88.57) SMLR(98.46%) MFL(99.16%)

SVM-3D(94.95%) SVM-3DG(98.00) 2D-CNN(98.96%) 3D-CNN(99.08%) R-2D-CNN(99.47%) R-3D-CNN(99.80%)

Fig. 15. Classiﬁcation maps and overall classiﬁcation accuracies obtained for the Salinas scene data set (overall accuracies are reported in parentheses).

Ground-truth SVM-CK(94.72%) GCK(99.48%) LORSAL(86.74%) SMLR-SpATV(99.41%) MFL(99.42%) SVM-3D(97.80%)

SVM-3DG(98.62%) CNN-MRF(95.16%) 2D-CNN(95.46%) 3D-CNN(98.49%) R-2D-CNN(99.19%) R-3D-CNN(99.97%)

Fig. 16. Classiﬁcation maps and overall classiﬁcation accuracies obtained for the Pavia Centre scene data set (overall accuracies are reported in parentheses).

Ground-truth SVM-CK(96.95%) GCK(98.98%) LORSAL(93.59%) SMLR-SpATV(96.86%) NMF(98.27%) SVM-3D(91.67%)

SVM-3DG(98.27%) CNN-MRF(97.56%) 2D-CNN(97.76%) 3D-CNN(98.46%) R-2D-CNN(99.22%) R-3D-CNN(99.85%)

Fig. 17. Classiﬁcation maps and overall classiﬁcation accuracies obtained for the Kennedy Space Center data set (overall accuracies are reported in parentheses).

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 15

0 200 400 600 800 1000

The itrations

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The accuracy

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

0 1000 2000 3000 4000 5000

The itrations

0.2

0.4

0.6

0.8

The accuracy

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

0 200 400 600 800 1000

The itrations

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The accuracy

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

0 200 400 600 800 1000

The itrations

0.4

0.5

0.6

0.7

0.8

0.9

The accuracy

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

0 200 400 600 800 1000

The itrations

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The accuracy

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

0 200 400 600 800 1000

The itrations

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The accuracy

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

Fig. 18. The accuracy varies against the number of iterations on different

data sets. (a) Indian Pines Scene; (b) Bostwana Scene; (c) Salinas Scene; (d)

Pavia Centre Scene; (e) Pavia University Scene; (f) Kennedy Space Center.

10% 20% 30% 40% 50% 60% 70%

Training samples of proportion (%)

100

The accuray(%)

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

SVM-CK

GCK

CNN-MRF

LORSASL

MFL

SMLR-SpATV

SVM-3D

SVM-3DG

10% 20% 30% 40% 50% 60% 70%

Training samples of proportion (%)

100

The accuray(%)

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

SVM-CK

GCK

CNN-MRF

LORSASL

MFL

SMLR-SpATV

SVM-3D

SVM-3DG

10% 20% 30% 40% 50% 60% 70%

Training samples of proportion (%)

100

The accuray(%)

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

SVM-CK

GCK

CNN-MRF

LORSASL

MFL

SMLR-SpATV

SVM-3D

SVM-3DG

10% 20% 30% 40% 50% 60% 70%

Training samples of proportion (%)

100

The accuray(%)

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

SVM-CK

GCK

LORSASL

MFL

SMLR-SpATV

SVM-3D

SVM-3DG

10% 20% 30% 40% 50% 60% 70%

Training samples of proportion (%)

100

The accuray(%)

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

SVM-CK

GCK

CNN-MRF

LORSASL

MFL

10% 20% 30% 40% 50% 60% 70%

Training samples of proportion (%)

100

The accuray(%)

2D-CNN

3D-CNN

R-2D-CNN

R-3D-CNN

SVM-CK

GCK

CNN-MRF

LORSASL

MFL

SMLR-SpATV

SVM-3D

SVM-3DG

Fig. 19. The inﬂuence of training samples proportion. a) Indian Pines Scene;

b) Botswana Scene; c) Salinas scene; d)Pavia Centre scene; e)Pavia University

scene; f)Kennedy Space Center;

fuses the spatial and the spectral correlations; (ii) the multi-

level recurrent structure can exploit spatial contexts better than

a ﬂat non-recurrent structure. For the same reason, we also

observe that the R-2D-CNN model often outperforms the other

models (such as LORSAL, GCK, SMLR-SpATV, CNN-MRF,

SVM-CK, SVM-3D, SVM-3DG, and MFL). The MFL and the

GCK models perform better than the 2D-CNN and the 3D-

CNN models because of their well-designed EMAP attributes

which can effectively represent the spatial contexts. On the

other hand, the 2D-CNN and the 3D-CNN models perform

better than the SVM-CK, the SVM-3D, and the LORSAL

classiﬁers in general. All our experimental results verify the

effectiveness and the advantages of the deep learning-based

methods.

Second, the 3D-CNN model often performs better than the

2D-CNN model. The main reason is that the 3D convolu-

tion operation can exploit both spatial features and spectral

correlations while the 2D convolution operation can only

exploit spatial features. On the other hand, the R-2D-CNN

model often performs better than the 3D-CNN model and

the 2D-CNN model because its recurrent structure can more

effectively exploit the spatial contexts than the latter two

models. Among all the four models, the R-3D-CNN model not

only performs the best for most data sets but it also converges

faster.

Finally, we ﬁnd that the proposed deep learning models

(e.g., R-3D-CNN and R-2D-CNN) may be slightly inferior

to conventional machine learning techniques if the training

samples are limited. However, when a reasonable number of

training samples are available, their performance is consid-

erably better than that of the conventional machine learning

techniques such as LORSAL, GCK, MFL, SMLR-SpATV,

SVM-3D, SVM-3DG, and SVM-CK. The main reason is that

deep learning models usually contain more model parameters,

and hence more training samples are required to estimate the

values of these parameters.

V. CONCLUSIONS

In this paper, we have explored deep learning techniques

for solving the hyperspectral image classiﬁcation problem.

In particular, four deep learning models such as 2D-CNN,

3D-CNN, R-2D-CNN, and R-3D-CNN have been designed

and developed. Rigorous experiments were conducted based

on six publicly available hyperspectral image data sets, and

our experimental results conﬁrm the superiority of these

deep learning methods when compared to traditional machine

learning methods such as LORSAL, MFL, GCK, SVM-3D,

and SVM-CK. In addition, the proposed R-3D-CNN and R-

2D-CNN models outperform the CNN-MRF, SVM-3DG, and

SMLR-SpATV. As a whole, the proposed R-3D-CNN model

often outperforms other models for most of the data sets, and

it can also converge faster because of its 3D convolutional

operators and the recurrent network structure which can effec-

tively exploit both the spectral and the spatial contexts. If we

measure classiﬁcation performance in terms of error rate, the

proposed methods (R-2D-CNN and R-3D-CNN) outperform

the baselines by more than 30%. Despite the superiority of

the proposed models, we ﬁnd that our deep learning models

often require more training samples than traditional machine

learning methods. Accordingly, it will be a very interesting

future research topic of incorporating prior domain knowledge

into the proposed deep learning models. Alternatively, we will

explore applying transfer learning approaches to alleviate the

shortcomings of our current deep learning models.

ACKNOWLEDGMENT

This research was supported in part by Shenzhen

Science and Technology Program under Grant No.

JCYJ20160330163900579 and No.JCYJ20170413105929681.

Huang’s work is supported by the National Natural Science

Foundation of China (NSFC) under Grant No.61562027

and Education Department of Jiangxi Province under Grant

No.GJJ170413. Lau’s work is supported by grants from the

RGC of the Hong Kong SAR (Projects: CityU 11502115 and

CityU 11525716), the NSFC Basic Research Program (Project:

71671155), the Shenzhen Municipal Science and Technology

Innovation Fund (Project: JCYJ20160229165300897), and

the CityU Shenzhen Research Institute.

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 16

REFERENCES

[1] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders,

N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data

analysis and future challenges,” IEEE Geoscience and remote sensing

magazine, vol. 1, no. 2, pp. 6–36, 2013.

[2] A. J. Brown, B. Sutter, and S. Dunagan, “The marte vnir imaging

spectrometer experiment: design and analysis,” Astrobiology, vol. 8,

no. 5, pp. 1001–1011, 2008.

[3] B. Scholkopf and A. J. Smola, Learning with kernels: support vector

machines, regularization, optimization, and beyond. MIT press, 2001.

[4] G. Hughes, “On the mean accuracy of statistical pattern recognizers,”

in IEEE Trans. Inf. Theory 1968, 1968, pp. 55–63.

[5] B. Krishnapuram, L. Carin, M. A. Figueiredo, and A. J. Hartemink,

“Sparse multinomial logistic regression: Fast algorithms and general-

ization bounds,” IEEE transactions on pattern analysis and machine

intelligence, vol. 27, no. 6, pp. 957–968, 2005.

[6] Q. Wang, Z. Meng, and X. Li, “Locality adaptive discriminant anal-

ysis for spectral–spatial classiﬁcation of hyperspectral images,” IEEE

Geoscience and Remote Sensing Letters, vol. 14, no. 11, pp. 2077–2081,

2017.

[7] Q. Wang, J. Lin, and Y. Yuan, “Salient band selection for hyperspectral

image classiﬁcation via manifold ranking,” IEEE transactions on neural

networks and learning systems, vol. 27, no. 6, pp. 1279–1289, 2016.

[8] Y. Yuan, J. Lin, and Q. Wang, “Dual-clustering-based hyperspectral band

selection by contextual analysis,” IEEE Transactions on Geoscience and

Remote Sensing, vol. 54, no. 3, pp. 1431–1445, 2016.

[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,

no. 7553, pp. 436–444, 2015.

[10] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,

inception-resnet and the impact of residual connections on learning.” in

AAAI, 2017, pp. 4278–4284.

[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca-

tion with deep convolutional neural networks,” in Advances in neural

information processing systems, 2012, pp. 1097–1105.

[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”

in Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2015, pp. 1–9.

[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for

image recognition,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2016, pp. 770–778.

[15] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classiﬁcation of

hyperspectral images with regularized linear discriminant analysis,”

IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 3,

pp. 862–873, 2009.

[16] A. J. Brown, “Spectral curve ﬁtting for automatic hyperspectral data

analysis,” IEEE Transactions on Geoscience and Remote Sensing,

vol. 44, no. 6, pp. 1601–1608, 2006.

[17] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspec-

tral image classiﬁcation,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 43, no. 6, pp. 1351–1362, 2005.

[18] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “Sim-

plemkl,” Journal of Machine Learning Research, vol. 9, no. 3, pp. 2491–

2521, 2008.

[19] D. Tuia, G. Camps-Valls, G. Matasci, and M. Kanevski, “Learn-

ing relevant image features with multiple-kernel classiﬁcation,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 48, no. 10, pp.

3780–3791, 2010.

[20] Y. Gu, C. Wang, D. You, Y. Zhang, S. Wang, and Y. Zhang, “Representa-

tive multiple kernel learning for classiﬁcation in hyperspectral imagery,”

IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 7,

pp. 2852–2865, 2012.

[21] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone,

“Morphological attribute proﬁles for the analysis of very high resolution

images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48,

no. 10, pp. 3747–3762, 2010.

[22] M. Dalla Mura, J. Atli Benediktsson, B. Waske, and L. Bruzzone,

“Extended proﬁles with morphological attribute ﬁlters for the analysis

of hyperspectral data,” International Journal of Remote Sensing, vol. 31,

no. 22, pp. 5975–5991, 2010.

[23] J. Li, P. R. Marpu, A. Plaza, J. M. Bioucas-Dias, and J. A. Benedikts-

son, “Generalized composite kernel framework for hyperspectral image

classiﬁcation,” IEEE Transactions on Geoscience and Remote Sensing,

vol. 51, no. 9, pp. 4816–4829, 2013.

[24] Q. Huang, C. K. Jia, X. Zhang, and Y. Ye, “Learning discriminative sub-

space models for weakly supervised face detection,” IEEE Transactions

on Industrial Informatics, vol. 13, no. 6, pp. 2956–2964, 2017.

[25] X. Ma, Q. Liu, Z. He, X. Zhang, and W.-S. Chen, “Visual tracking via

exemplar regression model,” Knowledge-Based Systems, vol. 106, pp.

26–37, 2016.

[26] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image

classiﬁcation via kernel sparse representation,” IEEE Transactions on

Geoscience and Remote Sensing, vol. 51, no. 1, pp. 217–231, 2013.

[27] L. Sun, Z. Wu, J. Liu, L. Xiao, and Z. Wei, “Supervised spectral–spatial

hyperspectral image classiﬁcation with weighted markov random ﬁelds,”

IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 3,

pp. 1490–1503, 2015.

[28] J. Li, X. Huang, P. Gamba, J. M. Bioucas-Dias, L. Zhang, J. A.

Benediktsson, and A. Plaza, “Multiple feature learning for hyperspectral

image classiﬁcation,” IEEE Transactions on Geoscience and Remote

sensing, vol. 53, no. 3, pp. 1592–1606, 2015.

[29] X. Cao, L. Xu, D. Meng, Q. Zhao, and Z. Xu, “Integration of 3-

dimensional discrete wavelet transform and markov random ﬁeld for

hyperspectral image classiﬁcation,” Neurocomputing, vol. 226, pp. 90–

100, 2017.

[30] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.

Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-

propagation network,” in Advances in neural information processing

systems, 1990, pp. 396–404.

[31] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,

W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten

zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551,

1989.

[32] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectiﬁer neural

networks.” in Aistats, vol. 15, no. 106, 2011, p. 275.

[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-

dinov, “Dropout: A simple way to prevent neural networks from over-

ﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.

1929–1958, 2014.

[34] A. Santara, K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, and

P. Mitra, “Bass net: Band-adaptive spectral-spatial feature learning neu-

ral network for hyperspectral image classiﬁcation,” IEEE Transactions

on Geoscience and Remote Sensing, 2017.

[35] X. Cao, F. Zhou, L. Xu, D. Meng, Z. Xu, and J. Paisley, “Hyperspectral

image segmentation with markov random ﬁelds and a convolutional

neural network,” arXiv preprint arXiv:1705.00727, 2017.

[36] Q. Wang, J. Gao, and Y. Yuan, “A joint convolutional neural networks

and context transfer for street scenes labeling,” IEEE Transactions on

Intelligent Transportation Systems, 2017.

[37] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks

for human action recognition,” IEEE transactions on pattern analysis and

machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.

[38] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with

deep recurrent neural networks,” in Acoustics, speech and signal

processing (icassp), 2013 ieee international conference on. IEEE, 2013,

pp. 6645–6649.

[39] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning

with neural networks,” in Advances in neural information processing

systems, 2014, pp. 3104–3112.

[40] K. Cho, B. Van Merri¨

enboer, C. Gulcehre, D. Bahdanau, F. Bougares,

H. Schwenk, and Y. Bengio, “Learning phrase representations using

rnn encoder-decoder for statistical machine translation,” arXiv preprint

arXiv:1406.1078, 2014.

[41] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks for

hyperspectral image classiﬁcation,” IEEE Transactions on Geoscience

and Remote Sensing, 2017.

[42] P. Pinheiro and R. Collobert, “Recurrent convolutional neural

networks for scene labeling,” in Proceedings of the 31st International

Conference on Machine Learning, ser. Proceedings of Machine

Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 1.

Bejing, China: PMLR, 22–24 Jun 2014, pp. 82–90. [Online]. Available:

http://proceedings.mlr.press/v32/pinheiro14.html

[43] P. J. Werbos, “Backpropagation through time: what it does and how to

do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.

SUBMISSION TO IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 17

Xiaofei Yang Xiaofei Yang received the B.Sc. from

Suihua University in 2007 and 2011, and received

M.Sc. degrees from Harbin Institute of Technology

in 2011 and 2013, respectively. Currently, he is

a Ph.D. candidate in Shenzhen Graduate School,

Harbin Institute of Technology. His research inter-

ests are in the areas of semi-supervised learning,

deep learning, remote sensing, transfer learning and

graph mining.

Yunming Ye Yunming Ye received the Ph.D. in

Computer Science from Shanghai Jiao Tong Univer-

sity. He is now a professor in the Shenzhen Graduate

School, Harbin Institute of Technology. His research

interests include data mining, text mining, and en-

semble learning algorithms.

Xutao Li Xutao Li is now an Associate Professor

in the Shenzhen Graduate School, Harbin Institute

of Technology. He received the Ph.D. and Master

degrees in Computer Science from Harbin Institute

of Technology in 2013 and 2009, and the Bachelor

from Lanzhou University of Technology in 2007.

His research interests include data mining, machine

learning, graph mining and social network analysis,

especially tensor based learning and mining algo-

rithms.

Raymond Y. K. Lau Raymond Y. K. Lau is an

Associate Professor in the Department of Informa-

tion Systems at City University of Hong Kong. He

is the author of two hundred refereed international

journals and conference papers. His research work

has been published in renowned journals such as

MIS Quarterly, INFORMS Journal on Computing,

ACM Transactions on Information Systems, IEEE

Transactions on Knowledge and Data Engineering,

IEEE Internet Computing, Journal of MIS, Decision

Support Systems, etc. His research interests include

Big Data Analytics, Social Media Analytics, FinTech, and AI for Business.

He is a senior member of the IEEE and the ACM, respectively.

Xiaohui Huang Xiaohui Huang received the B.Eng.

and masters degrees from Jiangxi Normal University,

Nanchang, China, in 2005 and 2008, respectively,

and the Ph.D. degree from the Shenzhen Graduate

School, Harbin Institute of Technology, Shenzhen,

China, in 2014. Since 2015, he has been with the

School of Information Engineering Department, East

China Jiaotong University, Nanchang, China, where

he is currently a lecturer of computer science. His

current research interests include clustering, social

media analysis, and deep learning.

Xiaofeng Zhang Xiaofeng Zhang received the MSc

degree from Harbin Institute of Technology in 1999,

and the Ph.D. degree from Hong Kong Baptist

University in 2008. He has worked in R&D center

of Peking University Founder Group and E-business

Technology Institute of Hong Kong University. He

is now an associate professor at department of

computer science of Harbin Institute of Technology

Shenzhen Graduate School. His research interests

include data mining, machine learning and graph

mining.

Hierarchical Receptive-Field Selection with Attention ResNet for Hyperspectral Image Classification

Conference Paper

May 2024

Mohamed A. Elshafey

How to Learn More? Exploring Kolmogorov-Arnold Networks for Hyperspectral Image Classification

Preprint

Full-text available

Jun 2024

Convolutional Neural Networks (CNNs) and vision transformers (ViTs) have shown excellent capability in complex hyperspectral image (HSI) classification. However, these models require a significant number of training data and are computational resources. On the other hand, modern Multi-Layer Perceptrons (MLPs) have demonstrated great classification capability. These modern MLP-based models require significantly less training data compared to CNNs and ViTs, achieving the state-of-the-art classification accuracy. Recently, Kolmogorov-Arnold Networks (KANs) were proposed as viable alternatives for MLPs. Because of their internal similarity to splines and their external similarity to MLPs, KANs are able to optimize learned features with remarkable accuracy in addition to being able to learn new features. Thus, in this study, we assess the effectiveness of KANs for complex HSI data classification. Moreover, to enhance the HSI classification accuracy obtained by the KANs, we develop and propose a Hybrid architecture utilizing 1D, 2D, and 3D KANs. To demonstrate the effectiveness of the proposed KAN architecture, we conducted extensive experiments on three newly created HSI benchmark datasets: QUH-Pingan, QUH-Tangdaowan, and QUH-Qingyun. The results underscored the competitive or better capability of the developed hybrid KAN-based model across these benchmark datasets over several other CNN- and ViT-based algorithms, including 1D-CNN, 2DCNN, 3D CNN, VGG-16, ResNet-50, EfficientNet, RNN, and ViT. The code are publicly available at (https://github.com/aj1365/HSIConvKAN)

Hyperspectral Image Classification Based on Adaptive Global–Local Feature Fusion

Article

Full-text available

May 2024

Labeled hyperspectral image (HSI) information is commonly difficult to acquire, so the lack of valid labeled data becomes a major puzzle for HSI classification. Semi-supervised methods can efficiently exploit unlabeled and labeled data for classification, which is highly valuable. Graph-based semi-supervised methods only focus on HSI local or global data and cannot fully utilize spatial–spectral information; this significantly limits the performance of classification models. To solve this problem, we propose an adaptive global–local feature fusion (AGLFF) method. First, the global high-order and local graphs are adaptively fused, and their weight parameters are automatically learned in an adaptive manner to extract the consistency features. The class probability structure is then used to express the relationship between the fused feature and the categories and to calculate their corresponding pseudo-labels. Finally, the fused features are imported into the broad learning system as weights, and the broad expansion of the fused features is performed with the weighted broad network to calculate the model output weights. Experimental results from three datasets demonstrate that AGLFF outperforms other methods.

DMAF-NET: Deep Multi-Scale Attention Fusion Network for Hyperspectral Image Classification with Limited Samples

Article

Full-text available

May 2024
SENSORS-BASEL

In recent years, deep learning methods have achieved remarkable success in hyperspectral image classification (HSIC), and the utilization of convolutional neural networks (CNNs) has proven to be highly effective. However, there are still several critical issues that need to be addressed in the HSIC task, such as the lack of labeled training samples, which constrains the classification accuracy and generalization ability of CNNs. To address this problem, a deep multi-scale attention fusion network (DMAF-NET) is proposed in this paper. This network is based on multi-scale features and fully exploits the deep features of samples from multiple levels and different perspectives with an aim to enhance HSIC results using limited samples. The innovation of this article is mainly reflected in three aspects: Firstly, a novel baseline network for multi-scale feature extraction is designed with a pyramid structure and densely connected 3D octave convolutional network enabling the extraction of deep-level information from features at different granularities. Secondly, a multi-scale spatial–spectral attention module and a pyramidal multi-scale channel attention module are designed, respectively. This allows modeling of the comprehensive dependencies of coordinates and directions, local and global, in four dimensions. Finally, a multi-attention fusion module is designed to effectively combine feature mappings extracted from multiple branches. Extensive experiments on four popular datasets demonstrate that the proposed method can achieve high classification accuracy even with fewer labeled samples.

SC-HybridSN: A deep learning network method for rapid discriminant analysis of industrial paraffin contamination levels in rice

Article

Jun 2024
J FOOD COMPOS ANAL

Transformer-enhanced two-stream complementary convolutional neural network for hyperspectral image classification

Article

Jun 2024
J FRANKLIN I

Exploring Multi-Timestep Multi-Stage Diffusion Features for Hyperspectral Image Classification

Article

Jan 2024

The effectiveness of spectral-spatial feature learning is crucial for the hyperspectral image (HSI) classification task. Diffusion models, as a new class of groundbreaking generative models, have the ability to learn both contextual semantics and textual details from the distinct timestep dimension, enabling the modeling of complex spectral-spatial relations in HSIs. However, existing diffusion-based HSI classification methods only utilize manually selected single-timestep single-stage features, limiting the full exploration and exploitation of rich contextual semantics and textual information hidden in the diffusion model. To address this issue, we propose a novel diffusion-based feature learning framework that explores Multi-Timestep Multi-Stage Diffusion features for HSI classification for the first time, called MTMSD. Specifically, the diffusion model is first pretrained with unlabeled HSI patches to mine the connotation of unlabeled data, and then is used to extract the multi-timestep multi-stage diffusion features. To effectively and efficiently leverage multi-timestep multi-stage features, two strategies are further developed. One strategy is class & timestep-oriented multi-stage feature purification module with the inter-class and inter-timestep prior for reducing the redundancy of multi-stage features and alleviating memory constraints. The other one is selective timestep feature fusion module with the guidance of global features to adaptively select different timestep features for integrating texture and semantics. Both strategies facilitate the generality and adaptability of the MTMSD framework for diverse patterns of different HSI data. Extensive experiments are conducted on four public HSI datasets, and the results demonstrate that our method outperforms state-of-the-art methods for HSI classification, especially on the challenging Houston 2018 dataset. The codes are available at https://github.com/zjyaccount/MTMSD.

A multi-scale multi-channel CNN introducing a channel-spatial attention mechanism hyperspectral remote sensing image classification method

Article

May 2024

Dynamic Evolution Graph Attention Network for Semi-Supervised Hyperspectral Image Classification

Article

Jan 2024

Graph Attention Network (GAT) has a wide range of applications in HSI classification. The GAT-based semi-supervised learning approach enables the integration of valuable information from both labeled and unlabeled samples, effectively reducing the model’s reliance on labeled data. However, the node-wise training approach of GAT often overlooks the inherent global feature of graph data and the long-range dependencies among nodes, thereby limiting the model’s generalization ability on unlabeled data. Therefore, we propose a semi-supervised HSI classification model based on the dynamic evolution graph attention network (DEGAT). The main contributions: 1)We design a dynamic graph evolution mechanism (DGEM) that enables the model to capture the interactive information between local graph attention coefficients and the global graph structure, thus obtaining more discriminative graph representations. 2)DEGAT utilizes the multi-scale mechanism and message-passing mechanism to capture the information of nodes with long-range dependencies, extracting richer spatial-spectral features. State-of-the-art results are achieved with very few labeled training samples on two typical benchmark HSI datasets, where the overall accuracy reaches 95.12% and 98.76% respectively.

Models for Exploring the Benefits of using Discrete Wavelet Transformation in HSI

Conference Paper

Mar 2024

Locality Adaptive Discriminant Analysis for Spectral-Spatial Classification of Hyperspectral Images

Article

Sep 2017

Linear discriminant analysis (LDA) is a popular technique for supervised dimensionality reduction, but with less concern about a local data structure. This makes LDA inapplicable to many real-world situations, such as hyperspectral image (HSI) classification. In this letter, we propose a novel dimensionality reduction algorithm, locality adaptive discriminant analysis (LADA) for HSI classification. The proposed algorithm aims to learn a representative subspace of data, and focuses on the data points with close relationship in spectral and spatial domains. An intuitive motivation is that data points of the same class have similar spectral feature and the data points among spatial neighborhood are usually associated with the same class. Compared with traditional LDA and its variants, LADA is able to adaptively exploit the local manifold structure of data. Experiments carried out on several real hyperspectral data sets demonstrate the effectiveness of the proposed method.

Learning Discriminative Subspace Models for Weakly Supervised Face Detection

Article

Sep 2017

Learning object detection models from weakly labeled data is an important topic in computer vision, and this task can be naturally cast as a Multiple Instance Learning (MIL) problem. Existing MIL approaches for object detection suffer from high false positive rates due to the lack of advanced instance selection techniques. In this study, a subspace based generative model is proposed to choose positive instances by minimizing rank of the coefficient matrix associated with the subspace models. An incoherence term between subspace models and some “hard” negative instances in also introduced, which is realized by an $\epsilon$ -insensitive loss function. To further improve the discriminative ability, a multiple subspace models approach is proposed by employing certain ensemble learning strategies. Rigorous experiments are performed on several data sets, and the promising experimental results have demonstrated that the proposed approach is superior to the state-of-the-art weakly supervised learning algorithms in terms of precision, recall and F-score.

Deep Sparse Rectifier Neural Networks

Article

Jan 2011
J MACH LEARN RES

While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training

Speech Recognition With Deep Recurrent Neural Networks

Conference Paper

Jan 2013

Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

A Joint Convolutional Neural Networks and Context Transfer for Street Scenes Labeling

Article

Aug 2017

Street scene understanding is an essential task for autonomous driving. One important step toward this direction is scene labeling, which annotates each pixel in the images with a correct class label. Although many approaches have been developed, there are still some weak points. First, many methods are based on the hand-crafted features whose image representation ability is limited. Second, they cannot label foreground objects accurately due to the data set bias. Third, in the refinement stage, the traditional Markov random filed inference is prone to over smoothness. For improving the above problems, this paper proposes a joint method of priori convolutional neural networks at superpixel level (called as "priori s-CNNs") and soft restricted context transfer. Our contributions are threefold: 1) a priori s-CNNs model that learns priori location information at superpixel level is proposed to describe various objects discriminatingly; 2) a hierarchical data augmentation method is presented to alleviate data set bias in the priori s-CNNs training stage, which improves foreground objects labeling significantly; and 3) a soft restricted MRF energy function is defined to improve the priori s-CNNs model's labeling performance and reduce the over smoothness at the same time. The proposed approach is verified on CamVid data set (11 classes) and SIFT Flow Street data set (16 classes) and achieves a competitive performance.

Deep Recurrent Neural Networks for Hyperspectral Image Classification

Article

Apr 2017

In recent years, vector-based machine learning algorithms, such as random forests, support vector machines, and 1-D convolutional neural networks, have shown promising results in hyperspectral image classification. Such methodologies, nevertheless, can lead to information loss in representing hyperspectral pixels, which intrinsically have a sequence-based data structure. A recurrent neural network (RNN), an important branch of the deep learning family, is mainly designed to handle sequential data. Can sequence-based RNN be an effective method of hyperspectral image classification? In this paper, we propose a novel RNN model that can effectively analyze hyperspectral pixels as sequential data and then determine information categories via network reasoning. As far as we know, this is the first time that an RNN framework has been proposed for hyperspectral image classification. Specifically, our RNN makes use of a newly proposed activation function, parametric rectified tanh (PRetanh), for hyperspectral sequential data analysis instead of the popular tanh or rectified linear unit. The proposed activation function makes it possible to use fairly high learning rates without the risk of divergence during the training procedure. Moreover, a modified gated recurrent unit, which uses PRetanh for hidden representation, is adopted to construct the recurrent layer in our network to efficiently process hyperspectral data and reduce the total number of parameters. Experimental results on three airborne hyperspectral images suggest competitive performance in the proposed mode. In addition, the proposed network architecture opens a new window for future research, showcasing the huge potential of deep recurrent networks for hyperspectral data analysis.

Deep Residual Learning for Image Recognition

Conference Paper

Jun 2016

BASS Net: Band-Adaptive Spectral-Spatial Feature Learning Neural Network for Hyperspectral Image Classification

Article

Dec 2016

Deep learning based landcover classification algorithms have recently been proposed in literature. In hyperspectral images (HSI) they face the challenges of large dimensionality, spatial variability of spectral signatures and scarcity of labeled data. In this article we propose an end-to-end deep learning architecture that extracts band specific spectral-spatial features and performs landcover classification. The architecture has fewer independent connection weights and thus requires lesser number of training data. The method is found to outperform the highest reported accuracies on popular hyperspectral image data sets.

Integration of 3-Dimensional Discrete Wavelet Transform and Markov Random Field for Hyperspectral Image Classification

Article

Nov 2016
NEUROCOMPUTING

Hyperspectral image (HSI) classification is one of the fundamental tasks in HSI analysis. Recently, many approaches have been extensively studied to improve the classification performance, among which integrating the spatial information underlying HSIs is a simple yet effective way. However, most of the current approaches haven't fully exploited the spatial information prior. They usually consider this prior either in the step of extracting spatial feature before classification or in the step of post-processing label map after classification, while don't integratively employ the prior in both steps, which thus leaves a room for further enhancing their performance. In this paper, we propose a novel spectral-spatial HSI classification method, which fully utilizes the spatial information in both steps. Firstly, the spatial feature is extracted by applying the 3-dimensional discrete wavelet transform (3D-DWT). Secondly, the local spatial correlation of neighboring pixels is modeled using Markov random field (MRF) based on the probabilistic classification map obtained by applying probabilistic support vector machine (SVM) to the extracted 3D-DWT feature in the first step, and then a maximum a posterior (MAP) classification problem can be formulated in a Bayesian perspective. Finally, α-Expansion min-cut-based optimization algorithm is adopted to solve this MAP problem efficiently. Experimental results on two benchmark HSIs show that the proposed method achieves a significant performance gain beyond state-of-the-art methods.

Hyperspectral Image Classification With Deep Learning Models

Abstract and Figures

Recommended publications

No-reference quality assessment for contrast-altered images using an end-to-end deep framework

A Method of Choosing a Pre-trained Convolutional Neural Network for Transfer Learning in Image Class...

Learning to Pay Attention on Spectral Domain: A Spectral Attention Module-Based Convolutional Networ...

Age and Gender prediction using Convolution, ResNet50 and Inception ResNetV2