ArticlePDF Available

Multilevel Building Detection Framework in Remote Sensing Images Based on Convolutional Neural Networks

Authors:

Abstract and Figures

In this paper, we propose a hierarchical building detection framework based on deep learning model, which focuses on accurately detecting buildings from remote sensing images. To this end, we first construct the generation model of the multi-level training samples using the Gaussian pyramid technique to learn the features of building objects at different scales and spatial resolutions. Then, the building region proposal networks are put forward to quickly extract candidate building regions, thereby increasing the efficiency of the building object detection. Based on the candidate building regions, we establish the multi-level building detection model using the convolutional neural networks (CNNs), from which the generic image features of each building region proposal are calculated. Finally, the obtained features are provided as inputs for training CNNs model, and the learned model is further applied to test images for the detection of unknown buildings. Various experiments using the Datasets I and II (in the Section V-A) show that the proposed framework increases the mean average precision (mAP) values of building detection by 3.63%, 3.85% and 3.77%, compared with the state-of-the-art methods, i.e., Method IV. Besides, the proposed method is robust to the buildings having different spatial textures and types.
Content may be subject to copyright.
1
AbstractIn this paper, we propose a hierarchical building
detection framework based on deep learning model, which
focuses on accurately detecting buildings from remote sensing
images. To this end, we first construct the generation model of
the multi-level training samples using the Gaussian pyramid
technique to learn the features of building objects at different
scales and spatial resolutions. Then, the building region proposal
networks are put forward to quickly extract candidate building
regions, thereby increasing the efficiency of the building object
detection. Based on the candidate building regions, we establish
the multi-level building detection model using the convolutional
neural networks (CNNs), from which the generic image features
of each building region proposal are calculated. Finally, the
obtained features are provided as inputs for training CNNs
model, and the learned model is further applied to test images for
the detection of unknown buildings. Various experiments using
the Datasets I and II (in the Section V-A) show that the proposed
framework increases the mean average precision (mAP) values of
building detection by 3.63%, 3.85% and 3.77%, compared with
the state-of-the-art methods, i.e., Method IV. Besides, the
proposed method is robust to the buildings having different
spatial textures and types.
Index TermsBuilding detection, multi-level framework,
CNNs, candidate building regions, remote sensing images.
I. INTRODUCTION
In the past few decades, due to the developments in
advanced aerospace remote sensing techniques and sensor
manufacture techniques, the quality of acquired remote
sensing images has improved tremendously. Admittedly, the
remote sensing images contain many complex and significant
spatial objects. Typically, the buildings from remote sensing
images constitute the most important landscape and have been
intensively used in various practical applications, including
digital urban model construction [1], urban planning [2],
environment control and mapping [3], among others.
Especially, the position, geometric shape and orientation of
the buildings, often represent the basis for high-level
building-oriented applications, such as building contouring,
building model reconstruction and cartographic generalization,
etc. Hence, efficient and accurate detection and recognition of
the real buildings from large-scale remote sensing images is a
relevant, yet challenging task. Additionally, in urban scenario,
this challenge is further intensified due to having diverse
geometric shapes and the inhomogeneity of spectral
information.
In this paper, we focus on the general problem of how to
automatically obtain the accurate building from large-scale
remote sensing imagery. Many previously published works in
this domain show impressive success on building or other
specific object detection. More specifically, the machine
learning techniques, such as AdaBoost [4-7], support vector
machine (SVM) [8-12], sparse coding-based classifiers
[13-17], etc. are generally adopted in the previous works.
Although these methods can reasonably detect the objects, the
saliency and hierarchy of object feature representation should
be enhanced considerably.
With the development of deep learning theory, especially
the advantage that deep learning model can describe more
powerful feature representations, it offers the possibility of
efficient building detection in remote sensing images. In this
paper, we aim to design an efficient and hierarchical building
detection framework based on deep learning model by using
remote sensing images. We first construct the multi-level
training samples using the Gaussian pyramid principle, to
learn the features of building objects at different scales and
spatial resolutions. Then, the building region proposal
networks are designed to determine the candidate building
object locations. This process increases the efficiency of the
building object detection. Next, we establish the multi-level
building detection model using CNNs, and the features from
hierarchical images corresponding to each building region
proposal are extracted. Finally, the extracted features are used
to learn the CNNs model, which is utilized to detect the
unknown buildings in the test images. Apart from introducing
Multi-level Building Detection Framework in Remote Sensing Images
Based on Convolutional Neural Networks
Yibo Liu1, 2, Zhenxin Zhang1, 2, 3, Ruofei Zhong1, 2, Dong Chen4, Yinghai Ke1, 2, Jiju Peethambaran5,
Chuqun Chen6, Lan Sun1, 2
1Beijing Advanced Innovation Center for Imaging Technology
Capital Normal University, Beijing
100048, China
2 College of Resources Environment and Tourism, Capital Normal University, Beijing 100048, China3
Chinese Academy of Surveying and Mapping, Beijing 100830, China
4College of Civil Engineering, Nanjing Forestry University, Nanjing, 210037, China
5Department of Computer Science, University of Victoria, Victoria, BC V8P 5C2, Canada.
6Guangdong Key Laboratory of Ocean Remote Sensing, South China Sea Institute of Oceanology
Chinese Academy of Sciences, Guangzhou, 510301,China
This work was supported by the Open Fund o f Twenty First Century
Areospace Technology Co., Ltd. under Grant 21AT-2016-04, Nation al
Natural Science Foundation of China under Grant 41701533, 41371434 and
41301521, Open Fund for Guangdong Key Laboratory of Ocean Remote
SensingSouth China Sea Institute of Oceanol ogy Chinese Academy of
Sciences) under Grant 2017B030301005-LORS1804, and the Open Fund
Key Laboratory for National Geograophy State Monitoring (National
Administration of Surveying, Mappin g and Geoinformation) under Grant
2017NGCM06 (Corresponding authors: Zhenxin Zhang and Ruofei Zhong.)
2
this generic methodology, our paper makes the following
specific novel contributions:
i) We propose a multi-level learning framework for building
detection from remote sensing images based on CNNs, which
can extract the features of buildings at different scales and
spatial resolutions to train the deep learning model and detect
buildings.
ii) We establish the building region proposal networks
(BRPN) to generate candidate building regions, thereby
improving the efficiency of building region searching and
enhancing the accuracy of building positions.
The rest of the paper is organized as follows. The Section II
reviews the related work of building or other object detection.
The Section III describes the proposed learning process of
building detection. The test step of building detection using
the learned model is designed in the Section IV. The Section
V analyzes the performance of the proposed framework.
Finally, Section VI concludes the paper along with a few
suggestions for future research topics.
II. RELATED WORK
The building or other object detections in remote sensing
images has been an active area of research for the last few
years and is still very much open. Two important
considerations under the hood of remote sensing image based
object detection are, the construction of hierarchical detection
framework and the object (building or other object) feature
representation, both are briefly reviewed under this section.
A
The construction of hierarchical detection framework
Hierarchical structure can fully represent the spatial
hierarchy and diversity [17]. A series of publications along
this line demonstrated the effectiveness of the hierarchical
structure in improving the effect of object detection and/or
recognition. For example, Farabet et al. [18] constructed the
pyramid of training data by using the Laplacian method to
enhance the ability of feature representation. Yu et al. [19]
proposed a hierarchical framework, called ScSPM (sparse
coding based spatial pyramid matching) to extract hierarchical
features. Then, considering the advantage of spatial
discriminative feature, He et al. [20] equipped the networks
with the strategy of spatial pyramid pooling (SPP), which can
generate a fixed-length representation regardless of image
size/scale. Taking the effect of object size in real space into
account, Zhang et al. [21, 22] built a hierarchical structure
based on exponential curve, which works well when large-
and small-sized objects coexist. In the aspect of hierarchical
construction concerning texture, Gaetano et al. [23] proposed
hierarchical texture-based segmentation of multi-resolution
remote sensing images, and similarly Trias-Sanz et al. [24]
used color and texture to achieve hierarchical segmentation for
processing high-resolution remote sensing image. Kurtz et al.
[25, 26] designed a hierarchical top-down methodology, which
can extract extremely complex patterns from multi-resolution
remote sensing images, for example the landslides were
hierarchically extracted from multi-resolution remotely sensed
optical images. Lin et al. [27] exploited the multi-scale
pyramidal hierarchy of deep feature to construct the
hierarchical features with marginal extra cost.
B. Object detection and location
Object detection and location in remote sensing images
have been widely researched in recent years. Sirmacek and
Unsalan [28] designed a building detection method in urban
areas using scale-invariant feature transform (SIFT) keypoints
and graph theory. Xu et al. [11] put forward an object
classification of aerial images through BoW (bag of words)
method. Huang et al. [29] designed an effective and automatic
building index to detect buildings from the high-resolution
imagery. After that, the multi-features [30], multi-angular
features [31] and the post-processing framework [32] of
remote sensing images are also explored for urban and
building classification. Hu et al. [33] developed an
unsupervised feature learning method via spectral clustering of
patches for remotely sensed scene classification. Xia et al. [34]
proposed a method of accurate annotation of remote sensing
images by active spectral clustering with little expert
knowledge.
Recent advances in deep learning provide unprecedented
opportunities to address the problems such as object detection
and location in a different way. For example, in [35], Cheng et
al. proposed a rotation-invariant CNNs for object detection in
optical remote sensing images. In another method, Long et al.
[36] designed an object detection framework in remote
sensing images based on CNNs. Girshick et al. [37]
constructed a region-based convolutional networks for
accurate object detection and segmentation. Hu et al. [38]
transferred deep CNNs for the scene classification of
high-resolution remote sensing imagery. Vakalopoulou et al.
[39] proposed an automated building extraction framework
using deep CNNs. Han et al. [8] designed an object detection
method in optical remote sensing images based on a weakly
supervised learning and high-level feature learning using deep
learning theory. The readers can refer to review paper in [40]
to learn the current progress on object detection and location
from optical remote sensing images. Obviously, the above
methods may perform well when the hierarchical information
of the spatial image is considered. Although the recognition of
buildings and other prominent objects from imagery has been
researched for many years, it remains to be solved especially
for accurate recognition due to the complexity of spatial
structure and diversity of surface texture, e.g., existing
building occlusions and inhomogeneities of building sizes in
experimental scenarios.
Ш. THE LEARNING FRAMEWORK
The overview of the proposed method is shown in Fig. 1.
Firstly, we construct the multi-level training samples by using
the Gaussian pyramid principle, to learn the features of
building objects at different scales and spatial resolutions.
Then, the building region proposal networks are designed to
determine the candidate building object locations. Based on
the building region proposals, the multi-level building
detection model is established using the CNNs, and the image
features from hierarchical images corresponding to each
building region proposal are extracted. Finally, the extracted
features are used to learn the CNNs model in the training step,
and the learned model is used to detect the unknown buildings
in the test images. It is to be noted that, the proposed
framework not only identify buildings from remote sensing
3
images, but also provide an accurate location of each
identified building.
Gaussian pyramid
principle
Remote sensing
images
Multi-level training
sample sets
Building region
proposal networks
Building region
proposals
Convolutional
neural networks
Hierarchical image
features
Training of building
detection model
Building detection
results
Test
images
Construction
of multi-level
training
sample sets
Generation
of candidate
building
regions
Extraction of
hierarchical
image
features
Fig. 1. The overview of our method.
In this section, we mainly discuss the learning framework of
the hierarchical building detection in remote sensing image
based on deep learning model. More precisely, the hierarchical
training data sets are first constructed. Next, the candidate
areas of building in different levels are determined by using
building region proposal networks. Finally, the features of the
building region proposals are designed by using deep learning
CNNs model, to train the hierarchical building detection
framework.
A
The construction of hierarchical training data
Due to the various spatial shapes, sizes and textures of
buildings and the occlusions between the objects in remote
sensing images, it is difficult to construct an efficient building
object features. To fully learn the characteristics of buildings,
and enhance the generalization ability of the proposed model,
we design a method for automatically generating multi-level
training data sets. Inspired by the Gaussian pyramid method in
[41], we construct a multi-level structure of training data by
resampling remote sensing images in each level. Firstly, the
original remote sensing images are automatically and
uniformly divided into patches of 300×300 pixels and
450×500 pixels respectively, to cover more kinds of complete
building samples. These patches are then used as a basic
image to down-sample with a Gaussian kernel convolutional
method to gradually generate multi-level training data sets.
We take the resampling image generation in the (l+1)th level
(l = 0,1,2…N-2, and N denotes the number of the level in the
hierarchical training dataset, e.g. N is equal to 6 in the Fig. 2)
as an example, to illustrate the process of the Gaussian kernel
convolutional operations. After the image in the lth level is
smoothed by a low-pass filter, the image is sampled according
to Eq. (1):
1
where Gl ) represents the image in the
l
th level of training
dataset, and the Gl+1) represents the image resampled from
the image of Gl(·). The d(m, n) = g(mg(n) is a 5 × 5 pixels
window function with low-pass filtering characteristic as a
Gaussian convolutional kernel, and the function g) is
Gaussian density distribution function, which is:
2 2 2
( )/2
2
1
( , ) .
2mn
d m n e

−+
=
(2)
According to the above principles, it is obvious that a series
of images G0, G1, GN-1 with respect to (l+1)th levels can be
naturally created. They constitute a set of multi-level training
data.
Fig. 2. The multi-level image training data sets.
B. Generation of candidate building areas
We design building region proposal networks to generate
candidate building areas. To this end we first describe the
structure of building region proposal networks and give our
strategy of how to generate the building region proposals.
B1) Building region proposal networks
Many researchers have used traditional methods to
determine the candidate areas, such as sliding window [42].
However, the sliding window based searching needs to
traverse the entire image, which leads to high time complexity
and consequently, affects the efficiency of building detection.
Meanwhile, this method needs to manually set the size and
ratio of the sliding window. Therefore it is difficult to
effectively extract the building areas in remote sensing image.
We use CNNs model to extract candidate building areas,
which can efficiently generate a small number of high-quality
candidate building areas.
The proposed generation process of building region
proposals is shown in the Fig. 3. Firstly, the convolutional
features of hierarchical training data set are extracted by a set
of shared convolution layers [43], such as AlexNet[44],
ZF[45], VGG-16[46], GoogLeNet[47] and ResNet[48]. The
input remote sensing image can be extracted into a
512-dimensional convolutional feature (shared feature map)
by the last layer of the CNN model. Then, the extracted shared
feature maps are used in the reconstruction of region proposal
networks, in which, there are two parts as evident in Fig. 3:
one is used to calculate the regression values of the building
region box positions and obtain the location parameters of the
predicted building region proposals; the other one is used to
predict the probability of the building region box belonging to
the building or non-building regions by calculating the
intersection over union (IoU) ratio between the initial
candidate building region and the sample of labeling region.
4
Hierarchical remote
sensing images
Shared
convolutional
layers
512
Bounding box
predict
1×1
Class score
1
36 Loss bounding box
SmoothL1 loss
18 Class score reshape
Class probability
Building classification Loss
Softmax classifier
Building region
proposals
CNN models
Fig. 3. The process of generating candidate building areas.
Considering the variations in spatial sizes, structures and
shapes of buildings, we set multi-scale region boxes of
building detection as nine rectangular sliding windows with
three pixel sizes (128, 256 and 512 pixels), and there are three
kinds of aspect ratios of each rectangle with 1: 1, 1: 2, and 2: 1
(Fig. 4). In the process of generating building region proposals,
we use the above 9 rectangular sliding windows as the initial
building detection boxes, and the nine sizes of rectangular
sliding windows produce a lot of overlapping areas in the
images, which is to cover all candidate building regions, and
can include complete buildings in a building detection box.
then Tthe location and size of each building detection box is
modified according to the values of bounding box regression
calculated by a mapping relationship of regression. The
mapping relationship is defined by the position translation and
scaling size of the building detection box, and the mapping
parameters are illustrated as follows [49]:
* * * *
* * * *
( )/ , ( ) / ,
log( / ), log( / ),
( )/ , ( )/ ,
log( / ), log( / ),
x a a y a a
w a h a
x a a y a a
w a h a
t x x w t y y h
t w w t h h
t x x w t y y h
t w w t h h
= − = −
==
= − = −
==
(3)
where x and y denote the coordinate values of the bounding
box centroid of the detected building. The variables w and h
are respectively the width and the height of the building
detection box. The variables x, xa and x* (likewise for y)
respectively represent the coordinate value of center point in
predicted building detection box, initial building detection box
and labeled detection box. The parameters w, wa and w* are
the widths of predicted building detection box, initial building
detection box and labeled detection box. Vector 𝑡𝑖
(𝑛)= [𝑡𝑥, 𝑡𝑦,
𝑡𝑤, 𝑡] represents the four parameterized coordinates of the
predicted building detection box. The (tx, ty) is the translation
values between the predicted building detection box and the
initial building detection box. The (tw, th) is the scaling
parameters between the predicted building detection box and
the initial building detection box. The ( 𝑡𝑥
, 𝑡𝑦
) is the
translation parameters between the labeled detection box and
the initial building detection box, and the (𝑡𝑤
,𝑡
) is the scaling
parameters between the labeled detection box and the initial
building detection box.
During the training of the building region proposal networks,
the hierarchical training dataset is used in the model, to
generate multi-scale building region proposals, which can be
used to detect building objects in remote sensing images. After
the parameters of translation and scaling are obtained, the final
building region proposals are determined by the values of IoU,
e.g., the output building region proposals denoted by the red
boxes in the right-most column in Fig. 5.
Fig. 4. Different sizes o f the initial sliding windows. T he red box in the upper
left represents a sliding window with 1282 pixe ls, the square blue box
represents a sliding window with 2562 pixels, and the square green box
represents a slidi ng window with 5122 pixels. T he region boxes of each col or
have three types of aspect ratios with 1: 1, 1: 2 and 2: 1.
Fig. 5. The schematic diagram of generating candidate building areas. Note
that IoU v alues of red rectangles in the right-most subfigures are higher than
0.7.
B2) Training of building region proposal networks
The training process of the building region proposal
networks are end to end. The loss function is defined as
follows:
**
, , ,
i i cls i i loc i i
ii
L p t L p p L t t,
(4)
where the subscript i is the index of initial building detection
box. The pi is the predicted probability of the ith building
detection box belonging to the building region. The 𝑝𝑖
represents the label of the ground truth, whose value is 1 if the
initial detection box is belonging to the class of building,
otherwise is 0. The 𝑡𝑖 = ( 𝑡𝑥, 𝑡𝑦, 𝑡𝑤, 𝑡), is a vector
representing the coordinates of the predicted bounding boxes,
and 𝑡𝑖
= (𝑡𝑥
, 𝑡𝑦
, 𝑡𝑤
, 𝑡
), is the coordinate of ground-truth
5
box representing building region. The function Lcls(pi, 𝑝𝑖
) is
the classification loss term, which is calculated as:
* * *
, log 1 1 ,
cls i i i i i i
L p p p p p p
(5)
which represents the price paid for inaccuracy of building
detection.
The regression loss (Lloc) is calculated by a classical loss
function (SmoothL1) [42], which is defined as follows:
**
1
, Smooth ,
loc i i L i i
L t t t t
(6)
and the SmoothL1 function is a nonlinear regression function
defined below:
2
1
0.5 1,
Smooth 0.5
L
aa
aa
if
otherwise.
)=
(7)
In Eq. (7), the parameter α represents the regression
argument of (𝑡𝑖-𝑡𝑖
). In the optimization process, the loss
function gradually approaches its minimum value and the
parameters of the training network are obtained.
Similar to the Faster-RCNN (faster region convolutional
neural networks) [43] method, our method employs the
generation of region proposals to perform the task of object
detection, because the generation of candidate regions can
effectively improve the accuracy and efficiency of the object
detection. Unlike the RPN (region proposal networks)
embedded in Faster-RCNN, the BRPN constructs the network
by combining the spatial hierarchies of the multi-level image
training data sets and deep learning model, while RPN does
not consider the hierarchical spatial information, and only
exploit the single scale of images. On the other hand, the
Faster-RCNN uses multi-task loss function to execute the
training task of multi-class object detection, which is not
suitable for detecting single-object task (e.g. the buildings) in
remote sensing images. So, we redesigned the loss function by
removing some normalization factors to form the Eq. (4). In
addition, our proposed BRPN model use the extracted features
from different scales, which is another difference compared
with Faster-RCNN.
C. Feature extraction
One of the most remarkable characteristics of deep learning
is to automatically learn the multi-level discriminative feature
representation (feature maps) by using the multi-level
convolutional layers. These feature maps can be used to
distinguish building and non-building in our scenario. The
deep learning models can extract more robust deep features
from the image, which have different levels of abstraction. For
example, the shallow convolution layers can extract the edge
contour and color related information of building objects. The
deeper convolutional layers can extract the texture and shape
structures of building objects. These features are sensitive to
descriptions of building characteristics with different
structures and textures in remote sensing images, thereby
contributing to the achievement of building detection
accuracy.
In the process of feature extraction (Fig. 6), the shared
feature map extracted by CNN models is used to generate
building region proposals in BRPN, and is also used to detect
further feature in detection network. We set ROI pooling layer
[42] before the full connection layer in the network of
extracting features. The ROI pooling layer obtains the
candidate ROI list generated by the building region proposal
network, and transfers the feature maps with different sizes
extracted by the CNN layers into a fixed-size feature vector.
Thus, the feature vector with 7 × 7 × 512 dimensions is
extracted from all candidate ROI regions, which is input into
the fully connected (FC) layer.
D. Training process
In the training step, we adopt the back-propagation
optimization method and SGD (stochastic gradient descent)
learning strategy. We generate the hierarchical training data
with the Gaussian pyramid principle, and the generated
hierarchical training data with the corresponding labels are
used in the training of model. The whole training process is
multi-step: First, we employ a pre-training model (ImageNet
[44] model) to initialize the BRPN network; then, the
multilevel training remote sensing images are used to train the
BRPN model, and the learned BRPN model can generate the
building region proposals. After that, we use the same
pre-training model to initialize the detection network, and train
the detection network based on the building region proposals
obtained from the BRPN model. When the learning model of
classification and the position parameters of building detection
boxes (Eq.4) are optimized by minimization, the parameters of
the shared convolution layers are obtained, and used in the
building detection.
Shared convolutional
layers Shared feature maps ROI pooling layer Extracted features
Training images Feature extraction layer
󳵰
󳵰
󳵰
Fig. 6. The process of feature extraction. The trainin g images consist of hierarchical remote sensing images. The shared convolutional layers extract the features
of the training images to obtain shared feature maps. The feature extraction layers extract deeper feature from shared feature maps. Th e ROI p ooling layer
converts the extracted features into a list of feature vectors and output the extracted features.
6
IV. THE BUILDING DETECTION
As shown in Fig. 7, the buildings in remote sensing images
are detected using the parameters of the learned deep learning
model. Firstly, the test images are converted into candidate
building region proposals by using the BRPN. At this point,
these candidate building regional proposals correspond to the
shared feature maps generated by the CNN models. Then, the
feature of each building region proposal is extracted through
convolutional layers, and the extracted features are mapped to
the ROI pooling layer to get the fixed size of feature vectors
and building region bounding box list. The feature vectors are
used in the fully connected layer and softmax classification
layer [50] to label the building objects. Finally, we use the
building bounding box regression [49] layer to modify the
position of predicted building region proposal box from
bounding box list to obtain accurate building detection results.
Test images
Share convolutional
layers Shared feature map s
Building region
proposal networks
Candidate building region proposals ROI pooling layer
Building
probability=0.92
Fully connected layers
Softmax
Building boundbox
regression
Bounding box list
Output building detection
results
Fig. 7. The process of the building detection.
V. EXPERIMENTS AND RESULTS
In this section, we test the sensitivities of the model
parameters and compare our method with other five methods,
to verify the efficiency and stability of the proposed model.
The test environment is as follows: Intel Xeon E5-2640 CPU,
Nvidia Quadro M4000 GPU with 8-GB RAM. The training
process was performed by Caffe [51] framework on
Ubuntu14.04 operating system. In this section, there are four
parts: the data overview, the test results evaluation, the
parameter sensitivity analysis and the comparison with the
other 5 methods.
A. Datasets
We used three datasets, namely Dataset I, Dataset II and
Dataset III to test the performance of the proposed method via
various experiments. Dataset I is taken from SIRI-WHU [52]
dataset, which comes from the USGS (United States
geological survey) public test datasets. The data is collected in
Montgomery, Ohio, America, with a spatial resolution of 0.6
m, and the type of the scene is primarily residential area, with
the area of 54006000 m2. Dataset II is taken from public
SpaceNet dataset1, which is collected by Digital Globe’s
WorldView-2 satellite. The spatial resolution of the Dataset II
is 0.5 m and the area is 80006000 m2. Dataset III comes
from the Inria aerial image labeling dataset [53]which is
aerial orthorectified color imagery with a spatial resolution of
0.3 m. The dataset covers an area of 4700×5000 m2, which
was captured in Chicago, America.
We select the training data and test data from the Dataset I,
Dataset II and Dataset III, according to the type and
distribution of buildings as shown in Figs. 8-10. In Dataset I,
the training data has a total of 263 buildings, and the test data
contains 817 buildings. In Dataset II, the training data contains
849 buildings while the test data contains 3048 buildings. In
Dataset III, the training data contains 140 buildings and the
test data contains 578 buildings. Figs. 8-10 respectively
represent the training and test areas of Datasets I-III.
1. https://github.com/SpaceNetChallenge/SpaceNetChallenge.github.io
7
Fig. 8. The training and the test areas in Dataset I.
Fig. 9. The training and the test areas in Dataset II.
Fig. 10. The training and the test areas in Dataset III.
B. Building detection
To test the efficacy of the proposed method, we combine
the VGG-16 [46] model into our method to perform
experiments on the basis of the above Dataset I, Dataset II and
Dataset III. The test results of building detection are shown in
Figs.11-13, respectively. It can be seen that our method
detects the building objects with different textures and shapes
well. Despite having buildings with different structures and
textures in the scene of the Fig. 11(b), and even sheltered by
the trees (e.g. the buildings in the dashed box of Fig. 11b and
the ones in the dashed box 1 of Fig. 12 and in the dashed box
1 of Fig. 13), our model detects the buildings noticeably better.
Meanwhile, the buildings with different sizes can be
accurately detected, as evident in the dashed boxes of Fig.
11(c), the dashed box 2 of Fig. 12 and the dashed box 2 of Fig.
13.
(a)
(b)
(c)
Fig. 11. The results of bu ilding detection on Dataset I. (a) Detection results of
some buildings. (b) Dete ction results o f some sheltered buildings. (c)
Detection results for buildings with different sizes.
8
Fig. 12. The results of building detection on Dataset II.
Fig. 13. The results of building detection on Dataset III.
C. The effect of different training iteration number
In order to analyze the effect of the training iteration
number, we set the learning rate as 0.001 and adopt the Adam
learning strategy [54] to optimize our model. The parameters
of the Adam learning strategy are as follows: exponential
decay rates β1 = 0.9 and β2 = 0.999, and minor amount ε =
1e-08. We use mean average precision (mAP) to evaluate the
detection results, and exploit the IoU value to assess the
accuracy of the building detection box position. Here, the
VGG-16 model [46] is integrated into our method to extract
features under different numbers of training iteration. Tables
1-3 show the obtained mAP values and the IoU values of the
tested images in Datasets I-III. It can be seen from the Tables
1-3 that the mAP values of the building detection on the 3
datasets gradually increase in general, when the number of
training iteration changes from 4000 to 10000, and the IoU
values of the test images also increases synchronously, which
indicates that the accuracies of the building detection and
positioning are improved with the increase in the training
iteration number. However, the accuracy of building detection
decreases when the number of iteration is more than 8000 in
the Tables 1 and 3, 9000 in the Table 2, which indicates that
the overfitting may be occurred. It can be observed that our
model has the best detection performance when the training
iteration number is set in the range 8000 to 9000.
Table 1. The mAP and IoU values of building de tection in test image of
Dataset I under the different iteration numbers.
Iteration
numbers
4000
5000
6000
7000
8000
9000
10000
mAP
0.44
0.46
0.49
0.51
0.54
0.50
0.46
IoU
76.7%
79.2%
81.5%
83.8%
86.9%
83.4%
80.2%
Table 2. The mAP and IoU values of building de tection in test image of
Dataset II under different iteration numbers.
Iteration
numbers
4000
5000
6000
7000
8000
9000
10000
mAP
0.38
0.40
0.43
0.47
0.49
0.52
0.46
IoU
66.7%
68.2%
74.5%
78.8%
81.9%
83.2%
79.6%
Table 3. The mAP and IoU values of building de tection in test image of
Dataset III under different iteration numbers.
Iteration
numbers
4000
5000
6000
7000
8000
9000
10000
mAP
0.40
0.42
0.45
0.46
0.53
0.51
0.49
IoU
68.3%
69.5%
75.5%
78.2%
86.4%
82.8%
81.6%
D. The effects of different deep learning network models
In this section, we use several representative deep learning
models to test the performance of our method. The employed
models, such as AlexNet [44], ZF [45], VGG-16 [46],
GoogLeNet [47] and ResNet [48], have achieved excellent
results in the object detection and recognition competitions in
the recent years, and integrated into our method to test the
performance on our test datasets (Datasets I-III) under the
same parameters (e.g. the number of training iteration is 8000,
and the number of levels (N) in multilevel training data set is
7). The final values of mAP and IoU are shown in Tables 4-6.
From Tables 4-6, and the precision-recall (P-R) curves of the
tested results in Figs. 14-16, we can observe that the five
models using our method obtained good performance in terms
of mAP and IoU, which shows the advantage of our
hierarchical building detection model.
Table 4. The values of mAP and IoU using different models in Dataset I.
Model
AlexNet
ZF
VGG-16
GoogLeNet
ResNet
mAP
0.49
0.51
0.54
0.53
0.56
IoU
71.4%
71.9%
86.9%
85.1%
87.1%
Table 5. The values of mAP and IoU using different models in Dataset II.
Model
AlexNet
ZF
VGG-16
GoogLeNet
ResNet
mAP
0.43
0.47
0.52
0.52
0.54
IoU
67.3%
68.9%
80.9%
79.1%
82.1%
Table 6. The values of mAP and IoU using different models in Dataset III.
Model
AlexNet
ZF
VGG-16
GoogLeNet
ResNet
mAP
0.45
0.49
0.56
0.55
0.57
IoU
69.2%
70.3%
82.7%
81.1%
83.1%
Fig. 14. The precision-recall curves of several deep learning models in Dataset
I.
󰕭󱎵 󱡗󱱚󰉯 100%
󰕭󱎵 󱡗󱱚󰉯 100%, 󲼀 0.1 󱗶
󰕭󱎵 󱡗󱱚󰉯 100%
󰕭󱎵 󱡗󱱚󰉯 100%, 󲼀 0.1 󱗶
9
Fig. 15. The precision-recall curves of several deep learning models in Dataset
II.
Fig. 16. The precision-recall curves of several deep learning models in Dataset
III.
E. The impact of layer number N
In order to test the influence of the level number N (in
Section III-A), we respectively set the layer number N as 1 to
10, and use the VGG-16 model [46] as the basic network to
train the detection model. In the experiments, the training
iteration is set as 8000, and the trained model is used to detect
buildings in the test image. The results of building detection
are counted by using the average value of IoU. As shown in
Fig. 17, we used the Datasets I-III to conduct the experiments.
It is evident from the Fig. 17 that the average IoU values of
the building detection on the two datasets gradually increase
with the augment of layer number N, and the average IoU
values tend to be stable when the layer number reaches 7 and
8 on the Datasets I-III, respectively. It indicates the advantage
of the multi-level training structure in improving the accuracy
of building detection.
Fig. 17. The curves of average IoU values under different numbers of layer.
F. The comparison with other methods
We compare our method with the other 5 building detection
methods, e.g. the DPM (deformable parts model) method [55]
(Method I), the fast-RCNN (fast region convolutional neural
networks) method [42] (Method II), the faster-RCNN [43]
(Method III), the RICNN (rotation-invariant convolutional
neural networks)[35] (Method IV), and the YOLO (you only
look once) model [56] (Method V). Table 7 shows the
comparison results between our method and the other 5
methods. The characteristic of our method includes three
aspects: i) the multi-level structure of training data, ii) deep
learning model, and iii) the building region proposal networks.
In the compared methods, the Method I does not use
multilevel structure, deep learning model, and building region
proposal networks, and detects the buildings by using the
spatial structure components, which are several
high-resolution component templates extracted from the image
samples. Besides, the sliding window is adopted in the Method
I to search for buildings in the remote sensing image. The
Method II uses selective search to generate building region
proposals, from which there are no multilevel structure and
building region proposal networks. Method III extracts
candidate areas by constructing region proposal networks, and
the features of objects are built by the fast-RCNN model [42].
In contrast with our method, the Method III aims at the
problem of multi-objects detection, and does not use the
multilevel training data. The Method IV introduces a
rotation-invariant layer on the basis of the existing CNN
models, and learns a RICNN model to improve the
performance of object detection. Method V proposes a neural
network model to perform the object detection task as
regression problem, and obtains the location and probability of
the object. There are no building region proposal networks and
multilevel structure of training data in the Methods IV and V.
Table 7. The comparisons among our method and the other 5 methods.
Multilevel
structure
Deep learning
model
Building region
proposal networks
Our method
Method I [55]
×
×
×
Method II [42]
×
×
Method III [43]
×
Method IV [35]
Method V [56]
×
×
×
×
In the experiments, the same datasets (Datasets I-III) and
hardware environment were used to test the building detection
performance. Meanwhile, the VGG-16 model [46] is set as a
basic network structure unit in our method. The mAP values
are shown in Tables 8-10. It can be seen from the above
results that, the mAP values of our method reaches 0.57 in
Table 8, 0.54 in Table 9 and 0.55 in Table 10, which is higher
than the rest of the methods, and improve the mAP values as
3.63%, 3.85%, 3.77% on the Datasets I-III than those of the
best approach (Method IV). The results illustrate the
advantages of the multilevel structure, deep learning and
BRPN in our method. As far as the time complexity of the
building detection (Table 11) is concerned, our method incurs
less time overhead, which is a little more than that of the best
method (Method V), as the Method V does not need to
generate building region proposals. However, the object
detection precision of Method V is lower than that of our
method (see the mAP values in Tables 8-10), which proves the
advantages of hierarchical training dataset and BRPN. P-R
curves in Figs. 18-20 indicate that our method performs best
which illustrates the advantages of multi-level training
framework and extraction of building region proposals using
BRPN.
Table 8. The mAP values of our method and the other 5 methods on the
Dataset I.
Methods
mAP
Method I[55]
0.32
Method II[42]
0.41
Method III[43]
0.52
10
Method IV[35]
0.55
Method V [56]
Our Method
0.51
0.57
Table 9. The mAP values of our method and the other 5 methods on the
Dataset II.
Methods
mAP
Method I[55]
0.34
Method II[42]
0.42
Method III[43]
0.48
Method IV[35]
0.52
Method V [56]
Our Method
0.50
0.54
Table 10. The mAP values of our method and the other 5 methods on the
Dataset III.
Methods
mAP
Method I[55]
0.33
Method II[42]
0.42
Method III[43]
0.50
Method IV[35]
0.53
Method V [56]
Our Method
0.51
0.55
Table 11. The computational cost of building detection on the Dataset I.
Methods
Time cost(second)
Method I[55]
3.78
Method II[42]
2.14
Method III[43]
0.26
Method IV[35]
1.97
Method V[56]
Our Method
0.15
0.17
Fig. 18 The P-R curves of several building detection methods in Dataset I.
Fig. 19 The P-R curves of several building detection methods in Dataset II.
Fig. 20 The P-R curves of several building detection methods in Dataset III.
. CONCLUSION
In this paper, we propose a deep learning based hierarchical
framework to automatically detect buildings in remote sensing
image. In the proposed framework, we design the hierarchical
training model using Gaussian pyramid principle, to extract
discriminative features at different scales and spatial extents.
Then, the deep learning model of hierarchical building
detection is constructed. Our method has been validated on
different scenes of remote sensing images. We have also
compared our method with five related methods (i.e. Methods
I-V in the Section V-F) qualitatively and quantitatively.
Experiments and comparisons with the state-of-the-art
methods clearly demonstrate the superiority of our method in
accurately and efficiently detecting the buildings in remote
sensing images. This underlines the advantages of hierarchical
training dataset, deep learning based building detection model,
and building region proposals using the BRPN.
In future, we will consider the adaptive construction of
hierarchical training datasets according to the content of
on-ground objects [57], to more adequately extract the feature
of buildings.
REFERENCES
[1]. L. Guan, Y. Ding, X. Feng and H. Zhang, “Digital beijing construction
and application based on the urban three-dimensional modelling and
remote sen sing monitoring technology,” IEEE. Inter. Geosci. Remote.
Sens. Symp. (IGARSS), pp:7299-7302, Jul. 2016.
[2]. M. Mazher Rathore, Awais Ahmad, Anand Paul and Seungmin Rho,
“Urban planning and building smart cities based on the Internet of
Things using Big D ata analytics,” Computer Networks, vol. 101, no: 4,
pp: 63-80, Jul. 2016.
[3]. J. F. Pekel, A. Cottam, N. Gorelick, and A. S . Bel ward,
“High-resolution mapping of global surface water and its long-term
changes,” Nature. vol: 540, no:7633, pp: 418-432, Dec. 15 2016.
[4]. J. Leitloff, S. Hinz, and U. Stilla, “Vehicle detection in very high
resolution satellite images of city areas,” IEEE Trans. Geosci. Remote
Sens.,vol. 48, no. 7, pp. 27952806, Jul. 2010.
[5]. S. Tuermer, F. Kurz, P. Reinartz, and U. Stilla, “Airborne vehicle
detection in dense urban areas using Ho G features and disparity maps,”
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6 , no. 6, pp.
23272337, Dec. 2013.
[6]. H. Grabner, T. T. Nguyen, B. Gruber, and H. Bi schof, “On-line
boosting based car detection from aerial images,” IS PRS J.
Photogramm. Remote Sens., vol. 63, no. 3, pp. 382396, May 2008.
[7]. Ö. Aytekin, U. Zöngür, and U. Halici, “Texture-based airport runway
detection,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp.
471475, May 2013.
[8]. J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in
optical remote sensing images based on weakly supervised learning and
high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol . 53,
no. 6, pp. 33253337, Jun. 2015.
[9]. G. Cheng, J . Han, L. Guo, X. Qian, P. Zhou, X.Yao, and X. Hu,
“Object detection in remote sensing imagery using a discriminatively
trained mixture model,” ISPRS J. Photogramm. Remote Sens., vol. 85,
pp. 3243, Nov. 2013.
[10]. D. Zhang, J. Han, G. Cheng, Z. Liu, S. Bu, and L. Guo, “Weakly
supervised learning for target detection in remote sensing images,”
IEEE Geosci. Remote Sens. Lett., vol. 12, n o. 4, pp. 701705, Apr.
2015.
[11]. S. Xu, T. Fang, D. Li, and S.Wang, “Object classification of aerial
images with bag-of-visual words,” IEEE Geosci. Remote Sens. Lett.,
vol. 7, no. 2, pp. 366370, Apr. 2010.
[12]. H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, “Automatic target detection
in high-resolution remote sensing images using spatial sparse coding
11
bag-of-words model,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 1,
pp. 109113, Jan. 2012.
[13]. Y. Zhang, L. Zhang, B. Du, and S. Wang, “A nonlinear sparse
representation-based binary hypothesis model for hyperspectral target
detection,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol.
8, no. 6, pp. 25132522, Jun. 2015.
[14]. Y. Zhang, B. Du, and L. Zhang, “A sparse representation-based binary
hypothesis mo del for target detection in hyperspectral images,” IEEE
Trans. Geosci. Remote Sens., vol. 53, no. 3, pp. 13461354, Mar. 2015.
[15]. N. Yokoya and A. Iwasaki, “Object detection based on sparse
representation and Hough voting for optical remote sensing imagery,”
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8 , no. 5, pp.
20532062, May 2015.
[16]. Y. Zhong, R. Feng, and L. Zhang, “Non-local sparse unmixing for
hyperspectral remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth
Observ. Remote Sens., vol. 7, no. 6, pp. 18891909, Jun. 2014.
[17]. Y. Xu, E. Carlinet, T. Geraud, and L. Najman. “Hierarchical
segmentation using tree-based shape spaces,” IEEE Trans. Pattern Anal.
Mach Intell., vol. 39, no. 3, pp.457-469, Mar. 2017.
[18]. Farabet, C ., Couprie, C., N ajman, L., and LeCun, Y, “Learnin g
hierarchical features for sc ene labeling,” IEEE Trans. Pattern Anal.
Mach Intell. , vol. 35, no. 8, pp.1915-1929, Oct. 2013.
[19]. K. Yu, Y. Lin, and J. Lafferty, “Learning image representations from
the pixel level via hierarchical sparse coding,” in Proc. IEEE CVPR,
Jun. 2011, pp. 17131720.
[20]. K. He, X .Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks fo r visual recognition,” IEEE Trans. Pattern
Anal. Mach Intell., vol. 37, no. 9, pp. 19041916, Jan. 2015.
[21]. Z. Zhang, L. Zhang, X. Tong, T. Mathiopoulos, B. Guo, X. Huang, Z.
Wang, and Y. Wang, “A Multilevel Point-Cluster-Based Discriminative
Feature for ALS Point C loud Classification,” IEEE Trans. Geo sci.
Remote Sens., vol. 54, no. 6, pp. 3309 3321, Jun. 2016.
[22]. Z. Zhang, L. Zhang, X. T ong, B. Guo, L. Zhang, and X. Xing,
“Discriminative Dictionary Learning-Based Multi-Level Point-C luster
Features for ALS Point Cloud Classification,” IEEE Trans. Geosci.
Remote Sens., vol. 54, no. 12, pp. 7309-7322, Dec. 2016.
[23]. R. Gaetano, G. Scarpa, and G. Poggi, “Hierarchical texture-based
segmentation of mult iresolution remote-sensing images, ” IEEE Trans.
Geosci. Remote Sens., vol. 47, no. 7, pp. 21292141, Jan. 2009.
[24]. R. Trias-Sanz, G. Stamon, and J. Louchet, “Using colour, texture, and
hierarchial segmentation fo r high-resolution remote sensing,” ISPRS J.
Photogramm. Remote Sens. , vol. 63, no. 2, pp. 156168, Mar. 2008.
[25]. C. Kurtz, N. Passat, P. Gancarski, and A. Puissant, “Extraction of
complex patterns from multiresolution remote sensing images: A
hierarchical top-down methodology,Pattern Recognition., vol.45, no.
2, pp.685-706, Feb. 2012.
[26]. C. Kurtz, A. Stumpf, J. P. Malet, P. Gançarski, A. Puissant, and N.
Passat, “Hierarchical extraction of landslides from multiresolution
remotely sensed optical images,” ISPRS J. Photogramm. Remote Sens.,
vol. 87, no.1, pp. 122-136, Jan. 2014.
[27]. T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
"Feature p yr amid networks for object detection." in IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR). Vol. 1. No. 2. 2017.
[28]. B. Sirmacek, and C. Unsalan,. “Urb an-area and building detection
using SIFT keypoints and graph theory, IEEE Trans. Geosci. Remote
Sens., vol. 47, no. 4, pp. 1156-1167, May. 2009.
[29]. X. Huang and L. Zhang, “Morphological Building/Shadow Index for
Building Extraction From High-Resolution Imagery Over Urban Areas,”
IEEE J. S el. Topics Appl. Earth Observ. Remote Sens. vol 5, no 1, pp.
161-172, 2012.
[30]. X. Huang and L. Zhang, “An SVM ensemble approach combining
spectral, structural, and semantic features for t he classification of
high-resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote
Sens., vol. 51, no. 1, pp. 257-272, Jan. 2013
[31]. X. Huang, H. Chen, and J. Gong, “Angular difference feature extraction
for urban scene classification using ZY-3 multi-angle high-resolution
satellite imagery,” ISPRS J. Photogramm. Remote Sens., vol. 135, no.1,
pp. 127-141, 2018.
[32]. X. Huang, W. Yuan, J. Li, and L. Zhang, “A New Building Extraction
Post-Processing Framework for High Sp atial Resolution R emote
Sensing Imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote
Sens., vol. 10, no. 2, pp. 654-668, Feb. 2017.
[33]. F. Hu, G.-S. Xia, Z. Wang, X. Huang, and L. Zhang, “Unsupervised
feature learning via spectral clustering of patches for remotely sensed
scene classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote
Sens., vol. 8, no. 5, pp. 20152030, May. 2015.
[34]. G. Xia, Z. Wang, C. Xiong, and L. Zhang, “Accurate annotation of
remote sensing images via active spectral clustering with little expert
knowledge,” Remote Sens., vol. 7, no. 11, pp. 15 01415 045, Nov.
2015.
[35]. G. Cheng, P. Zhou, and J. Ha n. "Learning Rot ation-Invariant
Convolutional Neural Networks for Object Detection in VHR Optical
Remote Sensing Images," IEEE Trans. Geosci. Remote Sens., Vol. 54,
no. 12, pp. 7405-7415, Dec. 2016.
[36]. Y. Long, Y. Gong, Z. Xiao, and Q. Liu. “Accurate object localization in
remote sensing images based on convolutional neural ne tworks,” IEEE
Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2486-2498, Jan. 2017.
[37]. R. Girshick, J. Donahue, T. Darrell, and J. M alik, “Region-based
convolutional networks for accurate object detection and segmentation,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142158,
Jan. 2016.
[38]. F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “T ransferring d eep
convolutional neural networks for the sc ene c lassification o f
high-resolution remote sensing imagery,” Remote Sens., vol. 7, no. 11,
pp. 14 68014 707, Nov. 2015.
[39]. M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios,
“Building detection in very high resolution multispectral data with deep
learning features,” IEEE Inter. Geosci. Remote Sens. Sympo. (IGARSS),
pp.1873-1876, July 2015.
[40]. G. Cheng and J. Han, “A survey on object detection in optical remote
sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117, pp.
1128, Jul. 2016.
[41]. M Suárez , VM Brea, J Fernández-Berni, R Carmona-Galán, and D
Cabello, “Low-Power CMOS Vision Sensor for Gaussian Pyramid
Extraction,” IEE E Journal of Solid-State Circuits., vol. 52, no . 2, pp.
483-495, Feb. 2017.
[42]. R. Girshick, “Fast R-CNN,” in IEEE Inter. Conf. Comput. Vis. (ICCV),
Jun. 2015, pp. 1440-1448.
[43]. S. Ren, K. He , R. Girshick, and J. Sun. "Faster R -CNN: towards
real-time object detection with region proposal networks," in Proc.
Neural Inf. Process. Syst. (NIPS). 2015, pp.1137-1149
[44]. A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Neural Inf. Process.
Syst. (NIPS). 2012, pp.1097-1105.
[45]. M. D. Zeiler, and R Fergus. “Visualizing and understanding
convolutional networks,” in Euro. Conf. Comput. Vis. (ECCV),2014.
[46]. K. Simonyan, and A. Zisserman, “Ver y Deep Convolutional Networks
for Large-Scale Image Recognition,” in Proc . IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Sep. 2015, pp.1-14.
[47]. C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with
Convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2015, pp. 1-9
[48]. K. He, X. Zhang, S. Ren and J. Sun, “Deep R esidual Learning for
Image Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun.2016, pp.770-778.
[49]. R. Girshick, J. Donahue, T. Darrell , and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,” in
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014.
[50]. G. E. Hinton and R. R. S alakhutdinov. “Replic ated Softmax: an
Undirected Topic Model,” in Proc. Neural Inf. Process. Syst. (NIPS),
Nov. 2009, pp. 1607-1614.
[51]. Y. Jia, S. Evan, D. Jeff, K. Sergey, L. Jonathan, G. Ross, G. Sergio, and
D. Trevor, “Caffe: Convolutional Architecture for F ast Feature
Embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2014, pp.1401-1410.
[52]. Y. Zh ong, Q. Zhu, and L. Zhang, “Scene Classification Based on the
Multi Feature Fusion Probabilistic Topic Model for High Spatial
Resolution Remote Sensing Imagery," IEEE Trans. Geosci. Remote
Sens., vol. 53, no. 11, pp. 6207-6222, Nov. 2015.
[53]. E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semantic
labeling methods generalize to any city? the inria aerial image labeling
benchmark,” IEEE Int er. Geosci. Remote Sens. Sympo. (IGARSS).
2017.
[54]. D. Kingma, and J. Ba, “Ad am: A Method for Stochastic Optimization,”
Inter. Conf. Lear. Repr. (ICLR) May. 2015
[55]. P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan,
“Object Detection with Discriminatively Trained Part-Based Models,”
12
IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp.1627-1645,
Sep. 2009.
[56]. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection,” i n Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015
[57]. Y. Liu, M. Yu, M. Yu, and Y. He, “Manifold slic: A fast method to
compute content-sensitive superpixels,” i n IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), pp.651-659, 2016.
Yibo L iu received the Bachelor’s degree in Geographical Information
Engineering from Henan University of Technology, Zh engzhou, China,
in 2016. He is currently pursuin g the Master’s degree with Beijing
Advanced Innovation Center for Imaging Technology, Capital Normal
University, Beijing, China. His research interests include deep learning,
remote sensing images analysis and LiDAR-based urban modeling.
Zhenxin Zhang received the Ph.D. degree in
geoinformatics in the Sch ool of Geography, Beijing
Normal University, Beijing, Ch ina, in 2016. He is
currently an assistant professor at Beijing Advanced
Innovation Center for Imaging Technology and Ke y
Lab of 3D Information Acquisition and Application,
College of R esource Environment and Tourism,
Capital Normal University, Beijing, China, and he also
works as a c ooperator with Beijing Key Laboratory of
Urban Spatial Information Engineering, Beijing
Institute of Surveying and Mapping, Beijing, China. His research interests
include light detection and ranging data processing, qualit y analysis o f
geographic information systems, and algorithm development.
Ruofei Zhong received the Ph.D. degree in
Geo-informatics from the Chinese Academy of
Science’s Institute of Remote Sensing Applications,
Beijing, China, in 2005. He is currently a professor at
Beijing Advanced Innovation Center for Imaging
Technology and Key Lab o f 3D Information
Acquisition and Application, College of Resource
Environment and Tourism, Capital Normal University,
Beijing, China.
His research interests include light detection and ranging data processing
and data collection system with laser scanning.
Dong Chen received the Bachelor’s degree in
Computer Science from Qingdao University of Science
and Technology, Qingdao, China, the Master’s degree
in Cartography and Geographical Information
Engineering from Xi’an University o f Science and
Technology, Xi’an, China and the Ph.D. degree in
Geographical Information Sciences from Beijing
Normal University, Beijing, China. He is currently an
Assistant Professor with Nanjing Forestry University,
Nanjing, China. He is also a Post -Doctoral Fellow with the Department of
Geomatics Engineering, University of Calgary, Calgary, AB, Canada . His
research interests include image-and LiD AR-b ased segmentation and
reconstruction, full-waveform LiDAR data processing, and related remote
sensing applications in the field of forest ecosystems.
Yinghai Ke re ceived the Bachelor’s degree in
Environmental Science from Wuhan University, Wuhan,
China, the Master’s degree in Geographical Information
Science from Peking University, B eijing, China and the
Ph.D. degree in Geospatial Sciences and Technology from
State University of New York College of Environmental
Science and Forestry, Syracuse, New York, USA. She is
currently an Associate Professor with Capital Normal
University, Beijing, China. Her research interests include
remote sensing image classification and the application in urban environment
and ecology.
Jiju Peethambaran received the Bachelors degree in
information technology from the University of Calicut,
Malappuram, India, the Masters degree in computer
science from the National Institute of Technology,
Karnataka, Mangalore, India, and the Ph.D. degree in computational geometry
from IIT Madras, Chennai, India. He is currentl y a Post-Doctoral Researcher
with the Department of Computer Science, University of Victoria, Victoria,
BC, Canada. His research interests include Computational geometry,
Geometric learning, Real-time geometry processing and related applications
including motion capture for VR/AR and LiDAR-based urban modeling.
Chuqun Chen received the B.S. degree in engineering of
geology and exploration from the Chengdu University of
Technology, Ch engdu, C hina, in 1982, the M.Sc. degree
in cartology and remote sensing from the Institute o f
Remote Sensing Applications, Chinese Academy o f
Sciences, Beijing, China, in 1992, the Ph.D. degree in the
science of physical oc eanography (ocean color remote
sensing) from the Graduate University o f Chinese
Academy of Sci ences, Guangzhou, China, in 2006, and
the Ph.D. de gree in water resources engineering from Lu nd University,
Sweden, in 2008.Since 1987, he has b een working on remote sensing
applications, and has been a Principal Investigator for more than ten p ro jects
sponsored by the Natio nal 863 Program, the Natio nal 973 Program
(subproject), the National Natural Science Foun dation of China, the Scie ntific
Foundation of Guangdong Province, the Chinese Academy of Scie nces, and
the Ministry of Science and Technology. His research interests include marine
optics theory and optical data analyses, the atmospheric correction of optical
satellite data in coastal areas, remotely sensed assessment of water quality and
thermal infrared remote sensing, skin temperature measurement, and
validation of satellite retrieved SST with his own developed instrument, the
Buoyant Equipment for Skin Temperature (BEST).
Lan Sun received the Bachelor’s degree in Beijing
University of Civil Engineering and Architecture,
Beijing, Chi na, 2 012. He is currently pursuing the
Master’s degree with Beijing Advanced Innovation
Center for Imaging Technology, Capital Normal
University, Beijing, China. His research interests include
spatial data mining, de ep learning, and LiDAR-base
classification and reconstruction.
... The size of each country's name is proportional to the number of papers published. Notably, China, the United States, Canada, Japan, Heung et al. 2016;Hu and Yuan 2016;Huang et al. 2017;Jia et al. 2021;Li et al. 2017;Li and Hsu 2020;Lin et al. 2017Lin et al. , 2022Marmanis et al. 2015;Palafox et al. 2017;Ren et al. 2015;Silburt et al. 2019;Torres et al. 2020;Vaz et al. 2015;Wang et al. 2018aWang et al. , 2020bWu et al. 2021) Hydrography Extraction (16) Cheng et al. 2020;Duan and Hu 2019;Feng et al. 2018;Gebrehiwot et al. 2019;Lin et al. 2021;Miao et al. 2018;Shaker et al. 2019;Stanislawski et al. 2018Stanislawski et al. , 2019Stanislawski et al. , 2021Wang et al. 2020a;Xu et al. 2018;Yamazaki et al. 2017Yamazaki et al. , 2019Wu et al. 2023) Shaded Relief (2) (Jenny et al. 2021;Li et al. 2022b) Semantic Segmentation (21) (Adam et al. 2023;Bachhofner et al. 2020;Grilli et al. 2019;Jhaldiyal and Chaudhary 2023;Kemker et al. 2018;L� opez et al. 2020;Ma et al. 2019;Macher et al. 2017;Mi and Chen 2020;Mullissa et al. 2019;Pierdicca et al. 2020;Qi et al. 2017a,b;Xiao 2014, 2016;Tamke et al. 2016; Thomson and Boehm 2015;Wang et al. 2018a;Yan et al. 2020;Yang et al. 2023;Yao et al. 2016) Element Extraction (7) (Braga et al. 2020;Chen et al. 2016;Hui et al. 2019;Li et al. 2019;Liu et al. 2018;Miyoshi et al. 2020;Sun et al. 2023) and the United Kingdom stand out as the countries with the most publications on this topic. As for (iii), Figure 4 presents a pie chart illustrating the distribution of articles on DL applied to DEMs among the top 15 journals with the highest number of publications on this subject. ...
... The U-Net structure consists of a preceding convolutional layer followed by five successive Xception modules on the encoder side, while the convolutional block of the decoder is similar to that of the original U-Net. Liu et al. (2018) used a region proposal network based on CNN (called BRPN) to generate candidate building areas, instead of the sliding windows used in the Faster R-CNN model. Unlike the latter network, BRPN constructs the network by combining multi-level spatial hierarchies of the image. ...
Article
Full-text available
Deep Learning (DL) has a wide variety of applications in various thematic domains, including spatial information. Although with limitations, it is also starting to be considered in operations related to Digital Elevation Models (DEMs). This study aims to review the methods of DL applied in the field of altimetric spatial information in general, and DEMs in particular. Void Filling (VF), Super-Resolution (SR), landform classification and hydrography extraction are just some of the operations where traditional methods are being replaced by DL methods. Our review concludes that although these methods have great potential, there are aspects that need to be improved. More appropriate terrain information or algorithm parameterisation are some of the challenges that this methodology still needs to face.
... Convolutional neural networks (CNN)-based deep learning models are widely used in natural image categorization, object identification, and semantic segmentation (Ma et al. 2019). The superior performance of CNNs has been demonstrated in automated building extraction from high-resolution remote sensing images (Guo et al. 2021;Liu et al. 2018). However, CNN retains many notable issues, such as limited perceptual area, erroneous edge segmentation, and spatial information loss (Zhu et al. 2021). ...
... Alshehhi et al. (2017) proposed a single, patch-based Convolutional Neural Network (CNN) architecture for the extraction of roads and buildings from high-resolution remote sensing data. Liu et al. (2018) proposed a multi-level learning framework for building detection from remote sensing images based on CNNs, which can extract the features of buildings at different scales and spatial resolutions. Li et al. (2019) proposed a U-Net-based semantic segmentation method for the extraction of building from high-resolution, multispectral satellite images. ...
Article
Full-text available
Building extraction from high-resolution remote sensing images benefits various practical applications. However, automation of this process is challenging due to the variety of building surface coverings, complex spatial layouts, different types of structures, and tree occlusion. In this study, we propose a multilayer perception network for building extraction from high-resolution remote sensing images. By constructing parallel networks at different levels, the proposed network retains spatial information of varying feature resolutions and uses the parsing module to perceive the prominent features of buildings, thus enhancing the model's parsing ability to target scale changes and complex urban scenes. Further, a structure-guided loss function is constructed to optimize building extraction edges. Experiments on multi-source remote sensing data sets show that our proposed multi-level perception network presents a superior performance in building extraction tasks.
... Since 2017, they have been integrated into building recognition models. For example, Yibo Liu and his team developed a deep learning-based framework for hierarchical building detection using CNNs to identify buildings from remote sensing data (Liu 2018). Similarly, Kang and colleagues employed CNNs with OpenStreetMap data to identify eight building types in North America (Kang 2018). ...
Article
Full-text available
Earthquakes pose significant risks in Taiwan, necessitating effective risk assessment and preventive measures to reduce damage. Obtaining complete building structure data is crucial for the accurate evaluation of earthquake-induced losses. However, manual annotation of building structures is time-consuming and inefficient, resulting in incomplete data. To address this, we propose YOLOX-CS, an object detection model, combined with the Convolutional Block Attention Module (CBAM), to enhance recognition capabilities for small structures and reduce background interference. Additionally, we introduce the Illustration Enhancement data augmentation method to improve the recognition of obscured buildings. We collected diverse building images and manually annotated them, resulting in a dataset for training the model. YOLOX-CS with CBAM significantly improves recognition accuracy, particularly for small objects, and Illustration Enhancement enhances the recognition of occluded buildings. Our proposed approach advances building structure recognition, contributing to more effective earthquake risk assessment systems in Taiwan and beyond.
... In recent years, deep learning has gradually been applied in many methods [2], [3], [14], [16], [24], [26], [36], [38], [54], [58]. Li et al. [24] use adversarial learning for building extraction from remote sensing images. ...
Article
Full-text available
Accurately extracting buildings from aerial images has essential research significance for timely understanding human intervention on the land. The distribution discrepancies between diversified unlabeled remote sensing images (changes in imaging sensor, location, and environment) and labeled historical images significantly degrade the generalization performance of deep learning algorithms. Unsupervised domain adaptation (UDA) algorithms have recently been proposed to eliminate the distribution discrepancies without re-annotating training data for new domains. Nevertheless, due to the limited information provided by a single-source domain, single-source UDA (SSUDA) is not an optimal choice when multitemporal and multiregion remote sensing images are available. We propose a multisource UDA (MSUDA) framework SPENet for building extraction, aiming at selecting, purifying, and exchanging information from multisource domains to better adapt the model to the target domain. Specifically, the framework effectively utilizes richer knowledge by extracting target-relevant information from multiple-source domains, purifying target domain information with low-level features of buildings, and exchanging target domain information in an interactive learning manner. Extensive experiments and ablation studies constructed on 12 city datasets prove the effectiveness of our method against existing state-of-the-art methods, e.g., our method achieves $59.1\%$ intersection over union (IoU) on Austin and Kitsap $\longrightarrow $ Potsdam, which surpasses the target domain supervised method by $2.2\%$ . The code is available at https://github.com/QZangXDU/SPENet.
Article
The satellite images have been employed in building extraction to aid urban planning, tax assessment, disaster management, etc. The number of buildings and building types is huge in urban areas, which puts more burden on human experts to extract buildings in satellite images. Hence, building extraction from satellite images using deep learning (DL) has become an emerging research domain in recent decades. The performance of the DL model depends on training parameters, the depth of the model, and the memory required to preserve the model. In this work, a Memory-Efficient Residual Dilated Convolutional Network (MRDCN) has been proposed to extract buildings effectively with reduced number of training parameters and with lesser memory consumption. The model is trained using the Massachusetts buildings dataset and implemented using PyTorch in Kaggle platform. The trained model has been tested using both Massachusetts and AIRS Dataset. The simulation results prove that the proposed model uses 31.64% less memory than the existing dilated residual network. It is evident from the results that the MRDCN is able to extract the buildings with better accuracy and an Intersection of Union with minimal memory consumption than the existing standard UNet, SegNet, ResUNet, and Dilated ResUNet models.
Article
Full-text available
Urban building information reflects the status and trends of a region’s development and is essential for urban sustainability. Detection of buildings from high-resolution (HR) remote sensing images (RSIs) provides a practical approach for quickly acquiring building information. Mainstream building detection methods are based on fully supervised deep learning networks, which require a large number of labeled RSIs. In practice, manually labeling building instances in RSIs is labor-intensive and time-consuming. This study introduces semi-supervised deep learning techniques for building detection and proposes a semi-supervised building detection framework to alleviate this problem. Specifically, the framework is based on teacher–student mutual learning and consists of two key modules: the color and Gaussian augmentation (CGA) module and the consistency learning (CL) module. The CGA module is designed to enrich the diversity of building features and the quantity of labeled images for better training of an object detector. The CL module derives a novel consistency loss by imposing consistency of predictions from augmented unlabeled images to enhance the detection ability on the unlabeled RSIs. The experimental results on three challenging datasets show that the proposed framework outperforms state-of-the-art building detection methods and semi-supervised object detection methods. This study develops a new approach for optimizing the building detection task and a methodological reference for the various object detection tasks on RSIs.
Chapter
Illegal buildings are structures built and used without being reviewed by competent construction authorities or obtaining a building or usage license. Due to their lack of compliance with firefighting and building regulations, these buildings frequently pose hazards. However, the government does not conduct timely inspections of illegal buildings as this process requires additional human resources. Therefore, developing an efficient and correct method for inspecting and reporting illegal buildings is necessary. In the present study, remote sensing-based unmanned aerial vehicles (UAVs) were used to acquire images and identify illegal buildings, offering higher mobility, lower cost, and higher spatial and temporal resolution compared to manned aerial vehicles. Additionally, Faster R-CNN based on deep learning was used to identify illegal buildings, and the accuracy of the model was 91.2%.KeywordsDeep learningUAV imageImage recognitionIllegal buildingObject detection
Article
Full-text available
Spaceborne multi-angle images with a high-resolution are capable of simultaneously providing spatial details and three-dimensional (3D) information to support detailed and accurate classification of complex urban scenes. In recent years, satellite-derived digital surface models (DSMs) have been increasingly utilized to provide height information to complement spectral properties for urban classification. However, in such a way, the multi-angle information is not effectively exploited, which is mainly due to the errors and difficulties of the multi-view image matching and the inaccuracy of the generated DSM over complex and dense urban scenes. Therefore, it is still a challenging task to effectively exploit the available angular information from high-resolution multi-angle images. In this paper, we investigate the potential for classifying urban scenes based on local angular properties characterized from high-resolution ZY-3 multi-view images. Specifically, three categories of angular difference features (ADFs) are proposed to describe the angular information at three levels (i.e., pixel, feature, and label levels): (1) ADF-pixel: the angular information is directly extrapolated by pixel comparison between the multi-angle images; (2) ADF-feature: the angular differences are described in the feature domains by comparing the differences between the multi-angle spatial features (e.g., morphological attribute profiles (APs)). (3) ADF-label: label-level angular features are proposed based on a group of urban primitives (e.g., buildings and shadows), in order to describe the specific angular information related to the types of primitive classes. In addition, we utilize spatial-contextual information to refine the multi-level ADF features using superpixel segmentation, for the purpose of alleviating the effects of salt-and-pepper noise and representing the main angular characteristics within a local area. The experiments on ZY-3 multi-angle images confirm that the proposed ADF features can effectively improve the accuracy of urban scene classification, with a significant increase in overall accuracy (3.8–11.7%) compared to using the spectral bands alone. Furthermore, the results indicated the superiority of the proposed ADFs in distinguishing between the spectrally similar and complex man-made classes, including roads and various types of buildings (e.g., high buildings, urban villages, and residential apartments).
Article
Full-text available
In this paper, we focus on tackling the problem of automatic accurate localization of detected objects in high-resolution remote sensing images. The two major problems for object localization in remote sensing images caused by the complex context information such images contain are achieving generalizability of the features used to describe objects and achieving accurate object locations. To address these challenges, we propose a new object localization framework, which can be divided into three processes: region proposal, classification, and accurate object localization process. First, a region proposal method is used to generate candidate regions with the aim of detecting all objects of interest within these images. Then, generic image features from a local image corresponding to each region proposal are extracted by a combination model of 2-D reduction convolutional neural networks (CNNs). Finally, to improve the location accuracy, we propose an unsupervised score-based bounding box regression (USB-BBR) algorithm, combined with a nonmaximum suppression algorithm to optimize the bounding boxes of regions that detected as objects. Experiments show that the dimension-reduction model performs better than the retrained and fine-tuned models and the detection precision of the combined CNN model is much higher than that of any single model. Also our proposed USB-BBR algorithm can more accurately locate objects within an image. Compared with traditional features extraction methods, such as elliptic Fourier transform-based histogram of oriented gradients and local binary pattern histogram Fourier, our proposed localization framework shows robustness when dealing with different complex backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
The sparsity model has been employed for hyperspectral target detection and has been proved to be very effective when compared to the traditional linear mixture model. However, the state-of-art sparsity models usually represent a test sample via a sparse linear combination of both target and background training samples, which does not result in an efficient representation of a background test sample. In this paper, a sparse representation-based binary hypothesis (SRBBH) model employs more appropriate dictionaries with the binary hypothesis model to sparsely represent the test sample. Furthermore, the nonlinear issue is addressed in this paper, and a kernel method is employed to resolve the detection issue in complicated hyperspectral images. In this way, the kernel SRBBH model not only takes the nonlinear endmember mixture into consideration, but also fully exploits the sparsity model by the use of more reasonable dictionaries. The recovery process leads to a competition between the background and target subspaces, which is effective in separating the targets from the background, thereby enhancing the detection performance.