ArticlePDF Available

Multilevel Building Detection Framework in Remote Sensing Images Based on Convolutional Neural Networks

September 2018
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(10)

September 2018
11(10)

DOI:10.1109/JSTARS.2018.2866284

Authors:

Yibo Liu

Capital Normal University

Zhenxin Zhang

Capital Normal University

Ruofei Zhong

Capital Normal University

Chen Dong

Nanjing Forestry University

Show all 8 authorsHide

In this paper, we propose a hierarchical building detection framework based on deep learning model, which focuses on accurately detecting buildings from remote sensing images. To this end, we first construct the generation model of the multi-level training samples using the Gaussian pyramid technique to learn the features of building objects at different scales and spatial resolutions. Then, the building region proposal networks are put forward to quickly extract candidate building regions, thereby increasing the efficiency of the building object detection. Based on the candidate building regions, we establish the multi-level building detection model using the convolutional neural networks (CNNs), from which the generic image features of each building region proposal are calculated. Finally, the obtained features are provided as inputs for training CNNs model, and the learned model is further applied to test images for the detection of unknown buildings. Various experiments using the Datasets I and II (in the Section V-A) show that the proposed framework increases the mean average precision (mAP) values of building detection by 3.63%, 3.85% and 3.77%, compared with the state-of-the-art methods, i.e., Method IV. Besides, the proposed method is robust to the buildings having different spatial textures and types.

Overview of our method.

…

Multilevel image training datasets.

…

Process of generating candidate building areas.

…

Schematic diagram of generating candidate building areas. Note that IoU values of red rectangles in the right-most subfigures are higher than 0.7.

…

+11

Training and the test areas in Dataset I.

…

Figures - uploaded by Zhenxin Zhang

Content may be subject to copyright.

Content uploaded by Yibo Liu

Content may be subject to copyright.

Content uploaded by Zhenxin Zhang

Content may be subject to copyright.

Abstract—In this paper, we propose a hierarchical building

detection framework based on deep learning model, which

focuses on accurately detecting buildings from remote sensing

images. To this end, we first construct the generation model of

the multi-level training samples using the Gaussian pyramid

technique to learn the features of building objects at different

scales and spatial resolutions. Then, the building region proposal

networks are put forward to quickly extract candidate building

regions, thereby increasing the efficiency of the building object

detection. Based on the candidate building regions, we establish

the multi-level building detection model using the convolutional

neural networks (CNNs), from which the generic image features

of each building region proposal are calculated. Finally, the

obtained features are provided as inputs for training CNNs

model, and the learned model is further applied to test images for

the detection of unknown buildings. Various experiments using

the Datasets I and II (in the Section V-A) show that the proposed

framework increases the mean average precision (mAP) values of

building detection by 3.63%, 3.85% and 3.77%, compared with

the state-of-the-art methods, i.e., Method IV. Besides, the

proposed method is robust to the buildings having different

spatial textures and types.

Index Terms—Building detection, multi-level framework,

CNNs, candidate building regions, remote sensing images.

I. INTRODUCTION

In the past few decades, due to the developments in

advanced aerospace remote sensing techniques and sensor

manufacture techniques, the quality of acquired remote

sensing images has improved tremendously. Admittedly, the

remote sensing images contain many complex and significant

spatial objects. Typically, the buildings from remote sensing

images constitute the most important landscape and have been

intensively used in various practical applications, including

digital urban model construction [1], urban planning [2],

environment control and mapping [3], among others.

Especially, the position, geometric shape and orientation of

the buildings, often represent the basis for high-level

building-oriented applications, such as building contouring,

building model reconstruction and cartographic generalization,

etc. Hence, efficient and accurate detection and recognition of

the real buildings from large-scale remote sensing images is a

relevant, yet challenging task. Additionally, in urban scenario,

this challenge is further intensified due to having diverse

geometric shapes and the inhomogeneity of spectral

information.

In this paper, we focus on the general problem of how to

automatically obtain the accurate building from large-scale

remote sensing imagery. Many previously published works in

this domain show impressive success on building or other

specific object detection. More specifically, the machine

learning techniques, such as AdaBoost [4-7], support vector

machine (SVM) [8-12], sparse coding-based classifiers

[13-17], etc. are generally adopted in the previous works.

Although these methods can reasonably detect the objects, the

saliency and hierarchy of object feature representation should

be enhanced considerably.

With the development of deep learning theory, especially

the advantage that deep learning model can describe more

powerful feature representations, it offers the possibility of

efficient building detection in remote sensing images. In this

paper, we aim to design an efficient and hierarchical building

detection framework based on deep learning model by using

remote sensing images. We first construct the multi-level

training samples using the Gaussian pyramid principle, to

learn the features of building objects at different scales and

spatial resolutions. Then, the building region proposal

networks are designed to determine the candidate building

object locations. This process increases the efficiency of the

building object detection. Next, we establish the multi-level

building detection model using CNNs, and the features from

hierarchical images corresponding to each building region

proposal are extracted. Finally, the extracted features are used

to learn the CNNs model, which is utilized to detect the

unknown buildings in the test images. Apart from introducing

Multi-level Building Detection Framework in Remote Sensing Images

Based on Convolutional Neural Networks

Yibo Liu1, 2, Zhenxin Zhang1, 2, 3, Ruofei Zhong1, 2, Dong Chen4, Yinghai Ke1, 2, Jiju Peethambaran5,

Chuqun Chen6, Lan Sun1, 2

1Beijing Advanced Innovation Center for Imaging Technology



Capital Normal University, Beijing

100048, China

2 College of Resources Environment and Tourism, Capital Normal University, Beijing 100048, China3

Chinese Academy of Surveying and Mapping, Beijing 100830, China

4College of Civil Engineering, Nanjing Forestry University, Nanjing, 210037, China

5Department of Computer Science, University of Victoria, Victoria, BC V8P 5C2, Canada.

6Guangdong Key Laboratory of Ocean Remote Sensing, South China Sea Institute of Oceanology

Chinese Academy of Sciences, Guangzhou, 510301,China

This work was supported by the Open Fund o f Twenty First Century

Areospace Technology Co., Ltd. under Grant 21AT-2016-04, Nation al

Natural Science Foundation of China under Grant 41701533, 41371434 and

41301521, Open Fund for Guangdong Key Laboratory of Ocean Remote

SensingSouth China Sea Institute of Oceanol ogy Chinese Academy of

Sciences) under Grant 2017B030301005-LORS1804, and the Open Fund

Key Laboratory for National Geograophy State Monitoring (National

Administration of Surveying, Mappin g and Geoinformation) under Grant

2017NGCM06 (Corresponding authors: Zhenxin Zhang and Ruofei Zhong.)

this generic methodology, our paper makes the following

specific novel contributions:

i) We propose a multi-level learning framework for building

detection from remote sensing images based on CNNs, which

can extract the features of buildings at different scales and

spatial resolutions to train the deep learning model and detect

buildings.

ii) We establish the building region proposal networks

(BRPN) to generate candidate building regions, thereby

improving the efficiency of building region searching and

enhancing the accuracy of building positions.

The rest of the paper is organized as follows. The Section II

reviews the related work of building or other object detection.

The Section III describes the proposed learning process of

building detection. The test step of building detection using

the learned model is designed in the Section IV. The Section

V analyzes the performance of the proposed framework.

Finally, Section VI concludes the paper along with a few

suggestions for future research topics.

II. RELATED WORK

The building or other object detections in remote sensing

images has been an active area of research for the last few

years and is still very much open. Two important

considerations under the hood of remote sensing image based

object detection are, the construction of hierarchical detection

framework and the object (building or other object) feature

representation, both are briefly reviewed under this section.



The construction of hierarchical detection framework

Hierarchical structure can fully represent the spatial

hierarchy and diversity [17]. A series of publications along

this line demonstrated the effectiveness of the hierarchical

structure in improving the effect of object detection and/or

recognition. For example, Farabet et al. [18] constructed the

pyramid of training data by using the Laplacian method to

enhance the ability of feature representation. Yu et al. [19]

proposed a hierarchical framework, called ScSPM (sparse

coding based spatial pyramid matching) to extract hierarchical

features. Then, considering the advantage of spatial

discriminative feature, He et al. [20] equipped the networks

with the strategy of spatial pyramid pooling (SPP), which can

generate a fixed-length representation regardless of image

size/scale. Taking the effect of object size in real space into

account, Zhang et al. [21, 22] built a hierarchical structure

based on exponential curve, which works well when large-

and small-sized objects coexist. In the aspect of hierarchical

construction concerning texture, Gaetano et al. [23] proposed

hierarchical texture-based segmentation of multi-resolution

remote sensing images, and similarly Trias-Sanz et al. [24]

used color and texture to achieve hierarchical segmentation for

processing high-resolution remote sensing image. Kurtz et al.

[25, 26] designed a hierarchical top-down methodology, which

can extract extremely complex patterns from multi-resolution

remote sensing images, for example the landslides were

hierarchically extracted from multi-resolution remotely sensed

optical images. Lin et al. [27] exploited the multi-scale

pyramidal hierarchy of deep feature to construct the

hierarchical features with marginal extra cost.

B. Object detection and location

Object detection and location in remote sensing images

have been widely researched in recent years. Sirmacek and

Unsalan [28] designed a building detection method in urban

areas using scale-invariant feature transform (SIFT) keypoints

and graph theory. Xu et al. [11] put forward an object

classification of aerial images through BoW (bag of words)

method. Huang et al. [29] designed an effective and automatic

building index to detect buildings from the high-resolution

imagery. After that, the multi-features [30], multi-angular

features [31] and the post-processing framework [32] of

remote sensing images are also explored for urban and

building classification. Hu et al. [33] developed an

unsupervised feature learning method via spectral clustering of

patches for remotely sensed scene classification. Xia et al. [34]

proposed a method of accurate annotation of remote sensing

images by active spectral clustering with little expert

knowledge.

Recent advances in deep learning provide unprecedented

opportunities to address the problems such as object detection

and location in a different way. For example, in [35], Cheng et

al. proposed a rotation-invariant CNNs for object detection in

optical remote sensing images. In another method, Long et al.

[36] designed an object detection framework in remote

sensing images based on CNNs. Girshick et al. [37]

constructed a region-based convolutional networks for

accurate object detection and segmentation. Hu et al. [38]

transferred deep CNNs for the scene classification of

high-resolution remote sensing imagery. Vakalopoulou et al.

[39] proposed an automated building extraction framework

using deep CNNs. Han et al. [8] designed an object detection

method in optical remote sensing images based on a weakly

supervised learning and high-level feature learning using deep

learning theory. The readers can refer to review paper in [40]

to learn the current progress on object detection and location

from optical remote sensing images. Obviously, the above

methods may perform well when the hierarchical information

of the spatial image is considered. Although the recognition of

buildings and other prominent objects from imagery has been

researched for many years, it remains to be solved especially

for accurate recognition due to the complexity of spatial

structure and diversity of surface texture, e.g., existing

building occlusions and inhomogeneities of building sizes in

experimental scenarios.

Ш. THE LEARNING FRAMEWORK

The overview of the proposed method is shown in Fig. 1.

Firstly, we construct the multi-level training samples by using

the Gaussian pyramid principle, to learn the features of

building objects at different scales and spatial resolutions.

Then, the building region proposal networks are designed to

determine the candidate building object locations. Based on

the building region proposals, the multi-level building

detection model is established using the CNNs, and the image

features from hierarchical images corresponding to each

building region proposal are extracted. Finally, the extracted

features are used to learn the CNNs model in the training step,

and the learned model is used to detect the unknown buildings

in the test images. It is to be noted that, the proposed

framework not only identify buildings from remote sensing

images, but also provide an accurate location of each

identified building.

Gaussian pyramid

principle

Remote sensing

images

Multi-level training

sample sets

Building region

proposal networks

Building region

proposals

Convolutional

neural networks

Hierarchical image

features

Training of building

detection model

Building detection

results

Test

images

Construction

of multi-level

training

sample sets

Generation

of candidate

building

regions

Extraction of

hierarchical

image

features

Fig. 1. The overview of our method.

In this section, we mainly discuss the learning framework of

the hierarchical building detection in remote sensing image

based on deep learning model. More precisely, the hierarchical

training data sets are first constructed. Next, the candidate

areas of building in different levels are determined by using

building region proposal networks. Finally, the features of the

building region proposals are designed by using deep learning

CNNs model, to train the hierarchical building detection

framework.



The construction of hierarchical training data

Due to the various spatial shapes, sizes and textures of

buildings and the occlusions between the objects in remote

sensing images, it is difficult to construct an efficient building

object features. To fully learn the characteristics of buildings,

and enhance the generalization ability of the proposed model,

we design a method for automatically generating multi-level

training data sets. Inspired by the Gaussian pyramid method in

[41], we construct a multi-level structure of training data by

resampling remote sensing images in each level. Firstly, the

original remote sensing images are automatically and

uniformly divided into patches of 300×300 pixels and

450×500 pixels respectively, to cover more kinds of complete

building samples. These patches are then used as a basic

image to down-sample with a Gaussian kernel convolutional

method to gradually generate multi-level training data sets.

We take the resampling image generation in the (l+1)th level

(l = 0,1,2…N-2, and N denotes the number of the level in the

hierarchical training dataset, e.g. N is equal to 6 in the Fig. 2)

as an example, to illustrate the process of the Gaussian kernel

convolutional operations. After the image in the lth level is

smoothed by a low-pass filter, the image is sampled according

to Eq. (1):

( ) ( ) ( )

122

, , 2 ,2 .

G i j d m n G i m j n

=− =−

= − −



1

where Gl (·) represents the image in the

th level of training

dataset, and the Gl+1(·) represents the image resampled from

the image of Gl(·). The d(m, n) = g(m)·g(n) is a 5 × 5 pixels

window function with low-pass filtering characteristic as a

Gaussian convolutional kernel, and the function g(·) is

Gaussian density distribution function, which is:

2 2 2

( )/2

( , ) .

2mn

d m n e





−+

(2)

According to the above principles, it is obvious that a series

of images G0, G1…, GN-1 with respect to (l+1)th levels can be

naturally created. They constitute a set of multi-level training

data.

Fig. 2. The multi-level image training data sets.

B. Generation of candidate building areas

We design building region proposal networks to generate

candidate building areas. To this end we first describe the

structure of building region proposal networks and give our

strategy of how to generate the building region proposals.

B1) Building region proposal networks

Many researchers have used traditional methods to

determine the candidate areas, such as sliding window [42].

However, the sliding window based searching needs to

traverse the entire image, which leads to high time complexity

and consequently, affects the efficiency of building detection.

Meanwhile, this method needs to manually set the size and

ratio of the sliding window. Therefore it is difficult to

effectively extract the building areas in remote sensing image.

We use CNNs model to extract candidate building areas,

which can efficiently generate a small number of high-quality

candidate building areas.

The proposed generation process of building region

proposals is shown in the Fig. 3. Firstly, the convolutional

features of hierarchical training data set are extracted by a set

of shared convolution layers [43], such as AlexNet[44],

ZF[45], VGG-16[46], GoogLeNet[47] and ResNet[48]. The

input remote sensing image can be extracted into a

512-dimensional convolutional feature (shared feature map)

by the last layer of the CNN model. Then, the extracted shared

feature maps are used in the reconstruction of region proposal

networks, in which, there are two parts as evident in Fig. 3:

one is used to calculate the regression values of the building

region box positions and obtain the location parameters of the

predicted building region proposals; the other one is used to

predict the probability of the building region box belonging to

the building or non-building regions by calculating the

intersection over union (IoU) ratio between the initial

candidate building region and the sample of labeling region.

Hierarchical remote

sensing images

Shared

convolutional

layers

512

Bounding box

predict

1×1

Class score

1×1

36 Loss bounding box

SmoothL1 loss

18 Class score reshape

Class probability

Building classification Loss

Softmax classifier

Building region

proposals

CNN models

Fig. 3. The process of generating candidate building areas.

Considering the variations in spatial sizes, structures and

shapes of buildings, we set multi-scale region boxes of

building detection as nine rectangular sliding windows with

three pixel sizes (128, 256 and 512 pixels), and there are three

kinds of aspect ratios of each rectangle with 1: 1, 1: 2, and 2: 1

(Fig. 4). In the process of generating building region proposals,

we use the above 9 rectangular sliding windows as the initial

building detection boxes, and the nine sizes of rectangular

sliding windows produce a lot of overlapping areas in the

images, which is to cover all candidate building regions, and

can include complete buildings in a building detection box.

then Tthe location and size of each building detection box is

modified according to the values of bounding box regression

calculated by a mapping relationship of regression. The

mapping relationship is defined by the position translation and

scaling size of the building detection box, and the mapping

parameters are illustrated as follows [49]:

* * * *

( )/ , ( ) / ,

log( / ), log( / ),

( )/ , ( )/ ,

log( / ), log( / ),

x a a y a a

w a h a

x a a y a a

w a h a

t x x w t y y h

t w w t h h

t x x w t y y h

t w w t h h

= − = −

(3)

where x and y denote the coordinate values of the bounding

box centroid of the detected building. The variables w and h

are respectively the width and the height of the building

detection box. The variables x, xa and x* (likewise for y)

respectively represent the coordinate value of center point in

predicted building detection box, initial building detection box

and labeled detection box. The parameters w, wa and w* are

the widths of predicted building detection box, initial building

detection box and labeled detection box. Vector 𝑡𝑖

(𝑛)= [𝑡𝑥, 𝑡𝑦,

𝑡𝑤, 𝑡ℎ] represents the four parameterized coordinates of the

predicted building detection box. The (tx, ty) is the translation

values between the predicted building detection box and the

initial building detection box. The (tw, th) is the scaling

parameters between the predicted building detection box and

the initial building detection box. The ( 𝑡𝑥

∗, 𝑡𝑦

∗) is the

translation parameters between the labeled detection box and

the initial building detection box, and the (𝑡𝑤

∗,𝑡ℎ

∗) is the scaling

parameters between the labeled detection box and the initial

building detection box.

During the training of the building region proposal networks,

the hierarchical training dataset is used in the model, to

generate multi-scale building region proposals, which can be

used to detect building objects in remote sensing images. After

the parameters of translation and scaling are obtained, the final

building region proposals are determined by the values of IoU,

e.g., the output building region proposals denoted by the red

boxes in the right-most column in Fig. 5.

Fig. 4. Different sizes o f the initial sliding windows. T he red box in the upper

left represents a sliding window with 1282 pixe ls, the square blue box

represents a sliding window with 2562 pixels, and the square green box

represents a slidi ng window with 5122 pixels. T he region boxes of each col or

have three types of aspect ratios with 1: 1, 1: 2 and 2: 1.

Fig. 5. The schematic diagram of generating candidate building areas. Note

that IoU v alues of red rectangles in the right-most subfigures are higher than

0.7.

B2) Training of building region proposal networks

The training process of the building region proposal

networks are end to end. The loss function is defined as

follows:

, , ,

i i cls i i loc i i

L p t L p p L t t,

(4)

where the subscript i is the index of initial building detection

box. The pi is the predicted probability of the ith building

detection box belonging to the building region. The 𝑝𝑖

∗

represents the label of the ground truth, whose value is 1 if the

initial detection box is belonging to the class of building,

otherwise is 0. The 𝑡𝑖 = ( 𝑡𝑥, 𝑡𝑦, 𝑡𝑤, 𝑡ℎ), is a vector

representing the coordinates of the predicted bounding boxes,

and 𝑡𝑖

∗ = (𝑡𝑥

∗, 𝑡𝑦

∗, 𝑡𝑤

∗, 𝑡ℎ

∗), is the coordinate of ground-truth

box representing building region. The function Lcls(pi, 𝑝𝑖

∗) is

the classification loss term, which is calculated as:

* * *

, log 1 1 ,

cls i i i i i i

L p p p p p p

(5)

which represents the price paid for inaccuracy of building

detection.

The regression loss (Lloc) is calculated by a classical loss

function (SmoothL1) [42], which is defined as follows:

, Smooth ,

loc i i L i i

L t t t t

(6)

and the SmoothL1 function is a nonlinear regression function

defined below:

0.5 1,

Smooth 0.5

otherwise.

)=

(7)

In Eq. (7), the parameter α represents the regression

argument of (𝑡𝑖-𝑡𝑖

∗). In the optimization process, the loss

function gradually approaches its minimum value and the

parameters of the training network are obtained.

Similar to the Faster-RCNN (faster region convolutional

neural networks) [43] method, our method employs the

generation of region proposals to perform the task of object

detection, because the generation of candidate regions can

effectively improve the accuracy and efficiency of the object

detection. Unlike the RPN (region proposal networks)

embedded in Faster-RCNN, the BRPN constructs the network

by combining the spatial hierarchies of the multi-level image

training data sets and deep learning model, while RPN does

not consider the hierarchical spatial information, and only

exploit the single scale of images. On the other hand, the

Faster-RCNN uses multi-task loss function to execute the

training task of multi-class object detection, which is not

suitable for detecting single-object task (e.g. the buildings) in

remote sensing images. So, we redesigned the loss function by

removing some normalization factors to form the Eq. (4). In

addition, our proposed BRPN model use the extracted features

from different scales, which is another difference compared

with Faster-RCNN.

C. Feature extraction

One of the most remarkable characteristics of deep learning

is to automatically learn the multi-level discriminative feature

representation (feature maps) by using the multi-level

convolutional layers. These feature maps can be used to

distinguish building and non-building in our scenario. The

deep learning models can extract more robust deep features

from the image, which have different levels of abstraction. For

example, the shallow convolution layers can extract the edge

contour and color related information of building objects. The

deeper convolutional layers can extract the texture and shape

structures of building objects. These features are sensitive to

descriptions of building characteristics with different

structures and textures in remote sensing images, thereby

contributing to the achievement of building detection

accuracy.

In the process of feature extraction (Fig. 6), the shared

feature map extracted by CNN models is used to generate

building region proposals in BRPN, and is also used to detect

further feature in detection network. We set ROI pooling layer

[42] before the full connection layer in the network of

extracting features. The ROI pooling layer obtains the

candidate ROI list generated by the building region proposal

network, and transfers the feature maps with different sizes

extracted by the CNN layers into a fixed-size feature vector.

Thus, the feature vector with 7 × 7 × 512 dimensions is

extracted from all candidate ROI regions, which is input into

the fully connected (FC) layer.

D. Training process

In the training step, we adopt the back-propagation

optimization method and SGD (stochastic gradient descent)

learning strategy. We generate the hierarchical training data

with the Gaussian pyramid principle, and the generated

hierarchical training data with the corresponding labels are

used in the training of model. The whole training process is

multi-step: First, we employ a pre-training model (ImageNet

[44] model) to initialize the BRPN network; then, the

multilevel training remote sensing images are used to train the

BRPN model, and the learned BRPN model can generate the

building region proposals. After that, we use the same

pre-training model to initialize the detection network, and train

the detection network based on the building region proposals

obtained from the BRPN model. When the learning model of

classification and the position parameters of building detection

boxes (Eq.4) are optimized by minimization, the parameters of

the shared convolution layers are obtained, and used in the

building detection.

Shared convolutional

layers Shared feature maps ROI pooling layer Extracted features

Training images Feature extraction layer

󳵰



Fig. 6. The process of feature extraction. The trainin g images consist of hierarchical remote sensing images. The shared convolutional layers extract the features

of the training images to obtain shared feature maps. The feature extraction layers extract deeper feature from shared feature maps. Th e ROI p ooling layer

converts the extracted features into a list of feature vectors and output the extracted features.

IV. THE BUILDING DETECTION

As shown in Fig. 7, the buildings in remote sensing images

are detected using the parameters of the learned deep learning

model. Firstly, the test images are converted into candidate

building region proposals by using the BRPN. At this point,

these candidate building regional proposals correspond to the

shared feature maps generated by the CNN models. Then, the

feature of each building region proposal is extracted through

convolutional layers, and the extracted features are mapped to

the ROI pooling layer to get the fixed size of feature vectors

and building region bounding box list. The feature vectors are

used in the fully connected layer and softmax classification

layer [50] to label the building objects. Finally, we use the

building bounding box regression [49] layer to modify the

position of predicted building region proposal box from

bounding box list to obtain accurate building detection results.

Test images

Share convolutional

layers Shared feature map s

Building region

proposal networks

Candidate building region proposals ROI pooling layer

Building

probability=0.92

Fully connected layers

Softmax

Building boundbox

regression

Bounding box list

Output building detection

results

Fig. 7. The process of the building detection.

V. EXPERIMENTS AND RESULTS

In this section, we test the sensitivities of the model

parameters and compare our method with other five methods,

to verify the efficiency and stability of the proposed model.

The test environment is as follows: Intel Xeon E5-2640 CPU,

Nvidia Quadro M4000 GPU with 8-GB RAM. The training

process was performed by Caffe [51] framework on

Ubuntu14.04 operating system. In this section, there are four

parts: the data overview, the test results evaluation, the

parameter sensitivity analysis and the comparison with the

other 5 methods.

A. Datasets

We used three datasets, namely Dataset I, Dataset II and

Dataset III to test the performance of the proposed method via

various experiments. Dataset I is taken from SIRI-WHU [52]

dataset, which comes from the USGS (United States

geological survey) public test datasets. The data is collected in

Montgomery, Ohio, America, with a spatial resolution of 0.6

m, and the type of the scene is primarily residential area, with

the area of 54006000 m2. Dataset II is taken from public

SpaceNet dataset1, which is collected by Digital Globe’s

WorldView-2 satellite. The spatial resolution of the Dataset II

is 0.5 m and the area is 80006000 m2. Dataset III comes

from the Inria aerial image labeling dataset [53]which is

aerial orthorectified color imagery with a spatial resolution of

0.3 m. The dataset covers an area of 4700×5000 m2, which

was captured in Chicago, America.

We select the training data and test data from the Dataset I,

Dataset II and Dataset III, according to the type and

distribution of buildings as shown in Figs. 8-10. In Dataset I,

the training data has a total of 263 buildings, and the test data

contains 817 buildings. In Dataset II, the training data contains

849 buildings while the test data contains 3048 buildings. In

Dataset III, the training data contains 140 buildings and the

test data contains 578 buildings. Figs. 8-10 respectively

represent the training and test areas of Datasets I-III.

1. https://github.com/SpaceNetChallenge/SpaceNetChallenge.github.io

Fig. 8. The training and the test areas in Dataset I.

Fig. 9. The training and the test areas in Dataset II.

Fig. 10. The training and the test areas in Dataset III.

B. Building detection

To test the efficacy of the proposed method, we combine

the VGG-16 [46] model into our method to perform

experiments on the basis of the above Dataset I, Dataset II and

Dataset III. The test results of building detection are shown in

Figs.11-13, respectively. It can be seen that our method

detects the building objects with different textures and shapes

well. Despite having buildings with different structures and

textures in the scene of the Fig. 11(b), and even sheltered by

the trees (e.g. the buildings in the dashed box of Fig. 11b and

the ones in the dashed box 1 of Fig. 12 and in the dashed box

1 of Fig. 13), our model detects the buildings noticeably better.

Meanwhile, the buildings with different sizes can be

accurately detected, as evident in the dashed boxes of Fig.

11(c), the dashed box 2 of Fig. 12 and the dashed box 2 of Fig.

13.

(a)

(b)

(c)

Fig. 11. The results of bu ilding detection on Dataset I. (a) Detection results of

some buildings. (b) Dete ction results o f some sheltered buildings. (c)

Detection results for buildings with different sizes.

Fig. 12. The results of building detection on Dataset II.

Fig. 13. The results of building detection on Dataset III.

C. The effect of different training iteration number

In order to analyze the effect of the training iteration

number, we set the learning rate as 0.001 and adopt the Adam

learning strategy [54] to optimize our model. The parameters

of the Adam learning strategy are as follows: exponential

decay rates β1 = 0.9 and β2 = 0.999, and minor amount ε =

1e-08. We use mean average precision (mAP) to evaluate the

detection results, and exploit the IoU value to assess the

accuracy of the building detection box position. Here, the

VGG-16 model [46] is integrated into our method to extract

features under different numbers of training iteration. Tables

1-3 show the obtained mAP values and the IoU values of the

tested images in Datasets I-III. It can be seen from the Tables

1-3 that the mAP values of the building detection on the 3

datasets gradually increase in general, when the number of

training iteration changes from 4000 to 10000, and the IoU

values of the test images also increases synchronously, which

indicates that the accuracies of the building detection and

positioning are improved with the increase in the training

iteration number. However, the accuracy of building detection

decreases when the number of iteration is more than 8000 in

the Tables 1 and 3, 9000 in the Table 2, which indicates that

the overfitting may be occurred. It can be observed that our

model has the best detection performance when the training

iteration number is set in the range 8000 to 9000.

Table 1. The mAP and IoU values of building de tection in test image of

Dataset I under the different iteration numbers.

Iteration

numbers

4000

5000

6000

7000

8000

9000

10000

mAP

0.44

0.46

0.49

0.51

0.54

0.50

0.46

IoU

76.7%

79.2%

81.5%

83.8%

86.9%

83.4%

80.2%

Table 2. The mAP and IoU values of building de tection in test image of

Dataset II under different iteration numbers.

Iteration

numbers

4000

5000

6000

7000

8000

9000

10000

mAP

0.38

0.40

0.43

0.47

0.49

0.52

0.46

IoU

66.7%

68.2%

74.5%

78.8%

81.9%

83.2%

79.6%

Table 3. The mAP and IoU values of building de tection in test image of

Dataset III under different iteration numbers.

Iteration

numbers

4000

5000

6000

7000

8000

9000

10000

mAP

0.40

0.42

0.45

0.46

0.53

0.51

0.49

IoU

68.3%

69.5%

75.5%

78.2%

86.4%

82.8%

81.6%

D. The effects of different deep learning network models

In this section, we use several representative deep learning

models to test the performance of our method. The employed

models, such as AlexNet [44], ZF [45], VGG-16 [46],

GoogLeNet [47] and ResNet [48], have achieved excellent

results in the object detection and recognition competitions in

the recent years, and integrated into our method to test the

performance on our test datasets (Datasets I-III) under the

same parameters (e.g. the number of training iteration is 8000,

and the number of levels (N) in multilevel training data set is

7). The final values of mAP and IoU are shown in Tables 4-6.

From Tables 4-6, and the precision-recall (P-R) curves of the

tested results in Figs. 14-16, we can observe that the five

models using our method obtained good performance in terms

of mAP and IoU, which shows the advantage of our

hierarchical building detection model.

Table 4. The values of mAP and IoU using different models in Dataset I.

Model

AlexNet

VGG-16

GoogLeNet

ResNet

mAP

0.49

0.51

0.54

0.53

0.56

IoU

71.4%

71.9%

86.9%

85.1%

87.1%

Table 5. The values of mAP and IoU using different models in Dataset II.

Model

AlexNet

VGG-16

GoogLeNet

ResNet

mAP

0.43

0.47

0.52

0.54

IoU

67.3%

68.9%

80.9%

79.1%

82.1%

Table 6. The values of mAP and IoU using different models in Dataset III.

Model

AlexNet

VGG-16

GoogLeNet

ResNet

mAP

0.45

0.49

0.56

0.55

0.57

IoU

69.2%

70.3%

82.7%

81.1%

83.1%

Fig. 14. The precision-recall curves of several deep learning models in Dataset

󰕭󱎵 󱡗󱱚󰉯 100%

󰕭󱎵 󱡗󱱚󰉯 100%, 󲼀 0.1 󱗶

󰕭󱎵 󱡗󱱚󰉯 100%

󰕭󱎵 󱡗󱱚󰉯 100%, 󲼀 0.1 󱗶

Fig. 15. The precision-recall curves of several deep learning models in Dataset

II.

Fig. 16. The precision-recall curves of several deep learning models in Dataset

III.

E. The impact of layer number N

In order to test the influence of the level number N (in

Section III-A), we respectively set the layer number N as 1 to

10, and use the VGG-16 model [46] as the basic network to

train the detection model. In the experiments, the training

iteration is set as 8000, and the trained model is used to detect

buildings in the test image. The results of building detection

are counted by using the average value of IoU. As shown in

Fig. 17, we used the Datasets I-III to conduct the experiments.

It is evident from the Fig. 17 that the average IoU values of

the building detection on the two datasets gradually increase

with the augment of layer number N, and the average IoU

values tend to be stable when the layer number reaches 7 and

8 on the Datasets I-III, respectively. It indicates the advantage

of the multi-level training structure in improving the accuracy

of building detection.

Fig. 17. The curves of average IoU values under different numbers of layer.

F. The comparison with other methods

We compare our method with the other 5 building detection

methods, e.g. the DPM (deformable parts model) method [55]

(Method I), the fast-RCNN (fast region convolutional neural

networks) method [42] (Method II), the faster-RCNN [43]

(Method III), the RICNN (rotation-invariant convolutional

neural networks)[35] (Method IV), and the YOLO (you only

look once) model [56] (Method V). Table 7 shows the

comparison results between our method and the other 5

methods. The characteristic of our method includes three

aspects: i) the multi-level structure of training data, ii) deep

learning model, and iii) the building region proposal networks.

In the compared methods, the Method I does not use

multilevel structure, deep learning model, and building region

proposal networks, and detects the buildings by using the

spatial structure components, which are several

high-resolution component templates extracted from the image

samples. Besides, the sliding window is adopted in the Method

I to search for buildings in the remote sensing image. The

Method II uses selective search to generate building region

proposals, from which there are no multilevel structure and

building region proposal networks. Method III extracts

candidate areas by constructing region proposal networks, and

the features of objects are built by the fast-RCNN model [42].

In contrast with our method, the Method III aims at the

problem of multi-objects detection, and does not use the

multilevel training data. The Method IV introduces a

rotation-invariant layer on the basis of the existing CNN

models, and learns a RICNN model to improve the

performance of object detection. Method V proposes a neural

network model to perform the object detection task as

regression problem, and obtains the location and probability of

the object. There are no building region proposal networks and

multilevel structure of training data in the Methods IV and V.

Table 7. The comparisons among our method and the other 5 methods.

Multilevel

structure

Deep learning

model

Building region

proposal networks

Our method

√

Method I [55]

Method II [42]

√

Method III [43]

√

Method IV [35]

Method V [56]

√

In the experiments, the same datasets (Datasets I-III) and

hardware environment were used to test the building detection

performance. Meanwhile, the VGG-16 model [46] is set as a

basic network structure unit in our method. The mAP values

are shown in Tables 8-10. It can be seen from the above

results that, the mAP values of our method reaches 0.57 in

Table 8, 0.54 in Table 9 and 0.55 in Table 10, which is higher

than the rest of the methods, and improve the mAP values as

3.63%, 3.85%, 3.77% on the Datasets I-III than those of the

best approach (Method IV). The results illustrate the

advantages of the multilevel structure, deep learning and

BRPN in our method. As far as the time complexity of the

building detection (Table 11) is concerned, our method incurs

less time overhead, which is a little more than that of the best

method (Method V), as the Method V does not need to

generate building region proposals. However, the object

detection precision of Method V is lower than that of our

method (see the mAP values in Tables 8-10), which proves the

advantages of hierarchical training dataset and BRPN. P-R

curves in Figs. 18-20 indicate that our method performs best

which illustrates the advantages of multi-level training

framework and extraction of building region proposals using

BRPN.

Table 8. The mAP values of our method and the other 5 methods on the

Dataset I.

Methods

mAP

Method I[55]

0.32

Method II[42]

0.41

Method III[43]

0.52

Method IV[35]

0.55

Method V [56]

Our Method

0.51

0.57

Table 9. The mAP values of our method and the other 5 methods on the

Dataset II.

Methods

mAP

Method I[55]

0.34

Method II[42]

0.42

Method III[43]

0.48

Method IV[35]

0.52

Method V [56]

Our Method

0.50

0.54

Table 10. The mAP values of our method and the other 5 methods on the

Dataset III.

Methods

mAP

Method I[55]

0.33

Method II[42]

0.42

Method III[43]

0.50

Method IV[35]

0.53

Method V [56]

Our Method

0.51

0.55

Table 11. The computational cost of building detection on the Dataset I.

Methods

Time cost(second)

Method I[55]

3.78

Method II[42]

2.14

Method III[43]

0.26

Method IV[35]

1.97

Method V[56]

Our Method

0.15

0.17

Fig. 18 The P-R curves of several building detection methods in Dataset I.

Fig. 19 The P-R curves of several building detection methods in Dataset II.

Fig. 20 The P-R curves of several building detection methods in Dataset III.

. CONCLUSION

In this paper, we propose a deep learning based hierarchical

framework to automatically detect buildings in remote sensing

image. In the proposed framework, we design the hierarchical

training model using Gaussian pyramid principle, to extract

discriminative features at different scales and spatial extents.

Then, the deep learning model of hierarchical building

detection is constructed. Our method has been validated on

different scenes of remote sensing images. We have also

compared our method with five related methods (i.e. Methods

I-V in the Section V-F) qualitatively and quantitatively.

Experiments and comparisons with the state-of-the-art

methods clearly demonstrate the superiority of our method in

accurately and efficiently detecting the buildings in remote

sensing images. This underlines the advantages of hierarchical

training dataset, deep learning based building detection model,

and building region proposals using the BRPN.

In future, we will consider the adaptive construction of

hierarchical training datasets according to the content of

on-ground objects [57], to more adequately extract the feature

of buildings.

REFERENCES

[1]. L. Guan, Y. Ding, X. Feng and H. Zhang, “Digital beijing construction

and application based on the urban three-dimensional modelling and

remote sen sing monitoring technology,” IEEE. Inter. Geosci. Remote.

Sens. Symp. (IGARSS), pp:7299-7302, Jul. 2016.

[2]. M. Mazher Rathore, Awais Ahmad, Anand Paul and Seungmin Rho,

“Urban planning and building smart cities based on the Internet of

Things using Big D ata analytics,” Computer Networks, vol. 101, no: 4,

pp: 63-80, Jul. 2016.

[3]. J. F. Pekel, A. Cottam, N. Gorelick, and A. S . Bel ward,

“High-resolution mapping of global surface water and its long-term

changes,” Nature. vol: 540, no:7633, pp: 418-432, Dec. 15 2016.

[4]. J. Leitloff, S. Hinz, and U. Stilla, “Vehicle detection in very high

resolution satellite images of city areas,” IEEE Trans. Geosci. Remote

Sens.,vol. 48, no. 7, pp. 2795–2806, Jul. 2010.

[5]. S. Tuermer, F. Kurz, P. Reinartz, and U. Stilla, “Airborne vehicle

detection in dense urban areas using Ho G features and disparity maps,”

IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6 , no. 6, pp.

2327–2337, Dec. 2013.

[6]. H. Grabner, T. T. Nguyen, B. Gruber, and H. Bi schof, “On-line

boosting based car detection from aerial images,” IS PRS J.

Photogramm. Remote Sens., vol. 63, no. 3, pp. 382–396, May 2008.

[7]. Ö. Aytekin, U. Zöngür, and U. Halici, “Texture-based airport runway

detection,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp.

471–475, May 2013.

[8]. J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in

optical remote sensing images based on weakly supervised learning and

high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol . 53,

no. 6, pp. 3325–3337, Jun. 2015.

[9]. G. Cheng, J . Han, L. Guo, X. Qian, P. Zhou, X.Yao, and X. Hu,

“Object detection in remote sensing imagery using a discriminatively

trained mixture model,” ISPRS J. Photogramm. Remote Sens., vol. 85,

pp. 32–43, Nov. 2013.

[10]. D. Zhang, J. Han, G. Cheng, Z. Liu, S. Bu, and L. Guo, “Weakly

supervised learning for target detection in remote sensing images,”

IEEE Geosci. Remote Sens. Lett., vol. 12, n o. 4, pp. 701–705, Apr.

2015.

[11]. S. Xu, T. Fang, D. Li, and S.Wang, “Object classification of aerial

images with bag-of-visual words,” IEEE Geosci. Remote Sens. Lett.,

vol. 7, no. 2, pp. 366–370, Apr. 2010.

[12]. H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, “Automatic target detection

in high-resolution remote sensing images using spatial sparse coding

bag-of-words model,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 1,

pp. 109–113, Jan. 2012.

[13]. Y. Zhang, L. Zhang, B. Du, and S. Wang, “A nonlinear sparse

representation-based binary hypothesis model for hyperspectral target

detection,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol.

8, no. 6, pp. 2513–2522, Jun. 2015.

[14]. Y. Zhang, B. Du, and L. Zhang, “A sparse representation-based binary

hypothesis mo del for target detection in hyperspectral images,” IEEE

Trans. Geosci. Remote Sens., vol. 53, no. 3, pp. 1346–1354, Mar. 2015.

[15]. N. Yokoya and A. Iwasaki, “Object detection based on sparse

representation and Hough voting for optical remote sensing imagery,”

IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8 , no. 5, pp.

2053–2062, May 2015.

[16]. Y. Zhong, R. Feng, and L. Zhang, “Non-local sparse unmixing for

hyperspectral remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth

Observ. Remote Sens., vol. 7, no. 6, pp. 1889–1909, Jun. 2014.

[17]. Y. Xu, E. Carlinet, T. Geraud, and L. Najman. “Hierarchical

segmentation using tree-based shape spaces,” IEEE Trans. Pattern Anal.

Mach Intell., vol. 39, no. 3, pp.457-469, Mar. 2017.

[18]. Farabet, C ., Couprie, C., N ajman, L., and LeCun, Y, “Learnin g

hierarchical features for sc ene labeling,” IEEE Trans. Pattern Anal.

Mach Intell. , vol. 35, no. 8, pp.1915-1929, Oct. 2013.

[19]. K. Yu, Y. Lin, and J. Lafferty, “Learning image representations from

the pixel level via hierarchical sparse coding,” in Proc. IEEE CVPR,

Jun. 2011, pp. 1713–1720.

[20]. K. He, X .Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep

convolutional networks fo r visual recognition,” IEEE Trans. Pattern

Anal. Mach Intell., vol. 37, no. 9, pp. 1904–1916, Jan. 2015.

[21]. Z. Zhang, L. Zhang, X. Tong, T. Mathiopoulos, B. Guo, X. Huang, Z.

Wang, and Y. Wang, “A Multilevel Point-Cluster-Based Discriminative

Feature for ALS Point C loud Classification,” IEEE Trans. Geo sci.

Remote Sens., vol. 54, no. 6, pp. 3309 – 3321, Jun. 2016.

[22]. Z. Zhang, L. Zhang, X. T ong, B. Guo, L. Zhang, and X. Xing,

“Discriminative Dictionary Learning-Based Multi-Level Point-C luster

Features for ALS Point Cloud Classification,” IEEE Trans. Geosci.

Remote Sens., vol. 54, no. 12, pp. 7309-7322, Dec. 2016.

[23]. R. Gaetano, G. Scarpa, and G. Poggi, “Hierarchical texture-based

segmentation of mult iresolution remote-sensing images, ” IEEE Trans.

Geosci. Remote Sens., vol. 47, no. 7, pp. 2129–2141, Jan. 2009.

[24]. R. Trias-Sanz, G. Stamon, and J. Louchet, “Using colour, texture, and

hierarchial segmentation fo r high-resolution remote sensing,” ISPRS J.

Photogramm. Remote Sens. , vol. 63, no. 2, pp. 156–168, Mar. 2008.

[25]. C. Kurtz, N. Passat, P. Gancarski, and A. Puissant, “Extraction of

complex patterns from multiresolution remote sensing images: A

hierarchical top-down methodology,” Pattern Recognition., vol.45, no.

2, pp.685-706, Feb. 2012.

[26]. C. Kurtz, A. Stumpf, J. P. Malet, P. Gançarski, A. Puissant, and N.

Passat, “Hierarchical extraction of landslides from multiresolution

remotely sensed optical images,” ISPRS J. Photogramm. Remote Sens.,

vol. 87, no.1, pp. 122-136, Jan. 2014.

[27]. T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,

"Feature p yr amid networks for object detection." in IEEE Conf.

Comput. Vis. Pattern Recognit. (CVPR). Vol. 1. No. 2. 2017.

[28]. B. Sirmacek, and C. Unsalan,. “Urb an-area and building detection

using SIFT keypoints and graph theory,” IEEE Trans. Geosci. Remote

Sens., vol. 47, no. 4, pp. 1156-1167, May. 2009.

[29]. X. Huang and L. Zhang, “Morphological Building/Shadow Index for

Building Extraction From High-Resolution Imagery Over Urban Areas,”

IEEE J. S el. Topics Appl. Earth Observ. Remote Sens. vol 5, no 1, pp.

161-172, 2012.

[30]. X. Huang and L. Zhang, “An SVM ensemble approach combining

spectral, structural, and semantic features for t he classification of

high-resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote

Sens., vol. 51, no. 1, pp. 257-272, Jan. 2013

[31]. X. Huang, H. Chen, and J. Gong, “Angular difference feature extraction

for urban scene classification using ZY-3 multi-angle high-resolution

satellite imagery,” ISPRS J. Photogramm. Remote Sens., vol. 135, no.1,

pp. 127-141, 2018.

[32]. X. Huang, W. Yuan, J. Li, and L. Zhang, “A New Building Extraction

Post-Processing Framework for High Sp atial Resolution R emote

Sensing Imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote

Sens., vol. 10, no. 2, pp. 654-668, Feb. 2017.

[33]. F. Hu, G.-S. Xia, Z. Wang, X. Huang, and L. Zhang, “Unsupervised

feature learning via spectral clustering of patches for remotely sensed

scene classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote

Sens., vol. 8, no. 5, pp. 2015–2030, May. 2015.

[34]. G. Xia, Z. Wang, C. Xiong, and L. Zhang, “Accurate annotation of

remote sensing images via active spectral clustering with little expert

knowledge,” Remote Sens., vol. 7, no. 11, pp. 15 014–15 045, Nov.

2015.

[35]. G. Cheng, P. Zhou, and J. Ha n. "Learning Rot ation-Invariant

Convolutional Neural Networks for Object Detection in VHR Optical

Remote Sensing Images," IEEE Trans. Geosci. Remote Sens., Vol. 54,

no. 12, pp. 7405-7415, Dec. 2016.

[36]. Y. Long, Y. Gong, Z. Xiao, and Q. Liu. “Accurate object localization in

remote sensing images based on convolutional neural ne tworks,” IEEE

Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2486-2498, Jan. 2017.

[37]. R. Girshick, J. Donahue, T. Darrell, and J. M alik, “Region-based

convolutional networks for accurate object detection and segmentation,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158,

Jan. 2016.

[38]. F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “T ransferring d eep

convolutional neural networks for the sc ene c lassification o f

high-resolution remote sensing imagery,” Remote Sens., vol. 7, no. 11,

pp. 14 680–14 707, Nov. 2015.

[39]. M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios,

“Building detection in very high resolution multispectral data with deep

learning features,” IEEE Inter. Geosci. Remote Sens. Sympo. (IGARSS),

pp.1873-1876, July 2015.

[40]. G. Cheng and J. Han, “A survey on object detection in optical remote

sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117, pp.

11–28, Jul. 2016.

[41]. M Suárez , VM Brea, J Fernández-Berni, R Carmona-Galán, and D

Cabello, “Low-Power CMOS Vision Sensor for Gaussian Pyramid

Extraction,” IEE E Journal of Solid-State Circuits., vol. 52, no . 2, pp.

483-495, Feb. 2017.

[42]. R. Girshick, “Fast R-CNN,” in IEEE Inter. Conf. Comput. Vis. (ICCV),

Jun. 2015, pp. 1440-1448.

[43]. S. Ren, K. He , R. Girshick, and J. Sun. "Faster R -CNN: towards

real-time object detection with region proposal networks," in Proc.

Neural Inf. Process. Syst. (NIPS). 2015, pp.1137-1149

[44]. A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet classification

with deep convolutional neural networks,” in Proc. Neural Inf. Process.

Syst. (NIPS). 2012, pp.1097-1105.

[45]. M. D. Zeiler, and R Fergus. “Visualizing and understanding

convolutional networks,” in Euro. Conf. Comput. Vis. (ECCV),2014.

[46]. K. Simonyan, and A. Zisserman, “Ver y Deep Convolutional Networks

for Large-Scale Image Recognition,” in Proc . IEEE Conf. Comput. Vis.

Pattern Recognit. (CVPR), Sep. 2015, pp.1-14.

[47]. C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D.

Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with

Convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

(CVPR), Jun. 2015, pp. 1-9

[48]. K. He, X. Zhang, S. Ren and J. Sun, “Deep R esidual Learning for

Image Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern

Recognit. (CVPR), Jun.2016, pp.770-778.

[49]. R. Girshick, J. Donahue, T. Darrell , and J. Malik, “Rich feature

hierarchies for accurate object detection and semantic segmentation,” in

IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014.

[50]. G. E. Hinton and R. R. S alakhutdinov. “Replic ated Softmax: an

Undirected Topic Model,” in Proc. Neural Inf. Process. Syst. (NIPS),

Nov. 2009, pp. 1607-1614.

[51]. Y. Jia, S. Evan, D. Jeff, K. Sergey, L. Jonathan, G. Ross, G. Sergio, and

D. Trevor, “Caffe: Convolutional Architecture for F ast Feature

Embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

(CVPR), Jun. 2014, pp.1401-1410.

[52]. Y. Zh ong, Q. Zhu, and L. Zhang, “Scene Classification Based on the

Multi Feature Fusion Probabilistic Topic Model for High Spatial

Resolution Remote Sensing Imagery," IEEE Trans. Geosci. Remote

Sens., vol. 53, no. 11, pp. 6207-6222, Nov. 2015.

[53]. E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semantic

labeling methods generalize to any city? the inria aerial image labeling

benchmark,” IEEE Int er. Geosci. Remote Sens. Sympo. (IGARSS).

2017.

[54]. D. Kingma, and J. Ba, “Ad am: A Method for Stochastic Optimization,”

Inter. Conf. Lear. Repr. (ICLR) May. 2015

[55]. P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan,

“Object Detection with Discriminatively Trained Part-Based Models,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp.1627-1645,

Sep. 2009.

[56]. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look

Once: Unified, Real-Time Object Detection,” i n Proc. IEEE Conf.

Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015

[57]. Y. Liu, M. Yu, M. Yu, and Y. He, “Manifold slic: A fast method to

compute content-sensitive superpixels,” i n IEEE Conf. Comput. Vis.

Pattern Recognit. (CVPR), pp.651-659, 2016.

Yibo L iu received the Bachelor’s degree in Geographical Information

Engineering from Henan University of Technology, Zh engzhou, China,

in 2016. He is currently pursuin g the Master’s degree with Beijing

Advanced Innovation Center for Imaging Technology, Capital Normal

University, Beijing, China. His research interests include deep learning,

remote sensing images analysis and LiDAR-based urban modeling.

Zhenxin Zhang received the Ph.D. degree in

geoinformatics in the Sch ool of Geography, Beijing

Normal University, Beijing, Ch ina, in 2016. He is

currently an assistant professor at Beijing Advanced

Innovation Center for Imaging Technology and Ke y

Lab of 3D Information Acquisition and Application,

College of R esource Environment and Tourism,

Capital Normal University, Beijing, China, and he also

works as a c ooperator with Beijing Key Laboratory of

Urban Spatial Information Engineering, Beijing

Institute of Surveying and Mapping, Beijing, China. His research interests

include light detection and ranging data processing, qualit y analysis o f

geographic information systems, and algorithm development.

Ruofei Zhong received the Ph.D. degree in

Geo-informatics from the Chinese Academy of

Science’s Institute of Remote Sensing Applications,

Beijing, China, in 2005. He is currently a professor at

Beijing Advanced Innovation Center for Imaging

Technology and Key Lab o f 3D Information

Acquisition and Application, College of Resource

Environment and Tourism, Capital Normal University,

Beijing, China.

His research interests include light detection and ranging data processing

and data collection system with laser scanning.

Dong Chen received the Bachelor’s degree in

Computer Science from Qingdao University of Science

and Technology, Qingdao, China, the Master’s degree

in Cartography and Geographical Information

Engineering from Xi’an University o f Science and

Technology, Xi’an, China and the Ph.D. degree in

Geographical Information Sciences from Beijing

Normal University, Beijing, China. He is currently an

Assistant Professor with Nanjing Forestry University,

Nanjing, China. He is also a Post -Doctoral Fellow with the Department of

Geomatics Engineering, University of Calgary, Calgary, AB, Canada . His

research interests include image-and LiD AR-b ased segmentation and

reconstruction, full-waveform LiDAR data processing, and related remote

sensing applications in the field of forest ecosystems.

Yinghai Ke re ceived the Bachelor’s degree in

Environmental Science from Wuhan University, Wuhan,

China, the Master’s degree in Geographical Information

Science from Peking University, B eijing, China and the

Ph.D. degree in Geospatial Sciences and Technology from

State University of New York College of Environmental

Science and Forestry, Syracuse, New York, USA. She is

currently an Associate Professor with Capital Normal

University, Beijing, China. Her research interests include

remote sensing image classification and the application in urban environment

and ecology.

Jiju Peethambaran received the Bachelors degree in

information technology from the University of Calicut,

Malappuram, India, the Masters degree in computer

science from the National Institute of Technology,

Karnataka, Mangalore, India, and the Ph.D. degree in computational geometry

from IIT Madras, Chennai, India. He is currentl y a Post-Doctoral Researcher

with the Department of Computer Science, University of Victoria, Victoria,

BC, Canada. His research interests include Computational geometry,

Geometric learning, Real-time geometry processing and related applications

including motion capture for VR/AR and LiDAR-based urban modeling.

Chuqun Chen received the B.S. degree in engineering of

geology and exploration from the Chengdu University of

Technology, Ch engdu, C hina, in 1982, the M.Sc. degree

in cartology and remote sensing from the Institute o f

Remote Sensing Applications, Chinese Academy o f

Sciences, Beijing, China, in 1992, the Ph.D. degree in the

science of physical oc eanography (ocean color remote

sensing) from the Graduate University o f Chinese

Academy of Sci ences, Guangzhou, China, in 2006, and

the Ph.D. de gree in water resources engineering from Lu nd University,

Sweden, in 2008.Since 1987, he has b een working on remote sensing

applications, and has been a Principal Investigator for more than ten p ro jects

sponsored by the Natio nal 863 Program, the Natio nal 973 Program

(subproject), the National Natural Science Foun dation of China, the Scie ntific

Foundation of Guangdong Province, the Chinese Academy of Scie nces, and

the Ministry of Science and Technology. His research interests include marine

optics theory and optical data analyses, the atmospheric correction of optical

satellite data in coastal areas, remotely sensed assessment of water quality and

thermal infrared remote sensing, skin temperature measurement, and

validation of satellite retrieved SST with his own developed instrument, the

Buoyant Equipment for Skin Temperature (BEST).

Lan Sun received the Bachelor’s degree in Beijing

University of Civil Engineering and Architecture,

Beijing, Chi na, 2 012. He is currently pursuing the

Master’s degree with Beijing Advanced Innovation

Center for Imaging Technology, Capital Normal

University, Beijing, China. His research interests include

spatial data mining, de ep learning, and LiDAR-base

classification and reconstruction.

Deep learning methods applied to digital elevation models: state of the art

Article

Full-text available

Sep 2023

Deep Learning (DL) has a wide variety of applications in various thematic domains, including spatial information. Although with limitations, it is also starting to be considered in operations related to Digital Elevation Models (DEMs). This study aims to review the methods of DL applied in the field of altimetric spatial information in general, and DEMs in particular. Void Filling (VF), Super-Resolution (SR), landform classification and hydrography extraction are just some of the operations where traditional methods are being replaced by DL methods. Our review concludes that although these methods have great potential, there are aspects that need to be improved. More appropriate terrain information or algorithm parameterisation are some of the challenges that this methodology still needs to face.

Multi-Level Perceptual Network for Urban Building Extraction from High-Resolution Remote Sensing Images

Article

Full-text available

Jul 2023
PHOTOGRAMM ENG REM S

Building extraction from high-resolution remote sensing images benefits various practical applications. However, automation of this process is challenging due to the variety of building surface coverings, complex spatial layouts, different types of structures, and tree occlusion. In this study, we propose a multilayer perception network for building extraction from high-resolution remote sensing images. By constructing parallel networks at different levels, the proposed network retains spatial information of varying feature resolutions and uses the parsing module to perceive the prominent features of buildings, thus enhancing the model's parsing ability to target scale changes and complex urban scenes. Further, a structure-guided loss function is constructed to optimize building extraction edges. Experiments on multi-source remote sensing data sets show that our proposed multi-level perception network presents a superior performance in building extraction tasks.

The method and implementation of a Taiwan building recognition model based on YOLOX-S and illustration enhancement

Article

Full-text available

Feb 2024

Earthquakes pose significant risks in Taiwan, necessitating effective risk assessment and preventive measures to reduce damage. Obtaining complete building structure data is crucial for the accurate evaluation of earthquake-induced losses. However, manual annotation of building structures is time-consuming and inefficient, resulting in incomplete data. To address this, we propose YOLOX-CS, an object detection model, combined with the Convolutional Block Attention Module (CBAM), to enhance recognition capabilities for small structures and reduce background interference. Additionally, we introduce the Illustration Enhancement data augmentation method to improve the recognition of obscured buildings. We collected diverse building images and manually annotated them, resulting in a dataset for training the model. YOLOX-CS with CBAM significantly improves recognition accuracy, particularly for small objects, and Illustration Enhancement enhances the recognition of occluded buildings. Our proposed approach advances building structure recognition, contributing to more effective earthquake risk assessment systems in Taiwan and beyond.

Select, Purify, and Exchange: A Multisource Unsupervised Domain Adaptation Method for Building Extraction

Article

Full-text available

Jul 2023

Accurately extracting buildings from aerial images has essential research significance for timely understanding human intervention on the land. The distribution discrepancies between diversified unlabeled remote sensing images (changes in imaging sensor, location, and environment) and labeled historical images significantly degrade the generalization performance of deep learning algorithms. Unsupervised domain adaptation (UDA) algorithms have recently been proposed to eliminate the distribution discrepancies without re-annotating training data for new domains. Nevertheless, due to the limited information provided by a single-source domain, single-source UDA (SSUDA) is not an optimal choice when multitemporal and multiregion remote sensing images are available. We propose a multisource UDA (MSUDA) framework SPENet for building extraction, aiming at selecting, purifying, and exchanging information from multisource domains to better adapt the model to the target domain. Specifically, the framework effectively utilizes richer knowledge by extracting target-relevant information from multiple-source domains, purifying target domain information with low-level features of buildings, and exchanging target domain information in an interactive learning manner. Extensive experiments and ablation studies constructed on 12 city datasets prove the effectiveness of our method against existing state-of-the-art methods, e.g., our method achieves $59.1\%$ intersection over union (IoU) on Austin and Kitsap $\longrightarrow $ Potsdam, which surpasses the target domain supervised method by $2.2\%$ . The code is available at https://github.com/QZangXDU/SPENet.

Building rooftop extraction from aerial imagery using low complexity UNet variant models

Article

Feb 2024

A Ternary-attention Capsule Feature Pyramid Network for Building Footprint Extraction from Remote Sensing Imagery

Conference Paper

Oct 2023

Semantic Segmentation-Based Building Extraction in Urban Area Using Memory-Efficient Residual Dilated Convolutional Network

Article

Jan 2024

The satellite images have been employed in building extraction to aid urban planning, tax assessment, disaster management, etc. The number of buildings and building types is huge in urban areas, which puts more burden on human experts to extract buildings in satellite images. Hence, building extraction from satellite images using deep learning (DL) has become an emerging research domain in recent decades. The performance of the DL model depends on training parameters, the depth of the model, and the memory required to preserve the model. In this work, a Memory-Efficient Residual Dilated Convolutional Network (MRDCN) has been proposed to extract buildings effectively with reduced number of training parameters and with lesser memory consumption. The model is trained using the Massachusetts buildings dataset and implemented using PyTorch in Kaggle platform. The trained model has been tested using both Massachusetts and AIRS Dataset. The simulation results prove that the proposed model uses 31.64% less memory than the existing dilated residual network. It is evident from the results that the MRDCN is able to extract the buildings with better accuracy and an Intersection of Union with minimal memory consumption than the existing standard UNet, SegNet, ResUNet, and Dilated ResUNet models.

Investigations on extraction of buildings from RS imagery using deep learning models

Article

Jan 2024

Semi-Supervised Building Detection from High-Resolution Remote Sensing Imagery

Article

Full-text available

Aug 2023

Urban building information reflects the status and trends of a region’s development and is essential for urban sustainability. Detection of buildings from high-resolution (HR) remote sensing images (RSIs) provides a practical approach for quickly acquiring building information. Mainstream building detection methods are based on fully supervised deep learning networks, which require a large number of labeled RSIs. In practice, manually labeling building instances in RSIs is labor-intensive and time-consuming. This study introduces semi-supervised deep learning techniques for building detection and proposes a semi-supervised building detection framework to alleviate this problem. Specifically, the framework is based on teacher–student mutual learning and consists of two key modules: the color and Gaussian augmentation (CGA) module and the consistency learning (CL) module. The CGA module is designed to enrich the diversity of building features and the quantity of labeled images for better training of an object detector. The CL module derives a novel consistency loss by imposing consistency of predictions from augmented unlabeled images to enhance the detection ability on the unlabeled RSIs. The experimental results on three challenging datasets show that the proposed framework outperforms state-of-the-art building detection methods and semi-supervised object detection methods. This study develops a new approach for optimizing the building detection task and a methodological reference for the various object detection tasks on RSIs.

Identifying Illegal Buildings Using UAV Images and Faster R-CNN Based on Deep Learning

Chapter

Jul 2023

Fan ching lung

Illegal buildings are structures built and used without being reviewed by competent construction authorities or obtaining a building or usage license. Due to their lack of compliance with firefighting and building regulations, these buildings frequently pose hazards. However, the government does not conduct timely inspections of illegal buildings as this process requires additional human resources. Therefore, developing an efficient and correct method for inspecting and reporting illegal buildings is necessary. In the present study, remote sensing-based unmanned aerial vehicles (UAVs) were used to acquire images and identify illegal buildings, offering higher mobility, lower cost, and higher spatial and temporal resolution compared to manned aerial vehicles. Additionally, Faster R-CNN based on deep learning was used to identify illegal buildings, and the accuracy of the model was 91.2%.KeywordsDeep learningUAV imageImage recognitionIllegal buildingObject detection

Angular difference feature extraction for urban scene classification using ZY-3 multi-angle high-resolution satellite imagery

Article

Full-text available

Jan 2018
ISPRS J PHOTOGRAMM

Spaceborne multi-angle images with a high-resolution are capable of simultaneously providing spatial details and three-dimensional (3D) information to support detailed and accurate classification of complex urban scenes. In recent years, satellite-derived digital surface models (DSMs) have been increasingly utilized to provide height information to complement spectral properties for urban classification. However, in such a way, the multi-angle information is not effectively exploited, which is mainly due to the errors and difficulties of the multi-view image matching and the inaccuracy of the generated DSM over complex and dense urban scenes. Therefore, it is still a challenging task to effectively exploit the available angular information from high-resolution multi-angle images. In this paper, we investigate the potential for classifying urban scenes based on local angular properties characterized from high-resolution ZY-3 multi-view images. Specifically, three categories of angular difference features (ADFs) are proposed to describe the angular information at three levels (i.e., pixel, feature, and label levels): (1) ADF-pixel: the angular information is directly extrapolated by pixel comparison between the multi-angle images; (2) ADF-feature: the angular differences are described in the feature domains by comparing the differences between the multi-angle spatial features (e.g., morphological attribute profiles (APs)). (3) ADF-label: label-level angular features are proposed based on a group of urban primitives (e.g., buildings and shadows), in order to describe the specific angular information related to the types of primitive classes. In addition, we utilize spatial-contextual information to refine the multi-level ADF features using superpixel segmentation, for the purpose of alleviating the effects of salt-and-pepper noise and representing the main angular characteristics within a local area. The experiments on ZY-3 multi-angle images confirm that the proposed ADF features can effectively improve the accuracy of urban scene classification, with a significant increase in overall accuracy (3.8–11.7%) compared to using the spectral bands alone. Furthermore, the results indicated the superiority of the proposed ADFs in distinguishing between the spectrally similar and complex man-made classes, including roads and various types of buildings (e.g., high buildings, urban villages, and residential apartments).

Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks

Article

Full-text available

Jan 2017

In this paper, we focus on tackling the problem of automatic accurate localization of detected objects in high-resolution remote sensing images. The two major problems for object localization in remote sensing images caused by the complex context information such images contain are achieving generalizability of the features used to describe objects and achieving accurate object locations. To address these challenges, we propose a new object localization framework, which can be divided into three processes: region proposal, classification, and accurate object localization process. First, a region proposal method is used to generate candidate regions with the aim of detecting all objects of interest within these images. Then, generic image features from a local image corresponding to each region proposal are extracted by a combination model of 2-D reduction convolutional neural networks (CNNs). Finally, to improve the location accuracy, we propose an unsupervised score-based bounding box regression (USB-BBR) algorithm, combined with a nonmaximum suppression algorithm to optimize the bounding boxes of regions that detected as objects. Experiments show that the dimension-reduction model performs better than the retrained and fine-tuned models and the detection precision of the combined CNN model is much higher than that of any single model. Also our proposed USB-BBR algorithm can more accurately locate objects within an image. Compared with traditional features extraction methods, such as elliptic Fourier transform-based histogram of oriented gradients and local binary pattern histogram Fourier, our proposed localization framework shows robustness when dealing with different complex backgrounds.

Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark

Conference Paper

Jul 2017

Feature Pyramid Networks for Object Detection

Conference Paper

Jul 2017

Rich feature hierarchies for accurate object detection and semantic segmentation

Conference Paper

Nov 2014

Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Conference Paper

Jan 2016

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.

A Nonlinear Sparse Representation-Based Binary Hypothesis Model for Hyperspectral Target Detection

Article

Jun 2015

The sparsity model has been employed for hyperspectral target detection and has been proved to be very effective when compared to the traditional linear mixture model. However, the state-of-art sparsity models usually represent a test sample via a sparse linear combination of both target and background training samples, which does not result in an efficient representation of a background test sample. In this paper, a sparse representation-based binary hypothesis (SRBBH) model employs more appropriate dictionaries with the binary hypothesis model to sparsely represent the test sample. Furthermore, the nonlinear issue is addressed in this paper, and a kernel method is employed to resolve the detection issue in complicated hyperspectral images. In this way, the kernel SRBBH model not only takes the nonlinear endmember mixture into consideration, but also fully exploits the sparsity model by the use of more reasonable dictionaries. The recovery process leads to a competition between the background and target subspaces, which is effective in separating the targets from the background, thereby enhancing the detection performance.

Manifold SLIC: A Fast Method to Compute Content-Sensitive Superpixels

Conference Paper

Jun 2016

You Only Look Once: Unified, Real-Time Object Detection

Conference Paper

Jun 2016

Multilevel Building Detection Framework in Remote Sensing Images Based on Convolutional Neural Networks

Abstract and Figures

Recommended publications

A detection method for low-pixel ratio object

Information-Guided Flame Detection based on Faster R-CNN

Road Detection of Remote Sensing Image Based on Convolutional Neural Network

Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote...

Multi-Scale Image Block-level F-CNN for Remote Sensing Images Object Detection

Multi-class geospatial object detection based on a position-sensitive balancing framework for high s...

Aircraft Detection in Remote Sensing Images Based on Background Filtering and Scale Prediction: 15th...