Article

Building Outline Delineation From VHR Remote Sensing Images Using the Convolutional Recurrent Neural Network Embedded With Line Segment Information

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recently, several recurrent neural network (RNN)-based models have been proposed to delineate the outlines of buildings from very high resolution (VHR) remote sensing images. These models first use convolutional neural networks (CNNs) to recognize the boundary fragments by learning probability maps of both edges and corners and then feed them into RNN to find and link a set of sequent corners into external boundaries of buildings. However, caused by the category imbalance of edges and corners, the local ambiguity of edge detection is very serious, which significantly affects the accuracy of predicted outline corners. To tackle this challenge, this article introduces a convolutional RNN embedded with line segment information (LSI-RNN), a novel network that aims to directly detect line segment instead of edges. To achieve this, LSI-RNN utilizes an additional cotraining branch to generate an attraction field map (AFM) by neural discriminative dimensionality reduction (NDDR) layer. Consequently, the conventional classification problem of edges is converted to a regression problem of line segments, thus solving the aforementioned issues. Experimental results over three remote sensing datasets with different spatial resolutions show that the proposed method consistently outperforms other state-of-the-art methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Huang et al. [24] and Liu et al. [25] concatenated a boundary map and extracted features to facilitate the propagation of boundary information in an end-to-end manner. Some studies have also applied adversarial loss to refine predictions. ...
... For example, Zorzi et al. [26] and Ding et al. [27] applied an additional discriminator network to ameliorate the boundary. Additionally, some studies treat a building as a set of lines and apply specific rules [11][12][13][14]25] to compose building structure fragments (e.g., Nauata and Furukawa [11] and Liu et al. [14]), which first detected edge and corner primitives, and then composited them using the extracted topology. ...
Article
Full-text available
High-resolution remote-sensing imagery has proven useful for building extraction. Unfortunately, due to the high acquisition costs and infrequent availability of high-resolution imagery, low-resolution images are more practical for large-scale mapping or change tracking of buildings. However, extracting buildings from low-resolution images is a challenging task. Compared with high-resolution images, low-resolution images pose two critical challenges in terms of building segmentation: the effects of fuzzy boundary details on buildings and the lack of local textures. In this study, we propose a sparse geometric feature attention network (SGFANet) based on multi-level feature fusion to address the aforementioned issues. From the perspective of the fuzzy effect, SGFANet enhances the representative boundary features by calculating the point-wise affinity of the selected feature points in a top-down manner. From the perspective of lacking local textures, we convert the top-down propagation from local to non-local by introducing the grounding transformer harvesting the global attention of the input image. SGFANet outperforms competing baselines on remote-sensing images collected worldwide and multiple sensors at 4 and 10 m resolution, thereby, improving the IoU by at least 0.66%. Notably, our method is robust and generalizable, which makes it useful for extending the accessibility and scalability of building dynamic tracking across developing areas (e.g., the Xiong’an New Area in China) by using low-resolution images.
... In more recent studies, researchers have realized contour extraction with a unified deep neural network framework. For example, a number of researchers Zhao et al., 2021;Liu et al., 2022;Li et al., 2019) have used a recurrent neural network (RNN) (Yu et al., 2019) to predict the corner vertices of buildings in the clockwise direction. However, this type of method often suffers from corner vertex loss and incorrect vertices. ...
... The HTC-DP (Zhao et al., 2020) and frame field learning (FFL) (Girard et al., 2021) methods are both mask-based methods, and the latter achieves the second-best result. The building outline delineation (BOD) (Liu et al., 2022) and PolyMapper (Li et al., 2019) methods use an RNN to iteratively predict the building corner vertices, which is prone to missing vertices and leads to a relatively poor accuracy. ...
... In more recent studies, researchers have realized contour extraction with a unified deep neural network framework. For example, a number of researchers Zhao et al., 2021;Liu et al., 2022;Li et al., 2019) have used a recurrent neural network (RNN) (Yu et al., 2019) to predict the corner vertices of buildings in the clockwise direction. However, this type of method often suffers from corner vertex loss and incorrect vertices. ...
... The HTC-DP (Zhao et al., 2020) and frame field learning (FFL) (Girard et al., 2021) methods are both mask-based methods, and the latter achieves the second-best result. The building outline delineation (BOD) (Liu et al., 2022) and PolyMapper (Li et al., 2019) methods use an RNN to iteratively predict the building corner vertices, which is prone to missing vertices and leads to a relatively poor accuracy. ...
Preprint
Deep learning based methods have significantly boosted the study of automatic building extraction from remote sensing images. However, delineating vectorized and regular building contours like a human does remains very challenging, due to the difficulty of the methodology, the diversity of building structures, and the imperfect imaging conditions. In this paper, we propose the first end-to-end learnable building contour extraction framework, named BuildMapper, which can directly and efficiently delineate building polygons just as a human does. BuildMapper consists of two main components: 1) a contour initialization module that generates initial building contours; and 2) a contour evolution module that performs both contour vertex deformation and reduction, which removes the need for complex empirical post-processing used in existing methods. In both components, we provide new ideas, including a learnable contour initialization method to replace the empirical methods, dynamic predicted and ground truth vertex pairing for the static vertex correspondence problem, and a lightweight encoder for vertex information extraction and aggregation, which benefit a general contour-based method; and a well-designed vertex classification head for building corner vertices detection, which casts light on direct structured building contour extraction. We also built a suitable large-scale building dataset, the WHU-Mix (vector) building dataset, to benefit the study of contour-based building extraction methods. The extensive experiments conducted on the WHU-Mix (vector) dataset, the WHU dataset, and the CrowdAI dataset verified that BuildMapper can achieve a state-of-the-art performance, with a higher mask average precision (AP) and boundary AP than both segmentation-based and contour-based methods.
... Some studies have attempted to use learn-based methods to adjust building outlines (Kong et al. 2022;Mai et al. 2023). For example, Liu et al. (2022) proposed a convolutional recurrent neural network (RNN) embedded with line segment information that can predict every possible corner of an individual building. Cheng et al. (2019) proposed a network based on polar coordinates that can prevent self-intersection and make the extracted outlines closer to the ground truth. ...
Article
Full-text available
Extracting building from remote sensing images is crucial, but the extracted outlines still face issues such as point redundancy and lack of right-angle features, relying on further regularization. This study proposed an integrated regularization method to combine the strengths of multiple algorithms. 22 metrics were calculated to describe the geometric characteristic of outlines, followed by principal component analysis for selecting key metrics. Further, a supervised learning model was constructed to analyze these key metrics and determine the most suitable algorithm among three candidates–rectangle transformation, recursive regression, and feature edge reconstruction–for regularizing each building. Experimental results demonstrated that our method can adaptively select the appropriate algorithm based on the metrics, achieving regularization results superior to those obtained by using algorithms independently. Compared with two existing geometric correction-based methods, our method excels in preserving the orientation, area, and shape. Our method also has advantages over learning-based methods in maintaining the orthogonality.
... They initiate from an initial contour vertex and predict vertices iteratively in a predefined direction until polygon closure is achieved. This approach is well-suited for buildings with regular shapes, leading to further investigations Zhao et al., 2021;Liu et al., 2022;Li et al., 2019) that apply and refine these techniques for polygonal building extraction. PolyWorld , on the other hand, achieves building polygon extraction by predicting building corner points and their adjacent relationships. ...
... Buildings, as typical man-made structures, exhibit sharp edges. However, the hierarchical enlarged receptive field of CNNs, despite their impressive power in semantic modeling, inevitably smooths features over boundaries Liu et al., 2022). Many dedicated building segmentation studies focus on enhancing representativeness with respect to boundaries. ...
Preprint
Full-text available
Most urban applications necessitate building footprints in the form of concise vector graphics with sharp boundaries rather than pixel-wise raster images. This need contrasts with the majority of existing methods, which typically generate over-smoothed footprint polygons. Editing these automatically produced polygons can be inefficient, if not more time-consuming than manual digitization. This paper introduces a semi-automatic approach for building footprint extraction through semantically-sensitive superpixels and neural graph networks. Drawing inspiration from object-based classification techniques, we first learn to generate superpixels that are not only boundary-preserving but also semantically-sensitive. The superpixels respond exclusively to building boundaries rather than other natural objects, while simultaneously producing semantic segmentation of the buildings. These intermediate superpixel representations can be naturally considered as nodes within a graph. Consequently, graph neural networks are employed to model the global interactions among all superpixels and enhance the representativeness of node features for building segmentation. Classical approaches are utilized to extract and regularize boundaries for the vectorized building footprints. Utilizing minimal clicks and straightforward strokes, we efficiently accomplish accurate segmentation outcomes, eliminating the necessity for editing polygon vertices. Our proposed approach demonstrates superior precision and efficacy, as validated by experimental assessments on various public benchmark datasets. We observe a 10\% enhancement in the metric for superpixel clustering and an 8\% increment in vector graphics evaluation, when compared with established techniques. Additionally, we have devised an optimized and sophisticated pipeline for interactive editing, poised to further augment the overall quality of the results.
... green, red, and near infrared (NIR) bands (B02, B03, B04, and B08, respectively) are of 10 m GSD, the vegetation red edge bands (B05, B06, and B07), narrow NIR (B08a), and the short-wave infrared (SWIR) bands (B11 and B12) are of 20 m GSD, while the remaining coastal aerosol (B01), water vapour (B09), and cirrus clouds estimation (B10) bands are of 60 m GSD. While such resolution is sufficient in many cases, it also constitutes a serious limitation for the tasks that require higher accuracy in the spatial domain, like precision farming [7] or object delineation [8], [9]. ...
Preprint
Multispectral Sentinel-2 images are a valuable source of Earth observation data, however spatial resolution of their spectral bands limited to 10 m, 20 m, and 60 m ground sampling distance remains insufficient in many cases. This problem can be addressed with super-resolution, aimed at reconstructing a high-resolution image from a low-resolution observation. For Sentinel-2, spectral information fusion allows for enhancing the 20 m and 60 m bands to the 10 m resolution. Also, there were attempts to combine multitemporal stacks of individual Sentinel-2 bands, however these two approaches have not been combined so far. In this paper, we introduce DeepSent -- a new deep network for super-resolving multitemporal series of multispectral Sentinel-2 images. It is underpinned with information fusion performed simultaneously in the spectral and temporal dimensions to generate an enlarged multispectral image. In our extensive experimental study, we demonstrate that our solution outperforms other state-of-the-art techniques that realize either multitemporal or multispectral data fusion. Furthermore, we show that the advantage of DeepSent results from how these two fusion types are combined in a single architecture, which is superior to performing such fusion in a sequential manner. Importantly, we have applied our method to super-resolve real-world Sentinel-2 images, enhancing the spatial resolution of all the spectral bands to 3.3 m nominal ground sampling distance, and we compare the outcome with very high-resolution WorldView-2 images. We will publish our implementation upon paper acceptance, and we expect it will increase the possibilities of exploiting super-resolved Sentinel-2 images in real-life applications.
Preprint
Extracting building contours from remote sensing imagery is a significant challenge due to buildings' complex and diverse shapes, occlusions, and noise. Existing methods often struggle with irregular contours, rounded corners, and redundancy points, necessitating extensive post-processing to produce regular polygonal building contours. To address these challenges, we introduce a novel, streamlined pipeline that generates regular building contours without post-processing. Our approach begins with the segmentation of generic geometric primitives (which can include vertices, lines, and corners), followed by the prediction of their sequence. This allows for the direct construction of regular building contours by sequentially connecting the segmented primitives. Building on this pipeline, we developed P2PFormer, which utilizes a transformer-based architecture to segment geometric primitives and predict their order. To enhance the segmentation of primitives, we introduce a unique representation called group queries. This representation comprises a set of queries and a singular query position, which improve the focus on multiple midpoints of primitives and their efficient linkage. Furthermore, we propose an innovative implicit update strategy for the query position embedding aimed at sharpening the focus of queries on the correct positions and, consequently, enhancing the quality of primitive segmentation. Our experiments demonstrate that P2PFormer achieves new state-of-the-art performance on the WHU, CrowdAI, and WHU-Mix datasets, surpassing the previous SOTA PolyWorld by a margin of 2.7 AP and 6.5 AP75 on the largest CrowdAI dataset. We intend to make the code and trained weights publicly available to promote their use and facilitate further research.
Article
Three-dimensional (3D) building models play a vital role in numerous applications including urban planning and smart cities. Recent 3D building modeling methods either rely heavily on available manaually-collected footprint reference or hardly reach real automation on par with manual editing. To approach the automated extraction of instance-level 3D buildings at Level of Detail (LoD) 1, we introduce an innovative end-to-end 3D building instance segmentation model. This model predicts accurate contours and heights of individual buildings simultaneously using ortho-rectified high-resolution remote sensing images and Digital Surface Models (DSMs), getting rid of additional reference data and impirical parameter settings. Firstly, we propose an Anchor-Free Multi-head building extraction network (AFM) tailored for extracting 2D building contours. AFM incorporates a full-resolution, long-range correlation boosted global mask prediction branch along with anchor-free bounding box generation, as well as a newly developed online hard sample mining (OHSM) training procedure based on uncertainty analysis to emphasize error-prone positions in locating building contours. Subsequently, we incorporate a height prediction component to AFM in order to derive accurate building height information, thus creating the comprehensive 3D building extraction model referred to as AFM-3D. The two-stage AFM-3D operates by initially predicting 3D cube proposals, followed by generating refined 3D prismatic models (LoD1 models) for each proposal. Thorough experimentation across different datasets demonstrates the superior performance of AFM and AFM-3D. A significant enhancement of 6.4% quality score is observed on the urban 3D dataset in comparison to recent methods. In addition to the proposed novel methodology, we compare anchor-based and anchor-free bounding box generation mechanisms for remote sensing data, explore pixel-based and contour-based segmentation strategies, evaluate learning-based and empirical height estimation methods, and discuss the indispensability of DSM data in 3D building instance extraction. These analyses yield valuable insights that contribute to the progression of 3D building extraction research.
Article
Optical remote-sensing image target detection holds significant research significance in various domains, including disaster relief, ecological environment protection, and military surveillance. However, since remote-sensing images have multiscale targets, complex backgrounds, and many small targets, the performance of the existing network models in remote-sensing image target detection cannot reach what we expect. In addition, we note that current networks use complex computational mechanisms that make the models time-consuming, which hinders their practicability in remote-sensing target detection scenarios. In response to this challenge, we propose an anchor-free and efficient one-stage target detection method for optical remote-sensing images. First, we propose the lightweight context-aware module GSelf-Attention, injected into the feature fusion network from top-to-bottom and bottom-to-top to enhance the feature information interaction. Second, we proposed that ELAN-RSN uses an optimized residual shrinkage network (RSN) to eliminate background noise and conflicting information in the multiscale feature fusion. Finally, we introduce the decoupled head fused with SPDConv to enhance the detection accuracy of small target objects further. The performance of the proposed algorithm is compared with that of other advanced methods on DIOR and RSOD datasets. The experimental results show that the proposed algorithm significantly improves object detection accuracy while ensuring detection efficiency and has high robustness. The code is available at https://github.com/FF-codeHouse/Object-Detection/tree/remote-sensing .
Article
Multispectral Sentinel-2 images are a valuable source of Earth observation data, however spatial resolution of their spectral bands limited to 10 m, 20 m, and 60 m ground sampling distance remains insufficient in many cases. This problem can be addressed with super-resolution, aimed at reconstructing a high-resolution image from a low-resolution observation. For Sentinel-2, spectral information fusion allows for enhancing the 20 m and 60 m bands to the 10 m resolution. Also, there were attempts to combine multitemporal stacks of individual Sentinel-2 bands, however these two approaches have not been combined so far. In this paper, we introduce DeepSent—a new deep network for super-resolving multitemporal series of multispectral Sentinel-2 images. It is underpinned with information fusion performed simultaneously in the spectral and temporal dimensions to generate an enlarged multispectral image. In our extensive experimental study, we demonstrate that our solution outperforms other state-of-the-art techniques that realize either multitemporal or multispectral data fusion. Furthermore, we show that the advantage of DeepSent results from how these two fusion types are combined in a single architecture, which is superior to performing such fusion in a sequential manner. Importantly, we have applied our method to super-resolve real-world Sentinel-2 images, enhancing the spatial resolution of all the spectral bands to 3.3 m nominal ground sampling distance, and we compare the outcome with very high-resolution WorldView-2 images. We will publish our implementation upon paper acceptance, and we expect it will increase the possibilities of exploiting super-resolved Sentinel-2 images in real-life applications.
Article
Full-text available
Large-scale and multi-annual maps of building rooftop area (BRA) are crucial for addressing policy decisions and sustainable development. In addition, as a fine-grained indicator of human activities, BRA could contribute to urban planning and energy modeling to provide benefits to human well-being. However, it is still challenging to produce a large-scale BRA due to the rather tiny sizes of individual buildings. From the viewpoint of classification methods, conventional approaches utilize high-resolution aerial images (metric or submetric resolution) to map BRA; unfortunately, high-resolution imagery is both infrequently captured and expensive to purchase, making the BRA mapping costly and inadequate over a consistent spatiotemporal scale. From the viewpoint of learning strategies, there is a nontrivial gap that persists between the limited training references and the applications over geospatial variations. Despite the difficulties, existing large-scale BRA datasets, such as those from Microsoft or Google, do not include China, and hence there are no full-coverage maps of BRA in China yet. In this paper, we first propose a deep-learning method, named the Spatio-Temporal aware Super-Resolution Segmentation framework (STSR-Seg), to achieve robust super-resolution BRA extraction from relatively low-resolution imagery over a large geographic space. Then, we produce the multi-annual China Building Rooftop Area (CBRA) dataset with 2.5 m resolution from 2016–2021 Sentinel-2 images. CBRA is the first full-coverage and multi-annual BRA dataset in China. With the designed training-sample-generation algorithms and the spatiotemporally aware learning strategies, CBRA achieves good performance with a F1 score of 62.55 % (+10.61 % compared with the previous BRA data in China) based on 250 000 testing samples in urban areas and a recall of 78.94 % based on 30 000 testing samples in rural areas. Temporal analysis shows good performance consistency over years and good agreement with other multi-annual impervious surface area datasets. STSR-Seg will enable low-cost, dynamic, and large-scale BRA mapping (https://github.com/zpl99/STSR-Seg, last access: 12 July 2023). CBRA will foster the development of BRA mapping and therefore provide basic data for sustainable research (Liu et al., 2023; https://doi.org/10.5281/zenodo.7500612).
Article
The representation quantifies the geometric shape and topology of a building is a necessary procedure for many urban planning applications. A sharp line framework is a high-level structural cue providing a compact building representation. However, accurate and efficient structural line extraction remains a challenging task given the variety and complexity of buildings. This study proposes a general 3-D structural line extraction method from point clouds. The building points are extracted and further divided into various single-building units. In the proposed 3-D structural line extraction method, individual building point cloud is the input. First, the corners are detected by an associative learning module. Next, the curve connection is implemented by a link prediction block based on the graph neural network (GNN) embedded with corner information. After that, the obtained curves are subsequently converted into a topological graph. Finally, the corner points are optimized to achieve precise fitting of the structural lines. The experiments and comparisons on two airborne laser scanning (ALS) point cloud datasets demonstrate the effectiveness of the proposed method and the ability to retrieve ideal structural line results for building point clouds. Furthermore, without reprocessing, the proposed method yielded better results for various dataset types (outdoor building, indoor scene, and furniture point clouds) than the prevalent published methods (i.e., EC-Net, PIE-Net, and PC2WF), verifying its strength and efficacy. To further verify the accuracy of the obtained structural lines, we also introduce a line-based model reconstruction method that employ these lines for building reconstruction.
Preprint
Full-text available
Large-scale and multi-annual maps of building rooftop area (BRA) are crucial for addressing policy decisions and sustainable development. In addition, as a fine-grained indicator of human activities, BRA could contribute to urban planning and energy modelling to provide benefits to human well-being. However, it is still challenging to produce large-scale BRA due to the rather tiny size of individual buildings. From the viewpoint of classification methods, conventional approaches utilize high-resolution aerial images (metric or sub-metric resolution) to map BRA; unfortunately, high-resolution imagery is both infrequently captured and expensive to purchase, making the BRA mapping costly and inadequate over a consistent spatio-temporal scale. From the viewpoint of learning strategies, there is a non-trivial gap that persists between the limited training references and the applications over geospatial variations. Despite the difficulties, existing large-scale BRA datasets, such as those from Microsoft or Google, do not include China, hence there are no full-coverage maps of BRA in China yet. In this paper, we first propose a deep-learning method, named Spatio-Temporal aware Super-Resolution Segmentation framework (STSR-Seg) to achieve robust super-resolution BRA extraction from relatively low-resolution imagery over a large geographic space. Then, we produce the multi-annual China building rooftop area dataset (CBRA) with 2.5 m resolution from 2016–2021 Sentinel-2 images. The CBRA is the first full-coverage and multi-annual BRA data in China. With the designed training sample generation algorithms and the spatio-temporal aware learning strategies, the CBRA achieves good performance with the F1 score of 62.55 % (+10.61 % compared with the previous BRA data in China) based on 250,000 testing samples in urban areas, and the recall of 78.94 % based on 30,000 testing samples in rural areas. Temporal analysis shows good performance consistency over years and the well agreement to other multi-annual impervious surface area datasets. The STSR-Seg will enable low-cost, dynamic and large-scale BRA mapping (https://github.com/zpl99/STSR-Seg). The CBRA will foster the development of BRA mapping and therefore provide basic data for sustainable research (Liu et al., 2023; https://doi.org/10.5281/zenodo.7500612).
Article
Full-text available
Semantic and instance segmentation methods are commonly used to build extraction from high-resolution images. The semantic segmentation method involves assigning a class label to each pixel in the image, thus ignoring the geometry of the building rooftop, which results in irregular shapes of the rooftop edges. As for instance segmentation, there is a strong assumption within this method that there exists only one outline polygon along the rooftop boundary. In this paper, we present a novel method to sequentially delineate exterior and interior contours of rooftops with holes from VHR aerial images, where most of the buildings have holes, by integrating semantic segmentation and polygon delineation. Specifically, semantic segmentation from the Mask R-CNN is used as a prior for hole detection. Then, the holes are used as objects for generating the internal contours of the rooftop. The external and internal contours of the rooftop are inferred separately using a convolutional recurrent neural network. Experimental results showed that the proposed method can effectively delineate the rooftops with both one and multiple polygons and outperform state-of-the-art methods in terms of the visual results and six statistical indicators, including IoU, OA, F1, BoundF, RE and Hd.
Article
Full-text available
Citation: Zhang, T.; Tang, H.; Ding, Y.; Li, P.; Ji, C.; Xu, P. FSRSS-Net: High-Resolution Mapping of Buildings from Middle-Resolution Satellite Images Using a Super-Resolution Semantic Segmentation Network. Remote Sens.
Article
Full-text available
It is an important task to automatically and accurately map rooftops from very high resolution remote sensing images since buildings are very closely related to human activity. Two typical technologies are often utilized to accomplish the task, i.e., semantic segmentation and instance segmentation. The semantic segmentation is to independently allocate a label (e.g., "building" or not) to each pixel, resulting in blob-like segments. On the contrary, one might model the boundary of a rooftop as a polygon to improve the shape of the rooftop by encouraging vertices of polygon to adhere to the rooftop's boundary. Following this line of work, we present a multitask learning approach to predict rooftop corners in a sequent way using the attention learned from where the boundaries are in a given image region. The approach simulates the process of manual delineation of rooftops' outline in a given image, which can produce accurate boundaries of rooftops with sharp corners and straight lines between them. Specifically, the proposed method consists of three components, i.e., object detection, pixel-by-pixel classification of both edges and corners, and delineation of rooftops in a sequent manner using a convolutional recurrent neural network (RNN). It is called as object-oriented, edges and corners (OEC)-RNN in this article. Three image datasets of buildings are employed to validate the performance of the OEC-RNN, which are compared with state-of-the-art methods for instance segmentation. The experimental results show that the OEC-RNN achieves the best performance in terms of overlay, boundary adherence, and vertex location between ground-truth and predicted polygons.
Article
Full-text available
Stereo photogrammetric survey used to be used to extract the height of buildings, then to convert the height to number of stories through certain rules to estimate the number of stories of buildings by means of satellite remote sensing. In contrast, we propose a new method using deep learning to estimate the number of stories of buildings from monocular optical satellite image end to end in this paper. To the best of our knowledge, this is the first attempt to directly estimate the number of stories of buildings from monocular satellite images. Specifically, in the proposed method, we extend a classic object detection network, i.e., Mask R-CNN, by adding a new head to predict the number of stories of detected buildings from satellite images. GF-2 images from nine cities in China are used to validate the effectiveness of the proposed methods. The result of experiment show that the mean absolute error of prediction on buildings whose stories between 1–7, 8–20, and above 20 are 1.329, 3.546, and 8.317, respectively, which indicate that our method has possible application potentials in low-rise buildings, but the accuracy in middle-rise and high-rise buildings needs to be further improved.
Article
Full-text available
Modern convolutional neural networks (CNNs) are often trained on pre-set data sets with a fixed size. As for the large-scale applications of satellite images, for example, global or regional mappings, these images are collected incrementally by multiple stages in general. In other words, the sizes of training datasets might be increased for the tasks of mapping rather than be fixed beforehand. In this paper, we present a novel algorithm, called GeoBoost, for the incremental-learning tasks of semantic segmentation via convolutional neural networks. Specifically, the GeoBoost algorithm is trained in an end-to-end manner on the newly available data, and it does not decrease the performance of previously trained models. The effectiveness of the GeoBoost algorithm is verified on the large-scale data set of DREAM-B. This method avoids the need for training on the enlarged data set from scratch and would become more effective along with more available data.
Article
Full-text available
Automatic building extraction from optical imagery remains a challenge due to, for example, the complexity of building shapes. Semantic segmentation is an efficient approach for this task. The latest development in deep convolutional neural networks (DCNNs) has made accurate pixel-level classification tasks possible. Yet one central issue remains: the precise delineation of boundaries. Deep architectures generally fail to produce fine-grained segmentation with accurate boundaries due to their progressive down-sampling. Hence, we introduce a generic framework to overcome the issue, integrating the graph convolutional network (GCN) and deep structured feature embedding (DSFE) into an end-to-end workflow. Furthermore, instead of using a classic graph convolutional neural network, we propose a gated graph convolutional network, which enables the refinement of weak and coarse semantic predictions to generate sharp borders and fine-grained pixel-level classification. Taking the semantic segmentation of building footprints as a practical example, we compared different feature embedding architectures and graph neural networks. Our proposed framework with the new GCN architecture outperforms state-of-the-art approaches. Although our main task in this work is building footprint extraction, the proposed method can be generally applied to other binary or multi-label segmentation tasks.
Article
Full-text available
We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach to unseen datasets.
Conference Paper
Full-text available
State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction problems such as semantic segmentation are structurally different from image classification. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.
Article
Full-text available
The goal of precipitation nowcasting is to predict the future rainfall intensity in a local region over a relatively short period of time. Very few previous studies have examined this crucial and challenging weather forecasting problem from the machine learning perspective. In this paper, we formulate precipitation nowcasting as a spatiotemporal sequence forecasting problem in which both the input and the prediction target are spatiotemporal sequences. By extending the fully connected LSTM (FC-LSTM) to have convolutional structures in both the input-to-state and state-to-state transitions, we propose the convolutional LSTM (ConvLSTM) and use it to build an end-to-end trainable model for the precipitation nowcasting problem. Experiments show that our ConvLSTM network captures spatiotemporal correlations better and consistently outperforms FC-LSTM and the state-of-the-art operational ROVER algorithm for precipitation nowcasting.
Article
Full-text available
The standardization of evaluation techniques for building extraction is an unresolved issue in the fields of remote sensing, photogrammetry, and computer vision. In this letter, we propose a metric with a working title “PoLiS metric” to compare two polygons. The PoLiS metric is a positive-definite and symmetric function that satisfies a triangle inequality. It accounts for shape and accuracy differences between the polygons, is straightforward to apply, and requires no thresholds. We show through an example that the PoLiS metric between two polygons changes approximately linearly with respect to small translation, rotation, and scale changes. Furthermore, we compare building polygons extracted from a digital surface model to the reference building polygons by computing PoLiS, Hausdorff, and Chamfer distances. The results show that quantification by the PoLiS distance of the dissimilarity between polygons is consistent with visual perception. Furthermore, Hausdorff and Chamfer distances overrate the dissimilarity when one polygon has more vertices than the other. We propose an approach toward standardizing building extraction evaluation, which may also have broader applications in the field of shape similarity.
Article
Buildings serve as the main places of human activities, and it is essential to automatically extract each building instance for a wide range of applications. Recently, automatic building segmentation approaches have made great progress in both detection and segmentation accuracy due to the rapid development of deep learning. However, these approaches struggle to delineate regular and accurate building boundaries due to the limitations in inferring overall structure of the building instance; this might lead to inconsistency in building geometry and difficulty in being applied directly to practical engineering. To tackle this challenge, this article presents an adaptive polygon generation algorithm (APGA), a novel method that aims at directly generating a polygonal output, parameterized as a sequence of building vertices, to outline each building instance. To achieve this, APGA predicts the candidate locations of building vertices and determines the arrangement of these vertices with the help of the position and orientation of the building boundary. Moreover, to introduce local context features and achieve improved performance of the predicted building polygon, APGA integrates finer structures around the candidate vertices to refine their positions. Experiments on several challenging building extraction datasets demonstrated that APGA outperformed state-of-the-art methods in terms of building coverage and geometric similarity.
Article
Deep learning methods based upon convolutional neural networks (CNNs) have demonstrated impressive performance in the task of building outline delineation from very high resolution (VHR) remote sensing (RS) imagery. In this paper, we introduce an improved method that is able to predict regularized building outline in a vector format within an end-to-end deep learning framework. The main idea of our framework is to learn to predict the location of key vertices of the buildings and connect them in sequence. The proposed method is based on PolyMapper. We upgrade the feature extraction by introducing global context and boundary refinement blocks and add channel and spatial attention modules to improve the effectiveness of the detection module. In addition, we introduce stacked conv-GRU to further preserve the geometric relationship between vertices and accelerate inference. We tested our method on two large-scale VHR-RS building extraction dataset. The results on both COCO and PoLiS metrics demonstrate better performance compared with Mask R-CNN and PolyMapper. Specifically, we achieve 4.2 mask mean average precision (mAP) and 3.7 mean average recall (mAR) absolute improvements compared to PolyMapper. Also, the qualitative comparison shows that our method significantly improves the instance segmentation of buildings of various shapes.
Article
A primary challenge in cloud detection is associated with highly mixed scenes that are filled with broken and thin clouds over inhomogeneous land. To tackle this challenge, we developed a new algorithm called the Random-Forest-based cloud mask (RFmask), which can improve the accuracy of cloud identification from Landsat Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM+), and Operational Land Imager and Thermal Infrared Sensor (OLI/TIRS) images. For the development and validation of the algorithm, we first chose the stratified sampling method to pre-select cloudy and clear-sky pixels to form a prior-pixel database according to the land use cover around the world. Next, we select typical spectral channels and calculate spectral indices based on the spectral reflection characteristics of different land cover types using the top-of-atmosphere reflectance and brightness temperature. These are then used as inputs to the RF model for training and establishing a preliminary cloud detection model. Finally, the Super-pixels Extracted via Energy-Driven Sampling (SEEDS) segmentation approach is applied to re-process the preliminary classification results in order to obtain the final cloud detection results. The RFmask detection results are evaluated against the globally distributed United States Geological Survey (USGS) cloud-cover assessment validation products. The average overall accuracy for RFmask cloud detection reaches 93.8% (Kappa coefficient = 0.77) with an omission error of 12.0% and a commission error of 7.4%. The RFmask algorithm is able to identify broken and thin clouds over both dark and bright surfaces. The new model generally outperforms other methods that are compared here, especially over these challenging scenes. The RFmask algorithm is not only accurate but also computationally efficient. It is potentially useful for a variety of applications in using Landsat data, especially for monitoring land cover and land-use changes.
Chapter
Active Contour (AC)-based segmentation has been widely used to solve many image processing problems, specially image segmentation. While these AC-based methods offer object shape constraints, they typically look for strong edges or statistical modeling for successful segmentation. Clearly, AC-based approaches lack a way to work with labeled images in a supervised machine learning framework. Furthermore, they are unsupervised approaches and strongly depend on many parameters which are chosen by empirical results. Recently, Deep Learning (DL) has become the go-to method for solving many problems in various areas. Over the past decade, DL has achieved remarkable success in various artificial intelligence research areas. DL is supervised methods and requires large volume ground-truth. This paper first provides a fundamental of both Active Contour techniques and Deep Learning framework. We then present the state-of-the-art approaches of Active Contour techniques incorporating in Deep Learning framework.
Article
This study proposes an automatic building footprint extraction framework that consists of a convolutional neural network (CNN)-based segmentation and an empirical polygon regularization that transforms segmentation maps into structured individual building polygons. The framework attempts to replace part of the manual delineation of building footprints that are involved in surveying and mapping field with algorithms. First, we develop a scale robust fully convolutional network (FCN) by introducing multiple scale aggregation of feature pyramids from convolutional layers. Two postprocessing strategies are introduced to refine the segmentation maps from the FCN. The refined segmentation maps are vectorized and polygonized. Then, we propose a polygon regularization algorithm consisting of a coarse and fine adjustment, to translate the initial polygons into structured footprints. Experiments on a large open building data set including 181,000 buildings showed that our algorithm reached a high automation level where at least 50% of individual buildings in the test area could be delineated to replace manual work. Experiments on different data sets demonstrated that our FCN-based segmentation method outperformed several most recent segmentation methods, and our polygon regularization algorithm is robust in challenging situations with different building styles, image resolutions, and even low-quality segmentation.
Article
The application of the convolutional neural network has shown to greatly improve the accuracy of building extraction from remote sensing imagery. In this paper, we created and made open a high-quality multisource data set for building detection, evaluated the accuracy obtained in most recent studies on the data set, demonstrated the use of our data set, and proposed a Siamese fully convolutional network model that obtained better segmentation accuracy. The building data set that we created contains not only aerial images but also satellite images covering 1000 km² with both raster labels and vector maps. The accuracy of applying the same methodology to our aerial data set outperformed several other open building data sets. On the aerial data set, we gave a thorough evaluation and comparison of most recent deep learning-based methods, and proposed a Siamese U-Net with shared weights in two branches, and original images and their down-sampled counterparts as inputs, which significantly improves the segmentation accuracy, especially for large buildings. For multisource building extraction, the generalization ability is further evaluated and extended by applying a radiometric augmentation strategy to transfer pretrained models on the aerial data set to the satellite data set. The designed experiments indicate our data set is accurate and can serve multiple purposes including building instance segmentation and change detection; our result shows the Siamese U-Net outperforms current building extraction methods and could provide valuable reference.
Article
Central to the looming paradigm shift toward data-intensive science, machine-learning techniques are becoming increasingly important. In particular, deep learning has proven to be both a major breakthrough and an extremely powerful tool in many fields. Shall we embrace deep learning as the key to everything? Or should we resist a black-box solution? These are controversial issues within the remote-sensing community. In this article, we analyze the challenges of using deep learning for remote-sensing data analysis, review recent advances, and provide resources we hope will make deep learning in remote sensing seem ridiculously simple. More importantly, we encourage remote-sensing scientists to bring their expertise into deep learning and use it as an implicit general model to tackle unprecedented, large-scale, influential challenges, such as climate change and urbanization.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Detecting incidental scene text is a challenging task because of multi-orientation, perspective distortion, and variation of text size, color and scale. Retrospective research has only focused on using rectangular bounding box or horizontal sliding window to localize text, which may result in redundant background noise, unnecessary overlap or even information loss. To address these issues, we propose a new Convolutional Neural Networks (CNNs) based method, named Deep Matching Prior Network (DMPNet), to detect text with tighter quadrangle. First, we use quadrilateral sliding windows in several specific intermediate convolutional layers to roughly recall the text with higher overlapping area and then a shared Monte-Carlo method is proposed for fast and accurate computing of the polygonal areas. After that, we designed a sequential protocol for relative regression which can exactly predict text with compact quadrangle. Moreover, a auxiliary smooth Ln loss is also proposed for further regressing the position of text, which has better overall performance than L2 loss and smooth L1 loss in terms of robustness and stability. The effectiveness of our approach is evaluated on a public word-level, multi-oriented scene text database, ICDAR 2015 Robust Reading Competition Challenge 4 "Incidental scene text localization". The performance of our method is evaluated by using F-measure and found to be 70.64%, outperforming the existing state-of-the-art method with F-measure 63.76%.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
Article
While deep convolutional neural networks (CNNs) have shown a great success in single-label image classification, it is important to note that real world images generally contain multiple labels, which could correspond to different objects, scenes, actions and attributes in an image. Traditional approaches to multi-label image classification learn independent classifiers for each category and employ ranking or thresholding on the classification results. These techniques, although working well, fail to explicitly exploit the label dependencies in an image. In this paper, we utilize recurrent neural networks (RNNs) to address this problem. Combined with CNNs, the proposed CNN-RNN framework learns a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance, and it can be trained end-to-end from scratch to integrate both information in a unified framework. Experimental results on public benchmark datasets demonstrate that the proposed architecture achieves better performance than the state-of-the-art multi-label classification model
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Recent high-resolution satellite images provide a valuable new data source for geospatial information acquisition. This paper addresses building extraction from Ikonos images in urban areas. The proposed approach uses the classification results of Ikonos multispectral images to provide approximate location and shape for candidate building objects. Their fine extraction is then carried out in the corresponding panchromatic image through segmentation and squaring. The ECHO classifier is used for supervised classification while the ISODATA algorithm is used for unsupervised classification and subsequent image segmentation. The classification performance is evaluated using the classification confusion matrix, while the final building extraction results are assessed based on the manually delineated results. A building squaring approach based on the Hough transformation is developed that detects and forms the rectilinear building boundaries. A number of sample results are presented to illustrate the approach and demonstrate its efficiency. It is shown that about 64.4 percent of the buildings can be detected, extracted, and accurately formed through this process. Remaining difficulties are high percentage false alarm errors caused by the misclassification of road and building classes as well as occlusion and shadows that may mislead the extraction process.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.