ArticlePDF Available

Roof Classification From 3-D LiDAR Point Clouds Using Multiview CNN With Self-Attention

Authors:

Abstract and Figures

Classification of light detection and ranging (LiDAR) point clouds of building roofs plays a vital role in various urban management applications and is significant in geographic information systems (GISs) and remote sensing. In this letter, a novel deep learning-based method is proposed for classifying roof point clouds, which outperforms the state-of-the-art methods. We use a view-based method called a multiview convolutional neural network with self-attention (MVCNN-SA), which takes the multiple views of a roof point cloud as input and outputs the category of the roof. Current view-based approaches treat all views equally and simply combine the view features into a single compact 3-D descriptor. Our adaptive weight-learning algorithm, which uses the SA block, discovers the relative importance of each view, thus assigning relative weights to the views. This enhances the shape descriptor, resulting in better classification performance. The effectiveness of the proposed method is then verified on the publicly available data set--RoofN3D--by comparing it with the current state-of-the-art methods.
Content may be subject to copyright.
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 1
Roof Classification from 3D LiDAR Point Clouds
using Multi-view CNN with Self-Attention
Dimple A Shajahan, Vaibhav Nayel and Ramanathan Muthuganapathy
Abstract—Classification of LiDAR point clouds of building
roofs plays a vital role in various urban management applications
and is significant in GIS and remote sensing. In this letter, a
novel deep learning-based method is proposed for classifying roof
point clouds, which outperforms the state-of-the-art methods.
We use a view-based method called multi-view convolutional
neural network with self-attention (MVCNN-SA), which takes
the multiple views of a roof point cloud as input and outputs
the category of the roof. Current view-based approaches treat
all views equally and simply combine the view features into
a single compact 3D descriptor. Our adaptive weight-learning
algorithm, which uses the self-attention (SA) block, discovers
the relative importance of each view, thus assigning relative
weights to the views. This enhances the shape descriptor resulting
in better classification performance. The effectiveness of the
proposed method is then verified on the publicly available
dataset-RoofN3D, by comparing it with current state-of-the-art
methods.
Index Terms—MVCNN-SA, Self-Attention, Roof Classification,
LiDAR, Point Clouds, ALS
I. INTRODUCTION
Three dimensional (3D) digital representations of building
models have become significant in geographic information
systems (GIS) and remote sensing. This is due to recent
developments in 3D sensing devices that deliver high-quality
air-borne laser scan (ALS) raw data in real-time. It is used
for a variety of urban management applications such as 3D
city infrastructure planning and modeling, 3D city mapping,
environmental simulations, change detection, and disaster esti-
mation, etc. Classification based on various roof styles, there-
fore, is very useful for the reconstruction and determination
of building shapes from segmented LiDAR point clouds.
Classifying 3D LiDAR point clouds of building roofs with
high accuracy is a challenging task because of the irregular
point distributions, sparsity, noise, and outliers present in them.
This, in turn, demands new solutions for processing point
cloud information, because current point cloud processing
techniques have limited capabilities to automatically extract
semantic information from raw data [1],[2]. Now, it is seen
that learning-based techniques have the ability to automatically
learn robust and discriminative feature representations from
3D LiDAR point clouds [1],[2].
Traditional methods of roof classification are not widely ap-
plicable because they consider specific features of the various
roof styles and are restricted to a small domain and hence do
Dimple A Shajahan and Ramanathan Muthuganapathy are with the Depart-
ment of Engineering Design, IIT Madras, Chennai-36, India. Vaibhav Nayel
is with the Department of Electrical Engineering, IIT Madras.
Manuscript received April 30th, 2019; revised August 21st, 2019.
(a) (b) (c)
Fig. 1: Roof types 1and Point clouds in RoofN3D (a) Saddleback
(b) Two-sided Hip (c) Pyramid.
not provide a general solution [1]. Learning-based methods are
not dependent on geometry-based fitting and roof topology and
therefore perform well in roof classification [1]. Deep learning
has led to a series of breakthroughs for image classification
and has solved more complex tasks with high-performance [3].
It is still not widely explored in the field of remote sensing and
could perform better when a large scale of data is available [2],
[4]. Therefore, our approach is to efficiently classify 3D urban
LiDAR point clouds of roofs, using deep learning. The various
kinds of roofs present in the RoofN3D dataset [5] used for this
work and the point clouds which represent those shapes are
shown in Figure 1.
It has been shown in [6],[7], that the classifiers involving 2D
rendered images for a 3D shape outperform those built directly
on 3D representations. This may be due to the reduction in
dimensionality, sensitivity to noise and computational com-
plexity of 2D data compared to complex and high dimensional
3D data [6]. Since there might be a loss of information due to
the limited number of viewpoints and also because a viewpoint
can only represent a 3D shape partially, we have considered
several views of the point cloud.
Multi-view convolutional neural network (MVCNN) [6]
adopts the strategy of treating all views as equal and simply
combines the view descriptors into a single shape descriptor
using max pooling [7]. This method of view-pooling is not
able to determine the relevance of each view. To discover the
relative importance or weight of each of the views, we develop
a deep learning strategy to learn weights adaptively, which
enhances the shape descriptor resulting in better classification
performance. Similar approaches have been used for the iden-
tification of salient features from multiple instance images for
1http://wherecamp2017.geoit.org/wp-content/uploads/2017/11/8\hskip2em\
relax-3D- City-Models- for-Digital- Maps-and- Navigation-Andreas- Wichmann-TU-Berlin.pdf
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 2
object detection as in [8], [9], [10], [11].
In this letter, we propose a novel technique in deep learning
for view-pooling by adding a self-attention network (SAN) to
MVCNN. MVCNN-SA(Multi-view CNN with self-attention)
can learn discriminative features of multiple views efficiently
by a technique called feature-wise pooling which leads to
better classification of the roofs.
II. RE LATE D WOR KS
The approaches used for roof modelling from LiDAR point
clouds are mainly divided into two categories: model-driven -
which is parametric, and data-driven - which is non-parametric
[12]. Model-driven methods require prior knowledge of the
shape and are robust to noise and missing points. It uses geo-
metric and topological constraints which cannot be generalized
to different datasets and is characterized by low classification
accuracy [1]. In data-driven approaches, the 3D model is
reconstructed by fitting together the individual roof planes
extracted by RANSAC segmentation or directions of normal
vectors. Here, the performance is affected by the average
point density of the input point clouds, presence of noise and
missing points [12],[13]. Advanced data-driven methods exist
for 3D roof modelling which can segment more complex roofs
but are also affected by the sparsity of points [13],[14].
In contrast, learning-based methods, which are also data-
driven, can be used to classify roofs with high accuracy [1],[4].
Zhang et al.[1] proposed a roof-classification technique using
a random forest classifier, that is trained on a bag of words
features extracted from a point cloud and a synthetically
generated codebook. Castagno et al.[4] propose a method
based on a fusion of two different modalities, such as LiDAR
point clouds and satellite images, from publicly available GIS
data.
Though these works also focus on classifying roofs from
LiDAR point clouds using supervised learning, they operate
on smaller datasets [4]. Our work stands out from them by
using a very large dataset, RoofN3D [5], and a deep learning-
based approach. The remainder of the letter is structured as
follows: Section III provides the methodology used for the
classification of LiDAR roof point clouds, followed by results
and discussion. Section IV includes performance measures
to evaluate the model and section V concludes and suggests
directions for future work.
III. METHODOLOGY
Fig. 2: Pipeline of 3D urban roof classification.
The pipeline of the classification procedure, shown in Figure
2, has mainly two phases: rendering of views and classification
of views. Rendering of views, which involves the conversion
of 3D point clouds of urban roofs to images, is explained in
section III-B. The steps involved in the classification of views
are detailed below and in section III-C.
1) Rendered images are categorized into train, validation
and test images, and are fed to the feature extraction
part of MVCNN-SA.
2) The corresponding features extracted are fed to the
self-attention network, which performs view-pooling by
producing adaptive weights.
3) The obtained shape descriptor is then passed to the
classifier to produce the label.
A. Dataset
The main challenge in the direct application of deep learning
for 3D roof classification was the absence of a publicly
available dataset that is well annotated and of reasonable size
[5]. A new, real dataset, RoofN3D [5], was recently released
which is both well-annotated and large, thus making it suitable
for deep learning-based classification. RoofN3D consists of
3D point clouds corresponding to the roofs of buildings of
NewYork City and has 118,073 instances. Each instance in
the dataset contains the point cloud of a single roof. A large
number of point clouds contain outliers such as points of trees,
walls or projections. Some point clouds also have missing
points, making this a representative real-world dataset. One
limitation of this dataset is that it contains only three categories
of roofs. RoofN3D is still a work in progress and is expected
to be improved, updated, and extended in the future with more
roof categories.
Based on our experiments, we have partitioned the dataset
as, 60% for training, 25% for validation and the remaining
15% for testing. Since this is an imbalanced dataset, the pro-
portion between the classes was maintained through stratified
sampling. The exact details of the split dataset are mentioned
in Table I.
Class Instances Train Validation Test
Saddleback 89057 53434 22264 13359
Two-sided Hip 26830 16098 6707 4025
Pyramid 2186 1311 547 328
Total 118073 70843 29518 17712
TABLE I: RoofN3D dataset with data split details for the
proposed work.
B. Rendering of Views
The technique used for capturing 2D views is similar to the
1st camera setup, as in section 3.1 in [6]. The view-generator
takes 2D projections of point clouds from different angles to
capture the geometry of the object. In this case, the point
clouds are in an upright orientation along a specific axis (e.g.,
the z-axis). The virtual cameras are placed around the point
clouds in a way that, the camera viewpoints are separated by
intervals of angle θin the same plane. The elevation of each
camera above the horizontal plane is 30, pointing towards the
centroid of the point cloud. Phong shading [15] in Blender [16]
was used to generate the rendered views of point clouds and
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 3
Fig. 3: Architecture of MVCNN-SA comprising three parts: Feature Extraction, View-Pooling and Classification.
the images are captured with depth information. The whole
setup, with lighting, is similar to the approach used in [6]. We
set θ= 36which yields 10 views for an object.
C. Classification of views
In multi-view methods, a recent work [7] has shown that
learning the relative importance of view descriptors results
in a better representation of the shape descriptor, leading to
enhanced classification performance. The aggregation of views
is enabled by an attention network which generates relative
weights assigned to each view. The features of each view
are scaled by these attention weights and are added to give
a pooled representation of the object.
The architecture of the proposed MVCNN-SA is shown in
Figure 3. It consists of three main parts. The first part is used
for feature extraction and consists of CNN blocks, where all
branches of the network share the same parameters in CNN1.
The second part is for view-pooling and has the SAN blocks
in which all SANs share the same weights. This is followed by
the third part, a fully connected layer F C1for classification.
For each 3D model, let V={v1, v2, ..., v10}be the set
of rendered images where each viis passed to CNN1. The
feature set FM={f1, f2, ..., f10}consists of feature maps fi
extracted from these views. In our experiments CNN1has a
ResNet-18 architecture. The feature maps fiare passed to an
SAN1which generates a key-value pair (ki,zi) for a particular
view ias shown in equations (1) and (2).
zi=Z(fi) RL(1)
where ziis the output of the value network and L is the size
of the value vector.
ki=K(fi) RL(2)
where kiis the output of the key network.
The architecture of the SAN is shown in Figure 4 and
consists of two parts - a value network Zcontaining a fully
connected layer FC2and a key network Kcontaining CNN2.
CNN2consists of three convolutional layers followed by a
Fig. 4: Architecture of Self-Attention-Network(SAN).
fully connected layer whose outputs represent un-normalized
self-attention weights. These are called keys and are passed
through a softmax function for normalization, producing wi
as shown in equation (3).
wij =ekij
Σiekij (3)
where wij is the attention matrix given by the softmax
function. The purpose of the softmax function is to ensure
that the weights for a particular feature of the value vector
across views sum to one. These weights are multiplied with
the outputs from the value network as shown in equation (4).
Pw= Σiwizi RL(4)
where Pwis the aggregated view-pool.
This view-pool is fed to a fully connected layer with soft-
max activation for classification. We term this view-pooling
method as feature-wise attention. The entire network is trained
in an end-to-end fashion using the cross-entropy loss function.
In feature-wise attention, the weights wiRLare vectors
such that the jth element of each vector sum to one. The
scaling of each ziis performed using a Hadamard product, as
shown in equation (4). It assigns a different importance to each
feature in a view vector, allowing the network to pay attention
to the salient features in every view. An auxiliary entropy term
is included in the loss function to study the effect of changing
attention map distributions. The entropy is calculated as in
equation (5)
S=Σiwi·logwi(5)
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 4
A positive entropy penalty encourages the attention map to
be concentrated while a negative penalty causes the attention
map to become more diffuse.
(a) (b) (c)
Fig. 5: Three types of Attention Maps (a) Diffused (b) Intermediate
(c) Concentrated.
We analyzed the attention matrix and recognized three
possible configurations. A diffused attention map is one in
which all views contribute almost equally to the prediction
and is shown in Figure 5(a). The intermediate attention map
is one in which some views contribute more to the prediction
and is plotted as shown in Figure 5(b). The last one is
the concentrated attention map in which one or two views
dominate the other views in the output and is shown in Figure
5(c).
Another method for view-pooling was also experimented
with, which we termed as view-wise attention. In view-wise
attention, the attention weights wiRare real numbers
summing to 1. For each view, the value vectors ziare scaled
by the attention weights wiand summed to generate the view-
pool. This can be interpreted as giving the same importance
to every feature of a view. Our experiments have shown that
the higher degree of freedom in feature-wise attention allows
the network to achieve better classification performance than
view-wise attention. Figure 3 can be used to represent both
methods: feature-wise attention or view-wise attention, with
the only difference being the dimensionality of the keys ki
and weights wi.
D. Parameter Setting
We use Adam optimizer with an initial learning rate of
0.0001. A decay of 0.1 is used for every 30 epochs, along
with a momentum of 0.9. Cross-Entropy Loss with Softmax
activation is used for classification in the output layer.
Since a high class-imbalance is present in our dataset,
for training our network with N instances per batch, we
sampled each instance in the batch by the inverse frequency
of the class. Thus, instances in larger classes have a lower
probability of being selected. Due to population differences
of the classes, the sampled batch is equally distributed which
can also help improve the accuracy of prediction. Suitable
transformations for dataset augmentation are applied to
the images during samplings such as random resized crop,
random horizontal flip, and normalization with mean and
standard deviation.
IV. RES ULT S AN D DISCUSSION
The metrics used to evaluate performance are the following:
Accuracy, Ac=T Pc+T Nc
T Pc+F Pc+F Nc+T Nc
(6)
P recision, Pc=T Pc
T Pc+F Pc
(7)
Recall, Rc=T Pc
T Pc+F Nc
(8)
and, F 1score =2·Pc·Rc
Pc+Rc
(9)
where TP = True Positives, TN = True Negatives, FP = False
Positives, and FN = False Negatives, with respect to a class c.
After experimenting with two types of pooling mechanisms
mentioned in section III, it was found that MVCNN-SA with
feature-wise pooling outperforms MVCNN with max-pooling
or average pooling. It also surpassed the current state-of-the-
art scores on the RoofN3D dataset, which is the result of a
model-based classification approach that directly uses the 3D
point clouds as input, and is based on the PointNet architecture
[17]. The comparison of results of MVCNN-SA with the
PointNet-based method [18] is shown in Table II and Figure
6. Our experimentation with Resnet 50, which uses all views
of the point clouds separately, also resulted in lower accuracy
compared to the proposed method. The computational times
measured for these methods are listed in Table III.
Class MVCNN-SA PointNet-based [18]
Precision Recall F1-score Precision Recall F1-score
Saddleback 0.98 0.99 0.99 0.99 0.97 0.98
Two-sided Hip 0.97 0.94 0.96 0.92 0.95 0.93
Pyramid 0.87 0.79 0.83 0.69 0.83 0.75
TABLE II: Comparison of MVCNN-SA results with PointNet-based
results.
Fig. 6: Accuracies for various models (FP-Feature-wise Pooling,
VP- View-wise Pooling, AP- Average Pooling, MP- Max Pooling).
Method # Epochs Training
Time(hrs)
Testing
Time(hrs)
MVCNN-SA 75 32 0.3
PointNet-based 75 2.4 0.008
Resnet 50 33 25 0.15
TABLE III: Comparison of methods based on computation time.
The confusion matrix for the feature-wise pooling approach
is shown in Table IV. The results show the proposed approach
performs well in classifying all the roof types. Errors mostly
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 5
come from classifying the pyramid class which is under-
represented in the dataset.
Saddleback Two-sided Hip Pyramid
Saddleback 13128 78 7
Two-sided Hip 199 3902 35
Pyramid 32 45 286
TABLE IV: Confusion Matrix of MVCNN-SA/FP.
Entropy Values 1 1031050105103101
Accuracy [%] 97.26 97.74 97.76 97.65 97.54 97.22 97.14
Accuracy after an-
gular rotation [%]
96.5 97.28 97.26 97.4 97.24 97.15 97.18
TABLE V: Comparison of Accuracy based on entropy coefficients.
Further, we observed that the concentrated and diffused
attention maps showed poor accuracy. A diffused attention
map might give importance to views with irrelevant or re-
dundant information, leading to poor performance. A similar
argument holds true for concentrated attention maps, as they
might ignore views with important information. We found
that the intermediate attention map which has an entropy
coefficient close to zero, produces a balance between these
two behaviours and also gives the highest accuracy as shown
in Table V.
We also experimented with a change in the elevation of
the camera, varying randomly between (-30,30) degrees and
captured the views for training, validation, and testing. The
performance obtained was not drastically lower compared to
results on the original dataset, as shown in Table V. This may
be due to the fact that even if some of the cameras shifted to
places where they could not capture much information, other
views would compensate for this.
In the final experiment, we captured 20 views where θ=
18to verify that increasing the number of views does not
cause large gains in performance, which was also found to be
true in [6].
A. Experimental Setup
We ran our experiments on a computing node running
Ubuntu 16.04 with an Intel(R) Xeon(R) CPU E5-2630 V3 @
2.40GHZ, 80 GB RAM, and an NVIDIA GTX 1080 Ti GPU.
PyTorch 1.0 was used for deep learning. We used Blender
software 2.79-b-Linux for generating the rendered views from
LiDAR point clouds.
V. CONCLUSION
We proposed an adaptive importance-learning algorithm to
discover efficient image descriptors from multiple views of
LiDAR roof point clouds. Using our novel approach, we were
able to outperform the current state-of-the-art method [18]
based on PointNet on this dataset. However, the performance
of our model could still be improved from the current results
with hyperparameter tuning and changing the architecture of
the SAN.
REFERENCES
[1] X. Zhang, A. Zang, G. Agam, and X. Chen, “Learning from synthetic
models for roof style classification in point clouds,” in Proceedings
of the 22Nd ACM SIGSPATIAL International Conference on Advances
in Geographic Information Systems, ser. SIGSPATIAL ’14. New
York, NY, USA: ACM, 2014, pp. 263–270. [Online]. Available:
http://doi.acm.org/10.1145/2666310.2666407
[2] L. Zhang and L. Zhang, “Deep learning-based classification and
reconstruction of residential scenes from large-scale point clouds,” IEEE
Trans. Geoscience and Remote Sensing, vol. 56, no. 4, pp. 1887–1897,
2018. [Online]. Available: https://doi.org/10.1109/TGRS.2017.2769120
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778, 2016.
[4] J. Castagno and E. Atkins, “Roof shape classification from lidar
and satellite image data fusion using supervised learning,” Sensors,
vol. 18, p. 3960, 11 2018. [Online]. Available: https://www.mdpi.com/
1424-8220/18/11/3960
[5] A. Wichmann, A. Agoub, and M. Kada, “Roofn3d: Deep learning
training data for 3d building reconstruction,” ISPRS - International
Archives of the Photogrammetry, Remote Sensing and Spatial Infor-
mation Sciences, vol. XLII-2, pp. 1191–1198, 2018. [Online]. Avail-
able: https://www.int-arch-photogramm- remote-sens- spatial-inf- sci.net/
XLII-2/1191/2018/
[6] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view
convolutional neural networks for 3d shape recognition,” in Proc. ICCV,
2015.
[7] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view
convolutional neural networks for 3d shape recognition,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June
2018.
[8] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a
self-paced multiple-instance learning framework,IEEE Trans. Pattern
Anal. Mach. Intell., vol. 39, no. 5, pp. 865–878, May 2017. [Online].
Available: https://doi.org/10.1109/TPAMI.2016.2567393
[9] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, “Detection of
co-salient objects by looking deep and wide,” Int. J. Comput.
Vision, vol. 120, no. 2, pp. 215–232, Nov. 2016. [Online]. Available:
http://dx.doi.org/10.1007/s11263-016-0907-4
[10] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant
convolutional neural networks for object detection in VHR optical
remote sensing images,” IEEE Trans. Geoscience and Remote
Sensing, vol. 54, no. 12, pp. 7405–7415, 2016. [Online]. Available:
https://doi.org/10.1109/TGRS.2016.2601622
[11] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background prior-
based salient object detection via deep reconstruction residual,” IEEE
Trans. Circuits Syst. Video Techn., vol. 25, no. 8, pp. 1309–1321, 2015.
[Online]. Available: https://doi.org/10.1109/TCSVT.2014.2381471
[12] M. Gkeli and C. Ioannidis, “Automatic 3d reconstruction of buildings
roof tops in densely urbanized areas,” ISPRS - International Archives
of the Photogrammetry, Remote Sensing and Spatial Information
Sciences, vol. XLII-4/W10, pp. 47–54, 2018. [Online]. Avail-
able: https://www.int-arch-photogramm- remote-sens- spatial-inf- sci.net/
XLII-4- W10/47/2018/
[13] K. Kim and J. Shan, “Building roof modeling from airborne laser
scanning data based on level set approach,ISPRS Journal of
Photogrammetry and Remote Sensing, vol. 66, no. 4, pp. 484 – 497,
2011. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0924271611000396
[14] A. Wichmann, J. Jung, G. Sohn, M. Kada, and M. Ehlers, “Integra-
tion of Building Knowledge Into Binary Space Partitioning for the
Reconstruction of Regularized Building Models,” ISPRS Annals of
Photogrammetry, Remote Sensing and Spatial Information Sciences,
no. 5, pp. 541–548, Sep. 2015.
[15] B. T. Phong, “Illumination for computer generated pictures,Commun.
ACM, vol. 18, no. 6, pp. 311–317, Jun. 1975. [Online]. Available:
http://doi.acm.org/10.1145/360825.360839
[16] Blender Online Community, Blender - a 3D modelling and rendering
package, Blender Foundation, Blender Institute, Amsterdam. [Online].
Available: http://www.blender.org
[17] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
on point sets for 3d classification and segmentation,” CoRR, vol.
abs/1612.00593, 2016. [Online]. Available: http://arxiv.org/abs/1612.
00593
[18] S. Guptha and R. Bohare, “Project title,” https://github.com/
sarthakTUM/roofn3d, 2019.
... Lee et al. [28] proposed a simple and efficient network based on self-attention, called set transformer, which can process set data, such as a point cloud. Shajaha et al. [29] proposed a multiview CNN with self-attention. Multiple views of a roof point cloud were taken as the input, an adaptive weight learning algorithm was used to assign weights corresponding to each view, and the category of the roof was the output. ...
... Multiple views of a roof point cloud were taken as the input, an adaptive weight learning algorithm was used to assign weights corresponding to each view, and the category of the roof was the output. However, the generalization ability of the model [29] is poor and is limited to special field. On the contrary, the DANC-Net we proposed can be applied to any point cloud classification tasks. ...
Article
Full-text available
Convolutional neural networks, as a branch of deep neural networks, have been widely used in multidimensional signal processing, especially in point cloud signal processing. Nevertheless, in point cloud signal processing, most point cloud classification networks currently do not consider local feature correlation. In addition, they only adopt ground-truth as positive information to guide the training of networks while ignoring negative information. Therefore, this paper proposes a network model to classify point cloud signals based on feature correlation and negative constraint, DANC-Net (dual-attention and negative constraint on point cloud classification). In the DANC-Net, the dual-attention mechanism is utilized to strengthen the interaction between local features of point cloud signal from both channel and space, thereby improving the expression ability of extracted features. Moreover, during the training of the DANC-Net, the negative constraint loss function ensures that the features in the same categories are close and those in the different categories are far away from each other in the representation space, so as to improve the feature extraction capability of the network. Experiments demonstrate that the DANC-Net achieves better classification performance than the existing point cloud classification algorithms on synthetic datasets ModelNet10 and ModelNet40 and real-scene dataset ScanObjectNN. The code is released at https://github.com/sunhang1986/DANC-Net.
... According to the kernel of convolution, it can be classified into three categories based on the object of convolution, image-based convolution, discretization-based convolution, and point-based convolution. Imagebased convolution projects the 3D point cloud into 2D images, such as multi-view images (Shajahan et al., 2020;Wijaya et al., 2022) and Bird's eye view (BEV) (Liu and Niu, 2021;U et al., 2021). Discretization-based convolution converts the irregular point clouds into regular structures that are well adapted to the properties of the convolution, such as voxels (Li et al., 2021), and crystals (Su et al., 2018). ...
Article
Full-text available
Semantic segmentation of urban-scale point clouds can effectively assist people in understanding and perceiving 3D urban scenes. Although a considerable number of deep learning models for the semantic segmentation of point clouds have been proposed, some methods are plagued by information loss caused by sampling and insufficient perception of the spatial relationship between points. To address this issue, this paper proposes an end-to-end space-aware DeepLab deep learning network, named SADNet. In the SADNet, a space-aware attentive residual module (SARM) is incorporated to extract rich point cloud features with the assistance of perceiving spatial relationships between points. Then, in combination with a point cloud atrous spatial pyramid pooling module (PCaspp), SADNet extracts multiscale point cloud features while effectively avoiding information loss from pooling and downsampling. Finally, a dilated local feature extraction (DLFE) module is designed to enhance the detection ability for small objects by dilating the feature map. Furthermore, to validate the superiority of the SADNet, extensive experiments are conducted on two publicly available benchmarks, Sensaturban and Hes-sigheim 3D. The results demonstrate the state-of-the-art performance on both datasets, which achieves the mean IoU of 66.8% on Sensaturban and overall accuracy of 91.77%, mean F1-score of 82.81% on Hessigheim 3D. Overall, SADNet is a promising approach for urban-scale point cloud semantic segmentation and has the potential to enhance understanding and perception of real-world urban scenes.
... In this context, graph-cut and constrained Triangulation Irregular Networks (TIN) are considered. Shajahan et al. [137] suggested a view-based method called a MultiView Convolutional Neural Network with Self-Attention (MVCNN-SA), which recognizes the roof geometric forms by considering multiple roof point cloud views. ...
Article
Full-text available
Machine Learning (ML) applications on Light Detection And Ranging (LiDAR) data have provided promising results and thus this topic has been widely addressed in the literature during the last few years. This paper reviews the essential and the more recent completed studies in the topography and surface feature identification domain. Four areas, with respect to the suggested approaches, have been analyzed and discussed: the input data, the concepts of point cloud structure for applying ML, the ML techniques used, and the applications of ML on LiDAR data. Then, an overview is provided to underline the advantages and the disadvantages of this research axis. Despite the training data labelling problem, the calculation cost, and the undesirable shortcutting due to data downsam-pling, most of the proposed methods use supervised ML concepts to classify the downsampled LiDAR data. Furthermore, despite the occasional highly accurate results, in most cases the results still require filtering. In fact, a considerable number of adopted approaches use the same data structure concepts employed in image processing to profit from available informatics tools. Knowing that the LiDAR point clouds represent rich 3D data, more effort is needed to develop specialized processing tools.
... (Wen et al. 2020) proposes a skip-attention model to capture the geometric information from local regions of the input to get a better representation. There are also many other works exploring the application of attention in point clouds (Fuchs et al. 2020;Shajahan, Nayel, and Muthuganapathy 2020;Liu et al. 2019b;Hu et al. 2020;. Point Cloud Generation Plenty of tasks need to generate point clouds as the output, which attracted a lot of research interest. ...
Article
Full-text available
In point cloud generation and completion, previous methods for transforming latent features to point clouds are generally based on fully connected layers (FC-based) or folding operations (Folding-based). However, point clouds generated by FC-based methods are usually troubled by outliers and rough surfaces. For folding-based methods, their data flow is large, convergence speed is slow, and they are also hard to handle the generation of non-smooth surfaces. In this work, we propose AXform, an attention-based method to transform latent features to point clouds. AXform first generates points in an interim space, using a fully connected layer. These interim points are then aggregated to generate the target point cloud. AXform takes both parameter sharing and data flow into account, which makes it has fewer outliers, fewer network parameters, and a faster convergence speed. The points generated by AXform do not have the strong 2-manifold constraint, which improves the generation of non-smooth surfaces. When AXform is expanded to multiple branches for local generations, the centripetal constraint makes it has properties of self-clustering and space consistency, which further enables unsupervised semantic segmentation. We also adopt this scheme and design AXformNet for point cloud completion. Considerable experiments on different datasets show that our methods achieve state-of-the-art results.
... While the ground (Elmqvist et al., 2001, Mousa et al., 2017 and the trees (Dai et al., 2018, Koch et al., 2014 can be modeled at a sufficient detail level, accurate modeling of buildings is tricky because of their varying appearances. The number of related approaches is vast (Bulatov et al., 2014, Meidow et al., 2016, Xiong et al., 2014, Xiong et al., 2015, Verdie et al., 2015, Verma et al., 2006, Lafarge et al., 2010, Sohn et al., 2012, Hu et al., 2018, Shajahan et al., 2019, Yu et al., 2021. This by no means exhaustive list of related works may serve as a hint that the research in this field is not yet completed. ...
Article
Full-text available
Building modeling from remote sensing data is essential for creating accurate 3D and 4D digital twins, especially for temperature modeling. In order to represent buildings as gap-free, visually appealing, and rich in details models, geo-typical prototypes should be represented in the scene. The sensor data and freely available OSM data are supposed to provide guidelines for best-possible matching. In this paper, the default similarity function based on intersection over union is extended by terms reflecting the similarity of elevation values, orientation towards the road, and trees in the vicinity. The goodness of fit has been evaluated by architecture experts as well as thermal simulations with a thermal image as ground truth and error measures based on mean average error, root mean square and mutual information. It could be concluded that while intersection over union measure still seems to be most preferred by architects, slightly better thermal simulation results are yielded by taking into account all similarity functions.
... erefore, we utilized to combine CNN and transformer to extract the spatiotemporal features of the trajectory. Generally, CNN with multiple convolutional kernels of different sizes often achieves better results than the model with single convolutional kernels, but the application of multiview CNN in sequence processing is not common because the convolution network is always accompanied by the constant change of sequence length [10,11]. However, we utilize the multiview CNN Transformer component to process the spatio-temporal trajectory data in the experiment and achieve better prediction results. ...
Article
Full-text available
In this paper, we propose a new travel time estimation framework based on transformer and convolution neural networks (CNN) to improve the accuracy of travel time estimation. We design a traffic information fusion component, which fuses the GPS trajectory, real road network, and external attributes, to fully consider the influence of road network topological characteristics as well as the traffic temporal characteristics on travel time estimation. Moreover, we provide a multiview CNN transformer component to capture the spatial information of each trajectory point at multiple regional scales. Extensive experiments on Chengdu and Beijing datasets show that the mean absolute percent error (MAPE) of our MCT-TTE is 11.25% and 11.78%, which is competitive with the state-of-the-arts baselines.
Article
The semantic recognition of point cloud is an important aspect of point cloud applications, it is crucial to study the intelligent point cloud recognition method for substation equipment to replace manual processing. Based on previous related work, this paper deals with the construction of an intelligent point cloud recognition network for substation equipment. In this network, a point cloud to tensor (PC2T) module is proposed to achieve the goal of using a 2D neural network to process 3D point cloud; a multi-scale self-attention (MSSA) module is introduced to optimize the global feature extraction of point cloud and enhance the accuracy of point cloud recognition. In addition, this paper proposes two point cloud data argumentation methods to assist in network training. The experimental results show that the constructed point cloud recognition method has excellent recognition accuracy and point cloud quality robustness, and that the proposed data argumentation methods are effective. The research results can provide a reference for the digital transformation of substations.
Article
Conceptual design is the foundational stage of a design process that translates ill-defined design problems into low-fidelity design concepts and prototypes. While deep learning approaches are widely applied in later design stages for design automation, we see fewer attempts in conceptual design for three reasons: 1) the data in this stage exhibit multiple modalities: natural language, sketches, and 3D shapes, and these modalities are challenging to represent in deep learning methods; 2) it requires knowledge from a larger source of inspiration instead of focusing on a single design task; and 3) it requires translating designers' intent and feedback, and hence needs more interaction with humans, either designers or users. With recent advances in deep learning of cross-modal tasks (DLCMT) and the availability of large cross-modal datasets, we see opportunities to apply these learning-based methods to the conceptual design of product shapes. In this paper, we conduct a systematic review on the methods for DLCMT that involve three design modalities: natural language, sketches, and 3D shapes, which revealed 50 articles in the fields of computer graphics, computer vision, and engineering design. This review work identifies the key challenges and opportunities in applying DLCMT in the conceptual design of engineered products. The authors also propose a list of five research questions that point to future directions and call on the community to devote itself to principled research investigations that help translate knowledge from computer science to engineering design.
Article
Full-text available
Point clouds are one of the most widely used data formats produced by depth sensors. There is a lot of research into feature extraction from unordered and irregular point cloud data. Deep learning in computer vision achieves great performance for data classification and segmentation of 3D data points as point clouds. Various research has been conducted on point clouds and remote sensing tasks using deep learning (DL) methods. However, there is a research gap in providing a road map of existing work, including limitations and challenges. This paper focuses on introducing the state-of-the-art DL models, categorized by the structure of the data they consume. The models’ performance is collected, and results are provided for benchmarking on the most used datasets. Additionally, we summarize the current benchmark 3D datasets publicly available for DL training and testing. In our comparative study, we can conclude that convolutional neural networks (CNNs) achieve the best performance in various remote-sensing applications while being light-weighted models, namely Dynamic Graph CNN (DGCNN) and ConvPoint.
Article
Full-text available
Geographic information systems (GIS) provide accurate maps of terrain, roads, waterways, and building footprints and heights. Aircraft, particularly small unmanned aircraft systems (UAS), can exploit this and additional information such as building roof structure to improve navigation accuracy and safely perform contingency landings particularly in urban regions. However, building roof structure is not fully provided in maps. This paper proposes a method to automatically label building roof shape from publicly available GIS data. Satellite imagery and airborne LiDAR data are processed and manually labeled to create a diverse annotated roof image dataset for small to large urban cities. Multiple convolutional neural network (CNN) architectures are trained and tested, with the best performing networks providing a condensed feature set for support vector machine and decision tree classifiers. Satellite image and LiDAR data fusion is shown to provide greater classification accuracy than using either data type alone. Model confidence thresholds are adjusted leading to significant increases in models precision. Networks trained from roof data in Witten, Germany and Manhattan (New York City) are evaluated on independent data from these cities and Ann Arbor, Michigan.
Article
Full-text available
3D reconstruction of the urban environment constitutes a well-studied problem in the field of photogrammetry and computer vision, attracting the growing interest of the scientific community, for many years. Although the current state of the art present very impressive results, there is still room for improvements. The production of reliable and accurate 3D reconstructions is useful for a wide range of applications, such as urban planning, GIS, tax assessment, cadastre, insurance, 3D city modelling, etc. In this paper, a methodology for the automatic 3D reconstruction of buildings roof tops in densely urbanized areas, utilizing dense point clouds data, is proposed. It consists of three (3) main phases, each of which comprises a set of processing steps. In the first phase, the point cloud is simplified and smoothed. Outliers and non-roof elements are detected and removed utilizing shape, position and area criteria. In the second phase, the geometry buildings roof tops is optimized, by detecting and normalizing the edges. In the last phase, the reconstruction of the buildings roof tops is conducted. A progressive process, utilizing a plane fitting algorithm in combination with Screened Poisson Surface Reconstruction is performed. Buildings roof tops surfaces are produced and optimized. A software tool is developed and utilized for the implementation of the proposed methodology. The produced results are assessed and a comparison with another open-source software is conducted. The proposed methodology seems to be effective providing satisfactory results, as it can manage properly the really noisy point clouds of densely urbanized environments.
Article
Full-text available
Machine learning methods have gained in importance through the latest development of artificial intelligence and computer hardware. Particularly approaches based on deep learning have shown that they are able to provide state-of-the-art results for various tasks. However, the direct application of deep learning methods to improve the results of 3D building reconstruction is often not possible due, for example, to the lack of suitable training data. To address this issue, we present RoofN3D which provides a new 3D point cloud training dataset that can be used to train machine learning models for different tasks in the context of 3D building reconstruction. It can be used, among others, to train semantic segmentation networks or to learn the structure of buildings and the geometric model construction. Further details about RoofN3D and the developed data preparation framework, which enables the automatic derivation of training data, are described in this paper. Furthermore, we provide an overview of other available 3D point cloud training data and approaches from current literature in which solutions for the application of deep learning to unstructured and not gridded 3D point cloud data are presented.
Article
Full-text available
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
Article
Full-text available
In this paper, we propose a unified co-salient object detection framework by introducing two novel insights: (1) looking deep to transfer higher-level representations by using the convolutional neural network with additional adaptive layers could better reflect the sematic properties of the co-salient objects; (2) looking wide to take advantage of the visually similar neighbors from other image groups could effectively suppress the influence of the common background regions. The wide and deep information are explored for the object proposal windows extracted in each image. The window-level co-saliency scores are calculated by integrating the intra-image contrast, the intra-group consistency, and the inter-group separability via a principled Bayesian formulation and are then converted to the superpixel-level co-saliency maps through a foreground region agreement strategy. Comprehensive experiments on two existing and one newly established datasets have demonstrated the consistent performance gain of the proposed approach.
Article
The reconstruction of urban buildings from large-scale airborne laser scanning point clouds is an important research topic in the geoscience field. Large-scale urban scenes usually contain a large number of object categories and many overlapped or closely neighboring objects, which poses great challenges for classifying and modeling buildings from these data sets. In this paper, we propose a deep reinforcement learning framework that integrates a 3-D convolutional neural network, a deep Q-network, and a residual recurrent neural network for the efficient semantic parsing of large-scale 3-D point clouds. The proposed framework provides an end-to-end automatic processing method that maps the raw point cloud to the classification results of the given categories. After obtaining the building classes, we utilize an edge-aware resampling algorithm to consolidate the point set with noise-free normals and clean preservation of sharp features. Finally, 2.5-D dual contouring, which is a data-driven approach, is introduced to generate urban building models from the consolidated point clouds. Our method can generate lightweight building models with arbitrarily shaped roofs while preserving the verticality of connecting walls.
Article
Object detection in very high resolution optical remote sensing images is a fundamental problem faced for remote sensing image analysis. Due to the advances of powerful feature representations, machine-learning-based object detection is receiving increasing attention. Although numerous feature representations exist, most of them are handcrafted or shallow-learning-based features. As the object detection task becomes more challenging, their description capability becomes limited or even impoverished. More recently, deep learning algorithms, especially convolutional neural networks (CNNs), have shown their much stronger feature representation power in computer vision. Despite the progress made in nature scene images, it is problematic to directly use the CNN feature for object detection in optical remote sensing images because it is difficult to effectively deal with the problem of object rotation variations. To address this problem, this paper proposes a novel and effective approach to learn a rotation-invariant CNN (RICNN) model for advancing the performance of object detection, which is achieved by introducing and learning a new rotation-invariant layer on the basis of the existing CNN architectures. However, different from the training of traditional CNN models that only optimizes the multinomial logistic regression objective, our RICNN model is trained by optimizing a new objective function via imposing a regularization constraint, which explicitly enforces the feature representations of the training samples before and after rotating to be mapped close to each other, hence achieving rotation invariance. To facilitate training, we first train the rotation-invariant layer and then domain-specifically fine-tune the whole RICNN network to further boost the performance. Comprehensive evaluations on a publicly available ten-class object detection data set demonstrate the effectiveness of the proposed method.
Article
As an interesting and emerging topic, co-saliency detection aims at simultaneously extracting common salient objects from a group of images. On one hand, traditional co-saliency detection approaches rely heavily on human knowledge for designing hand-crafted metrics to possibly reflect the faithful properties of the co-salient regions. Such strategies, however, always suffer from poor generalization capability to flexibly adapt various scenarios in real applications. On the other hand, most current methods pursue co-saliency detection in unsupervised fashions. This, however, tends to weaken their performance in real complex scenarios because they are lack of robust learning mechanism to make full use of the weak labels of each image. To alleviate these two problems, this paper proposes a new SP-MIL framework for co-saliency detection, which integrates both multiple instance learning (MIL) and self-paced learning (SPL) into a unified learning framework. Specifically, for the first problem, we formulate the co-saliency detection problem as a MIL paradigm to learn the discriminative classifiers to detect the co-saliency object in the "instance-level". The formulated MIL component facilitates our method capable of automatically producing the proper metrics to measure the intra-image contrast and the inter-image consistency for detecting co-saliency in a purely self-learning way. For the second problem, the embedded SPL paradigm is able to alleviate the data ambiguity under the weak supervision of co-saliency detection and guide a robust learning manner in complex scenarios. Experiments on benchmark datasets together with multiple extended computer vision applications demonstrate the superiority of the proposed framework beyond the state-of-the-arts.