Content uploaded by Dimple A Shajahan
Author content
All content in this area was uploaded by Dimple A Shajahan on Jan 20, 2020
Content may be subject to copyright.
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 1
Roof Classification from 3D LiDAR Point Clouds
using Multi-view CNN with Self-Attention
Dimple A Shajahan, Vaibhav Nayel and Ramanathan Muthuganapathy
Abstract—Classification of LiDAR point clouds of building
roofs plays a vital role in various urban management applications
and is significant in GIS and remote sensing. In this letter, a
novel deep learning-based method is proposed for classifying roof
point clouds, which outperforms the state-of-the-art methods.
We use a view-based method called multi-view convolutional
neural network with self-attention (MVCNN-SA), which takes
the multiple views of a roof point cloud as input and outputs
the category of the roof. Current view-based approaches treat
all views equally and simply combine the view features into
a single compact 3D descriptor. Our adaptive weight-learning
algorithm, which uses the self-attention (SA) block, discovers
the relative importance of each view, thus assigning relative
weights to the views. This enhances the shape descriptor resulting
in better classification performance. The effectiveness of the
proposed method is then verified on the publicly available
dataset-RoofN3D, by comparing it with current state-of-the-art
methods.
Index Terms—MVCNN-SA, Self-Attention, Roof Classification,
LiDAR, Point Clouds, ALS
I. INTRODUCTION
Three dimensional (3D) digital representations of building
models have become significant in geographic information
systems (GIS) and remote sensing. This is due to recent
developments in 3D sensing devices that deliver high-quality
air-borne laser scan (ALS) raw data in real-time. It is used
for a variety of urban management applications such as 3D
city infrastructure planning and modeling, 3D city mapping,
environmental simulations, change detection, and disaster esti-
mation, etc. Classification based on various roof styles, there-
fore, is very useful for the reconstruction and determination
of building shapes from segmented LiDAR point clouds.
Classifying 3D LiDAR point clouds of building roofs with
high accuracy is a challenging task because of the irregular
point distributions, sparsity, noise, and outliers present in them.
This, in turn, demands new solutions for processing point
cloud information, because current point cloud processing
techniques have limited capabilities to automatically extract
semantic information from raw data [1],[2]. Now, it is seen
that learning-based techniques have the ability to automatically
learn robust and discriminative feature representations from
3D LiDAR point clouds [1],[2].
Traditional methods of roof classification are not widely ap-
plicable because they consider specific features of the various
roof styles and are restricted to a small domain and hence do
Dimple A Shajahan and Ramanathan Muthuganapathy are with the Depart-
ment of Engineering Design, IIT Madras, Chennai-36, India. Vaibhav Nayel
is with the Department of Electrical Engineering, IIT Madras.
Manuscript received April 30th, 2019; revised August 21st, 2019.
(a) (b) (c)
Fig. 1: Roof types 1and Point clouds in RoofN3D (a) Saddleback
(b) Two-sided Hip (c) Pyramid.
not provide a general solution [1]. Learning-based methods are
not dependent on geometry-based fitting and roof topology and
therefore perform well in roof classification [1]. Deep learning
has led to a series of breakthroughs for image classification
and has solved more complex tasks with high-performance [3].
It is still not widely explored in the field of remote sensing and
could perform better when a large scale of data is available [2],
[4]. Therefore, our approach is to efficiently classify 3D urban
LiDAR point clouds of roofs, using deep learning. The various
kinds of roofs present in the RoofN3D dataset [5] used for this
work and the point clouds which represent those shapes are
shown in Figure 1.
It has been shown in [6],[7], that the classifiers involving 2D
rendered images for a 3D shape outperform those built directly
on 3D representations. This may be due to the reduction in
dimensionality, sensitivity to noise and computational com-
plexity of 2D data compared to complex and high dimensional
3D data [6]. Since there might be a loss of information due to
the limited number of viewpoints and also because a viewpoint
can only represent a 3D shape partially, we have considered
several views of the point cloud.
Multi-view convolutional neural network (MVCNN) [6]
adopts the strategy of treating all views as equal and simply
combines the view descriptors into a single shape descriptor
using max pooling [7]. This method of view-pooling is not
able to determine the relevance of each view. To discover the
relative importance or weight of each of the views, we develop
a deep learning strategy to learn weights adaptively, which
enhances the shape descriptor resulting in better classification
performance. Similar approaches have been used for the iden-
tification of salient features from multiple instance images for
1http://wherecamp2017.geoit.org/wp-content/uploads/2017/11/8\hskip2em\
relax-3D- City-Models- for-Digital- Maps-and- Navigation-Andreas- Wichmann-TU-Berlin.pdf
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 2
object detection as in [8], [9], [10], [11].
In this letter, we propose a novel technique in deep learning
for view-pooling by adding a self-attention network (SAN) to
MVCNN. MVCNN-SA(Multi-view CNN with self-attention)
can learn discriminative features of multiple views efficiently
by a technique called feature-wise pooling which leads to
better classification of the roofs.
II. RE LATE D WOR KS
The approaches used for roof modelling from LiDAR point
clouds are mainly divided into two categories: model-driven -
which is parametric, and data-driven - which is non-parametric
[12]. Model-driven methods require prior knowledge of the
shape and are robust to noise and missing points. It uses geo-
metric and topological constraints which cannot be generalized
to different datasets and is characterized by low classification
accuracy [1]. In data-driven approaches, the 3D model is
reconstructed by fitting together the individual roof planes
extracted by RANSAC segmentation or directions of normal
vectors. Here, the performance is affected by the average
point density of the input point clouds, presence of noise and
missing points [12],[13]. Advanced data-driven methods exist
for 3D roof modelling which can segment more complex roofs
but are also affected by the sparsity of points [13],[14].
In contrast, learning-based methods, which are also data-
driven, can be used to classify roofs with high accuracy [1],[4].
Zhang et al.[1] proposed a roof-classification technique using
a random forest classifier, that is trained on a bag of words
features extracted from a point cloud and a synthetically
generated codebook. Castagno et al.[4] propose a method
based on a fusion of two different modalities, such as LiDAR
point clouds and satellite images, from publicly available GIS
data.
Though these works also focus on classifying roofs from
LiDAR point clouds using supervised learning, they operate
on smaller datasets [4]. Our work stands out from them by
using a very large dataset, RoofN3D [5], and a deep learning-
based approach. The remainder of the letter is structured as
follows: Section III provides the methodology used for the
classification of LiDAR roof point clouds, followed by results
and discussion. Section IV includes performance measures
to evaluate the model and section V concludes and suggests
directions for future work.
III. METHODOLOGY
Fig. 2: Pipeline of 3D urban roof classification.
The pipeline of the classification procedure, shown in Figure
2, has mainly two phases: rendering of views and classification
of views. Rendering of views, which involves the conversion
of 3D point clouds of urban roofs to images, is explained in
section III-B. The steps involved in the classification of views
are detailed below and in section III-C.
1) Rendered images are categorized into train, validation
and test images, and are fed to the feature extraction
part of MVCNN-SA.
2) The corresponding features extracted are fed to the
self-attention network, which performs view-pooling by
producing adaptive weights.
3) The obtained shape descriptor is then passed to the
classifier to produce the label.
A. Dataset
The main challenge in the direct application of deep learning
for 3D roof classification was the absence of a publicly
available dataset that is well annotated and of reasonable size
[5]. A new, real dataset, RoofN3D [5], was recently released
which is both well-annotated and large, thus making it suitable
for deep learning-based classification. RoofN3D consists of
3D point clouds corresponding to the roofs of buildings of
NewYork City and has 118,073 instances. Each instance in
the dataset contains the point cloud of a single roof. A large
number of point clouds contain outliers such as points of trees,
walls or projections. Some point clouds also have missing
points, making this a representative real-world dataset. One
limitation of this dataset is that it contains only three categories
of roofs. RoofN3D is still a work in progress and is expected
to be improved, updated, and extended in the future with more
roof categories.
Based on our experiments, we have partitioned the dataset
as, 60% for training, 25% for validation and the remaining
15% for testing. Since this is an imbalanced dataset, the pro-
portion between the classes was maintained through stratified
sampling. The exact details of the split dataset are mentioned
in Table I.
Class Instances Train Validation Test
Saddleback 89057 53434 22264 13359
Two-sided Hip 26830 16098 6707 4025
Pyramid 2186 1311 547 328
Total 118073 70843 29518 17712
TABLE I: RoofN3D dataset with data split details for the
proposed work.
B. Rendering of Views
The technique used for capturing 2D views is similar to the
1st camera setup, as in section 3.1 in [6]. The view-generator
takes 2D projections of point clouds from different angles to
capture the geometry of the object. In this case, the point
clouds are in an upright orientation along a specific axis (e.g.,
the z-axis). The virtual cameras are placed around the point
clouds in a way that, the camera viewpoints are separated by
intervals of angle θin the same plane. The elevation of each
camera above the horizontal plane is 30◦, pointing towards the
centroid of the point cloud. Phong shading [15] in Blender [16]
was used to generate the rendered views of point clouds and
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 3
Fig. 3: Architecture of MVCNN-SA comprising three parts: Feature Extraction, View-Pooling and Classification.
the images are captured with depth information. The whole
setup, with lighting, is similar to the approach used in [6]. We
set θ= 36◦which yields 10 views for an object.
C. Classification of views
In multi-view methods, a recent work [7] has shown that
learning the relative importance of view descriptors results
in a better representation of the shape descriptor, leading to
enhanced classification performance. The aggregation of views
is enabled by an attention network which generates relative
weights assigned to each view. The features of each view
are scaled by these attention weights and are added to give
a pooled representation of the object.
The architecture of the proposed MVCNN-SA is shown in
Figure 3. It consists of three main parts. The first part is used
for feature extraction and consists of CNN blocks, where all
branches of the network share the same parameters in CNN1.
The second part is for view-pooling and has the SAN blocks
in which all SANs share the same weights. This is followed by
the third part, a fully connected layer F C1for classification.
For each 3D model, let V={v1, v2, ..., v10}be the set
of rendered images where each viis passed to CNN1. The
feature set FM={f1, f2, ..., f10}consists of feature maps fi
extracted from these views. In our experiments CNN1has a
ResNet-18 architecture. The feature maps fiare passed to an
SAN1which generates a key-value pair (ki,zi) for a particular
view ias shown in equations (1) and (2).
zi=Z(fi) RL(1)
where ziis the output of the value network and L is the size
of the value vector.
ki=K(fi) RL(2)
where kiis the output of the key network.
The architecture of the SAN is shown in Figure 4 and
consists of two parts - a value network Zcontaining a fully
connected layer FC2and a key network Kcontaining CNN2.
CNN2consists of three convolutional layers followed by a
Fig. 4: Architecture of Self-Attention-Network(SAN).
fully connected layer whose outputs represent un-normalized
self-attention weights. These are called keys and are passed
through a softmax function for normalization, producing wi
as shown in equation (3).
wij =ekij
Σiekij (3)
where wij is the attention matrix given by the softmax
function. The purpose of the softmax function is to ensure
that the weights for a particular feature of the value vector
across views sum to one. These weights are multiplied with
the outputs from the value network as shown in equation (4).
Pw= Σiwizi RL(4)
where Pwis the aggregated view-pool.
This view-pool is fed to a fully connected layer with soft-
max activation for classification. We term this view-pooling
method as feature-wise attention. The entire network is trained
in an end-to-end fashion using the cross-entropy loss function.
In feature-wise attention, the weights wi∈RLare vectors
such that the jth element of each vector sum to one. The
scaling of each ziis performed using a Hadamard product, as
shown in equation (4). It assigns a different importance to each
feature in a view vector, allowing the network to pay attention
to the salient features in every view. An auxiliary entropy term
is included in the loss function to study the effect of changing
attention map distributions. The entropy is calculated as in
equation (5)
S=−Σiwi·logwi(5)
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 4
A positive entropy penalty encourages the attention map to
be concentrated while a negative penalty causes the attention
map to become more diffuse.
(a) (b) (c)
Fig. 5: Three types of Attention Maps (a) Diffused (b) Intermediate
(c) Concentrated.
We analyzed the attention matrix and recognized three
possible configurations. A diffused attention map is one in
which all views contribute almost equally to the prediction
and is shown in Figure 5(a). The intermediate attention map
is one in which some views contribute more to the prediction
and is plotted as shown in Figure 5(b). The last one is
the concentrated attention map in which one or two views
dominate the other views in the output and is shown in Figure
5(c).
Another method for view-pooling was also experimented
with, which we termed as view-wise attention. In view-wise
attention, the attention weights wi∈Rare real numbers
summing to 1. For each view, the value vectors ziare scaled
by the attention weights wiand summed to generate the view-
pool. This can be interpreted as giving the same importance
to every feature of a view. Our experiments have shown that
the higher degree of freedom in feature-wise attention allows
the network to achieve better classification performance than
view-wise attention. Figure 3 can be used to represent both
methods: feature-wise attention or view-wise attention, with
the only difference being the dimensionality of the keys ki
and weights wi.
D. Parameter Setting
We use Adam optimizer with an initial learning rate of
0.0001. A decay of 0.1 is used for every 30 epochs, along
with a momentum of 0.9. Cross-Entropy Loss with Softmax
activation is used for classification in the output layer.
Since a high class-imbalance is present in our dataset,
for training our network with N instances per batch, we
sampled each instance in the batch by the inverse frequency
of the class. Thus, instances in larger classes have a lower
probability of being selected. Due to population differences
of the classes, the sampled batch is equally distributed which
can also help improve the accuracy of prediction. Suitable
transformations for dataset augmentation are applied to
the images during samplings such as random resized crop,
random horizontal flip, and normalization with mean and
standard deviation.
IV. RES ULT S AN D DISCUSSION
The metrics used to evaluate performance are the following:
Accuracy, Ac=T Pc+T Nc
T Pc+F Pc+F Nc+T Nc
(6)
P recision, Pc=T Pc
T Pc+F Pc
(7)
Recall, Rc=T Pc
T Pc+F Nc
(8)
and, F 1score =2·Pc·Rc
Pc+Rc
(9)
where TP = True Positives, TN = True Negatives, FP = False
Positives, and FN = False Negatives, with respect to a class c.
After experimenting with two types of pooling mechanisms
mentioned in section III, it was found that MVCNN-SA with
feature-wise pooling outperforms MVCNN with max-pooling
or average pooling. It also surpassed the current state-of-the-
art scores on the RoofN3D dataset, which is the result of a
model-based classification approach that directly uses the 3D
point clouds as input, and is based on the PointNet architecture
[17]. The comparison of results of MVCNN-SA with the
PointNet-based method [18] is shown in Table II and Figure
6. Our experimentation with Resnet 50, which uses all views
of the point clouds separately, also resulted in lower accuracy
compared to the proposed method. The computational times
measured for these methods are listed in Table III.
Class MVCNN-SA PointNet-based [18]
Precision Recall F1-score Precision Recall F1-score
Saddleback 0.98 0.99 0.99 0.99 0.97 0.98
Two-sided Hip 0.97 0.94 0.96 0.92 0.95 0.93
Pyramid 0.87 0.79 0.83 0.69 0.83 0.75
TABLE II: Comparison of MVCNN-SA results with PointNet-based
results.
Fig. 6: Accuracies for various models (FP-Feature-wise Pooling,
VP- View-wise Pooling, AP- Average Pooling, MP- Max Pooling).
Method # Epochs Training
Time(hrs)
Testing
Time(hrs)
MVCNN-SA 75 32 0.3
PointNet-based 75 2.4 0.008
Resnet 50 33 25 0.15
TABLE III: Comparison of methods based on computation time.
The confusion matrix for the feature-wise pooling approach
is shown in Table IV. The results show the proposed approach
performs well in classifying all the roof types. Errors mostly
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 5
come from classifying the pyramid class which is under-
represented in the dataset.
Saddleback Two-sided Hip Pyramid
Saddleback 13128 78 7
Two-sided Hip 199 3902 35
Pyramid 32 45 286
TABLE IV: Confusion Matrix of MVCNN-SA/FP.
Entropy Values 1 10−310−50−10−5−10−3−10−1
Accuracy [%] 97.26 97.74 97.76 97.65 97.54 97.22 97.14
Accuracy after an-
gular rotation [%]
96.5 97.28 97.26 97.4 97.24 97.15 97.18
TABLE V: Comparison of Accuracy based on entropy coefficients.
Further, we observed that the concentrated and diffused
attention maps showed poor accuracy. A diffused attention
map might give importance to views with irrelevant or re-
dundant information, leading to poor performance. A similar
argument holds true for concentrated attention maps, as they
might ignore views with important information. We found
that the intermediate attention map which has an entropy
coefficient close to zero, produces a balance between these
two behaviours and also gives the highest accuracy as shown
in Table V.
We also experimented with a change in the elevation of
the camera, varying randomly between (-30,30) degrees and
captured the views for training, validation, and testing. The
performance obtained was not drastically lower compared to
results on the original dataset, as shown in Table V. This may
be due to the fact that even if some of the cameras shifted to
places where they could not capture much information, other
views would compensate for this.
In the final experiment, we captured 20 views where θ=
18◦to verify that increasing the number of views does not
cause large gains in performance, which was also found to be
true in [6].
A. Experimental Setup
We ran our experiments on a computing node running
Ubuntu 16.04 with an Intel(R) Xeon(R) CPU E5-2630 V3 @
2.40GHZ, 80 GB RAM, and an NVIDIA GTX 1080 Ti GPU.
PyTorch 1.0 was used for deep learning. We used Blender
software 2.79-b-Linux for generating the rendered views from
LiDAR point clouds.
V. CONCLUSION
We proposed an adaptive importance-learning algorithm to
discover efficient image descriptors from multiple views of
LiDAR roof point clouds. Using our novel approach, we were
able to outperform the current state-of-the-art method [18]
based on PointNet on this dataset. However, the performance
of our model could still be improved from the current results
with hyperparameter tuning and changing the architecture of
the SAN.
REFERENCES
[1] X. Zhang, A. Zang, G. Agam, and X. Chen, “Learning from synthetic
models for roof style classification in point clouds,” in Proceedings
of the 22Nd ACM SIGSPATIAL International Conference on Advances
in Geographic Information Systems, ser. SIGSPATIAL ’14. New
York, NY, USA: ACM, 2014, pp. 263–270. [Online]. Available:
http://doi.acm.org/10.1145/2666310.2666407
[2] L. Zhang and L. Zhang, “Deep learning-based classification and
reconstruction of residential scenes from large-scale point clouds,” IEEE
Trans. Geoscience and Remote Sensing, vol. 56, no. 4, pp. 1887–1897,
2018. [Online]. Available: https://doi.org/10.1109/TGRS.2017.2769120
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778, 2016.
[4] J. Castagno and E. Atkins, “Roof shape classification from lidar
and satellite image data fusion using supervised learning,” Sensors,
vol. 18, p. 3960, 11 2018. [Online]. Available: https://www.mdpi.com/
1424-8220/18/11/3960
[5] A. Wichmann, A. Agoub, and M. Kada, “Roofn3d: Deep learning
training data for 3d building reconstruction,” ISPRS - International
Archives of the Photogrammetry, Remote Sensing and Spatial Infor-
mation Sciences, vol. XLII-2, pp. 1191–1198, 2018. [Online]. Avail-
able: https://www.int-arch-photogramm- remote-sens- spatial-inf- sci.net/
XLII-2/1191/2018/
[6] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view
convolutional neural networks for 3d shape recognition,” in Proc. ICCV,
2015.
[7] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view
convolutional neural networks for 3d shape recognition,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June
2018.
[8] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a
self-paced multiple-instance learning framework,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 39, no. 5, pp. 865–878, May 2017. [Online].
Available: https://doi.org/10.1109/TPAMI.2016.2567393
[9] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, “Detection of
co-salient objects by looking deep and wide,” Int. J. Comput.
Vision, vol. 120, no. 2, pp. 215–232, Nov. 2016. [Online]. Available:
http://dx.doi.org/10.1007/s11263-016-0907-4
[10] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant
convolutional neural networks for object detection in VHR optical
remote sensing images,” IEEE Trans. Geoscience and Remote
Sensing, vol. 54, no. 12, pp. 7405–7415, 2016. [Online]. Available:
https://doi.org/10.1109/TGRS.2016.2601622
[11] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background prior-
based salient object detection via deep reconstruction residual,” IEEE
Trans. Circuits Syst. Video Techn., vol. 25, no. 8, pp. 1309–1321, 2015.
[Online]. Available: https://doi.org/10.1109/TCSVT.2014.2381471
[12] M. Gkeli and C. Ioannidis, “Automatic 3d reconstruction of buildings
roof tops in densely urbanized areas,” ISPRS - International Archives
of the Photogrammetry, Remote Sensing and Spatial Information
Sciences, vol. XLII-4/W10, pp. 47–54, 2018. [Online]. Avail-
able: https://www.int-arch-photogramm- remote-sens- spatial-inf- sci.net/
XLII-4- W10/47/2018/
[13] K. Kim and J. Shan, “Building roof modeling from airborne laser
scanning data based on level set approach,” ISPRS Journal of
Photogrammetry and Remote Sensing, vol. 66, no. 4, pp. 484 – 497,
2011. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0924271611000396
[14] A. Wichmann, J. Jung, G. Sohn, M. Kada, and M. Ehlers, “Integra-
tion of Building Knowledge Into Binary Space Partitioning for the
Reconstruction of Regularized Building Models,” ISPRS Annals of
Photogrammetry, Remote Sensing and Spatial Information Sciences,
no. 5, pp. 541–548, Sep. 2015.
[15] B. T. Phong, “Illumination for computer generated pictures,” Commun.
ACM, vol. 18, no. 6, pp. 311–317, Jun. 1975. [Online]. Available:
http://doi.acm.org/10.1145/360825.360839
[16] Blender Online Community, Blender - a 3D modelling and rendering
package, Blender Foundation, Blender Institute, Amsterdam. [Online].
Available: http://www.blender.org
[17] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
on point sets for 3d classification and segmentation,” CoRR, vol.
abs/1612.00593, 2016. [Online]. Available: http://arxiv.org/abs/1612.
00593
[18] S. Guptha and R. Bohare, “Project title,” https://github.com/
sarthakTUM/roofn3d, 2019.