ArticlePDF Available

Roof Classification From 3-D LiDAR Point Clouds Using Multiview CNN With Self-Attention

November 2019
IEEE Geoscience and Remote Sensing Letters PP(99):1-5

November 2019
PP(99):1-5

DOI:10.1109/LGRS.2019.2945886

Authors:

Indian Institute of Technology Madras

Classification of light detection and ranging (LiDAR) point clouds of building roofs plays a vital role in various urban management applications and is significant in geographic information systems (GISs) and remote sensing. In this letter, a novel deep learning-based method is proposed for classifying roof point clouds, which outperforms the state-of-the-art methods. We use a view-based method called a multiview convolutional neural network with self-attention (MVCNN-SA), which takes the multiple views of a roof point cloud as input and outputs the category of the roof. Current view-based approaches treat all views equally and simply combine the view features into a single compact 3-D descriptor. Our adaptive weight-learning algorithm, which uses the SA block, discovers the relative importance of each view, thus assigning relative weights to the views. This enhances the shape descriptor, resulting in better classification performance. The effectiveness of the proposed method is then verified on the publicly available data set--RoofN3D--by comparing it with the current state-of-the-art methods.

Roof types 1 and Point clouds in RoofN3D (a) Saddleback (b) Two-sided Hip (c) Pyramid.

…

Pipeline of 3D urban roof classification.

…

Architecture of MVCNN-SA comprising three parts: Feature Extraction, View-Pooling and Classification.

…

Architecture of Self-Attention-Network(SAN).

…

Three types of Attention Maps (a) Diffused (b) Intermediate (c) Concentrated.

…

Figures - uploaded by Dimple A Shajahan

Content may be subject to copyright.

Content uploaded by Dimple A Shajahan

Content may be subject to copyright.

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 1

Roof Classiﬁcation from 3D LiDAR Point Clouds

using Multi-view CNN with Self-Attention

Dimple A Shajahan, Vaibhav Nayel and Ramanathan Muthuganapathy

Abstract—Classiﬁcation of LiDAR point clouds of building

roofs plays a vital role in various urban management applications

and is signiﬁcant in GIS and remote sensing. In this letter, a

novel deep learning-based method is proposed for classifying roof

point clouds, which outperforms the state-of-the-art methods.

We use a view-based method called multi-view convolutional

neural network with self-attention (MVCNN-SA), which takes

the multiple views of a roof point cloud as input and outputs

the category of the roof. Current view-based approaches treat

all views equally and simply combine the view features into

a single compact 3D descriptor. Our adaptive weight-learning

algorithm, which uses the self-attention (SA) block, discovers

the relative importance of each view, thus assigning relative

weights to the views. This enhances the shape descriptor resulting

in better classiﬁcation performance. The effectiveness of the

proposed method is then veriﬁed on the publicly available

dataset-RoofN3D, by comparing it with current state-of-the-art

methods.

Index Terms—MVCNN-SA, Self-Attention, Roof Classiﬁcation,

LiDAR, Point Clouds, ALS

I. INTRODUCTION

Three dimensional (3D) digital representations of building

models have become signiﬁcant in geographic information

systems (GIS) and remote sensing. This is due to recent

developments in 3D sensing devices that deliver high-quality

air-borne laser scan (ALS) raw data in real-time. It is used

for a variety of urban management applications such as 3D

city infrastructure planning and modeling, 3D city mapping,

environmental simulations, change detection, and disaster esti-

mation, etc. Classiﬁcation based on various roof styles, there-

fore, is very useful for the reconstruction and determination

of building shapes from segmented LiDAR point clouds.

Classifying 3D LiDAR point clouds of building roofs with

high accuracy is a challenging task because of the irregular

point distributions, sparsity, noise, and outliers present in them.

This, in turn, demands new solutions for processing point

cloud information, because current point cloud processing

techniques have limited capabilities to automatically extract

semantic information from raw data [1],[2]. Now, it is seen

that learning-based techniques have the ability to automatically

learn robust and discriminative feature representations from

3D LiDAR point clouds [1],[2].

Traditional methods of roof classiﬁcation are not widely ap-

plicable because they consider speciﬁc features of the various

roof styles and are restricted to a small domain and hence do

Dimple A Shajahan and Ramanathan Muthuganapathy are with the Depart-

ment of Engineering Design, IIT Madras, Chennai-36, India. Vaibhav Nayel

is with the Department of Electrical Engineering, IIT Madras.

Manuscript received April 30th, 2019; revised August 21st, 2019.

(a) (b) (c)

Fig. 1: Roof types 1and Point clouds in RoofN3D (a) Saddleback

(b) Two-sided Hip (c) Pyramid.

not provide a general solution [1]. Learning-based methods are

not dependent on geometry-based ﬁtting and roof topology and

therefore perform well in roof classiﬁcation [1]. Deep learning

has led to a series of breakthroughs for image classiﬁcation

and has solved more complex tasks with high-performance [3].

It is still not widely explored in the ﬁeld of remote sensing and

could perform better when a large scale of data is available [2],

[4]. Therefore, our approach is to efﬁciently classify 3D urban

LiDAR point clouds of roofs, using deep learning. The various

kinds of roofs present in the RoofN3D dataset [5] used for this

work and the point clouds which represent those shapes are

shown in Figure 1.

It has been shown in [6],[7], that the classiﬁers involving 2D

rendered images for a 3D shape outperform those built directly

on 3D representations. This may be due to the reduction in

dimensionality, sensitivity to noise and computational com-

plexity of 2D data compared to complex and high dimensional

3D data [6]. Since there might be a loss of information due to

the limited number of viewpoints and also because a viewpoint

can only represent a 3D shape partially, we have considered

several views of the point cloud.

Multi-view convolutional neural network (MVCNN) [6]

adopts the strategy of treating all views as equal and simply

combines the view descriptors into a single shape descriptor

using max pooling [7]. This method of view-pooling is not

able to determine the relevance of each view. To discover the

relative importance or weight of each of the views, we develop

a deep learning strategy to learn weights adaptively, which

enhances the shape descriptor resulting in better classiﬁcation

performance. Similar approaches have been used for the iden-

tiﬁcation of salient features from multiple instance images for

1http://wherecamp2017.geoit.org/wp-content/uploads/2017/11/8\hskip2em\

relax-3D- City-Models- for-Digital- Maps-and- Navigation-Andreas- Wichmann-TU-Berlin.pdf

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 2

object detection as in [8], [9], [10], [11].

In this letter, we propose a novel technique in deep learning

for view-pooling by adding a self-attention network (SAN) to

MVCNN. MVCNN-SA(Multi-view CNN with self-attention)

can learn discriminative features of multiple views efﬁciently

by a technique called feature-wise pooling which leads to

better classiﬁcation of the roofs.

II. RE LATE D WOR KS

The approaches used for roof modelling from LiDAR point

clouds are mainly divided into two categories: model-driven -

which is parametric, and data-driven - which is non-parametric

[12]. Model-driven methods require prior knowledge of the

shape and are robust to noise and missing points. It uses geo-

metric and topological constraints which cannot be generalized

to different datasets and is characterized by low classiﬁcation

accuracy [1]. In data-driven approaches, the 3D model is

reconstructed by ﬁtting together the individual roof planes

extracted by RANSAC segmentation or directions of normal

vectors. Here, the performance is affected by the average

point density of the input point clouds, presence of noise and

missing points [12],[13]. Advanced data-driven methods exist

for 3D roof modelling which can segment more complex roofs

but are also affected by the sparsity of points [13],[14].

In contrast, learning-based methods, which are also data-

driven, can be used to classify roofs with high accuracy [1],[4].

Zhang et al.[1] proposed a roof-classiﬁcation technique using

a random forest classiﬁer, that is trained on a bag of words

features extracted from a point cloud and a synthetically

generated codebook. Castagno et al.[4] propose a method

based on a fusion of two different modalities, such as LiDAR

point clouds and satellite images, from publicly available GIS

data.

Though these works also focus on classifying roofs from

LiDAR point clouds using supervised learning, they operate

on smaller datasets [4]. Our work stands out from them by

using a very large dataset, RoofN3D [5], and a deep learning-

based approach. The remainder of the letter is structured as

follows: Section III provides the methodology used for the

classiﬁcation of LiDAR roof point clouds, followed by results

and discussion. Section IV includes performance measures

to evaluate the model and section V concludes and suggests

directions for future work.

III. METHODOLOGY

Fig. 2: Pipeline of 3D urban roof classiﬁcation.

The pipeline of the classiﬁcation procedure, shown in Figure

2, has mainly two phases: rendering of views and classiﬁcation

of views. Rendering of views, which involves the conversion

of 3D point clouds of urban roofs to images, is explained in

section III-B. The steps involved in the classiﬁcation of views

are detailed below and in section III-C.

1) Rendered images are categorized into train, validation

and test images, and are fed to the feature extraction

part of MVCNN-SA.

2) The corresponding features extracted are fed to the

self-attention network, which performs view-pooling by

producing adaptive weights.

3) The obtained shape descriptor is then passed to the

classiﬁer to produce the label.

A. Dataset

The main challenge in the direct application of deep learning

for 3D roof classiﬁcation was the absence of a publicly

available dataset that is well annotated and of reasonable size

[5]. A new, real dataset, RoofN3D [5], was recently released

which is both well-annotated and large, thus making it suitable

for deep learning-based classiﬁcation. RoofN3D consists of

3D point clouds corresponding to the roofs of buildings of

NewYork City and has 118,073 instances. Each instance in

the dataset contains the point cloud of a single roof. A large

number of point clouds contain outliers such as points of trees,

walls or projections. Some point clouds also have missing

points, making this a representative real-world dataset. One

limitation of this dataset is that it contains only three categories

of roofs. RoofN3D is still a work in progress and is expected

to be improved, updated, and extended in the future with more

roof categories.

Based on our experiments, we have partitioned the dataset

as, 60% for training, 25% for validation and the remaining

15% for testing. Since this is an imbalanced dataset, the pro-

portion between the classes was maintained through stratiﬁed

sampling. The exact details of the split dataset are mentioned

in Table I.

Class Instances Train Validation Test

Saddleback 89057 53434 22264 13359

Two-sided Hip 26830 16098 6707 4025

Pyramid 2186 1311 547 328

Total 118073 70843 29518 17712

TABLE I: RoofN3D dataset with data split details for the

proposed work.

B. Rendering of Views

The technique used for capturing 2D views is similar to the

1st camera setup, as in section 3.1 in [6]. The view-generator

takes 2D projections of point clouds from different angles to

capture the geometry of the object. In this case, the point

clouds are in an upright orientation along a speciﬁc axis (e.g.,

the z-axis). The virtual cameras are placed around the point

clouds in a way that, the camera viewpoints are separated by

intervals of angle θin the same plane. The elevation of each

camera above the horizontal plane is 30◦, pointing towards the

centroid of the point cloud. Phong shading [15] in Blender [16]

was used to generate the rendered views of point clouds and

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 3

Fig. 3: Architecture of MVCNN-SA comprising three parts: Feature Extraction, View-Pooling and Classiﬁcation.

the images are captured with depth information. The whole

setup, with lighting, is similar to the approach used in [6]. We

set θ= 36◦which yields 10 views for an object.

C. Classiﬁcation of views

In multi-view methods, a recent work [7] has shown that

learning the relative importance of view descriptors results

in a better representation of the shape descriptor, leading to

enhanced classiﬁcation performance. The aggregation of views

is enabled by an attention network which generates relative

weights assigned to each view. The features of each view

are scaled by these attention weights and are added to give

a pooled representation of the object.

The architecture of the proposed MVCNN-SA is shown in

Figure 3. It consists of three main parts. The ﬁrst part is used

for feature extraction and consists of CNN blocks, where all

branches of the network share the same parameters in CNN1.

The second part is for view-pooling and has the SAN blocks

in which all SANs share the same weights. This is followed by

the third part, a fully connected layer F C1for classiﬁcation.

For each 3D model, let V={v1, v2, ..., v10}be the set

of rendered images where each viis passed to CNN1. The

feature set FM={f1, f2, ..., f10}consists of feature maps fi

extracted from these views. In our experiments CNN1has a

ResNet-18 architecture. The feature maps fiare passed to an

SAN1which generates a key-value pair (ki,zi) for a particular

view ias shown in equations (1) and (2).

zi=Z(fi) RL(1)

where ziis the output of the value network and L is the size

of the value vector.

ki=K(fi) RL(2)

where kiis the output of the key network.

The architecture of the SAN is shown in Figure 4 and

consists of two parts - a value network Zcontaining a fully

connected layer FC2and a key network Kcontaining CNN2.

CNN2consists of three convolutional layers followed by a

Fig. 4: Architecture of Self-Attention-Network(SAN).

fully connected layer whose outputs represent un-normalized

self-attention weights. These are called keys and are passed

through a softmax function for normalization, producing wi

as shown in equation (3).

wij =ekij

Σiekij (3)

where wij is the attention matrix given by the softmax

function. The purpose of the softmax function is to ensure

that the weights for a particular feature of the value vector

across views sum to one. These weights are multiplied with

the outputs from the value network as shown in equation (4).

Pw= Σiwizi RL(4)

where Pwis the aggregated view-pool.

This view-pool is fed to a fully connected layer with soft-

max activation for classiﬁcation. We term this view-pooling

method as feature-wise attention. The entire network is trained

in an end-to-end fashion using the cross-entropy loss function.

In feature-wise attention, the weights wi∈RLare vectors

such that the jth element of each vector sum to one. The

scaling of each ziis performed using a Hadamard product, as

shown in equation (4). It assigns a different importance to each

feature in a view vector, allowing the network to pay attention

to the salient features in every view. An auxiliary entropy term

is included in the loss function to study the effect of changing

attention map distributions. The entropy is calculated as in

equation (5)

S=−Σiwi·logwi(5)

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 4

A positive entropy penalty encourages the attention map to

be concentrated while a negative penalty causes the attention

map to become more diffuse.

(a) (b) (c)

Fig. 5: Three types of Attention Maps (a) Diffused (b) Intermediate

We analyzed the attention matrix and recognized three

possible conﬁgurations. A diffused attention map is one in

which all views contribute almost equally to the prediction

and is shown in Figure 5(a). The intermediate attention map

is one in which some views contribute more to the prediction

and is plotted as shown in Figure 5(b). The last one is

the concentrated attention map in which one or two views

dominate the other views in the output and is shown in Figure

5(c).

Another method for view-pooling was also experimented

with, which we termed as view-wise attention. In view-wise

attention, the attention weights wi∈Rare real numbers

summing to 1. For each view, the value vectors ziare scaled

by the attention weights wiand summed to generate the view-

pool. This can be interpreted as giving the same importance

to every feature of a view. Our experiments have shown that

the higher degree of freedom in feature-wise attention allows

the network to achieve better classiﬁcation performance than

view-wise attention. Figure 3 can be used to represent both

methods: feature-wise attention or view-wise attention, with

the only difference being the dimensionality of the keys ki

and weights wi.

D. Parameter Setting

We use Adam optimizer with an initial learning rate of

0.0001. A decay of 0.1 is used for every 30 epochs, along

with a momentum of 0.9. Cross-Entropy Loss with Softmax

activation is used for classiﬁcation in the output layer.

Since a high class-imbalance is present in our dataset,

for training our network with N instances per batch, we

sampled each instance in the batch by the inverse frequency

of the class. Thus, instances in larger classes have a lower

probability of being selected. Due to population differences

of the classes, the sampled batch is equally distributed which

can also help improve the accuracy of prediction. Suitable

transformations for dataset augmentation are applied to

the images during samplings such as random resized crop,

random horizontal ﬂip, and normalization with mean and

standard deviation.

IV. RES ULT S AN D DISCUSSION

The metrics used to evaluate performance are the following:

Accuracy, Ac=T Pc+T Nc

T Pc+F Pc+F Nc+T Nc

(6)

P recision, Pc=T Pc

T Pc+F Pc

(7)

Recall, Rc=T Pc

T Pc+F Nc

(8)

and, F 1score =2·Pc·Rc

Pc+Rc

(9)

where TP = True Positives, TN = True Negatives, FP = False

Positives, and FN = False Negatives, with respect to a class c.

After experimenting with two types of pooling mechanisms

mentioned in section III, it was found that MVCNN-SA with

feature-wise pooling outperforms MVCNN with max-pooling

or average pooling. It also surpassed the current state-of-the-

art scores on the RoofN3D dataset, which is the result of a

model-based classiﬁcation approach that directly uses the 3D

point clouds as input, and is based on the PointNet architecture

[17]. The comparison of results of MVCNN-SA with the

PointNet-based method [18] is shown in Table II and Figure

6. Our experimentation with Resnet 50, which uses all views

of the point clouds separately, also resulted in lower accuracy

compared to the proposed method. The computational times

measured for these methods are listed in Table III.

Class MVCNN-SA PointNet-based [18]

Precision Recall F1-score Precision Recall F1-score

Saddleback 0.98 0.99 0.99 0.99 0.97 0.98

Two-sided Hip 0.97 0.94 0.96 0.92 0.95 0.93

Pyramid 0.87 0.79 0.83 0.69 0.83 0.75

TABLE II: Comparison of MVCNN-SA results with PointNet-based

results.

Fig. 6: Accuracies for various models (FP-Feature-wise Pooling,

VP- View-wise Pooling, AP- Average Pooling, MP- Max Pooling).

Method # Epochs Training

Time(hrs)

Testing

Time(hrs)

MVCNN-SA 75 32 0.3

PointNet-based 75 2.4 0.008

Resnet 50 33 25 0.15

TABLE III: Comparison of methods based on computation time.

The confusion matrix for the feature-wise pooling approach

is shown in Table IV. The results show the proposed approach

performs well in classifying all the roof types. Errors mostly

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. , NO. , SEPTEMBER 2019 5

come from classifying the pyramid class which is under-

represented in the dataset.

Saddleback Two-sided Hip Pyramid

Saddleback 13128 78 7

Two-sided Hip 199 3902 35

Pyramid 32 45 286

TABLE IV: Confusion Matrix of MVCNN-SA/FP.

Entropy Values 1 10−310−50−10−5−10−3−10−1

Accuracy [%] 97.26 97.74 97.76 97.65 97.54 97.22 97.14

Accuracy after an-

gular rotation [%]

96.5 97.28 97.26 97.4 97.24 97.15 97.18

TABLE V: Comparison of Accuracy based on entropy coefﬁcients.

Further, we observed that the concentrated and diffused

attention maps showed poor accuracy. A diffused attention

map might give importance to views with irrelevant or re-

dundant information, leading to poor performance. A similar

argument holds true for concentrated attention maps, as they

might ignore views with important information. We found

that the intermediate attention map which has an entropy

coefﬁcient close to zero, produces a balance between these

two behaviours and also gives the highest accuracy as shown

in Table V.

We also experimented with a change in the elevation of

the camera, varying randomly between (-30,30) degrees and

captured the views for training, validation, and testing. The

performance obtained was not drastically lower compared to

results on the original dataset, as shown in Table V. This may

be due to the fact that even if some of the cameras shifted to

places where they could not capture much information, other

views would compensate for this.

In the ﬁnal experiment, we captured 20 views where θ=

18◦to verify that increasing the number of views does not

cause large gains in performance, which was also found to be

true in [6].

A. Experimental Setup

We ran our experiments on a computing node running

Ubuntu 16.04 with an Intel(R) Xeon(R) CPU E5-2630 V3 @

2.40GHZ, 80 GB RAM, and an NVIDIA GTX 1080 Ti GPU.

PyTorch 1.0 was used for deep learning. We used Blender

software 2.79-b-Linux for generating the rendered views from

LiDAR point clouds.

V. CONCLUSION

We proposed an adaptive importance-learning algorithm to

discover efﬁcient image descriptors from multiple views of

LiDAR roof point clouds. Using our novel approach, we were

able to outperform the current state-of-the-art method [18]

based on PointNet on this dataset. However, the performance

of our model could still be improved from the current results

with hyperparameter tuning and changing the architecture of

the SAN.

REFERENCES

[1] X. Zhang, A. Zang, G. Agam, and X. Chen, “Learning from synthetic

models for roof style classiﬁcation in point clouds,” in Proceedings

of the 22Nd ACM SIGSPATIAL International Conference on Advances

in Geographic Information Systems, ser. SIGSPATIAL ’14. New

York, NY, USA: ACM, 2014, pp. 263–270. [Online]. Available:

http://doi.acm.org/10.1145/2666310.2666407

[2] L. Zhang and L. Zhang, “Deep learning-based classiﬁcation and

reconstruction of residential scenes from large-scale point clouds,” IEEE

Trans. Geoscience and Remote Sensing, vol. 56, no. 4, pp. 1887–1897,

2018. [Online]. Available: https://doi.org/10.1109/TGRS.2017.2769120

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” 2016 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pp. 770–778, 2016.

[4] J. Castagno and E. Atkins, “Roof shape classiﬁcation from lidar

and satellite image data fusion using supervised learning,” Sensors,

vol. 18, p. 3960, 11 2018. [Online]. Available: https://www.mdpi.com/

1424-8220/18/11/3960

[5] A. Wichmann, A. Agoub, and M. Kada, “Roofn3d: Deep learning

training data for 3d building reconstruction,” ISPRS - International

Archives of the Photogrammetry, Remote Sensing and Spatial Infor-

mation Sciences, vol. XLII-2, pp. 1191–1198, 2018. [Online]. Avail-

able: https://www.int-arch-photogramm- remote-sens- spatial-inf- sci.net/

XLII-2/1191/2018/

[6] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view

convolutional neural networks for 3d shape recognition,” in Proc. ICCV,

2015.

[7] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view

convolutional neural networks for 3d shape recognition,” in The IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), June

2018.

[8] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a

self-paced multiple-instance learning framework,” IEEE Trans. Pattern

Anal. Mach. Intell., vol. 39, no. 5, pp. 865–878, May 2017. [Online].

Available: https://doi.org/10.1109/TPAMI.2016.2567393

[9] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, “Detection of

co-salient objects by looking deep and wide,” Int. J. Comput.

Vision, vol. 120, no. 2, pp. 215–232, Nov. 2016. [Online]. Available:

http://dx.doi.org/10.1007/s11263-016-0907-4

[10] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant

convolutional neural networks for object detection in VHR optical

remote sensing images,” IEEE Trans. Geoscience and Remote

Sensing, vol. 54, no. 12, pp. 7405–7415, 2016. [Online]. Available:

https://doi.org/10.1109/TGRS.2016.2601622

[11] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background prior-

based salient object detection via deep reconstruction residual,” IEEE

Trans. Circuits Syst. Video Techn., vol. 25, no. 8, pp. 1309–1321, 2015.

[Online]. Available: https://doi.org/10.1109/TCSVT.2014.2381471

[12] M. Gkeli and C. Ioannidis, “Automatic 3d reconstruction of buildings

roof tops in densely urbanized areas,” ISPRS - International Archives

of the Photogrammetry, Remote Sensing and Spatial Information

Sciences, vol. XLII-4/W10, pp. 47–54, 2018. [Online]. Avail-

able: https://www.int-arch-photogramm- remote-sens- spatial-inf- sci.net/

XLII-4- W10/47/2018/

[13] K. Kim and J. Shan, “Building roof modeling from airborne laser

scanning data based on level set approach,” ISPRS Journal of

Photogrammetry and Remote Sensing, vol. 66, no. 4, pp. 484 – 497,

2011. [Online]. Available: http://www.sciencedirect.com/science/article/

pii/S0924271611000396

[14] A. Wichmann, J. Jung, G. Sohn, M. Kada, and M. Ehlers, “Integra-

tion of Building Knowledge Into Binary Space Partitioning for the

Reconstruction of Regularized Building Models,” ISPRS Annals of

Photogrammetry, Remote Sensing and Spatial Information Sciences,

no. 5, pp. 541–548, Sep. 2015.

[15] B. T. Phong, “Illumination for computer generated pictures,” Commun.

ACM, vol. 18, no. 6, pp. 311–317, Jun. 1975. [Online]. Available:

http://doi.acm.org/10.1145/360825.360839

[16] Blender Online Community, Blender - a 3D modelling and rendering

package, Blender Foundation, Blender Institute, Amsterdam. [Online].

Available: http://www.blender.org

[17] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning

on point sets for 3d classiﬁcation and segmentation,” CoRR, vol.

abs/1612.00593, 2016. [Online]. Available: http://arxiv.org/abs/1612.

00593

[18] S. Guptha and R. Bohare, “Project title,” https://github.com/

sarthakTUM/roofn3d, 2019.

DANC-Net: Dual-Attention and Negative Constraint Network for Point Cloud Classification

Article

Full-text available

May 2022
Int J Antenn Propag

Convolutional neural networks, as a branch of deep neural networks, have been widely used in multidimensional signal processing, especially in point cloud signal processing. Nevertheless, in point cloud signal processing, most point cloud classification networks currently do not consider local feature correlation. In addition, they only adopt ground-truth as positive information to guide the training of networks while ignoring negative information. Therefore, this paper proposes a network model to classify point cloud signals based on feature correlation and negative constraint, DANC-Net (dual-attention and negative constraint on point cloud classification). In the DANC-Net, the dual-attention mechanism is utilized to strengthen the interaction between local features of point cloud signal from both channel and space, thereby improving the expression ability of extracted features. Moreover, during the training of the DANC-Net, the negative constraint loss function ensures that the features in the same categories are close and those in the different categories are far away from each other in the representation space, so as to improve the feature extraction capability of the network. Experiments demonstrate that the DANC-Net achieves better classification performance than the existing point cloud classification algorithms on synthetic datasets ModelNet10 and ModelNet40 and real-scene dataset ScanObjectNN. The code is released at https://github.com/sunhang1986/DANC-Net.

SADNet: Space-aware DeepLab network for Urban-Scale point clouds semantic segmentation

Article

Full-text available

Apr 2024

Semantic segmentation of urban-scale point clouds can effectively assist people in understanding and perceiving 3D urban scenes. Although a considerable number of deep learning models for the semantic segmentation of point clouds have been proposed, some methods are plagued by information loss caused by sampling and insufficient perception of the spatial relationship between points. To address this issue, this paper proposes an end-to-end space-aware DeepLab deep learning network, named SADNet. In the SADNet, a space-aware attentive residual module (SARM) is incorporated to extract rich point cloud features with the assistance of perceiving spatial relationships between points. Then, in combination with a point cloud atrous spatial pyramid pooling module (PCaspp), SADNet extracts multiscale point cloud features while effectively avoiding information loss from pooling and downsampling. Finally, a dilated local feature extraction (DLFE) module is designed to enhance the detection ability for small objects by dilating the feature map. Furthermore, to validate the superiority of the SADNet, extensive experiments are conducted on two publicly available benchmarks, Sensaturban and Hes-sigheim 3D. The results demonstrate the state-of-the-art performance on both datasets, which achieves the mean IoU of 66.8% on Sensaturban and overall accuracy of 91.77%, mean F1-score of 82.81% on Hessigheim 3D. Overall, SADNet is a promising approach for urban-scale point cloud semantic segmentation and has the potential to enhance understanding and perception of real-world urban scenes.

Review of Automatic Processing of Topography and Surface Feature Identification LiDAR Data Using Machine Learning Techniques

Article

Full-text available

Sep 2022

Machine Learning (ML) applications on Light Detection And Ranging (LiDAR) data have provided promising results and thus this topic has been widely addressed in the literature during the last few years. This paper reviews the essential and the more recent completed studies in the topography and surface feature identification domain. Four areas, with respect to the suggested approaches, have been analyzed and discussed: the input data, the concepts of point cloud structure for applying ML, the ML techniques used, and the applications of ML on LiDAR data. Then, an overview is provided to underline the advantages and the disadvantages of this research axis. Despite the training data labelling problem, the calculation cost, and the undesirable shortcutting due to data downsam-pling, most of the proposed methods use supervised ML concepts to classify the downsampled LiDAR data. Furthermore, despite the occasional highly accurate results, in most cases the results still require filtering. In fact, a considerable number of adopted approaches use the same data structure concepts employed in image processing to profit from available informatics tools. Knowing that the LiDAR point clouds represent rich 3D data, more effort is needed to develop specialized processing tools.

Attention-Based Transformation from Latent Features to Point Clouds

Article

Full-text available

Jun 2022

In point cloud generation and completion, previous methods for transforming latent features to point clouds are generally based on fully connected layers (FC-based) or folding operations (Folding-based). However, point clouds generated by FC-based methods are usually troubled by outliers and rough surfaces. For folding-based methods, their data flow is large, convergence speed is slow, and they are also hard to handle the generation of non-smooth surfaces. In this work, we propose AXform, an attention-based method to transform latent features to point clouds. AXform first generates points in an interim space, using a fully connected layer. These interim points are then aggregated to generate the target point cloud. AXform takes both parameter sharing and data flow into account, which makes it has fewer outliers, fewer network parameters, and a faster convergence speed. The points generated by AXform do not have the strong 2-manifold constraint, which improves the generation of non-smooth surfaces. When AXform is expanded to multiple branches for local generations, the centripetal constraint makes it has properties of self-clustering and space consistency, which further enables unsupervised semantic segmentation. We also adopt this scheme and design AXformNet for point cloud completion. Considerable experiments on different datasets show that our methods achieve state-of-the-art results.

ASSESSING GEO-TYPICAL TECHNIQUES FOR MODELING BUILDINGS USING THERMAL SIMULATIONS

Article

Full-text available

May 2022

Building modeling from remote sensing data is essential for creating accurate 3D and 4D digital twins, especially for temperature modeling. In order to represent buildings as gap-free, visually appealing, and rich in details models, geo-typical prototypes should be represented in the scene. The sensor data and freely available OSM data are supposed to provide guidelines for best-possible matching. In this paper, the default similarity function based on intersection over union is extended by terms reflecting the similarity of elevation values, orientation towards the road, and trees in the vicinity. The goodness of fit has been evaluated by architecture experts as well as thermal simulations with a thermal image as ground truth and error measures based on mean average error, root mean square and mutual information. It could be concluded that while intersection over union measure still seems to be most preferred by architects, slightly better thermal simulation results are yielded by taking into account all similarity functions.

MCT-TTE: Travel Time Estimation Based on Transformer and Convolution Neural Networks

Article

Full-text available

Apr 2022

In this paper, we propose a new travel time estimation framework based on transformer and convolution neural networks (CNN) to improve the accuracy of travel time estimation. We design a traffic information fusion component, which fuses the GPS trajectory, real road network, and external attributes, to fully consider the influence of road network topological characteristics as well as the traffic temporal characteristics on travel time estimation. Moreover, we provide a multiview CNN transformer component to capture the spatial information of each trajectory point at multiple regional scales. Extensive experiments on Chengdu and Beijing datasets show that the mean absolute percent error (MAPE) of our MCT-TTE is 11.25% and 11.78%, which is competitive with the state-of-the-arts baselines.

An Intelligent Point Cloud Recognition Method for Substation Equipment Based on Multiscale Self-Attention

Article

Jan 2023

The semantic recognition of point cloud is an important aspect of point cloud applications, it is crucial to study the intelligent point cloud recognition method for substation equipment to replace manual processing. Based on previous related work, this paper deals with the construction of an intelligent point cloud recognition network for substation equipment. In this network, a point cloud to tensor (PC2T) module is proposed to achieve the goal of using a 2D neural network to process 3D point cloud; a multi-scale self-attention (MSSA) module is introduced to optimize the global feature extraction of point cloud and enhance the accuracy of point cloud recognition. In addition, this paper proposes two point cloud data argumentation methods to assist in network training. The experimental results show that the constructed point cloud recognition method has excellent recognition accuracy and point cloud quality robustness, and that the proposed data argumentation methods are effective. The research results can provide a reference for the digital transformation of substations.

Deep-Learning Methods of Cross-Modal Tasks for Conceptual Design of Product Shapes: A Review

Article

Dec 2022

Conceptual design is the foundational stage of a design process that translates ill-defined design problems into low-fidelity design concepts and prototypes. While deep learning approaches are widely applied in later design stages for design automation, we see fewer attempts in conceptual design for three reasons: 1) the data in this stage exhibit multiple modalities: natural language, sketches, and 3D shapes, and these modalities are challenging to represent in deep learning methods; 2) it requires knowledge from a larger source of inspiration instead of focusing on a single design task; and 3) it requires translating designers' intent and feedback, and hence needs more interaction with humans, either designers or users. With recent advances in deep learning of cross-modal tasks (DLCMT) and the availability of large cross-modal datasets, we see opportunities to apply these learning-based methods to the conceptual design of product shapes. In this paper, we conduct a systematic review on the methods for DLCMT that involve three design modalities: natural language, sketches, and 3D shapes, which revealed 50 articles in the fields of computer graphics, computer vision, and engineering design. This review work identifies the key challenges and opportunities in applying DLCMT in the conceptual design of engineered products. The authors also propose a list of five research questions that point to future directions and call on the community to devote itself to principled research investigations that help translate knowledge from computer science to engineering design.

DACNet: A Dual-Attention Contrastive Learning Network for 3D Point Cloud Classification

Conference Paper

Jul 2022

Deep Learning for LiDAR Point Cloud Classification in Remote Sensing

Article

Full-text available

Oct 2022
SENSORS-BASEL

Point clouds are one of the most widely used data formats produced by depth sensors. There is a lot of research into feature extraction from unordered and irregular point cloud data. Deep learning in computer vision achieves great performance for data classification and segmentation of 3D data points as point clouds. Various research has been conducted on point clouds and remote sensing tasks using deep learning (DL) methods. However, there is a research gap in providing a road map of existing work, including limitations and challenges. This paper focuses on introducing the state-of-the-art DL models, categorized by the structure of the data they consume. The models’ performance is collected, and results are provided for benchmarking on the most used datasets. Additionally, we summarize the current benchmark 3D datasets publicly available for DL training and testing. In our comparative study, we can conclude that convolutional neural networks (CNNs) achieve the best performance in various remote-sensing applications while being light-weighted models, namely Dynamic Graph CNN (DGCNN) and ConvPoint.

Roof Shape Classification from LiDAR and Satellite Image Data Fusion Using Supervised Learning

Article

Full-text available

Nov 2018
SENSORS-BASEL

Geographic information systems (GIS) provide accurate maps of terrain, roads, waterways, and building footprints and heights. Aircraft, particularly small unmanned aircraft systems (UAS), can exploit this and additional information such as building roof structure to improve navigation accuracy and safely perform contingency landings particularly in urban regions. However, building roof structure is not fully provided in maps. This paper proposes a method to automatically label building roof shape from publicly available GIS data. Satellite imagery and airborne LiDAR data are processed and manually labeled to create a diverse annotated roof image dataset for small to large urban cities. Multiple convolutional neural network (CNN) architectures are trained and tested, with the best performing networks providing a condensed feature set for support vector machine and decision tree classifiers. Satellite image and LiDAR data fusion is shown to provide greater classification accuracy than using either data type alone. Model confidence thresholds are adjusted leading to significant increases in models precision. Networks trained from roof data in Witten, Germany and Manhattan (New York City) are evaluated on independent data from these cities and Ann Arbor, Michigan.

AUTOMATIC 3D RECONSTRUCTION OF BUILDINGS ROOF TOPS IN DENSELY URBANIZED AREAS

Article

Full-text available

Sep 2018

3D reconstruction of the urban environment constitutes a well-studied problem in the field of photogrammetry and computer vision, attracting the growing interest of the scientific community, for many years. Although the current state of the art present very impressive results, there is still room for improvements. The production of reliable and accurate 3D reconstructions is useful for a wide range of applications, such as urban planning, GIS, tax assessment, cadastre, insurance, 3D city modelling, etc. In this paper, a methodology for the automatic 3D reconstruction of buildings roof tops in densely urbanized areas, utilizing dense point clouds data, is proposed. It consists of three (3) main phases, each of which comprises a set of processing steps. In the first phase, the point cloud is simplified and smoothed. Outliers and non-roof elements are detected and removed utilizing shape, position and area criteria. In the second phase, the geometry buildings roof tops is optimized, by detecting and normalizing the edges. In the last phase, the reconstruction of the buildings roof tops is conducted. A progressive process, utilizing a plane fitting algorithm in combination with Screened Poisson Surface Reconstruction is performed. Buildings roof tops surfaces are produced and optimized. A software tool is developed and utilized for the implementation of the proposed methodology. The produced results are assessed and a comparison with another open-source software is conducted. The proposed methodology seems to be effective providing satisfactory results, as it can manage properly the really noisy point clouds of densely urbanized environments.

RoofN3D: Deep Learning Training Data for 3D Building Reconstruction

Article

Full-text available

Jun 2018

Machine learning methods have gained in importance through the latest development of artificial intelligence and computer hardware. Particularly approaches based on deep learning have shown that they are able to provide state-of-the-art results for various tasks. However, the direct application of deep learning methods to improve the results of 3D building reconstruction is often not possible due, for example, to the lack of suitable training data. To address this issue, we present RoofN3D which provides a new 3D point cloud training dataset that can be used to train machine learning models for different tasks in the context of 3D building reconstruction. It can be used, among others, to train semantic segmentation networks or to learn the structure of buildings and the geometric model construction. Further details about RoofN3D and the developed data preparation framework, which enables the automatic derivation of training data, are described in this paper. Furthermore, we provide an overview of other available 3D point cloud training data and approaches from current literature in which solutions for the application of deep learning to unstructured and not gridded 3D point cloud data are presented.

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Article

Full-text available

Dec 2016

Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.

Detection of Co-salient Objects by Looking Deep and Wide

Article

Full-text available

Nov 2016
INT J COMPUT VISION

In this paper, we propose a unified co-salient object detection framework by introducing two novel insights: (1) looking deep to transfer higher-level representations by using the convolutional neural network with additional adaptive layers could better reflect the sematic properties of the co-salient objects; (2) looking wide to take advantage of the visually similar neighbors from other image groups could effectively suppress the influence of the common background regions. The wide and deep information are explored for the object proposal windows extracted in each image. The window-level co-saliency scores are calculated by integrating the intra-image contrast, the intra-group consistency, and the inter-group separability via a principled Bayesian formulation and are then converted to the superpixel-level co-saliency maps through a foreground region agreement strategy. Comprehensive experiments on two existing and one newly established datasets have demonstrated the consistent performance gain of the proposed approach.

GVCNN: Group-View Convolutional Neural Networks for 3D Shape Recognition

Conference Paper

Jun 2018

Deep Learning-Based Classification and Reconstruction of Residential Scenes From Large-Scale Point Clouds

Article

Dec 2017

The reconstruction of urban buildings from large-scale airborne laser scanning point clouds is an important research topic in the geoscience field. Large-scale urban scenes usually contain a large number of object categories and many overlapped or closely neighboring objects, which poses great challenges for classifying and modeling buildings from these data sets. In this paper, we propose a deep reinforcement learning framework that integrates a 3-D convolutional neural network, a deep Q-network, and a residual recurrent neural network for the efficient semantic parsing of large-scale 3-D point clouds. The proposed framework provides an end-to-end automatic processing method that maps the raw point cloud to the classification results of the given categories. After obtaining the building classes, we utilize an edge-aware resampling algorithm to consolidate the point set with noise-free normals and clean preservation of sharp features. Finally, 2.5-D dual contouring, which is a data-driven approach, is introduced to generate urban building models from the consolidated point clouds. Our method can generate lightweight building models with arbitrarily shaped roofs while preserving the verticality of connecting walls.

Deep Residual Learning for Image Recognition

Conference Paper

Jun 2016

Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images

Article

Dec 2016

Object detection in very high resolution optical remote sensing images is a fundamental problem faced for remote sensing image analysis. Due to the advances of powerful feature representations, machine-learning-based object detection is receiving increasing attention. Although numerous feature representations exist, most of them are handcrafted or shallow-learning-based features. As the object detection task becomes more challenging, their description capability becomes limited or even impoverished. More recently, deep learning algorithms, especially convolutional neural networks (CNNs), have shown their much stronger feature representation power in computer vision. Despite the progress made in nature scene images, it is problematic to directly use the CNN feature for object detection in optical remote sensing images because it is difficult to effectively deal with the problem of object rotation variations. To address this problem, this paper proposes a novel and effective approach to learn a rotation-invariant CNN (RICNN) model for advancing the performance of object detection, which is achieved by introducing and learning a new rotation-invariant layer on the basis of the existing CNN architectures. However, different from the training of traditional CNN models that only optimizes the multinomial logistic regression objective, our RICNN model is trained by optimizing a new objective function via imposing a regularization constraint, which explicitly enforces the feature representations of the training samples before and after rotating to be mapped close to each other, hence achieving rotation invariance. To facilitate training, we first train the rotation-invariant layer and then domain-specifically fine-tune the whole RICNN network to further boost the performance. Comprehensive evaluations on a publicly available ten-class object detection data set demonstrate the effectiveness of the proposed method.

Co-Saliency Detection via a Self-Paced Multiple-Instance Learning Framework

Article

May 2016

As an interesting and emerging topic, co-saliency detection aims at simultaneously extracting common salient objects from a group of images. On one hand, traditional co-saliency detection approaches rely heavily on human knowledge for designing hand-crafted metrics to possibly reflect the faithful properties of the co-salient regions. Such strategies, however, always suffer from poor generalization capability to flexibly adapt various scenarios in real applications. On the other hand, most current methods pursue co-saliency detection in unsupervised fashions. This, however, tends to weaken their performance in real complex scenarios because they are lack of robust learning mechanism to make full use of the weak labels of each image. To alleviate these two problems, this paper proposes a new SP-MIL framework for co-saliency detection, which integrates both multiple instance learning (MIL) and self-paced learning (SPL) into a unified learning framework. Specifically, for the first problem, we formulate the co-saliency detection problem as a MIL paradigm to learn the discriminative classifiers to detect the co-saliency object in the "instance-level". The formulated MIL component facilitates our method capable of automatically producing the proper metrics to measure the intra-image contrast and the inter-image consistency for detecting co-saliency in a purely self-learning way. For the second problem, the embedded SPL paradigm is able to alleviate the data ambiguity under the weak supervision of co-saliency detection and guide a robust learning manner in complex scenarios. Experiments on benchmark datasets together with multiple extended computer vision applications demonstrate the superiority of the proposed framework beyond the state-of-the-arts.

Roof Classification From 3-D LiDAR Point Clouds Using Multiview CNN With Self-Attention

Abstract and Figures

Recommended publications

Point Transformer for Shape Classification and Retrieval of Urban Roof Point Clouds

PointTransformer for Shape Classification and Retrieval of 3D and ALS Roof PointClouds

Exploring Attention-based Deep Learning methods for Classification, Retrieval and Shape Completion o...

PTTE:Power Tower Tilt Estimation Algorithm based on LiDAR Point Cloud