ArticlePDF Available

Multi-Modal and Multi-Scale Fusion 3D Object Detection of 4D Radar and LiDAR for Autonomous Driving

January 2022
IEEE Transactions on Vehicular Technology PP(99):1-15

January 2022
PP(99):1-15

DOI:10.1109/TVT.2022.3230265

Authors:

Xinyu Zhang

Tsinghua University

Jun Li

Chinese Academy of Agricultural Sciences

Show all 9 authorsHide

Multi-modal fusion plays a critical role in 3D object detection, overcoming the inherent limitations of single-sensor perception in autonomous driving. Most fusion methods require data from high-resolution cameras and LiDAR sensors, which are less robust and the detection accuracy drops drastically with the increase of range as the point cloud density decreases. Alternatively, the fusion of Radar and LiDAR alleviates these issues but is still a developing field, especially for 4D Radar with a more robust and broader detection range. Nevertheless, different data characteristics and noise distributions between two sensors hinder performance improvement when directly integrating them. Therefore, we are the first to propose a novel fusion method termed $M^{2}$ -Fusion for 4D Radar and LiDAR, based on Multi-modal and Multi-scale fusion. To better integrate two sensors, we propose an Interaction-based Multi-Modal Fusion (IMMF) method utilizing a self-attention mechanism to learn features from each modality and exchange intermediate layer information. Specific to the current single-resolution voxel division's precision and efficiency balance problem, we also put forward a Center-based Multi-Scale Fusion (CMSF) method to first regress the center points of objects and then extract features in multiple resolutions. Furthermore, we present a data preprocessing method based on Gaussian distribution that effectively decreases data noise to reduce errors caused by point cloud divergence of 4D Radar data in the $x$ - $z$ plane. To evaluate the proposed fusion method, a series of experiments were conducted using the Astyx HiRes 2019 dataset, including the calibrated 4D Radar and 16-line LiDAR data. The results demonstrated that our fusion method compared favorably with state-of-the-art algorithms. When compared to PointPillars, our method achieves mAP (mean average precision) increases of 5.64 $\%$ and 13.57 $\%$ for 3D and BEV (bird's eye view) detection of the car class at a moderate level, respectively.

Content uploaded by Xinyu Zhang

Content may be subject to copyright.

Multi-Modal and Multi-Scale Fusion 3D Object

Detection of 4D Radar and LiDAR for Autonomous

Driving

Li Wang, Xinyu Zhang, Jun Li, Baowei Xv, Rong Fu, Haifeng Chen, Lei Yang, Dafeng Jin, Lijun Zhao

Abstract—Multi-modal fusion plays a critical role in 3D object

detection, overcoming the inherent limitations of single-sensor

perception in autonomous driving. Most fusion methods require

data from high-resolution cameras and LiDAR sensors, which

are less robust and the detection accuracy drops drastically

with the increase of range as the point cloud density decreases.

Alternatively, the fusion of Radar and LiDAR alleviates these

issues but is still a developing ﬁeld, especially for 4D Radar

with a more robust and broader detection range. Nevertheless,

different data characteristics and noise distributions between

two sensors hinder performance improvement when directly

integrating them. Therefore, we are the ﬁrst to propose a novel

fusion method termed M2-Fusion for 4D Radar and LiDAR,

based on Multi-modal and Multi-scale fusion. To better inte-

grate two sensors, we propose an Interaction-based Multi-Modal

Fusion (IMMF) method utilizing a self-attention mechanism to

learn features from each modality and exchange intermediate

layer information. Speciﬁc to the current single-resolution voxel

division’s precision and efﬁciency balance problem, we also put

forward a Center-based Multi-Scale Fusion (CMSF) method to

ﬁrst regress the center points of objects and then extract features

in multiple resolutions. Furthermore, we present a data prepro-

cessing method based on Gaussian distribution that effectively

decreases data noise to reduce errors caused by point cloud

divergence of 4D Radar data in the x-zplane. To evaluate the

proposed fusion method, a series of experiments were conducted

using the Astyx HiRes 2019 dataset, including the calibrated

4D Radar and 16-line LiDAR data. The results demonstrated

that our fusion method compared favorably with state-of-the-art

algorithms. When compared to PointPillars, our method achieves

mAP (mean average precision) increases of 5.64%and 13.57%

for 3D and BEV (bird’s eye view) detection of the car class at a

moderate level, respectively.

This work was supported by the National High Technology Research and

Development Program of China under Grant No. 2018YFE0204300, the

National Natural Science Foundation of China under Grant No. 62273198,

U1964203, the China Postdoctoral Science Foundation (No.2021M691780),

and State Key Laboratory of Robotics and Systems (HIT) (SKLRS-2022-KF-

12). (Corresponding author: Xinyu Zhang.)

L. Wang is with the State Key Laboratory of Automotive Safety and Energy,

and the School of Vehicle and Mobility, Tsinghua University, Beijing 100084,

and also with the State Key Laboratory of Robotics and Systems (HIT), Harbin

150001, China (e-mail: wangli thu@mail.tsinghua.edu.cn).

X. Zhang is with the State Key Laboratory of Automotive Safety and

Energy, and the School of Vehicle and Mobility, Tsinghua University, Bei-

jing 100084, China, and also with the Department of Mechanical Engi-

neering, University of MichiganAnn Arbor, MI, 48105, America (e-mail:

xyzhang@tsinghua.edu.cn).

J. Li, B. Xv, R. Fu, H. Chen, L. Yang, and D. Jin are with

the State Key Laboratory of Automotive Safety and Energy, and

the School of Vehicle and Mobility, Tsinghua University, Beijing

100084, China (e-mail: lj19580324@126.com; xvbaowei718500@163.com;

fu-r16@mails.tsinghua.edu.cn; chenhaifeng0523@foxmail.com; yanglei20@

mails.tsinghua.edu.cn; jindf@tsinghua.edu.cn).

L. Zhao is with the State Key Laboratory of Robotics and System at Harbin

Institute of Technology, Harbin 150001, China (e-mail: zhaolj@hit.edu.cn).

4D Radar Point Cloud LiDAR Point Cloud

3D Object Detection

PointPillars

Backbone

PillarVFE

IMMF

CMSF





























































































































































Fig. 1. The overall concept of M2-Fusion processing both sparse 4D Radar

and LiDAR point clouds simultaneously. We propose two novel modules,

IMMF and CMSF, remarkably improving detection accuracy. The average

precision bar charts above compare M2-Fusion and seven mainstream object

detection networks (PointRCNN [1], SECOND [2], PV-RCNN [3], Point-

Pillars [4], Part-A2[5], Voxel R-CNN [6], and MVX-Net [7]). As shown,

our method outperforms the others by a large margin in both 3D and BEV

object detection. “C+R” and “C+L” mean “Camera + Radar” and “Camera +

LiDAR”, respectively.

Index Terms—Object detection, 4D Radar, multi-modal fusion,

autonomous driving

I. INTRODUCTION

THE increasing safety requirements for autonomous driv-

ing have motivated the environmental adaptability for on-

board sensing systems [8]–[11]. Previous works have investi-

gated the applications of multiple sensors for object detection,

including cameras, LiDAR, and Radar [12]–[15]. However,

practical applications are primarily limited by two challenges:

sensor performance itself and algorithm implementation [12].

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

Fusing data from multiple modalities can signiﬁcantly improve

accuracy and overcome critical issues in autonomous driving.

High-resolution cameras are typically used to determine the

shape and texture of complex objects, such as road signs and

trafﬁc signals [16]. Although computer vision algorithms have

made some progress in visual 3D detection, the task remains

challenging due to a lack of accurate depth information [17].

LiDAR is another commonly used sensor that can accurately

calculate object distance relative to the sensor. However, the

LiDAR point cloud becomes sparse with the range increase,

and the object detection capability decreases signiﬁcantly.

Camera and LiDAR systems can also be adversely affected by

weather conditions such as heavy rain, snow, or fog [18], [19].

These factors increase the risks associated with autonomous

driving and limit the development of related technologies to a

large extent.

Radar has been receiving increased attention from industrial

and academic communities as a potential alternative. These

systems generally transmit millimeter waves in the frequency

range from 30 GHz to 300 GHz and record the returned beams.

As the millimeter wave signals are weakly attenuated in the

atmosphere, Radar can detect objects at a long-range distance

by measuring the round-trip time. In addition, millimeter wave

radar can adapt to bad weather conditions due to its strong

penetrability. Without being severely affected by bad weather,

Radars allow for long-range observations (usually over 200m)

in inclement weather, such as heavy rain, fog, and night driving

[20]. 3D Radar systems, typically employed as an auxiliary

measurement for other sensors, collect the horizontal position

and velocity information (x, y, v)using antennas positioned

only in the xand ydirections. Usually, the azimuth angle

resolution is over 10◦, and it has no vertical angle resolution.

In contrast, 4D Radar sensors use antennas positioned in

three directions to obtain 3D position and velocity information

(x, y, z, v), thus providing denser point clouds. The azimuth

angle resolution is about 1◦, and the vertical angle resolution

is about 2◦. For instance, the Astyx HiRes 2019 [21] dataset

used in this work includes more than 1,000 4D Radar points

per frame. Compared to 3D Radar, 4D Radar signiﬁcantly

improves angle resolution and vertical ranging capability,

enabling denser point cloud acquisition.

High-level autonomous driving requires multi-modal fu-

sion [7], [22]–[24] to achieve sufﬁcient and robust detection

performance. Fused data are often collected from various

sensors, including 3D Radar, cameras, LiDAR, and 4D Radar.

Speciﬁcally, 4D Radar offers substantial long-range (detection

range over 200 m) and robustness advantages, while LiDAR

offers higher accuracy. A more practical and cost-effective 3D

object detection method could thus be developed by fusing

these modalities effectively. Moreover, there are few research

works on the fusion of 4D Radar and LiDAR. However,

4D Radar data is sparser than LiDAR, making it difﬁcult to

estimate objects shapes and sizes. Furthermore, the range and

height of 4D Radar point clouds are fuzzy due to aliasing

and false detection caused by clusters and multi-path echoes.

Accordingly, some objects are highly visible in LiDAR (e.g.,

cars and pedestrians) but appear blurry on the Radar. Thus,

effectively combining these two types of point cloud features

is a critical problem. One intuitive solution would be to

directly splice the features extracted from the two modalities to

achieve fusion. Nevertheless, this strategy suffers from a low

correlation between LiDAR and Radar data since the spatial

distributions vary signiﬁcantly across point clouds for the

two modalities. To date, inter-modality relationships between

4D Radar and LiDAR have rarely been investigated in a

framework used for object detection.

Motivated by the above discussion, as far as the authors

know, we are the ﬁrst to propose a 3D object detection

fusion method for 4D Radar and LiDAR, termed Multi-

modal and Multi-scale Fusion (M2-Fusion). This process

consists of two core modules: Interaction-based Multi-Modal

Fusion (IMMF) and Center-based Multi-Scale Fusion (CMSF).

Through a large number of comparative experiments with

existing mainstream methods, state-of-the-art 3D detection

performance is achieved by signiﬁcantly improving feature

expression capabilities using a self-attention mechanism and

multi-scale fusion (see Fig. 1).

One of the primary challenges in multi-modal fusion is to

maintain a strong relationship between different features of the

same object from each modality. A self-attention mechanism

is a viable approach for accomplishing feature interaction,

allowing the parameters from different modalities to learn

from each other. Therefore, we propose an interaction-based

multi-modal fusion method termed IMMF. The module fuses

and updates each 4D Radar and LiDAR feature using the

attention-weighted information from the other modality. Then

the information ﬂow is transmitted to each modality to capture

complex intra-modal relationships. As a result, the attention

weights within each modality condition in the 4D Radar and

LiDAR features are mutually reinforcing.

Secondly, the high sparsity of 4D Radar and LiDAR data

often necessitates the enhancement of feature expressions in

each modality. For instance, voxel-based frameworks offer

excellent performance and are considered the mainstream

technology for 3D object detection [2], [4], [25]. In this

process, a 3D space is divided into regular 3D voxels, and

a convolutional neural network (CNN) is utilized to extract

features for compression into a bird’s eye view feature map

for 3D object detection. However, due to the uneven density

of point clouds in 3D spaces, setting a single scale for default

voxel grids is insufﬁcient to represent all information in a given

scenario. Smaller grids produce more reﬁned localization and

features, while larger grids are faster due to fewer non-empty

grids and smaller feature tensors. Multi-scale methods, which

employ the beneﬁts of both small and large grids, have shown

great potential in various 3D object detection tasks [26]. One

intuitive solution would be directly merging voxel features

from high-resolution and low-resolution data as input to the

next backbone, thus forming an aggregated feature across

various voxel scales. However, this method inevitably causes

tremendous memory and computational pressure on the limited

onboard computing resources. Moreover, the superposition

of feature maps at different scales does not highlight object

characteristics, and the fuzziness of the 4D Radar may produce

additional noise.

Thus, to better integrate multi-scale point cloud features,

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

we propose a center-based multi-scale fusion method termed

CMSF, which includes a two-step strategy. In the ﬁrst step,

key points are extracted from objects with higher scores. In

the second step, voxels are selected around key points in a

low-resolution grid and then converted to a high-resolution

grid. Pseudo-images of all scales are then standardized using

an adaptive max-pooling operation.

In addition, we propose a data preprocessing method based

on Gaussian normal distribution to reduce the data noise

of the 4D Radar. Generally, 4D Radar sensors have good

ranging accuracy, but the insufﬁcient resolution in the vertical

direction leads to poor accuracy in the z-axis direction and

introduces considerable data noise. We utilize the Gaussian

normal distribution to obtain the statistic rule of 4D Radar

point cloud distribution in the recent Astyx Hires 2019 [21]

dataset, a 3D object detection dataset that includes 4D Radar

data. Then, noise in the 4D Radar point cloud is corrected to

improve data quality.

The primary contributions of our work are as follows:

•We are the ﬁrst to propose a novel fusion method termed

M2-Fusion that utilizes 4D Radar and LiDAR data for 3D

object detection in autonomous driving, offering higher

precision and especially remote detection accuracy.

•We propose an interaction-based multi-modal fusion

method termed IMMF. This module can learn corre-

lations between two modalities by utilizing attention

mechanisms, emphasizing critical features, and reducing

noise to improve accuracy and robustness for 3D object

detection.

•We develop a center-based multi-scale fusion method

termed CMSF, which combines the advantages of small-

scale and large-scale features. We demonstrate that select-

ing voxels in a particular range around key points results

in better efﬁciency and accuracy than other methods using

only one quantization scale.

•We conduct extensive experiments using the Astyx HiRes

2019 dataset to evaluate M2-Fusion. Experimental results

showed that our method achieves state-of-the-art perfor-

mance.

II. RE LATE D WOR K

A. Radar-based Methods

Several recent studies have applied Radar to the ﬁeld of au-

tonomous driving for tasks such as free space estimation [27],

object detection [14], [28], object classiﬁcation [15], [29]–

[31], and instance segmentation [32]. For example, Li et al.

[33] investigated Radar characteristics and proposed a deep

learning model that applied a CNN to object and feature

detection. However, these experiments were conducted using

a static device rather than a dynamic scene. Chen et al.

[14] proposed a Radar-based 2D object detection method

based on PointNet [34] and 2D spatial coordinates acquired

from Radar data, ego-motion compensation Doppler velocities,

and Radar cross-section measurements. Unsupervised coupling

networks were also used for change detection with Radar

and cameras [35], [36]. These methods utilized 3D Radar for

2D object detection, but the detection accuracy was not ideal

because 3D Radar lacked vertical information. Xv et al. [37]

proposed a pillar feature attention network for 4D Radar that

could identify dependencies for every pillar. Although some

improvements were reported, the algorithm was applied solely

to Radar and did not fully use data from other modalities. In

this study, we focus on the fusion of Radar and LiDAR for

3D object detection in autonomous driving.

B. LiDAR-based Methods

LiDAR-based 3D object detection is typically performed

using point-based, voxel-based, or BEV-based methods. Point-

based methods extract features from the original point cloud.

PointRCNN [1] was a two-stage point-based approach that

generated 3D region proposals from point clouds by ﬁrst

dividing them into the top-down front and background points.

In the second stage, the proposals were converted into standard

coordinates and local features were learned in detail to obtain

results. Based on PointRCNN [1], Part-A2[5] developed a

method that could simultaneously predict high-quality 3D

proposals and accurate part location information in objects.

PointRGCN [38] was also a point-based method. It utilized

graph convolution to classify and regress region proposal

boxes. Voxel-based methods convert point clouds into voxels

and extract features from each voxel. VoxelNet [25] was

the ﬁrst method to encode point clouds into voxels. How-

ever, it suffered from low efﬁciency, particularly for large

voxel quantities. SECOND [2] applied sparse convolutions to

non-empty voxels, signiﬁcantly reducing computational costs.

SASSD [39], a one-stage method, outperformed comparable

two-stage methods. PointPillars [4] further improved these

results by dividing point clouds into individual pillars and

mapping them onto a 2D pseudo-image. A 2D convolution

was then used to regress bounding boxes, achieving the highest

detection efﬁciency to date. In contrast, BEV-based methods

process point clouds as pseudo-images. PIXOR [40] cut point

clouds into multi-channel pseudo-images in the vertical di-

rection. However, vertical information was often lost when

projecting to BEV, requiring supplemented height information

from other angles. Based on PIXOR [40], HDNet [41] took

advantage of geometric and semantic prior information pro-

vided by high-deﬁnition (HD) maps to enhance the robustness

and accuracy.

Although these LiDAR-based methods have achieved good

results, they still face a trade-off between accuracy and efﬁ-

ciency. Point cloud density is typically distributed unevenly,

and most data is from the background. As the range increases,

the point cloud density decreases rapidly, which leads to

low detection accuracy. However, few methods consider these

problems.

C. Camera-Radar Fusion

In order to improve the robustness of 3D object detection,

multiple researchers have proposed methods to fuse camera

and radar data for object detection in bad weather. In camera-

Radar fusion algorithms, the camera images are typically

used to extract area proposals or 2D detection bounding

boxes, while 3D Radar provides depth information for the

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

ﬁnal detection results. For example, Ramin et al. [42] pro-

posed a center point detection network called CenterNet [43],

which employed a frustum-based method to associate objects

detected by 3D Radar with identiﬁed image center points.

This approach utilized 3D Radar data to supplement image

features, thereby regressing object attributes such as size,

depth, rotation, and speed. CRF-Net [44] performed a multi-

level fusion of the camera and radar data for object detection

based on the implementation of RetinaNet [45], which learned

to fuse each modality effectively for the detection result at

each level. RODNet [46] employed a camera-radar fusion

strategy to achieve robust 3D object detection in various

driving scenarios. Fu et al. [47] proposed a fusion method for

roadside cameras and millimeter wave radars as an application

of edge computing. However, 3D Radar only plays an auxiliary

role in this case, and a signiﬁcant gap remains in estimating

object depth compared to LiDAR.

D. Camera-LiDAR Fusion

Camera-LiDAR fusion has been attracting increased atten-

tion in recent years. MV3D [22], a pioneering multi-view

fusion network based on region proposals, integrated RGB

images, front views from point clouds, and BEV views to

project point cloud features of BEV view into front views for

image fusion. However, MV3D did not perform well in small

object detection. Multi-modal fusion detection networks have

been adopted in prior studies to generate region proposals [22],

[23], [48], thereby improving detection accuracy for small

objects. As discussed earlier, a primary reason for the low

accuracy of existing methods is the fusion of camera and BEV

images collected from two different perspectives. Converting

images or LiDAR data to other views signiﬁcantly reduces the

data’s resolution and accuracy. To solve this problem, Point-

Painting [24] was used in previous studies to perform semantic

image segmentation and merge the resulting point cloud, prior

to segmenting the LiDAR fusion network to enhance precision.

However, PointPainting required more computing resources

because the original point cloud data was not preprocessed.

Secondly, the semantic segmentation model and the 3D point

cloud detection model required a high degree of coupling,

which limited the scope of the application. For example, [17]

divided object detection into several related tasks in an end-

to-end training architecture to fuse LiDAR and camera data.

SCANet [49] employed a spatial channel attention module to

select discriminative features. Meyer et al. presented a fusion

algorithm based on LaserNet [50] that could accurately detect

objects at long ranges [51]. But LaserNet performed poorly on

the KITTI dataset, which was too small to fully train their large

network; hence, this method relied on larger datasets. In [52],

Wang et al. provided a different perspective for transferring

images into a point cloud, achieving the highest performance

for the KITTI 3D object detection benchmark. Shi et al. [5]

presented part-aware modules and part-aggregation modules

to enhance the expression of point clouds. RoarNet [44]

used images to generate proposal regions prior to using point

clouds. The method could deal with the asynchronous situation

between LiDAR and the camera. However, the detection accu-

racy depended on the recall of proposals, and the undetected

objects proposals could not be recovered at the second step.

Such fusion was limited when LiDAR points were sparse. And

inclement weather, such as fog, rain, and snow, also decreased

the quality of LiDAR and camera data signiﬁcantly.

E. Radar-LiDAR Fusion

Few works have been studied on the application of Radar

and LiDAR fusion. RadarNet [18] applied a voxel network

to feature extraction and conversion in 3D Radar and LiDAR.

However, this approach ignores the low resolution and sparsity

of 3D Radar data, which lacks vertical position information.

As a result, features extracted from 3D Radar have provided

limited contributions to improving fusion networks, particu-

larly when applying standard LiDAR-based methods. Unlike

3D Radar, which only provides horizontal position informa-

tion, 4D Radar offers 3D coordinates and generates point

clouds similar to LiDAR. As a result, 4D Radar point clouds

can be merged with LiDAR without speciﬁc transformations.

Prior research on 4D Radar and LiDAR data fusion is

lacking, primarily because 4D Radar sensors are still de-

veloping. In addition, currently available feature extraction

methods of the point cloud are designed for LiDAR, which

may not be suitable for 4D Radar point clouds. Another reason

is that low-density point clouds for LiDAR cannot produce

high-grade detection results. Regardless, we believe LiDAR

point cloud features could help extract 4D Radar point cloud

features. The key to this process lies in the interaction between

point cloud modalities during feature extraction, especially in

the case of 4D Radar data. To this end, we propose a 3D

object detection method based on the fusion of 4D Radar and

LiDAR point cloud features. Self-attention is used to introduce

information between Radar and LiDAR modalities to establish

these interactions.

The critical task in multi-scale fusion is enriching features

with strong semantics and strong detail expression capabil-

ities. One representative work was the FPN [53] algorithm,

which extracted features from images at varying scales. As

a result, feature maps at all levels exhibited reliable semantic

information and high-resolution features. A primary limitation

of FPN was the considerable memory costs, which may

result in a longer inference time. While FPN is generally

used in 2D object detection, it also has some applications

for 3D object detection in LiDAR point clouds. PV-RCNN

[20] introduced a 3D sparse Conv layer at different scales in

the feature extraction stage, which were then combined with

point cloud and BEV features. However, previous studies have

only combined features at a single resolution. Voxel-FPN [26]

achieved multi-scale fusion at the data processing stage, as

point clouds were divided into voxels with multiple solutions.

Features were then extracted along the same dimension to

achieve feature fusion. Qian et al. [19] fused LiDAR and

Radar data across both time dimensions and sensing modalities

using deep late fusion. These studies illustrate the necessity of

multi-scale fusion, which still suffers from several problems,

including large memory consumption and long inference time.

The following section describes improvements made to multi-

scale fusion mechanisms in this study.

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

Fig. 2. The overall framework of our proposed M2-Fusion method consisting of four sub-networks. (1) IMMF: the module in the green box demonstrates

multi-modal fusion, which learns correlations between two modalities by utilizing an attention mechanism. (2) CMSF: the module in the blue box performs

multi-scale feature extraction by dividing selected pixels for key points into small-scale pixels. (3) 2D CNN backbone. (4) RPN (Region Proposal Network)

head.

III. M2-FUSION FRA ME WORK

In this section, we introduce the proposed M2-Fusion

method in detail, the framework of which is shown in Fig. 2. In

this process, point cloud features from 4D Radar and LiDAR

data are effectively fused using an IMMF module based on

multi-modal fusion and a CMSF module based on multi-scale

fusion.

A. Interaction-based Multi-Modal Fusion (IMMF)

A comparison of 4D Radar and LiDAR data suggests

their forms are similar, though Radar exhibits weak reﬂection

echoes from vehicles, pedestrians, and cyclists. However, the

Radar point cloud for the background environment is also

much more extensive than that of the recognition objects.

Existing network structures are better at identifying particular

objects in uniform backgrounds. Therefore, it is essential to fo-

cus our limited resources on the core portion when integrating

the two modalities, for which we propose an interaction-based

multi-modal fusion approach (IMMF). Precisely, the tensor de-

composed from a feature in the attention mechanism (between

modalities) constitutes a weight matrix that exerts network

learning attention and exchanges inter-modality tensors. In

other words, LiDAR representations guide the attention of 4D

Radar, and 4D Radar representations direct the attention of

LiDAR.

The network structure in the green dotted box in Fig. 2

shows the detailed IMMF module, a symmetrical structure

that consists of two self-attention models. IMMF facilitates

interaction between modalities by exchanging tensors in sym-

metric structures and guiding the network to learn more

valuable features. Following PointPillars [4], we divide the

input point cloud into S×S pillars. Here points in each pillar

are expanded from 4 dimensions (x, y, z, r)to 10 dimensions

(x, y, z, r, xc, yc, zc, xp, yp, zp)where rrepresents the reﬂec-

tivity. The calculations are as follows:

(xc, yc, zc)=(x, y, z)−(xm, ym, zm)

(xp, yp, zp)=(x, y, z)−(xg, yg.zg),(1)

where (xc, yc, zc)is the deviation of each point cloud relative

to the pillar center, (xm, ym, zm)is the center coordinate of

each pillar, (xp, yp, zp)is the deviation of each point cloud

from the grid center, and (xg, yg, zg)is the center coordinate

of each grid.

We extract features of each pillar from LiDAR and 4D

Radar point clouds and set FLand FR, the initial features for

LiDAR and 4D Radar, as the input to the IMMF module. We

then adopt convolutional networks to improve computational

efﬁciency with the strategy that reduces the dimension of FL64

and FR64 deﬁned as,

FL64 =Maxpool(Linear(FL))

FR64 =Maxpool(Linear(FR)),(2)

where Maxpool(·)represents a maximum pooling layer,

Linear(·)indicates a fully connected layer. The reduced

feature dimensions are both 16, denoted by FL16 and FR16

as below.

FL16 =Conv(Maxpool(Linear(FL)))

FR16 =Conv(Maxpool(Linear(FR))),(3)

where Conv(·)denotes a convolutional layer. The correspond-

ing weight matrices FLwand FRwcan then be calculated as

follows:

FLw=Sof tmax((FL16 )TFR16 )

FRw=Sof tmax((FR16 )TFL16 ),(4)

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

Fig. 3. The network framework for the CMSF module. The input can be LiDAR or 4D Radar point cloud. The backbone network uses large-scale pseudo-

images to regress and select key points with high scores. Pixels around these key points are then divided into small scales, and the pseudo-images are

regenerated. Finally, all pseudo-images are superimposed to generate features with excellent representation capabilities. Speciﬁcally, Sin the ﬁgure represents

the scale of pillars. The red square represents key points, and its surrounding orange squares represent the chosen pillars.

where Sof tmax(·)is a softmax function.

The sizes of FLwand FRware M×Nand N×M, respec-

tively, where Mis the number of LiDAR features and Nis the

number of Radar features. These two weight matrices contain

modal information for each other. We then multiply the FLw

term on the right by FR64 . The resulting feature size is M×d,

since FLwhas a size of M×Nand FL64 has a size of N×d

(dis the characteristic dimension). This feature is consistent

with the size of FL64 and is subtracted from and then added to

FL64 after the linear layer, normalization layer, and activation

function, to acquire interactive LiDAR features FLt. Similarly,

the same operation is conducted to obtain interactive Radar

features FRt. This process can be represented as:

FLm=ReLU(BN (LN(FLwFR64 −FL64 )))

FRm=ReLU(BN (LN(FRwFL64 −FR64 ))),(5)

FLt=FLm+FL64

FRt=FRm+FR64 ,(6)

where ReLU(·)is a rectiﬁed linear unit (ReLU) activation

function, BN (·)denotes a batch normalization function, FLm

and FRmare intermediate variables, and LN(·)represents a

linear layer.

B. Center-based Multi-Scale Fusion (CMSF)

The division method used in PointPillars [4] can divide

point clouds into high-resolution or low-resolution pillars,

depending on the voxel size. High-resolution voxel divisions

can learn more features and improve the resulting detection ac-

curacy at the cost of increased training difﬁculty and inference

time. In contrast, low-resolution voxel divisions exhibit shorter

training time and higher efﬁciency but lack local details,

which leads to lower detection accuracy. Density distinctions

in speciﬁc areas may be noticeable in the 3D space of a

one-point cloud frame since most point cloud voxels exist

at a ﬁxed scale. Moreover, it is common to use farthest-

point sampling in voxel-based networks. This approach uniﬁes

the number of points in various pillars to one value, while

pillars with fewer points are ﬁlled with zeros. However, this

sampling method causes an excessive deﬁciency of abundant

information. Thus, a single ﬁxed scale is insufﬁcient to express

features with all required information, especially for distant

objects with sparse point clouds or small objects such as

pedestrians. To resolve this issue, we propose a two-stage

method to fuse multi-scale information around key points for

objects that are regressed in the ﬁrst stage and used to select

point clouds for multi-scale fusion in the second stage. The

framework is shown in Fig. 3. A center-based method is then

employed to produce center points in the pseudo-images, with

chosen pillars corresponding to these center points. Pillars are

processed by the IMMF module, allowing information from

the same scales and different modalities to interact. Afterward,

pseudo-images are generated from these concatenated features

and stacked to enhance feature expression.

Speciﬁcally, the original cloud is divided into two scales (S

and S/2). Initially, we encode pillars using the proportion of

Sand generate a pseudo image I∈RH×W×64 with a size

of H×W. The coordinates (x,y) are then transferred to the

corresponding ground truth heat map. Key points in the ground

truth are calculated by:











Cx=x−xmin

(xmax −xmin)×hw

Cy=y−ymin

(ymax −ymin)×hl

(7)

where Cxand Cyare coordinates of a key point in the ground

truth, xand yare center coordinates of a 3D bounding box,

xmin, xmax , ymin, and ymax are the individual minimum and

maximum values, and hwand hlare the length and width of

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

the heat map. All ground truth key points in the heat map are

then split using a Gaussian kernel function.

Key point coordinates are applied over a large-scale space,

as shown in Fig. 3. The expression of these key points can

be represented by a heat map, with a ratio four times smaller

than the input pseudo-image. In addition, the width and height

of pillars with scales of S/2are decreased by a factor of two

compared with pillars of scale S, scaling all corresponding

coordinates by a factor of eight. Pillars are then selected near

adjacent key points, and a ﬁxed area of square sides (exhibiting

the most vehicles) is identiﬁed for use in generating pseudo-

images. Images on a larger scale are then transformed to the

proper dimensions by adaptive pooling. This allows pseudo-

images on both large and small scales to be concatenated

to generate high-dimensional features. Finally, the backbone

network is applied to the high-dimensional features to regress

detected objects.

C. Pre-processing of 4D Radar Point Clouds

4D Radar is equipped with azimuthal and vertical angle

antennas to measure horizontal and vertical positions. Gen-

erally, a high angular resolution of 4D Radar needs a very

large effective aperture from a number of antennas. However,

the number of vertical angle antennas of 4D Radar is often

small, which will cause aperture resolution issues along the

azimuthal direction. For example, due to aperture limitations,

various objects with differing angles for the same speed and

distance conditions may be challenging to distinguish in the

azimuthal direction. Compared with LiDAR, the point cloud

output by 4D Radar has more miscellaneous points or noise

points, which are determined by the electromagnetic scattering

characteristics including multiple reﬂections, refractions, and

diffractions to a certain extent. Speciﬁcally, the theoretical

vertical angle of the Radar in Astyx HiRes 2019 is θidegrees,

but the actual one is higher than 9×θidegrees, as shown in

Fig. 4 (a). Although the observed targets are virtually above

the ground, there are still a lot of 4D Radar points below

ground due to this aperture resolution issue. We observe that

considerable 4D Radar points are below ground which affects

the detection accuracy. Therefore, we propose a novel method

for addressing noise in the data.

We use Gaussian normal distribution to determine if the

vertical angle θtis in the normal range based on Shapiro-

Wilk (S-W) test. Here we focus on two descriptive statistics,

i.e., skewness and kurtosis, which can help determine if the

point cloud conforms to the Gaussian normal distribution.

Speciﬁcally, the symmetry of the distribution measure, i.e.,

the inequality between the left and right distribution tails,

can be veriﬁed using skewness value. The peakedness and

heaviness of the distribution tails can be conﬁrmed using the

kurtosis value. The skewness (g1) and kurtosis (g2) are shape

parameters of the point cloud, which are calculated frame by

frame as follows:











g1(t) = E[(θt−µ

σ)3]

g2(t) = E[(θt−µ

σ)4],

(8)

(a)

(b)

Fig. 4. Y-axis views of 4D Radar (blue) and LiDAR (red) point clouds,

including (a) the original data and (b) the processed data.

where θtdenotes the divergence angle of the point cloud in

4D Radar, µis the mean value, and σis the standard deviation

of divergence angle, Erefers to the mean operation.

As the maximum divergence angle for the 4D Radar is

far beyond the sensor setting range, we reduce the point

cloud elevation in the x-z plane so that all vertical slopes are

limited to within θm. When the absolute value of kurtosis

and skewness are both less than 1, we assume the data frame

conforms to the Gaussian normal distribution, then θmis set as

the mean value of the vertical angle. Otherwise, if this frame

does not conform to the Gaussian normal distribution, then we

set θmas the median value. Finally, we update the coordinates

by the following equation:











x′

t=cos(2θt

θm

arctan zt

)px2

t+z2

z′

t=sin(2θt

θm

arctan zt

)px2

t+z2

(9)

where xtand ztare original coordinates of the point cloud,

x′

tand z′

tare the coordinates after pre-process, and θmis the

angle at which pre-processed data are compressed.

After the above data pre-processing of the 4D Radar point

cloud, the noise data can be improved to conduct the following

detection task. The intuitive effect is shown in Fig. 4 (b), which

illustrates that the point cloud data is transformed within a

reasonable range.

D. Overall Loss

We propose a loss function combining these two points,

based on the regression requirements for 3D bounding boxes

and key points. The overall loss function draws on that of both

PointPillars and CenterNet [43]. We design the loss function in

key point training to be identical to that of CenterNet [43]. In

addition, the loss function is equivalent to that of PointPillars

for 3D bounding box regression. Combining these two losses

gives:

Ltotal =αLc+βLp,(10)

where Ltotal is the total loss function, Lcis the loss function

in key point regression , Lpis the loss function in 3D bounding

box regression, and αand βare balance coefﬁcients.

The CenterNet [43] loss function is given by:

Lc=Lk+λsizeLsize +λof f Loff ,(11)

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

where Lkis a penalty-reduced pixel-wise logistic regression

term including focal loss, λsize = 0.1,λof f = 1,Lsize is the

object size loss and Loff is the offset loss.

A prediction ˆ

Yxyc = 1 corresponds to a detected key point,

while ˆ

Yxyc = 0 denotes the background. These conditions can

be represented as follows:

Lk=−1

xyc ((1 −ˆ

Y)αlog(ˆ

Yxyc),if Yxy c = 1

Lotherwise ,otherwise, (12)

Lotherwise = (1 −Yxyc)β(ˆ

Yxyc)αlog(1 −ˆ

Yxyc),(13)

Yxyc = exp(−(x−˜px)2+ (y−˜py)2

2σ2

),(14)

where Nis the number of key points in the image, α= 2, and

β= 4.

The loss function for the object size regression is then given

by:

Lsize =1

k=1

|ˆ

Spk−sk|,(15)

where pk= (x(k)

1+x(k)

2,y(k)

1+y(k)

2)is the center point list, sk=

(xk

2−xk

1, yk

2−yk

1)is the object size of each object k,ˆ

is a single size prediction, and (x(k)

1, y(k)

1, x(k)

2, y(k)

2)is the

bounding box for object k.

The offset loss function is given by:

Loff =1

|ˆ

Oˆp−(p

R−ˆp)|,(16)

where ˆ

Ois the local offset, pis the ground truth key point,

and ˆp= [ p

R]is the low-resolution equivalent.

The loss function for 3D bounding box regression can be

expressed as:

Lp=1

Npos

(βclsLcls +βdir Ldir +βloc Lloc),(17)

where Npos is the number of positive anchors, Lcls is the

object classiﬁcation loss function, and Ldir is the direction loss

for the softmax function, Lloc is the location loss, βcls = 1,

βdir = 0.2, and βloc = 2.

The classiﬁcation focal loss can be represented as:

Lcls =−α(1 −pa)γlogpa,(18)

where pais the class probability for an anchor, α= 0.25, and

γ= 2.

We parameterize the 3D regressed results as (x, y, z, l,

w, h, θ), where x, y , z represent the center location, l, w, h are

length, width, and height of the object box, and θis the yaw

rotation around the z-axis. Then, the localization loss function

is then given by:

Lloc =X

u∈(x,y,z,w,h,l,θ)

SmoothL1(∆u),(19)

The 7-dimensional residual vector (∆x, ∆y, ∆z, ∆l, ∆w,

∆h, ∆θ)between the ground truth and the regressed result is

calculated by:











∆x=xgt −xa

da,∆y=ygt −ya

da,∆z=zgt −za

∆w=log wgt

wa,∆l=log lgt

la,∆h=log hgt

∆θ=sin(θgt −θa), da=p(wa)2+ (la)2,

(20)

where SmoothL1(·) is a smooth L1loss function, superscripts

gt and arepresent the ground truths and the regressed results

for the anchor boxes.

IV. EXP ER IM EN TS

This section presents the results of several experiments

applying the proposed M2-Fusion framework to the Astyx

HiRes 2019 dataset, verifying its advantages for 3D object

detection. The proposed method is evaluated using compar-

isons with seven mainstream detection methods, including

PointRCNN [1], SECOND [2], PV-RCNN [3], PointPillars [4],

Part-A2[5], Voxel R-CNN [6], and multi-modal fusion method

MVX-Net [7]. Adequate ablation studies are also conducted

to verify the effectiveness of each proposed module, including

data preprocessing (DP), IMMF, and CMSF. In addition,

hyperparameter tuning is used to assess performance trends.

A. Dataset

Astyx Hires 2019 [21] is an open-access database that

includes 4D Radar data for use in 3D object detection. Its

purpose is to provide high-resolution 4D Radar data to the

research community. The set consists of 4D Radar frames,

16-line LiDAR data, and camera images with temporal and

spatial calibration. The data were split using a ratio of 3:1 to

ensure the training and test distributions were consistent. The

4D Radar and LiDAR point clouds typically included 1,000

to 10,000 points and 10,000 to 25,000 points, respectively,

with images exhibiting a resolution of 2,048×618 pixels. The

maximum range of point cloud data reached 200 meters for 4D

Radar and 100 meters for 16-line LiDAR. The LiDAR point

clouds were then transferred to the Radar coordinate system

since the 3D bounding boxes were labeled in the Radar point

cloud. The training set mainly included cars and very few

pedestrians and cyclists. Therefore, experimental evaluation

was conducted only for the car category, as the number of

objects outside this was too small. Ofﬁcial KITTI evaluation

protocols were followed, in which an IoU threshold of 0.5

was used for cars. IoU threshold was the same for the bird’s

eye view (BEV) data and the entire 3D evaluation set. These

methods were compared using the mean average precision

(mAP) as an evaluation metric.

B. Implementation Details

The range of x,y, and zwas set to (0, 69.12m), (-

39.68m, 39.68m), and (-3m, 1m) following the point cloud

conﬁguration in the KITTI dataset, respectively. The fusion

network consisted of a pillar extraction module (used to extract

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

TABLE I

COMPARATIVE RESULTS FOR MAINSTREAM ALGORITHMS APPLIED TO THE ASTY X HIRES 2019 DATASE T.

Modality Methods Reference 3D mAP(%) BEV mAP(%)

Easy Moderate Hard Easy Moderate Hard

4D Radar

PointRCNN [1] CVPR 2019 14.79 11.40 11.32 26.71 18.74 18.60

SECOND [2] SENSORS 2018 23.26 18.02 17.06 37.92 31.01 28.83

PV-RCNN [3] CVPR 2020 27.61 22.08 20.51 49.17 39.88 36.50

PointPillars [4] CVPR 2019 26.03 20.49 20.40 47.38 38.21 36.74

Part-A2[5] TPAMI 2021 14.96 13.76 13.17 26.46 21.47 20.98

Voxel R-CNN [6] AAAI 2021 23.65 18.71 18.47 37.77 31.26 27.83

16-Line LiDAR

PointRCNN [1] CVPR 2019 39.03 29.97 29.66 41.34 34.22 32.95

SECOND [2] SENSORS 2018 51.75 43.54 40.72 55.16 45.63 43.57

PV-RCNN [3] CVPR 2020 54.63 44.71 41.26 56.08 46.68 44.86

PointPillars [4] CVPR 2019 54.37 44.21 41.81 58.64 47.67 45.26

Part-A2[5] TPAMI 2021 45.41 38.45 36.74 49.85 41.85 38.93

Voxel R-CNN [6] AAAI 2021 52.26 44.08 40.06 53.94 44.54 40.43

Camera + 4D Radar MVX-Net [7] ICRA 2019 13.20 11.69 11.43 23.57 20.36 19.04

Camera + 16-Line LiDAR MVX-Net [7] ICRA 2019 39.16 31.43 30.40 47.04 38.15 35.60

4D Radar + 16-Line LiDAR Direct fusion 54.25 44.33 43.24 66.05 55.75 54.67

M2-Fusion(Ours) 61.33 49.85 49.12 71.27 61.24 57.03

pillars from 4D Radar and LiDAR point clouds), a feature

fusion module, and a backbone network. Two different scales

were utilized in the CMSF module, with pillar volumes of

[0.16, 0.16, 4] (m) and [0.08, 0.08, 4] (m) for Sand S/2,

respectively. The input channel size for the IMMF module was

10 and the output channel size was 64, while the backbone

network consisted of three convolutional blocks and three

deconvolutional blocks. The number of convolution layers, the

step size, and the number of output channels were [3, 5, 5], [2,

2, 2], and [128, 256, 512] in the three convolutional blocks,

respectively, and [3, 5, 5], [1, 2, 4], and [128, 128, 128] in the

three deconvolutional blocks, respectively.

C. Training

The open-source OpenPCDet [54] code framework was

utilized to construct the training set, and a single NVIDIA

RTX 3090 was employed to train the model, using 160 epochs.

During training, we adopt the Adam optimizer with an initial

learning rate of 3e−3. To reduce false positives, we apply

an NMS threshold of 0.01 to remove the redundant boxes.

Different from PointPillars, we set the max number of pillars

(P): 16000 and the max number of points per pillar (N): 32.

We modify the 2D backbone of PointPillars, change upsample

ﬁlters from [64, 128, 256] to [128, 256, 516] because the

IMMF network has 128 output features. During training, we

utilize the widely adopted data augmentation strategy of 3D

object detection, including random ﬂipping along the Xaxis,

global scaling with a random scaling factor sampled from

[0.95, 1.05], global rotation around the Zaxis with a random

angle sampled from[-π/4, π/4]. We also conduct the ground-

truth sampling augmentation to randomly “paste” some new

ground-truth objects from other scenes to the current training

scenes, for simulating objects in various environments. Other

training parameters were consistent with PointPillars [4].

D. Experimental Results

1) 3D Object Detection on Astyx HiRes 2019 Dataset:

Existing fusion networks mainly focus on modalities with

different data formats. However, 4D Radar and LiDAR have

similar formats, so features from the two sensors can com-

plement each other. This is the underlying principle for our

proposed multi-modal fusion network.

To establish baselines, seven mainstream algorithms applied

to the KITTI 3D object detection benchmark were selected. As

the Astyx HiRes dataset differs from KITTI, and the models

were all processed on the KITTI dataset, we converted the

Astyx HiRes dataset into the same format as KITTI based

on the framework OpenPCDet. The training parameters of

these algorithms were retained the same as they were set on

KITTI. We ﬁrst compared the performance of PointRCNN [1],

SECOND [2], PV-RCNN [3], PointPillars [4], Part-A2[5],

Voxel R-CNN [6] and multi-modal fusion method MVX-

Net [7] for 4D Radar, 16-line LiDAR and camera data from

Astyx. The results were shown in Table I. The point-based

methods including PointRCNN and Part-A2exhibited lower

performance than other voxel-based methods signiﬁcantly. The

main reason was that the point clouds of both 4D Radar and

16-line LiDAR were very sparse, which was not conducive to

the extraction of point features. PointPillars achieved the best

accuracy in both 4D Radar and 16-line LiDAR point clouds

except PV-RCNN. However, PointPillars had signiﬁcant ad-

vantages in network structure complexity and inference time

over PV-RCNN. Thus, we chose the easily expanded Point-

Pillars network as our baseline to verify the proposed method.

These algorithms were trained and compared with our model

to evaluate the effectiveness of the proposed method for 4D

Radar and LiDAR fusion. These results demonstrated that our

proposed M2-Fusion method achieved the best results, with

signiﬁcant improvements over other methods. Compared with

the baseline PointPillars only using 4D Radar, our proposed

M2-Fusion method achieved increases of 29.36%(3D mAP)

and 23.03%(BEV mAP) in the moderate level. This also

included increases of 5.64%(3D mAP) and 13.57%(BEV

mAP) over the baseline PointPillars model trained with 16-

line LiDAR. When compared to another multi-modal fusion

method MVX-Net, our method also shows better detection

performance. Compared with MVX-Net using the Camera and

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

TABLE II

THE R ESU LTS OF ABLATION APPLYING M2-FUS IO N TO TH E ASTY X HIRES 2019 DATAS ET (TH E BAS EL INE I S POIN TPI LLA RS ).

Methods 3D mAP(%) BEV mAP(%)

4D Radar 16-Line LiDAR DP IMMF CMSF Easy Moderate Hard Easy Moderate Hard

✓26.03 20.49 20.40 47.38 38.21 36.74

✓ ✓ 28.61 21.99 21.35 50.66 41.83 38.70

✓54.37 44.21 41.81 58.64 47.67 45.26

✓ ✓ 54.25 44.33 43.24 66.05 55.75 54.67

✓ ✓ ✓ 54.55 45.16 44.40 66.05 57.18 55.59

✓ ✓ ✓ ✓ 57.15 48.24 47.01 69.64 58.12 56.43

✓ ✓ ✓ ✓ 56.61 47.67 46.57 67.01 57.35 56.24

✓ ✓ ✓ ✓ ✓ 61.33 49.85 49.12 71.27 61.24 57.03

4D Radar, our proposed M2-Fusion method achieved huge

increases of 38.16%(3D mAP) and 40.88%(BEV mAP) in

the moderate level. When using Camera and 16-line LiDAR

as the inputs of MVX-Net, M2-Fusion also showed increases

of 18.42%(3D mAP) and 23.09%(BEV mAP). Our method

achieved increases of 5.64%(3D mAP) and 13.57%(BEV

mAP) over the baseline PointPillars model trained with 16-line

LiDAR. We also compared with the direct fusion of 4D Radar

and 16-line LiDAR to illustrate the effectiveness of our fusion

method. The results showed that our method even achieved

increases of 5.52%(3D mAP) and 5.49%(BEV mAP) using

the PointPillars as the baseline. Moreover, the inference time

for M2-Fusion is about 10 fps on a single RTX 3090 GPU.

2) Ablation Studies with M2-Fusion: Ablation studies were

also used to verify the effectiveness of each proposed model

compared to the baseline (PointPillars), the results of which

were shown in Table II. The ﬁrst column showed the ex-

perimental conﬁguration, including the sensor modality, data

preprocessing (DP), and two proposed modules IMMF and

CMSF. Assessment difﬁculty in the second and third columns

included three levels: easy, moderate, and hard. The following

data comparisons were all evaluated in a moderate level. It

was evident that the results for the baseline (PointPillars)

using 4D Radar alone produced the lowest accuracy. After data

preprocessing, the 3D mAP for Radar increased by 1.50%, and

the BEV mAP increased by 3.62%. These results suggested

that data preprocessing offered certain improvements in Radar

detection accuracy, which demonstrated its effectiveness. The

third line showed the results for 16-line LiDAR, and the

fourth line showed the results for 4D Radar fused with LiDAR

using the direct feature concatenation. The fusion method

got an improvement of 0.12%in 3D mAP while achieving

an improvement of 8.08%in BEV mAP. This illustrated

that 4D Radar could improve BEV accuracy remarkably.

The next line provided the fusion results adding the data

preprocessing, illustrating the beneﬁcial effects. The sixth

line showed the results for an added IMMF module, which

produced increases of 3.08%(3D mAP) and 0.94%(BEV

mAP) above the baseline (the ﬁfth line, 4D Radar+16-line

LiDAR+DP). The seventh line showed experimental results

for an added CMSF module. The corresponding 3D and BEV

mAP values increased by 2.51%and 0.17%above the baseline

(the ﬁfth line, 4D Radar+16-line LiDAR+DP), respectively.

These above comparisons proved the remarkable effects of

each proposed DP, IMMF, and CMSF module. Finally, we

combined these three modules to produce M2-Fusion, which

achieved the best results with improvements of 5.64%in 3D

mAP and 13.57%in BEV mAP compared to the traditional

single LiDAR-based method (the third line, 16-line LiDAR).

And the fusion of multiple modules acquired a more obvious

improvement than a single module. These results were evident

in Fig. 5 and conﬁrmed the overall effectiveness of our

proposed M2-Fusion algorithm for 3D object detection.

We have the following observations through the above ab-

lation studies: (1) The fusion of 4D Radar and 16-line LiDAR

can achieve better results than the single-mode method. (2)

The data processing of 4D Radar can correct data and suppress

the inﬂuence of noise. (3) The attention-based interaction

between 4D Radar and 16-line can take advantage of each

modality and enhance the perception ability. (4) Multi-scale

feature extraction can obtain richer feature information and

improve detection accuracy. Finally, our proposed M2-Fusion

method aggregates the above advantages to improve the de-

tection accuracy signiﬁcantly.







































































Fig. 5. Ablation results for M2-Fusion applied to the Astyx HiRes 2019

dataset, with PointPillars serving as the baseline. The combination of CMSF

and IMMF outperformed comparable models.

3) Accuracy Comparison Experiments at Different Ranges:

4D Radar usually has a more extended range capability

than LiDAR. To verify the object detection performance, we

conducted the accuracy comparison experiments at different

ranges for 4D Radar, 16-line LiDAR, direct fusion, and our

proposed M2-Fusion, respectively. The results were shown

in Table III and the intuitive line chart in Fig. 6. The same

method PointPillars was used following the above experiments

in Table II.

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

TABLE III

COMPARATIVE RESULTS OF 3D OBJECT DETECTION USING SINGLE MODALITY AND MULTI-FUSION METHODS APPLIED TO DIFFERENT RANGES. “I NF”

MEANS INFINITY.

Modality Methods 3D mAP(%) BEV mAP(%)

Overall 0-30m 30-50m 50m-Inf Overall 0-30m 30-50m 50m-Inf

4D Radar PointPillars [4] 20.49 34.06 14.76 6.98 38.21 52.08 28.81 19.54

16-Line LiDAR PointPillars [4] 44.21 71.90 21.25 9.09 47.67 76.07 24.86 9.09

4D Radar + 16-Line LiDAR Direct fusion 44.33 67.67 21.50 12.50 55.75 77.92 35.64 24.22

4D Radar + 16-Line LiDAR M2-Fusion(Ours) 49.85 77.26 27.36 15.56 61.24 83.73 42.08 27.68

4D Radar + 16-Line LiDAR Improvement +5.52 +9.59 +5.86 +3.06 +5.49 +5.81 +6.44 +3.46

In terms of overall accuracy, the detection accuracy of

4D Radar was lower than that of LiDAR, but as the range

increased, the detection accuracy gap between the two became

smaller and smaller. In addition, after the detection range

exceeded 30m, the BEV detection accuracy of 4D Radar

surpassed that of LiDAR, and the accuracy after 50m in-

creased by 10.45%, showing a signiﬁcant effect. It showed

that LiDAR had high accuracy in short-range detection, while

4D Radar had advantages in long-distance detection. After

the direct fusion of the two sensors according to the feature

concatenation method in Table II (line 4), the overall and most

ranges, accuracy was increased, but the accuracy of the close

range within 50m was not signiﬁcantly improved than that of

LiDAR. In contrast, the detection accuracy above 50m was

improved by 3.41%and 15.13%on 3D and BEV mAP, re-

spectively. It illustrated that fusion could improve the accuracy

of remote detection. Our proposed M2-Fusion method could

signiﬁcantly improve the detection accuracy within each range,

and the overall accuracy of 3D and BEV was improved by

5.52%and 5.49%, respectively, which proved the effectiveness

of the proposed method.

  























  























Fig. 6. Comparative results of 3D Object Detection in different ranges.

Our proposed M2-Fusion achieves the best accuracy in all ranges, especially

remote detection.

4) Parameter Comparison Experiment: High-resolution

pseudo-images were reshaped using a uniﬁed scale to fuse

point clouds of different resolutions. A variety of methods

were available for reconstructing pseudo-images that could

maintain the integrity of information to the extent possible

prior to reconstruction. Various methods utilizing a standard

size (when transforming pseudo-images from different scales

to the same scale) were compared in Table IV. These results

demonstrated the effectiveness of adaptive pooling, which

outperformed CNN and max pooling methods and proved to

be highly compatible with our network by outputting features

of multiple dimensions.

TABLE IV

A COMPARISON OF DIFFERENT METHODS USED TO TRANSFORM

PS EUD O-I MAGE S FRO M DI FFER EN T SCA LES . MP A ND AMP ME AN T HE

MAX POOLING AND ADAPTIVE MAX POOLING,RES PEC TI VELY.

Methods 3D mAP(%) BEV mAP(%)

Easy Moderate Hard Easy Moderate Hard

CNN 58.01 47.93 46.57 69.19 57.10 55.58

MP 59.67 48.74 46.89 70.34 58.02 56.60

AMP 61.33 49.85 49.12 71.27 61.24 57.03

TABLE V

A COMPARISON OF DIFFERENT SCORES FOR KEY POINT SELECTION

METHODS.

Score 3D mAP(%) BEV mAP(%)

Easy Moderate Hard Easy Moderate Hard

0.2 57.65 47.71 45.10 70.05 56.80 55.61

0.4 56.72 46.90 45.65 70.56 57.68 57.02

0.6 61.33 49.85 49.12 71.27 61.24 57.03

0.8 57.07 48.20 46.75 67.59 57.79 56.39

Since the network in the CMSF module predicts key points

with probabilistic scores, key points with low scores could lead

to false detection. As such, we set a score threshold to remove

these points. The inﬂuence of this threshold was tested using

different values, the results of which were shown in Table V.

As shown, the model achieved the best results for a score

threshold of 0.6, suggesting that smaller thresholds introduced

errors while larger thresholds ﬁltered out too much data.

Table VI provided the results of comparison experiments

involving concatenation at different scales in the CMSF mod-

ule. Relying on prior experience, we selected two pillar scales:

[0.16, 0.16, 4], and [0.08, 0.08, 4], with a baseline pillar scale

of [0.16, 0.16, 4]. The use of different pillar scales had a direct

impact on the experimental results, as the accuracy was higher

for smaller scales within a speciﬁc range. As seen in Table VI,

the multi-scale fusion of [0.16, 0.16, 4] and [0.08, 0.08, 4]

achieved better results, producing a 1.61%increase in 3D

mAP and a 3.12%increase in BEV mAP over the baseline in

TABLE VI

THE R ESU LTS OF SCALE PARAMETER VERIFICATION EXPERIMENTS USING

TH E CMSF MODULE IN OUR PROPOSED M2-FUSION METHOD.

Scale(m) 3D mAP(%) BEV mAP(%)

Easy Mod. Hard Easy Mod. Hard

0.16 57.15 48.24 47.00 69.64 58.12 56.43

0.16+0.08 61.33 49.85 49.12 71.27 61.24 57.03

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

Fig. 7. Qualitative results from the Astyx HiRes 2019 dataset. The LiDAR point cloud is grey and the 4D Radar point cloud is pink. The ﬁrst row shows

RGB images; the second row provides the ground truth, and the green bounding boxes surrounded by dotted orange circles indicate missed detections. The

third row shows detection results for 4D Radar; the fourth row shows results for 16-line LiDAR, and the blue bounding box surrounded by dotted brown

circles denotes false detections. The last row shows results produced by the proposed M2-Fusion method.

the moderate level. This experiment demonstrated that feature

representations could be enhanced by fusing multiple scales.

5) Visualization Experiments: To observe the effects of

modal fusion more intuitively, we provided tree visualiza-

tion experiments including Precision-Recall (PR) curve com-

parison, feature map visualization, and qualitative detection

results. Fig. 8 showed the Precision-Recall (PR) curves of

our proposed M2-Fusion and the direct fusion of 4D Radar

and LiDAR-based on PointPillars. It revealed that our method

signiﬁcantly improved accuracy at easy, moderate, and hard

levels. The visualization of feature maps for 4D Radar and

LiDAR data using the baseline (PointPillars) and our proposed

M2-Fusion was shown in Fig. 9. It was evident that feature

maps using our method contained more information. These

results illustrated that multi-modal interactive fusion enhanced

network feature extraction capabilities and made extracting

long-range features more accessible. We also provided a

qualitative visualization of the ﬁnal detection results for single

Radar, LiDAR, and M2-Fusion in Fig. 7, where green boxes

denoted the ground truth. We observed that the baseline (Point-

Pillars) method using only 4D Radar produced many false

detections, as several trees and signal lights were considered

cars. The baseline results using only LiDAR were somewhat

better. However, cars were not detected at large distances since

too few points in the LiDAR data. In contrast, our proposed

M2-Fusion method achieved the best results and reduced false

detections signiﬁcantly.

          













































Fig. 8. PR curve comparison of our proposed M2-Fusion and the direct

fusion method based on PointPillars. “R+L” mean Radar and LiDAR fusion.

Our method surpasses the direct fusion method in easy, moderate, and hard

levels observably.

V. CONCLUSION

A novel method based on multi-modal and multi-scale

fusion termed M2-Fusion, was proposed for 4D Radar and

LiDAR. We veriﬁed the role of 4D Radar in 3D object de-

tection for autonomous driving, fusing these data with LiDAR

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

Fig. 9. Feature maps for (a) 4D Radar data processed using PointPillars,

(b) 4D Radar processed by M2-Fusion, (c) 16-line LiDAR processed by

PointPillars, and (d) 16-line LiDAR processed by M2-Fusion.

for the ﬁrst time and effectively improving detection accuracy.

A center-based multi-scale fusion method was proposed to

solve the problem of information loss in feature extraction for

sparse point clouds. A multi-modal fusion method based on

a self-attention interaction was also proposed, which achieved

an effective fusion of 4D Radar and LiDAR. The proposed

method, evaluated using the Astyx HiRes 2019 dataset, out-

performed mainstream LiDAR-based object detection methods

signiﬁcantly. We will consider fusion with camera images in

future research to further improve detection accuracy.

VI. ACKNOWLEDGMENTS

We thank LetPub (www.letpub.com) for linguistic assistance

and pre-submission expert review.

REFERENCES

[1] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation

and detection from point cloud,” in 2019 IEEE/CVF Conference on

Computer Vision and Pattern Recognition (CVPR), 2019, pp. 770–779.

[2] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional

detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.

[3] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-

rcnn: Point-voxel feature set abstraction for 3d object detection,” in 2020

IEEE/CVF Conference on Computer Vision and Pattern Recognition

(CVPR), 2020, pp. 10 526–10 535.

[4] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,

“Pointpillars: Fast encoders for object detection from point clouds,” in

2019 IEEE/CVF Conference on Computer Vision and Pattern Recogni-

tion (CVPR), 2019, pp. 12 689–12 697.

[5] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3d

object detection from point cloud with part-aware and part-aggregation

network,” IEEE transactions on pattern analysis and machine intelli-

gence, vol. 43, no. 8, pp. 2647–2664, 2020.

[6] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-cnn:

Towards high performance voxel-based 3d object detection,” in AAAI

Conference on Artiﬁcial Intelligence (AAAI), vol. 35, no. 02, 2021, pp.

1201–1209.

[7] V. A. Sindagi, Y. Zhou, and O. Tuzel, “Mvx-net: Multimodal voxelnet

for 3d object detection,” in 2019 International Conference on Robotics

and Automation (ICRA), 2019, pp. 7276–7282.

[8] H. Wang, Y. Huang, A. Khajepour, D. Cao, and C. Lv, “Ethical

decision-making platform in autonomous vehicles with lexicographic

optimization based model predictive controller,” IEEE Transactions on

Vehicular Technology, vol. 69, no. 8, pp. 8164–8175, 2020.

[9] H. Wang, Y. Huang, A. Soltani, A. Khajepour, and D. Cao, “Cyber-

physical predictive energy management for through-the-road hybrid

vehicles,” IEEE Transactions on Vehicular Technology, vol. 68, no. 4,

pp. 3246–3256, 2019.

[10] Y. Huang, H. Wang, A. Khajepour, H. Ding, K. Yuan, and Y. Qin, “A

novel local motion planning framework for autonomous vehicles based

on resistance network and model predictive control,” IEEE Transactions

on Vehicular Technology, vol. 69, no. 1, pp. 55–66, 2020.

[11] S. Wen, J. Chen, F. R. Yu, F. Sun, Z. Wang, and S. Fan, “Edge

computing-based collaborative vehicles 3d mapping in real time,” IEEE

Transactions on Vehicular Technology, vol. 69, no. 11, pp. 12 470–

12 481, 2020.

[12] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and

A. Mouzakitis, “A survey on 3d object detection methods for au-

tonomous driving applications,” IEEE Transactions on Intelligent Trans-

portation Systems, vol. 20, no. 10, pp. 3782–3795, 2019.

[13] M. Herzog and K. Dietmayer, “Training a fast object detector for lidar

range images using labeled data from sensors with higher resolution,” in

2019 IEEE Intelligent Transportation Systems Conference (ITSC), 2019,

pp. 2707–2713.

[14] A. Danzer, T. Griebel, M. Bach, and K. Dietmayer, “2d car detection

in radar data with pointnets,” in 2019 IEEE Intelligent Transportation

Systems Conference (ITSC), 2019, pp. 61–66.

[15] K. Patel, K. Rambach, T. Visentin, D. Rusev, M. Pfeiffer, and B. Yang,

“Deep learning-based object classiﬁcation on automotive radar spectra,”

in 2019 IEEE Radar Conference (RadarConf), 2019, pp. 1–6.

[16] K. Xie, Z. Zhang, B. Li, J. Kang, T. D. Niyato, S. Xie, and Y. Wu,

“Efﬁcient federated learning with spike neural networks for trafﬁc sign

recognition,” IEEE Transactions on Vehicular Technology, pp. 1–1,

2022.

[17] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-

sensor fusion for 3d object detection,” in 2019 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7337–

7345.

[18] B. Yang, R. Guo, M. Liang, S. Casas, and R. Urtasun, “Radarnet:

Exploiting radar for robust perception of dynamic objects,” in European

Conference on Computer Vision. Springer, 2020, pp. 496–512.

[19] K. Qian, S. Zhu, X. Zhang, and L. E. Li, “Robust multimodal vehicle

detection in foggy weather using complementary lidar and radar sig-

nals,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern

Recognition (CVPR), 2021, pp. 444–453.

[20] M. Russell, A. Crain, A. Curran, R. Campbell, C. Drubin, and W. Mic-

cioli, “Millimeter-wave radar sensor for automotive intelligent cruise

control (icc),” IEEE Transactions on Microwave Theory and Techniques,

vol. 45, no. 12, pp. 2444–2453, 1997.

[21] M. Meyer and G. Kuschk, “Automotive radar dataset for deep learning

based 3d object detection,” in 2019 16th European Radar Conference

(EuRAD), 2019, pp. 129–132.

[22] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object

detection network for autonomous driving,” in 2017 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6526–

6534.

[23] J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. L. Waslander, “Joint

3d proposal generation and object detection from view aggregation,”

in 2018 IEEE/RSJ International Conference on Intelligent Robots and

Systems (IROS), 2018, pp. 1–8.

[24] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Se-

quential fusion for 3d object detection,” in 2020 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4603–

4611.

[25] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud

based 3d object detection,” in 2018 IEEE/CVF Conference on Computer

Vision and Pattern Recognition, 2018, pp. 4490–4499.

[26] H. Kuang, B. Wang, J. An, M. Zhang, and Z. Zhang, “Voxel-fpn: Multi-

scale voxel feature aggregation for 3d object detection from lidar point

clouds,” Sensors, vol. 20, no. 3, p. 704, 2020.

[27] L. Sless, B. E. Shlomo, G. Cohen, and S. Oron, “Road scene under-

standing by occupancy grid learning from sparse radar clusters using

semantic segmentation,” in 2019 IEEE/CVF International Conference

on Computer Vision Workshop (ICCVW), 2019, pp. 867–875.

[28] R. Nabati and H. Qi, “Rrpn: Radar region proposal network for object

detection in autonomous vehicles,” in 2019 IEEE International Confer-

ence on Image Processing (ICIP), 2019, pp. 3093–3097.

[29] O. Schumann, C. W¨

ohler, M. Hahn, and J. Dickmann, “Comparison of

random forest and long short-term memory network performances in

classiﬁcation tasks using radar,” in 2017 Sensor Data Fusion: Trends,

Solutions, Applications (SDF), 2017, pp. 1–6.

[30] S. Kim, S. Lee, S. Doo, and B. Shim, “Moving target classiﬁcation in au-

tomotive radar systems using convolutional recurrent neural networks,”

in 2018 26th European Signal Processing Conference (EUSIPCO), 2018,

pp. 1482–1486.

[31] J. Lombacher, K. Laudt, M. Hahn, J. Dickmann, and C. W¨

ohler,

“Semantic radar grids,” in 2017 IEEE Intelligent Vehicles Symposium

(IV), 2017, pp. 1170–1175.

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

[32] O. Schumann, M. Hahn, J. Dickmann, and C. W¨

ohler, “Semantic seg-

mentation on radar point clouds,” in 2018 21st International Conference

on Information Fusion (FUSION), 2018, pp. 2179–2186.

[33] D. Brodeski, I. Bilik, and R. Giryes, “Deep radar detector,” in 2019

IEEE Radar Conference (RadarConf), 2019, pp. 1–6.

[34] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep

learning on point sets for 3d classiﬁcation and segmentation,” in 2017

IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

2017, pp. 77–85.

[35] J. Liu, M. Gong, K. Qin, and P. Zhang, “A deep convolutional coupling

network for change detection based on heterogeneous optical and radar

images,” IEEE Transactions on Neural Networks and Learning Systems,

vol. 29, no. 3, pp. 545–559, 2018.

[36] Y. Zhu, F. Zhuang, J. Wang, G. Ke, J. Chen, J. Bian, H. Xiong, and

Q. He, “Deep subdomain adaptation network for image classiﬁcation,”

IEEE Transactions on Neural Networks and Learning Systems, vol. 32,

no. 4, pp. 1713–1722, 2021.

[37] B. Xu, X. Zhang, L. Wang, X. Hu, Z. Li, S. Pan, J. Li, and Y. Deng,

“Rpfa-net: a 4d radar pillar feature attention network for 3d object de-

tection,” in 2021 IEEE International Intelligent Transportation Systems

Conference (ITSC), 2021, pp. 3061–3066.

[38] J. Zarzar, S. Giancola, and B. Ghanem, “Pointrgcn: Graph convolu-

tion networks for 3d vehicles detection reﬁnement,” arXiv preprint

arXiv:1911.12236, 2019.

[39] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware

single-stage 3d object detection from point cloud,” in Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

2020, pp. 11 873–11 882.

[40] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection

from point clouds,” in 2018 IEEE/CVF Conference on Computer Vision

and Pattern Recognition, 2018, pp. 7652–7660.

[41] B. Yang, M. Liang, and R. Urtasun, “Hdnet: Exploiting hd maps for 3d

object detection,” in Conference on Robot Learning. PMLR, 2018, pp.

146–155.

[42] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera

fusion for 3d object detection,” in 2021 IEEE Winter Conference on

Applications of Computer Vision (WACV), 2021, pp. 1526–1535.

[43] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:

Keypoint triplets for object detection,” in 2019 IEEE/CVF International

Conference on Computer Vision (ICCV), 2019, pp. 6568–6577.

[44] K. Shin, Y. P. Kwon, and M. Tomizuka, “Roarnet: A robust 3d object

detection based on region approximation reﬁnement,” in 2019 IEEE

Intelligent Vehicles Symposium (IV), 2019, pp. 2510–2515.

[45] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´

ar, “Focal loss

for dense object detection,” in 2017 IEEE International Conference on

Computer Vision (ICCV), 2017, pp. 2999–3007.

[46] Y. Wang, Z. Jiang, Y. Li, J.-N. Hwang, G. Xing, and H. Liu, “Rodnet:

A real-time radar object detection network cross-supervised by camera-

radar fused object 3d localization,” IEEE Journal of Selected Topics in

Signal Processing, vol. 15, no. 4, pp. 954–967, 2021.

[47] Y. Fu, D. Tian, X. Duan, J. Zhou, P. Lang, C. Lin, and X. You, “A

camera–radar fusion method based on edge computing,” in 2020 IEEE

International Conference on Edge Computing (EDGE), 2020, pp. 9–14.

[48] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion

for multi-sensor 3d object detection,” in Proceedings of the European

conference on computer vision (ECCV), 2018, pp. 641–656.

[49] H. Lu, X. Chen, G. Zhang, Q. Zhou, Y. Ma, and Y. Zhao, “Scanet:

Spatial-channel attention network for 3d object detection,” in ICASSP

2019 - 2019 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), 2019, pp. 1992–1996.

[50] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K.

Wellington, “Lasernet: An efﬁcient probabilistic 3d object detector for

autonomous driving,” in 2019 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), 2019, pp. 12 669–12 678.

[51] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-

Gonzalez, “Sensor fusion for joint 3d object detection and semantic

segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and

Pattern Recognition Workshops (CVPRW), 2019, pp. 1230–1237.

[52] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q.

Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the

gap in 3d object detection for autonomous driving,” in 2019 IEEE/CVF

Conference on Computer Vision and Pattern Recognition (CVPR), 2019,

pp. 8437–8445.

[53] T.-Y. Lin, P. Doll´

ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,

“Feature pyramid networks for object detection,” in Proceedings of the

IEEE conference on computer vision and pattern recognition, 2017, pp.

2117–2125.

[54] O. D. Team, “OpenPCDet: An open-source toolbox for 3d object de-

tection from point clouds,” https://github.com/open-mmlab/OpenPCDet,

2020.

Li Wang was born in Shangqiu, Henan Province,

China in 1990. He received his Ph.D. degree in

mechatronic engineering at State Key Laboratory of

Robotics and System, Harbin Institute of Technol-

ogy, in 2020.

He was a visiting scholar at Nanyang Technology

of University for two years. Currently, he is a

postdoctoral fellow in the State Key Laboratory of

Automotive Safety and Energy, and the School of

Vehicle and Mobility, Tsinghua University.

Dr. Wang is the author of more than 20 SCI/EI

articles. His research interests include autonomous driving perception, 3D

robot vision, and multi-modal fusion.

Xinyu Zhang was born in Huining, Gansu Province,

and he received a B.E. degree from the School

of Vehicle and Mobility at Tsinghua University, in

2001.

He was a visiting scholar at the University of

Cambridge. He is currently a researcher with the

School of Vehicle and Mobility, and the head of

the Mengshi Intelligent Vehicle Team at Tsinghua

University.

Dr. Zhang is the author of more than 30 SCI/EI

articles. His research interests include intelligent

driving and multimodal information fusion.

Jun Li was born in Jilin Province, China in 1958.

He received a Ph.D. degree in internal-combustion

engineering at Jilin University of Technology, in

1989.

He has joined the China FAW Group Corporation

in 1989 and currently works as a professor with

the School of Vehicle and Mobility at Tsinghua

University. Now he also serves as the chairman of

the China Society of Automotive Engineers (SAE).

In these years, Dr. Li has presided over the product

development and technological innovation of large-

scale automobile companies in China. Dr. Li has many scientiﬁc research

achievements in the ﬁelds of automotive powertrain, new energy vehicles,

and intelligent connected vehicles.

Dr. Li is the author of more than 98 papers. In 2013, Dr. Li was awarded

an academician of Chinese Academy of Engineering (CAE) for contributions

to vehicle engineering.

Baowei Xv was born in Yulin, Shanxi Province,

China in 1995. He is now a 2019 graduate stu-

dent majoring in forestry engineering at Northeast

Forestry University, with a research focus on simul-

taneous localization and mapping.

Since Aug. 2020, he has been interning at State

Key Laboratory of Automotive Safety and Energy,

and the School of Vehicle and Mobility, Tsinghua

University, responsible for the development of 3D

object detection algorithm based on multi-sensor

fusion.

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

Rong Fu was born in Wuhan, Hubei Province, China

in 1994. She received a bachelor’s degree major

in Electronic Information Engineering at Beihang

University, Beijing, China, in 2016. She received a

Ph.D. degree major at the Intelligence Sensing Labo-

ratory (ISL), Department of Electronic Engineering,

Tsinghua University, Beijing, China, in 2022. Her

current research interests include statistical signal

processing, compressive sensing, optimization meth-

ods, model-based deep learning, and their applica-

tions in signal detection and parameter estimation.

Haifeng Chen was born in Chuzhou, Anhui

Province, China in 1996. He is currently a Master

student in the Institute of Information Engineering

at The China University of Mining and Technology,

Beijing. His research interests include computer vi-

sion, deep learning and 3D object detection.

He is now interning at State Key Laboratory of

Automotive Safety and Energy, and the School of

Vehicle and Mobility, Tsinghua University, respon-

sible for the development of 3D object detection

algorithm based on multi-sensor fusion.

Lei Yang was born in DaTong, ShanXi Province,

China in 1993. He received his master’s degree in

robotics at Beihang University, in 2018. Then he

joined the Autonomous Driving R&D Department

of JD.COM as an algorithm researcher from 2018

to 2020.

He is now a PhD student in School of Vehicle

and Mobility at Tsinghua University since 2020. His

research interests are computer vision, autonomous

driving and environmental perception.

Dafeng Jin was born in China in 1965. He is an

associate professor of the School of Vehicle and

Mobility of Tsinghua University. His research ﬁeld

is intelligent driving and integrated technology of

new energy vehicle system.

Lijun Zhao was born in Harbin, Heilongjiang

Province, and he received a Ph.D degree from

Robotics Institute of Technology, at Harbin Insti-

tute of Technology, China, in 2009. He currently

works as a professor with Robotics Institute, State

Key Laboratory of Robotics and System at Harbin

Institute of Technology. Dr. Zhao is the author of

more than 70 SCI/EI articles. His research interests

include SLAM, environments perception and navi-

gation of mobile robots.

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265

Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.

FADet: A Multi-sensor 3D Object Detection Network based on Local Featured Attention

Preprint

Full-text available

May 2024

Camera, LiDAR and radar are common perception sensors for autonomous driving tasks. Robust prediction of 3D object detection is optimally based on the fusion of these sensors. To exploit their abilities wisely remains a challenge because each of these sensors has its own characteristics. In this paper, we propose FADet, a multi-sensor 3D detection network, which specifically studies the characteristics of different sensors based on our local featured attention modules. For camera images, we propose dual-attention-based sub-module. For LiDAR point clouds, triple-attention-based sub-module is utilized while mixed-attention-based sub-module is applied for features of radar points. With local featured attention sub-modules, our FADet has effective detection results in long-tail and complex scenes from camera, LiDAR and radar input. On NuScenes validation dataset, FADet achieves state-of-the-art performance on LiDAR-camera object detection tasks with 71.8% NDS and 69.0% mAP, at the same time, on radar-camera object detection tasks with 51.7% NDS and 40.3% mAP. Code will be released at https://github.com/ZionGo6/FADet.

VELIE: A Vehicle-Based Efficient Low-Light Image Enhancement Method for Intelligent Vehicles

Article

Full-text available

Feb 2024
SENSORS-BASEL

In Advanced Driving Assistance Systems (ADAS), Automated Driving Systems (ADS), and Driver Assistance Systems (DAS), RGB camera sensors are extensively utilized for object detection, semantic segmentation, and object tracking. Despite their popularity due to low costs, RGB cameras exhibit weak robustness in complex environments, particularly underperforming in low-light conditions, which raises a significant concern. To address these challenges, multi-sensor fusion systems or specialized low-light cameras have been proposed, but their high costs render them unsuitable for widespread deployment. On the other hand, improvements in post-processing algorithms offer a more economical and effective solution. However, current research in low-light image enhancement still shows substantial gaps in detail enhancement on nighttime driving datasets and is characterized by high deployment costs, failing to achieve real-time inference and edge deployment. Therefore, this paper leverages the Swin Vision Transformer combined with a gamma transformation integrated U-Net for the decoupled enhancement of initial low-light inputs, proposing a deep learning enhancement network named Vehicle-based Efficient Low-light Image Enhancement (VELIE). VELIE achieves state-of-the-art performance on various driving datasets with a processing time of only 0.19 s, significantly enhancing high-dimensional environmental perception tasks in low-light conditions.

Architecture and Potential of Connected and Autonomous Vehicles

Article

Full-text available

Jan 2024

The transport sector is under an intensive renovation process. Innovative concepts such as shared and intermodal mobility, mobility as a service, and connected and autonomous vehicles (CAVs) will contribute to the transition toward carbon neutrality and are foreseen as crucial parts of future mobility systems, as demonstrated by worldwide efforts in research and industry communities. The main driver of CAVs development is road safety, but other benefits, such as comfort and energy saving, are not to be neglected. CAVs analysis and development usually focus on Information and Communication Technology (ICT) research themes and less on the entire vehicle system. Many studies on specific aspects of CAVs are available in the literature, including advanced powertrain control strategies and their effects on vehicle efficiency. However, most studies neglect the additional power consumption due to the autonomous driving system. This work aims to assess uncertain CAVs' efficiency improvements and offers an overview of their architecture. In particular, a combination of the literature survey and proper statistical methods are proposed to provide a comprehensive overview of CAVs. The CAV layout, data processing, and management to be used in energy management strategies are discussed. The data gathered are used to define statistical distribution relative to the efficiency improvement, number of sensors, computing units and their power requirements. Those distributions have been employed within a Monte Carlo method simulation to evaluate the effect on vehicle energy consumption and energy saving, using optimal driving behaviour, and considering the power consumption from additional CAV hardware. The results show that the assumption that CAV technologies will reduce energy consumption compared to the reference vehicle, should not be taken for granted. In 75% of scenarios, simulated light-duty CAVs worsen energy efficiency, while the results are more promising for heavy-duty vehicles.

Louvain-Based Traffic Object Detection for Roadside 4D Millimeter-Wave Radar

Article

Full-text available

Jan 2024

Object detection is the fundamental task of vision-based sensors in environmental perception and sensing. To leverage the full potential of roadside 4D MMW radars, an innovative traffic detection method is proposed based on their distinctive data characteristics. First, velocity-based filtering and region of interest (ROI) extraction were employed to filter and associate point data by merging the point cloud frames to enhance the point relationship. Then, the Louvain algorithm was used to divide the graph into modularity by converting the point cloud data into graph structure and amplifying the differences with the Gaussian kernel function. Finally, a detection augmentation method is introduced to address the problems of over-clustering and under-clustering based on the object ID characteristics of 4D MMW radar data. The experimental results showed that the proposed method obtained the highest average precision and F1 score: 98.15% and 98.58%, respectively. In addition, the proposed method showcased the lowest over-clustering and under-clustering errors in various traffic scenarios compared with the other detection methods.

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

Preprint

Jun 2024

LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficient information expression in object proxies. The latter relies on multi-stage pipelines and auxiliary tasks, which reduce the inference speed. To maintain the efficiency of the sparse framework while fully aggregating contextual information, in this work, we propose SparseDet which designs sparse queries as object proxies. It introduces two key modules, the Local Multi-scale Feature Aggregation (LMFA) module and the Global Feature Aggregation (GFA) module, aiming to fully capture the contextual information, thereby enhancing the ability of the proxies to represent objects. Where LMFA sub-module achieves feature fusion across different scales for sparse key voxels %which does this through via coordinate transformations and using nearest neighbor relationships to capture object-level details and local contextual information, GFA sub-module uses self-attention mechanisms to selectively aggregate the features of the key voxels across the entire scene for capturing scene-level contextual information. Experiments on nuScenes and KITTI demonstrate the effectiveness of our method. Specifically, on nuScene, SparseDet surpasses the previous best sparse detector VoxelNeXt by 2.2\% mAP with 13.5 FPS, and on KITTI, it surpasses VoxelNeXt by 1.12\% $\mathbf{AP_{3D}}$ on hard level tasks with 17.9 FPS.

RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar

Preprint

Full-text available

May 2024

3D occupancy-based perception pipeline has significantly advanced autonomous driving by capturing detailed scene descriptions and demonstrating strong generalizability across various object categories and shapes. Current methods predominantly rely on LiDAR or camera inputs for 3D occupancy prediction. These methods are susceptible to adverse weather conditions, limiting the all-weather deployment of self-driving cars. To improve perception robustness, we leverage the recent advances in automotive radars and introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction. Our method, RadarOcc, circumvents the limitations of sparse radar point clouds by directly processing the 4D radar tensor, thus preserving essential scene details. RadarOcc innovatively addresses the challenges associated with the voluminous and noisy 4D radar data by employing Doppler bins descriptors, sidelobe-aware spatial sparsification, and range-wise self-attention mechanisms. To minimize the interpolation errors associated with direct coordinate transformations, we also devise a spherical-based feature encoding followed by spherical-to-Cartesian feature aggregation. We benchmark various baseline methods based on distinct modalities on the public K-Radar dataset. The results demonstrate RadarOcc's state-of-the-art performance in radar-based 3D occupancy prediction and promising results even when compared with LiDAR- or camera-based methods. Additionally, we present qualitative evidence of the superior performance of 4D radar in adverse weather conditions and explore the impact of key pipeline components through ablation studies.

Perception and Planning of Intelligent Vehicles Based on BEV in Extreme Off-Road Scenarios

Article

Apr 2024

In extreme off-road scenarios, autonomous driving technology holds strategic significance for enhancing emergency rescue capabilities, reducing labor intensity, and mitigating safety risks. Challenges such as adverse conditions, complex terrains, unstable satellite signals, and lack of roads pose serious safety challenges for autonomous driving. This perspective first delves into a Bird's Eye View (BEV)-based perception-planning framework, aiming to enhance the adaptability of intelligent vehicles to their environment. Subsequently, this perspective further discusses key issues such as Cyber-Physical-Social Systems (CPSS), foundation models, and technologies like Sora for off-road scenario generation, vehicle cognitive enhancement, and autonomous decision-making. Ultimately, the discussed framework is poised to endow intelligent vehicles with the capability to perform challenging tasks in complex off-road scenarios, realizing a more efficient, safer, and sustainable transportation system, thereby providing better support for the low-altitude economy

TL-4DRCF: A Two-Level 4-D Radar–Camera Fusion Method for Object Detection in Adverse Weather

Article

Apr 2024
IEEE SENS J

In autonomous driving systems, cameras and Light Detection and Ranging (LIDAR) are two common sensors for object detection. But both sensors can be severely affected by adverse weather. With the development of radar technology, the emergence of the 4D radar gives a more robust solution for sensor fusion strategies in 3D object detection tasks. This study proposes a two-level 4D radar and camera fusion model called TL-4DRCF, which performs a two-level fusion of 4D radar and camera information at the data and feature levels. In the data-level fusion stage, the radar point cloud is projected onto the image and fed as additional information to the image into the EarlyFusion-Net (EF-Net), which is the network designed for simultaneous extraction of point cloud and image features. In the feature-level fusion stage, the Radar-Camera Alignment (RCA) module is proposed to accurately correlate point cloud voxels and pixel-level image features while consuming less inference time. The correlated features are used to predict the class and location of the object through a standard 3D detection framework. The proposed TL-4DRCF was validated on the View-of-Delft (VoD) dataset and the VoD-Fog dataset performed by artificial fog processing. The experimental results show that the proposed model outperforms the baseline method PointPillars on the VoD dataset by 3.8% mAP and the LIDAR-camera-based method MVX-Net in the driving corridor area of the VoD-Fog dataset by 0.39% mAP.

4D mmWave Radar for Autonomous Driving Perception: A Comprehensive Survey

Article

Apr 2024

The rapid development of autonomous driving technology has driven continuous innovation in perception systems, with 4D millimeter-wave (mmWave) radar being one of the key sensing devices. Leveraging its all-weather operational characteristics and robust perception capabilities in challenging environments, 4D mmWave radar plays a crucial role in achieving highly automated driving. This review systematically summarizes the latest advancements and key applications of 4D mmWave radar in the field of autonomous driving. To begin with, we introduce the fundamental principles and technical features of 4D mmWave radar, delving into its comprehensive perception capabilities across distance, speed, angle, and time dimensions. Subsequently, we provide a detailed analysis of the performance advantages of 4D mmWave radar compared to other sensors in complex environments. We then discuss the latest developments in target detection and tracking using 4D mmWave radar, along with existing datasets in this domain. Finally, we explore the current technological challenges and future directions. This review offers researchers and engineers a comprehensive understanding of the cutting-edge technology and future development directions of 4D mmWave radar in the context of autonomous driving perception. 4D mmWave radar, autonomous driving, perception, multi-modal fusion, 3D object detection. </list

Towards robust car-following based on deep reinforcement learning

Article

Feb 2024

RPFA-Net: a 4D RaDAR Pillar Feature Attention Network for 3D Object Detection

Conference Paper

Full-text available

Sep 2021

Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection

Article

May 2021

Recent advances on 3D object detection heavily rely on how the 3D data are represented, i.e., voxel-based or point-based representation. Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. In contrast, the voxel-based structure is better suited for feature extraction but often yields lower accuracy because the input data are divided into grids. In this paper, we take a slightly different viewpoint --- we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy. Bearing this view in mind, we devise a simple but effective voxel-based framework, named Voxel R-CNN. By taking full advantage of voxel features in a two-stage approach, our method achieves comparable detection accuracy with state-of-the-art point-based models, but at a fraction of the computation cost. Voxel R-CNN consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network, and a detect head. A voxel RoI pooling is devised to extract RoI features directly from voxel features for further refinement. Extensive experiments are conducted on the widely used KITTI Dataset and the more recent Waymo Open Dataset. Our results show that compared to existing voxel-based methods, Voxel R-CNN delivers a higher detection accuracy while maintaining a real-time frame processing rate, i.e., at a speed of 25 FPS on an NVIDIA RTX 2080 Ti GPU. The code is available at https://github.com/djiajunustc/Voxel-R-CNN.

Efficient Federated Learning With Spike Neural Networks for Traffic Sign Recognition

Article

Sep 2022

With the gradual popularization of self-driving, it is becoming increasingly important for vehicles to smartly make the right driving decisions and autonomously obey traffic rules by correctly recognizing traffic signs. However, for machine learning-based traffic sign recognition on the Internet of Vehicles (IoV), a large amount of traffic sign data from distributed vehicles is needed to be gathered in a centralized server for model training, which brings serious privacy leakage risk because of traffic sign data containing lots of location privacy information. To address this issue, we first exploit privacy-preserving federated learning to perform collaborative training for accurate recognition models without sharing raw traffic sign data. Nevertheless, due to the limited computing and energy resources of most devices, it is hard for vehicles to continuously undertake complex artificial intelligence tasks. Therefore, we introduce powerful Spike Neural Networks (SNNs) into traffic sign recognition for energy-efficient and fast model training, which is the next generation of neural networks and is practical and well-fitted to IoV scenarios. Furthermore, we design a novel encoding scheme for SNNs based on neuron receptive fields to extract information from the pixel and spatial dimensions of traffic signs to achieve high-accuracy training. Numerical results indicate that the proposed federated SNN outperforms traditional federated convolutional neural networks in terms of accuracy, noise immunity, and energy efficiency as well.

Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals

Conference Paper

Jun 2021

CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection

Conference Paper

Jan 2021

RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization

Article

Feb 2021

Various autonomous or assisted driving strategies have been facilitated through the accurate and reliable perception of the environment around a vehicle. Among the commonly used sensors, radar has usually been considered as a robust and cost-effective solution even in adverse driving scenarios, e.g., weak/strong lighting or bad weather. Instead of considering fusing the unreliable information from all available sensors, perception from pure radar data becomes a valuable alternative that is worth exploring. In this paper, we propose a deep radar object detection network, named RODNet, which is cross-supervised by a camera-radar fused algorithm without laborious annotation efforts, to effectively detect objects from the radio frequency (RF) images in real-time. First, the raw signals captured by millimeter-wave radars are transformed to RF images in range-azimuth coordinates. Second, our proposed RODNet takes a snippet of RF images as the input to predict the likelihood of objects in the radar field of view (FoV). Two customized modules are also added to handle multi-chirp information and object relative motion. The proposed RODNet is cross-supervised by a novel 3D localization of detected objects using a camera-radar fusion (CRF) strategy in the training stage. Due to no existing public dataset available for our task, we create a new dataset, named CRUW, <sup>1</sup> <sup>1</sup> The dataset and code are available at https://www.cruwdataset.org/ . which contains synchronized RGB and RF image sequences in various driving scenarios. With intensive experiments, our proposed cross-supervised RODNet achieves 86% average precision and 88% average recall of object detection performance, which shows the robustness in various driving conditions.

A Camera–Radar Fusion Method Based on Edge Computing

Conference Paper

Oct 2020

RadarNet: Exploiting Radar for Robust Perception of Dynamic Objects

Chapter

Dec 2020

We tackle the problem of exploiting Radar for perception in the context of self-driving as Radar provides complementary information to other sensors such as LiDAR or cameras in the form of Doppler velocity. The main challenges of using Radar are the noise and measurement ambiguities which have been a struggle for existing simple input or output fusion methods. To better address this, we propose a new solution that exploits both LiDAR and Radar sensors for perception. Our approach, dubbed RadarNet, features a voxel-based early fusion and an attention-based late fusion, which learn from data to exploit both geometric and dynamic information of Radar data. RadarNet achieves state-of-the-art results on two large-scale real-world datasets in the tasks of object detection and velocity estimation. We further show that exploiting Radar improves the perception capabilities of detecting faraway objects and understanding the motion of dynamic objects.

Edge Computing-Based Collaborative Vehicles 3D Mapping in Real Time

Article

Aug 2020

Cooperative vehicles are better able to detect the environment and self-localize than a single vehicle. Cooperative vehicles can quickly cover the entire environment by communicating and cooperating with each other and can also reduce localization and mapping error by merging the cooperative vehicle information from observation and navigation. In this paper, we propose a novel algorithm for an effective solution of navigation and mapping for cooperative vehicles in an unknown environment. We present an improved centralized and collaborative monocular simultaneous localization and mapping (CCM-SLAM) approach. The proposed algorithm can accurately compute the transformation matrix for cooperative vehicle maps and reduce the communication delay, data loss among vehicles and decrease the bandwidth demand. The quaternion and credibility similarity transformation (QC-Sim(3)) method we proposed is used to accurately merge the matched maps and accomplish loop closures. The sending messages at variable frequencies (SMVF) method we proposed and an improved detection and resending lost data (I-DRLD) method we proposed can improve the accuracy of pose estimation. SMVF solves the time-delay problem by sending messages to the vehicles at flexible frequencies while I-DRLD detects and resends the lost data. We also adopt Intra-frame Feature Compression (IFC) to decrease the bandwidth demand in the process of the transmitting data. The experiments demonstrate the superiority of our proposed algorithm compared with the state-of-the-art methods.

PointPainting: Sequential Fusion for 3D Object Detection

Conference Paper

Jun 2020

Multi-Modal and Multi-Scale Fusion 3D Object Detection of 4D Radar and LiDAR for Autonomous Driving

Abstract

Recommended publications

MVLidarNet: Real-Time Multi-Class Scene Understanding for Autonomous Driving Using Multiple Views

A Perception Method Based on Point Cloud Processing in Autonomous Driving

NSAW: An Efficient and Accurate Transformer for Vehicle LiDAR Object Detection

MVMM: Multiview Multimodal 3-D Object Detection for Autonomous Driving