ArticlePDF Available

Multi-Modal and Multi-Scale Fusion 3D Object Detection of 4D Radar and LiDAR for Autonomous Driving

Authors:

Abstract

Multi-modal fusion plays a critical role in 3D object detection, overcoming the inherent limitations of single-sensor perception in autonomous driving. Most fusion methods require data from high-resolution cameras and LiDAR sensors, which are less robust and the detection accuracy drops drastically with the increase of range as the point cloud density decreases. Alternatively, the fusion of Radar and LiDAR alleviates these issues but is still a developing field, especially for 4D Radar with a more robust and broader detection range. Nevertheless, different data characteristics and noise distributions between two sensors hinder performance improvement when directly integrating them. Therefore, we are the first to propose a novel fusion method termed $M^{2}$ -Fusion for 4D Radar and LiDAR, based on Multi-modal and Multi-scale fusion. To better integrate two sensors, we propose an Interaction-based Multi-Modal Fusion (IMMF) method utilizing a self-attention mechanism to learn features from each modality and exchange intermediate layer information. Specific to the current single-resolution voxel division's precision and efficiency balance problem, we also put forward a Center-based Multi-Scale Fusion (CMSF) method to first regress the center points of objects and then extract features in multiple resolutions. Furthermore, we present a data preprocessing method based on Gaussian distribution that effectively decreases data noise to reduce errors caused by point cloud divergence of 4D Radar data in the $x$ - $z$ plane. To evaluate the proposed fusion method, a series of experiments were conducted using the Astyx HiRes 2019 dataset, including the calibrated 4D Radar and 16-line LiDAR data. The results demonstrated that our fusion method compared favorably with state-of-the-art algorithms. When compared to PointPillars, our method achieves mAP (mean average precision) increases of 5.64 $\%$ and 13.57 $\%$ for 3D and BEV (bird's eye view) detection of the car class at a moderate level, respectively.
1
Multi-Modal and Multi-Scale Fusion 3D Object
Detection of 4D Radar and LiDAR for Autonomous
Driving
Li Wang, Xinyu Zhang, Jun Li, Baowei Xv, Rong Fu, Haifeng Chen, Lei Yang, Dafeng Jin, Lijun Zhao
Abstract—Multi-modal fusion plays a critical role in 3D object
detection, overcoming the inherent limitations of single-sensor
perception in autonomous driving. Most fusion methods require
data from high-resolution cameras and LiDAR sensors, which
are less robust and the detection accuracy drops drastically
with the increase of range as the point cloud density decreases.
Alternatively, the fusion of Radar and LiDAR alleviates these
issues but is still a developing field, especially for 4D Radar
with a more robust and broader detection range. Nevertheless,
different data characteristics and noise distributions between
two sensors hinder performance improvement when directly
integrating them. Therefore, we are the first to propose a novel
fusion method termed M2-Fusion for 4D Radar and LiDAR,
based on Multi-modal and Multi-scale fusion. To better inte-
grate two sensors, we propose an Interaction-based Multi-Modal
Fusion (IMMF) method utilizing a self-attention mechanism to
learn features from each modality and exchange intermediate
layer information. Specific to the current single-resolution voxel
division’s precision and efficiency balance problem, we also put
forward a Center-based Multi-Scale Fusion (CMSF) method to
first regress the center points of objects and then extract features
in multiple resolutions. Furthermore, we present a data prepro-
cessing method based on Gaussian distribution that effectively
decreases data noise to reduce errors caused by point cloud
divergence of 4D Radar data in the x-zplane. To evaluate the
proposed fusion method, a series of experiments were conducted
using the Astyx HiRes 2019 dataset, including the calibrated
4D Radar and 16-line LiDAR data. The results demonstrated
that our fusion method compared favorably with state-of-the-art
algorithms. When compared to PointPillars, our method achieves
mAP (mean average precision) increases of 5.64%and 13.57%
for 3D and BEV (bird’s eye view) detection of the car class at a
moderate level, respectively.
This work was supported by the National High Technology Research and
Development Program of China under Grant No. 2018YFE0204300, the
National Natural Science Foundation of China under Grant No. 62273198,
U1964203, the China Postdoctoral Science Foundation (No.2021M691780),
and State Key Laboratory of Robotics and Systems (HIT) (SKLRS-2022-KF-
12). (Corresponding author: Xinyu Zhang.)
L. Wang is with the State Key Laboratory of Automotive Safety and Energy,
and the School of Vehicle and Mobility, Tsinghua University, Beijing 100084,
and also with the State Key Laboratory of Robotics and Systems (HIT), Harbin
150001, China (e-mail: wangli thu@mail.tsinghua.edu.cn).
X. Zhang is with the State Key Laboratory of Automotive Safety and
Energy, and the School of Vehicle and Mobility, Tsinghua University, Bei-
jing 100084, China, and also with the Department of Mechanical Engi-
neering, University of MichiganAnn Arbor, MI, 48105, America (e-mail:
xyzhang@tsinghua.edu.cn).
J. Li, B. Xv, R. Fu, H. Chen, L. Yang, and D. Jin are with
the State Key Laboratory of Automotive Safety and Energy, and
the School of Vehicle and Mobility, Tsinghua University, Beijing
100084, China (e-mail: lj19580324@126.com; xvbaowei718500@163.com;
fu-r16@mails.tsinghua.edu.cn; chenhaifeng0523@foxmail.com; yanglei20@
mails.tsinghua.edu.cn; jindf@tsinghua.edu.cn).
L. Zhao is with the State Key Laboratory of Robotics and System at Harbin
Institute of Technology, Harbin 150001, China (e-mail: zhaolj@hit.edu.cn).
4D Radar Point Cloud LiDAR Point Cloud
3D Object Detection
PointPillars
Backbone
PillarVFE
IMMF
CMSF



2




























2



2



2




























2



2
Fig. 1. The overall concept of M2-Fusion processing both sparse 4D Radar
and LiDAR point clouds simultaneously. We propose two novel modules,
IMMF and CMSF, remarkably improving detection accuracy. The average
precision bar charts above compare M2-Fusion and seven mainstream object
detection networks (PointRCNN [1], SECOND [2], PV-RCNN [3], Point-
Pillars [4], Part-A2[5], Voxel R-CNN [6], and MVX-Net [7]). As shown,
our method outperforms the others by a large margin in both 3D and BEV
object detection. “C+R” and “C+L” mean “Camera + Radar” and “Camera +
LiDAR”, respectively.
Index Terms—Object detection, 4D Radar, multi-modal fusion,
autonomous driving
I. INTRODUCTION
THE increasing safety requirements for autonomous driv-
ing have motivated the environmental adaptability for on-
board sensing systems [8]–[11]. Previous works have investi-
gated the applications of multiple sensors for object detection,
including cameras, LiDAR, and Radar [12]–[15]. However,
practical applications are primarily limited by two challenges:
sensor performance itself and algorithm implementation [12].
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
2
Fusing data from multiple modalities can significantly improve
accuracy and overcome critical issues in autonomous driving.
High-resolution cameras are typically used to determine the
shape and texture of complex objects, such as road signs and
traffic signals [16]. Although computer vision algorithms have
made some progress in visual 3D detection, the task remains
challenging due to a lack of accurate depth information [17].
LiDAR is another commonly used sensor that can accurately
calculate object distance relative to the sensor. However, the
LiDAR point cloud becomes sparse with the range increase,
and the object detection capability decreases significantly.
Camera and LiDAR systems can also be adversely affected by
weather conditions such as heavy rain, snow, or fog [18], [19].
These factors increase the risks associated with autonomous
driving and limit the development of related technologies to a
large extent.
Radar has been receiving increased attention from industrial
and academic communities as a potential alternative. These
systems generally transmit millimeter waves in the frequency
range from 30 GHz to 300 GHz and record the returned beams.
As the millimeter wave signals are weakly attenuated in the
atmosphere, Radar can detect objects at a long-range distance
by measuring the round-trip time. In addition, millimeter wave
radar can adapt to bad weather conditions due to its strong
penetrability. Without being severely affected by bad weather,
Radars allow for long-range observations (usually over 200m)
in inclement weather, such as heavy rain, fog, and night driving
[20]. 3D Radar systems, typically employed as an auxiliary
measurement for other sensors, collect the horizontal position
and velocity information (x, y, v)using antennas positioned
only in the xand ydirections. Usually, the azimuth angle
resolution is over 10, and it has no vertical angle resolution.
In contrast, 4D Radar sensors use antennas positioned in
three directions to obtain 3D position and velocity information
(x, y, z, v), thus providing denser point clouds. The azimuth
angle resolution is about 1, and the vertical angle resolution
is about 2. For instance, the Astyx HiRes 2019 [21] dataset
used in this work includes more than 1,000 4D Radar points
per frame. Compared to 3D Radar, 4D Radar significantly
improves angle resolution and vertical ranging capability,
enabling denser point cloud acquisition.
High-level autonomous driving requires multi-modal fu-
sion [7], [22]–[24] to achieve sufficient and robust detection
performance. Fused data are often collected from various
sensors, including 3D Radar, cameras, LiDAR, and 4D Radar.
Specifically, 4D Radar offers substantial long-range (detection
range over 200 m) and robustness advantages, while LiDAR
offers higher accuracy. A more practical and cost-effective 3D
object detection method could thus be developed by fusing
these modalities effectively. Moreover, there are few research
works on the fusion of 4D Radar and LiDAR. However,
4D Radar data is sparser than LiDAR, making it difficult to
estimate objects shapes and sizes. Furthermore, the range and
height of 4D Radar point clouds are fuzzy due to aliasing
and false detection caused by clusters and multi-path echoes.
Accordingly, some objects are highly visible in LiDAR (e.g.,
cars and pedestrians) but appear blurry on the Radar. Thus,
effectively combining these two types of point cloud features
is a critical problem. One intuitive solution would be to
directly splice the features extracted from the two modalities to
achieve fusion. Nevertheless, this strategy suffers from a low
correlation between LiDAR and Radar data since the spatial
distributions vary significantly across point clouds for the
two modalities. To date, inter-modality relationships between
4D Radar and LiDAR have rarely been investigated in a
framework used for object detection.
Motivated by the above discussion, as far as the authors
know, we are the first to propose a 3D object detection
fusion method for 4D Radar and LiDAR, termed Multi-
modal and Multi-scale Fusion (M2-Fusion). This process
consists of two core modules: Interaction-based Multi-Modal
Fusion (IMMF) and Center-based Multi-Scale Fusion (CMSF).
Through a large number of comparative experiments with
existing mainstream methods, state-of-the-art 3D detection
performance is achieved by significantly improving feature
expression capabilities using a self-attention mechanism and
multi-scale fusion (see Fig. 1).
One of the primary challenges in multi-modal fusion is to
maintain a strong relationship between different features of the
same object from each modality. A self-attention mechanism
is a viable approach for accomplishing feature interaction,
allowing the parameters from different modalities to learn
from each other. Therefore, we propose an interaction-based
multi-modal fusion method termed IMMF. The module fuses
and updates each 4D Radar and LiDAR feature using the
attention-weighted information from the other modality. Then
the information flow is transmitted to each modality to capture
complex intra-modal relationships. As a result, the attention
weights within each modality condition in the 4D Radar and
LiDAR features are mutually reinforcing.
Secondly, the high sparsity of 4D Radar and LiDAR data
often necessitates the enhancement of feature expressions in
each modality. For instance, voxel-based frameworks offer
excellent performance and are considered the mainstream
technology for 3D object detection [2], [4], [25]. In this
process, a 3D space is divided into regular 3D voxels, and
a convolutional neural network (CNN) is utilized to extract
features for compression into a bird’s eye view feature map
for 3D object detection. However, due to the uneven density
of point clouds in 3D spaces, setting a single scale for default
voxel grids is insufficient to represent all information in a given
scenario. Smaller grids produce more refined localization and
features, while larger grids are faster due to fewer non-empty
grids and smaller feature tensors. Multi-scale methods, which
employ the benefits of both small and large grids, have shown
great potential in various 3D object detection tasks [26]. One
intuitive solution would be directly merging voxel features
from high-resolution and low-resolution data as input to the
next backbone, thus forming an aggregated feature across
various voxel scales. However, this method inevitably causes
tremendous memory and computational pressure on the limited
onboard computing resources. Moreover, the superposition
of feature maps at different scales does not highlight object
characteristics, and the fuzziness of the 4D Radar may produce
additional noise.
Thus, to better integrate multi-scale point cloud features,
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
3
we propose a center-based multi-scale fusion method termed
CMSF, which includes a two-step strategy. In the first step,
key points are extracted from objects with higher scores. In
the second step, voxels are selected around key points in a
low-resolution grid and then converted to a high-resolution
grid. Pseudo-images of all scales are then standardized using
an adaptive max-pooling operation.
In addition, we propose a data preprocessing method based
on Gaussian normal distribution to reduce the data noise
of the 4D Radar. Generally, 4D Radar sensors have good
ranging accuracy, but the insufficient resolution in the vertical
direction leads to poor accuracy in the z-axis direction and
introduces considerable data noise. We utilize the Gaussian
normal distribution to obtain the statistic rule of 4D Radar
point cloud distribution in the recent Astyx Hires 2019 [21]
dataset, a 3D object detection dataset that includes 4D Radar
data. Then, noise in the 4D Radar point cloud is corrected to
improve data quality.
The primary contributions of our work are as follows:
We are the first to propose a novel fusion method termed
M2-Fusion that utilizes 4D Radar and LiDAR data for 3D
object detection in autonomous driving, offering higher
precision and especially remote detection accuracy.
We propose an interaction-based multi-modal fusion
method termed IMMF. This module can learn corre-
lations between two modalities by utilizing attention
mechanisms, emphasizing critical features, and reducing
noise to improve accuracy and robustness for 3D object
detection.
We develop a center-based multi-scale fusion method
termed CMSF, which combines the advantages of small-
scale and large-scale features. We demonstrate that select-
ing voxels in a particular range around key points results
in better efficiency and accuracy than other methods using
only one quantization scale.
We conduct extensive experiments using the Astyx HiRes
2019 dataset to evaluate M2-Fusion. Experimental results
showed that our method achieves state-of-the-art perfor-
mance.
II. RE LATE D WOR K
A. Radar-based Methods
Several recent studies have applied Radar to the field of au-
tonomous driving for tasks such as free space estimation [27],
object detection [14], [28], object classification [15], [29]–
[31], and instance segmentation [32]. For example, Li et al.
[33] investigated Radar characteristics and proposed a deep
learning model that applied a CNN to object and feature
detection. However, these experiments were conducted using
a static device rather than a dynamic scene. Chen et al.
[14] proposed a Radar-based 2D object detection method
based on PointNet [34] and 2D spatial coordinates acquired
from Radar data, ego-motion compensation Doppler velocities,
and Radar cross-section measurements. Unsupervised coupling
networks were also used for change detection with Radar
and cameras [35], [36]. These methods utilized 3D Radar for
2D object detection, but the detection accuracy was not ideal
because 3D Radar lacked vertical information. Xv et al. [37]
proposed a pillar feature attention network for 4D Radar that
could identify dependencies for every pillar. Although some
improvements were reported, the algorithm was applied solely
to Radar and did not fully use data from other modalities. In
this study, we focus on the fusion of Radar and LiDAR for
3D object detection in autonomous driving.
B. LiDAR-based Methods
LiDAR-based 3D object detection is typically performed
using point-based, voxel-based, or BEV-based methods. Point-
based methods extract features from the original point cloud.
PointRCNN [1] was a two-stage point-based approach that
generated 3D region proposals from point clouds by first
dividing them into the top-down front and background points.
In the second stage, the proposals were converted into standard
coordinates and local features were learned in detail to obtain
results. Based on PointRCNN [1], Part-A2[5] developed a
method that could simultaneously predict high-quality 3D
proposals and accurate part location information in objects.
PointRGCN [38] was also a point-based method. It utilized
graph convolution to classify and regress region proposal
boxes. Voxel-based methods convert point clouds into voxels
and extract features from each voxel. VoxelNet [25] was
the first method to encode point clouds into voxels. How-
ever, it suffered from low efficiency, particularly for large
voxel quantities. SECOND [2] applied sparse convolutions to
non-empty voxels, significantly reducing computational costs.
SASSD [39], a one-stage method, outperformed comparable
two-stage methods. PointPillars [4] further improved these
results by dividing point clouds into individual pillars and
mapping them onto a 2D pseudo-image. A 2D convolution
was then used to regress bounding boxes, achieving the highest
detection efficiency to date. In contrast, BEV-based methods
process point clouds as pseudo-images. PIXOR [40] cut point
clouds into multi-channel pseudo-images in the vertical di-
rection. However, vertical information was often lost when
projecting to BEV, requiring supplemented height information
from other angles. Based on PIXOR [40], HDNet [41] took
advantage of geometric and semantic prior information pro-
vided by high-definition (HD) maps to enhance the robustness
and accuracy.
Although these LiDAR-based methods have achieved good
results, they still face a trade-off between accuracy and effi-
ciency. Point cloud density is typically distributed unevenly,
and most data is from the background. As the range increases,
the point cloud density decreases rapidly, which leads to
low detection accuracy. However, few methods consider these
problems.
C. Camera-Radar Fusion
In order to improve the robustness of 3D object detection,
multiple researchers have proposed methods to fuse camera
and radar data for object detection in bad weather. In camera-
Radar fusion algorithms, the camera images are typically
used to extract area proposals or 2D detection bounding
boxes, while 3D Radar provides depth information for the
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
4
final detection results. For example, Ramin et al. [42] pro-
posed a center point detection network called CenterNet [43],
which employed a frustum-based method to associate objects
detected by 3D Radar with identified image center points.
This approach utilized 3D Radar data to supplement image
features, thereby regressing object attributes such as size,
depth, rotation, and speed. CRF-Net [44] performed a multi-
level fusion of the camera and radar data for object detection
based on the implementation of RetinaNet [45], which learned
to fuse each modality effectively for the detection result at
each level. RODNet [46] employed a camera-radar fusion
strategy to achieve robust 3D object detection in various
driving scenarios. Fu et al. [47] proposed a fusion method for
roadside cameras and millimeter wave radars as an application
of edge computing. However, 3D Radar only plays an auxiliary
role in this case, and a significant gap remains in estimating
object depth compared to LiDAR.
D. Camera-LiDAR Fusion
Camera-LiDAR fusion has been attracting increased atten-
tion in recent years. MV3D [22], a pioneering multi-view
fusion network based on region proposals, integrated RGB
images, front views from point clouds, and BEV views to
project point cloud features of BEV view into front views for
image fusion. However, MV3D did not perform well in small
object detection. Multi-modal fusion detection networks have
been adopted in prior studies to generate region proposals [22],
[23], [48], thereby improving detection accuracy for small
objects. As discussed earlier, a primary reason for the low
accuracy of existing methods is the fusion of camera and BEV
images collected from two different perspectives. Converting
images or LiDAR data to other views significantly reduces the
data’s resolution and accuracy. To solve this problem, Point-
Painting [24] was used in previous studies to perform semantic
image segmentation and merge the resulting point cloud, prior
to segmenting the LiDAR fusion network to enhance precision.
However, PointPainting required more computing resources
because the original point cloud data was not preprocessed.
Secondly, the semantic segmentation model and the 3D point
cloud detection model required a high degree of coupling,
which limited the scope of the application. For example, [17]
divided object detection into several related tasks in an end-
to-end training architecture to fuse LiDAR and camera data.
SCANet [49] employed a spatial channel attention module to
select discriminative features. Meyer et al. presented a fusion
algorithm based on LaserNet [50] that could accurately detect
objects at long ranges [51]. But LaserNet performed poorly on
the KITTI dataset, which was too small to fully train their large
network; hence, this method relied on larger datasets. In [52],
Wang et al. provided a different perspective for transferring
images into a point cloud, achieving the highest performance
for the KITTI 3D object detection benchmark. Shi et al. [5]
presented part-aware modules and part-aggregation modules
to enhance the expression of point clouds. RoarNet [44]
used images to generate proposal regions prior to using point
clouds. The method could deal with the asynchronous situation
between LiDAR and the camera. However, the detection accu-
racy depended on the recall of proposals, and the undetected
objects proposals could not be recovered at the second step.
Such fusion was limited when LiDAR points were sparse. And
inclement weather, such as fog, rain, and snow, also decreased
the quality of LiDAR and camera data significantly.
E. Radar-LiDAR Fusion
Few works have been studied on the application of Radar
and LiDAR fusion. RadarNet [18] applied a voxel network
to feature extraction and conversion in 3D Radar and LiDAR.
However, this approach ignores the low resolution and sparsity
of 3D Radar data, which lacks vertical position information.
As a result, features extracted from 3D Radar have provided
limited contributions to improving fusion networks, particu-
larly when applying standard LiDAR-based methods. Unlike
3D Radar, which only provides horizontal position informa-
tion, 4D Radar offers 3D coordinates and generates point
clouds similar to LiDAR. As a result, 4D Radar point clouds
can be merged with LiDAR without specific transformations.
Prior research on 4D Radar and LiDAR data fusion is
lacking, primarily because 4D Radar sensors are still de-
veloping. In addition, currently available feature extraction
methods of the point cloud are designed for LiDAR, which
may not be suitable for 4D Radar point clouds. Another reason
is that low-density point clouds for LiDAR cannot produce
high-grade detection results. Regardless, we believe LiDAR
point cloud features could help extract 4D Radar point cloud
features. The key to this process lies in the interaction between
point cloud modalities during feature extraction, especially in
the case of 4D Radar data. To this end, we propose a 3D
object detection method based on the fusion of 4D Radar and
LiDAR point cloud features. Self-attention is used to introduce
information between Radar and LiDAR modalities to establish
these interactions.
The critical task in multi-scale fusion is enriching features
with strong semantics and strong detail expression capabil-
ities. One representative work was the FPN [53] algorithm,
which extracted features from images at varying scales. As
a result, feature maps at all levels exhibited reliable semantic
information and high-resolution features. A primary limitation
of FPN was the considerable memory costs, which may
result in a longer inference time. While FPN is generally
used in 2D object detection, it also has some applications
for 3D object detection in LiDAR point clouds. PV-RCNN
[20] introduced a 3D sparse Conv layer at different scales in
the feature extraction stage, which were then combined with
point cloud and BEV features. However, previous studies have
only combined features at a single resolution. Voxel-FPN [26]
achieved multi-scale fusion at the data processing stage, as
point clouds were divided into voxels with multiple solutions.
Features were then extracted along the same dimension to
achieve feature fusion. Qian et al. [19] fused LiDAR and
Radar data across both time dimensions and sensing modalities
using deep late fusion. These studies illustrate the necessity of
multi-scale fusion, which still suffers from several problems,
including large memory consumption and long inference time.
The following section describes improvements made to multi-
scale fusion mechanisms in this study.
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
5
Fig. 2. The overall framework of our proposed M2-Fusion method consisting of four sub-networks. (1) IMMF: the module in the green box demonstrates
multi-modal fusion, which learns correlations between two modalities by utilizing an attention mechanism. (2) CMSF: the module in the blue box performs
multi-scale feature extraction by dividing selected pixels for key points into small-scale pixels. (3) 2D CNN backbone. (4) RPN (Region Proposal Network)
head.
III. M2-FUSION FRA ME WORK
In this section, we introduce the proposed M2-Fusion
method in detail, the framework of which is shown in Fig. 2. In
this process, point cloud features from 4D Radar and LiDAR
data are effectively fused using an IMMF module based on
multi-modal fusion and a CMSF module based on multi-scale
fusion.
A. Interaction-based Multi-Modal Fusion (IMMF)
A comparison of 4D Radar and LiDAR data suggests
their forms are similar, though Radar exhibits weak reflection
echoes from vehicles, pedestrians, and cyclists. However, the
Radar point cloud for the background environment is also
much more extensive than that of the recognition objects.
Existing network structures are better at identifying particular
objects in uniform backgrounds. Therefore, it is essential to fo-
cus our limited resources on the core portion when integrating
the two modalities, for which we propose an interaction-based
multi-modal fusion approach (IMMF). Precisely, the tensor de-
composed from a feature in the attention mechanism (between
modalities) constitutes a weight matrix that exerts network
learning attention and exchanges inter-modality tensors. In
other words, LiDAR representations guide the attention of 4D
Radar, and 4D Radar representations direct the attention of
LiDAR.
The network structure in the green dotted box in Fig. 2
shows the detailed IMMF module, a symmetrical structure
that consists of two self-attention models. IMMF facilitates
interaction between modalities by exchanging tensors in sym-
metric structures and guiding the network to learn more
valuable features. Following PointPillars [4], we divide the
input point cloud into S×S pillars. Here points in each pillar
are expanded from 4 dimensions (x, y, z, r)to 10 dimensions
(x, y, z, r, xc, yc, zc, xp, yp, zp)where rrepresents the reflec-
tivity. The calculations are as follows:
(xc, yc, zc)=(x, y, z)(xm, ym, zm)
(xp, yp, zp)=(x, y, z)(xg, yg.zg),(1)
where (xc, yc, zc)is the deviation of each point cloud relative
to the pillar center, (xm, ym, zm)is the center coordinate of
each pillar, (xp, yp, zp)is the deviation of each point cloud
from the grid center, and (xg, yg, zg)is the center coordinate
of each grid.
We extract features of each pillar from LiDAR and 4D
Radar point clouds and set FLand FR, the initial features for
LiDAR and 4D Radar, as the input to the IMMF module. We
then adopt convolutional networks to improve computational
efficiency with the strategy that reduces the dimension of FL64
and FR64 defined as,
FL64 =Maxpool(Linear(FL))
FR64 =Maxpool(Linear(FR)),(2)
where Maxpool(·)represents a maximum pooling layer,
Linear(·)indicates a fully connected layer. The reduced
feature dimensions are both 16, denoted by FL16 and FR16
as below.
FL16 =Conv(Maxpool(Linear(FL)))
FR16 =Conv(Maxpool(Linear(FR))),(3)
where Conv(·)denotes a convolutional layer. The correspond-
ing weight matrices FLwand FRwcan then be calculated as
follows:
FLw=Sof tmax((FL16 )TFR16 )
FRw=Sof tmax((FR16 )TFL16 ),(4)
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
6
Fig. 3. The network framework for the CMSF module. The input can be LiDAR or 4D Radar point cloud. The backbone network uses large-scale pseudo-
images to regress and select key points with high scores. Pixels around these key points are then divided into small scales, and the pseudo-images are
regenerated. Finally, all pseudo-images are superimposed to generate features with excellent representation capabilities. Specifically, Sin the figure represents
the scale of pillars. The red square represents key points, and its surrounding orange squares represent the chosen pillars.
where Sof tmax(·)is a softmax function.
The sizes of FLwand FRware M×Nand N×M, respec-
tively, where Mis the number of LiDAR features and Nis the
number of Radar features. These two weight matrices contain
modal information for each other. We then multiply the FLw
term on the right by FR64 . The resulting feature size is M×d,
since FLwhas a size of M×Nand FL64 has a size of N×d
(dis the characteristic dimension). This feature is consistent
with the size of FL64 and is subtracted from and then added to
FL64 after the linear layer, normalization layer, and activation
function, to acquire interactive LiDAR features FLt. Similarly,
the same operation is conducted to obtain interactive Radar
features FRt. This process can be represented as:
FLm=ReLU(BN (LN(FLwFR64 FL64 )))
FRm=ReLU(BN (LN(FRwFL64 FR64 ))),(5)
FLt=FLm+FL64
FRt=FRm+FR64 ,(6)
where ReLU(·)is a rectified linear unit (ReLU) activation
function, BN (·)denotes a batch normalization function, FLm
and FRmare intermediate variables, and LN(·)represents a
linear layer.
B. Center-based Multi-Scale Fusion (CMSF)
The division method used in PointPillars [4] can divide
point clouds into high-resolution or low-resolution pillars,
depending on the voxel size. High-resolution voxel divisions
can learn more features and improve the resulting detection ac-
curacy at the cost of increased training difficulty and inference
time. In contrast, low-resolution voxel divisions exhibit shorter
training time and higher efficiency but lack local details,
which leads to lower detection accuracy. Density distinctions
in specific areas may be noticeable in the 3D space of a
one-point cloud frame since most point cloud voxels exist
at a fixed scale. Moreover, it is common to use farthest-
point sampling in voxel-based networks. This approach unifies
the number of points in various pillars to one value, while
pillars with fewer points are filled with zeros. However, this
sampling method causes an excessive deficiency of abundant
information. Thus, a single fixed scale is insufficient to express
features with all required information, especially for distant
objects with sparse point clouds or small objects such as
pedestrians. To resolve this issue, we propose a two-stage
method to fuse multi-scale information around key points for
objects that are regressed in the first stage and used to select
point clouds for multi-scale fusion in the second stage. The
framework is shown in Fig. 3. A center-based method is then
employed to produce center points in the pseudo-images, with
chosen pillars corresponding to these center points. Pillars are
processed by the IMMF module, allowing information from
the same scales and different modalities to interact. Afterward,
pseudo-images are generated from these concatenated features
and stacked to enhance feature expression.
Specifically, the original cloud is divided into two scales (S
and S/2). Initially, we encode pillars using the proportion of
Sand generate a pseudo image IRH×W×64 with a size
of H×W. The coordinates (x,y) are then transferred to the
corresponding ground truth heat map. Key points in the ground
truth are calculated by:
Cx=xxmin
(xmax xmin)×hw
Cy=yymin
(ymax ymin)×hl
,
(7)
where Cxand Cyare coordinates of a key point in the ground
truth, xand yare center coordinates of a 3D bounding box,
xmin, xmax , ymin, and ymax are the individual minimum and
maximum values, and hwand hlare the length and width of
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
7
the heat map. All ground truth key points in the heat map are
then split using a Gaussian kernel function.
Key point coordinates are applied over a large-scale space,
as shown in Fig. 3. The expression of these key points can
be represented by a heat map, with a ratio four times smaller
than the input pseudo-image. In addition, the width and height
of pillars with scales of S/2are decreased by a factor of two
compared with pillars of scale S, scaling all corresponding
coordinates by a factor of eight. Pillars are then selected near
adjacent key points, and a fixed area of square sides (exhibiting
the most vehicles) is identified for use in generating pseudo-
images. Images on a larger scale are then transformed to the
proper dimensions by adaptive pooling. This allows pseudo-
images on both large and small scales to be concatenated
to generate high-dimensional features. Finally, the backbone
network is applied to the high-dimensional features to regress
detected objects.
C. Pre-processing of 4D Radar Point Clouds
4D Radar is equipped with azimuthal and vertical angle
antennas to measure horizontal and vertical positions. Gen-
erally, a high angular resolution of 4D Radar needs a very
large effective aperture from a number of antennas. However,
the number of vertical angle antennas of 4D Radar is often
small, which will cause aperture resolution issues along the
azimuthal direction. For example, due to aperture limitations,
various objects with differing angles for the same speed and
distance conditions may be challenging to distinguish in the
azimuthal direction. Compared with LiDAR, the point cloud
output by 4D Radar has more miscellaneous points or noise
points, which are determined by the electromagnetic scattering
characteristics including multiple reflections, refractions, and
diffractions to a certain extent. Specifically, the theoretical
vertical angle of the Radar in Astyx HiRes 2019 is θidegrees,
but the actual one is higher than 9×θidegrees, as shown in
Fig. 4 (a). Although the observed targets are virtually above
the ground, there are still a lot of 4D Radar points below
ground due to this aperture resolution issue. We observe that
considerable 4D Radar points are below ground which affects
the detection accuracy. Therefore, we propose a novel method
for addressing noise in the data.
We use Gaussian normal distribution to determine if the
vertical angle θtis in the normal range based on Shapiro-
Wilk (S-W) test. Here we focus on two descriptive statistics,
i.e., skewness and kurtosis, which can help determine if the
point cloud conforms to the Gaussian normal distribution.
Specifically, the symmetry of the distribution measure, i.e.,
the inequality between the left and right distribution tails,
can be verified using skewness value. The peakedness and
heaviness of the distribution tails can be confirmed using the
kurtosis value. The skewness (g1) and kurtosis (g2) are shape
parameters of the point cloud, which are calculated frame by
frame as follows:
g1(t) = E[(θtµ
σ)3]
g2(t) = E[(θtµ
σ)4],
(8)
z
x
z
x
(a)
(b)
Fig. 4. Y-axis views of 4D Radar (blue) and LiDAR (red) point clouds,
including (a) the original data and (b) the processed data.
where θtdenotes the divergence angle of the point cloud in
4D Radar, µis the mean value, and σis the standard deviation
of divergence angle, Erefers to the mean operation.
As the maximum divergence angle for the 4D Radar is
far beyond the sensor setting range, we reduce the point
cloud elevation in the x-z plane so that all vertical slopes are
limited to within θm. When the absolute value of kurtosis
and skewness are both less than 1, we assume the data frame
conforms to the Gaussian normal distribution, then θmis set as
the mean value of the vertical angle. Otherwise, if this frame
does not conform to the Gaussian normal distribution, then we
set θmas the median value. Finally, we update the coordinates
by the following equation:
x
t=cos(2θt
θm
arctan zt
xt
)px2
t+z2
t
z
t=sin(2θt
θm
arctan zt
xt
)px2
t+z2
t,
(9)
where xtand ztare original coordinates of the point cloud,
x
tand z
tare the coordinates after pre-process, and θmis the
angle at which pre-processed data are compressed.
After the above data pre-processing of the 4D Radar point
cloud, the noise data can be improved to conduct the following
detection task. The intuitive effect is shown in Fig. 4 (b), which
illustrates that the point cloud data is transformed within a
reasonable range.
D. Overall Loss
We propose a loss function combining these two points,
based on the regression requirements for 3D bounding boxes
and key points. The overall loss function draws on that of both
PointPillars and CenterNet [43]. We design the loss function in
key point training to be identical to that of CenterNet [43]. In
addition, the loss function is equivalent to that of PointPillars
for 3D bounding box regression. Combining these two losses
gives:
Ltotal =αLc+βLp,(10)
where Ltotal is the total loss function, Lcis the loss function
in key point regression , Lpis the loss function in 3D bounding
box regression, and αand βare balance coefficients.
The CenterNet [43] loss function is given by:
Lc=Lk+λsizeLsize +λof f Loff ,(11)
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
8
where Lkis a penalty-reduced pixel-wise logistic regression
term including focal loss, λsize = 0.1,λof f = 1,Lsize is the
object size loss and Loff is the offset loss.
A prediction ˆ
Yxyc = 1 corresponds to a detected key point,
while ˆ
Yxyc = 0 denotes the background. These conditions can
be represented as follows:
Lk=1
NX
xyc ((1 ˆ
Y)αlog(ˆ
Yxyc),if Yxy c = 1
Lotherwise ,otherwise, (12)
Lotherwise = (1 Yxyc)β(ˆ
Yxyc)αlog(1 ˆ
Yxyc),(13)
Yxyc = exp((x˜px)2+ (y˜py)2
2σ2
p
),(14)
where Nis the number of key points in the image, α= 2, and
β= 4.
The loss function for the object size regression is then given
by:
Lsize =1
N
N
X
k=1
|ˆ
Spksk|,(15)
where pk= (x(k)
1+x(k)
2
2,y(k)
1+y(k)
2
2)is the center point list, sk=
(xk
2xk
1, yk
2yk
1)is the object size of each object k,ˆ
S
is a single size prediction, and (x(k)
1, y(k)
1, x(k)
2, y(k)
2)is the
bounding box for object k.
The offset loss function is given by:
Loff =1
NX
p
|ˆ
Oˆp(p
Rˆp)|,(16)
where ˆ
Ois the local offset, pis the ground truth key point,
and ˆp= [ p
R]is the low-resolution equivalent.
The loss function for 3D bounding box regression can be
expressed as:
Lp=1
Npos
(βclsLcls +βdir Ldir +βloc Lloc),(17)
where Npos is the number of positive anchors, Lcls is the
object classification loss function, and Ldir is the direction loss
for the softmax function, Lloc is the location loss, βcls = 1,
βdir = 0.2, and βloc = 2.
The classification focal loss can be represented as:
Lcls =α(1 pa)γlogpa,(18)
where pais the class probability for an anchor, α= 0.25, and
γ= 2.
We parameterize the 3D regressed results as (x, y, z, l,
w, h, θ), where x, y , z represent the center location, l, w, h are
length, width, and height of the object box, and θis the yaw
rotation around the z-axis. Then, the localization loss function
is then given by:
Lloc =X
u(x,y,z,w,h,l)
SmoothL1(∆u),(19)
The 7-dimensional residual vector (∆x, y, z, l, w,
h, θ)between the ground truth and the regressed result is
calculated by:
x=xgt xa
da,y=ygt ya
da,z=zgt za
da
w=log wgt
wa,l=log lgt
la,h=log hgt
ha
θ=sin(θgt θa), da=p(wa)2+ (la)2,
(20)
where SmoothL1(·) is a smooth L1loss function, superscripts
gt and arepresent the ground truths and the regressed results
for the anchor boxes.
IV. EXP ER IM EN TS
This section presents the results of several experiments
applying the proposed M2-Fusion framework to the Astyx
HiRes 2019 dataset, verifying its advantages for 3D object
detection. The proposed method is evaluated using compar-
isons with seven mainstream detection methods, including
PointRCNN [1], SECOND [2], PV-RCNN [3], PointPillars [4],
Part-A2[5], Voxel R-CNN [6], and multi-modal fusion method
MVX-Net [7]. Adequate ablation studies are also conducted
to verify the effectiveness of each proposed module, including
data preprocessing (DP), IMMF, and CMSF. In addition,
hyperparameter tuning is used to assess performance trends.
A. Dataset
Astyx Hires 2019 [21] is an open-access database that
includes 4D Radar data for use in 3D object detection. Its
purpose is to provide high-resolution 4D Radar data to the
research community. The set consists of 4D Radar frames,
16-line LiDAR data, and camera images with temporal and
spatial calibration. The data were split using a ratio of 3:1 to
ensure the training and test distributions were consistent. The
4D Radar and LiDAR point clouds typically included 1,000
to 10,000 points and 10,000 to 25,000 points, respectively,
with images exhibiting a resolution of 2,048×618 pixels. The
maximum range of point cloud data reached 200 meters for 4D
Radar and 100 meters for 16-line LiDAR. The LiDAR point
clouds were then transferred to the Radar coordinate system
since the 3D bounding boxes were labeled in the Radar point
cloud. The training set mainly included cars and very few
pedestrians and cyclists. Therefore, experimental evaluation
was conducted only for the car category, as the number of
objects outside this was too small. Official KITTI evaluation
protocols were followed, in which an IoU threshold of 0.5
was used for cars. IoU threshold was the same for the bird’s
eye view (BEV) data and the entire 3D evaluation set. These
methods were compared using the mean average precision
(mAP) as an evaluation metric.
B. Implementation Details
The range of x,y, and zwas set to (0, 69.12m), (-
39.68m, 39.68m), and (-3m, 1m) following the point cloud
configuration in the KITTI dataset, respectively. The fusion
network consisted of a pillar extraction module (used to extract
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
9
TABLE I
COMPARATIVE RESULTS FOR MAINSTREAM ALGORITHMS APPLIED TO THE ASTY X HIRES 2019 DATASE T.
Modality Methods Reference 3D mAP(%) BEV mAP(%)
Easy Moderate Hard Easy Moderate Hard
4D Radar
PointRCNN [1] CVPR 2019 14.79 11.40 11.32 26.71 18.74 18.60
SECOND [2] SENSORS 2018 23.26 18.02 17.06 37.92 31.01 28.83
PV-RCNN [3] CVPR 2020 27.61 22.08 20.51 49.17 39.88 36.50
PointPillars [4] CVPR 2019 26.03 20.49 20.40 47.38 38.21 36.74
Part-A2[5] TPAMI 2021 14.96 13.76 13.17 26.46 21.47 20.98
Voxel R-CNN [6] AAAI 2021 23.65 18.71 18.47 37.77 31.26 27.83
16-Line LiDAR
PointRCNN [1] CVPR 2019 39.03 29.97 29.66 41.34 34.22 32.95
SECOND [2] SENSORS 2018 51.75 43.54 40.72 55.16 45.63 43.57
PV-RCNN [3] CVPR 2020 54.63 44.71 41.26 56.08 46.68 44.86
PointPillars [4] CVPR 2019 54.37 44.21 41.81 58.64 47.67 45.26
Part-A2[5] TPAMI 2021 45.41 38.45 36.74 49.85 41.85 38.93
Voxel R-CNN [6] AAAI 2021 52.26 44.08 40.06 53.94 44.54 40.43
Camera + 4D Radar MVX-Net [7] ICRA 2019 13.20 11.69 11.43 23.57 20.36 19.04
Camera + 16-Line LiDAR MVX-Net [7] ICRA 2019 39.16 31.43 30.40 47.04 38.15 35.60
4D Radar + 16-Line LiDAR Direct fusion 54.25 44.33 43.24 66.05 55.75 54.67
M2-Fusion(Ours) 61.33 49.85 49.12 71.27 61.24 57.03
pillars from 4D Radar and LiDAR point clouds), a feature
fusion module, and a backbone network. Two different scales
were utilized in the CMSF module, with pillar volumes of
[0.16, 0.16, 4] (m) and [0.08, 0.08, 4] (m) for Sand S/2,
respectively. The input channel size for the IMMF module was
10 and the output channel size was 64, while the backbone
network consisted of three convolutional blocks and three
deconvolutional blocks. The number of convolution layers, the
step size, and the number of output channels were [3, 5, 5], [2,
2, 2], and [128, 256, 512] in the three convolutional blocks,
respectively, and [3, 5, 5], [1, 2, 4], and [128, 128, 128] in the
three deconvolutional blocks, respectively.
C. Training
The open-source OpenPCDet [54] code framework was
utilized to construct the training set, and a single NVIDIA
RTX 3090 was employed to train the model, using 160 epochs.
During training, we adopt the Adam optimizer with an initial
learning rate of 3e3. To reduce false positives, we apply
an NMS threshold of 0.01 to remove the redundant boxes.
Different from PointPillars, we set the max number of pillars
(P): 16000 and the max number of points per pillar (N): 32.
We modify the 2D backbone of PointPillars, change upsample
filters from [64, 128, 256] to [128, 256, 516] because the
IMMF network has 128 output features. During training, we
utilize the widely adopted data augmentation strategy of 3D
object detection, including random flipping along the Xaxis,
global scaling with a random scaling factor sampled from
[0.95, 1.05], global rotation around the Zaxis with a random
angle sampled from[-π/4, π/4]. We also conduct the ground-
truth sampling augmentation to randomly “paste” some new
ground-truth objects from other scenes to the current training
scenes, for simulating objects in various environments. Other
training parameters were consistent with PointPillars [4].
D. Experimental Results
1) 3D Object Detection on Astyx HiRes 2019 Dataset:
Existing fusion networks mainly focus on modalities with
different data formats. However, 4D Radar and LiDAR have
similar formats, so features from the two sensors can com-
plement each other. This is the underlying principle for our
proposed multi-modal fusion network.
To establish baselines, seven mainstream algorithms applied
to the KITTI 3D object detection benchmark were selected. As
the Astyx HiRes dataset differs from KITTI, and the models
were all processed on the KITTI dataset, we converted the
Astyx HiRes dataset into the same format as KITTI based
on the framework OpenPCDet. The training parameters of
these algorithms were retained the same as they were set on
KITTI. We first compared the performance of PointRCNN [1],
SECOND [2], PV-RCNN [3], PointPillars [4], Part-A2[5],
Voxel R-CNN [6] and multi-modal fusion method MVX-
Net [7] for 4D Radar, 16-line LiDAR and camera data from
Astyx. The results were shown in Table I. The point-based
methods including PointRCNN and Part-A2exhibited lower
performance than other voxel-based methods significantly. The
main reason was that the point clouds of both 4D Radar and
16-line LiDAR were very sparse, which was not conducive to
the extraction of point features. PointPillars achieved the best
accuracy in both 4D Radar and 16-line LiDAR point clouds
except PV-RCNN. However, PointPillars had significant ad-
vantages in network structure complexity and inference time
over PV-RCNN. Thus, we chose the easily expanded Point-
Pillars network as our baseline to verify the proposed method.
These algorithms were trained and compared with our model
to evaluate the effectiveness of the proposed method for 4D
Radar and LiDAR fusion. These results demonstrated that our
proposed M2-Fusion method achieved the best results, with
significant improvements over other methods. Compared with
the baseline PointPillars only using 4D Radar, our proposed
M2-Fusion method achieved increases of 29.36%(3D mAP)
and 23.03%(BEV mAP) in the moderate level. This also
included increases of 5.64%(3D mAP) and 13.57%(BEV
mAP) over the baseline PointPillars model trained with 16-
line LiDAR. When compared to another multi-modal fusion
method MVX-Net, our method also shows better detection
performance. Compared with MVX-Net using the Camera and
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
10
TABLE II
THE R ESU LTS OF ABLATION APPLYING M2-FUS IO N TO TH E ASTY X HIRES 2019 DATAS ET (TH E BAS EL INE I S POIN TPI LLA RS ).
Methods 3D mAP(%) BEV mAP(%)
4D Radar 16-Line LiDAR DP IMMF CMSF Easy Moderate Hard Easy Moderate Hard
26.03 20.49 20.40 47.38 38.21 36.74
28.61 21.99 21.35 50.66 41.83 38.70
54.37 44.21 41.81 58.64 47.67 45.26
54.25 44.33 43.24 66.05 55.75 54.67
54.55 45.16 44.40 66.05 57.18 55.59
57.15 48.24 47.01 69.64 58.12 56.43
56.61 47.67 46.57 67.01 57.35 56.24
61.33 49.85 49.12 71.27 61.24 57.03
4D Radar, our proposed M2-Fusion method achieved huge
increases of 38.16%(3D mAP) and 40.88%(BEV mAP) in
the moderate level. When using Camera and 16-line LiDAR
as the inputs of MVX-Net, M2-Fusion also showed increases
of 18.42%(3D mAP) and 23.09%(BEV mAP). Our method
achieved increases of 5.64%(3D mAP) and 13.57%(BEV
mAP) over the baseline PointPillars model trained with 16-line
LiDAR. We also compared with the direct fusion of 4D Radar
and 16-line LiDAR to illustrate the effectiveness of our fusion
method. The results showed that our method even achieved
increases of 5.52%(3D mAP) and 5.49%(BEV mAP) using
the PointPillars as the baseline. Moreover, the inference time
for M2-Fusion is about 10 fps on a single RTX 3090 GPU.
2) Ablation Studies with M2-Fusion: Ablation studies were
also used to verify the effectiveness of each proposed model
compared to the baseline (PointPillars), the results of which
were shown in Table II. The first column showed the ex-
perimental configuration, including the sensor modality, data
preprocessing (DP), and two proposed modules IMMF and
CMSF. Assessment difficulty in the second and third columns
included three levels: easy, moderate, and hard. The following
data comparisons were all evaluated in a moderate level. It
was evident that the results for the baseline (PointPillars)
using 4D Radar alone produced the lowest accuracy. After data
preprocessing, the 3D mAP for Radar increased by 1.50%, and
the BEV mAP increased by 3.62%. These results suggested
that data preprocessing offered certain improvements in Radar
detection accuracy, which demonstrated its effectiveness. The
third line showed the results for 16-line LiDAR, and the
fourth line showed the results for 4D Radar fused with LiDAR
using the direct feature concatenation. The fusion method
got an improvement of 0.12%in 3D mAP while achieving
an improvement of 8.08%in BEV mAP. This illustrated
that 4D Radar could improve BEV accuracy remarkably.
The next line provided the fusion results adding the data
preprocessing, illustrating the beneficial effects. The sixth
line showed the results for an added IMMF module, which
produced increases of 3.08%(3D mAP) and 0.94%(BEV
mAP) above the baseline (the fifth line, 4D Radar+16-line
LiDAR+DP). The seventh line showed experimental results
for an added CMSF module. The corresponding 3D and BEV
mAP values increased by 2.51%and 0.17%above the baseline
(the fifth line, 4D Radar+16-line LiDAR+DP), respectively.
These above comparisons proved the remarkable effects of
each proposed DP, IMMF, and CMSF module. Finally, we
combined these three modules to produce M2-Fusion, which
achieved the best results with improvements of 5.64%in 3D
mAP and 13.57%in BEV mAP compared to the traditional
single LiDAR-based method (the third line, 16-line LiDAR).
And the fusion of multiple modules acquired a more obvious
improvement than a single module. These results were evident
in Fig. 5 and confirmed the overall effectiveness of our
proposed M2-Fusion algorithm for 3D object detection.
We have the following observations through the above ab-
lation studies: (1) The fusion of 4D Radar and 16-line LiDAR
can achieve better results than the single-mode method. (2)
The data processing of 4D Radar can correct data and suppress
the influence of noise. (3) The attention-based interaction
between 4D Radar and 16-line can take advantage of each
modality and enhance the perception ability. (4) Multi-scale
feature extraction can obtain richer feature information and
improve detection accuracy. Finally, our proposed M2-Fusion
method aggregates the above advantages to improve the de-
tection accuracy significantly.







2


























Fig. 5. Ablation results for M2-Fusion applied to the Astyx HiRes 2019
dataset, with PointPillars serving as the baseline. The combination of CMSF
and IMMF outperformed comparable models.
3) Accuracy Comparison Experiments at Different Ranges:
4D Radar usually has a more extended range capability
than LiDAR. To verify the object detection performance, we
conducted the accuracy comparison experiments at different
ranges for 4D Radar, 16-line LiDAR, direct fusion, and our
proposed M2-Fusion, respectively. The results were shown
in Table III and the intuitive line chart in Fig. 6. The same
method PointPillars was used following the above experiments
in Table II.
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
11
TABLE III
COMPARATIVE RESULTS OF 3D OBJECT DETECTION USING SINGLE MODALITY AND MULTI-FUSION METHODS APPLIED TO DIFFERENT RANGES. “I NF
MEANS INFINITY.
Modality Methods 3D mAP(%) BEV mAP(%)
Overall 0-30m 30-50m 50m-Inf Overall 0-30m 30-50m 50m-Inf
4D Radar PointPillars [4] 20.49 34.06 14.76 6.98 38.21 52.08 28.81 19.54
16-Line LiDAR PointPillars [4] 44.21 71.90 21.25 9.09 47.67 76.07 24.86 9.09
4D Radar + 16-Line LiDAR Direct fusion 44.33 67.67 21.50 12.50 55.75 77.92 35.64 24.22
4D Radar + 16-Line LiDAR M2-Fusion(Ours) 49.85 77.26 27.36 15.56 61.24 83.73 42.08 27.68
4D Radar + 16-Line LiDAR Improvement +5.52 +9.59 +5.86 +3.06 +5.49 +5.81 +6.44 +3.46
In terms of overall accuracy, the detection accuracy of
4D Radar was lower than that of LiDAR, but as the range
increased, the detection accuracy gap between the two became
smaller and smaller. In addition, after the detection range
exceeded 30m, the BEV detection accuracy of 4D Radar
surpassed that of LiDAR, and the accuracy after 50m in-
creased by 10.45%, showing a significant effect. It showed
that LiDAR had high accuracy in short-range detection, while
4D Radar had advantages in long-distance detection. After
the direct fusion of the two sensors according to the feature
concatenation method in Table II (line 4), the overall and most
ranges, accuracy was increased, but the accuracy of the close
range within 50m was not significantly improved than that of
LiDAR. In contrast, the detection accuracy above 50m was
improved by 3.41%and 15.13%on 3D and BEV mAP, re-
spectively. It illustrated that fusion could improve the accuracy
of remote detection. Our proposed M2-Fusion method could
significantly improve the detection accuracy within each range,
and the overall accuracy of 3D and BEV was improved by
5.52%and 5.49%, respectively, which proved the effectiveness
of the proposed method.
  









2
  









2
Fig. 6. Comparative results of 3D Object Detection in different ranges.
Our proposed M2-Fusion achieves the best accuracy in all ranges, especially
remote detection.
4) Parameter Comparison Experiment: High-resolution
pseudo-images were reshaped using a unified scale to fuse
point clouds of different resolutions. A variety of methods
were available for reconstructing pseudo-images that could
maintain the integrity of information to the extent possible
prior to reconstruction. Various methods utilizing a standard
size (when transforming pseudo-images from different scales
to the same scale) were compared in Table IV. These results
demonstrated the effectiveness of adaptive pooling, which
outperformed CNN and max pooling methods and proved to
be highly compatible with our network by outputting features
of multiple dimensions.
TABLE IV
A COMPARISON OF DIFFERENT METHODS USED TO TRANSFORM
PS EUD O-I MAGE S FRO M DI FFER EN T SCA LES . MP A ND AMP ME AN T HE
MAX POOLING AND ADAPTIVE MAX POOLING,RES PEC TI VELY.
Methods 3D mAP(%) BEV mAP(%)
Easy Moderate Hard Easy Moderate Hard
CNN 58.01 47.93 46.57 69.19 57.10 55.58
MP 59.67 48.74 46.89 70.34 58.02 56.60
AMP 61.33 49.85 49.12 71.27 61.24 57.03
TABLE V
A COMPARISON OF DIFFERENT SCORES FOR KEY POINT SELECTION
METHODS.
Score 3D mAP(%) BEV mAP(%)
Easy Moderate Hard Easy Moderate Hard
0.2 57.65 47.71 45.10 70.05 56.80 55.61
0.4 56.72 46.90 45.65 70.56 57.68 57.02
0.6 61.33 49.85 49.12 71.27 61.24 57.03
0.8 57.07 48.20 46.75 67.59 57.79 56.39
Since the network in the CMSF module predicts key points
with probabilistic scores, key points with low scores could lead
to false detection. As such, we set a score threshold to remove
these points. The influence of this threshold was tested using
different values, the results of which were shown in Table V.
As shown, the model achieved the best results for a score
threshold of 0.6, suggesting that smaller thresholds introduced
errors while larger thresholds filtered out too much data.
Table VI provided the results of comparison experiments
involving concatenation at different scales in the CMSF mod-
ule. Relying on prior experience, we selected two pillar scales:
[0.16, 0.16, 4], and [0.08, 0.08, 4], with a baseline pillar scale
of [0.16, 0.16, 4]. The use of different pillar scales had a direct
impact on the experimental results, as the accuracy was higher
for smaller scales within a specific range. As seen in Table VI,
the multi-scale fusion of [0.16, 0.16, 4] and [0.08, 0.08, 4]
achieved better results, producing a 1.61%increase in 3D
mAP and a 3.12%increase in BEV mAP over the baseline in
TABLE VI
THE R ESU LTS OF SCALE PARAMETER VERIFICATION EXPERIMENTS USING
TH E CMSF MODULE IN OUR PROPOSED M2-FUSION METHOD.
Scale(m) 3D mAP(%) BEV mAP(%)
Easy Mod. Hard Easy Mod. Hard
0.16 57.15 48.24 47.00 69.64 58.12 56.43
0.16+0.08 61.33 49.85 49.12 71.27 61.24 57.03
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
12
Fig. 7. Qualitative results from the Astyx HiRes 2019 dataset. The LiDAR point cloud is grey and the 4D Radar point cloud is pink. The first row shows
RGB images; the second row provides the ground truth, and the green bounding boxes surrounded by dotted orange circles indicate missed detections. The
third row shows detection results for 4D Radar; the fourth row shows results for 16-line LiDAR, and the blue bounding box surrounded by dotted brown
circles denotes false detections. The last row shows results produced by the proposed M2-Fusion method.
the moderate level. This experiment demonstrated that feature
representations could be enhanced by fusing multiple scales.
5) Visualization Experiments: To observe the effects of
modal fusion more intuitively, we provided tree visualiza-
tion experiments including Precision-Recall (PR) curve com-
parison, feature map visualization, and qualitative detection
results. Fig. 8 showed the Precision-Recall (PR) curves of
our proposed M2-Fusion and the direct fusion of 4D Radar
and LiDAR-based on PointPillars. It revealed that our method
significantly improved accuracy at easy, moderate, and hard
levels. The visualization of feature maps for 4D Radar and
LiDAR data using the baseline (PointPillars) and our proposed
M2-Fusion was shown in Fig. 9. It was evident that feature
maps using our method contained more information. These
results illustrated that multi-modal interactive fusion enhanced
network feature extraction capabilities and made extracting
long-range features more accessible. We also provided a
qualitative visualization of the final detection results for single
Radar, LiDAR, and M2-Fusion in Fig. 7, where green boxes
denoted the ground truth. We observed that the baseline (Point-
Pillars) method using only 4D Radar produced many false
detections, as several trees and signal lights were considered
cars. The baseline results using only LiDAR were somewhat
better. However, cars were not detected at large distances since
too few points in the LiDAR data. In contrast, our proposed
M2-Fusion method achieved the best results and reduced false
detections significantly.
          
















2
2
2
Fig. 8. PR curve comparison of our proposed M2-Fusion and the direct
fusion method based on PointPillars. “R+L” mean Radar and LiDAR fusion.
Our method surpasses the direct fusion method in easy, moderate, and hard
levels observably.
V. CONCLUSION
A novel method based on multi-modal and multi-scale
fusion termed M2-Fusion, was proposed for 4D Radar and
LiDAR. We verified the role of 4D Radar in 3D object de-
tection for autonomous driving, fusing these data with LiDAR
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
13
Fig. 9. Feature maps for (a) 4D Radar data processed using PointPillars,
(b) 4D Radar processed by M2-Fusion, (c) 16-line LiDAR processed by
PointPillars, and (d) 16-line LiDAR processed by M2-Fusion.
for the first time and effectively improving detection accuracy.
A center-based multi-scale fusion method was proposed to
solve the problem of information loss in feature extraction for
sparse point clouds. A multi-modal fusion method based on
a self-attention interaction was also proposed, which achieved
an effective fusion of 4D Radar and LiDAR. The proposed
method, evaluated using the Astyx HiRes 2019 dataset, out-
performed mainstream LiDAR-based object detection methods
significantly. We will consider fusion with camera images in
future research to further improve detection accuracy.
VI. ACKNOWLEDGMENTS
We thank LetPub (www.letpub.com) for linguistic assistance
and pre-submission expert review.
REFERENCES
[1] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation
and detection from point cloud,” in 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2019, pp. 770–779.
[2] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
[3] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-
rcnn: Point-voxel feature set abstraction for 3d object detection, in 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020, pp. 10 526–10 535.
[4] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
“Pointpillars: Fast encoders for object detection from point clouds,” in
2019 IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2019, pp. 12 689–12 697.
[5] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3d
object detection from point cloud with part-aware and part-aggregation
network,” IEEE transactions on pattern analysis and machine intelli-
gence, vol. 43, no. 8, pp. 2647–2664, 2020.
[6] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-cnn:
Towards high performance voxel-based 3d object detection, in AAAI
Conference on Artificial Intelligence (AAAI), vol. 35, no. 02, 2021, pp.
1201–1209.
[7] V. A. Sindagi, Y. Zhou, and O. Tuzel, “Mvx-net: Multimodal voxelnet
for 3d object detection,” in 2019 International Conference on Robotics
and Automation (ICRA), 2019, pp. 7276–7282.
[8] H. Wang, Y. Huang, A. Khajepour, D. Cao, and C. Lv, “Ethical
decision-making platform in autonomous vehicles with lexicographic
optimization based model predictive controller, IEEE Transactions on
Vehicular Technology, vol. 69, no. 8, pp. 8164–8175, 2020.
[9] H. Wang, Y. Huang, A. Soltani, A. Khajepour, and D. Cao, “Cyber-
physical predictive energy management for through-the-road hybrid
vehicles,” IEEE Transactions on Vehicular Technology, vol. 68, no. 4,
pp. 3246–3256, 2019.
[10] Y. Huang, H. Wang, A. Khajepour, H. Ding, K. Yuan, and Y. Qin, A
novel local motion planning framework for autonomous vehicles based
on resistance network and model predictive control, IEEE Transactions
on Vehicular Technology, vol. 69, no. 1, pp. 55–66, 2020.
[11] S. Wen, J. Chen, F. R. Yu, F. Sun, Z. Wang, and S. Fan, “Edge
computing-based collaborative vehicles 3d mapping in real time, IEEE
Transactions on Vehicular Technology, vol. 69, no. 11, pp. 12 470–
12 481, 2020.
[12] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and
A. Mouzakitis, “A survey on 3d object detection methods for au-
tonomous driving applications, IEEE Transactions on Intelligent Trans-
portation Systems, vol. 20, no. 10, pp. 3782–3795, 2019.
[13] M. Herzog and K. Dietmayer, “Training a fast object detector for lidar
range images using labeled data from sensors with higher resolution,” in
2019 IEEE Intelligent Transportation Systems Conference (ITSC), 2019,
pp. 2707–2713.
[14] A. Danzer, T. Griebel, M. Bach, and K. Dietmayer, “2d car detection
in radar data with pointnets,” in 2019 IEEE Intelligent Transportation
Systems Conference (ITSC), 2019, pp. 61–66.
[15] K. Patel, K. Rambach, T. Visentin, D. Rusev, M. Pfeiffer, and B. Yang,
“Deep learning-based object classification on automotive radar spectra,
in 2019 IEEE Radar Conference (RadarConf), 2019, pp. 1–6.
[16] K. Xie, Z. Zhang, B. Li, J. Kang, T. D. Niyato, S. Xie, and Y. Wu,
“Efficient federated learning with spike neural networks for traffic sign
recognition,” IEEE Transactions on Vehicular Technology, pp. 1–1,
2022.
[17] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-
sensor fusion for 3d object detection,” in 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7337–
7345.
[18] B. Yang, R. Guo, M. Liang, S. Casas, and R. Urtasun, “Radarnet:
Exploiting radar for robust perception of dynamic objects,” in European
Conference on Computer Vision. Springer, 2020, pp. 496–512.
[19] K. Qian, S. Zhu, X. Zhang, and L. E. Li, “Robust multimodal vehicle
detection in foggy weather using complementary lidar and radar sig-
nals,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2021, pp. 444–453.
[20] M. Russell, A. Crain, A. Curran, R. Campbell, C. Drubin, and W. Mic-
cioli, “Millimeter-wave radar sensor for automotive intelligent cruise
control (icc),” IEEE Transactions on Microwave Theory and Techniques,
vol. 45, no. 12, pp. 2444–2453, 1997.
[21] M. Meyer and G. Kuschk, Automotive radar dataset for deep learning
based 3d object detection,” in 2019 16th European Radar Conference
(EuRAD), 2019, pp. 129–132.
[22] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
detection network for autonomous driving, in 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6526–
6534.
[23] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint
3d proposal generation and object detection from view aggregation,
in 2018 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), 2018, pp. 1–8.
[24] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Se-
quential fusion for 3d object detection,” in 2020 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4603–
4611.
[25] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
based 3d object detection,” in 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2018, pp. 4490–4499.
[26] H. Kuang, B. Wang, J. An, M. Zhang, and Z. Zhang, “Voxel-fpn: Multi-
scale voxel feature aggregation for 3d object detection from lidar point
clouds,” Sensors, vol. 20, no. 3, p. 704, 2020.
[27] L. Sless, B. E. Shlomo, G. Cohen, and S. Oron, “Road scene under-
standing by occupancy grid learning from sparse radar clusters using
semantic segmentation,” in 2019 IEEE/CVF International Conference
on Computer Vision Workshop (ICCVW), 2019, pp. 867–875.
[28] R. Nabati and H. Qi, “Rrpn: Radar region proposal network for object
detection in autonomous vehicles,” in 2019 IEEE International Confer-
ence on Image Processing (ICIP), 2019, pp. 3093–3097.
[29] O. Schumann, C. W¨
ohler, M. Hahn, and J. Dickmann, “Comparison of
random forest and long short-term memory network performances in
classification tasks using radar, in 2017 Sensor Data Fusion: Trends,
Solutions, Applications (SDF), 2017, pp. 1–6.
[30] S. Kim, S. Lee, S. Doo, and B. Shim, “Moving target classification in au-
tomotive radar systems using convolutional recurrent neural networks,”
in 2018 26th European Signal Processing Conference (EUSIPCO), 2018,
pp. 1482–1486.
[31] J. Lombacher, K. Laudt, M. Hahn, J. Dickmann, and C. W¨
ohler,
“Semantic radar grids,” in 2017 IEEE Intelligent Vehicles Symposium
(IV), 2017, pp. 1170–1175.
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
14
[32] O. Schumann, M. Hahn, J. Dickmann, and C. W¨
ohler, “Semantic seg-
mentation on radar point clouds,” in 2018 21st International Conference
on Information Fusion (FUSION), 2018, pp. 2179–2186.
[33] D. Brodeski, I. Bilik, and R. Giryes, “Deep radar detector, in 2019
IEEE Radar Conference (RadarConf), 2019, pp. 1–6.
[34] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep
learning on point sets for 3d classification and segmentation,” in 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017, pp. 77–85.
[35] J. Liu, M. Gong, K. Qin, and P. Zhang, “A deep convolutional coupling
network for change detection based on heterogeneous optical and radar
images,” IEEE Transactions on Neural Networks and Learning Systems,
vol. 29, no. 3, pp. 545–559, 2018.
[36] Y. Zhu, F. Zhuang, J. Wang, G. Ke, J. Chen, J. Bian, H. Xiong, and
Q. He, “Deep subdomain adaptation network for image classification,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 32,
no. 4, pp. 1713–1722, 2021.
[37] B. Xu, X. Zhang, L. Wang, X. Hu, Z. Li, S. Pan, J. Li, and Y. Deng,
“Rpfa-net: a 4d radar pillar feature attention network for 3d object de-
tection,” in 2021 IEEE International Intelligent Transportation Systems
Conference (ITSC), 2021, pp. 3061–3066.
[38] J. Zarzar, S. Giancola, and B. Ghanem, “Pointrgcn: Graph convolu-
tion networks for 3d vehicles detection refinement,” arXiv preprint
arXiv:1911.12236, 2019.
[39] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware
single-stage 3d object detection from point cloud,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 11 873–11 882.
[40] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection
from point clouds,” in 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2018, pp. 7652–7660.
[41] B. Yang, M. Liang, and R. Urtasun, “Hdnet: Exploiting hd maps for 3d
object detection,” in Conference on Robot Learning. PMLR, 2018, pp.
146–155.
[42] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera
fusion for 3d object detection,” in 2021 IEEE Winter Conference on
Applications of Computer Vision (WACV), 2021, pp. 1526–1535.
[43] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:
Keypoint triplets for object detection, in 2019 IEEE/CVF International
Conference on Computer Vision (ICCV), 2019, pp. 6568–6577.
[44] K. Shin, Y. P. Kwon, and M. Tomizuka, “Roarnet: A robust 3d object
detection based on region approximation refinement,” in 2019 IEEE
Intelligent Vehicles Symposium (IV), 2019, pp. 2510–2515.
[45] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´
ar, “Focal loss
for dense object detection,” in 2017 IEEE International Conference on
Computer Vision (ICCV), 2017, pp. 2999–3007.
[46] Y. Wang, Z. Jiang, Y. Li, J.-N. Hwang, G. Xing, and H. Liu, “Rodnet:
A real-time radar object detection network cross-supervised by camera-
radar fused object 3d localization,” IEEE Journal of Selected Topics in
Signal Processing, vol. 15, no. 4, pp. 954–967, 2021.
[47] Y. Fu, D. Tian, X. Duan, J. Zhou, P. Lang, C. Lin, and X. You, “A
camera–radar fusion method based on edge computing,” in 2020 IEEE
International Conference on Edge Computing (EDGE), 2020, pp. 9–14.
[48] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion
for multi-sensor 3d object detection,” in Proceedings of the European
conference on computer vision (ECCV), 2018, pp. 641–656.
[49] H. Lu, X. Chen, G. Zhang, Q. Zhou, Y. Ma, and Y. Zhao, “Scanet:
Spatial-channel attention network for 3d object detection,” in ICASSP
2019 - 2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2019, pp. 1992–1996.
[50] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K.
Wellington, “Lasernet: An efficient probabilistic 3d object detector for
autonomous driving,” in 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2019, pp. 12 669–12 678.
[51] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-
Gonzalez, “Sensor fusion for joint 3d object detection and semantic
segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 2019, pp. 1230–1237.
[52] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q.
Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the
gap in 3d object detection for autonomous driving, in 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2019,
pp. 8437–8445.
[53] T.-Y. Lin, P. Doll´
ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp.
2117–2125.
[54] O. D. Team, “OpenPCDet: An open-source toolbox for 3d object de-
tection from point clouds,” https://github.com/open-mmlab/OpenPCDet,
2020.
Li Wang was born in Shangqiu, Henan Province,
China in 1990. He received his Ph.D. degree in
mechatronic engineering at State Key Laboratory of
Robotics and System, Harbin Institute of Technol-
ogy, in 2020.
He was a visiting scholar at Nanyang Technology
of University for two years. Currently, he is a
postdoctoral fellow in the State Key Laboratory of
Automotive Safety and Energy, and the School of
Vehicle and Mobility, Tsinghua University.
Dr. Wang is the author of more than 20 SCI/EI
articles. His research interests include autonomous driving perception, 3D
robot vision, and multi-modal fusion.
Xinyu Zhang was born in Huining, Gansu Province,
and he received a B.E. degree from the School
of Vehicle and Mobility at Tsinghua University, in
2001.
He was a visiting scholar at the University of
Cambridge. He is currently a researcher with the
School of Vehicle and Mobility, and the head of
the Mengshi Intelligent Vehicle Team at Tsinghua
University.
Dr. Zhang is the author of more than 30 SCI/EI
articles. His research interests include intelligent
driving and multimodal information fusion.
Jun Li was born in Jilin Province, China in 1958.
He received a Ph.D. degree in internal-combustion
engineering at Jilin University of Technology, in
1989.
He has joined the China FAW Group Corporation
in 1989 and currently works as a professor with
the School of Vehicle and Mobility at Tsinghua
University. Now he also serves as the chairman of
the China Society of Automotive Engineers (SAE).
In these years, Dr. Li has presided over the product
development and technological innovation of large-
scale automobile companies in China. Dr. Li has many scientific research
achievements in the fields of automotive powertrain, new energy vehicles,
and intelligent connected vehicles.
Dr. Li is the author of more than 98 papers. In 2013, Dr. Li was awarded
an academician of Chinese Academy of Engineering (CAE) for contributions
to vehicle engineering.
Baowei Xv was born in Yulin, Shanxi Province,
China in 1995. He is now a 2019 graduate stu-
dent majoring in forestry engineering at Northeast
Forestry University, with a research focus on simul-
taneous localization and mapping.
Since Aug. 2020, he has been interning at State
Key Laboratory of Automotive Safety and Energy,
and the School of Vehicle and Mobility, Tsinghua
University, responsible for the development of 3D
object detection algorithm based on multi-sensor
fusion.
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
15
Rong Fu was born in Wuhan, Hubei Province, China
in 1994. She received a bachelor’s degree major
in Electronic Information Engineering at Beihang
University, Beijing, China, in 2016. She received a
Ph.D. degree major at the Intelligence Sensing Labo-
ratory (ISL), Department of Electronic Engineering,
Tsinghua University, Beijing, China, in 2022. Her
current research interests include statistical signal
processing, compressive sensing, optimization meth-
ods, model-based deep learning, and their applica-
tions in signal detection and parameter estimation.
Haifeng Chen was born in Chuzhou, Anhui
Province, China in 1996. He is currently a Master
student in the Institute of Information Engineering
at The China University of Mining and Technology,
Beijing. His research interests include computer vi-
sion, deep learning and 3D object detection.
He is now interning at State Key Laboratory of
Automotive Safety and Energy, and the School of
Vehicle and Mobility, Tsinghua University, respon-
sible for the development of 3D object detection
algorithm based on multi-sensor fusion.
Lei Yang was born in DaTong, ShanXi Province,
China in 1993. He received his master’s degree in
robotics at Beihang University, in 2018. Then he
joined the Autonomous Driving R&D Department
of JD.COM as an algorithm researcher from 2018
to 2020.
He is now a PhD student in School of Vehicle
and Mobility at Tsinghua University since 2020. His
research interests are computer vision, autonomous
driving and environmental perception.
Dafeng Jin was born in China in 1965. He is an
associate professor of the School of Vehicle and
Mobility of Tsinghua University. His research field
is intelligent driving and integrated technology of
new energy vehicle system.
Lijun Zhao was born in Harbin, Heilongjiang
Province, and he received a Ph.D degree from
Robotics Institute of Technology, at Harbin Insti-
tute of Technology, China, in 2009. He currently
works as a professor with Robotics Institute, State
Key Laboratory of Robotics and System at Harbin
Institute of Technology. Dr. Zhao is the author of
more than 70 SCI/EI articles. His research interests
include SLAM, environments perception and navi-
gation of mobile robots.
This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TVT.2022.3230265
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on February 21,2023 at 09:07:45 UTC from IEEE Xplore. Restrictions apply.
... Through the integration of a bidirectional spatial fusion module and a promoted detection transformer, the model excels in unifying feature representations from distinct domains. Wang et al. [13] revolutionize point cloud feature fusion by seamlessly integrating radar and LiDAR data sources. However, a limitation of this study may pertain to the scalability and adaptability across different vehicle types. ...
Preprint
Full-text available
Camera, LiDAR and radar are common perception sensors for autonomous driving tasks. Robust prediction of 3D object detection is optimally based on the fusion of these sensors. To exploit their abilities wisely remains a challenge because each of these sensors has its own characteristics. In this paper, we propose FADet, a multi-sensor 3D detection network, which specifically studies the characteristics of different sensors based on our local featured attention modules. For camera images, we propose dual-attention-based sub-module. For LiDAR point clouds, triple-attention-based sub-module is utilized while mixed-attention-based sub-module is applied for features of radar points. With local featured attention sub-modules, our FADet has effective detection results in long-tail and complex scenes from camera, LiDAR and radar input. On NuScenes validation dataset, FADet achieves state-of-the-art performance on LiDAR-camera object detection tasks with 71.8% NDS and 69.0% mAP, at the same time, on radar-camera object detection tasks with 51.7% NDS and 40.3% mAP. Code will be released at https://github.com/ZionGo6/FADet.
... RGB imaging in photos and videos plays a crucial role in various intelligent driving tasks such as depth estimation, object detection, semantic segmentation, object tracking, and 3D high-dimension mapping. Current research in intelligent driving vehicle perception systems has evolved into systems comprising pure RGB sensing, RGB-guided Lidar and Radar perception, and multi-sensor fusion systems incorporating RGB [1][2][3]. Favored for their low manufacturing and deployment costs, RGB cameras, however, face significant limitations in robustness under complex environments, especially in low-light conditions. Nighttime and low-illumination driving scenarios, unavoidable for vehicles, however, diminish human reaction and perception capabilities, both visually and cognitively, necessitating enhanced perception aids. ...
Article
Full-text available
In Advanced Driving Assistance Systems (ADAS), Automated Driving Systems (ADS), and Driver Assistance Systems (DAS), RGB camera sensors are extensively utilized for object detection, semantic segmentation, and object tracking. Despite their popularity due to low costs, RGB cameras exhibit weak robustness in complex environments, particularly underperforming in low-light conditions, which raises a significant concern. To address these challenges, multi-sensor fusion systems or specialized low-light cameras have been proposed, but their high costs render them unsuitable for widespread deployment. On the other hand, improvements in post-processing algorithms offer a more economical and effective solution. However, current research in low-light image enhancement still shows substantial gaps in detail enhancement on nighttime driving datasets and is characterized by high deployment costs, failing to achieve real-time inference and edge deployment. Therefore, this paper leverages the Swin Vision Transformer combined with a gamma transformation integrated U-Net for the decoupled enhancement of initial low-light inputs, proposing a deep learning enhancement network named Vehicle-based Efficient Low-light Image Enhancement (VELIE). VELIE achieves state-of-the-art performance on various driving datasets with a processing time of only 0.19 s, significantly enhancing high-dimensional environmental perception tasks in low-light conditions.
... In general, more reliable and accurate data can be obtained through sensor fusion. Wang et al. through a multi-modal multi-scale fusion algorithm for Lidar and 4D radar achieved about 5-10% higher accuracy with respect to only the LiDAR [125]. Similar improvements have been found in object detection by adopting a multimodal VoxelNet to fuse vision and lidar data [46]. ...
Article
Full-text available
The transport sector is under an intensive renovation process. Innovative concepts such as shared and intermodal mobility, mobility as a service, and connected and autonomous vehicles (CAVs) will contribute to the transition toward carbon neutrality and are foreseen as crucial parts of future mobility systems, as demonstrated by worldwide efforts in research and industry communities. The main driver of CAVs development is road safety, but other benefits, such as comfort and energy saving, are not to be neglected. CAVs analysis and development usually focus on Information and Communication Technology (ICT) research themes and less on the entire vehicle system. Many studies on specific aspects of CAVs are available in the literature, including advanced powertrain control strategies and their effects on vehicle efficiency. However, most studies neglect the additional power consumption due to the autonomous driving system. This work aims to assess uncertain CAVs' efficiency improvements and offers an overview of their architecture. In particular, a combination of the literature survey and proper statistical methods are proposed to provide a comprehensive overview of CAVs. The CAV layout, data processing, and management to be used in energy management strategies are discussed. The data gathered are used to define statistical distribution relative to the efficiency improvement, number of sensors, computing units and their power requirements. Those distributions have been employed within a Monte Carlo method simulation to evaluate the effect on vehicle energy consumption and energy saving, using optimal driving behaviour, and considering the power consumption from additional CAV hardware. The results show that the assumption that CAV technologies will reduce energy consumption compared to the reference vehicle, should not be taken for granted. In 75% of scenarios, simulated light-duty CAVs worsen energy efficiency, while the results are more promising for heavy-duty vehicles.
... However, the different characteristics and noise distributions of radar and LiDAR point cloud data affect the detection performance when directly integrating radar and LiDAR data. Therefore, Wang [34] employed a Gaussian distribution model in the M2-Fusion framework to mitigate the variations in point cloud distribution. Chen [35] proposed an end-to-end multi-sensor fusion framework called FUTR3D, which can adapt to most existing sensors in object tracking. ...
Article
Full-text available
Object detection is the fundamental task of vision-based sensors in environmental perception and sensing. To leverage the full potential of roadside 4D MMW radars, an innovative traffic detection method is proposed based on their distinctive data characteristics. First, velocity-based filtering and region of interest (ROI) extraction were employed to filter and associate point data by merging the point cloud frames to enhance the point relationship. Then, the Louvain algorithm was used to divide the graph into modularity by converting the point cloud data into graph structure and amplifying the differences with the Gaussian kernel function. Finally, a detection augmentation method is introduced to address the problems of over-clustering and under-clustering based on the object ID characteristics of 4D MMW radar data. The experimental results showed that the proposed method obtained the highest average precision and F1 score: 98.15% and 98.58%, respectively. In addition, the proposed method showcased the lowest over-clustering and under-clustering errors in various traffic scenarios compared with the other detection methods.
Preprint
LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficient information expression in object proxies. The latter relies on multi-stage pipelines and auxiliary tasks, which reduce the inference speed. To maintain the efficiency of the sparse framework while fully aggregating contextual information, in this work, we propose SparseDet which designs sparse queries as object proxies. It introduces two key modules, the Local Multi-scale Feature Aggregation (LMFA) module and the Global Feature Aggregation (GFA) module, aiming to fully capture the contextual information, thereby enhancing the ability of the proxies to represent objects. Where LMFA sub-module achieves feature fusion across different scales for sparse key voxels %which does this through via coordinate transformations and using nearest neighbor relationships to capture object-level details and local contextual information, GFA sub-module uses self-attention mechanisms to selectively aggregate the features of the key voxels across the entire scene for capturing scene-level contextual information. Experiments on nuScenes and KITTI demonstrate the effectiveness of our method. Specifically, on nuScene, SparseDet surpasses the previous best sparse detector VoxelNeXt by 2.2\% mAP with 13.5 FPS, and on KITTI, it surpasses VoxelNeXt by 1.12\% $\mathbf{AP_{3D}}$ on hard level tasks with 17.9 FPS.
Preprint
Full-text available
3D occupancy-based perception pipeline has significantly advanced autonomous driving by capturing detailed scene descriptions and demonstrating strong generalizability across various object categories and shapes. Current methods predominantly rely on LiDAR or camera inputs for 3D occupancy prediction. These methods are susceptible to adverse weather conditions, limiting the all-weather deployment of self-driving cars. To improve perception robustness, we leverage the recent advances in automotive radars and introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction. Our method, RadarOcc, circumvents the limitations of sparse radar point clouds by directly processing the 4D radar tensor, thus preserving essential scene details. RadarOcc innovatively addresses the challenges associated with the voluminous and noisy 4D radar data by employing Doppler bins descriptors, sidelobe-aware spatial sparsification, and range-wise self-attention mechanisms. To minimize the interpolation errors associated with direct coordinate transformations, we also devise a spherical-based feature encoding followed by spherical-to-Cartesian feature aggregation. We benchmark various baseline methods based on distinct modalities on the public K-Radar dataset. The results demonstrate RadarOcc's state-of-the-art performance in radar-based 3D occupancy prediction and promising results even when compared with LiDAR- or camera-based methods. Additionally, we present qualitative evidence of the superior performance of 4D radar in adverse weather conditions and explore the impact of key pipeline components through ablation studies.
Article
In extreme off-road scenarios, autonomous driving technology holds strategic significance for enhancing emergency rescue capabilities, reducing labor intensity, and mitigating safety risks. Challenges such as adverse conditions, complex terrains, unstable satellite signals, and lack of roads pose serious safety challenges for autonomous driving. This perspective first delves into a Bird's Eye View (BEV)-based perception-planning framework, aiming to enhance the adaptability of intelligent vehicles to their environment. Subsequently, this perspective further discusses key issues such as Cyber-Physical-Social Systems (CPSS), foundation models, and technologies like Sora for off-road scenario generation, vehicle cognitive enhancement, and autonomous decision-making. Ultimately, the discussed framework is poised to endow intelligent vehicles with the capability to perform challenging tasks in complex off-road scenarios, realizing a more efficient, safer, and sustainable transportation system, thereby providing better support for the low-altitude economy
Article
In autonomous driving systems, cameras and Light Detection and Ranging (LIDAR) are two common sensors for object detection. But both sensors can be severely affected by adverse weather. With the development of radar technology, the emergence of the 4D radar gives a more robust solution for sensor fusion strategies in 3D object detection tasks. This study proposes a two-level 4D radar and camera fusion model called TL-4DRCF, which performs a two-level fusion of 4D radar and camera information at the data and feature levels. In the data-level fusion stage, the radar point cloud is projected onto the image and fed as additional information to the image into the EarlyFusion-Net (EF-Net), which is the network designed for simultaneous extraction of point cloud and image features. In the feature-level fusion stage, the Radar-Camera Alignment (RCA) module is proposed to accurately correlate point cloud voxels and pixel-level image features while consuming less inference time. The correlated features are used to predict the class and location of the object through a standard 3D detection framework. The proposed TL-4DRCF was validated on the View-of-Delft (VoD) dataset and the VoD-Fog dataset performed by artificial fog processing. The experimental results show that the proposed model outperforms the baseline method PointPillars on the VoD dataset by 3.8% mAP and the LIDAR-camera-based method MVX-Net in the driving corridor area of the VoD-Fog dataset by 0.39% mAP.
Article
The rapid development of autonomous driving technology has driven continuous innovation in perception systems, with 4D millimeter-wave (mmWave) radar being one of the key sensing devices. Leveraging its all-weather operational characteristics and robust perception capabilities in challenging environments, 4D mmWave radar plays a crucial role in achieving highly automated driving. This review systematically summarizes the latest advancements and key applications of 4D mmWave radar in the field of autonomous driving. To begin with, we introduce the fundamental principles and technical features of 4D mmWave radar, delving into its comprehensive perception capabilities across distance, speed, angle, and time dimensions. Subsequently, we provide a detailed analysis of the performance advantages of 4D mmWave radar compared to other sensors in complex environments. We then discuss the latest developments in target detection and tracking using 4D mmWave radar, along with existing datasets in this domain. Finally, we explore the current technological challenges and future directions. This review offers researchers and engineers a comprehensive understanding of the cutting-edge technology and future development directions of 4D mmWave radar in the context of autonomous driving perception. 4D mmWave radar, autonomous driving, perception, multi-modal fusion, 3D object detection. </list
Article
Recent advances on 3D object detection heavily rely on how the 3D data are represented, i.e., voxel-based or point-based representation. Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. In contrast, the voxel-based structure is better suited for feature extraction but often yields lower accuracy because the input data are divided into grids. In this paper, we take a slightly different viewpoint --- we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy. Bearing this view in mind, we devise a simple but effective voxel-based framework, named Voxel R-CNN. By taking full advantage of voxel features in a two-stage approach, our method achieves comparable detection accuracy with state-of-the-art point-based models, but at a fraction of the computation cost. Voxel R-CNN consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network, and a detect head. A voxel RoI pooling is devised to extract RoI features directly from voxel features for further refinement. Extensive experiments are conducted on the widely used KITTI Dataset and the more recent Waymo Open Dataset. Our results show that compared to existing voxel-based methods, Voxel R-CNN delivers a higher detection accuracy while maintaining a real-time frame processing rate, i.e., at a speed of 25 FPS on an NVIDIA RTX 2080 Ti GPU. The code is available at https://github.com/djiajunustc/Voxel-R-CNN.
Article
With the gradual popularization of self-driving, it is becoming increasingly important for vehicles to smartly make the right driving decisions and autonomously obey traffic rules by correctly recognizing traffic signs. However, for machine learning-based traffic sign recognition on the Internet of Vehicles (IoV), a large amount of traffic sign data from distributed vehicles is needed to be gathered in a centralized server for model training, which brings serious privacy leakage risk because of traffic sign data containing lots of location privacy information. To address this issue, we first exploit privacy-preserving federated learning to perform collaborative training for accurate recognition models without sharing raw traffic sign data. Nevertheless, due to the limited computing and energy resources of most devices, it is hard for vehicles to continuously undertake complex artificial intelligence tasks. Therefore, we introduce powerful Spike Neural Networks (SNNs) into traffic sign recognition for energy-efficient and fast model training, which is the next generation of neural networks and is practical and well-fitted to IoV scenarios. Furthermore, we design a novel encoding scheme for SNNs based on neuron receptive fields to extract information from the pixel and spatial dimensions of traffic signs to achieve high-accuracy training. Numerical results indicate that the proposed federated SNN outperforms traditional federated convolutional neural networks in terms of accuracy, noise immunity, and energy efficiency as well.
Article
Various autonomous or assisted driving strategies have been facilitated through the accurate and reliable perception of the environment around a vehicle. Among the commonly used sensors, radar has usually been considered as a robust and cost-effective solution even in adverse driving scenarios, e.g., weak/strong lighting or bad weather. Instead of considering fusing the unreliable information from all available sensors, perception from pure radar data becomes a valuable alternative that is worth exploring. In this paper, we propose a deep radar object detection network, named RODNet, which is cross-supervised by a camera-radar fused algorithm without laborious annotation efforts, to effectively detect objects from the radio frequency (RF) images in real-time. First, the raw signals captured by millimeter-wave radars are transformed to RF images in range-azimuth coordinates. Second, our proposed RODNet takes a snippet of RF images as the input to predict the likelihood of objects in the radar field of view (FoV). Two customized modules are also added to handle multi-chirp information and object relative motion. The proposed RODNet is cross-supervised by a novel 3D localization of detected objects using a camera-radar fusion (CRF) strategy in the training stage. Due to no existing public dataset available for our task, we create a new dataset, named CRUW, <sup>1</sup> <sup>1</sup> The dataset and code are available at https://www.cruwdataset.org/ . which contains synchronized RGB and RF image sequences in various driving scenarios. With intensive experiments, our proposed cross-supervised RODNet achieves 86% average precision and 88% average recall of object detection performance, which shows the robustness in various driving conditions.
Chapter
We tackle the problem of exploiting Radar for perception in the context of self-driving as Radar provides complementary information to other sensors such as LiDAR or cameras in the form of Doppler velocity. The main challenges of using Radar are the noise and measurement ambiguities which have been a struggle for existing simple input or output fusion methods. To better address this, we propose a new solution that exploits both LiDAR and Radar sensors for perception. Our approach, dubbed RadarNet, features a voxel-based early fusion and an attention-based late fusion, which learn from data to exploit both geometric and dynamic information of Radar data. RadarNet achieves state-of-the-art results on two large-scale real-world datasets in the tasks of object detection and velocity estimation. We further show that exploiting Radar improves the perception capabilities of detecting faraway objects and understanding the motion of dynamic objects.
Article
Cooperative vehicles are better able to detect the environment and self-localize than a single vehicle. Cooperative vehicles can quickly cover the entire environment by communicating and cooperating with each other and can also reduce localization and mapping error by merging the cooperative vehicle information from observation and navigation. In this paper, we propose a novel algorithm for an effective solution of navigation and mapping for cooperative vehicles in an unknown environment. We present an improved centralized and collaborative monocular simultaneous localization and mapping (CCM-SLAM) approach. The proposed algorithm can accurately compute the transformation matrix for cooperative vehicle maps and reduce the communication delay, data loss among vehicles and decrease the bandwidth demand. The quaternion and credibility similarity transformation (QC-Sim(3)) method we proposed is used to accurately merge the matched maps and accomplish loop closures. The sending messages at variable frequencies (SMVF) method we proposed and an improved detection and resending lost data (I-DRLD) method we proposed can improve the accuracy of pose estimation. SMVF solves the time-delay problem by sending messages to the vehicles at flexible frequencies while I-DRLD detects and resends the lost data. We also adopt Intra-frame Feature Compression (IFC) to decrease the bandwidth demand in the process of the transmitting data. The experiments demonstrate the superiority of our proposed algorithm compared with the state-of-the-art methods.