Conference PaperPDF Available

RGB-L: Enhancing Indirect Visual SLAM Using LiDAR-Based Dense Depth Maps

Authors:

Abstract

In this paper, we present a novel method for integrating 3D LiDAR depth measurements into the existing ORB-SLAM3 by building upon the RGB-D mode. We propose and compare two methods of depth map generation: conventional computer vision methods, namely an inverse dilation operation, and a supervised deep learning-based approach. We integrate the former directly into the ORB-SLAM3 framework by adding a so-called RGB-L (LiDAR) mode that directly reads LiDAR point clouds. The proposed methods are evaluated on the KITTI Odometry dataset and compared to each other and the standard ORB-SLAM3 stereo method. We demonstrate that, depending on the environment, advantages in trajectory accuracy and robustness can be achieved. Furthermore, we demonstrate that the runtime of the ORB-SLAM3 algorithm can be reduced by more than 40 % compared to the stereo mode. The related code for the ORB-SLAM3 RGB-L mode is available as open-source software under https://github.com/TUMFTM/ORB_SLAM3_RGBL.
RGB-L: Enhancing Indirect Visual SLAM Using
LiDAR-Based Dense Depth Maps
Florian Sauerbeck
Institute of Automotive Technology
Technical University of Munich
85748 Garching, Germany
florian.sauerbeck@tum.de
Benjamin Obermeier
Institute of Automotive Technology
Technical University of Munich
85748 Garching, Germany
benjamin.obermeier@tum.de
Johannes Betz
Autonomous Vehicle Systems Lab
Technical University of Munich
85748 Garching, Germany
johannes.betz@tum.de
Martin Rudolph
Institute of Automotive Technology
Technical University of Munich
85748 Garching, Germany
martin.rudolph@tum.de
Abstract—In this paper, we present a novel method for
integrating 3D LiDAR depth measurements into the existing ORB-
SLAM3 by building upon the RGB-D mode. We propose and
compare two methods of depth map generation: conventional
computer vision methods, namely an inverse dilation operation,
and a supervised deep learning-based approach. We integrate the
former directly into the ORB-SLAM3 framework by adding a so-
called RGB-L (LiDAR) mode that directly reads LiDAR point
clouds. The proposed methods are evaluated on the KITTI
Odometry dataset and compared to each other and the standard
ORB-SLAM3 stereo method. We demonstrate that, depending on
the environment, advantages in trajectory accuracy and
robustness can be achieved. Furthermore, we demonstrate that the
runtime of the ORB-SLAM3 algorithm can be reduced by more
than 40 % compared to the stereo mode.
The related code for the ORB-SLAM3 RGB-L mode is
available as open-source software under
https://github.com/TUMFTM/ORB_SLAM3_RGBL.
Keywords—autonomous vehicles, simultaneous localization and
mapping, SLAM, Sensor fusion
I. INTRODUCTION
Robust and precise localization is needed for all modules of
autonomous vehicle software, such as path planning, trajectory
following, or object prediction. To enable full self-driving
capabilities, a reliance on satellite-based global localization
systems such as the Global Positioning System (GPS) is not
feasible. First, such sensors are expensive; second, sensor
dropouts can lead to subsequent system failures. Therefore,
simultaneous localization and mapping (SLAM) algorithms are
important in developing robots and autonomous vehicles [1], [2].
These algorithms can provide a precise robot pose by only using
sensor measurements of the environment and therefore provide
an important localization technique. To achieve robust SLAM
algorithms, it is necessary to include the environmental
information from various sensors such as LiDARs and cam-
eras. Since all sensors have their own individual advantages and
disadvantages, sensor fusion can be a huge benefit for the
localization of autonomous vehicles [3].
This paper presents an enhancement to the well-known
ORB-SLAM3 [6] algorithm. Currently, this algorithm relies
solely on camera data to calculate the robot’s pose. The goal
is to enhance this algorithm with LiDAR sensor depth
measurements to achieve a higher localization accuracy and
better robustness e.g. in urban environments. To summarize, this
paper comprises four main contributions:
We present a method to integrate LiDAR depth
measurements into the existing ORB-SLAM3 algorithm.
We propose and compare two methods of dense depth
map generation from LiDAR point clouds.
We present a variety of experiments for localization of
an autonomous vehicle that demonstrates the improved
accuracy and robustness of our method.
We compare the runtimes and show a decrease of more
than 40 % compared to stereo mode.
II. RELATED WORK
Some approaches for fusing multimodal sensor data in
SLAM algorithms have been presented in the past and are
summarized below.
A. Visual SLAM
In general, visual SLAM methods can be divided into
two subcategories: feature-based (indirect) SLAM and direct
SLAM methods. Indirect SLAM tries to detect and match
features across multiple image frames and minimize the geo-
metric error. The most-used indirect SLAM is the ORB-SLAM
[4] - [6].
In contrast, direct approaches try to directly minimize the
photometric error. The most famous direct SLAM algorithms
are LSD (Large-Scale Direct Monocular SLAM) [7], SVO
(Semi-direct Visual-Inertial Odometry) [8] and DSO (Direct
Sparse Odometry) [9], [10].
95
2023 3th International Conference on Computer, Control and Robotics
978-1-6654-9212-6/23/$31.00 ©2023 IEEE
2023 3rd International Conference on Computer, Control and Robotics (ICCCR) | 978-1-6654-9212-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCCR56747.2023.10194045
Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.
B. RGB-D SLAM
RGB-D (depth) cameras provide an infrared-based depth-
measuring approach. Thus, the SLAM algorithm does not have
to reconstruct 3D information purely from 2D images. An
enhancement of DSO with RGB-D measurements was presented
in [11]. The main disadvantage of RGB-D cameras is the limited
range of only a few meters and the high dependency on lighting
conditions [12]. This makes them unusable for automotive
applications.
C. Visual-LiDAR SLAM
Debeunne et al. [13] gave a detailed overview of the state-
of-the-art of visual-LiDAR SLAM. Generally, the authors
distinguished between concurrent approaches, LiDAR-assisted
visual SLAM, camera-assisted LiDAR SLAM, and approaches
with tightly coupled sensor fusion.
1)
LiDAR Assisted Visual SLAM: LIMO (Lidar-
Monocular) SLAM [14] uses depth information from LiDAR to
fit local planes to obtain depth information for features detected
in the camera images. Moreover, semantic information is
included. Shin et al. [15] combined a direct visual SLAM
approach with sparse depth data from LiDARs. They used a
sliding window-based optimization and included loop closure
functionality. Robust SLAM results were obtained on different
large-scale datasets.
VLOAM (Visual-LiDAR Odometry And Mapping) [16]
uses LiDAR data to get the 3D information of the visual features.
Pose estimates from visual feature matching and LiDAR scan
matching are combined in graph optimization. The VIL- SLAM,
presented by Shao et al. in 2019 [17] combines a visual-inertial
odometry (VIO) frontend with LiDAR-mapping and LiDAR-
enhanced loop closure. Both algorithms demonstrated state-of-
the-art precision under normal conditions with improved
robustness against adverse conditions. The idea to of combining
the ORB-SLAM3 with LiDAR data had previously been
presented in [18], however, no different upsampling methods
were presented and a scientific analysis and discussion of the
results were missing.
2)
Tightly Coupled Visual-LiDAR SLAM: Recently,
research has focused on SLAM algorithms that tightly fuse
multimodal sensor data. The LIC-fusion method (LiDAR-
inertial-camera) [19] was presented in 2019. It fuses LiDAR
features, camera features, and high-frequency IMU data for
improved localization accuracy and robustness. Therefore, LIC
incorporates online spatiotemporal calibration. Its successor,
LIC-fusion 2.0 [20] added sliding-window plane-feature
tracking to further improve performance.
Chou et al. [21] presented another tightly coupled visual-
LiDAR SLAM: TVL-SLAM. This has separate frontends for
LiDAR and camera (ORB-SLAM2-based), which are fused
in
a shared graph optimization backend with LiDAR loop closure.
TVL-SLAM could show improvements over single visual and
LiDAR SLAM implementations.
R2 LIVE [22] uses an error-state Kalman filter in
combination with factor-graph optimization to fuse LiDAR,
IMU and visual measurements. R3 LIVE [23] added a colored
3D map by combining VIO (visual-inertial odometry) and LIO
(laser inertial odometry) subsystems. Moreover, online
photometric calibration was added to the system. Both
algorithms are robust enough to handle various camera failures
or LiDAR-degenerated scenarios and can run in real-time on
small onboard platforms.
Zhao et al. proposed Super Odometry [24], an IMU-centric
robust state estimation pipeline. They showed convincing results,
especially for perceptually-degraded environments.
D. Generation of Depth Maps
Our proposed method uses dense depth maps to obtain pixel-
wise depth information for the corresponding monocular camera
images. Different approaches exist for the generation of such
maps. Currently, a major area of research is monocular depth
estimation without depth measurements. The current state-of-
the-art is presented in [25] - [27].
In this study, we focus on depth completion of sparse depth
measurements, namely LiDAR point clouds. To complete sparse
depth data with camera images, a distinction is made between
two basic approaches: conventional computer vision (CV) and
deep learning-based approaches. The approach used by Ku et al.
[28] belongs to the former. Their work showed that well-
designed classical CV algorithms could outperform deep
learning methods under certain circumstances. A significant
advantage is that their algorithms require very small resources
and can easily run on a CPU. Also, no training is needed, thus
overfitting is not possible.
By contrast, deep learning-based algorithms dominate in the
renowned depth completion challenges such as the KITTI depth
completion challenge [29]. Some of the best-performing and
most used networks for LiDAR depth completion are the
PENet and ENet [30], the ACMNet [31] and the RigNet [32].
Further detailed information on existing approaches and the
current state-of-the-art can be found in specific literature review
papers [33], [34].
III. APPROACH
The overall approach of this work is presented in Figure 1.
The camera frame is handed over to the SLAM interface.
LiDAR depth measurements (point cloud data (pcd)) are
transformed into the camera frame and upsampled.
A. ORB SLAM Integration
As a baseline indirect visual SLAM algorithm, we use ORB-
SLAM3 [6]. An RGB-D mode is already integrated and provides
the option of directly feeding depth maps to each corresponding
camera image. This allows us to use the RGB-D mode with
previously generated and saved depth maps as depicted in
Figure 1a. Moreover, we introduce RGB-L mode, which
upsamples LiDAR data online using conventional CV and feeds
the results to the SLAM module. The computation graph is
shown in Figure 1b. The depth node transforms LiDAR data
into the camera frame and upsamples the data online using
conventional CV methods.
96
Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.
(a)
Standard ORB-SLAM3 RGB-D mode with external depthmap generation
(b)
Novel RGB-L mode with integrated depth module
Fig. 1. Setup of the ORB-SLAM3 for working with LiDAR-based dense depth maps in the standard RGB-D mode and the novel RGB-L mode.
B. Generation of Depth Maps
When compared to cameras, LiDARs output sparse data.
Thus, it is difficult to get pixel-level depth information which
is needed to transform visual features into 3D space. Figure 2
shows a LiDAR point cloud transformed into the corresponding
camera image.
Fig. 2. Projection of sparse LiDAR point cloud into camera image.
We propose two different methods of depth map generation,
which will be evaluated and compared below: a conventional
CV approach and a deep learning-based method. Multiple
experiments with different conventional upsampling methods
have revealed that an inverse dilation with a 5x5 diamond kernel
exhibits the best results amongst conventional CV methods.
Usually, image dilation is used to increase the size of the
thickness of foreground objects. Here, we want to use it to
increase the density of the LiDAR point cloud transformed into
the image frame. The kernel matrix we chose is of size 5x5
pixels and has a diamond shape.
The unprocessed depth maps from the KITTI Odometry
dataset have an average sparsity of around 96 % after projection
into the camera frame. By applying the inverse dilation, we can
decrease the sparsity to around 65 % The generated depth map
has an MAE of 1.03 m and an RMSE of 4.41 m. We found that
methods with denser outputs tend to have worse tracking
results because the depth errors increase in between the single
depth measurements.
With deep learning methods, we can achieve a density
of 100 %. This means that each pixel has a corresponding depth
value. However, those values are not worthwhile for the area
without LiDAR returns because the KITTI depth dataset does
not contain ground truth for those pixels. For the deep
learning-based LiDAR depth completion, we use a CNN
based on the ENet [30] which we have previously optimized for
runtime. In our configuration, the network has an average
runtime of 16.32 ms, 9.66 ms of which are the actual
computation time on the GPU. For the MAE per pixel we
achieve values of 0.39 m and for the RMSE 1.17 m. Those
values were generated on the KITTI Depth dataset. Values
for
the Odometry data might differ. However, since the data comes
from the same source, it is a justified assumption that the
accuracy of the Odometry data is comparable. A typical high-
resolution depth map from the neural network and the
corresponding camera image are shown in Figure 3.
Fig. 3. Typical camera image and corresponding depth map generated by the
ENet based neural network.
/camera
image
/slam_interface
orbslam3
System
/pose
/lidar
pcd
depthmap
Tracking
Frame
orbslam3
/camera
image
/slam_interface
System
/pose
/lidar
pcd
Depth Module
Tracking
Frame
97
Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.
The depth in the image area for which we did not have
ground truth data for training the network was not used.
However, it is hard to find a general rule about which areas
to use and which to ignore.
IV. RESULTS
This section summarizes the results we achieved with the
presented method. We want to compare our results to the ORB-
SLAM3 stereo module in terms of accuracy, and particularly in
terms of the runtime. The experiments were conducted on the
KITTI Odometry dataset [35].
A. Comparison of the Trajectories
Tables I and II show the translational and rotational error of
the mentioned SLAM modes on the ten KITTI Odometry
sequences.
Stereo and RGB-L exhibit comparable performances with a
slight advantage for the stereo mode. It can be seen directly that
the accuracy of the deep learning methods is significantly worse
for both, translational and rotational errors. This might be
because it is hard to generally define areas, where the net
performs well. To minimize the negative effect of this, the upper
30 % of the depth maps were ignored. Another downside of the
neural net is that the KITTI ground truth only uses values
of up to 80 m. Thus, even areas that are significantly further
away will show depth values around 80 m. The RGB-L mode
only uses depth values that are close to the actual depth
measurements from the LiDAR and thus generalizes better.
Upon further investigation of the accuracy of the algorithms
across the different sequences, it becomes apparent that the
RGB-L mode performs especially well in low-feature
environments, as shown in Figure 4.
B. Runtime Evaluation
As SLAM is a highly time-critical process that has to run
in
real-time on autonomous vehicles, we want to put a special focus
on runtime analysis.
Figure 5 compares the total runtimes of the three modes.
Both modes that use depth maps from LiDAR, show
significantly smaller variances. The mean value of the RGB-
L runtime is 22.8 ms and thus is more than 40 % less than the
41.2 ms mean from stereo mode. The composition of the
runtimes is summarized in Table III.
(a)
KITTI Sequence 01, Frame 248
(b)
KITTI Sequence 04, Frame 152
Fig. 4. Typical images of low-feature environments. RGB-L beats ORB3 stereo
mode accuracy here.
TABLE I. AVERAGE TRANSLATIONAL ERROR IN %. OPTIMUM RESULTS ARE
HIGHLIGHTED IN GREEN AND THE NEXT BEST IN BLUE
Translational Error in %
00
Urban
0.704
0.695
0.721
01
Highway
1.628
1.098
1.698
02
Urban
0.817
0.836
0.852
03
Urban
0.950
1.033
0.726
04
Rural
0.562
0.453
1.121
05
Urban
0.413
0.470
0.840
06
Urban
0.544
0.878
1.182
07
Urban
0.483
0.581
0.862
08
Urban
0.991
1.085
1.284
09
Rural
0.988
0.866
1.092
10
Urban
0.710
0.840
1.388
Average 0.800 0.801 1.070
Since the RGB-D mode requires previously generated depth
maps, it should be noted that the partial computation times from
offline and online calculations were added together. The
conventional CV algorithm has a very low computation time and
can therefore significantly accelerate the ORB-SLAM3
algorithm in comparison to the stereo mode.
The referred to smaller mean computation time in combi-
nation with the smaller variance could make the presented
approach particularly interesting for embedded applications
with limited hardware resources. The whole of the algorithms
runs on a CPU without the need for a power-intensive GPU.
V. DISCUSSION AND FUTURE WORK
The presented algorithms were evaluated using the KITTI
Odometry dataset. To generate further results and conclusions,
an evaluation using other datasets could well be worthwhile.
Since the RGB-L mode demonstrated high accuracy in low-
feature environments, also bad weather scenarios or partially
polluted camera images could be of special interest. Since the
RGB-L mode does not use deep learning methods, good
scalability to other data can be assumed. The chosen algorithms
do not depend on the exact camera or LiDAR model, the
environment, or the data-base. Also, no expensive ground truth
depth data is needed. Only the camera’s internal calibration and
the external calibration of the camera and LiDAR need to be
adjusted.
TABLE II. AVERAGE ROTATIONAL ERROR IN ◦/100 M. OPTIMUM RESULTS
ARE HIGHLIGHTED IN GREEN AND THE NEXT BEST IN BLUE
Seq. Scenery
Stereo RGB-L Deep Learning
Rotational Error in °/100m
00
Urban
0.272
0.257
0.315
01
Highway
0.201
0.364
0.681
02
(Sub-) Urban
0.275
0.259
0.344
03
Urban
0.235
0.258
0.334
04
Rural
0.251
0.221
0.985
05
Urban
0.158
0.214
0.419
06
Urban
0.224
0.466
0.425
07
Urban
0.271
0.293
0.497
08
Urban
0.303
0.338
0.556
09
Rural
0.299
0.342
0.438
10
Urban
0.343
0.421
0.484
Average 0.244
0.312
0.498
Seq. Scenery Stereo L Deep Learning
98
Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Runtime comparison of the presented methods.
Further research could also be conducted in the field of deep
learning-based depth map generation. For example, including a
confidence map to decide which depth values best to use in the
SLAM algorithm could be highly advantageous. Another
problem of the deep learning-based approach is scalability. The
depth net will have to be retrained for every new dataset to
obtain the best results. This is currently one of the most
significant issues that deep learning approaches are facing and
will undoubtedly be one of the major research topics in the
coming years.
Including the deep-learning pipeline with the GPU directly
in the RGB-L mode can help to reduce computation time thus
making online use possible.
The proposed method builds upon existing open-source
ORB-SLAM3. We present a simple open-source method that can
easily be used by other researchers. However, sensor fusion
is not ideal. The ORB-SLAM uses the depth values from the
depth maps to generate quasi-stereo’ images for handling the
depth information. As already pointed out in section II, there are
some other ways of combining LiDAR and camera data for
localization and mapping. Future research on the optimal way to
fuse that information will need to be conducted.
VI. CONCLUSION
In this paper, we have proposed a method of fusing LiDAR
depth measurements into an existing indirect visual SLAM
algorithm, namely ORB-SLAM3 [6]. We compare two methods
to generate pixel-wise depth maps for each camera image. For
upsampling the LiDAR point cloud, we used an inverse dilation
with a 5x5 pixel-sized diamond kernel and a supervised deep
learning-based method based on ENet [30]. While the deep
learning-based depth maps had to be created offline before
running the SLAM, the dilation was directly included in the
ORB SLAM3 as RGB-L mode. The code is available as open-
source software.
Depending on the sequence of the KITTI Odometry dataset,
RGB-L mode could match the performance of the stereo method
and even outperform it in low-feature environments. The use of
deep learning-based depth maps did not prove beneficial in this
work and will be the subject of future research. With RGB-L,
we were able to reduce the runtime by more than 40 % when
compared to the stereo mode, while at the same time the runtime
variance decreased significantly. This makes our approach
particularly interesting for low-power and low-latency
applications.
ACKNOWLEDGMENT
As the primary author, Florian Sauerbeck proposed the
initial idea for this paper, and he is responsible for the concept
presented and the implementation thereof. Benjamin Obermeier
contributed to the generation of the results and the development
of the workflow. Martin Rudolph contributed to the
development of the conventional LiDAR data upsampling and
the integration of the RGB-L mode into ORB-SLAM3. Johannes
Betz contributed to the conception of the research project and
revised the paper critically for important intellectual content. He
gave final approval of the version to be published and agreed
to all aspects of the work. We thank the ADAC Stiftung and
TUM for funding the underlying research project. We want to
thank Marius Spitzar whose master’s thesis contributed to the
implementation of the depth completion neural net.
TABLE III. RESPONSE TIMES FOR STEREO MODE AND THE PROPOSED RGB-L MODE
Operation
RGB-L with 5x5 Dilation
min median max
Deep Learning-based
min median max
Stereo
min median max
Neural Net GPU
-
-
-
9.25
9.66
10.21
-
-
-
Neural Net total
-
-
-
15.38
16.32
18.63
-
-
-
Projection
1.52
2.24
3.61
-
-
-
-
-
-
Upsampling
0.43
0.57
0.83
-
-
-
-
-
-
Tracking
17.23
22.79
30.44
19.14
23.29
29.93
28.21
41.17
66.18
Overall
19.18
25.60
34.88
34.52
39.61
48.56
28.21
41.17
66.17
99
Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1]
T. Bailey and H. Durrant-Whyte, “Simultaneous Localization and
Mapping (SLAM): Part ii,” IEEE robotics & automation magazine, vol.
13, no. 3, 2006.
[2]
M. G. Dissanayake, P. Newman, S. Clark, H. F. Durrant-Whyte, and
M. Csorba, “A solution to the simultaneous localization and map building
(SLAM) problem,” IEEE Transactions on robotics and automation, vol.
17, no. 3, 2001.
[3]
E. Marti, M. A. De Miguel, F. Garcia, and J. Perez, “A Review of Sensor
Technologies for Perception in Automated Driving,” IEEE Intelligent
Transportation Systems Magazine, vol. 11, no. 4, 2019.
[4]
R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a
Versatile and Accurate Monocular SLAM System,” IEEE transactions
on robotics, vol. 31, no. 5, 2015.
[5]
R. Mur-Artal and J. D. Tardo´s, “ORB-SLAM2: An Open-Source SLAM
S ystem for Monocular, Stereo, and RGB-D Cameras,” IEEE
transactions on robotics, vol. 33, no. 5, 2017.
[6]
C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. Montiel, and J. D. Tardos,
“ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–
Inertial, and Multimap SLAM,” IEEE Transactions on Robotics, vol. 37,
no. 6, 2021.
[7]
J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale Direct
Monocular SLAM,” in European conference on computer vision,
Springer, 2014.
[8]
C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza,
“SVO: Semidirect Visual Odometry for Monocular and Multicamera
Systems,” IEEE Transactions on Robotics, vol. 33, no. 2, 2016.
[9]
J. Engel, V. Koltun, and D. Cremers, “Direct Sparse Odometry,” IEEE
transactions on pattern analysis and machine intelligence, vol. 40, no. 3,
2017.
[10]
R. Wang, M. Schworer, and D. Cremers, “Stereo DSO: Large-scale
Direct S parse Visual Odometry with Stereo Cameras,” in Proceedings of
the IEEE International Conference on Computer Vision, 2017.
[11]
N. Yin and Y. Ding, “Semi-direct Sparse Odometry with Depth Prior,”
in 2022 2nd International Conference on Computer, Control and
Robotics (ICCCR), IEEE, 2022.
[12]
P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D mapping:
Using kinect-style depth cameras for dense 3d modeling of indoor
environments,” The international journal of Robotics Research, vol. 31,
no. 5, 2012.
[13]
C. Debeunne and D. Vivet, “A Review of Visual-Lidar Fusion based
Simultaneous Localization and Mapping,” Sensors, vol. 20, no. 7, 2020.
[14]
J. Graeter, A. Wilczynski, and M. Lauer, “Limo: Lidar-monocular Visual
Odometry,” in 2018 IEEE/RSJ International Conference on Intelligent
R obots and Systems (IROS), IEEE, 2018.
[15]
Y.-S. Shin, Y. S. Park, and A. Kim, “Direct Visual SLAM using Sparse
Depth for Camera-Lidar System,” in 2018 IEEE International
Conference on Robotics and Automation (ICRA), IEEE, 2018.
[16]
J. Zhang and S. Singh, “Laser–visual–inertial Odometry and Mapping
with High Robustness and Low Drift,” Journal of field robotics, vol. 35,
no. 8, 2018.
[17]
W. Shao, S. Vijayarangan, C. Li, and G. Kantor, “Stereo Visual Inertial
L idar Simultaneous Localization and Mapping,” in 2019 IEEE/RSJ
Inter- national Conference on Intelligent Robots and Systems (IROS),
IEEE, 2019.
[18]
J. Zijlmans, “RGB-D ORB-SLAM2 with a depth map based on lidar
data.” https://medium.com/@j.zijlmans/rgb-d-orb-slam-with-a-depth-map-
based-on-lidar-data-ecac280614e7, 2017. [Online; accessed 10-October-
2022].
[19]
X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “LIC-Fusion: Lidar-
Inertial-Camera Odometry,” in 2019 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), IEEE, 2019.
[20]
X. Zuo, Y. Yang, P. Geneva, J. Lv, Y. Liu, G. Huang, and M. Pollefeys,
“LIC-Fusion 2.0: Lidar-Inertial-Camera Odometry with Sliding-Window
Plane-Feature Tracking,” in 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), IEEE, 2020.
[21]
C.-C. Chou and C.-F. Chou, “Efficient and Accurate Tightly-coupled
Visual-lidar slam,” IEEE Transactions on Intelligent Transportation
Systems, 2021.
[22]
J. Lin, C. Zheng, W. Xu, and F. Zhang, “R2 live: A R obust, Real-time,
Lidar-Inertial-Visual Tightly-coupled State Estimator and Mapping,”
IEEE Robotics and Automation Letters, vol. 6, no. 4, 2021.
[23]
J. Lin and F. Zhang, “R3 live++: A Robust, Real-time, R adiance
Reconstruction Package with a Tightly-coupled Lidar-Inertial-Visual
State Estimator,” arXiv preprint arXiv:2209.03666, 2022.
[24]
S. Zhao, H. Zhang, P. Wang, L. Nogueira, and S. Scherer, “Super
O dometry: IMU-centric Lidar-Visual-Inertial Estimator for Challenging
Environments,” in 2021 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), IEEE, 2021.
[25]
C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular Depth
Estimation based on Deep Learning: An Overview,” Science China Tech-
nological Sciences, vol. 63, no. 9.
[26]
A. Bhoi, “Monocular Depth Estimation: A Survey,” arXiv preprint
arXiv:1901.09402, 2019.
[27]
Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep Learning for Monocular
Depth Estimation: A Review,” Neurocomputing, vol. 438, 2021.
[28]
J. Ku, A. Harakeh, and S. L. Waslander, “In Defense of Classical Image
Processing: Fast Depth Completion on the CPU,” in 2018 15th
Conference on Computer and Robot Vision (CRV), IEEE, 2018.
[29]
J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger,
“Sparsity Invariant CNNs,” in International Conference on 3D Vision
(3DV), 2017.
[30]
M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “PENet:
Towards Precise and Efficient Image Guided Depth Completion,” in
2021 IEEE International Conference on Robotics and Automation
(ICRA), IEEE, 2021.
[31]
S. Zhao, M. Gong, H. Fu, and D. Tao, “Adaptive Context-aware Multi-
modal Network for Depth Completion,” IEEE Transactions on Image
Processing, vol. 30, 2021.
[32]
Z. Yan, K. Wang, X. Li, Z. Zhang, B. Xu, J. Li, and J. Yang, “Rignet:
Repetitive Image Guided Network for Depth Completion,” arXiv
preprint arXiv:2107.13802, 2021.
[33]
A. Masoumian, H. A. Rashwan, J. Cristiano, M. S. Asif, and D. Puig,
“Monocular Depth Estimation using Deep Learning: A Review,” Sensors,
vol. 22, no. 14, 2022.
[34]
J. Hu, C. Bao, M. Ozay, C. Fan, Q. Gao, H. Liu, and T. L. Lam, “Deep
Depth Completion: A Survey,” arXiv preprint arXiv:2205.05335, 2022.
[35]
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics:
The KITTI Dataset,” The International Journal of Robotics Research,
vol. 32, no. 11, 2013.
100
Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.
Article
Full-text available
Simultaneous localization and mapping (SLAM) is a very challenging yet fundamental problem in the field of robotics and photogrammetry, and it is also a prerequisite for intelligent perception of unmanned systems. In recent years, 3D LiDAR SLAM technology has made remarkable progress. However, to the best of our knowledge, almost all existing surveys focus on visual SLAM methods. To bridge the gap, this paper provides a comprehensive review that summarizes the scientific connotation, key difficulties, research status, and future trends of 3D LiDAR SLAM, aiming to give readers a better understanding of LiDAR SLAM technology, thereby inspiring future research. Specifically, it summarizes the contents and characteristics of the main steps of LiDAR SLAM, introduces the key difficulties it faces, and gives the relationship with existing reviews; it provides an overview of current research hotspots, including LiDAR‐only methods and multi‐sensor fusion methods, and gives milestone algorithms and open‐source tools in each category; it summarizes common datasets, evaluation metrics and representative commercial SLAM solutions, and provides the evaluation results of mainstream methods on public datasets; it looks forward to the development trend of LiDAR SLAM, and considers the preliminary ideas of multi‐modal SLAM, event SLAM, and quantum SLAM.
Article
Full-text available
In current decades, significant advancements in robotics engineering and autonomous vehicles have improved the requirement for precise depth measurements. Depth estimation (DE) is a traditional task in computer vision that can be appropriately predicted by applying numerous procedures. This task is vital in disparate applications such as augmented reality and target tracking. Conventional monocular DE (MDE) procedures are based on depth cues for depth prediction. Various deep learning techniques have demonstrated their potential applications in managing and supporting the traditional ill-posed problem. The principal purpose of this paper is to represent a state-of-the-art review of the current developments in MDE based on deep learning techniques. For this goal, this paper tries to highlight the critical points of the state-of-the-art works on MDE from disparate aspects. These aspects include input data shapes and training manners such as supervised, semi-supervised, and unsupervised learning approaches in combination with applying different datasets and evaluation indicators. At last, limitations regarding the accuracy of the DL-based MDE models, computational time requirements, real-time inference, transferability, input images shape and domain adaptation, and generalization are discussed to open new directions for future research.
Chapter
Depth completion deals with the problem of recovering dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent approaches mainly focus on image guided learning frameworks to predict dense depth. However, blurry guidance in the image and unclear structure in the depth still impede the performance of the image guided frameworks. To tackle these problems, we explore a repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a repetitive hourglass network to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we introduce a repetitive guidance module based on dynamic convolution, in which an efficient convolution factorization is proposed to simultaneously reduce its complexity and progressively model high-frequency structures. Extensive experiments show that our method achieves superior or competitive results on KITTI benchmark and NYUv2 dataset.
Article
We investigate a novel way to integrate visual SLAM and lidar SLAM. Instead of enhancing visual odometry via lidar depths or using visual odometry as the motion initial guess of lidar odometry, we propose tightly-coupled visual-lidar SLAM (TVL-SLAM), in which the visual and lidar frontend are run independently and which incorporates all of the visual and lidar measurements in the backend optimizations. To achieve large-scale bundle adjustments in TVL-SLAM, we focus on accurate and efficient lidar residual compression. The visual-lidar SLAM system implemented in this work is based on the open-source ORB-SLAM2 and a lidar SLAM method with average performance, whereas the resulting visual-lidar SLAM clearly outperforms existing visual/lidar SLAM approaches, achieving 0.52% error on KITTI training sequences and 0.56% error on testing sequences.
Article
In this letter, we propose a robust, real-time tightly-coupled multi-sensor fusion framework, which fuses measurements from LiDAR, inertial sensor, and visual camera to achieve robust and accurate state estimation. Our proposed framework is composed of two parts: the filter-based odometry and factor graph optimization. To guarantee real-time performance, we estimate the state within the framework of error-state iterated Kalman-filter, and further improve the overall precision with our factor graph optimization. Taking advantage of measurements from all individual sensors, our algorithm is robust enough to various visual failure, LiDAR-degenerated scenarios, and is able to run in real time on an on-board computation platform, as shown by extensive experiments conducted in indoor, outdoor, and mixed environments of different scale. Moreover, the results show that our proposed framework can improve the accuracy of state-of-the-art LiDAR-inertial or visual-inertial odometry. To share our findings and to make contributions to the community, we open source our codes on our Github.
Article
This article presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multimap SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models. The first main novelty is a tightly integrated visual-inertial SLAM system that fully relies on maximum a posteriori (MAP) estimation, even during IMU initialization, resulting in real-time robust operation in small and large, indoor and outdoor environments, being two to ten times more accurate than previous approaches. The second main novelty is a multiple map system relying on a new place recognition method with improved recall that lets ORB-SLAM3 survive to long periods of poor visual information: when it gets lost, it starts a new map that will be seamlessly merged with previous maps when revisiting them. Compared with visual odometry systems that only use information from the last few seconds, ORB-SLAM3 is the first system able to reuse in all the algorithm stages all previous information from high parallax co-visible keyframes, even if they are widely separated in time or come from previous mapping sessions, boosting accuracy. Our experiments show that, in all sensor configurations, ORB-SLAM3 is as robust as the best systems available in the literature and significantly more accurate. Notably, our stereo-inertial SLAM achieves an average accuracy of 3.5 cm in the EuRoC drone and 9 mm under quick hand-held motions in the room of TUM-VI dataset, representative of AR/VR scenarios. For the benefit of the community we make public the source code.
Article
Depth completion aims to recover a dense depth map from the sparse depth data and the corresponding single RGB image. The observed pixels provide the significant guidance for the recovery of the unobserved pixels’ depth. However, due to the sparsity of the depth data, the standard convolution operation, exploited by most of existing methods, is not effective to model the observed contexts with depth values. To address this issue, we propose to adopt the graph propagation to capture the observed spatial contexts. Specifically, we first construct multiple graphs at different scales from observed pixels. Since the graph structure varies from sample to sample, we then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Furthermore, considering the mutli-modality of input data, we exploit the graph propagation on the two modalities respectively to extract multi-modal representations. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. The proposed strategy preserves the original information for one modality and also absorbs complementary information from the other through learning the adaptive gating weights. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks, i.e. , KITTI and NYU-v2, and at the same time has fewer parameters than latest models. Our code is available at: https://github.com/sshan-zhao/ACMNet .