Conference PaperPDF Available

RGB-L: Enhancing Indirect Visual SLAM Using LiDAR-Based Dense Depth Maps

March 2023

March 2023

DOI:10.1109/ICCCR56747.2023.10194045

Conference: 2023 3rd International Conference on Computer, Control and Robotics (ICCCR)

Authors:

Florian Sauerbeck

Technische Universität München

Martin Rudolph

Technische Universität München

Johannes Betz

Technische Universität München

In this paper, we present a novel method for integrating 3D LiDAR depth measurements into the existing ORB-SLAM3 by building upon the RGB-D mode. We propose and compare two methods of depth map generation: conventional computer vision methods, namely an inverse dilation operation, and a supervised deep learning-based approach. We integrate the former directly into the ORB-SLAM3 framework by adding a so-called RGB-L (LiDAR) mode that directly reads LiDAR point clouds. The proposed methods are evaluated on the KITTI Odometry dataset and compared to each other and the standard ORB-SLAM3 stereo method. We demonstrate that, depending on the environment, advantages in trajectory accuracy and robustness can be achieved. Furthermore, we demonstrate that the runtime of the ORB-SLAM3 algorithm can be reduced by more than 40 % compared to the stereo mode. The related code for the ORB-SLAM3 RGB-L mode is available as open-source software under https://github.com/TUMFTM/ORB_SLAM3_RGBL.

Content uploaded by Johannes Betz

Content may be subject to copyright.

RGB-L: Enhancing Indirect Visual SLAM Using

LiDAR-Based Dense Depth Maps

Florian Sauerbeck

Institute of Automotive Technology

Technical University of Munich

85748 Garching, Germany

florian.sauerbeck@tum.de

Benjamin Obermeier

Institute of Automotive Technology

Technical University of Munich

85748 Garching, Germany

benjamin.obermeier@tum.de

Johannes Betz

Autonomous Vehicle Systems Lab

Technical University of Munich

85748 Garching, Germany

johannes.betz@tum.de

Martin Rudolph

Institute of Automotive Technology

Technical University of Munich

85748 Garching, Germany

martin.rudolph@tum.de

Abstract—In this paper, we present a novel method for

integrating 3D LiDAR depth measurements into the existing ORB-

SLAM3 by building upon the RGB-D mode. We propose and

compare two methods of depth map generation: conventional

computer vision methods, namely an inverse dilation operation,

and a supervised deep learning-based approach. We integrate the

former directly into the ORB-SLAM3 framework by adding a so-

called RGB-L (LiDAR) mode that directly reads LiDAR point

clouds. The proposed methods are evaluated on the KITTI

Odometry dataset and compared to each other and the standard

ORB-SLAM3 stereo method. We demonstrate that, depending on

the environment, advantages in trajectory accuracy and

robustness can be achieved. Furthermore, we demonstrate that the

runtime of the ORB-SLAM3 algorithm can be reduced by more

than 40 % compared to the stereo mode.

The related code for the ORB-SLAM3 RGB-L mode is

available as open-source software under

https://github.com/TUMFTM/ORB_SLAM3_RGBL.

Keywords—autonomous vehicles, simultaneous localization and

mapping, SLAM, Sensor fusion

I. INTRODUCTION

Robust and precise localization is needed for all modules of

autonomous vehicle software, such as path planning, trajectory

following, or object prediction. To enable full self-driving

capabilities, a reliance on satellite-based global localization

systems such as the Global Positioning System (GPS) is not

feasible. First, such sensors are expensive; second, sensor

dropouts can lead to subsequent system failures. Therefore,

simultaneous localization and mapping (SLAM) algorithms are

important in developing robots and autonomous vehicles [1], [2].

These algorithms can provide a precise robot pose by only using

sensor measurements of the environment and therefore provide

an important localization technique. To achieve robust SLAM

algorithms, it is necessary to include the environmental

information from various sensors such as LiDARs and cam-

eras. Since all sensors have their own individual advantages and

disadvantages, sensor fusion can be a huge benefit for the

localization of autonomous vehicles [3].

This paper presents an enhancement to the well-known

ORB-SLAM3 [6] algorithm. Currently, this algorithm relies

solely on camera data to calculate the robot’s pose. The goal

is to enhance this algorithm with LiDAR sensor depth

measurements to achieve a higher localization accuracy and

better robustness e.g. in urban environments. To summarize, this

paper comprises four main contributions:

•

We present a method to integrate LiDAR depth

measurements into the existing ORB-SLAM3 algorithm.

•

We propose and compare two methods of dense depth

map generation from LiDAR point clouds.

•

We present a variety of experiments for localization of

an autonomous vehicle that demonstrates the improved

accuracy and robustness of our method.

•

We compare the runtimes and show a decrease of more

than 40 % compared to stereo mode.

II. RELATED WORK

Some approaches for fusing multimodal sensor data in

SLAM algorithms have been presented in the past and are

summarized below.

A. Visual SLAM

In general, visual SLAM methods can be divided into

two subcategories: feature-based (indirect) SLAM and direct

SLAM methods. Indirect SLAM tries to detect and match

features across multiple image frames and minimize the geo-

metric error. The most-used indirect SLAM is the ORB-SLAM

[4] - [6].

In contrast, direct approaches try to directly minimize the

photometric error. The most famous direct SLAM algorithms

are LSD (Large-Scale Direct Monocular SLAM) [7], SVO

(Semi-direct Visual-Inertial Odometry) [8] and DSO (Direct

Sparse Odometry) [9], [10].

2023 3th International Conference on Computer, Control and Robotics

Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.

B. RGB-D SLAM

RGB-D (depth) cameras provide an infrared-based depth-

measuring approach. Thus, the SLAM algorithm does not have

to reconstruct 3D information purely from 2D images. An

enhancement of DSO with RGB-D measurements was presented

in [11]. The main disadvantage of RGB-D cameras is the limited

range of only a few meters and the high dependency on lighting

conditions [12]. This makes them unusable for automotive

applications.

C. Visual-LiDAR SLAM

Debeunne et al. [13] gave a detailed overview of the state-

of-the-art of visual-LiDAR SLAM. Generally, the authors

distinguished between concurrent approaches, LiDAR-assisted

visual SLAM, camera-assisted LiDAR SLAM, and approaches

with tightly coupled sensor fusion.

LiDAR Assisted Visual SLAM: LIMO (Lidar-

Monocular) SLAM [14] uses depth information from LiDAR to

fit local planes to obtain depth information for features detected

in the camera images. Moreover, semantic information is

included. Shin et al. [15] combined a direct visual SLAM

approach with sparse depth data from LiDARs. They used a

sliding window-based optimization and included loop closure

functionality. Robust SLAM results were obtained on different

large-scale datasets.

VLOAM (Visual-LiDAR Odometry And Mapping) [16]

uses LiDAR data to get the 3D information of the visual features.

Pose estimates from visual feature matching and LiDAR scan

matching are combined in graph optimization. The VIL- SLAM,

presented by Shao et al. in 2019 [17] combines a visual-inertial

odometry (VIO) frontend with LiDAR-mapping and LiDAR-

enhanced loop closure. Both algorithms demonstrated state-of-

the-art precision under normal conditions with improved

robustness against adverse conditions. The idea to of combining

the ORB-SLAM3 with LiDAR data had previously been

presented in [18], however, no different upsampling methods

were presented and a scientific analysis and discussion of the

results were missing.

Tightly Coupled Visual-LiDAR SLAM: Recently,

research has focused on SLAM algorithms that tightly fuse

multimodal sensor data. The LIC-fusion method (LiDAR-

inertial-camera) [19] was presented in 2019. It fuses LiDAR

features, camera features, and high-frequency IMU data for

improved localization accuracy and robustness. Therefore, LIC

incorporates online spatiotemporal calibration. Its successor,

LIC-fusion 2.0 [20] added sliding-window plane-feature

tracking to further improve performance.

Chou et al. [21] presented another tightly coupled visual-

LiDAR SLAM: TVL-SLAM. This has separate frontends for

LiDAR and camera (ORB-SLAM2-based), which are fused

a shared graph optimization backend with LiDAR loop closure.

TVL-SLAM could show improvements over single visual and

LiDAR SLAM implementations.

R2 LIVE [22] uses an error-state Kalman filter in

combination with factor-graph optimization to fuse LiDAR,

IMU and visual measurements. R3 LIVE [23] added a colored

3D map by combining VIO (visual-inertial odometry) and LIO

(laser inertial odometry) subsystems. Moreover, online

photometric calibration was added to the system. Both

algorithms are robust enough to handle various camera failures

or LiDAR-degenerated scenarios and can run in real-time on

small onboard platforms.

Zhao et al. proposed Super Odometry [24], an IMU-centric

robust state estimation pipeline. They showed convincing results,

especially for perceptually-degraded environments.

D. Generation of Depth Maps

Our proposed method uses dense depth maps to obtain pixel-

wise depth information for the corresponding monocular camera

images. Different approaches exist for the generation of such

maps. Currently, a major area of research is monocular depth

estimation without depth measurements. The current state-of-

the-art is presented in [25] - [27].

In this study, we focus on depth completion of sparse depth

measurements, namely LiDAR point clouds. To complete sparse

depth data with camera images, a distinction is made between

two basic approaches: conventional computer vision (CV) and

deep learning-based approaches. The approach used by Ku et al.

[28] belongs to the former. Their work showed that well-

designed classical CV algorithms could outperform deep

learning methods under certain circumstances. A significant

advantage is that their algorithms require very small resources

and can easily run on a CPU. Also, no training is needed, thus

overfitting is not possible.

By contrast, deep learning-based algorithms dominate in the

renowned depth completion challenges such as the KITTI depth

completion challenge [29]. Some of the best-performing and

most used networks for LiDAR depth completion are the

PENet and ENet [30], the ACMNet [31] and the RigNet [32].

Further detailed information on existing approaches and the

current state-of-the-art can be found in specific literature review

papers [33], [34].

III. APPROACH

The overall approach of this work is presented in Figure 1.

The camera frame is handed over to the SLAM interface.

LiDAR depth measurements (point cloud data (pcd)) are

transformed into the camera frame and upsampled.

A. ORB SLAM Integration

As a baseline indirect visual SLAM algorithm, we use ORB-

SLAM3 [6]. An RGB-D mode is already integrated and provides

the option of directly feeding depth maps to each corresponding

camera image. This allows us to use the RGB-D mode with

previously generated and saved depth maps as depicted in

Figure 1a. Moreover, we introduce RGB-L mode, which

upsamples LiDAR data online using conventional CV and feeds

the results to the SLAM module. The computation graph is

shown in Figure 1b. The depth node transforms LiDAR data

into the camera frame and upsamples the data online using

conventional CV methods.

Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.

(a)

Standard ORB-SLAM3 RGB-D mode with external depthmap generation

(b)

Novel RGB-L mode with integrated depth module

Fig. 1. Setup of the ORB-SLAM3 for working with LiDAR-based dense depth maps in the standard RGB-D mode and the novel RGB-L mode.

B. Generation of Depth Maps

When compared to cameras, LiDARs output sparse data.

Thus, it is difficult to get pixel-level depth information which

is needed to transform visual features into 3D space. Figure 2

shows a LiDAR point cloud transformed into the corresponding

camera image.

Fig. 2. Projection of sparse LiDAR point cloud into camera image.

We propose two different methods of depth map generation,

which will be evaluated and compared below: a conventional

CV approach and a deep learning-based method. Multiple

experiments with different conventional upsampling methods

have revealed that an inverse dilation with a 5x5 diamond kernel

exhibits the best results amongst conventional CV methods.

Usually, image dilation is used to increase the size of the

thickness of foreground objects. Here, we want to use it to

increase the density of the LiDAR point cloud transformed into

the image frame. The kernel matrix we chose is of size 5x5

pixels and has a diamond shape.

The unprocessed depth maps from the KITTI Odometry

dataset have an average sparsity of around 96 % after projection

into the camera frame. By applying the inverse dilation, we can

decrease the sparsity to around 65 % The generated depth map

has an MAE of 1.03 m and an RMSE of 4.41 m. We found that

methods with denser outputs tend to have worse tracking

results because the depth errors increase in between the single

depth measurements.

With deep learning methods, we can achieve a density

of 100 %. This means that each pixel has a corresponding depth

value. However, those values are not worthwhile for the area

without LiDAR returns because the KITTI depth dataset does

not contain ground truth for those pixels. For the deep

learning-based LiDAR depth completion, we use a CNN

based on the ENet [30] which we have previously optimized for

runtime. In our configuration, the network has an average

runtime of 16.32 ms, 9.66 ms of which are the actual

computation time on the GPU. For the MAE per pixel we

achieve values of 0.39 m and for the RMSE 1.17 m. Those

values were generated on the KITTI Depth dataset. Values

for

the Odometry data might differ. However, since the data comes

from the same source, it is a justified assumption that the

accuracy of the Odometry data is comparable. A typical high-

resolution depth map from the neural network and the

corresponding camera image are shown in Figure 3.

Fig. 3. Typical camera image and corresponding depth map generated by the

ENet based neural network.

/camera

image

/slam_interface

orbslam3

System

/pose

/lidar

pcd

/depth_node

depthmap

Tracking

Frame

orbslam3

/camera

image

/slam_interface

System

/pose

/lidar

pcd

Depth Module

Tracking

Frame

Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.

The depth in the image area for which we did not have

ground truth data for training the network was not used.

However, it is hard to find a general rule about which areas

to use and which to ignore.

IV. RESULTS

This section summarizes the results we achieved with the

presented method. We want to compare our results to the ORB-

SLAM3 stereo module in terms of accuracy, and particularly in

terms of the runtime. The experiments were conducted on the

KITTI Odometry dataset [35].

A. Comparison of the Trajectories

Tables I and II show the translational and rotational error of

the mentioned SLAM modes on the ten KITTI Odometry

sequences.

Stereo and RGB-L exhibit comparable performances with a

slight advantage for the stereo mode. It can be seen directly that

the accuracy of the deep learning methods is significantly worse

for both, translational and rotational errors. This might be

because it is hard to generally define areas, where the net

performs well. To minimize the negative effect of this, the upper

30 % of the depth maps were ignored. Another downside of the

neural net is that the KITTI ground truth only uses values

of up to 80 m. Thus, even areas that are significantly further

away will show depth values around 80 m. The RGB-L mode

only uses depth values that are close to the actual depth

measurements from the LiDAR and thus generalizes better.

Upon further investigation of the accuracy of the algorithms

across the different sequences, it becomes apparent that the

RGB-L mode performs especially well in low-feature

environments, as shown in Figure 4.

B. Runtime Evaluation

As SLAM is a highly time-critical process that has to run

real-time on autonomous vehicles, we want to put a special focus

on runtime analysis.

Figure 5 compares the total runtimes of the three modes.

Both modes that use depth maps from LiDAR, show

significantly smaller variances. The mean value of the RGB-

L runtime is 22.8 ms and thus is more than 40 % less than the

41.2 ms mean from stereo mode. The composition of the

runtimes is summarized in Table III.

(a)

KITTI Sequence 01, Frame 248

(b)

KITTI Sequence 04, Frame 152

Fig. 4. Typical images of low-feature environments. RGB-L beats ORB3 stereo

mode accuracy here.

TABLE I. AVERAGE TRANSLATIONAL ERROR IN %. OPTIMUM RESULTS ARE

HIGHLIGHTED IN GREEN AND THE NEXT BEST IN BLUE

Translational Error in %

Urban

0.704

0.695

0.721

Highway

1.628

1.098

1.698

Urban

0.817

0.836

0.852

Urban

0.950

1.033

0.726

Rural

0.562

0.453

1.121

Urban

0.413

0.470

0.840

Urban

0.544

0.878

1.182

Urban

0.483

0.581

0.862

Urban

0.991

1.085

1.284

Rural

0.988

0.866

1.092

Urban

0.710

0.840

1.388

Average 0.800 0.801 1.070

Since the RGB-D mode requires previously generated depth

maps, it should be noted that the partial computation times from

offline and online calculations were added together. The

conventional CV algorithm has a very low computation time and

can therefore significantly accelerate the ORB-SLAM3

algorithm in comparison to the stereo mode.

The referred to smaller mean computation time in combi-

nation with the smaller variance could make the presented

approach particularly interesting for embedded applications

with limited hardware resources. The whole of the algorithms

runs on a CPU without the need for a power-intensive GPU.

V. DISCUSSION AND FUTURE WORK

The presented algorithms were evaluated using the KITTI

Odometry dataset. To generate further results and conclusions,

an evaluation using other datasets could well be worthwhile.

Since the RGB-L mode demonstrated high accuracy in low-

feature environments, also bad weather scenarios or partially

polluted camera images could be of special interest. Since the

RGB-L mode does not use deep learning methods, good

scalability to other data can be assumed. The chosen algorithms

do not depend on the exact camera or LiDAR model, the

environment, or the data-base. Also, no expensive ground truth

depth data is needed. Only the camera’s internal calibration and

the external calibration of the camera and LiDAR need to be

adjusted.

TABLE II. AVERAGE ROTATIONAL ERROR IN ◦/100 M. OPTIMUM RESULTS

ARE HIGHLIGHTED IN GREEN AND THE NEXT BEST IN BLUE

Seq. Scenery

Stereo RGB-L Deep Learning

Rotational Error in °/100m

Urban

0.272

0.257

0.315

Highway

0.201

0.364

0.681

(Sub-) Urban

0.275

0.259

0.344

Urban

0.235

0.258

0.334

Rural

0.251

0.221

0.985

Urban

0.158

0.214

0.419

Urban

0.224

0.466

0.425

Urban

0.271

0.293

0.497

Urban

0.303

0.338

0.556

Rural

0.299

0.342

0.438

Urban

0.343

0.421

0.484

Average 0.244

0.312

0.498

Seq. Scenery Stereo L Deep Learning

Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.

Fig. 5. Runtime comparison of the presented methods.

Further research could also be conducted in the field of deep

learning-based depth map generation. For example, including a

confidence map to decide which depth values best to use in the

SLAM algorithm could be highly advantageous. Another

problem of the deep learning-based approach is scalability. The

depth net will have to be retrained for every new dataset to

obtain the best results. This is currently one of the most

significant issues that deep learning approaches are facing and

will undoubtedly be one of the major research topics in the

coming years.

Including the deep-learning pipeline with the GPU directly

in the RGB-L mode can help to reduce computation time thus

making online use possible.

The proposed method builds upon existing open-source

ORB-SLAM3. We present a simple open-source method that can

easily be used by other researchers. However, sensor fusion

is not ideal. The ORB-SLAM uses the depth values from the

depth maps to generate ‘quasi-stereo’ images for handling the

depth information. As already pointed out in section II, there are

some other ways of combining LiDAR and camera data for

localization and mapping. Future research on the optimal way to

fuse that information will need to be conducted.

VI. CONCLUSION

In this paper, we have proposed a method of fusing LiDAR

depth measurements into an existing indirect visual SLAM

algorithm, namely ORB-SLAM3 [6]. We compare two methods

to generate pixel-wise depth maps for each camera image. For

upsampling the LiDAR point cloud, we used an inverse dilation

with a 5x5 pixel-sized diamond kernel and a supervised deep

learning-based method based on ENet [30]. While the deep

learning-based depth maps had to be created offline before

running the SLAM, the dilation was directly included in the

ORB SLAM3 as RGB-L mode. The code is available as open-

source software.

Depending on the sequence of the KITTI Odometry dataset,

RGB-L mode could match the performance of the stereo method

and even outperform it in low-feature environments. The use of

deep learning-based depth maps did not prove beneficial in this

work and will be the subject of future research. With RGB-L,

we were able to reduce the runtime by more than 40 % when

compared to the stereo mode, while at the same time the runtime

variance decreased significantly. This makes our approach

particularly interesting for low-power and low-latency

applications.

ACKNOWLEDGMENT

As the primary author, Florian Sauerbeck proposed the

initial idea for this paper, and he is responsible for the concept

presented and the implementation thereof. Benjamin Obermeier

contributed to the generation of the results and the development

of the workflow. Martin Rudolph contributed to the

development of the conventional LiDAR data upsampling and

the integration of the RGB-L mode into ORB-SLAM3. Johannes

Betz contributed to the conception of the research project and

revised the paper critically for important intellectual content. He

gave final approval of the version to be published and agreed

to all aspects of the work. We thank the ADAC Stiftung and

TUM for funding the underlying research project. We want to

thank Marius Spitzar whose master’s thesis contributed to the

implementation of the depth completion neural net.

TABLE III. RESPONSE TIMES FOR STEREO MODE AND THE PROPOSED RGB-L MODE

Operation

RGB-L with 5x5 Dilation

min median max

Deep Learning-based

min median max

Stereo

min median max

Neural Net GPU

9.25

9.66

10.21

Neural Net total

15.38

16.32

18.63

Projection

1.52

2.24

3.61

Upsampling

0.43

0.57

0.83

Tracking

17.23

22.79

30.44

19.14

23.29

29.93

28.21

41.17

66.18

Overall

19.18

25.60

34.88

34.52

39.61

48.56

28.21

41.17

66.17

Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.

REFERENCES

[1]

T. Bailey and H. Durrant-Whyte, “Simultaneous Localization and

Mapping (SLAM): Part ii,” IEEE robotics & automation magazine, vol.

13, no. 3, 2006.

[2]

M. G. Dissanayake, P. Newman, S. Clark, H. F. Durrant-Whyte, and

M. Csorba, “A solution to the simultaneous localization and map building

(SLAM) problem,” IEEE Transactions on robotics and automation, vol.

17, no. 3, 2001.

[3]

E. Marti, M. A. De Miguel, F. Garcia, and J. Perez, “A Review of Sensor

Technologies for Perception in Automated Driving,” IEEE Intelligent

Transportation Systems Magazine, vol. 11, no. 4, 2019.

[4]

R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a

Versatile and Accurate Monocular SLAM System,” IEEE transactions

on robotics, vol. 31, no. 5, 2015.

[5]

R. Mur-Artal and J. D. Tardo´s, “ORB-SLAM2: An Open-Source SLAM

S ystem for Monocular, Stereo, and RGB-D Cameras,” IEEE

transactions on robotics, vol. 33, no. 5, 2017.

[6]

C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. Montiel, and J. D. Tardos,

“ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–

Inertial, and Multimap SLAM,” IEEE Transactions on Robotics, vol. 37,

no. 6, 2021.

[7]

J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale Direct

Monocular SLAM,” in European conference on computer vision,

Springer, 2014.

[8]

C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza,

“SVO: Semidirect Visual Odometry for Monocular and Multicamera

Systems,” IEEE Transactions on Robotics, vol. 33, no. 2, 2016.

[9]

J. Engel, V. Koltun, and D. Cremers, “Direct Sparse Odometry,” IEEE

transactions on pattern analysis and machine intelligence, vol. 40, no. 3,

2017.

[10]

R. Wang, M. Schworer, and D. Cremers, “Stereo DSO: Large-scale

Direct S parse Visual Odometry with Stereo Cameras,” in Proceedings of

the IEEE International Conference on Computer Vision, 2017.

[11]

N. Yin and Y. Ding, “Semi-direct Sparse Odometry with Depth Prior,”

in 2022 2nd International Conference on Computer, Control and

Robotics (ICCCR), IEEE, 2022.

[12]

P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D mapping:

Using kinect-style depth cameras for dense 3d modeling of indoor

environments,” The international journal of Robotics Research, vol. 31,

no. 5, 2012.

[13]

C. Debeunne and D. Vivet, “A Review of Visual-Lidar Fusion based

Simultaneous Localization and Mapping,” Sensors, vol. 20, no. 7, 2020.

[14]

J. Graeter, A. Wilczynski, and M. Lauer, “Limo: Lidar-monocular Visual

Odometry,” in 2018 IEEE/RSJ International Conference on Intelligent

R obots and Systems (IROS), IEEE, 2018.

[15]

Y.-S. Shin, Y. S. Park, and A. Kim, “Direct Visual SLAM using Sparse

Depth for Camera-Lidar System,” in 2018 IEEE International

Conference on Robotics and Automation (ICRA), IEEE, 2018.

[16]

J. Zhang and S. Singh, “Laser–visual–inertial Odometry and Mapping

with High Robustness and Low Drift,” Journal of field robotics, vol. 35,

no. 8, 2018.

[17]

W. Shao, S. Vijayarangan, C. Li, and G. Kantor, “Stereo Visual Inertial

L idar Simultaneous Localization and Mapping,” in 2019 IEEE/RSJ

Inter- national Conference on Intelligent Robots and Systems (IROS),

IEEE, 2019.

[18]

J. Zijlmans, “RGB-D ORB-SLAM2 with a depth map based on lidar

data.” https://medium.com/@j.zijlmans/rgb-d-orb-slam-with-a-depth-map-

based-on-lidar-data-ecac280614e7, 2017. [Online; accessed 10-October-

2022].

[19]

X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “LIC-Fusion: Lidar-

Inertial-Camera Odometry,” in 2019 IEEE/RSJ International Conference

on Intelligent Robots and Systems (IROS), IEEE, 2019.

[20]

X. Zuo, Y. Yang, P. Geneva, J. Lv, Y. Liu, G. Huang, and M. Pollefeys,

“LIC-Fusion 2.0: Lidar-Inertial-Camera Odometry with Sliding-Window

Plane-Feature Tracking,” in 2020 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), IEEE, 2020.

[21]

C.-C. Chou and C.-F. Chou, “Efficient and Accurate Tightly-coupled

Visual-lidar slam,” IEEE Transactions on Intelligent Transportation

Systems, 2021.

[22]

J. Lin, C. Zheng, W. Xu, and F. Zhang, “R2 live: A R obust, Real-time,

Lidar-Inertial-Visual Tightly-coupled State Estimator and Mapping,”

IEEE Robotics and Automation Letters, vol. 6, no. 4, 2021.

[23]

J. Lin and F. Zhang, “R3 live++: A Robust, Real-time, R adiance

Reconstruction Package with a Tightly-coupled Lidar-Inertial-Visual

State Estimator,” arXiv preprint arXiv:2209.03666, 2022.

[24]

S. Zhao, H. Zhang, P. Wang, L. Nogueira, and S. Scherer, “Super

O dometry: IMU-centric Lidar-Visual-Inertial Estimator for Challenging

Environments,” in 2021 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), IEEE, 2021.

[25]

C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular Depth

Estimation based on Deep Learning: An Overview,” Science China Tech-

nological Sciences, vol. 63, no. 9.

[26]

A. Bhoi, “Monocular Depth Estimation: A Survey,” arXiv preprint

arXiv:1901.09402, 2019.

[27]

Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep Learning for Monocular

Depth Estimation: A Review,” Neurocomputing, vol. 438, 2021.

[28]

J. Ku, A. Harakeh, and S. L. Waslander, “In Defense of Classical Image

Processing: Fast Depth Completion on the CPU,” in 2018 15th

Conference on Computer and Robot Vision (CRV), IEEE, 2018.

[29]

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger,

“Sparsity Invariant CNNs,” in International Conference on 3D Vision

(3DV), 2017.

[30]

M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “PENet:

Towards Precise and Efficient Image Guided Depth Completion,” in

2021 IEEE International Conference on Robotics and Automation

(ICRA), IEEE, 2021.

[31]

S. Zhao, M. Gong, H. Fu, and D. Tao, “Adaptive Context-aware Multi-

modal Network for Depth Completion,” IEEE Transactions on Image

Processing, vol. 30, 2021.

[32]

Z. Yan, K. Wang, X. Li, Z. Zhang, B. Xu, J. Li, and J. Yang, “Rignet:

Repetitive Image Guided Network for Depth Completion,” arXiv

preprint arXiv:2107.13802, 2021.

[33]

A. Masoumian, H. A. Rashwan, J. Cristiano, M. S. Asif, and D. Puig,

“Monocular Depth Estimation using Deep Learning: A Review,” Sensors,

vol. 22, no. 14, 2022.

[34]

J. Hu, C. Bao, M. Ozay, C. Fan, Q. Gao, H. Liu, and T. L. Lam, “Deep

Depth Completion: A Survey,” arXiv preprint arXiv:2205.05335, 2022.

[35]

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics:

The KITTI Dataset,” The International Journal of Robotics Research,

vol. 32, no. 11, 2013.

100

Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on August 06,2023 at 16:10:28 UTC from IEEE Xplore. Restrictions apply.

3D LiDAR SLAM : A survey

Article

Full-text available

May 2024

Simultaneous localization and mapping (SLAM) is a very challenging yet fundamental problem in the field of robotics and photogrammetry, and it is also a prerequisite for intelligent perception of unmanned systems. In recent years, 3D LiDAR SLAM technology has made remarkable progress. However, to the best of our knowledge, almost all existing surveys focus on visual SLAM methods. To bridge the gap, this paper provides a comprehensive review that summarizes the scientific connotation, key difficulties, research status, and future trends of 3D LiDAR SLAM, aiming to give readers a better understanding of LiDAR SLAM technology, thereby inspiring future research. Specifically, it summarizes the contents and characteristics of the main steps of LiDAR SLAM, introduces the key difficulties it faces, and gives the relationship with existing reviews; it provides an overview of current research hotspots, including LiDAR‐only methods and multi‐sensor fusion methods, and gives milestone algorithms and open‐source tools in each category; it summarizes common datasets, evaluation metrics and representative commercial SLAM solutions, and provides the evaluation results of mainstream methods on public datasets; it looks forward to the development trend of LiDAR SLAM, and considers the preliminary ideas of multi‐modal SLAM, event SLAM, and quantum SLAM.

Monocular Depth Estimation Using Deep Learning: A Review

Article

Full-text available

Jul 2022
SENSORS-BASEL

In current decades, significant advancements in robotics engineering and autonomous vehicles have improved the requirement for precise depth measurements. Depth estimation (DE) is a traditional task in computer vision that can be appropriately predicted by applying numerous procedures. This task is vital in disparate applications such as augmented reality and target tracking. Conventional monocular DE (MDE) procedures are based on depth cues for depth prediction. Various deep learning techniques have demonstrated their potential applications in managing and supporting the traditional ill-posed problem. The principal purpose of this paper is to represent a state-of-the-art review of the current developments in MDE based on deep learning techniques. For this goal, this paper tries to highlight the critical points of the state-of-the-art works on MDE from disparate aspects. These aspects include input data shapes and training manners such as supervised, semi-supervised, and unsupervised learning approaches in combination with applying different datasets and evaluation indicators. At last, limitations regarding the accuracy of the DL-based MDE models, computational time requirements, real-time inference, transferability, input images shape and domain adaptation, and generalization are discussed to open new directions for future research.

RigNet: Repetitive Image Guided Network for Depth Completion

Chapter

Oct 2022

Depth completion deals with the problem of recovering dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent approaches mainly focus on image guided learning frameworks to predict dense depth. However, blurry guidance in the image and unclear structure in the depth still impede the performance of the image guided frameworks. To tackle these problems, we explore a repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a repetitive hourglass network to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we introduce a repetitive guidance module based on dynamic convolution, in which an efficient convolution factorization is proposed to simultaneously reduce its complexity and progressively model high-frequency structures. Extensive experiments show that our method achieves superior or competitive results on KITTI benchmark and NYUv2 dataset.

R 3 LIVE: A Robust, Real-time, RGB-colored, LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping package

Conference Paper

May 2022

Semi-direct Sparse Odometry with Depth Prior

Conference Paper

Mar 2022

Super Odometry: IMU-centric LiDAR-Visual-Inertial Estimator for Challenging Environments

Conference Paper

Sep 2021

Efficient and Accurate Tightly-Coupled Visual-Lidar SLAM

Article

Dec 2021

We investigate a novel way to integrate visual SLAM and lidar SLAM. Instead of enhancing visual odometry via lidar depths or using visual odometry as the motion initial guess of lidar odometry, we propose tightly-coupled visual-lidar SLAM (TVL-SLAM), in which the visual and lidar frontend are run independently and which incorporates all of the visual and lidar measurements in the backend optimizations. To achieve large-scale bundle adjustments in TVL-SLAM, we focus on accurate and efficient lidar residual compression. The visual-lidar SLAM system implemented in this work is based on the open-source ORB-SLAM2 and a lidar SLAM method with average performance, whereas the resulting visual-lidar SLAM clearly outperforms existing visual/lidar SLAM approaches, achieving 0.52% error on KITTI training sequences and 0.56% error on testing sequences.

PENet: Towards Precise and Efficient Image Guided Depth Completion

Conference Paper

May 2021

R2LIVE: A Robust, Real-time, LiDAR-Inertial-Visual tightly-coupled state Estimator and mapping

Article

Jul 2021

In this letter, we propose a robust, real-time tightly-coupled multi-sensor fusion framework, which fuses measurements from LiDAR, inertial sensor, and visual camera to achieve robust and accurate state estimation. Our proposed framework is composed of two parts: the filter-based odometry and factor graph optimization. To guarantee real-time performance, we estimate the state within the framework of error-state iterated Kalman-filter, and further improve the overall precision with our factor graph optimization. Taking advantage of measurements from all individual sensors, our algorithm is robust enough to various visual failure, LiDAR-degenerated scenarios, and is able to run in real time on an on-board computation platform, as shown by extensive experiments conducted in indoor, outdoor, and mixed environments of different scale. Moreover, the results show that our proposed framework can improve the accuracy of state-of-the-art LiDAR-inertial or visual-inertial odometry. To share our findings and to make contributions to the community, we open source our codes on our Github.

ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM

Article

May 2021

This article presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multimap SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models. The first main novelty is a tightly integrated visual-inertial SLAM system that fully relies on maximum a posteriori (MAP) estimation, even during IMU initialization, resulting in real-time robust operation in small and large, indoor and outdoor environments, being two to ten times more accurate than previous approaches. The second main novelty is a multiple map system relying on a new place recognition method with improved recall that lets ORB-SLAM3 survive to long periods of poor visual information: when it gets lost, it starts a new map that will be seamlessly merged with previous maps when revisiting them. Compared with visual odometry systems that only use information from the last few seconds, ORB-SLAM3 is the first system able to reuse in all the algorithm stages all previous information from high parallax co-visible keyframes, even if they are widely separated in time or come from previous mapping sessions, boosting accuracy. Our experiments show that, in all sensor configurations, ORB-SLAM3 is as robust as the best systems available in the literature and significantly more accurate. Notably, our stereo-inertial SLAM achieves an average accuracy of 3.5 cm in the EuRoC drone and 9 mm under quick hand-held motions in the room of TUM-VI dataset, representative of AR/VR scenarios. For the benefit of the community we make public the source code.

Adaptive Context-Aware Multi-Modal Network for Depth Completion

Article

May 2021

Depth completion aims to recover a dense depth map from the sparse depth data and the corresponding single RGB image. The observed pixels provide the significant guidance for the recovery of the unobserved pixels’ depth. However, due to the sparsity of the depth data, the standard convolution operation, exploited by most of existing methods, is not effective to model the observed contexts with depth values. To address this issue, we propose to adopt the graph propagation to capture the observed spatial contexts. Specifically, we first construct multiple graphs at different scales from observed pixels. Since the graph structure varies from sample to sample, we then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Furthermore, considering the mutli-modality of input data, we exploit the graph propagation on the two modalities respectively to extract multi-modal representations. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. The proposed strategy preserves the original information for one modality and also absorbs complementary information from the other through learning the adaptive gating weights. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks, i.e. , KITTI and NYU-v2, and at the same time has fewer parameters than latest models. Our code is available at: https://github.com/sshan-zhao/ACMNet .

RGB-L: Enhancing Indirect Visual SLAM Using LiDAR-Based Dense Depth Maps

Abstract

Recommended publications

RGB-L: Enhancing Indirect Visual SLAM using LiDAR-based Dense Depth Maps

GR-Fusion: Multi-sensor Fusion SLAM for Ground Robots with High Robustness and Low Drift

Camera, LiDAR, and IMU Based Multi-Sensor Fusion SLAM: A Survey

Tightly-coupled LiDAR-visual SLAM Based on Geometric Features