Donghwan Lee's research while affiliated with Naver and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (17)


TMO: Textured Mesh Acquisition of Objects with a Mobile Device by using Differentiable Rendering
  • Conference Paper

June 2023

·

10 Reads

·

Dongki Jung

·

Taejae Lee

·

[...]

·

Donghwan Lee
Share

TMO: Textured Mesh Acquisition of Objects with a Mobile Device by using Differentiable Rendering

March 2023

·

23 Reads

We present a new pipeline for acquiring a textured mesh in the wild with a single smartphone which offers access to images, depth maps, and valid poses. Our method first introduces an RGBD-aided structure from motion, which can yield filtered depth maps and refines camera poses guided by corresponding depth. Then, we adopt the neural implicit surface reconstruction method, which allows for high-quality mesh and develops a new training process for applying a regularization provided by classical multi-view stereo methods. Moreover, we apply a differentiable rendering to fine-tune incomplete texture maps and generate textures which are perceptually closer to the original scene. Our pipeline can be applied to any common objects in the real world without the need for either in-the-lab environments or accurate mask images. We demonstrate results of captured objects with complex shapes and validate our method numerically against existing 3D reconstruction and texture mapping methods.


Figure 2: InLoc 3D map generated by assigning a 3D point to each local feature in the training images (viewed in COLMAP).
Figure 3: Top view of the 3D reconstruction of GangnamStation B2.
Figure 4: Paradigm 1 (pose approximation). Results obtained with pose interpolation methods where the weights are obtained using EWB, BDI, and CSI (rows) for different datasets (columns). For datasets with available retrieval GT (see Sec. 3.6.3), we show results obtained using the GT rankings (dashed lines) with EWB weighting scheme. These results can be understood as upper bounds on the localization performance. The best upper bound can be obtained with the distance-based ranking. The best pose approximation results are obtained with CSI and simply using the top-retrieved pose works best in many cases (except for NetVLAD and for all on Aachen). There is no clear winning global representation for all weighting schemes.
Figure 5:
Figure 9: Landmark retrieval per image linear correlation. Pearson coefficients computed for each query image individually (directly using the pose error as localization metric) and visualized as violin plots. The columns show the localization paradigms and the rows show different datasets. For pose approximation, the Pearson coefficients are densely sampled in the upper part of the violins, meaning good linear correlation, whereas for pose estimation, they are sampled on both sides and in the middle (RobotCar), meaning high, inverse, and low correlation. We can observe similar behaviour for all feature types.

+3

Investigating the Role of Image Retrieval for Visual Localization -- An exhaustive benchmark
  • Preprint
  • File available

May 2022

·

91 Reads

Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two purposes: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for both of them. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes which often differs from the requirements of visual localization. In order to investigate the consequences for visual localization, this paper focuses on understanding the role of image retrieval for multiple visual localization paradigms. First, we introduce a novel benchmark setup and compare state-of-the-art retrieval representations on multiple datasets using localization performance as metric. Second, we investigate several definitions of "ground truth" for image retrieval. Using these definitions as upper bounds for the visual localization paradigms, we show that there is still sgnificant room for improvement. Third, using these tools and in-depth analysis, we show that retrieval performance on classical landmark retrieval or place recognition tasks correlates only for some but not all paradigms to localization performance. Finally, we analyze the effects of blur and dynamic scenes in the images. We conclude that there is a need for retrieval approaches specifically designed for localization paradigms. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization. https://arxiv.org/abs/2205.15761

Download

Investigating the Role of Image Retrieval for Visual Localization: An Exhaustive Benchmark

May 2022

·

137 Reads

·

22 Citations

International Journal of Computer Vision

Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two purposes: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for both of them. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes which often differs from the requirements of visual localization. In order to investigate the consequences for visual localization, this paper focuses on understanding the role of image retrieval for multiple visual localization paradigms. First, we introduce a novel benchmark setup and compare state-of-the-art retrieval representations on multiple datasets using localization performance as metric. Second, we investigate several definitions of “ground truth” for image retrieval. Using these definitions as upper bounds for the visual localization paradigms, we show that there is still significant room for improvement. Third, using these tools and in-depth analysis, we show that retrieval performance on classical landmark retrieval or place recognition tasks correlates only for some but not all paradigms to localization performance. Finally, we analyze the effects of blur and dynamic scenes in the images. We conclude that there is a need for retrieval approaches specifically designed for localization paradigms. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.




Fig. 1. (T-B, L-R): Before and after the application of our proposed method called Quatro in KITTI dataset [7] when two distant and partially overlapped point clouds, i.e. source (cyan) and target (yellow), are given. As the distance between the two viewpoints of source and target becomes farther, it gives rise to an increase in the ratio of outliers within the putative correspondences and simultaneously reduces the number of inliers, which results in the performance degradation of correspondence-based global registration methods in general [23], [26], [28]. Under the circumstances, our proposed method shows robust performance, overcoming the effect of outliers, as well as the degeneracy issue. The red and green lines denote outlier and inlier correspondences, respectively (best viewed in color).
Fig. 3. Illustration of Quatro in a degeneracy case when two distant and partially overlapped source (cyan) and target (yellow) are given. (a) Spurious correspondences. (b) The output of MCIS-heuristic. Most outliers are initially filtered. (c)-(e) An example of Quasi-SO(3) estimation via GNC. (c) First of all, all weights w (0) k of TIMs are set to one. (d) During the optimization, sometimes GNC unexpectedly leaves less than three pairs by assigning near-zero values to some w (t) k (red dashed rectangle). (e) In the degeneracy case, quasi-SO(3) estimation is successfully done because DoF of R + is one, so it can be estimated even when a single pair of TIMs is left. (f) Before and the after the application of COTE. For (a), (b), and (f), the definite outliers, inliers, and quasi-inliers are represented by the red, green, and blue lines, respectively (best viewed in color).
A Single Correspondence Is Enough: Robust Global Registration to Avoid Degeneracy in Urban Environments

March 2022

·

122 Reads

Global registration using 3D point clouds is a crucial technology for mobile platforms to achieve localization or manage loop-closing situations. In recent years, numerous researchers have proposed global registration methods to address a large number of outlier correspondences. Unfortunately, the degeneracy problem, which represents the phenomenon in which the number of estimated inliers becomes lower than three, is still potentially inevitable. To tackle the problem, a degeneracy-robust decoupling-based global registration method is proposed, called Quatro. In particular, our method employs quasi-SO(3) estimation by leveraging the Atlanta world assumption in urban environments to avoid degeneracy in rotation estimation. Thus, the minimum degree of freedom (DoF) of our method is reduced from three to one. As verified in indoor and outdoor 3D LiDAR datasets, our proposed method yields robust global registration performance compared with other global registration methods, even for distant point cloud pairs. Furthermore, the experimental results confirm the applicability of our method as a coarse alignment. Our code is available: https://github.com/url-kaist/quatro.


SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

March 2022

·

36 Reads

Monocular depth estimation in the wild inherently predicts depth up to an unknown scale. To resolve scale ambiguity issue, we present a learning algorithm that leverages monocular simultaneous localization and mapping (SLAM) with proprioceptive sensors. Such monocular SLAM systems can provide metrically scaled camera poses. Given these metric poses and monocular sequences, we propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation. Our approach is based on a teacher-student formulation which guides our network to predict high-quality depths. We demonstrate that our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments. Our full system shows improvements over recent self-supervised depth estimation and completion methods on EuRoC, OpenLORIS, and ScanNet datasets.



DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes

August 2021

·

15 Reads

We present a novel approach for estimating depth from a monocular camera as it moves through complex and crowded indoor environments, e.g., a department store or a metro station. Our approach predicts absolute scale depth maps over the entire scene consisting of a static background and multiple moving people, by training on dynamic scenes. Since it is difficult to collect dense depth maps from crowded indoor environments, we design our training framework without requiring depths produced from depth sensing devices. Our network leverages RGB images and sparse depth maps generated from traditional 3D reconstruction methods to estimate dense depth maps. We use two constraints to handle depth for non-rigidly moving people without tracking their motion explicitly. We demonstrate that our approach offers consistent improvements over recent depth estimation methods on the NAVERLABS dataset, which includes complex and crowded scenes.


Citations (8)


... Subsequently, a fine-tuning method has been established to estimate depth in a metrically accurate manner with the self-supervised learning scheme. To resolve the issue of scale ambiguity in single-image depth estimation in the wild or any rough or diverse environment, an algorithm termed SelfTune has been introduced in "SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning" [126], which makes use of SLAM (Simultaneous Localization And Mapping) using proprioceptive sensors. These SLAM techniques can provide poses of cameras which are metrically scaled. ...

Reference:

Deep Learning-Based Stereopsis and Monocular Depth Estimation Techniques: A Review
SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning
  • Citing Conference Paper
  • May 2022

... A coarse initial estimate can be obtained via image retrieval [3,80] against a database of reference images. The pose(s) of the top-retrieved image(s) then provide an approximation of the pose of the query image [32,79]. A more efficient alternative to image retrieval is to directly regress the camera pose using a neural network [5,13,17,18,35,36,73,74,84]. ...

Investigating the Role of Image Retrieval for Visual Localization: An Exhaustive Benchmark

International Journal of Computer Vision

... However, these depth completion methods are vulnerable to noisy depth values from SLAM and varying sparse point distributions [33], [35]. Thus, we aim to leverage the learning-based depth estimation from a single image to predict depth with a metric scale [14]- [16]. ...

DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes
  • Citing Conference Paper
  • October 2021

... Over the last decade, many visual localisation methods have been proposed, including feature matching-based approaches [11,21,27,30,41], scene coordinate regression [2][3][4] and absolute pose regressors (APRs) [17,18,37]. Much of this progress has been driven by the availability of diverse datasets and benchmarks [6,8,10,18,19,29,31,36,38,40,41,43,44]. However, most of these datasets present limitations that affect their application to XR. ...

Large-scale Localization Datasets in Crowded Indoor Spaces
  • Citing Conference Paper
  • June 2021

... This algorithm was extended to depth completion, introducing improvements like the use of aligned color as a guiding factor for the weight function and to define the order of computations [33], and the use of a pixel-wise confidence factor [30]. Sparse depth maps, generally captured with LiDAR sensors, suffer especially from large patches of missing depth data and have particular time limitations, as they are commonly linked with autonomous driving. Thus, more advanced techniques have been developed, relying on both supervised [29] and self-supervised [36,11,16] deep convolutional neural networks assisted by color information to fill large depth gaps. ...

SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments
  • Citing Conference Paper
  • May 2021

... Sin embargo, una secuencia de imágenes aumenta la precisión del método. En consecuencia, algunos artículos han comenzado a aprovechar secuencias de imágenes para estimar la VBL (Lee et al. (2021); Brahmbhatt et al. (2018); Valada et al. (2018); Xue et al. (2019); Li et al. (2019)). En este articulo, suponemos que se nos proporcionan datos secuenciales, y para localizar combinamos métodos de aprendizaje profundo y métodos tradicionales de seguimiento. ...

Local to Global: Efficient Visual Localization for a Monocular Camera
  • Citing Conference Paper
  • January 2021

... Indoor place recognition represents an important yet relatively less explored area. The SpoxelNet (Chang et al. 2020) neural network architecture was proposed as a 3D-PCPR method tailored for crowded indoor spaces. SpoxelNet effectively encodes input voxels into global descriptor vectors. ...

SpoxelNet: Spherical Voxel-based Deep Place Recognition for 3D Point Clouds of Crowded Indoor Spaces
  • Citing Conference Paper
  • October 2020