Fig 4 - uploaded by Jose M. Facil
Content may be subject to copyright.
Normalized influence area of the points. Notice how it expands around local structure areas given a set of points in Ω. First column: RGB image with the points of Ω labeled with different colors. Second column: influence areas computed by our method. Notice how this influence expands in areas with the same local structure but can be misled in areas where there is a lack of points or where the estimation from the neural net is not accurate enough. Figure best viewed in color.

Normalized influence area of the points. Notice how it expands around local structure areas given a set of points in Ω. First column: RGB image with the points of Ω labeled with different colors. Second column: influence areas computed by our method. Notice how this influence expands in areas with the same local structure but can be misled in areas where there is a lack of points or where the estimation from the neural net is not accurate enough. Figure best viewed in color.

Source publication
Article
Full-text available
Dense 3D mapping from a monocular sequence is a key technology for several applications and still a research problem. This paper leverages recent results on single-view CNN-based depth estimation and fuses them with multi-view depth estimation. Both approaches present complementary strengths. Multi-view depth estimation is highly accurate but only...

Context in source publication

Context 1
... normalized weights expand the local influence to the whole image (see Fig. 4 and Fig. 5 for a more detailed view). Notice how the influence expands along planes even if the points in Ω do not reach the end of the plane; and is sharply reduced when the local structure changes. Once these influence weights have been calculated and normalized, the fusion depth estimation, f , for each point (i, j) is a combination ...

Similar publications

Thesis
Full-text available
En el presente trabajo se presenta el desarrollo de un proyecto el cual tiene por objetivo la implementación en MATLAB de un algoritmo de SLAM para un robot móvil comercial, operando en entornos estructurados y estáticos, usando para la localización la odometría del robot y el modelo del mismo, y para el mapeo un sensor de rango láser, además, una...
Article
Full-text available
To solve the SLAM (simultaneous localization and map building) of mobile robots, a multi-robot SLAM algorithm based on particle filter for communication in unknown environment is proposed. In the standard particle filter, the incremental map construction method based on point-line consistency is introduced to preserve the hypothesis of the line seg...
Article
Full-text available
Visual odometry in the field of computer vision and robotics is a well-known approach with which the position and orientation of an agent can be obtained using only images from a camera or multiple of them. In most traditional point feature-based visual odometry, one important assumption and also an ideal condition is that the scene remains static....
Chapter
Full-text available
Multi-robot systems have recently been in the spotlight in terms of efficiency in performing tasks. However, if there is no map in the working environment, each robot must perform SLAM which simultaneously performs localization and mapping the surrounding environments. To operate the multi-robot systems efficiently, the individual maps should be ac...
Conference Paper
Full-text available
Abstract: To reduce the feature matching time in visual based multi-robot Simultaneous localization and mapping (SLAM), a feature matching algo-rithm based on map environment is proposed in this paper. This algorithm is different from any other method purposed, which will establish feature li-braries by classifying features collected in the mobile...

Citations

... Despite its complexity, mastering single-view depth estimation is critical for scenarios where multi-view or motion-based methods are impractical or impossible, e.g., in endoscopy. As depth estimation algorithms from single views mature in precision, they also offer potential for integration into multi-view 3D reconstruction pipelines [Facil et al., 2017]. ...
Preprint
Full-text available
Single-view depth estimation refers to the ability to derive three-dimensional information per pixel from a single two-dimensional image. Single-view depth estimation is an ill-posed problem because there are multiple depth solutions that explain 3D geometry from a single view. While deep neural networks have been shown to be effective at capturing depth from a single view, the majority of current methodologies are deterministic in nature. Accounting for uncertainty in the predictions can avoid disastrous consequences when applied to fields such as autonomous driving or medical robotics. We have addressed this problem by quantifying the uncertainty of supervised single-view depth for Bayesian deep neural networks. There are scenarios, especially in medicine in the case of endoscopic images, where such annotated data is not available. To alleviate the lack of data, we present a method that improves the transition from synthetic to real domain methods. We introduce an uncertainty-aware teacher-student architecture that is trained in a self-supervised manner, taking into account the teacher uncertainty. Given the vast amount of unannotated data and the challenges associated with capturing annotated depth in medical minimally invasive procedures, we advocate a fully self-supervised approach that only requires RGB images and the geometric and photometric calibration of the endoscope. In endoscopic imaging, the camera and light sources are co-located at a small distance from the target surfaces. This setup indicates that brighter areas of the image are nearer to the camera, while darker areas are further away. Building on this observation, we exploit the fact that for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance. We propose the use of illumination as a strong single-view self-supervisory signal for deep neural networks.
... Moreover, it is found in [7] that although the whole performances of the monocular models are poorer than the binocular ones, the monocular models still perform better on some special local regions, e.g., the occluded regions around objects which could only be seen at a single view. Inspired by this finding, some monocular (or binocular) models employed a separate binocular (or monocular) model to boost their performances in their own task [1,7,9,15,40,42,49]. All the above issues naturally raise the following problem: Is it feasible to explore a general model that could not only compatibly handle the two tasks, but also improve the prediction accuracy? ...
Preprint
Full-text available
Monocular and binocular self-supervised depth estimations are two important and related tasks in computer vision, which aim to predict scene depths from single images and stereo image pairs respectively. In literature, the two tasks are usually tackled separately by two different kinds of models, and binocular models generally fail to predict depth from single images, while the prediction accuracy of monocular models is generally inferior to binocular models. In this paper, we propose a Two-in-One self-supervised depth estimation network, called TiO-Depth, which could not only compatibly handle the two tasks, but also improve the prediction accuracy. TiO-Depth employs a Siamese architecture and each sub-network of it could be used as a monocular depth estimation model. For binocular depth estimation, a Monocular Feature Matching module is proposed for incorporating the stereo knowledge between the two images, and the full TiO-Depth is used to predict depths. We also design a multi-stage joint-training strategy for improving the performances of TiO-Depth in both two tasks by combining the relative advantages of them. Experimental results on the KITTI, Cityscapes, and DDAD datasets demonstrate that TiO-Depth outperforms both the monocular and binocular state-of-the-art methods in most cases, and further verify the feasibility of a two-in-one network for monocular and binocular depth estimation. The code is available at https://github.com/ZM-Zhou/TiO-Depth_pytorch.
... Most promising, however, are those systems that combine deep learning with standard geometric constraints ( [16]- [22]). It was shown in [23] that learning-based and geometrybased approaches have a complementary nature as learningbased systems tend to perform better on the interior points of objects but blur edges, whereas geometry-based systems typically do well on areas with a high image gradient but perform poorly on interior points that may lack texture. ...
Preprint
The best way to combine the results of deep learning with standard 3D reconstruction pipelines remains an open problem. While systems that pass the output of traditional multi-view stereo approaches to a network for regularisation or refinement currently seem to get the best results, it may be preferable to treat deep neural networks as separate components whose results can be probabilistically fused into geometry-based systems. Unfortunately, the error models required to do this type of fusion are not well understood, with many different approaches being put forward. Recently, a few systems have achieved good results by having their networks predict probability distributions rather than single values. We propose using this approach to fuse a learned single-view depth prior into a standard 3D reconstruction system. Our system is capable of incrementally producing dense depth maps for a set of keyframes. We train a deep neural network to predict discrete, nonparametric probability distributions for the depth of each pixel from a single image. We then fuse this "probability volume" with another probability volume based on the photometric consistency between subsequent frames and the keyframe image. We argue that combining the probability volumes from these two sources will result in a volume that is better conditioned. To extract depth maps from the volume, we minimise a cost function that includes a regularisation term based on network predicted surface normals and occlusion boundaries. Through a series of experiments, we demonstrate that each of these components improves the overall performance of the system.
... While many of these approaches advocate a completely end-to-end framework (for example, [14]- [19]), there has been some work demonstrating the benefit of combining both geometric constraints and learned priors. As shown in [20], geometricbased systems perform best on areas of high image gradients (usually on the edges of objects) but struggle with interior areas of low texture, whereas learning-based systems typically do reasonably well on interior points but blur the edges of objects. Despite the evidence to their complementary nature, however, the best approach to combining learning and geometry remains an open problem. ...
... For this reason, a number of approaches that take network depth predictions and refine them with geometric constraints have been proposed. In [20], the authors compute a network depth prediction for each keyframe and update a semi-dense multiview stereo depth map with each new frame. The two depth estimates are then interpolated based on a set of tunable weights related to the image structure. ...
Preprint
While the keypoint-based maps created by sparse monocular simultaneous localisation and mapping (SLAM) systems are useful for camera tracking, dense 3D reconstructions may be desired for many robotic tasks. Solutions involving depth cameras are limited in range and to indoor spaces, and dense reconstruction systems based on minimising the photometric error between frames are typically poorly constrained and suffer from scale ambiguity. To address these issues, we propose a 3D reconstruction system that leverages the output of a convolutional neural network (CNN) to produce fully dense depth maps for keyframes that include metric scale. Our system, DeepFusion, is capable of producing real-time dense reconstructions on a GPU. It fuses the output of a semi-dense multiview stereo algorithm with the depth and gradient predictions of a CNN in a probabilistic fashion, using learned uncertainties produced by the network. While the network only needs to be run once per keyframe, we are able to optimise for the depth map with each new frame so as to constantly make use of new geometric constraints. Based on its performance on synthetic and real-world datasets, we demonstrate that DeepFusion is capable of performing at least as well as other comparable systems.
... I. INTRODUCTION Monocular depth estimation is an important problem in robotics and computer vision. Depth maps can be used to understand the 3D structure and relative positions of objects in a scene for applications including autonomous driving [1], visual odometry [2], [3], augmented reality [4], sensor fusion [5], and many others. Estimating depth from a monocular image is an inherently ill-posed problem, since 3D information is irretrievably lost when the camera projects to a 2D image. ...
Preprint
Full-text available
Estimating depth from a monocular image is an ill-posed problem: when the camera projects a 3D scene onto a 2D plane, depth information is inherently and permanently lost. Nevertheless, recent work has shown impressive results in estimating 3D structure from 2D images using deep learning. In this paper, we put on an introspective hat and analyze state-of-the-art monocular depth estimation models in indoor scenes to understand these models' limitations and error patterns. To address errors in depth estimation, we introduce a novel Depth Error Detection Network (DEDN) that spatially identifies erroneous depth predictions in the monocular depth estimation models. By experimenting with multiple state-of-the-art monocular indoor depth estimation models on multiple datasets, we show that our proposed depth error detection network can identify a significant number of errors in the predicted depth maps. Our module is flexible and can be readily plugged into any monocular depth prediction network to help diagnose its results. Additionally, we propose a simple yet effective Depth Error Correction Network (DECN) that iteratively corrects errors based on our initial error diagnosis.
... It is also possible to train stereo networks without ground truth supervision [98,82,1,36], but these models are typically outperformed by supervised variants. Some works fuse conventional matching-based stereo estimation with monocular depth cues [71,60,17]. In contrast, we do not require stereo pairs during training or testing. ...
Conference Paper
Self-supervised monocular depth estimation networks are trained to predict scene depth using nearby frames as a supervision signal during training. However, for many applications, sequence information in the form of video frames is also available at test time. The vast majority of monocular networks do not make use of this extra signal, thus ignoring valuable information that could be used to improve the predicted depth. Those that do, either use computationally expensive test-time refinement techniques or off-the-shelf recurrent networks, which only indirectly make use of the geometric information that is inherently available. We propose ManyDepth, an adaptive approach to dense depth estimation that can make use of sequence information at test time, when it is available. Taking inspiration from multi-view stereo, we propose a deep end-to-end cost volume based approach that is trained using self-supervision only. We present a novel consistency loss that encourages the network to ignore the cost volume when it is deemed unreliable, e.g. in the case of moving objects, and an augmentation scheme to cope with static cameras. Our detailed experiments on both KITTI and Cityscapes show that we outperform all published self-supervised baselines, including those that use single or multiple frames at test time.
... It is also possible to train stereo networks without ground truth supervision [98,82,1,36], but these models are typically outperformed by supervised variants. Some works fuse conventional matching-based stereo estimation with monocular depth cues [71,60,17]. In contrast, we do not require stereo pairs during training or testing. ...
Preprint
Self-supervised monocular depth estimation networks are trained to predict scene depth using nearby frames as a supervision signal during training. However, for many applications, sequence information in the form of video frames is also available at test time. The vast majority of monocular networks do not make use of this extra signal, thus ignoring valuable information that could be used to improve the predicted depth. Those that do, either use computationally expensive test-time refinement techniques or off-the-shelf recurrent networks, which only indirectly make use of the geometric information that is inherently available. We propose ManyDepth, an adaptive approach to dense depth estimation that can make use of sequence information at test time, when it is available. Taking inspiration from multi-view stereo, we propose a deep end-to-end cost volume based approach that is trained using self-supervision only. We present a novel consistency loss that encourages the network to ignore the cost volume when it is deemed unreliable, e.g. in the case of moving objects, and an augmentation scheme to cope with static cameras. Our detailed experiments on both KITTI and Cityscapes show that we outperform all published self-supervised baselines, including those that use single or multiple frames at test time.
... Maybe the only exception is end-to-end pose regression (e.g., [28]), for which [29] shows a worse performance than geometric methods. [30], [31] are early works combining single-view depth with multiview approaches for mapping and tracking respectively. [32] proposed a convolutional model for camera tracking and incremental mapping, in this case tightly integrating multiview optimization within the network model. ...
Preprint
Full-text available
Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-supervised depth networks to generate pseudo-RGBD frames, then tracks the camera pose using photometric residuals and fuses the registered depth maps in a volumetric representation. We present an extensive experimental evaluation in the public dataset Hamlyn, showing high-quality results and comparisons against relevant baselines. We also release all models and code for future comparisons.
... The proposed depth refinement algorithm can restore missing depth information by combining the learning-based monocular depth estimation and MVS methods. Similarly, Fácil et al. [58] fused CNN-based single-view and multi-view depth to improve the depth of low-parallax image sequences. Martins et al. [59] have demonstrated that the stereo depth leads to higher performance with the monocular estimated depth fusion. ...
Article
Full-text available
Image-based rendering (IBR) attempts to synthesize novel views using a set of observed images. Some IBR approaches (such as light fields) have yielded impressive high-quality results on small-scale scenes with dense photo capture. However, available wide-baseline IBR methods are still restricted by the low geometric accuracy and completeness of multi-view stereo (MVS) reconstruction on low-textured and non-Lambertian surfaces. The issues become more significant in large-scale outdoor scenes due to challenging scene content, e.g., buildings, trees, and sky. To address these problems, we present a novel IBR algorithm that consists of two key components. First, we propose a novel depth refinement method that combines MVS depth maps with monocular depth maps predicted via deep learning. A lookup table remap is proposed for converting the scale of the monocular depths to be consistent with the scale of the MVS depths. Then, the rescaled monocular depth is used as the constraint in the minimum spanning tree (MST)-based nonlocal filter to refine the per-view MVS depth. Second, we present an efficient shape-preserving warping algorithm that uses superpixels to generate the warped images and blend expected novel views of scenes. The proposed method has been evaluated on public MVS and view synthesis datasets, as well as newly captured large-scale outdoor datasets. In comparison with state-of-the-art methods, the experimental results demonstrated that the proposed method can obtain more complete and reliable depth maps for the challenging large-scale outdoor scenes, thereby resulting in more promising novel view synthesis.
... In that case, we could use a depth sensor, such as an RGB-D camera, or a stereo camera to estimate the depth. Alternatively, there are some works to estimate the depth purely, based on monocular information [64]. As a proof of concept, we found that the probability score for the detection network was correlated to the level of occlusion of each object. ...
Article
Full-text available
Prosthetic vision is being applied to partially recover the retinal stimulation of visually impaired people. However, the phosphenic images produced by the implants have very limited information bandwidth due to the poor resolution and lack of color or contrast. The ability of object recognition and scene understanding in real environments is severely restricted for prosthetic users. Computer vision can play a key role to overcome the limitations and to optimize the visual information in the prosthetic vision, improving the amount of information that is presented. We present a new approach to build a schematic representation of indoor environments for simulated phosphene images. The proposed method combines a variety of convolutional neural networks for extracting and conveying relevant information about the scene such as structural informative edges of the environment and silhouettes of segmented objects. Experiments were conducted with normal sighted subjects with a Simulated Prosthetic Vision system. The results show good accuracy for object recognition and room identification tasks for indoor scenes using the proposed approach, compared to other image processing methods.