Conference Paper

Towards Linear-Time Incremental Structure from Motion

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The time complexity of incremental structure from motion (SfM) is often known as O(n^4) with respect to the number of cameras. As bundle adjustment (BA) being significantly improved recently by preconditioned conjugate gradient (PCG), it is worth revisiting how fast incremental SfM is. We introduce a novel BA strategy that provides good balance between speed and accuracy. Through algorithm analysis and extensive experiments, we show that incremental SfM requires only O(n) time on many major steps including BA. Our method maintains high accuracy by regularly re-triangulating the feature matches that initially fail to triangulate. We test our algorithm on large photo collections and long video sequences with various settings, and show that our method offers state of the art performance for large-scale reconstructions. The presented algorithm is available as part of VisualSFM at http://homes.cs.washington.edu/~ccwu/vsfm/.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... 3D reconstruction is a hot topic in computer vision that aims to recover 3D geometry from RGB images. However, traditional methods contain lots of complex procedures, such as feature extraction and matching (Lowe 2004;Yi et al. 2016), sparse reconstruction (Agarwal et al. 2011;Wu 2013;Schonberger and Frahm 2016;Moulon et al. 2016), and dense reconstruction (Yao et al. 2018;Mi, Di, and Xu 2022;Yan et al. 2023). Consequently, traditional methods are not a differential end-to-end reconstruction pipeline and require high-quality results from each sub-module to achieve accurate results. ...
... Traditional SfM (Wu 2013;Moulon, Monasse, and Marlet 2013;Schonberger and Frahm 2016;Moulon et al. 2016) and SLAM (Mur-Artal, Montiel, and Tardos 2015; Engel, Koltun, and Cremers 2017) can estimate camera parameters for given images. However, these methods divide the reconstruction pipeline into several non-differentiable modules that need hand-crafted features (Lowe 2004) or learningbased methods (Yi et al. 2016;Teed and Deng 2020) to establish image correspondences, and then reconstruct a sparse scene and camera parameters through multi-view geometry. ...
... Incremental SfM Given a set of images, the incremental SfM can recover δ one by one in a linear time (Wu 2013) and contains four steps (Schonberger and Frahm 2016): Initialization The selection of an initial two-view is essential because a suitable initial two-view improves the robustness and quality of the reconstruction. With a given twoview and its matched features, incremental SfM computes the relative pose by multi-view geometry (MVG) and triangulates 3D points to initial the scene. ...
Article
Full-text available
Neural Radiance Fields have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2G-NeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel camera parameter free neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion. Given a sequence of images, CF-NeRF estimates camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset, NeRFBuster, which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to rotation and achieves state-of-the-art results without providing prior information and constraints.
... Interestingly, two off-the-shelf image registration methods (ORB based OpenSFM (Adorjan, 2016) and SIFT based VSFM (Wu, 2013)) manage to accurately estimate relative pose for the majority of the scenarios missed by ORB-SLAM. This significant performance gap between the two available methods suggests that ORB-SLAM's pipeline can be further matured by focusing on the relative pose estimators. ...
... We manually inspected ORB-SLAM's loop closure (for more than 10,000 keyframes) to establish how frequently a valid (Wu, 2013) and ORB based SfM (Adorjan, 2016) for relative pose estimation achieves performance comparable to the lowest thresholds. ORB-SLAM3 (Campos et al., 2021), at standard thresholds, does not outperform original ORB-SLAM, at lower thresholds, despite an iterative approach with increased time budget candidate was provided by the native vPR. ...
... In this section, we evaluate SIFT and ORB based registration methods for relative pose estimation. We selected a SIFT based SfM pipeline (Wu, 2013) and ORB based SfM pipeline (OpenSFM (Adorjan, 2016)) and tried to register current keyframe, for which vPR provided a valid loop closing candidate but ORB-SLAM was unable to estimate SIM(3). Instead of registering the corresponding keyframes directly, we extracted their connected keyframes (keyframes having common features in ORB-SLAM map) for both counterparts. ...
Article
Full-text available
We analyse, for the first time, the popular loop closing module of a well known and widely used open-source visual SLAM (ORB-SLAM) pipeline. Investigating failures in the loop closure module of visual SLAM is challenging since it consists of multiple building blocks. Our meticulous investigations have revealed a few interesting findings. Contrary to reported results, ORB-SLAM frequently misses large fraction of loop closures on public (KITTI, TUM RGB-D) datasets. One common assumption is, in such scenarios, the visual place recognition (vPR) block of the loop closure module is unable to find a suitable match due to extreme conditions (dynamic scene, viewpoint/scale changes). We report that native vPR of ORB-SLAM is not the sole reason for these failures. Although recent deep vPR alternatives achieve impressive matching performance, replacing native vPR with these deep alternatives will only partially improve loop closure performance of visual SLAM. Our findings suggest that the problem lies with the subsequent relative pose estimation module between the matching pair. ORB-SLAM3 has improved the recall of the original loop closing module. However, even in ORB-SLAM3, the loop closing module is the major reason behind loop closing failures. Surprisingly, using off-the-shelf ORB and SIFT based relative pose estimators (non real-time) manages to close most of the loops missed by ORB-SLAM. This significant performance gap between the two available methods suggests that ORB-SLAM’s pipeline can be further matured by focusing on the relative pose estimators, to improve loop closure performance, rather than investing more resources on improving vPR. We also evaluate deep alternatives for relative pose estimation in the context of loop closures. Interestingly, the performance of deep relocalization methods (e.g. MapNet) is worse than classic methods even in loop closures scenarios. This finding further supports the fundamental limitation of deep relocalization methods recently diagnosed. Finally, we expose bias in well-known public dataset (KITTI) due to which these commonly occurring failures have eluded the community. We augment the KITTI dataset with detailed loop closing labels. In order to compensate for the bias in the public datasets, we provide a challenging loop closure dataset which contains challenging yet commonly occurring indoor navigation scenarios with loop closures. We hope our findings and the accompanying dataset will help the community in further improving the popular ORB-SLAM’s pipeline.
... Thanks for the portability and low-cost of cameras and smart phones, image-based method [6], [7] becomes a popular way to reconstruct real-world trees because it is convenient to acquire the photographs of a real-world tree. The common image-based method utilized SfM and CMVS algorithms [8], [9] for reconstructing camera parameters and tree point cloud, and then reconstructed the tree models from the point cloud through interactive editing. This process is tedious especially when the reconstructed point cloud is very incomplete. ...
... Furthermore, we compared with the state-of-the-art volumetric reconstruction method Instant-NGP and showed the advantage of realistic reconstruction of our method. First, we compared the results of our volumetric point cloud reconstruction method with point clouds generated from VisualSFM [8] and COLMAP [9]. Tab. ...
Article
Full-text available
The realistic reconstruction of real-world trees is a challenging task in the community of computer graphics because natural trees have complex structures of branches and leaves. Existing terrestrial laser scanning (TLS) system is able to capture dense and precise tree point clouds, yet the TLS system is expensive and not easy to carry around. An alternative low-cost and portable way is the reconstruction of tree point cloud from multiple view images. However, it is usually difficult to reconstruct a complete tree point cloud because of the texture similarity of branches and leaves as well as the lack of a sufficient number of images. Thus, we propose a new approach for reconstructing tree point clouds and geometries from sparse images. We first infer the camera parameters of each image, and then calculate the bounding volume of a tree from the camera parameters. Next, we set the mask of each image and the resolution of voxel, and then project each voxel in 3D space to all the mask images to determine the validity of the voxel. To alleviate the miss deletion of valid voxel, we utilize a boundary threshold and adjust mask resolution for robust point cloud reconstruction. Finally, an efficient tree reconstruction method is proposed to generate plausible tree geometries. We tested 6 different tree species that contain deciduous and evergreen trees, and the results showed that our approach is able to generate complete tree point cloud and realistic tree models even from a few number of images. The paper can be accessed from (https://authors.elsevier.com/a/1jDiFMFvICjPu).
... Modern unmanned aerial vehicles (UAVs) equipped with cameras have become crucial in several fields, such as surveying and mapping, geographic information systems (GIS), and digital city modeling. To achieve accurate localization and create 3D representations of real-world scenes, techniques like image or video-based structure from motion (SfM) and visual simultaneous localization and mapping (VSLAM) are utilized [1][2][3][4][5][6][7][8][9][10]. However, it is important to note that there is a relatively limited amount of research on large-size videobased SfM specifically designed for outdoor UAVs. ...
... These methods can be classified into incremental SfM, global SfM, and hybrid SfM, based on the manner in which camera poses are estimated. Currently available open-source incremental SfM algorithms, such as Bundler [1], VisualSfM [2], and COLMAP [3,24,25], provide a solid foundation for SfM research. Mainstream global SfM methods [4,5,26,27] estimate all camera poses and perform a global BA to refine the camera poses and reconstruction scene, resulting in better scalability and efficiency. ...
Article
Full-text available
Modern UAVs (unmanned aerial vehicles) equipped with video cameras can provide large-scale high-resolution video data. This poses significant challenges for structure from motion (SfM) and simultaneous localization and mapping (SLAM) algorithms, as most of them are developed for relatively small-scale and low-resolution scenes. In this paper, we present a video-based SfM method specifically designed for high-resolution large-size UAV videos. Despite the wide range of applications for SfM, performing mainstream SfM methods on such videos poses challenges due to their high computational cost. Our method consists of three main steps. Firstly, we employ a visual SLAM (VSLAM) system to efficiently extract keyframes, keypoints, initial camera poses, and sparse structures from downsampled videos. Next, we propose a novel two-step keypoint adjustment method. Instead of matching new points in the original videos, our method effectively and efficiently adjusts the existing keypoints at the original scale. Finally, we refine the poses and structures using a rotation-averaging constrained global bundle adjustment (BA) technique, incorporating the adjusted keypoints. To enrich the resources available for SLAM or SfM studies, we provide a large-size (3840 × 2160) outdoor video dataset with millimeter-level-accuracy ground control points, which supplements the current relatively low-resolution video datasets. Experiments demonstrate that, compared with other SLAM or SfM methods, our method achieves an average efficiency improvement of 100% on our collected dataset and 45% on the EuRoc dataset. Our method also demonstrates superior localization accuracy when compared with state-of-the-art SLAM or SfM methods.
... Structure from Motion (SfM) has been a pivotal topic in the field of computer vision, robotics, photogrammetry, which are widely applied in augmented reality (Liu et al., 2019), autonomous driving (Sarlin et al., 2021;Sarlin et al., 2019;Brachmann et al., 2021), and 3D reconstruction (Schönberger et al., 2016). Heretofore, many impressive SfM approaches have been extensively studied, mainly including Incremental SfM (Schönberger et al., 2016;Wu, 2013;Agarwal et al., 2009;Frahm et al., 2010;Wang et al., 2018), Hierarchical SfM (Gherardi et al., 2010;Toldo et al., 2015;Farenzena et al., 2009;Havlena et al., 2009) and Global SfM (Jiang et al., 2013;Cui et al., 2015;Wilson et al., 2014;Kasten et al., 2019;Zhuang et al., 2018;Arrigoni et al., 2016;Arie-Nachimson et al., 2012), depending on the procedure of how images are registered. However, these SfM methods predominantly operate in an offline manner, i.e., images are firstly captured, feature extracting\matching and epipolar geometry validation are then performed using all images, one specific SfM method is selected to estimate poses of all images and the corresponding sparse point cloud. ...
... Based on the local block consisting of ℎ and weighting , we establish a new efficient and robust local BA with hierarchical weights. Equation (5) denotes the original reduced normal equation with only camera parameters (Wu, 2013). ...
Article
Full-text available
Over the last decades, ample achievements have been made on Structure from Motion (SfM). However, the vast majority of them basically work in an offline manner, i.e., images are firstly captured and then fed together into a SfM pipeline for obtaining poses and sparse point cloud. In this work, on the contrary, we present an on-the-fly SfM: running online SfM while image capturing, the newly taken On-the-Fly image is online estimated with the corresponding pose and points, i.e., what you capture is what you get. Specifically, our approach firstly employs a vocabulary tree that is unsupervised trained using learning-based global features for fast image retrieval of newly fly-in image. Then, a robust feature matching mechanism with least squares (LSM) is presented to improve image registration performance. Finally, via investigating the influence of newly fly-in image’s connected neighboring images, an efficient hierarchical weighted local bundle adjustment (BA) is used for optimization. Extensive experimental results demonstrate that on-the-fly SfM can meet the goal of robustly registering the images while capturing in an online way.
... To generate the 3D point cloud the open-source and free software VisualSFM was used (Wu et al. 2011;Wu 2013). This applies the SiftGPU algorithm which uses the graphical processing unit (GPU) for 3D image processing (Wu 2007). ...
... We used 97 pictures to capture each fracture surface. VisualSFM reconstructs the camera position and orientation for each picture to produce a 3D model of the imaged object (Marsch et al. 2020;Wu 2013). The resulting point cloud is sparse and consists Content courtesy of Springer Nature, terms of use apply. ...
Article
Full-text available
Knowledge of fracture properties and associated flow processes is important for geoscience applications such as nuclear waste disposal, geothermal energy and hydrocarbons. An important tool established in recent years are hydro-mechanical models which provide a useful alternative to experimental methods determining single fracture parameters such as hydraulic aperture. A crucial issue for meaningful numerical modeling is precise imaging of the fracture surfaces to capture geometrical information. Hence, we apply and compare three distinct fracture surface imaging methods: (1) handheld laser scanner (HLS), (2) mounted laser scanner (MLS) and (3) Structure from Motion (SfM) to a bedding plane fracture of sandstone. The imaging reveals that the resolution of the fracture surface obtained from handheld laser scanner (HLS) is insufficient for any numerical simulations, which was therefore rejected. The remaining surfaces are subsequently matched and the resulting fracture dataset is used for detailed fracture flow simulations. The resulting hydraulic aperture is calibrated with laboratory measurements using a handheld air permeameter. The air permeameter data provide a hydraulic aperture of 81 ± 1 µm. For calibration, mechanical aperture fields are calculated using stepwise increasing contact areas up to 15%. At 5% contact area, the average hydraulic aperture obtained by MLS (85 µm) is close to the measurement. For SfM, the measurements are fitted at 7% contact area (83 µm). The flow simulations reveal preferential flow through major channels that are structurally and geometrically predefined. Thus, this study illustrates that resolution and accuracy of the imaging device strongly affect the quality of fluid flow simulations and that SfM provides a promising low-cost method for fracture imaging on cores or even outcrops.
... Accurate camera poses are essential for NeRF models to converge and obtain consistent color and occupancy. Classical Structure-from-Motion (SfM) [1,5,22] was an effective offline way to guarantee the accuracy of camera poses, as well as the sparse scene structure. As proposed in [4,23]- [26], jointly optimizing camera poses and NeRF models during training can further improve the model consistency, meanwhile reducing the requirements for very accurate camera poses. ...
Preprint
Recently, Neural Radiance Fields (NeRF) achieved impressive results in novel view synthesis. Block-NeRF showed the capability of leveraging NeRF to build large city-scale models. For large-scale modeling, a mass of image data is necessary. Collecting images from specially designed data-collection vehicles can not support large-scale applications. How to acquire massive high-quality data remains an opening problem. Noting that the automotive industry has a huge amount of image data, crowd-sourcing is a convenient way for large-scale data collection. In this paper, we present a crowd-sourced framework, which utilizes substantial data captured by production vehicles to reconstruct the scene with the NeRF model. This approach solves the key problem of large-scale reconstruction, that is where the data comes from and how to use them. Firstly, the crowd-sourced massive data is filtered to remove redundancy and keep a balanced distribution in terms of time and space. Then a structure-from-motion module is performed to refine camera poses. Finally, images, as well as poses, are used to train the NeRF model in a certain block. We highlight that we present a comprehensive framework that integrates multiple modules, including data selection, sparse 3D reconstruction, sequence appearance embedding, depth supervision of ground surface, and occlusion completion. The complete system is capable of effectively processing and reconstructing high-quality 3D scenes from crowd-sourced data. Extensive quantitative and qualitative experiments were conducted to validate the performance of our system. Moreover, we proposed an application, named first-view navigation, which leveraged the NeRF model to generate 3D street view and guide the driver with a synthesized video.
... In the next step, the VisualSFM program was used with the Clustering Views for Multi-view Stereo/Patch-based Multi-view Stereo Software (CMVS/PMVS) algorithm 9 . Changchang Wu developed VisualSFM, the fast-running application (multicore parallesim), for feature detection, feature matching, and bundle adjustment 60 . VisualSFM was, for example, used to monitor the position of a cliff in Ault in Northern France 61 . ...
Article
Full-text available
In studies of the relief evolution of smaller landforms, up to several dozen meters in width/diameter, digital elevation models (DEMs) freely accessible in different repositories may be insufficient in terms of resolution. Existing geophysical or photogrammetric equipment is not always available due to costs, conditions and regulations, especially for students or young researchers. An alternative may be the handy-held ground-based Structure from Motion technique. It allows us to obtain free high-resolution DEMs (~0.05 m) using open-source software. The method was tested on kettle holes of the glacial flood origin on Skeiðarársandur (S Iceland). The material was collected in 2022 at two outwash levels of different ages and vegetation cover. The dataset is available in the Zenodo repository; the first part is data processed in point clouds and DEMs, and the second includes original videos in MOV format. The data can be used as a reference to assess changes in the kettle hole relief in subsequent research seasons, as a methodological study for other projects, or for didactic purposes.
... This dataset was used to count the number of cattle in a pasture area by using multiple images (Shao et al., 2020), in which the detection results from multiple images were merged by reconstructing a 3-D model using the Structure from Motion (SfM) technique (Wu, 2013) to avoid duplicate counting. Shao et al. (2020) collected aerial images at 4,000 × 3,000 pixels resolution and curated two sub-datasets. ...
Preprint
Full-text available
Technology-driven precision livestock farming (PLF) empowers practitioners to monitor and analyze animal growth and health conditions for improved productivity and welfare. Computer vision (CV) is indispensable in PLF by using cameras and computer algorithms to supplement or supersede manual efforts for livestock data acquisition. Data availability is crucial for developing innovative monitoring and analysis systems through artificial intelligence-based techniques. However, data curation processes are tedious, time-consuming, and resource intensive. This study presents the first systematic survey of publicly available livestock CV datasets (https://github.com/Anil-Bhujel/Public-Computer-Vision-Dataset-A-Systematic-Survey). Among 58 public datasets identified and analyzed, encompassing different species of livestock, almost half of them are for cattle, followed by swine, poultry, and other animals. Individual animal detection and color imaging are the dominant application and imaging modality for livestock. The characteristics and baseline applications of the datasets are discussed, emphasizing the implications for animal welfare advocates. Challenges and opportunities are also discussed to inspire further efforts in developing livestock CV datasets. This study highlights that the limited quantity of high-quality annotated datasets collected from diverse environments, animals, and applications, the absence of contextual metadata, are a real bottleneck in PLF.
... Obtaining high-fidelity 3D models from real-world environments is pivotal for enabling immersive experiences in augmented reality (AR) and virtual reality (VR). This paper focuses exclusively on surface reconstruction under given poses, which can be readily computed using SLAM [5], [7], [8] or SFM [43], [51], [57] methods. ...
Preprint
Full-text available
Recently, 3D Gaussian Splatting (3DGS) has attracted widespread attention due to its high-quality rendering, and ultra-fast training and rendering speed. However, due to the unstructured and irregular nature of Gaussian point clouds, it is difficult to guarantee geometric reconstruction accuracy and multi-view consistency simply by relying on image reconstruction loss. Although many studies on surface reconstruction based on 3DGS have emerged recently, the quality of their meshes is generally unsatisfactory. To address this problem, we propose a fast planar-based Gaussian splatting reconstruction representation (PGSR) to achieve high-fidelity surface reconstruction while ensuring high-quality rendering. Specifically, we first introduce an unbiased depth rendering method, which directly renders the distance from the camera origin to the Gaussian plane and the corresponding normal map based on the Gaussian distribution of the point cloud, and divides the two to obtain the unbiased depth. We then introduce single-view geometric, multi-view photometric, and geometric regularization to preserve global geometric accuracy. We also propose a camera exposure compensation model to cope with scenes with large illumination variations. Experiments on indoor and outdoor scenes show that our method achieves fast training and rendering while maintaining high-fidelity rendering and geometric reconstruction, outperforming 3DGS-based and NeRF-based methods.
... Building on the success of removing temporary construction machinery from UAV images, the study advances into the phase of 3D reconstruction of the current construction sites. Structure from Motion (SfM) [28] and Multi-View Stereo (MVS) [29] techniques are employed to create detailed point clouds. This integration of advanced image processing with 3D reconstruction technologies ensures that the final digital models accurately represent the actual state of the construction sites, significantly enhancing project management and planning capabilities. ...
... Traditional 3D reconstruction methods represent scenes as point clouds or meshes, with numerous notable achievements in this area [5]. COLMAP [14] is a representative of incremental Structure-from-Motion (SfM) methods [15,16,17,18]. COLMAP [14] reconstructs 3D scenes by extracting feature points from images and performing feature matching, triangulation, and bundle adjustment. ...
Preprint
Road surface reconstruction plays a crucial role in autonomous driving, which can be used for road lane perception and autolabeling tasks. Recently, mesh-based road surface reconstruction algorithms show promising reconstruction results. However, these mesh-based methods suffer from slow speed and poor rendering quality. In contrast, the 3D Gaussian Splatting (3DGS) shows superior rendering speed and quality. Although 3DGS employs explicit Gaussian spheres to represent the scene, it lacks the ability to directly represent the geometric information of the scene. To address this limitation, we propose a novel large-scale road surface reconstruction approach based on 2D Gaussian Splatting (2DGS), named RoGS. The geometric shape of the road is explicitly represented using 2D Gaussian surfels, where each surfel stores color, semantics, and geometric information. Compared to Gaussian spheres, the Gaussian surfels aligns more closely with the physical reality of the road. Distinct from previous initialization methods that rely on point clouds for Gaussian spheres, we introduce a trajectory-based initialization for Gaussian surfels. Thanks to the explicit representation of the Gaussian surfels and a good initialization, our method achieves a significant acceleration while improving reconstruction quality. We achieve excellent results in reconstruction of roads surfaces in a variety of challenging real-world scenes.
... Bao et al. [23] introduced a semantic motion recovery structure algorithm that enhances algorithm robustness through the recognition and estimation of high-level semantic information such as regions and objects in the 3D scene. Wu et al. [24] presented the VisualSFM algorithm, which employs preprocessed conjugate gradient descent to improve computational efficiency while maintaining accuracy. Schönberger et al. [25] introduced the COLMAP algorithm, enhancing key steps such as geometric calibration, viewpoint selection, and triangulation. ...
Article
Full-text available
Three-dimensional reconstruction plays a crucial role in capturing plant phenotypes and expediting the process of agricultural informatization. However, the reconstruction of small objects such as plant specimens and grains often faces challenges like low two-dimensional image resolution and sparse textures. To enhance the three-dimensional reconstruction of plant specimens like wheat grains for comprehensive phenotypic characterization, this study proposes a novel super-resolution reconstruction network called T-transformer net. The network leverages the self-attention mechanism of Transformers to extract extensive global information from spatial sequences. By employing a hourglass block structure to construct spatial attention units and combining channel attention with window-based self-attention schemes, it effectively harnesses their complementary advantages. This encompasses utilizing global statistical data while capitalizing on potent local fitting capabilities. Evaluation of the model on publicly available datasets Set5, Set14, and Manga109 demonstrates superior overall performance of T-transformer net compared to mainstream super-resolution algorithms at upscaling factors of 2x, 3x, and 4x. In the context of super-resolution tasks involving wheat grain datasets, the peak signal-to-noise ratio reaches 42.89 dB, and the structural similarity index attains 0.9643. Subsequently, we subject the super-resolved wheat grain images to three-dimensional reconstruction. Through comprehensive extraction of high-level semantic information by neural networks, the reconstruction accuracy is improved by 38.96% compared with the unprocessed image, effectively mitigating challenges arising from sparse textures and repetitive patterns in wheat grain structures. This study contributes valuable methodology and insights to the realm of three-dimensional reconstruction in botany, holding significant implications for advancing agricultural informatization.
... Motion, which involves restoring the 3D structure of a scene through image sequences, is a crucial milestone in vision-based 3D reconstruction. SFM is mainly divided into four groups: incremental SFM [93], global SFM [94], hybrid SFM [95], and hierarchical SFM [96]. ...
Article
Full-text available
With the rapid development of 3D reconstruction, especially the emergence of algorithms such as NeRF and 3DGS, 3D reconstruction has become a popular research topic in recent years. 3D reconstruction technology provides crucial support for training extensive computer vision models and advancing the development of general artificial intelligence. With the development of deep learning and GPU technology, the demand for high-precision and high-efficiency 3D reconstruction information is increasing, especially in the fields of unmanned systems, human-computer interaction, virtual reality, and medicine. The rapid development of 3D reconstruction is becoming inevitable. This survey categorizes the various methods and technologies used in 3D reconstruction. It explores and classifies them based on three aspects: traditional static, dynamic, and machine learning. Furthermore, it compares and discusses these methods. At the end of the survey, which includes a detailed analysis of the trends and challenges in 3D reconstruction development, we aim to provide a comprehensive introduction for individuals who are currently engaged in or planning to conduct research on 3D reconstruction. Our goal is to help them gain a comprehensive understanding of the relevant knowledge related to 3D reconstruction.
... Current mainstream methods for large-scale multi-view 3D reconstruction, such as structure from motion (SfM) [5][6][7][8], Multi-view stereo (MVS) [9][10][11], and simultaneous localization and mapping (SLAM) [12,13], have been successfully integrated into various commercial software, producing satisfactory high-quality scene models. However, challenges arise when applying these methods to UAV video data, particularly in complex large-scale geographic environments, such as areas with diverse terrain like plateaus, plains, hills, and mountains. ...
Article
Full-text available
In unmanned aerial vehicle (UAV) large-scale scene modeling, challenges such as missed shots, low overlap, and data gaps due to flight paths and environmental factors, such as variations in lighting, occlusion, and weak textures, often lead to incomplete 3D models with blurred geometric structures and textures. To address these challenges, an implicit–explicit coupling enhancement for a UAV large-scale scene modeling framework is proposed. Benefiting from the mutual promotion of implicit and explicit models, we initially address the issue of missing co-visibility clusters caused by environmental noise through large-scale implicit modeling with UAVs. This enhances the inter-frame photometric and geometric consistency. Subsequently, we enhance the multi-view point cloud reconstruction density via synthetic co-visibility clusters, effectively recovering missing spatial information and constructing a more complete dense point cloud. Finally, during the mesh modeling phase, high-quality 3D modeling of large-scale UAV scenes is achieved by inversely radiating and mapping additional texture details into 3D voxels. The experimental results demonstrate that our method achieves state-of-the-art modeling accuracy across various scenarios, outperforming existing commercial UAV aerial photography software (COLMAP 3.9, Context Capture 2023, PhotoScan 2023, Pix4D 4.5.6) and related algorithms.
... Due to the presence of noise and drift in pose and 3D point estimation, it is necessary to optimize the camera poses using the bundle adjustment (BA) algorithm after incorporating a certain number of new image pairs. In 2013, Wu et al. introduced VisualSFM [23], which improved the matching speed through a preemptive feature matching strategy and accelerated sparse reconstruction using a local-global bundle-adjustment technique. When the number of cameras increases, optimization is performed only on local images, and when the overall model reaches a certain scale, optimization is applied to all images, thus improving the reconstruction speed. ...
Article
Full-text available
Three-dimensional reconstruction is a key technology employed to represent virtual reality in the real world, which is valuable in computer vision. Large-scale 3D models have broad application prospects in the fields of smart cities, navigation, virtual tourism, disaster warning, and search-and-rescue missions. Unfortunately, most image-based studies currently prioritize the speed and accuracy of 3D reconstruction in indoor scenes. While there are some studies that address large-scale scenes, there has been a lack of systematic comprehensive efforts to bring together the advancements made in the field of 3D reconstruction in large-scale scenes. Hence, this paper presents a comprehensive overview of a 3D reconstruction technique that utilizes multi-view imagery from large-scale scenes. In this article, a comprehensive summary and analysis of vision-based 3D reconstruction technology for large-scale scenes are presented. The 3D reconstruction algorithms are extensively categorized into traditional and learning-based methods. Furthermore, these methods can be categorized based on whether the sensor actively illuminates objects with light sources, resulting in two categories: active and passive methods. Two active methods, namely, structured light and laser scanning, are briefly introduced. The focus then shifts to structure from motion (SfM), stereo matching, and multi-view stereo (MVS), encompassing both traditional and learning-based approaches. Additionally, a novel approach of neural-radiance-field-based 3D reconstruction is introduced. The workflow and improvements in large-scale scenes are elaborated upon. Subsequently, some well-known datasets and evaluation metrics for various 3D reconstruction tasks are introduced. Lastly, a summary of the challenges encountered in the application of 3D reconstruction technology in large-scale outdoor scenes is provided, along with predictions for future trends in development.
... parallel lines, surfaces), to achieve the estimation of the change of the target object's pose. Monocular RGB cameras are widely used in AR [4,5], visual SLAM [6,7,8], 3D reconstruction [9,10], and other fields. The way the camera estimates the scene depth relies heavily on the feature matching between frames. ...
Article
With the development of social networks and hardware devices, many young people have post a lot of high definition v-logs containing selfie images and videos to commemorate and share their daily lives. We found that the reflected image of corneal position in the high definition selfie image has been able to reflect the position and posture of the selfie taker. The classic localization works estimating the position and posture from a selfie are difficult because they lack the knowledge of the environment. The corneal reflection images inherently carry information about the surrounding environment, which can reveal the location, posture and even height of the selfie taker. We analyze the corneal reflection imaging process in the selfie scenario and design a validation experiment based on this process to estimate the pose of the selfie in several scenarios to further evaluate the leakage of the pose information of the selfie taker.
... 3D reconstruction based on photogrammetry was performed after each test to quantify the scour depth and scour hole volume. For each trial, approximately two hundred images from different locations and orientations were taken to reconstruct the scoured scenario ( Fig. 5a) using the software VisualSFM, in which the pixel points of these images were extracted and re-organized to establish a 3D structure that includes abundant cloud points [39,41]. These mesh points were fed into the software CloudCompare to be normalized from the virtual scale to the real scale (Fig. 5b). ...
Article
Full-text available
The intricate prop root system of mangrove forests forms a natural barrier that traps the sediments and reduces coastline erosion, which provides a design inspiration to mitigate local scour around a monopile foundation. In this study, a ring of mini skirt piles with various spacings was proposed to mimic the mangrove prop root system. It is hypothesized that, in addition to hydraulic benefits, installation of the skirt piles also densifies and strengthens the sediments around the monopile and thus enhances the shear strength against erosion. The hypothesis was first tested with laboratory flume experiments considering four different installation sequences. The discrete element method was then used to model the pile installation process and investigate the evolution of sediment density and stresses. The flume tests validated that installation of the skirt piles reduces scour potential. The simulation results revealed that the installation of skirt piles causes densification of the sediments and strengthens the contact forces. Such effects were more pronounced when skirt pile spacing was smaller. Both numerical and experimental results indicate that the installation of skirt piles provides geotechnical benefits as part of the scour countermeasure.
... A calibration is needed if the relative position of the markers is not precisely controlled. The calibration of a locator can be seen as a Structure-from-Motion (SfM) problem [12,27] which can be solved offline to obtain an accurate model of the relative position of the fiducial markers within the locator [28,29]. ...
Article
Full-text available
In the field of medical applications, precise localization of medical instruments and bone structures is crucial to ensure computer-assisted surgical interventions. In orthopedic surgery, existing devices typically rely on stereoscopic vision. Their purpose is to aid the surgeon in screw fixation of prostheses or bone removal. This article addresses the challenge of localizing a rigid object consisting of randomly arranged planar markers using a single camera. This approach is especially vital in medical situations where accurate object alignment relative to a camera is necessary at distances ranging from 80 cm to 120 cm. In addition, the size limitation of a few tens of centimeters ensures that the resulting locator does not obstruct the work area. This rigid locator consists of a solid at the surface of which a set of plane markers (ArUco) are glued. These plane markers are randomly distributed over the surface in order to systematically have a minimum of two visible markers whatever the orientation of the locator. The calibration of the locator involves finding the relative positions between the individual planar elements and is based on a bundle adjustment approach. One of the main and known difficulties associated with planar markers is the problem of pose ambiguity. To solve this problem, our method lies in the formulation of an efficient initial solution for the optimization step. After the calibration step, the reached positioning uncertainties of the locator are better than two-tenth of a cubic millimeter and one-tenth of a degree, regardless of the orientation of the locator in space. To assess the proposed method, the locator is rigidly attached to a stylus of about twenty centimeters length. Thanks to this approach, the tip of this stylus seen by a 16.1 megapixel camera at a distance of about 1 m is localized in real time in a cube lower than 1 mm side. A surface registration application is proposed by using the stylus on an artificial scapula.
... Jancosek and Pajdla (2011), Moulon et al. (2012) Wu et al. (2011), Wu (2013 OpenDroneMap Authors (2014), WebODM Authors (2017) Fuhrmann et al. (2014) Hiestand (2015) Moulon et al. (2016) Schönberger and Frahm (2016) ...
Technical Report
When performing yield prediction and weed identification by analyzing a three-dimensional (3D) model of a farm constructed from aerial images, it is useful to exclude in advance forests, houses, warehouses, and work roads among others included in the model. In this study, we introduce the process of selecting only the grassland part from the 3D model of a farm. Assuming the use of the point cloud classification function and digital surface model provided in the model construction software, the selection process is implemented in the computer using the programming language Python. The process consists of the following three steps: 1) construction of a 3D model after performing point cloud classification to obtain a digital surface model and an orthomosaic image excluding most of the forests and buildings, 2) complete removal of the noise remaining in the forest part using the slope attribute of the digital surface model, and 3) removal of the fields not subject to analysis using the color information of work roads. We use the Agisoft Metashape Professional software to perform 1) and program with a Python script to perform 2) and 3), including a user interface to visually execute the process. We apply these steps of image processing to the 3D models constructed from aerial images taken with three types of small multicopter at three different times in two different grasslands. By carrying this method, we can easily select the image of distorted grassland fields.
... We first employ incremental Structure-from-Motion (SfM) [Wu13,Ull79] to estimate the mapping from the camera image to world coordinates. We optimize for both extrinsic and intrinsic camera parameters by minimizing the mean squared projection error of AR markers. ...
Article
Full-text available
Although digital painting has advanced much in recent years, there is still a significant divide between physically drawn paintings and purely digitally drawn paintings. These differences arise due to the physical interactions between the brush, ink, and paper, which are hard to emulate in the digital domain. Most ink painting approaches have focused on either using heuristics or physical simulation to attempt to bridge the gap between digital and analog, however, these approaches are still unable to capture the diversity of painting effects, such as ink fading or blotting, found in the real world. In this work, we propose a data‐driven approach to generate ink paintings based on a semi‐automatically collected high‐quality real‐world ink painting dataset. We use a multi‐camera robot‐based setup to automatically create a diversity of ink paintings, which allows for capturing the entire process in high resolution, including capturing detailed brush motions and drawing results. To ensure high‐quality capture of the painting process, we calibrate the setup and perform occlusion‐aware blending to capture all the strokes in high resolution in a robust and efficient way. Using our new dataset, we propose a recursive deep learning‐based model to reproduce the ink paintings stroke by stroke while capturing complex ink painting effects such as bleeding and mixing. Our results corroborate the fidelity of the proposed approach to real hand‐drawn ink paintings in comparison with existing approaches. We hope the availability of our dataset will encourage new research on digital realistic ink painting techniques.
... The software is capable of processing a collection of images and estimating camera poses, 3D point cloud, and camera calibration parameters. A GUI application for 3D reconstruction using structure from motion based on lineartime incremental approach based on the paper (Wu 2013). VisualSFM uses feature-based methods to identify keypoints (distinctive points) in the images. ...
Article
Exploiting photogrammetric computer vision techniques to generate point cloud data for 3D scene understanding has seen many research improvements in the last decade. Open-source research and algorithm development have provided benefits and intellectual capacity to researchers and developers for understanding and providing multiple solutions to problems from different perspectives. This study focuses on the open-source domain for photogrammetry and is trying to provide a walkthrough for the recent developments in extracting 3D information from 2D images with the context of point clouds. Four different free and open-source software (VisualSFM, WebODM, Colmap, Meshroom) were studied from the perspective of their point cloud generation capability and photogrammetric workflow to provide a comparative assessment in this research. Each software is also assessed for their usability and workflow functions. UAV-based photographs were acquired for the study area and using the same datasets and default parameters in each software, dense photogrammetric point clouds were generated using their own photogrammetric workflow. For each of these dense point clouds, an assessment of their quality and enriched information based on some robust parameters is done.
... Cambridge Landmarks consists of five scenes with varying number of images, ranging from 231 to 3015. The reference pose of each image was reconstructed using the Vi-sualSFM structure-from-motion software [70,69] and the images are divided into training and testing sets. To have more accurate reference data, we ran COLMAP [63] obtaining a dense point cloud. ...
Conference Paper
Full-text available
We present a new solver for estimating a surface normal from a single affine correspondence in two calibrated views. The proposed approach provides a new globally optimal solution for this over-determined problem and proves that it reduces to a linear system that can be solved extremely efficiently. This allows for performing significantly faster than other recent methods, solving the same problem and obtaining the same globally optimal solution. We demonstrate on 15k image pairs from standard benchmarks that the proposed approach leads to the same results as other optimal algorithms while being, on average, five times faster than the fastest alternative. Besides its theoretical value, we demonstrate that such an approach has clear benefits, e.g., in image-based visual localization, due to not requiring a dense point cloud to recover the surface normal. We show on the Cambridge Landmarks dataset that leveraging the proposed surface normal estimation further improves localization accuracy. Matlab and C++ implementations are also published in the supplementary material.
... These methods can be grouped into two categories: the first is based on the number of matched correspondences, while the second uses the similarity score computed from image descriptors. For the former, two images are labeled as a valid match pair when the number of matches surpasses a threshold, such as the multi-scale strategy [24] and the preemptive matching strategy [25]. For the latter, images are quantified as descriptors, and the similarity score between two images is calculated as the distance between two descriptors. ...
Article
Full-text available
SfM (Structure from Motion) has been extensively used for UAV (Unmanned Aerial Vehicle) image orientation. Its efficiency is directly influenced by feature matching. Although image retrieval has been extensively used for match pair selection, high computational costs are consumed due to a large number of local features and the large size of the used codebook. Thus, this paper proposes an efficient match pair retrieval method and implements an integrated workflow for parallel SfM reconstruction. First, an individual codebook is trained online by considering the redundancy of UAV images and local features, which avoids the ambiguity of training codebooks from other datasets. Second, local features of each image are aggregated into a single high-dimension global descriptor through the VLAD (Vector of Locally Aggregated Descriptors) aggregation by using the trained codebook, which remarkably reduces the number of features and the burden of nearest neighbor searching in image indexing. Third, the global descriptors are indexed via the HNSW (Hierarchical Navigable Small World) based graph structure for the nearest neighbor searching. Match pairs are then retrieved by using an adaptive threshold selection strategy and utilized to create a view graph for divide-and-conquer based parallel SfM reconstruction. Finally, the performance of the proposed solution has been verified using three large-scale UAV datasets. The test results demonstrate that the proposed solution accelerates match pair retrieval with a speedup ratio ranging from 36 to 108 and improves the efficiency of SfM reconstruction with competitive accuracy in both relative and absolute orientation.
Article
High-resolution landslide images are required for detailed geomorphological analysis in complex topographic environment with steep and vertical landslide distribution. This study proposed a vertical route planning method for unmanned aerial vehicles (UAVs), which could achieve rapid image collection based on strictly calculated route parameters. The effectiveness of this method was verified using a DJI Mavic 2 Pro, obtaining high-resolution landslide images within the Dongchuan debris flow gully, in the Xiaojiang River Basin, Dongchuan District, Yunnan, China. A three-dimensional (3D) model was constructed by the structure-from-motion and multi-view stereo (SfM-MVS). Micro-geomorphic features were analyzed through visual interpretation, geographic information system (GIS), spatial analysis, and mathematical statistics methods. The results demonstrated that the proposed method could obtain comprehensive vertical information on landslides while improving measurement accuracy. The 3D model was constructed using the vertically oriented flight route to achieve centimeter-level accuracy (horizontal accuracy better than 6 cm, elevation accuracy better than 3 cm, and relative accuracy better than 3.5 cm). The UAV technology could further help understand the micro internal spatial and structural characteristics of landslides, facilitating intuitive acquisition of surface details. The slope of landslide clusters ranged from 36° to 72°, with the majority of the slope facing east and southeast. Upper elevation levels were relatively consistent while middle to lower elevation levels gradually decreased from left to right with significant variations in lower elevation levels. During the rainy season, surface runoff was abundant, and steep topography exacerbated changes in surface features. This route method is suitable for unmanned aerial vehicle (UAV) landslide surveys in complex mountainous environments. The geomorphological analysis methods used will provide references for identifying and describing topographic features.
Article
The removal of outliers is crucial for establishing correspondence between two images. However, when the proportion of outliers reaches nearly 90%, the task becomes highly challenging. Existing methods face limitations in effectively utilizing geometric transformation consistency (GTC) information and incorporating geometric semantic neighboring information. To address these challenges, we propose a Multi-Stage Geometric Semantic Attention (MSGSA) network. The MSGSA network consists of three key modules: the multi-branch (MB) module, the GTC module, and the geometric semantic attention (GSA) module. The MB module, structured with a multi-branch design, facilitates diverse and robust spatial transformations. The GTC module captures transformation consistency information from the preceding stage. The GSA module categorizes input based on the prior stage’s output, enabling efficient extraction of geometric semantic information through a graph-based representation and inter-category information interaction using Transformer. Extensive experiments on the YFCC100M and SUN3D datasets demonstrate that MSGSA outperforms current state-of-the-art methods in outlier removal and camera pose estimation, particularly in scenarios with a high prevalence of outliers. Source code is available at https://github.com/shuyuanlin .
Article
Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS .
Article
Full-text available
Efficient incremental Structure from Motion (ISfM) has become the core technique for UAV (Unmanned Aerial Vehicle) image orientation. However, the characteristics of large volume, high overlap, and high resolution cause the deficiency in match pair retrieval and the accumulated error and low efficiency in BA (Bundle Adjustment) optimization, which degenerate its performance for large-scale scenes. This study proposes a parallel SfM for UAV images via global descriptors and graph-based indexing. On the one hand, to cope with the deficiency caused by a large number of local descriptors and the large size of a codebook, an efficient match pair retrieval is designed via the global descriptor and graph-based indexing, which could dramatically accelerate feature matching; on the other hand, to address the deficiency of correspondence searching and low accuracy of transformation estimation in parallel SfM, this study designs an efficient cluster merging algorithm based on the on-demand correspondence graph and bi-directional reprojection error, which achieves efficient and accurate parallel SfM. The proposed algorithm is verified by using three UAV datasets, and the experimental results demonstrate that the proposed method can increase match pair retrieval with speedup ratios ranging from 36 to 108, and dramatically improves the SfM efficiency with the speedup ratio better than 30 and with the comparative accuracy.
Chapter
The planimetric drawing of excavations and cultural heritage has always been part of any architectural project or archaeological campaign. The use of laser scanner and SfM photogrammetry has spread and is currently established in all the modern studies. In this article we will analyze if it is worth it or cost effective to document all the digging layers and its comparison with the manual drawing using graph paper. For this comparation two projects have been used, the excavation of nave 17 of the Mosque-Cathedral of Cordoba in 2017, and the excavation of the Montilla’s Castle, Córdoba between 1999 and 2000.
Article
Feature matching is a critical problem in the field of computer vision, which serves as the foundation for many high-level computer vision applications. This article aims to improve the accuracy of feature matching by eliminating mismatches in putative matches. To achieve this goal, we propose an extremely fast approach to boosting image matching precision (FAPP). The key idea behind FAPP is that correct feature matches have similar Euclidean distances, and the sine values of the angles between correct feature matches and the horizontal axis are also similar. Consequently, putative matches can be represented as 2-D coordinate points (sine value, Euclidean distance), which makes correct and incorrect feature matches have different degrees of clustering. The coordinate points are, furthermore, divided into grid spaces so that the coordinate points are distributed in different grid areas. Through adaptive parameter estimation, we determine a threshold for the number of correct feature matches within each grid, thereby eliminating false feature matches. In addition, to validate the effectiveness of FAPP, we conducted experiments on two public datasets and compared the results with several existing classical methods. The experimental outcomes demonstrate the superior performance of FAPP over the existing classical methods. Furthermore, the method has been applied to 3-D reconstruction with good results. Source code: https://github.com/caomw/fapp .
Article
Particle size distribution (PSD) is an essential parameter in assessing the overall efficiency of blasting operations in mines and the subsequent mine-to-mill process in the mining industry. Despite some drawbacks of 2D image analysis techniques in accurately estimating particle sizes in PSD, the mining industry has relied on them for the last three decades. This study proposes the 3D rock fragmentation measurement (3DFM) technique for deducting the accurate dimensions of 3D rock particles for the PSD. 3DFM has utilized different processing algorithms. Images of different views of the non-touching rock particles of varying sizes have been acquired as a data acquisition step of structure from motion technology for generating sparse point cloud. Dense point cloud reconstruction is used to avail finer details of the point cloud using clustering views for the multi-view stereo algorithm. Random sample consensus (RANSAC) algorithm coupled with an unsupervised classification using the density-based spatial clustering of applications with noise (DBSCAN) classifier is employed to extract the rock clusters from the 3D point cloud. Finally, the accurate rock sizes are derived using the hybrid bounding box rotation identification (HYBBRID) algorithm with a root mean square error (RMSE) of 0.10 cm for length, 0.10 cm for breadth, and 0.32 cm for depth. The PSD of rock fragments obtained from the proposed 3DFM technique is found to be matching with the results of mechanical sieving and manual gauging with an R2 values of 0.98 and 0.99, respectively. The 3DFM method can be considered cheaper, more accurate, and computationally faster in determining the rock dimensions for the PSD determination method to enhance the productivity of the mining industry.
Article
Fully perceiving the surrounding world is a vital capability for autonomous robots. To achieve this goal, a multi-camera system is usually equipped on the data collecting platform and the structure from motion (SfM) technology is used for scene reconstruction. However, although incremental SfM achieves high-precision modeling, it is inefficient and prone to scene drift in large-scale reconstruction tasks. In this paper, we propose a tailored incremental SfM framework for multi-camera systems, where the internal relative poses between cameras can not only be calibrated automatically but also serve as an additional constraint to improve the system robustness. Previous multi-camera based modeling work has mainly focused on stereo setups or multi-camera systems with known calibration information, but we allow arbitrary configurations and only require images as input. First, one camera is selected as the reference camera, and the other cameras in the multi-camera system are denoted as non-reference cameras. Based on the pose relationship between the reference and non-reference camera, the non-reference camera pose can be derived from the reference camera pose and internal relative poses. Then, a two-stage multi-camera based camera registration module is proposed, where the internal relative poses are computed first by local motion averaging, and then the rigid units are registered incrementally. Finally, a multi-camera based bundle adjustment is put forth to iteratively refine the reference camera and the internal relative poses. Experiments demonstrate that our system achieves higher accuracy and robustness on benchmark data compared to the state-of-the-art SfM and SLAM (simultaneous localization and mapping) methods.
Article
Efficient match pair selection and image feature matching directly affect the efficiency of SfM-based 3D reconstruction for UAV images. This study combines the inverted and direct index structure of the vocabulary tree to achieve the speedup of match pair selection and feature matching for UAV images. First, for match pair selection, vocabulary tree-based image retrieval has been the commonly used technique. However, it depends on the fixed number or fixed ratio threshold for match pair selection, which may cause many redundant match pairs. An adaptive vocabulary tree-based retrieval algorithm is designed for match pair selection by using the "word-image" index structure and the spatial distribution of similarity scores, and it can avoid the drawback of depending on fixed thresholds. Second, for feature matching, the nearest neighbor searching (NNS) method attempts to compute the Euclidean distance exhaustively between two sets of feature descriptors, which causes high computational costs and generates high outlier ratios. Thus, a guided feature matching (GFM) algorithm is presented. It casts the explicit closest descriptor searching as the direct assignment by using the "image-word" index structure of the vocabulary tree. Combining the match pair selection and guided feature matching algorithm, an integrated workflow is finally presented to achieve feature matching of both ordered and unordered UAV images with high precision and efficiency. The proposed workflow is verified using four UAV datasets and compared comprehensively with classical NNS algorithms and commercial software packages. The experimental results verify that the proposed method can achieve efficient match pair selection and avoid the problem of retrieving too many or too few match pairs that are usually caused by traditional methods using fixed threshold or number strategies; without the sacrifice of matching precision, the speedup ratio of direct assign based feature matching ranges from 156 to 228, and competitive accuracy is also obtained from 3D reconstruction compared with the nearest neighbor searching (NNS) method.
Article
Accurate and up-to-date 3D maps, often represented as point clouds, are crucial for autonomous vehicles. Crowd-sourcing has emerged as a low-cost and scalable approach for collecting mapping data utilizing widely available dashcams and other sensing devices. However, it is still a non-trivial task to utilize crowdsourced data, such as dashcam images and video, to efficiently create or update high-quality point clouds using technologies like Structure from Motion (SfM). This study assesses and compares different image matching options available in open-source SfM software, analyzing their applicability and limitations for mapping urban scenes in different practical scenarios. Furthermore, the study analyzes the impact of various camera setups (i.e., the number of cameras and their placement) and weather conditions on the quality of the generated 3D point clouds in terms of completeness and accuracy. Based on these analyses, our study provides guidelines for creating more accurate point clouds.
Conference Paper
Full-text available
This paper introduces an approach for dense 3D reconstruction from unregistered Internet-scale photo collections with about 3 million images within the span of a day on a single PC (“cloudless”). Our method advances image clustering, stereo, stereo fusion and structure from motion to achieve high computational performance. We leverage geometric and appearance constraints to obtain a highly parallel implementation on modern graphics processors and multi-core architectures. This leads to two orders of magnitude higher performance on an order of magnitude larger dataset than competing state-of-the-art approaches.
Conference Paper
Full-text available
We present a completely automated Structure and Motion pipeline capable of working with uncalibrated images with varying internal parameters and no ancillary information. The system is based on a novel hierarchical scheme which reduces the total complexity by one order of magnitude. We assess the quality of our approach analytically by comparing the recovered point clouds with laser scans, which serves as ground truth data.
Conference Paper
Full-text available
Bundle adjustment for multi-view reconstruction is traditionally done using the Levenberg-Marquardt algorithm with a direct linear solver, which is computationally very expensive. An alternative to this approach is to apply the conjugate gradients algorithm in the inner loop. This is appealing since the main computational step of the CG algorithm involves only a simple matrix-vector multiplication with the Jacobian. In this work we improve on the latest published approaches to bundle adjustment with conjugate gradients by making full use of the least squares nature of the problem. We employ an easy-to-compute QR factorization based block preconditioner and show how a certain property of the preconditioned system allows us to reduce the work per iteration to roughly half of the standard CG algorithm.
Article
Full-text available
This article presents an approach for modeling landmarks based on large-scale, heavily contaminated image collections gathered from the Internet. Our system efficiently combines 2D appearance and 3D geometric constraints to extract scene summaries and construct 3D models. In the first stage of processing, images are clustered based on low-dimensional global appearance descriptors, and the clusters are refined using 3D geometric constraints. Each valid cluster is represented by a single iconic view, and the geometric relationships between iconic views are captured by an iconic scene graph. Using structure from motion techniques, the system then registers the iconic images to efficiently produce 3D models of the different aspects of the landmark. To improve coverage of the scene, these 3D models are subsequently extended using additional, non-iconic views. We also demonstrate the use of iconic images for recognition and browsing. Our experimental results demonstrate the ability to process datasets containing up to 46,000 images in less than 20 hours, using a single commodity PC equipped with a graphics card. This is a significant advance towards Internet-scale operation.
Article
Full-text available
Bundle adjustment constitutes a large, nonlinear least-squares problem that is often solved as the last step of feature-based structure and motion estimation computer vision algorithms to obtain optimal estimates. Due to the very large number of parameters involved, a general purpose least-squares algorithm incurs high computational and memory storage costs when applied to bundle adjustment. Fortunately, the lack of interaction among certain subgroups of parameters results in the corresponding Jacobian being sparse, a fact that can be exploited to achieve considerable computational savings. This article presents sba, a publicly available C/C++ software package for realizing generic bundle adjustment with high efficiency and flexibility regarding parameterization.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Bundle adjustment constitutes a large, nonlinear least-squares problem that is often solved as the last step of feature-based structure and motion estimation computer vision algorithms to obtain optimal estimates. Due to the very large number of parameters involved, a general purpose least-squares algorithm incurs high computational and memory storage costs when applied to bundle adjustment. Fortunately, the lack of interaction among certain subgroups of parameters results in the corresponding Jacobian being sparse, a fact that can be exploited to achieve considerable computational savings. This article presents sba, a publicly available C/C++ software package for realizing generic bundle adjustment with high efficiency and flexibility regarding parameterization. ACM Reference Format: Lourakis, M. I. A. and Argyros, A. A. 2009. SBA: A software package for generic sparse bundle adjustment.
Conference Paper
We present a system for interactively browsing and exploring large unstructured collections of photographs of a scene using a novel 3D interface. Our system consists of an image-based modeling front end that automatically computes the viewpoint of each photograph as well as a sparse 3D model of the scene and image to model correspondences. Our photo explorer uses image-based rendering techniques to smoothly transition between photographs, while also enabling full 3D navigation and exploration of the set of images and world geometry, along with auxiliary information such as overhead maps. Our system also makes it easy to construct photo tours of scenic or historic locations, and to annotate image details, which are automatically transferred to other relevant images. We demonstrate our system on several large personal photo collections as well as images gathered from Internet photo sharing sites.
Conference Paper
This paper describes an effort to automatically create "tours" of thousands of the world's landmarks from geo-tagged user-contributed photos on the Internet. These photo tours take you through each site's most popular viewpoints on a tour that maximizes visual quality and traversal efficiency. This planning problem is framed as a form of the Traveling Salesman Problem on a graph with photos as nodes and transition costs on edges and pairs of edges, permitting efficient solution even for large graphs containing thousands of photos. Our approach is highly scalable and is the basis for the Photo Tours feature in Google Maps, which can be viewed at http://maps.google.com/phototours.
Conference Paper
We present a new structure from motion (Sfm) technique based on point and vanishing point (VP) matches in images. First, all global camera rotations are computed from VP matches as well as rel-ative rotation estimates obtained from pairwise image matches. A new multi-staged linear technique is then used to estimate all camera trans-lations and 3D points simultaneously. The proposed method involves first performing pairwise reconstructions, then robustly aligning these in pairs, and finally aligning all of them globally by simultaneously es-timating their unknown relative scales and translations. In doing so, measurements inconsistent in three views are efficiently removed. Unlike sequential Sfm, the proposed method treats all images equally, is easy to parallelize and does not require intermediate bundle adjustments. There is also a reduction of drift and significant speedups up to two order of magnitude over sequential Sfm. We compare our method with a standard Sfm pipeline [1] and demonstrate that our linear estimates are accurate on a variety of datasets, and can serve as good initializations for final bundle adjustment. Because we exploit VPs when available, our approach is particularly well-suited to the reconstruction of man-made scenes.
Conference Paper
Recent work in structure from motion (SfM) has successfully built 3D models from large unstructured collections of images downloaded from the Internet. Most approaches use incremental algorithms that solve progressively larger bundle adjustment problems. These incremental techniques scale poorly as the number of images grows, and can drift or fall into bad local minima. We present an alternative formulation for SfM based on finding a coarse initial solution using a hybrid discrete-continuous optimization, and then improving that solution using bundle adjustment. The initial optimization step uses a discrete Markov random field (MRF) formulation, coupled with a continuous Levenberg-Marquardt refinement. The formulation naturally incorporates various sources of information about both the cameras and the points, including noisy geotags and vanishing point estimates. We test our method on several large-scale photo collections, including one with measured camera positions, and show that it can produce models that are similar to or better than those produced with incremental bundle adjustment, but more robustly and in a fraction of the time.
Conference Paper
We address the problem of efficient structure from mo- tion for large, unordered, highly redundant, and irregular ly sampled photo collections, such as those found on Internet photo-sharing sites. Our approach computes a smallskele- tal subset of images, reconstructs the skeletal set, and adds the remaining images using pose estimation. Our technique drastically reduces the number of parameters that are con- sidered, resulting in dramatic speedups, while provably ap - proximating the covariance of the full set of parameters. To compute a skeletal image set, we first estimate the accuracy of two-frame reconstructions between pairs of overlapping images, then use a graph algorithm to select a subset of im- ages that, when reconstructed, approximates the accuracy of the full set. A final bundle adjustment can then optionally be used to restore any loss of accuracy.
Conference Paper
We present the design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. We explore the use of multicore CPU as well as multicore GPUs for this purpose. We show that overcoming the severe memory and bandwidth limitations of current generation GPUs not only leads to more space efficient algorithms, but also to surprising savings in runtime. Our CPU based system is up to ten times and our GPU based system is up to thirty times faster than the current state of the art methods, while maintaining comparable convergence behavior. The code and additional results are available at http://grail.cs.washington.edu/projects/mcba.
Conference Paper
We present the design and implementation of a new inexact Newton type algorithm for solving large-scale bundle adjustment problems with tens of thousands of images. We explore the use of Conjugate Gradients for calculating the Newton step and its performance as a function of some simple and computationally efficient preconditioners. We show that the common Schur complement trick is not limited to factorization-based methods and that it can be interpreted as a form of preconditioning. Using photos from a street-side dataset and several community photo collections, we generate a variety of bundle adjustment problems and use them to evaluate the performance of six different bundle adjustment algorithms. Our experiments show that truncated Newton methods, when paired with relatively simple preconditioners, offer state of the art performance for large-scale bundle adjustment. The code, test problems and detailed performance data are available at http://grail.cs.washington.edu/projects/bal .
Conference Paper
We present a system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city (e.g., Rome) on Internet photo sharing sites. Our system uses a collection of novel parallel distributed matching and reconstruction algorithms, designed to maximize parallelism at each stage in the pipeline and minimize serialization bottlenecks. It is designed to scale gracefully with both the size of the problem and the amount of available computation. We have experimented with a variety of alternative algorithms at each stage of the pipeline and report on which ones work best in a parallel computing environment. Our experimental results demonstrate that it is now possible to reconstruct cities consisting of 150 K images in less than a day on a cluster with 500 compute cores.
Article
is that these approaches will one day allow virtual tourism of the world's interesting and important sites. We present a system for interactively browsing and exploring large unstructured collections of photographs of a scene using a novel 3D interface. Our system consists of an image-based modeling front end that automatically computes the viewpoint of each photo-graph as well as a sparse D model of the scene and image to model correspondences. Our photo explorer uses image-based rendering techniques to smoothly transition between photographs, while also enabling full D navigation and exploration of the set of images and world geometry, along with auxiliary information such as overhead maps. Our system also makes it easy to construct photo tours of scenic or historic locations, and to annotate image details, which are automatically transferred to other relevant images. We demon-strate our system on several large personal photo collections as well as images gathered from Internet photo sharing sites. CR Categories: H. 5.1 [Information Interfaces and Presentation]: Multimedia Information Systems, Arti cial, augmented, and vir-tual realities I. 2.10 [Arti cial Intelligence]: Vision and Scene Understanding, Modeling and recovery of physical attributes Keywords: image-based rendering, image-based modeling, photo browsing, structure from motion
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
A recognition scheme that scales efficiently to a large number of objects is presented. The efficiency and quality is exhibited in a live demonstration that recognizes CD-covers from a database of 40000 images of popular music CD’s. The scheme builds upon popular techniques of indexing descriptors extracted from local regions, and is robust to background clutter and occlusion. The local region descriptors are hierarchically quantized in a vocabulary tree. The vocabulary tree allows a larger and more discriminatory vocabulary to be used efficiently, which we show experimentally leads to a dramatic improvement in retrieval quality. The most significant property of the scheme is that the tree directly defines the quantization. The quantization and the indexing are therefore fully integrated, essentially being one and the same. The recognition quality is evaluated through retrieval on a database with ground truth, showing the power of the vocabulary tree approach, going as high as 1 million images.
Modeling and recognition of landmark image collections using iconic scene graphs
  • X Li
  • C Wu
  • C Zach
  • S Lazebnik
  • J Frahm