Table 1 - uploaded by Aron Monszpart
Content may be subject to copyright.
Quality evaluation for structure graphs. 

Quality evaluation for structure graphs. 

Source publication
Conference Paper
Full-text available
Missing data due to occlusion is a key challenge in 3D acquisition, particularly in cluttered man-made scenes. Such partial information about the scenes limits our ability to analyze and understand them. In this work we abstract such environments as collections of cuboids and hallucinate geometry in the occluded regions by glob- ally analyzing the...

Contexts in source publication

Context 1
... we calculated initial and optimized precision-recall (PR) ratios, and converted them to f-measure (F1 = 200P R P +R ). A summary of our results can be found in Table 1. ...
Context 2
... of the edges are missed because of the minimal extension principle in areas of occlusion. # proxies 18 17 56 # contacts 41 29 64 max valence 6 11 11 median valence 5 2 2 average ...
Context 3
... evaluation. The ground truth graphs automatically re- trievable with the method above contained 40%-85% of the edges a manual annotator would insert, influencing the scores reported in Table 1. Our algorithm was designed to resolve proxy intersections, which assumes a constant approximation scale and no "inclusive" relationships. ...
Context 4
... scenes had 2-23 (average 8) objects and 2-29 (average 10) touching and fixed contacts. In addition to estimating the validity of the structure graphs (Table 1) we evaluated the correctness of the geometric in- initial cuboids optimized cuboids + structure scene #1 scene #2 Figure 14: Initial and optimized cuboids with extracted structure for two multi-view recordings of cluttered scenes (see supplemen- tary material for details). ...

Citations

... There are many works (Yuille & Kersten, 2006;Zhu & Mumford, 2007;Bai et al., 2012;Kulkarni et al., 2015;Jimenez Rezende et al., 2016;Eslami et al., 2016;Wu et al., 2017b;Yao et al., 2018;Azinovic et al., 2019;Munkberg et al., 2022) that focus on inverse graphics. Recently, some researches (Gupta et al., 2010;Battaglia et al., 2013;Jia et al., 2014;Shao et al., 2014;Zheng et al., 2015;Lerer et al., 2016;Mottaghi et al., 2016;Fragkiadaki et al., 2016;Battaglia et al., 2016;Agrawal et al., 2016;Pinto et al., 2016;Finn et al., 2016;Ehrhardt et al., 2017;Chang et al., 2017;Raissi et al., 2019;Li et al., 2020;Chu et al., 2022) on this field pay more attention to reasoning about physical properties of the scene from image/video. In contrast to these methods that infer static properties, we infer a generalizable physical dynamic model via inverse rendering. ...
Preprint
Humans have a strong intuitive understanding of physical processes such as fluid falling by just a glimpse of such a scene picture, i.e., quickly derived from our immersive visual experiences in memory. This work achieves such a photo-to-fluid-dynamics reconstruction functionality learned from unannotated videos, without any supervision of ground-truth fluid dynamics. In a nutshell, a differentiable Euler simulator modeled with a ConvNet-based pressure projection solver, is integrated with a volumetric renderer, supporting end-to-end/coherent differentiable dynamic simulation and rendering. By endowing each sampled point with a fluid volume value, we derive a NeRF-like differentiable renderer dedicated from fluid data; and thanks to this volume-augmented representation, fluid dynamics could be inversely inferred from the error signal between the rendered result and ground-truth video frame (i.e., inverse rendering). Experiments on our generated Fluid Fall datasets and DPI Dam Break dataset are conducted to demonstrate both effectiveness and generalization ability of our method.
... Snapping in 3D, on the other hand, is more complicated since the depth information is difficult to obtain in the augmented 3D space. Therefore, it is quite natural attempt to investigate snapping algorithms based on RGB-D images which providing additional depth information [36,37]. An efficient RGB-D images-based methods to snap virtual objects into real scenes in real-time is presented by Li et al. [27]. ...
Article
Full-text available
High-performance spatial target snapping is an essential function in 3D scene modeling and mapping that is widely used in mobile augmented reality (MAR). Spatial data snapping in a MAR system must be quick and accurate, while real-time human–computer interaction and drawing smoothness must also be ensured. In this paper, we analyze the advantages and disadvantages of several spatial data snapping algorithms, such as the 2D computational geometry method and the absolute distance calculation method. To address the issues that existing algorithms do not adequately support 3D data snapping and real-time snapping of high data volumes, we present a new adaptive dynamic snapping algorithm based on the spatial and graphical characteristics of augmented reality (AR) data snapping. Finally, the algorithm is experimented with by an AR modeling system, including the evaluation of snapping efficiency and snapping accuracy. Through the experimental comparison, we found that the algorithm proposed in this paper is substantially improved in terms of shortening the snapping time, enhancing the snapping stability, and improving the snapping accuracy of vector points, lines, faces, bodies, etc. The snapping efficiency of the algorithm proposed in this paper is 1.6 times higher than that of the traditional algorithm on average, while the data acquisition accuracy based on the algorithm in this paper is more than 6 times higher than that of the traditional algorithm on average under the same conditions, and its data accuracy is improved from the decimeter level to the centimeter level.
... However, these datasets mainly target at pure video understanding and do not contain natural language question answering. Much research has been studying dynamic modeling for physical scenes (Lerer et al., 2016;Battaglia et al., 2013;Mottaghi et al., 2016;Finn et al., 2016;Shao et al., 2014;Fire & Zhu, 2015;Ye et al., 2018;Li et al., 2019b). We adopt PropNet (Li et al., 2019b) for dynamics prediction and feed the predicted scenes to the video feature extractor and the neuro-symbolic executor for event prediction and question answering. ...
Conference Paper
Full-text available
We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from dynamic scenes and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse question into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.
... However, these datasets mainly target at pure video understanding and do not contain natural language question answering. Much research has been studying dynamic modeling for physical scenes (Lerer et al., 2016;Battaglia et al., 2013;Mottaghi et al., 2016;Finn et al., 2016;Shao et al., 2014;Fire & Zhu, 2015;Ye et al., 2018;Li et al., 2019b). We adopt PropNet (Li et al., 2019b) for dynamics prediction and feed the predicted scenes to the video feature extractor and the neuro-symbolic executor for event prediction and question answering. ...
Preprint
Full-text available
We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse questions into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.
... Although not dedicated to image editing, Shao et al. [40] proposed a method to reveal hidden parts during 3D acquisition. This method recovers missing and occluded data by abstracting the 3D scene using cuboids. ...
Article
Full-text available
We present Calipso, an interactive method for editing images and videos in a physically-coherent manner. Our main idea is to realize physics-based manipulations by running a full physics simulation on proxy geometries given by non-rigidly aligned CAD models. Running these simulations allows us to apply new, unseen forces to move or deform selected objects, change physical parameters such as mass or elasticity, or even add entire new objects that interact with the rest of the underlying scene. In Calipso, the user makes edits directly in 3D; these edits are processed by the simulation and then transfered to the target 2D content using shape-to-image correspondences in a photo-realistic rendering process. To align the CAD models, we introduce an efficient CAD-to-image alignment procedure that jointly minimizes for rigid and non-rigid alignment while preserving the high-level structure of the input shape. Moreover, the user can choose to exploit image flow to estimate scene motion, producing coherent physical behavior with ambient dynamics. We demonstrate Calipso's physics-based editing on a wide range of examples producing myriad physical behavior while preserving geometric and visual consistency.
... One is to incorporate the acquired room structure and the information of the non-planar geometry to assist intelligent agents in navigating and paying attention to detecting the unknown components [54], [55], [56]. Also, we could incorporate a physical understanding of the scene to complete the unseen parts of the environment [57]. Currently, some of the structure that is partially planar can remain floating in the air after our pipeline. ...
Article
Full-text available
Scanning and acquiring a 3D indoor environment suffers from complex occlusions and misalignment errors. The reconstruction obtained from an RGB-D scanner contains holes in geometry and ghosting in texture. These are easily noticeable and cannot be considered as visually compelling VR content without further processing. On the other hand, the well-known Manhattan World priors successfully recreate relatively simple structures. In this paper, we would like to push the limit of planar representation in indoor environments. Given an initial 3D reconstruction captured by an RGB-D sensor, we use planes not only to represent the environment geometrically but also to solve an inverse rendering problem considering texture and light. The complex process of shape inference and intrinsic imaging is greatly simplified with the help of detected planes and yet produces a realistic 3D indoor environment. The generated content can adequately represent the spatial arrangements for various AR/VR applications and can be readily composited with virtual objects possessing plausible lighting and texture.
... In general, one can formulate the 3D object detection problem as follows: fit a 3D bounding box to objects in an RGB-D image [48, 49, 34, 46], detect 3D keypoints in an RGB image [51, 53], or perform 3D model to 2D image alignment [50, 24, 2] . In this paper, we follow the keypointbased formulation. ...
Article
Full-text available
We present a Deep Cuboid Detector which takes a consumer-quality RGB image of a cluttered scene and localizes all 3D cuboids (box-like objects). Contrary to classical approaches which fit a 3D model from low-level cues like corners, edges, and vanishing points, we propose an end-to-end deep learning system to detect cuboids across many semantic categories (e.g., ovens, shipping boxes, and furniture). We localize cuboids with a 2D bounding box, and simultaneously localize the cuboid's corners, effectively producing a 3D interpretation of box-like objects. We refine keypoints by pooling convolutional features iteratively, improving the baseline method significantly. Our deep learning cuboid detector is trained in an end-to-end fashion and is suitable for real-time applications in augmented reality (AR) and robotics.
... Detecting, encoding, and synthesizing relationships between objects is critical for many shape analysis and scene synthesis tasks. Handling even simple relations like 'on top of,' 'is next to,' or 'is touching' has been shown to be very useful for scene understand- ing [Liu et al. 2014], structuring raw RGBD images [Shao et al. 2014], realistic scene synthesis [Fisher et al. 2012; Chen et al. 2014], object retrieval [Fisher et al. 2011], etc. Recently, more advanced relationship descriptors like IBS [Zhao et al. 2014] and ICON [Hu et al. 2015] have demonstrated the value of capturing Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
Article
As humans, we regularly interpret scenes based on how objects are related, rather than based on the objects themselves. For example, we see a person riding an object X or a plank bridging two objects. Current methods provide limited support to search for content based on such relations. We present RAID, a relation-augmented image descriptor that supports queries based on inter-region relations. The key idea of our descriptor is to encode region-to-region relations as the spatial distribution of point-to-region relationships between two image regions. RAID allows sketch-based retrieval and requires minimal training data, thus making it suited even for querying uncommon relations. We evaluate the proposed descriptor by querying into large image databases and successfully extract nontrivial images demonstrating complex inter-region relations, which are easily missed or erroneously classified by existing methods. We assess the robustness of RAID on multiple datasets even when the region segmentation is computed automatically or very noisy.
... Approaches include using classifiers on scene objects [Schlecht and Barnard 2009; Xiong and Huber 2010; Anand et al. 2011; Koppula et al. 2011; Silberman et al. 2012], interactive 3D modeling from raw RGBD scans [Shao et al. 2012], interleaving segmentation and classification [Nan et al. 2012] , unsupervised algorithms to identify and consolidate scans [Mattausch et al. 2014], proxy geometry based scene understanding [Lafarge and Alliez 2013], or studying spatial layout of scenes Lee et al. 2010; Hartley et al. 2012]. Physical constraints have also been used for scene understanding: for example, [Jia et al. 2013; Jiang and Xiao 2013; Zheng et al. 2013; Shao et al. 2014] consider local and global physical stability to predict occluded geometry in scenes. These methods , however, primarily focus on static scenes. ...
Article
Collision sequences are commonly used in games and entertainment to add drama and excitement. Authoring even two body collisions in real world can be difficult as one has to get timing and the object trajectories to be correctly synchronized. After trial-and-error iterations, when objects can actually be made to collide, then they are difficult to acquire in 3D. In contrast, synthetically generating plausible collisions is difficult as it requires adjusting different collision parameters (e.g., object mass ratio, coefficient of restitution, etc.) and appropriate initial parameters. We present SMASH to directly `read off' appropriate collision parameters simply based on input video recordings. Specifically, we describe how to use laws of rigid body collision to regularize the problem of lifting 2D annotated poses to 3D reconstruction of collision sequences. The reconstructed sequences can then be modified and combined to easily author novel and plausible collision sequences. We demonstrate the system on various complex collision sequences.
... Most of these approaches, starting either from a single image or multiple images, are dedicated to faithfully recovering the 3D geometry of image objects, while their semantic relations, underlying physical settings, or even functionality are overlooked. More recently, research has explored use of high-level structural information to facilitate 3D reconstruction [3][4][5]. For example, Shao et al. [3] leverage physical stability to suggest possible interactions between image objects and obtain a physically plausible reconstruction of objects in RGBD images. ...
... More recently, research has explored use of high-level structural information to facilitate 3D reconstruction [3][4][5]. For example, Shao et al. [3] leverage physical stability to suggest possible interactions between image objects and obtain a physically plausible reconstruction of objects in RGBD images. Such high-level semantic information plays an important role in constraining the underlying geometric structure. ...
... To recover structural information, Shen et al. [4] extract suitable model parts from a database, and compose them to form highquality models from a single RGBD image. Shao et al. [3] use physical stability to recover unseen structures from a single RGBD image using cuboids. However, their techniques focus on creating static 3D geometry and structure, whereas our goal is to produce models with correctly moving parts. ...
Article
Inferring the functionality of an object from a single RGBD image is difficult for two reasons: lack of semantic information about the object, and missing data due to occlusion. In this paper, we present an interactive framework to recover a 3D functional prototype from a single RGBD image. Instead of precisely reconstructing the object geometry for the prototype, we mainly focus on recovering the object’s functionality along with its geometry. Our system allows users to scribble on the image to create initial rough proxies for the parts. After user annotation of high-level relations between parts, our system automatically jointly optimizes detailed joint parameters (axis and position) and part geometry parameters (size, orientation, and position). Such prototype recovery enables a better understanding of the underlying image geometry and allows for further physically plausible manipulation. We demonstrate our framework on various indoor objects with simple or hybrid functions.