Conference Paper

Fast and Accurate Single-Image Depth Estimation on Mobile Devices, Mobile AI 2021 Challenge: Report

Authors:
  • Raspberry Pi
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... With the development of IoT and the widespread adoption of mobile devices, artificial intelligence (AI) technology is widely used in many aspects of daily life [1,2]. Considering security issues, federated learning (FL) comes into being, which enables collaborative training by exchanging local model parameters instead of raw data among devices [3,4]. ...
... If there is no transaction within the network that client P i+1 sends deposits to P i (i = 1, · · · , N − 1) as described above, all P j (j ≤ i) do not perform the ladder deposits and wait for Ack i,N 's reimbursement according to Equation (2). All P j (j > i) need to wait until the end of the RingFFL to judge whether their ladder deposits are acknowledged. ...
... The raw data are stored locally from start to finish, thus ensuring the security of the clients' data. (2) Guaranteeing the security of the clients' deposits: In the process of deposit payment and deposit refund, we use blockchain to guarantee the security of the transactions. The security mechanism based on consensus can avoid security attacks such as double spending, thus ensuring the security of clients' deposits during the training process. ...
Article
Full-text available
In the ring-architecture-based federated learning framework, security and fairness are severely compromised when dishonest clients abort the training process after obtaining useful information. To solve the problem, we propose a Ring- architecture-based Fair Federated Learning framework called RingFFL, in which we design a penalty mechanism for FL. Before the training starts in each round, all clients that will participate in the training pay deposits in a set order and record the transactions on the blockchain to ensure that they are not tampered with. Subsequently, the clients perform the FL training process, and the correctness of the models transmitted by the clients is guaranteed by the HASH algorithm during the training process. When all clients perform honestly, each client can obtain the final model, and the number of digital currencies in each client’s wallet is kept constant; otherwise, the deposits of clients who leave halfway will be compensated to the clients who perform honestly during the training process. In this way, through the penalty mechanism, all clients either obtain the final model or are compensated, thus ensuring the fairness of federated learning. The security analysis and experimental results show that RingFFL not only guarantees the accuracy and security of the federated learning model but also guarantees the fairness.
... Therefore, research along the line of accelerating depth estimation while reducing quality sacrifice on mobile devices has drawn increasing attention [15,38]. ...
... While engaging results have been presented, most of these state-of-the-art (SoTA) models are only optimized for high fidelity results while not taking into account computational efficiency and mobile-related constraints. The requirements of powerful high-end GPUs and consuming gigabytes of RAM lead to a dilemma when developing these models on resourceconstrained mobile hardware [15,1,2]. ...
... In this paper, we aim to address the more practical application problem of monocular depth estimation on mobile devices, where the solution should consider not only the precision but also the inference time [15]. We first investigate a suitable network design. ...
Preprint
Full-text available
Monocular depth estimation is an essential task in the computer vision community. While tremendous successful methods have obtained excellent results, most of them are computationally expensive and not applicable for real-time on-device inference. In this paper, we aim to address more practical applications of monocular depth estimation, where the solution should consider not only the precision but also the inference time on mobile devices. To this end, we first develop an end-to-end learning-based model with a tiny weight size (1.4MB) and a short inference time (27FPS on Raspberry Pi 4). Then, we propose a simple yet effective data augmentation strategy, called R2 crop, to boost the model performance. Moreover, we observe that the simple lightweight model trained with only one single loss term will suffer from performance bottleneck. To alleviate this issue, we adopt multiple loss terms to provide sufficient constraints during the training stage. Furthermore, with a simple dynamic re-weight strategy, we can avoid the time-consuming hyper-parameter choice of loss terms. Finally, we adopt the structure-aware distillation to further improve the model performance. Notably, our solution named LiteDepth ranks 2nd in the MAI&AIM2022 Monocular Depth Estimation Challenge}, with a si-RMSE of 0.311, an RMSE of 3.79, and the inference time is 37$ms$ tested on the Raspberry Pi 4. Notably, we provide the fastest solution to the challenge. Codes and models will be released at \url{https://github.com/zhyever/LiteDepth}.
... Certain limitations prevent the deployment of the neural network to mobile devices, such as a restricted amount of RAM and a limited and not always efficient support for many common deep learning layers and operators. Recent research for super-resolution of single images and videos on mobile devices includes [4,[38][39][40][41][42][43]. A pure CNN proposed model performs per-frame upscaling without considering any inter-frame dependencies, which can significantly speed up the inference. ...
... A pure CNN proposed model performs per-frame upscaling without considering any inter-frame dependencies, which can significantly speed up the inference. Team Noah TerminalVision presented a TinyVSRNet [38] architecture that contains three residual blocks followed by a depth-to-space upsampling layer and one global skip connection performing bilinear image upscaling. ...
Conference Paper
Video super-resolution (VSR) aims to generate high-resolution (HR) frmes from corresponding low-resolution (LR) frames. It draws stark contrasat from single image super-resolution (SISR) because of its high temporal dependency on misaligned supporting frames. The existing methods involve using RNNs to learn the temporal dependency while using other networks (CNNs, GANs) for predicting neighboring pixels. Due to the memory and processing constraints and the inference time required for up-scaling LR frames, a wide variety of VSR techniques cannot be applied to mid-range and budget mobile devices. This paper presents VIhanceD, a real-time sliding window-based network that can operate on budget smartphones and laptops while producing cutting-edge results on various video datasets. Our approaches include both spatial and temporal dependencies to make the up-scaled HR frames coherent and free of motion distortions. We concentrate on enhancing the user experience in internet-restricted places owing to social, political, and geographical limitations. The mobile app (and PC client) provides a continuous stream of HR frames without buffering. Our experiments on various public and internal datasets demonstrate that the suggested method is generalizable and works on natural video frames and textual data, making it suitable for infotainment multimedia.
... This is the second installment of this challenge. The previous edition was in conjunction with Mobile AI 2021 CVPR workshop [24]. ...
... This solution is able to run at more than 27 FPS on the Raspberry Pi 4, thus demonstrating a nearly real-time performance, which is critical for many depth estimation applications. Overall, we can see a noticeable improvement in the efficiency of the proposed solutions compared to the models produced in the previous Mobile AI depth estimation challenge [24], which allows for faster and more accurate depth estimation models on mobile devices. ...
Preprint
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
... State-of-the-art methods tend to employ large encoders like ResNet [15,33,45,47], ResNext-101 [66], SeNet-154 [3,21], Transformer [48,65], with sophisticated decoder strategies [3,21,33], and train with huge dataset such as PBRS [68], MIX 6 [48] to achieve high accuracy. On the contrary, fast solutions [26,58] suffer from low precision, manifesting an apparent compromise between accuracy and network size. ...
... Gonzalez and Kim [17] proposed to synthesize the right view from the left view for training from stereo images. Yang et al. [65] and Ranftl et al. [48] utilize transformer modules to estimate high-quality depth maps, while in contrast [26,58] proposed fast depth estimation methods. However, there is a clear trade-off between accuracy and model size. ...
Preprint
Full-text available
Deep neural networks have recently thrived on single image depth estimation. That being said, current developments on this topic highlight an apparent compromise between accuracy and network size. This work proposes an accurate and lightweight framework for monocular depth estimation based on a self-attention mechanism stemming from salient point detection. Specifically, we utilize a sparse set of keypoints to train a FuSaNet model that consists of two major components: Fusion-Net and Saliency-Net. In addition, we introduce a normalized Hessian loss term invariant to scaling and shear along the depth direction, which is shown to substantially improve the accuracy. The proposed method achieves state-of-the-art results on NYU-Depth-v2 and KITTI while using 3.1-38.4 times smaller model in terms of the number of parameters than baseline approaches. Experiments on the SUN-RGBD further demonstrate the generalizability of the proposed method.
... The challenging process of deriving such information is key for 3D scene reconstruction and augmented reality generation, robotics and autonomous driving, essential for perception, navigation, and planning. Herewith, monocular depth estimation based on deep neural networks has demonstrated a high ability for depth prediction from RGB images [1,9,24,31] and exhibits numerous practical applications, including the highly demanded area of mobile devices [11,12]. ...
Conference Paper
Full-text available
We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation. In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images. We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset. By conducting extensive experiments with our virtually modified dataset and validating on the original NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures, especially for the current state-of-the-art VPD model. To the best of our knowledge, this is the first work that augments a real-world dataset with randomly generated virtual 3D objects for monocular depth estimation. We make our ANYU dataset publicly available in two training configurations with 10% and 100% additional synthetically enriched RGB-D pairs of training images, respectively, for efficient training and empirical exploration of virtual augmentation at https://github.com/ABrain-One/ANYU .
... More recently, Ignatov et al. introduced the Mobile AI Challenge [40], investigating efficient MDE on mobile devices in urban settings. Finally, the NTIRE2023 [102] challenge, concurrent to ours, targeted high-resolution images of specular and non-lambertian surfaces. ...
... Supervised Depth Estimation. In recent years, supervised depth models [4,27,36,49,53] have significantly improved the depth accuracy. DPT [35] utilizes the vision transformer [9] for depth prediction and semantic segmentation. ...
Preprint
Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.
... The Dense Depth for Autonomous Driving (DDAD) Challenge [25] targeted long-range and dense depth estimation from diverse urban conditions. The Mobile AI Challenge [36] focused on real-time depth estimation on smartphones and IoT platforms. The SeasonDepth Depth Prediction Challenge [34] was specialized for estimating accurate depth information of scenes under different illumination and season conditions. ...
... More recently, Ignatov et al. introduced the Mobile AI Challenge [40], investigating efficient MDE on mobile devices in urban settings. Finally, the NTIRE2023 [102] challenge, concurrent to ours, targeted high-resolution images of specular and non-lambertian surfaces. ...
Preprint
Full-text available
This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes complex natural environments, e.g. forests or fields, which are greatly underrepresented in current benchmarks. The challenge received eight unique submissions that outperformed the provided SotA baseline on any of the pointcloud- or image-based metrics. The top supervised submission improved relative F-Score by 27.62%, while the top self-supervised improved it by 16.61%. Supervised submissions generally leveraged large collections of datasets to improve data diversity. Self-supervised submissions instead updated the network architecture and pretrained backbones. These results represent a significant progress in the field, while highlighting avenues for future research, such as reducing interpolation artifacts at depth boundaries, improving self-supervised indoor performance and overall natural image accuracy.
... This challenge is one of the AIM 2022 associated challenges: reversed ISP [14], efficient learned ISP [25], super-resolution of compressed image and video [46], efficient image super-resolution [20], efficient video super-resolution [21], efficient Bokeh effect rendering [23], efficient monocular depth estimation [24], Instagram filter removal [32] (Fig. 1). ...
Chapter
Full-text available
Cameras capture sensor RAW images and transform them into pleasant RGB images, suitable for the human eyes, using their integrated Image Signal Processor (ISP). Numerous low-level vision tasks operate in the RAW domain (e.g. image denoising, white balance) due to its linear relationship with the scene irradiance, wide-range of information at 12bits, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public RGB datasets. This paper introduces the AIM 2022 Challenge on Reversed Image Signal Processing and RAW Reconstruction. We aim to recover raw sensor images from the corresponding RGBs without metadata and, by doing this, “reverse” the ISP transformation. The proposed methods and benchmark establish the state-of-the-art for this low-level vision inverse problem, and generating realistic raw sensor readings can potentially benefit other tasks such as denoising and super-resolution.
... The AIM 2022 Challenge on Super-Resolution of Compressed Image and Video is one of the AIM 2022 associated challenges: reversed ISP [18], efficient learned ISP [36], super-resolution of compressed image and video [73], efficient image super-resolution [32], efficient video super-resolution [33], efficient Bokeh effect rendering [34], efficient monocular depth estimation [35], Instagram filter removal [41]. ...
Chapter
Full-text available
This paper reviews the Challenge on Super-Resolution of Compressed Image and Video at AIM 2022. This challenge includes two tracks. Track 1 aims at the super-resolution of compressed image, and Track 2 targets the super-resolution of compressed video. In Track 1, we use the popular dataset DIV2K as the training, validation and test sets. In Track 2, we propose the LDV 3.0 dataset, which contains 365 videos, including the LDV 2.0 dataset (335 videos) and 30 additional videos. In this challenge, there are 12 teams and 2 teams that submitted the final results to Track 1 and Track 2, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution on compressed image and video. The proposed LDV 3.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge is at https://github.com/RenYang-home/AIM22_CompressSR.
... This challenge is a part of the AIM 2022 Challenges: Real-Time Image Super-Resolution [12], Real-Time Video Super-Resolution [13], Single-Image Depth Estimation [15], Learned Smartphone ISP [16], Real-Time Rendering Realistic Bokeh [14], Compressed Input Super-Resolution [35] and Reversed ISP [10]. The results obtained in the other competitions and the description of the proposed solutions can be found in the corresponding challenge reports. ...
Chapter
Full-text available
This paper introduces the methods and the results of AIM 2022 challenge on Instagram Filter Removal. Social media filters transform the images by consecutive non-linear operations, and the feature maps of the original content may be interpolated into a different domain. This reduces the overall performance of the recent deep learning strategies. The main goal of this challenge is to produce realistic and visually plausible images where the impact of the filters applied is mitigated while preserving the content. The proposed solutions are ranked in terms of the PSNR value with respect to the original images. There are two prior studies on this task as the baseline, and a total of 9 teams have competed in the final phase of the challenge. The comparison of qualitative results of the proposed solutions and the benchmark for the challenge are presented in this report.
... Depth estimation, for example, in smart cleaning robots produced in the robotic field, is programmed to determine where the robot is and where the robot cannot enter, and to decide where the robot should clean according to the distance of the objects [2]. In mobile devices, depth perception is used to detect the close object and add a blur effect on other objects so that the Bokeh Effect, called portrait mode, can be applied on the image [3,4,5]. In this section, International Journal of Technological Sciences e-ISSN 1309-1220 depth perception, various usage methods and the developed application will be discussed. ...
Article
Full-text available
The image obtained from the cameras is 2D, so we cannot know how far the object is on the image. In order to detect objects only at a certain distance in a camera system, we need to convert the 2D image into 3D. Depth estimation is used to estimate distances to objects. It is the perception of the 2D image as 3D. Although different methods are used to implement this, the method to be applied in this experiment is to detect depth perception with a single camera. After obtaining the depth map, the obtained image will be filtered by objects in the near distance, the distant image will be closed, a new image will be run with the object detection model and object detection will be performed. The desired result in this experiment is, for projects with a low budget, instead of using dual camera or LIDAR methods, it is to ensure that a robot can detect obstacles that will come in front of it with only one camera. As a result, 8 FPS was obtained by running two models on the embedded device, and the loss value was obtained as 0.342 in the inference test performed on the new image, where only close objects were taken after the depth estimation.
... This challenge is a part of the AIM 2022 Challenges: Real-Time Image Super-Resolution [12], Real-Time Video Super-Resolution [13], Single-Image Depth Estimation [15], Learned Smartphone ISP [16], Real-Time Rendering Realistic Bokeh [14], Compressed Input Super-Resolution [35] and Reversed ISP [10]. The results obtained in the other competitions and the description of the proposed solutions can be found in the corresponding challenge reports. ...
Preprint
Full-text available
This paper introduces the methods and the results of AIM 2022 challenge on Instagram Filter Removal. Social media filters transform the images by consecutive non-linear operations, and the feature maps of the original content may be interpolated into a different domain. This reduces the overall performance of the recent deep learning strategies. The main goal of this challenge is to produce realistic and visually plausible images where the impact of the filters applied is mitigated while preserving the content. The proposed solutions are ranked in terms of the PSNR value with respect to the original images. There are two prior studies on this task as the baseline, and a total of 9 teams have competed in the final phase of the challenge. The comparison of qualitative results of the proposed solutions and the benchmark for the challenge are presented in this report.
... The AIM 2022 Challenge on Super-Resolution of Compressed Image and Video is one of the AIM 2022 associated challenges: reversed ISP [18], efficient learned ISP [36], super-resolution of compressed image and video [73], efficient image super-resolution [32], efficient video super-resolution [33], efficient Bokeh effect rendering [34], efficient monocular depth estimation [35], Instagram filter removal [42]. ...
Preprint
Full-text available
This paper reviews the Challenge on Super-Resolution of Compressed Image and Video at AIM 2022. This challenge includes two tracks. Track 1 aims at the super-resolution of compressed image, and Track~2 targets the super-resolution of compressed video. In Track 1, we use the popular dataset DIV2K as the training, validation and test sets. In Track 2, we propose the LDV 3.0 dataset, which contains 365 videos, including the LDV 2.0 dataset (335 videos) and 30 additional videos. In this challenge, there are 12 teams and 2 teams that submitted the final results to Track 1 and Track 2, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution on compressed image and video. The proposed LDV 3.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge is at https://github.com/RenYang-home/AIM22_CompressSR.
... Further, it has been gaining traction for tasks other than semantic segmentation such as image reconstruction, with examples being inpainting [30], view-synthesis [2,46] and relighting [62], as well as depth estimation [12][13][14]65]. Adding to the latter, in the recent Mobile AI 2021 Challenge [22] on single image depth estimation, 7 out of 10 submissions used UNet architectures. Also, the detail preserving nature of early encoder features allowed its application as the discriminator architecture in high quality synthesis tasks [49]. ...
Preprint
Full-text available
In this work we introduce a biologically inspired long-range skip connection for the UNet architecture that relies on the perceptual illusion of hybrid images, being images that simultaneously encode two images. The fusion of early encoder features with deeper decoder ones allows UNet models to produce finer-grained dense predictions. While proven in segmentation tasks, the network's benefits are down-weighted for dense regression tasks as these long-range skip connections additionally result in texture transfer artifacts. Specifically for depth estimation, this hurts smoothness and introduces false positive edges which are detrimental to the task due to the depth maps' piece-wise smooth nature. The proposed HybridSkip connections show improved performance in balancing the trade-off between edge preservation, and the minimization of texture transfer artifacts that hurt smoothness. This is achieved by the proper and balanced exchange of information that Hybrid-Skip connections offer between the high and low frequency, encoder and decoder features, respectively.
... linear relationship with scene irradiance, raw and untampered signal and noise samples) are often better suited for the ill-posed, inverse problems that commonly arise in low-level vision tasks such as denoising, demosaicing, HDR, super-resolution (Qian et al. 2019;Abdelhamed, Lin, and Brown 2018;Wronski et al. 2019;Gharbi et al. 2016;Liu et al. 2020). For tasks within the ISP, this does not come as a choice but rather a must, as the input domain is necessarily in the RAW domain due to the camera hardware design (Buckler, Jayasuriya, and Sampson 2017;Ignatov et al. 2021). ...
Conference Paper
Full-text available
Digital cameras transform sensor RAW readings into RGB images by means of their Image Signal Processor (ISP). Computational photography tasks such as image denoising and colour constancy are commonly performed in the RAW domain, in part due to the inherent hardware design, but also due to the appealing simplicity of noise statistics that result from the direct sensor readings. Despite this, the availability of RAW images is limited in comparison with the abundance and diversity of available RGB data. Recent approaches have attempted to bridge this gap by estimating the RGB to RAW mapping: handcrafted model-based methods that are interpretable and controllable usually require manual parameter fine-tuning, while end-to-end learnable neural networks require large amounts of training data, at times with complex training procedures, and generally lack interpretability and parametric control. Towards addressing these existing limitations, we present a novel hybrid model-based and data-driven ISP that builds on canonical ISP operations and is both learnable and interpretable. Our proposed invertible model, capable of bidirectional mapping between RAW and RGB domains, employs end-to-end learning of rich parameter representations, i.e. dictionaries, that are free from direct parametric supervision and additionally enable simple and plausible data augmentation. We evidence the value of our data generation process by extensive experiments under both RAW image reconstruction and RAW image denoising tasks, obtaining state-of-the-art performance in both. Additionally, we show that our ISP can learn meaningful mappings from few data samples, and that denoising models trained with our dictionary-based data augmentation are competitive despite having only few or zero ground-truth labels.
... Further, it has been gaining traction for tasks other than semantic segmentation such as image reconstruction, with examples being inpainting [30], view-synthesis [2], [46] and relighting [62], as well as depth estimation [12]- [14], [65]. Adding to the latter, in the recent Mobile AI 2021 Challenge [22] on single image depth estimation, 7 out of 10 submissions used UNet architectures. Also, the detail preserving nature of early encoder features allowed its application as the discriminator architecture in high quality synthesis tasks [49]. ...
Article
Full-text available
In this work we introduce a biologically inspired long-range skip connection for the UNet architecture that relies on the perceptual illusion of hybrid images, being images that simultaneously encode two images. The fusion of early encoder features with deeper decoder ones allows UNet models to produce finer-grained dense predictions. While proven in segmentation tasks, the network’s benefits are down-weighted for dense regression tasks as these long-range skip connections additionally result in texture transfer artifacts. Specifically for depth estimation, this hurts smoothness and introduces false positive edges which are detrimental to the task due to the depth maps’ piece-wise smooth nature. The proposed HybridSkip connections show improved performance in balancing the trade-off between edge preservation, and the minimization of texture transfer artifacts that hurt smoothness. This is achieved by the proper and balanced exchange of information that HybridSkip connections offer between the high and low frequency, encoder and decoder features, respectively. The code and models will be made available in the project page.
... Up to now, few works have been proposed to address this problem in the indoor scenario. Most of them [6], [24] are designed to achieve real-time frequencies on a NVIDIA Jetson TX2 GPU 4 while being also widely employed in mobile applications [23]. Differently, we are interested in testing the those methods on less investigated devices; for this reason we chose as benchmark hardware the TPU-v2 and an Intel CPU provided by Google Cloud Platform. ...
Article
Full-text available
The monocular depth estimation (MDE) is the task of estimating depth from a single frame. This information is an essential knowledge in many computer vision tasks such as scene understanding and visual odometry, which are key components in autonomous and robotic systems. Approaches based on the state of the art vision transformer architectures are extremely deep and complex not suitable for real-time inference operations on edge and autonomous systems equipped with low resources (i.e. robot indoor navigation and surveillance). This paper presents SPEED, a Separable Pyramidal pooling EncodEr-Decoder architecture designed to achieve real-time frequency performances on multiple hardware platforms. The proposed model is a fast-throughput deep architecture for MDE able to obtain depth estimations with high accuracy from low resolution images using minimum hardware resources (i.e. edge devices). Our encoder-decoder model exploits two depthwise separable pyramidal pooling layers, which allow to increase the inference frequency while reducing the overall computational complexity. The proposed method performs better than other fast-throughput architectures in terms of both accuracy and frame rates, achieving real-time performances over cloud CPU, TPU and the NVIDIA Jetson TX1 on two indoor benchmarks: the NYU Depth v2 and the DIML Kinect v2 datasets.
... We also compare the runtime of our models with stateof-the-art lightweight methods on an Android device using the app from the Mobile AI benchmark developed by Ignatov et al. [53]. To this end, we utilize the pre-trained models provided by the authors (Tensorflow [80], PyTorch [96]), convert them to tflite and measure their runtime on mobile CPUs. ...
Preprint
Full-text available
Dense prediction is a class of computer vision problems aiming at mapping every pixel of the input image with some predicted values. Depending on the problem, the output values can be either continous or discrete. For instance, monocular depth estimation and image super-resolution are often formulated as regression, while semantic segmentation is a dense classification, i.e. discrete, problem. More specifically, the monocular depth estimation problem produces a dense depth map from a single image to be used in various applications including robotics, scene understanding, and augmented reality. Single image super-resolution (SISR) is a low-level vision task that generates a high-resolution image from its low-resolution counterpart. SISR is widely utilized in medical and surveillance imaging, where images with more precise details can provide invaluable information. On the other hand, semantic segmentation predicts a dense annotated map of different semantic categories from a given image that is crucial for image understanding tasks.
... linear relationship with scene irradiance, raw and untampered signal and noise samples) are often better suited for the ill-posed, inverse problems that commonly arise in low-level vision tasks such as denoising, demosaicing, HDR, super-resolution (Qian et al. 2019;Abdelhamed, Lin, and Brown 2018;Wronski et al. 2019;Gharbi et al. 2016;Liu et al. 2020). For tasks within the ISP, this does not come as a choice but rather a must, as the input domain is necessarily in the RAW domain due to the camera hardware design (Buckler, Jayasuriya, and Sampson 2017;Ignatov et al. 2021). ...
Preprint
Full-text available
Digital cameras transform sensor RAW readings into RGB images by means of their Image Signal Processor (ISP). Computational photography tasks such as image denoising and colour constancy are commonly performed in the RAW domain, in part due to the inherent hardware design, but also due to the appealing simplicity of noise statistics that result from the direct sensor readings. Despite this, the availability of RAW images is limited in comparison with the abundance and diversity of available RGB data. Recent approaches have attempted to bridge this gap by estimating the RGB to RAW mapping: handcrafted model-based methods that are interpretable and controllable usually require manual parameter fine-tuning, while end-to-end learnable neural networks require large amounts of training data, at times with complex training procedures, and generally lack interpretability and parametric control. Towards addressing these existing limitations, we present a novel hybrid model-based and data-driven ISP that builds on canonical ISP operations and is both learnable and interpretable. Our proposed invertible model, capable of bidirectional mapping between RAW and RGB domains, employs end-to-end learning of rich parameter representations, i.e. dictionaries, that are free from direct parametric supervision and additionally enable simple and plausible data augmentation. We evidence the value of our data generation process by extensive experiments under both RAW image reconstruction and RAW image denoising tasks, obtaining state-of-the-art performance in both. Additionally, we show that our ISP can learn meaningful mappings from few data samples, and that denoising models trained with our dictionary-based data augmentation are competitive despite having only few or zero ground-truth labels.
... Runtime Measurement: We also compare the runtime of our models with state-of-the-art lightweight methods on an Android device using the app from the Mobile AI benchmark developed by Ignatov et al. [29]. To this end, we utilize the pre-trained models provided by the authors (Tensorflow [42], PyTorch [55]) and convert them to tflite. ...
Preprint
Full-text available
This paper presents a novel neural architecture search method, called LiDNAS, for generating lightweight monocular depth estimation models. Unlike previous neural architecture search (NAS) approaches, where finding optimized networks are computationally highly demanding, the introduced novel Assisted Tabu Search leads to efficient architecture exploration. Moreover, we construct the search space on a pre-defined backbone network to balance layer diversity and search space size. The LiDNAS method outperforms the state-of-the-art NAS approach, proposed for disparity and depth estimation, in terms of search efficiency and output model performance. The LiDNAS optimized models achieve results superior to compact depth estimation state-of-the-art on NYU-Depth-v2, KITTI, and ScanNet, while being 7%-500% more compact in size, i.e the number of model parameters.
Article
Full-text available
Monocular depth estimation (MDE) is critical in enabling intelligent autonomous systems and has received considerable attention in recent years. Achieving both low latency and high accuracy in MDE is desirable but challenging to optimize, especially on edge devices. In this paper, we present a novel approach to balancing speed and accuracy in MDE on edge devices. We introduce FasterMDE, an efficient and fast encoder-decoder network architecture that leverages a multiobjective neural architecture search method to find the optimal encoder structure for the target edge. Moreover, we incorporate a neural window fully connected CRF module into the network as the decoder, enhancing fine-grained depth prediction based on coarse depth and image features. To address the issue of bad “local minimums” in the multiobjective neural architecture search, we propose a new approach for automatically learning the weights of subobjective loss functions based on uncertainty. We also accelerate the FasterMDE model using TensorRT and implement it on a target edge device. The experimental results demonstrate that FasterMDE achieves a better balance of speed and accuracy on the KITTI and NYUv2 datasets compared to previous methods. We validate the effectiveness of the proposed method through an ablation study and verify the real-time monocular depth estimation performance of FasterMDE in realistic scenarios. On the KITTI dataset, the FasterMDE model achieves a high frame rate of 555.55 FPS with 9.1% Abs Rel on a single NVIDIA Titan RTX GPU and 14.46 FPS on the NVIDIA Jetson Xavier NX.
Chapter
Full-text available
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon’s 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20–50 ms while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.
Chapter
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt/30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
Chapter
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 m. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
Chapter
Full-text available
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
Chapter
As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB! bokeh dataset consisting of 5K shallow/wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The runtime of the resulting models was evaluated on the Kirin 9000’s Mali GPU that provides excellent acceleration results for the majority of common deep learning ops. A detailed description of all models developed in this challenge is provided in this paper.
Chapter
Monocular depth estimation is an essential task in the computer vision community. While tremendous successful methods have obtained excellent results, most of them are computationally expensive and not applicable for real-time on-device inference. In this paper, we aim to address more practical applications of monocular depth estimation, where the solution should consider not only the precision but also the inference time on mobile devices. To this end, we first develop an end-to-end learning-based model with a tiny weight size (1.4MB) and a short inference time (27FPS on Raspberry Pi 4). Then, we propose a simple yet effective data augmentation strategy, called R2crop, to boost the model performance. Moreover, we observe that the simple lightweight model trained with only one single loss term will suffer from performance bottleneck. To alleviate this issue, we adopt multiple loss terms to provide sufficient constraints during the training stage. Furthermore, with a simple dynamic re-weight strategy, we can avoid the time-consuming hyper-parameter choice of loss terms. Finally, we adopt the structure-aware distillation to further improve the model performance. Notably, our solution named LiteDepth ranks 2ndin the MAI &AIM2022 Monocular Depth Estimation Challenge, with a si-RMSE of 0.311, an RMSE of 3.79, and the inference time is 37ms tested on the Raspberry Pi 4. Notably, we provide the fastest solution to the challenge. Codes and models will be released at https://github.com/zhyever/LiteDepth.
Preprint
As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The runtime of the resulting models was evaluated on the Kirin 9000's Mali GPU that provides excellent acceleration results for the majority of common deep learning ops. A detailed description of all models developed in this challenge is provided in this paper.
Preprint
Full-text available
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
Preprint
Full-text available
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
Preprint
Full-text available
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.
Article
Photoplethysmography (PPG), as one of the most widely used physiological signals on wearable devices, with dominance for portability and accessibility, is an ideal carrier of biometric recognition for guaranteeing the security of sensitive information. However, the existing state-of-the-art methods are restricted to practical deployment since power-constrained and compute-insufficient for wearable devices. 1D convolutional neural networks (1D-CNNs) have succeeded in numerous applications on sequential signals. Still, they fall short in modeling long-range dependencies (LRD), which are extremely needed in high-security PPG-based biometric recognition. In view of these limitations, this paper conducts a comparative study of scalable end-to-end 1D-CNNs for capturing LRD and parameterizing authorized templates by enlarging the receptive fields via stacking convolution operations, non-local blocks, and attention mechanisms. Compared to a robust baseline model, seven scalable models have different impacts (−0.2%–9.9%) on the accuracy of recognition over three datasets. Experimental cases demonstrate clear-cut improvements. Scalable models achieve state-of-the-art performance with an accuracy of over 97% on VitalDB and with the best accuracy on BIDMC and PRRB datasets performing 99.5% and 99.3%, respectively. We also discuss the effects of capturing LRD in generated templates by visualizations with Gramian Angular Summation Field and Class Activation Map. This study conducts that the scalable 1D-CNNs offer a performance-excellent and complexity-feasible approach for biometric recognition using PPG.
Article
Knowledge distillation has become a key technique for making smart and light-weight networks through model compression and transfer learning. Unlike previous methods that applied knowledge distillation to the classification task, we propose to exploit the decomposition-and-replacement based distillation scheme for depth estimation from a single RGB color image. To do this, Laplacian pyramid-based knowledge distillation is firstly presented in this paper. The key idea of the proposed method is to transfer the rich knowledge of the scene depth, which is well encoded through the teacher network, to the student network in a structured way by decomposing it into the global context and local details. This is fairly desirable for the student network to restore the depth layout more accurately with limited resources. Moreover, we also propose a new guidance concept for knowledge distillation, so-called ReplaceBlock, which replaces blocks randomly selected in the decoded feature of the student network with those of the teacher network. Our ReplaceBlock gives a smoothing effect in learning the feature distribution of the teacher network by considering the spatial contiguity in the feature space. This process is also helpful to clearly restore the depth layout without the significant computational cost. Based on various experimental results on benchmark datasets, the effectiveness of our distillation scheme for monocular depth estimation is demonstrated in details. The code and model are publicly available at : https://github.com/tjqansthd/Lap_Rep_KD_Depth.
Article
Nowadays, smartphones can produce a synchronized (synced) stream of high-quality data, including RGB images, inertial measurements, and other data. Therefore, smartphones are becoming appealing sensor systems in the robotics community. Unfortunately, there is still the need for external supporting sensing hardware, such as a depth camera precisely synced with the smartphone sensors. In this paper, we propose a hardware-software recording system that presents a heterogeneous structure and contains a smartphone and an external depth camera for recording visual, depth, and inertial data that are mutually synchronized. The system is synced at the time and the frame levels: every RGB image frame from the smartphone camera is exposed at the same moment of time with a depth camera frame with sub-millisecond precision. We provide a method and a tool for sync performance evaluation that can be applied to any pair of depth and RGB cameras. Our system could be replicated, modified, or extended by employing our open-sourced materials.
Article
Full-text available
On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing power, thermal constraints, and energy consumption. App developers and researchers have begun exploiting hardware accelerators to overcome these challenges. Recently, device manufacturers are adding neural processing units into high-end phones for on-device inference , but these account for only a small fraction of hand-held devices. In this paper, we present how we leverage the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, to run inference of deep neural networks in real-time for both Android and iOS devices. By describing our architecture, we also discuss how to design networks that are mobile GPU-friendly. Our state-of-the-art mobile GPU inference engine is integrated into the open-source project TensorFlow Lite and publicly available at https://tensorflow.org/lite.
Chapter
Full-text available
This paper reviews the second AIM learned ISP challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world RAW-to-RGB mapping problem, where to goal was to map the original low-quality RAW images captured by the Huawei P20 device to the same photos obtained with the Canon 5D DSLR camera. The considered task embraced a number of complex computer vision subtasks, such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions’ perceptual results measured in a user study. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical image signal processing pipeline modeling.
Article
Full-text available
Depth perception is paramount for tackling real-world problems, ranging from autonomous driving to consumer applications. For the latter, depth estimation from a single image would represent the most versatile solution since a standard camera is available on almost any handheld device. Nonetheless, two main issues limit the practical deployment of monocular depth estimation methods on such devices: (i) the low reliability when deployed in the wild and (ii) the resources needed to achieve real-time performance, often not compatible with low-power embedded systems. Therefore, in this paper, we deeply investigate all these issues, showing how they are both addressable by adopting appropriate network design and training strategies. Moreover, we also outline how to map the resulting networks on handheld devices to achieve real-time performance. Our thorough evaluation highlights the ability of such fast networks to generalize well to new environments, a crucial feature required to tackle the extremely varied contexts faced in real applications. Indeed, to further support this evidence, we report experimental results concerning real-time, depth-aware augmented reality and image blurring with smartphones in the wild.
Article
Full-text available
The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with six diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer, i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.
Article
Full-text available
Despite the growing popularity of deep learning technologies, high memory requirements and power consumption are essentially limiting their application in mobile and IoT areas. While binary convolutional networks can alleviate these problems, the limited bitwidth of weights is often leading to significant degradation of prediction accuracy. In this paper, we present a method for training binary networks that maintains a stable predefined level of their information capacity throughout the training process by applying Shannon entropy based penalty to convolutional filters. The results of experiments conducted on the SVHN, CIFAR and ImageNet datasets demonstrate that the proposed approach can statistically significantly improve the accuracy of binary networks.
Conference Paper
Full-text available
Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learned quantization parameters, achieving state-of-the-art performance.
Conference Paper
Full-text available
This paper reviews the first AIM challenge on mapping camera RAW to RGB images with the focus on proposed solutions and results. The participating teams were solving a real-world photo enhancement problem, where the goal was to map the original low-quality RAW images from the Huawei P20 device to the same photos captured with the Canon 5D DSLR camera. The considered problem embraced a number of computer vision subtasks, such as image demosaicing, denoising, gamma correction, image resolution and sharpness enhancement, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. The proposed solutions significantly improved base-line results, defining the state-of-the-art for RAW to RGB image restoration.
Conference Paper
Full-text available
This paper reviewed the 3rd NTIRE challenge on single-image super-resolution (restoration of rich details in a low-resolution image) with a focus on proposed solutions and results. The challenge had 1 track, which was aimed at the real-world single image super-resolution problem with an unknown scaling factor. Participants were mapping low-resolution images captured by a DSLR camera with a shorter focal length to their high-resolution images captured at a longer focal length. With this challenge, we introduced a novel real-world super-resolution dataset (Re-alSR). The track had 403 registered participants, and 36 teams competed in the final testing phase. They gauge the state-of-the-art in real-world single image super-resolution.
Conference Paper
Full-text available
This paper reviews the first NTIRE challenge on perceptual image enhancement with the focus on proposed solutions and results. The participating teams were solving a real-world photo enhancement problem, where the goal was to map low-quality photos from the iPhone 3GS device to the same photos captured with Canon 70D DSLR camera. The considered problem embraced a number of computer vision subtasks, such as image denoising, image resolution and sharpness enhancement, image color/contrast/exposure adjustment, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. From above 200 registered participants, 13 teams submitted solutions for the final test phase of the challenge. The proposed solutions significantly improved baseline results, defining the state-of-the-art for practical image enhancement.
Conference Paper
Full-text available
In this paper we study the problem of monocular relative depth perception in the wild. We introduce a simple yet effective method to automatically generate dense relative depth annotations from web stereo images, and propose a new dataset that consists of diverse images as well as corresponding dense relative depth maps. Further, an improved ranking loss is introduced to deal with imbalanced ordinal relations, enforcing the network to focus on a set of hard pairs. Experimental results demonstrate that our proposed approach not only achieves state-of-the-art accuracy of relative depth perception in the wild, but also benefits other dense per-pixel prediction tasks, e.g., metric depth estimation and semantic segmentation.
Article
Full-text available
The ZED camera is binocular vision system that can be used to provide a 3D perception of the world. It can be applied in autonomous robot navigation, virtual reality, tracking, motion analysis and so on. This paper proposes a mathematical error model for depth data estimated by the ZED camera with its several resolutions of operation. For doing that, the ZED is attached to a Nvidia Jetson TK1 board providing an embedded system that is used for processing raw data acquired by ZED from a 3D checkerboard. Corners are extracted from the checkerboard using RGB data, and a 3D reconstruction is done for these points using disparity data calculated from the ZED camera, coming up with a partially ordered, and regularly distributed (in 3D space) point cloud of corners with given coordinates (x e , y e , z e), which are computed by the device software. These corners also have their ideal world (3D) positions (x i , y i , z i) known with respect to the coordinate frame origin that is empirically set in the pattern. Both given (computed) coordinates from the camera's data and known (ideal) coordinates of a corner can, thus, be compared for estimating the error between the given and ideal point locations of the detected corner cloud. Subsequently, using a curve fitting technique, we obtain the equations that model the RMS (Root Mean Square) error. This procedure is repeated for several resolutions of the ZED sensor, and at several distances. Results showed its best effectiveness with a maximum distance of approximately sixteen meters, in real time, which allows its use in robotic or other online applications.
Article
Full-text available
We propose a novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes. Models trained with AGs implicitly learn to suppress irrelevant regions in an input image while highlighting salient features useful for a specific task. This enables us to eliminate the necessity of using explicit external tissue/organ localisation modules of cascaded convolutional neural networks (CNNs). AGs can be easily integrated into standard CNN architectures such as the U-Net model with minimal computational overhead while increasing the model sensitivity and prediction accuracy. The proposed Attention U-Net architecture is evaluated on two large CT abdominal datasets for multi-class image segmentation. Experimental results show that AGs consistently improve the prediction performance of U-Net across different datasets and training sizes while preserving computational efficiency. The code for the proposed architecture is publicly available.
Chapter
This paper reviews the second AIM realistic bokeh effect rendering challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world bokeh simulation problem, where the goal was to learn a realistic shallow focus technique using a large-scale EBB! bokeh dataset consisting of 5K shallow/wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The participants had to render bokeh effect based on only one single frame without any additional data from other cameras or sensors. The target metric used in this challenge combined the runtime and the perceptual quality of the solutions measured in the user study. To ensure the efficiency of the submitted models, we measured their runtime on standard desktop CPUs as well as were running the models on smartphone GPUs. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical bokeh effect rendering problem.
Article
Depth sensing is a critical function for robotic tasks such as localization, mapping and obstacle detection. There has been a significant and growing interest in depth estimation from a single RGB image, due to the relatively low cost and size of monocular cameras. However, state-of-the-art single-view depth estimation algorithms are based on fairly complex deep neural networks that are too slow for real-time inference on an embedded platform, for instance, mounted on a micro aerial vehicle. In this paper, we address the problem of fast depth estimation on embedded systems. We propose an efficient and lightweight encoder-decoder network architecture and apply network pruning to further reduce computational complexity and latency. In particular, we focus on the design of a low-latency decoder. Our methodology demonstrates that it is possible to achieve similar accuracy as prior work on depth estimation, but at inference speeds that are an order of magnitude faster. Our proposed network, FastDepth, runs at 178 fps on an NVIDIA Jetson TX2 GPU and at 27 fps when using only the TX2 CPU, with active power consumption under 10 W. FastDepth achieves close to state-of-the-art accuracy on the NYU Depth v2 dataset. To the best of the authors' knowledge, this paper demonstrates real-time monocular depth estimation using a deep neural network with the lowest latency and highest throughput on an embedded platform that can be carried by a micro aerial vehicle.
Conference Paper
This paper reviews the 2nd NTIRE challenge on single image super-resolution (restoration of rich details in a low resolution image) with focus on proposed solutions and results. The challenge had 4 tracks. Track 1 employed the standard bicubic downscaling setup, while Tracks 2, 3 and 4 had realistic unknown downgrading operators simulating camera image acquisition pipeline. The operators were learnable through provided pairs of low and high resolution train images. The tracks had 145, 114, 101, and 113 registered participants, resp., and 31 teams competed in the final testing phase. They gauge the state-of-the-art in single image super-resolution.
Conference Paper
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.