Conference Paper

Fast and Accurate Single-Image Depth Estimation on Mobile Devices, Mobile AI 2021 Challenge: Report

June 2021

June 2021

DOI:10.1109/CVPRW53098.2021.00288

Conference: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Authors:

David Plowman

Raspberry Pi

Samarth Shukla

ETH Zurich

Show all 38 authorsHide

RingFFL: A Ring-Architecture-Based Fair Federated Learning Framework

Article

Full-text available

Feb 2023

In the ring-architecture-based federated learning framework, security and fairness are severely compromised when dishonest clients abort the training process after obtaining useful information. To solve the problem, we propose a Ring- architecture-based Fair Federated Learning framework called RingFFL, in which we design a penalty mechanism for FL. Before the training starts in each round, all clients that will participate in the training pay deposits in a set order and record the transactions on the blockchain to ensure that they are not tampered with. Subsequently, the clients perform the FL training process, and the correctness of the models transmitted by the clients is guaranteed by the HASH algorithm during the training process. When all clients perform honestly, each client can obtain the final model, and the number of digital currencies in each client’s wallet is kept constant; otherwise, the deposits of clients who leave halfway will be compensated to the clients who perform honestly during the training process. In this way, through the penalty mechanism, all clients either obtain the final model or are compensated, thus ensuring the fairness of federated learning. The security analysis and experimental results show that RingFFL not only guarantees the accuracy and security of the federated learning model but also guarantees the fairness.

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

Preprint

Full-text available

Sep 2022

Monocular depth estimation is an essential task in the computer vision community. While tremendous successful methods have obtained excellent results, most of them are computationally expensive and not applicable for real-time on-device inference. In this paper, we aim to address more practical applications of monocular depth estimation, where the solution should consider not only the precision but also the inference time on mobile devices. To this end, we first develop an end-to-end learning-based model with a tiny weight size (1.4MB) and a short inference time (27FPS on Raspberry Pi 4). Then, we propose a simple yet effective data augmentation strategy, called R2 crop, to boost the model performance. Moreover, we observe that the simple lightweight model trained with only one single loss term will suffer from performance bottleneck. To alleviate this issue, we adopt multiple loss terms to provide sufficient constraints during the training stage. Furthermore, with a simple dynamic re-weight strategy, we can avoid the time-consuming hyper-parameter choice of loss terms. Finally, we adopt the structure-aware distillation to further improve the model performance. Notably, our solution named LiteDepth ranks 2nd in the MAI&AIM2022 Monocular Depth Estimation Challenge}, with a si-RMSE of 0.311, an RMSE of 3.79, and the inference time is 37$ms$ tested on the Raspberry Pi 4. Notably, we provide the fastest solution to the challenge. Codes and models will be released at \url{https://github.com/zhyever/LiteDepth}.

VIhanceD: Efficient Video Super Resolution for Edge Devices

Conference Paper

Feb 2023

Video super-resolution (VSR) aims to generate high-resolution (HR) frmes from corresponding low-resolution (LR) frames. It draws stark contrasat from single image super-resolution (SISR) because of its high temporal dependency on misaligned supporting frames. The existing methods involve using RNNs to learn the temporal dependency while using other networks (CNNs, GANs) for predicting neighboring pixels. Due to the memory and processing constraints and the inference time required for up-scaling LR frames, a wide variety of VSR techniques cannot be applied to mid-range and budget mobile devices. This paper presents VIhanceD, a real-time sliding window-based network that can operate on budget smartphones and laptops while producing cutting-edge results on various video datasets. Our approaches include both spatial and temporal dependencies to make the up-scaled HR frames coherent and free of motion distortions. We concentrate on enhancing the user experience in internet-restricted places owing to social, political, and geographical limitations. The mobile app (and PC client) provides a continuous stream of HR frames without buffering. Our experiments on various public and internal datasets demonstrate that the suggested method is generalizable and works on natural video frames and textual data, making it suitable for infotainment multimedia.

Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report

Preprint

Nov 2022

Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.

Monocular Depth Estimation Primed by Salient Point Detection and Normalized Hessian Loss

Preprint

Full-text available

Aug 2021

Deep neural networks have recently thrived on single image depth estimation. That being said, current developments on this topic highlight an apparent compromise between accuracy and network size. This work proposes an accurate and lightweight framework for monocular depth estimation based on a self-attention mechanism stemming from salient point detection. Specifically, we utilize a sparse set of keypoints to train a FuSaNet model that consists of two major components: Fusion-Net and Saliency-Net. In addition, we introduce a normalized Hessian loss term invariant to scaling and shear along the depth direction, which is shown to substantially improve the accuracy. The proposed method achieves state-of-the-art results on NYU-Depth-v2 and KITTI while using 3.1-38.4 times smaller model in terms of the number of parameters than baseline approaches. Experiments on the SUN-RGBD further demonstrate the generalizability of the proposed method.

Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation?

Conference Paper

Full-text available

Jun 2024

We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation. In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images. We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset. By conducting extensive experiments with our virtually modified dataset and validating on the original NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures, especially for the current state-of-the-art VPD model. To the best of our knowledge, this is the first work that augments a real-world dataset with randomly generated virtual 3D objects for monocular depth estimation. We make our ANYU dataset publicly available in two training configurations with 10% and 100% additional synthetically enriched RGB-D pairs of training images, respectively, for efficient training and empirical exploration of virtual augmentation at https://github.com/ABrain-One/ANYU .

The Second Monocular Depth Estimation Challenge

Conference Paper

Full-text available

Jun 2023

Diffusion-Augmented Depth Prediction with Sparse Annotations

Preprint

Aug 2023

Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.

The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

Preprint

Full-text available

Jul 2023

The Second Monocular Depth Estimation Challenge

Preprint

Full-text available

Apr 2023

This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes complex natural environments, e.g. forests or fields, which are greatly underrepresented in current benchmarks. The challenge received eight unique submissions that outperformed the provided SotA baseline on any of the pointcloud- or image-based metrics. The top supervised submission improved relative F-Score by 27.62%, while the top self-supervised improved it by 16.61%. Supervised submissions generally leveraged large collections of datasets to improve data diversity. Self-supervised submissions instead updated the network architecture and pretrained backbones. These results represent a significant progress in the field, while highlighting avenues for future research, such as reducing interpolation artifacts at depth boundaries, improving self-supervised indoor performance and overall natural image accuracy.

Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

Chapter

Full-text available

Feb 2023

Cameras capture sensor RAW images and transform them into pleasant RGB images, suitable for the human eyes, using their integrated Image Signal Processor (ISP). Numerous low-level vision tasks operate in the RAW domain (e.g. image denoising, white balance) due to its linear relationship with the scene irradiance, wide-range of information at 12bits, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public RGB datasets. This paper introduces the AIM 2022 Challenge on Reversed Image Signal Processing and RAW Reconstruction. We aim to recover raw sensor images from the corresponding RGBs without metadata and, by doing this, “reverse” the ISP transformation. The proposed methods and benchmark establish the state-of-the-art for this low-level vision inverse problem, and generating realistic raw sensor readings can potentially benefit other tasks such as denoising and super-resolution.

AIM 2022 Challenge on Super-Resolution of Compressed Image and Video: Dataset, Methods and Results

Chapter

Full-text available

Feb 2023

This paper reviews the Challenge on Super-Resolution of Compressed Image and Video at AIM 2022. This challenge includes two tracks. Track 1 aims at the super-resolution of compressed image, and Track 2 targets the super-resolution of compressed video. In Track 1, we use the popular dataset DIV2K as the training, validation and test sets. In Track 2, we propose the LDV 3.0 dataset, which contains 365 videos, including the LDV 2.0 dataset (335 videos) and 30 additional videos. In this challenge, there are 12 teams and 2 teams that submitted the final results to Track 1 and Track 2, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution on compressed image and video. The proposed LDV 3.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge is at https://github.com/RenYang-home/AIM22_CompressSR.

AIM 2022 Challenge on Instagram Filter Removal: Methods and Results

Chapter

Full-text available

Feb 2023

This paper introduces the methods and the results of AIM 2022 challenge on Instagram Filter Removal. Social media filters transform the images by consecutive non-linear operations, and the feature maps of the original content may be interpolated into a different domain. This reduces the overall performance of the recent deep learning strategies. The main goal of this challenge is to produce realistic and visually plausible images where the impact of the filters applied is mitigated while preserving the content. The proposed solutions are ranked in terms of the PSNR value with respect to the original images. There are two prior studies on this task as the baseline, and a total of 9 teams have competed in the final phase of the challenge. The comparison of qualitative results of the proposed solutions and the benchmark for the challenge are presented in this report.

Monocular Depth Estimation and Detection of Near Objects

Article

Full-text available

Jan 2022

The image obtained from the cameras is 2D, so we cannot know how far the object is on the image. In order to detect objects only at a certain distance in a camera system, we need to convert the 2D image into 3D. Depth estimation is used to estimate distances to objects. It is the perception of the 2D image as 3D. Although different methods are used to implement this, the method to be applied in this experiment is to detect depth perception with a single camera. After obtaining the depth map, the obtained image will be filtered by objects in the near distance, the distant image will be closed, a new image will be run with the object detection model and object detection will be performed. The desired result in this experiment is, for projects with a low budget, instead of using dual camera or LIDAR methods, it is to ensure that a robot can detect obstacles that will come in front of it with only one camera. As a result, 8 FPS was obtained by running two models on the embedded device, and the loss value was obtained as 0.342 in the inference test performed on the new image, where only close objects were taken after the depth estimation.

AIM 2022 Challenge on Instagram Filter Removal: Methods and Results

Preprint

Full-text available

Oct 2022

AIM 2022 Challenge on Super-Resolution of Compressed Image and Video: Dataset, Methods and Results

Preprint

Full-text available

Aug 2022

This paper reviews the Challenge on Super-Resolution of Compressed Image and Video at AIM 2022. This challenge includes two tracks. Track 1 aims at the super-resolution of compressed image, and Track~2 targets the super-resolution of compressed video. In Track 1, we use the popular dataset DIV2K as the training, validation and test sets. In Track 2, we propose the LDV 3.0 dataset, which contains 365 videos, including the LDV 2.0 dataset (335 videos) and 30 additional videos. In this challenge, there are 12 teams and 2 teams that submitted the final results to Track 1 and Track 2, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution on compressed image and video. The proposed LDV 3.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge is at https://github.com/RenYang-home/AIM22_CompressSR.

Hybrid Skip: A Biologically Inspired Skip Connection for the UNet Architecture

Preprint

Full-text available

Jul 2022

In this work we introduce a biologically inspired long-range skip connection for the UNet architecture that relies on the perceptual illusion of hybrid images, being images that simultaneously encode two images. The fusion of early encoder features with deeper decoder ones allows UNet models to produce finer-grained dense predictions. While proven in segmentation tasks, the network's benefits are down-weighted for dense regression tasks as these long-range skip connections additionally result in texture transfer artifacts. Specifically for depth estimation, this hurts smoothness and introduces false positive edges which are detrimental to the task due to the depth maps' piece-wise smooth nature. The proposed HybridSkip connections show improved performance in balancing the trade-off between edge preservation, and the minimization of texture transfer artifacts that hurt smoothness. This is achieved by the proper and balanced exchange of information that Hybrid-Skip connections offer between the high and low frequency, encoder and decoder features, respectively.

Model-Based Image Signal Processors via Learnable Dictionaries

Conference Paper

Full-text available

Jun 2022

Digital cameras transform sensor RAW readings into RGB images by means of their Image Signal Processor (ISP). Computational photography tasks such as image denoising and colour constancy are commonly performed in the RAW domain, in part due to the inherent hardware design, but also due to the appealing simplicity of noise statistics that result from the direct sensor readings. Despite this, the availability of RAW images is limited in comparison with the abundance and diversity of available RGB data. Recent approaches have attempted to bridge this gap by estimating the RGB to RAW mapping: handcrafted model-based methods that are interpretable and controllable usually require manual parameter fine-tuning, while end-to-end learnable neural networks require large amounts of training data, at times with complex training procedures, and generally lack interpretability and parametric control. Towards addressing these existing limitations, we present a novel hybrid model-based and data-driven ISP that builds on canonical ISP operations and is both learnable and interpretable. Our proposed invertible model, capable of bidirectional mapping between RAW and RGB domains, employs end-to-end learning of rich parameter representations, i.e. dictionaries, that are free from direct parametric supervision and additionally enable simple and plausible data augmentation. We evidence the value of our data generation process by extensive experiments under both RAW image reconstruction and RAW image denoising tasks, obtaining state-of-the-art performance in both. Additionally, we show that our ISP can learn meaningful mappings from few data samples, and that denoising models trained with our dictionary-based data augmentation are competitive despite having only few or zero ground-truth labels.

Hybrid Skip: A Biologically Inspired Skip Connection for the UNet Architecture

Article

Full-text available

Jan 2022

In this work we introduce a biologically inspired long-range skip connection for the UNet architecture that relies on the perceptual illusion of hybrid images, being images that simultaneously encode two images. The fusion of early encoder features with deeper decoder ones allows UNet models to produce finer-grained dense predictions. While proven in segmentation tasks, the network’s benefits are down-weighted for dense regression tasks as these long-range skip connections additionally result in texture transfer artifacts. Specifically for depth estimation, this hurts smoothness and introduces false positive edges which are detrimental to the task due to the depth maps’ piece-wise smooth nature. The proposed HybridSkip connections show improved performance in balancing the trade-off between edge preservation, and the minimization of texture transfer artifacts that hurt smoothness. This is achieved by the proper and balanced exchange of information that HybridSkip connections offer between the high and low frequency, encoder and decoder features, respectively. The code and models will be made available in the project page.

SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Article

Full-text available

Jan 2022

The monocular depth estimation (MDE) is the task of estimating depth from a single frame. This information is an essential knowledge in many computer vision tasks such as scene understanding and visual odometry, which are key components in autonomous and robotic systems. Approaches based on the state of the art vision transformer architectures are extremely deep and complex not suitable for real-time inference operations on edge and autonomous systems equipped with low resources (i.e. robot indoor navigation and surveillance). This paper presents SPEED, a Separable Pyramidal pooling EncodEr-Decoder architecture designed to achieve real-time frequency performances on multiple hardware platforms. The proposed model is a fast-throughput deep architecture for MDE able to obtain depth estimations with high accuracy from low resolution images using minimum hardware resources (i.e. edge devices). Our encoder-decoder model exploits two depthwise separable pyramidal pooling layers, which allow to increase the inference frequency while reducing the overall computational complexity. The proposed method performs better than other fast-throughput architectures in terms of both accuracy and frame rates, achieving real-time performances over cloud CPU, TPU and the NVIDIA Jetson TX1 on two indoor benchmarks: the NYU Depth v2 and the DIML Kinect v2 datasets.

Fast Neural Architecture Search for Lightweight Dense Prediction Networks

Preprint

Full-text available

Mar 2022

Dense prediction is a class of computer vision problems aiming at mapping every pixel of the input image with some predicted values. Depending on the problem, the output values can be either continous or discrete. For instance, monocular depth estimation and image super-resolution are often formulated as regression, while semantic segmentation is a dense classification, i.e. discrete, problem. More specifically, the monocular depth estimation problem produces a dense depth map from a single image to be used in various applications including robotics, scene understanding, and augmented reality. Single image super-resolution (SISR) is a low-level vision task that generates a high-resolution image from its low-resolution counterpart. SISR is widely utilized in medical and surveillance imaging, where images with more precise details can provide invaluable information. On the other hand, semantic segmentation predicts a dense annotated map of different semantic categories from a given image that is crucial for image understanding tasks.

Model-Based Image Signal Processors via Learnable Dictionaries

Preprint

Full-text available

Jan 2022

Lightweight Monocular Depth with a Novel Neural Architecture Search Method

Preprint

Full-text available

Aug 2021

This paper presents a novel neural architecture search method, called LiDNAS, for generating lightweight monocular depth estimation models. Unlike previous neural architecture search (NAS) approaches, where finding optimized networks are computationally highly demanding, the introduced novel Assisted Tabu Search leads to efficient architecture exploration. Moreover, we construct the search space on a pre-defined backbone network to balance layer diversity and search space size. The LiDNAS method outperforms the state-of-the-art NAS approach, proposed for disparity and depth estimation, in terms of search efficiency and output model performance. The LiDNAS optimized models achieve results superior to compact depth estimation state-of-the-art on NYU-Depth-v2, KITTI, and ScanNet, while being 7%-500% more compact in size, i.e the number of model parameters.

Diffusion-Augmented Depth Prediction with Sparse Annotations

Conference Paper

Oct 2023

NTIRE 2023 Challenge on HR Depth from Images of Specular and Transparent Surfaces

Conference Paper

Jun 2023

FasterMDE: A real-time monocular depth estimation search method that balances accuracy and speed on the edge

Article

Full-text available

Jul 2023
APPL INTELL

Monocular depth estimation (MDE) is critical in enabling intelligent autonomous systems and has received considerable attention in recent years. Achieving both low latency and high accuracy in MDE is desirable but challenging to optimize, especially on edge devices. In this paper, we present a novel approach to balancing speed and accuracy in MDE on edge devices. We introduce FasterMDE, an efficient and fast encoder-decoder network architecture that leverages a multiobjective neural architecture search method to find the optimal encoder structure for the target edge. Moreover, we incorporate a neural window fully connected CRF module into the network as the decoder, enhancing fine-grained depth prediction based on coarse depth and image features. To address the issue of bad “local minimums” in the multiobjective neural architecture search, we propose a new approach for automatically learning the weights of subobjective loss functions based on uncertainty. We also accelerate the FasterMDE model using TensorRT and implement it on a target edge device. The experimental results demonstrate that FasterMDE achieves a better balance of speed and accuracy on the KITTI and NYUv2 datasets compared to previous methods. We validate the effectiveness of the proposed method through an ablation study and verify the real-time monocular depth estimation performance of FasterMDE in realistic scenarios. On the KITTI dataset, the FasterMDE model achieves a high frame rate of 555.55 FPS with 9.1% Abs Rel on a single NVIDIA Titan RTX GPU and 14.46 FPS on the NVIDIA Jetson Xavier NX.

Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report

Chapter

Full-text available

Feb 2023

The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon’s 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20–50 ms while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.

Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report

Chapter

Feb 2023

Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt/30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.

Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report

Chapter

Feb 2023

Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 m. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.

Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 Challenge: Report

Chapter

Full-text available

Feb 2023

Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.

Realistic Bokeh Effect Rendering on Mobile GPUs, Mobile AI & AIM 2022 Challenge: Report

Chapter

Feb 2023

As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB! bokeh dataset consisting of 5K shallow/wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The runtime of the resulting models was evaluated on the Kirin 9000’s Mali GPU that provides excellent acceleration results for the majority of common deep learning ops. A detailed description of all models developed in this challenge is provided in this paper.

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

Chapter

Feb 2023

Monocular depth estimation is an essential task in the computer vision community. While tremendous successful methods have obtained excellent results, most of them are computationally expensive and not applicable for real-time on-device inference. In this paper, we aim to address more practical applications of monocular depth estimation, where the solution should consider not only the precision but also the inference time on mobile devices. To this end, we first develop an end-to-end learning-based model with a tiny weight size (1.4MB) and a short inference time (27FPS on Raspberry Pi 4). Then, we propose a simple yet effective data augmentation strategy, called R2crop, to boost the model performance. Moreover, we observe that the simple lightweight model trained with only one single loss term will suffer from performance bottleneck. To alleviate this issue, we adopt multiple loss terms to provide sufficient constraints during the training stage. Furthermore, with a simple dynamic re-weight strategy, we can avoid the time-consuming hyper-parameter choice of loss terms. Finally, we adopt the structure-aware distillation to further improve the model performance. Notably, our solution named LiteDepth ranks 2ndin the MAI &AIM2022 Monocular Depth Estimation Challenge, with a si-RMSE of 0.311, an RMSE of 3.79, and the inference time is 37ms tested on the Raspberry Pi 4. Notably, we provide the fastest solution to the challenge. Codes and models will be released at https://github.com/zhyever/LiteDepth.

Realistic Bokeh Effect Rendering on Mobile GPUs, Mobile AI & AIM 2022 challenge: Report

Preprint

Nov 2022

As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The runtime of the resulting models was evaluated on the Kirin 9000's Mali GPU that provides excellent acceleration results for the majority of common deep learning ops. A detailed description of all models developed in this challenge is provided in this paper.

Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

Preprint

Full-text available

Nov 2022

Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Preprint

Full-text available

Nov 2022

Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.

Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report

Preprint

Full-text available

Nov 2022

The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.

PhoneDepth: A Dataset for Monocular Depth Estimation on Mobile Devices

Conference Paper

Jun 2022

Biometric recognition based on scalable end-to-end convolutional neural network using photoplethysmography: A comparative study

Article

May 2022
COMPUT BIOL MED

Photoplethysmography (PPG), as one of the most widely used physiological signals on wearable devices, with dominance for portability and accessibility, is an ideal carrier of biometric recognition for guaranteeing the security of sensitive information. However, the existing state-of-the-art methods are restricted to practical deployment since power-constrained and compute-insufficient for wearable devices. 1D convolutional neural networks (1D-CNNs) have succeeded in numerous applications on sequential signals. Still, they fall short in modeling long-range dependencies (LRD), which are extremely needed in high-security PPG-based biometric recognition. In view of these limitations, this paper conducts a comparative study of scalable end-to-end 1D-CNNs for capturing LRD and parameterizing authorized templates by enlarging the receptive fields via stacking convolution operations, non-local blocks, and attention mechanisms. Compared to a robust baseline model, seven scalable models have different impacts (−0.2%–9.9%) on the accuracy of recognition over three datasets. Experimental cases demonstrate clear-cut improvements. Scalable models achieve state-of-the-art performance with an accuracy of over 97% on VitalDB and with the best accuracy on BIDMC and PRRB datasets performing 99.5% and 99.3%, respectively. We also discuss the effects of capturing LRD in generated templates by visualizations with Gramian Angular Summation Field and Class Activation Map. This study conducts that the scalable 1D-CNNs offer a performance-excellent and complexity-feasible approach for biometric recognition using PPG.

Decomposition and replacement: Spatial knowledge distillation for monocular depth estimation

Article

Apr 2022
J VIS COMMUN IMAGE R

Knowledge distillation has become a key technique for making smart and light-weight networks through model compression and transfer learning. Unlike previous methods that applied knowledge distillation to the classification task, we propose to exploit the decomposition-and-replacement based distillation scheme for depth estimation from a single RGB color image. To do this, Laplacian pyramid-based knowledge distillation is firstly presented in this paper. The key idea of the proposed method is to transfer the rich knowledge of the scene depth, which is well encoded through the teacher network, to the student network in a structured way by decomposing it into the global context and local details. This is fairly desirable for the student network to restore the depth layout more accurately with limited resources. Moreover, we also propose a new guidance concept for knowledge distillation, so-called ReplaceBlock, which replaces blocks randomly selected in the decoded feature of the student network with those of the teacher network. Our ReplaceBlock gives a smoothing effect in learning the feature distribution of the teacher network by considering the spatial contiguity in the feature space. This process is also helpful to clearly restore the depth layout without the significant computational cost. Based on various experimental results on benchmark datasets, the effectiveness of our distillation scheme for monocular depth estimation is demonstrated in details. The code and model are publicly available at : https://github.com/tjqansthd/Lap_Rep_KD_Depth.

Lightweight Monocular Depth with a Novel Neural Architecture Search Method

Conference Paper

Jan 2022

SmartDepthSync: Open Source Synchronized Video Recording System of Smartphone RGB and Depth Camera Range Image Frames With Sub-Millisecond Precision

Article

Apr 2022

Nowadays, smartphones can produce a synchronized (synced) stream of high-quality data, including RGB images, inertial measurements, and other data. Therefore, smartphones are becoming appealing sensor systems in the robotics community. Unfortunately, there is still the need for external supporting sensing hardware, such as a depth camera precisely synced with the smartphone sensors. In this paper, we propose a hardware-software recording system that presents a heterogeneous structure and contains a smartphone and an external depth camera for recording visual, depth, and inertial data that are mutually synchronized. The system is synced at the time and the frame levels: every RGB image frame from the smartphone camera is exposed at the same moment of time with a depth camera frame with sub-millisecond precision. We provide a method and a tool for sync performance evaluation that can be applied to any pair of depth and RGB cameras. Our system could be replicated, modified, or extended by employing our open-sourced materials.

Monocular Depth Estimation Primed by Salient Point Detection and Normalized Hessian Loss

Conference Paper

Dec 2021

Real-Time Video Super-Resolution on Smartphones with Deep Learning, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Fast and Accurate Camera Scene Detection on Smartphones

Conference Paper

Jun 2021

Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Fast and Accurate Quantized Camera Scene Detection on Smartphones, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

A Simple Baseline for Fast and Accurate Depth Estimation on Mobile Devices

Conference Paper

Jun 2021

Real-Time Quantized Image Super-Resolution on Mobile NPUs, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Fast Camera Image Denoising on Mobile GPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Knowledge Distillation for Fast and Accurate Monocular Depth Estimation on Mobile Devices

Conference Paper

Jun 2021

On-Device Neural Net Inference with Mobile GPUs

Article

Full-text available

Jul 2019

On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing power, thermal constraints, and energy consumption. App developers and researchers have begun exploiting hardware accelerators to overcome these challenges. Recently, device manufacturers are adding neural processing units into high-end phones for on-device inference , but these account for only a small fraction of hand-held devices. In this paper, we present how we leverage the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, to run inference of deep neural networks in real-time for both Android and iOS devices. By describing our architecture, we also discuss how to design networks that are mobile GPU-friendly. Our state-of-the-art mobile GPU inference engine is integrated into the open-source project TensorFlow Lite and publicly available at https://tensorflow.org/lite.

AIM 2020 Challenge on Learned Image Signal Processing Pipeline

Chapter

Full-text available

Jan 2020

This paper reviews the second AIM learned ISP challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world RAW-to-RGB mapping problem, where to goal was to map the original low-quality RAW images captured by the Huawei P20 device to the same photos obtained with the Canon 5D DSLR camera. The considered task embraced a number of complex computer vision subtasks, such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions’ perceptual results measured in a user study. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical image signal processing pipeline modeling.

Real-Time Single Image Depth Perception in the Wild with Handheld Devices

Article

Full-text available

Dec 2020
SENSORS-BASEL

Depth perception is paramount for tackling real-world problems, ranging from autonomous driving to consumer applications. For the latter, depth estimation from a single image would represent the most versatile solution since a standard camera is available on almost any handheld device. Nonetheless, two main issues limit the practical deployment of monocular depth estimation methods on such devices: (i) the low reliability when deployed in the wild and (ii) the resources needed to achieve real-time performance, often not compatible with low-power embedded systems. Therefore, in this paper, we deeply investigate all these issues, showing how they are both addressable by adopting appropriate network design and training strategies. Moreover, we also outline how to map the resulting networks on handheld devices to achieve real-time performance. Our thorough evaluation highlights the ability of such fast networks to generalize well to new environments, a crucial feature required to tackle the extremely varied contexts faced in real applications. Indeed, to further support this evidence, we report experimental results concerning real-time, depth-aware augmented reality and image blurring with smartphones in the wild.

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

Article

Full-text available

Aug 2020

The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with six diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer, i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.

Controlling Information Capacity of Binary Neural Network

Article

Full-text available

Jul 2020
PATTERN RECOGN LETT

Despite the growing popularity of deep learning technologies, high memory requirements and power consumption are essentially limiting their application in mobile and IoT areas. While binary convolutional networks can alleviate these problems, the limited bitwidth of weights is often leading to significant degradation of prediction accuracy. In this paper, we present a method for training binary networks that maintains a stable predefined level of their information capacity throughout the training process by applying Shannon entropy based penalty to convolutional filters. The results of experiments conducted on the SVHN, CIFAR and ImageNet datasets demonstrate that the proposed approach can statistically significantly improve the accuracy of binary networks.

Learning Filter Basis for Convolutional Neural Network Compression

Conference Paper

Full-text available

Oct 2019

Mixed Precision DNNs: All you need is a good parametrization

Conference Paper

Full-text available

Jan 2020

Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learned quantization parameters, achieving state-of-the-art performance.

AIM 2019 Challenge on RAW to RGB Mapping: Methods and Results

Conference Paper

Full-text available

Nov 2019

This paper reviews the first AIM challenge on mapping camera RAW to RGB images with the focus on proposed solutions and results. The participating teams were solving a real-world photo enhancement problem, where the goal was to map the original low-quality RAW images from the Huawei P20 device to the same photos captured with the Canon 5D DSLR camera. The considered problem embraced a number of computer vision subtasks, such as image demosaicing, denoising, gamma correction, image resolution and sharpness enhancement, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. The proposed solutions significantly improved base-line results, defining the state-of-the-art for RAW to RGB image restoration.

NTIRE 2019 Challenge on Real Image Super-Resolution: Methods and Results

Conference Paper

Full-text available

Nov 2019

This paper reviewed the 3rd NTIRE challenge on single-image super-resolution (restoration of rich details in a low-resolution image) with a focus on proposed solutions and results. The challenge had 1 track, which was aimed at the real-world single image super-resolution problem with an unknown scaling factor. Participants were mapping low-resolution images captured by a DSLR camera with a shorter focal length to their high-resolution images captured at a longer focal length. With this challenge, we introduced a novel real-world super-resolution dataset (Re-alSR). The track had 403 registered participants, and 36 teams competed in the final testing phase. They gauge the state-of-the-art in real-world single image super-resolution.

NTIRE 2019 Challenge on Image Enhancement: Methods and Results

Conference Paper

Full-text available

Oct 2019

This paper reviews the first NTIRE challenge on perceptual image enhancement with the focus on proposed solutions and results. The participating teams were solving a real-world photo enhancement problem, where the goal was to map low-quality photos from the iPhone 3GS device to the same photos captured with Canon 70D DSLR camera. The considered problem embraced a number of computer vision subtasks, such as image denoising, image resolution and sharpness enhancement, image color/contrast/exposure adjustment, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. From above 200 registered participants, 13 teams submitted solutions for the final test phase of the challenge. The proposed solutions significantly improved baseline results, defining the state-of-the-art for practical image enhancement.

Towards Real-Time Unsupervised Monocular Depth Estimation on CPU

Conference Paper

Full-text available

Oct 2018

Monocular Relative Depth Perception with Web Stereo Data Supervision

Conference Paper

Full-text available

Sep 2018

In this paper we study the problem of monocular relative depth perception in the wild. We introduce a simple yet effective method to automatically generate dense relative depth annotations from web stereo images, and propose a new dataset that consists of diverse images as well as corresponding dense relative depth maps. Further, an improved ranking loss is introduced to deal with imbalanced ordinal relations, enforcing the network to focus on a set of hard pairs. Experimental results demonstrate that our proposed approach not only achieves state-of-the-art accuracy of relative depth perception in the wild, but also benefits other dense per-pixel prediction tasks, e.g., metric depth estimation and semantic segmentation.

Depth Data Error Modeling of the ZED 3D Vision Sensor from Stereolabs

Article

Full-text available

Apr 2018

The ZED camera is binocular vision system that can be used to provide a 3D perception of the world. It can be applied in autonomous robot navigation, virtual reality, tracking, motion analysis and so on. This paper proposes a mathematical error model for depth data estimated by the ZED camera with its several resolutions of operation. For doing that, the ZED is attached to a Nvidia Jetson TK1 board providing an embedded system that is used for processing raw data acquired by ZED from a 3D checkerboard. Corners are extracted from the checkerboard using RGB data, and a 3D reconstruction is done for these points using disparity data calculated from the ZED camera, coming up with a partially ordered, and regularly distributed (in 3D space) point cloud of corners with given coordinates (x e , y e , z e), which are computed by the device software. These corners also have their ideal world (3D) positions (x i , y i , z i) known with respect to the coordinate frame origin that is empirically set in the pattern. Both given (computed) coordinates from the camera's data and known (ideal) coordinates of a corner can, thus, be compared for estimating the error between the given and ideal point locations of the detected corner cloud. Subsequently, using a curve fitting technique, we obtain the equations that model the RMS (Root Mean Square) error. This procedure is repeated for several resolutions of the ZED sensor, and at several distances. Results showed its best effectiveness with a maximum distance of approximately sixteen meters, in real time, which allows its use in robotic or other online applications.

Attention U-Net: Learning Where to Look for the Pancreas

Article

Full-text available

Apr 2018

We propose a novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes. Models trained with AGs implicitly learn to suppress irrelevant regions in an input image while highlighting salient features useful for a specific task. This enables us to eliminate the necessity of using explicit external tissue/organ localisation modules of cascaded convolutional neural networks (CNNs). AGs can be easily integrated into standard CNN architectures such as the U-Net model with minimal computational overhead while increasing the model sensitivity and prediction accuracy. The proposed Attention U-Net architecture is evaluated on two large CT abdominal datasets for multi-class image segmentation. Experimental results show that AGs consistently improve the prediction performance of U-Net across different datasets and training sizes while preserving computational efficiency. The code for the proposed architecture is publicly available.

U-Net: Convolutional Networks for Biomedical Image Segmentation

Book

Jan 2015

AdaBins: Depth Estimation Using Adaptive Bins

Conference Paper

Jun 2021

Real-Time Video Super-Resolution on Smartphones with Deep Learning, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Fast and Accurate Quantized Camera Scene Detection on Smartphones, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

A Simple Baseline for Fast and Accurate Depth Estimation on Mobile Devices

Conference Paper

Jun 2021

Real-Time Quantized Image Super-Resolution on Mobile NPUs, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Fast Camera Image Denoising on Mobile GPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Conference Paper

Jun 2021

Knowledge Distillation for Fast and Accurate Monocular Depth Estimation on Mobile Devices

Conference Paper

Jun 2021

AIM 2020 Challenge on Rendering Realistic Bokeh

Chapter

Jan 2020

This paper reviews the second AIM realistic bokeh effect rendering challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world bokeh simulation problem, where the goal was to learn a realistic shallow focus technique using a large-scale EBB! bokeh dataset consisting of 5K shallow/wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The participants had to render bokeh effect based on only one single frame without any additional data from other cameras or sensors. The target metric used in this challenge combined the runtime and the perceptual quality of the solutions measured in the user study. To ensure the efficiency of the submitted models, we measured their runtime on standard desktop CPUs as well as were running the models on smartphone GPUs. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical bokeh effect rendering problem.

FastDepth: Fast monocular depth estimation on embedded systems

Article

May 2019

Depth sensing is a critical function for robotic tasks such as localization, mapping and obstacle detection. There has been a significant and growing interest in depth estimation from a single RGB image, due to the relatively low cost and size of monocular cameras. However, state-of-the-art single-view depth estimation algorithms are based on fairly complex deep neural networks that are too slow for real-time inference on an embedded platform, for instance, mounted on a micro aerial vehicle. In this paper, we address the problem of fast depth estimation on embedded systems. We propose an efficient and lightweight encoder-decoder network architecture and apply network pruning to further reduce computational complexity and latency. In particular, we focus on the design of a low-latency decoder. Our methodology demonstrates that it is possible to achieve similar accuracy as prior work on depth estimation, but at inference speeds that are an order of magnitude faster. Our proposed network, FastDepth, runs at 178 fps on an NVIDIA Jetson TX2 GPU and at 27 fps when using only the TX2 CPU, with active power consumption under 10 W. FastDepth achieves close to state-of-the-art accuracy on the NYU Depth v2 dataset. To the best of the authors' knowledge, this paper demonstrates real-time monocular depth estimation using a deep neural network with the lowest latency and highest throughput on an embedded platform that can be carried by a micro aerial vehicle.

ZeroQ: A Novel Zero Shot Quantization Framework

Conference Paper

Jun 2020

FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions

Conference Paper

Jun 2020

NTIRE 2020 Challenge on Real-World Image Super-Resolution: Methods and Results

Conference Paper

Jun 2020

Deploying Image Deblurring across Mobile Devices: A Perspective of Quality and Latency

Conference Paper

Jun 2020

Replacing Mobile Camera ISP with a Single Deep Learning Model

Conference Paper

Jun 2020

NTIRE 2020 Challenge on Real Image Denoising: Dataset, Methods and Results

Conference Paper

Jun 2020

Rendering Natural Camera Bokeh Effect with Deep Learning

Conference Paper

Jun 2020

NTIRE 2018 Challenge on Single Image Super-Resolution: Methods and Results

Conference Paper

Jun 2018

This paper reviews the 2nd NTIRE challenge on single image super-resolution (restoration of rich details in a low resolution image) with focus on proposed solutions and results. The challenge had 4 tracks. Track 1 employed the standard bicubic downscaling setup, while Tracks 2, 3 and 4 had realistic unknown downgrading operators simulating camera image acquisition pipeline. The operators were learnable through provided pairs of low and high resolution train images. The tracks had 145, 114, 101, and 113 registered participants, resp., and 31 teams competed in the final testing phase. They gauge the state-of-the-art in single image super-resolution.

NTIRE 2019 Challenge on Real Image Super-Resolution: Methods and Results

Conference Paper

Jun 2019

Digging Into Self-Supervised Monocular Depth Estimation

Conference Paper

Nov 2019

Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.

AIM 2019 Challenge on Real-World Image Super-Resolution: Methods and Results

Conference Paper