Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animati...

Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Preprint

Jun 2024

We present Follow-Your-Emoji, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences. The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity....

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Conference Paper

Full-text available

Jun 2024

Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target clas...

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

Conference Paper

Full-text available

Jun 2024

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs,...

Comparison results for image restoration in adverse weather...

GridFormer architecture
. It consists of a grid head, a grid fusion...

Grid unit structure and information flow
. (a) The structure of a...

The structure of the proposed Residual Dense Transformer Block (RDTB)
....

Right
: the schematic illustration of the proposed Compact-enhanced...

GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Article

Full-text available

May 2024

Image restoration in adverse weather conditions is a difficult task in computer vision. In this paper, we propose a novel transformer-based framework called GridFormer which serves as a backbone for image restoration under adverse weather conditions. GridFormer is designed in a grid structure using a residual dense transformer block, and it introdu...

Causality-Invariant Interactive Mining for Cross-Modal Similarity Learning

Article

Mar 2024

In the real world, how to effectively learn consistent similarity measurement across different modalities is essential. Most of the existing similarity learning methods cannot deal well with cross-modal data due to the modality gap and have obvious performance degeneration when applied to cross-modal data. To tackle this problem, we propose a novel...

Face Recognition with Synthetic Data

Chapter

Dec 2023

In the last few years, face recognition has achieved extraordinary progress in a wide range of challenging problems including pose-robust face recognition [5, 24, 63], matching faces across ages [15, 17, 56, 60], across modalities [13, 14, 16, 30, 31], and occlusions [40, 49, 71]. Among these progresses, not only the very deep neural networks [22,...

CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention

Article

Dec 2023

While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL ble...

Learnable Central Similarity Quantization for Efficient Image and Video Retrieval

Article

Dec 2023

Data-dependent hashing methods aim to learn hash functions from the pairwise or triplet relationships among the data, which often lead to low efficiency and low collision rate by only capturing the local distribution of the data. To solve the limitation, we propose central similarity, in which the hash codes of similar data pairs are encouraged to...

Unsupervised Cross-Modal Hashing With Modality-Interaction

Article

Sep 2023

Recently, numerous unsupervised cross-modal hashing methods have been proposed to deal the image-text retrieval tasks for the unlabeled cross-modal data. However, when these methods learn to generate hash codes, almost all of them lack modality-interaction in the following two aspects: (1) The instance similarity matrix used to guide the hashing ne...

LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement

Preprint

Jul 2023

Current deep learning methods for low-light image enhancement (LLIE) typically rely on pixel-wise mapping learned from paired data. However, these methods often overlook the importance of considering degradation representations, which can lead to sub-optimal outcomes. In this paper, we address this limitation by proposing a degradation-aware learni...

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task

Article

Jun 2023

VQA is an ambitious task aiming to answer any image-related question. However, in reality, it is hard to build such a system once for all since the needs of users are continuously updated, and the system has to implement new functions. Thus, Continual Learning (CL) ability is a must in developing advanced VQA systems. Recently, a pioneer work split...

Towards In-Distribution Compatible Out-of-Distribution Detection

Article

Jun 2023

Deep neural network, despite its remarkable capability of discriminating targeted in-distribution samples, shows poor performance on detecting anomalous out-of-distribution data. To address this defect, state-of-the-art solutions choose to train deep networks on an auxiliary dataset of outliers. Various training criteria for these auxiliary outlier...

Bilateral Relation Distillation for Weakly Supervised Temporal Action Localization

Article

Jun 2023

Weakly supervised temporal action localization (WSTAL), which aims to locate the time interval of actions in an untrimmed video with only video-level action labels, has attracted increasing research interest in the past few years. However, a model trained with such labels will tend to focus on segments that contributions most to the video-level cla...

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Preprint

Jun 2023

Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further pr...

LARNeXt: End-to-End Lie Algebra Residual Network for Face Recognition

Article

Jun 2023

Face recognition has always been courted in computer vision and is especially amenable to situations with significant variations between frontal and profile faces. Traditional techniques make great strides either by synthesizing frontal faces from sizable datasets or by empirical pose invariant learning. In this paper, we propose a completely integ...

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

Conference Paper

Jun 2023

GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Preprint

May 2023

Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders

Preprint

Apr 2023

We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features. To study which type of deep features is appropriate for MIM as a learning target, we propose a simple MIM framework with serials of well-trained self-supervised models to convert an Image to a feature Vector as the learning target of MIM, where th...

Efficient-Adam: Communication-Efficient Distributed Adam

Article

Jan 2023

Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models. However, their communication complexity on finding $\varepsilon$ -stationary points has rarely been analyzed in the nonconvex setting. In this work, we present a novel communication-efficient distri...

MC-Blur: A Comprehensive Benchmark for Image Deblurring

Article

Jan 2023

Blur artifacts can seriously degrade the visual quality of images, and numerous deblurring methods have been proposed for specific scenarios. However, in most real-world images, blur is caused by different factors, e.g ., motion, and defocus. In this paper, we address how other deblurring methods perform in the case of multiple types of blur. For...

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

Preprint

Nov 2022

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cros...

Improving Vision Transformers by Revisiting High-Frequency Components

Chapter

Nov 2022

The transformer models have shown promising effectiveness in dealing with various vision tasks. However, compared with training Convolutional Neural Network (CNN) models, training Vision Transformer (ViT) models is more difficult and relies on the large-scale training set. To explain this observation we make a hypothesis that ViT models are less ef...

A Survey of Deep Face Restoration: Denoise, Super-Resolution, Deblur, Artifact Removal

Preprint

Nov 2022

Face Restoration (FR) aims to restore High-Quality (HQ) faces from Low-Quality (LQ) input images, which is a domain-specific image restoration problem in the low-level computer vision area. The early face restoration methods mainly use statistic priors and degradation models, which are difficult to meet the requirements of real-world applications i...

Triangle Attack: A Query-Efficient Decision-Based Adversarial Attack

Chapter

Nov 2022

Decision-based attack poses a severe threat to real-world applications since it regards the target model as a black box and only accesses the hard prediction label. Great efforts have been made recently to decrease the number of queries; however, existing decision-based attacks still require thousands of queries in order to generate good quality ad...

Hardly Perceptible Trojan Attack Against Neural Networks with Bit Flips

Chapter

Nov 2022

The security of deep neural networks (DNNs) has attracted increasing attention due to their widespread use in various applications. Recently, the deployed DNNs have been demonstrated to be vulnerable to Trojan attacks, which manipulate model parameters with bit flips to inject a hidden behavior and activate it by a specific trigger pattern. However...

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task

Preprint

Aug 2022

SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization

Preprint

Aug 2022

Matching-based methods, especially those based on space-time memory, are significantly ahead of other solutions in semi-supervised video object segmentation (VOS). However, continuously growing and redundant template features lead to an inefficient inference. To alleviate this, we propose a novel Sequential Weighted Expectation-Maximization (SWEM)...

Hardly Perceptible Trojan Attack against Neural Networks with Bit Flips

Preprint

Full-text available

Jul 2022

Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Preprint

Jul 2022

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Bas...

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

Preprint

Jul 2022

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pi...

The illustration of stereo cameras. One pair of images...

The architecture of the proposed semantic-aware deraining module. Rainy...

The Semantic-rethinking Loop. During training, rainy images are fed...

The architecture of SFNet. The coarse deraining images and semantic...

The architecture of VFNet. Features volumes from stereo images are...

Beyond Monocular Deraining: Parallel Stereo Deraining Network Via Semantic Prior

Article

Full-text available

Jul 2022

Rain is a common natural phenomenon. Taking images in the rain however often results in degraded quality of images, thus compromises the performance of many computer vision systems. Most existing de-rain algorithms use only one single input image and aim to recover a clean image. Few work has exploited stereo images. Moreover, even for single image...

EDFace-Celeb-1 M: Benchmarking Face Hallucination With a Million-Scale Dataset

Article

Jun 2022

Recent deep face hallucination methods show stunning performance in super-resolving severely degraded facial images, even surpassing human ability. However, these algorithms are mainly evaluated on non-public synthetic datasets. It is thus unclear how these algorithms perform on public face hallucination datasets. Meanwhile, most of the existing da...

Figure 8: Visualization of EgoClip clip-text pairs. We sample five...

Figure 10: Institution distributions of EgoMCQ

Recall and mAP metrics for several IoU on the Moment Query task's val....

Egocentric Video-Language Pretraining

Preprint

Full-text available

Jun 2022

Video-Language Pretraining (VLP), aiming to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Dominant works that achieve strong performance rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D...

SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization

Conference Paper

Jun 2022

Fig. 2 Experimental results for stochastic convex optimization (2). The...

Fig. 3 Ablation study with/without error-feedback on the norm of...

Fig. 4 Experimental results for image classification task on the...

Fig. 5 Experimental results for image classification on the CIFAR100...

Fig. 6 Ablation study with/without error-feedback on the training loss...

Efficient-Adam: Communication-Efficient Distributed Adam with Complexity Analysis

Preprint

Full-text available

May 2022

Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models. However, their communication complexity on finding $\varepsilon$-stationary points has rarely been analyzed in the nonconvex setting. In this work, we present a novel communication-efficient distribut...

RFNet: Unsupervised Network for Mutually Reinforcing Multi-modal Image Registration and Fusion

Conference Paper

Full-text available

Apr 2022

In this paper, we propose a novel method to realize multi-modal image registration and fusion in a mutually reinforcing framework, termed as RFNet. We handle the registration in a coarse-to-fine fashion. For the first time, we exploit the feedback of image fusion to promote the registration accuracy rather than treating them as two separate issues....

The ablation study results on MSR-VTT 1k.

HunYuan_tvr for Text-Video Retrivial

Preprint

Full-text available

Apr 2022

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and...

Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A Faster and Better Framework

Article

Feb 2022

Video deraining is an important task in computer vision as the unwanted rain hampers the visibility of videos and deteriorates the robustness of most outdoor vision systems. Despite the significant success which has been achieved for video deraining recently, two major challenges remain: 1) how to exploit the vast information among successive frame...

Figure 1. ImageNet accuracy v.s. model capacity. All models are trained...

Figure 3. The procedure of our proposed DynaMixer operation for one...

Image classification results of our DynamMixer and other

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Preprint

Full-text available

Jan 2022

Recently, MLP-like vision models have achieved promising performances on mainstream visual recognition tasks. In contrast with vision transformers and CNNs, the success of MLP-like models shows that simple information fusion operations among tokens and channels can yield a good representation power for deep recognition models. However, existing MLP...

Retrieval results on DiDeMo. * indicates that the method uses...

Retrieval results on ActivityNet. * indicates that the method uses...

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Article

Full-text available

Jan 2022

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper,...

Fig. 4: The curves of output scores of the student discriminator D with...

Fig. 11: The edited images synthesized via DGL-GAN with channel...

Fig. 12: Human faces generated by original StyleGAN2 and DGL-GAN with...

Fig. 15: The generated images of BigGAN and DGL-GAN with channel...

Fig. 16: The curves of loss values of Adv(G, D) and Adv(G, D)

DGL-GAN: Discriminator Guided Learning for GAN Compression

Preprint

Full-text available

Dec 2021

Generative Adversarial Networks (GANs) with high computation costs, e.g., BigGAN and StyleGAN2, have achieved remarkable results in synthesizing high resolution and diverse images with high fidelity from random noises. Reducing the computation cost of GANs while keeping generating photo-realistic images is an urgent and challenging field for their...

Triangle Attack: A Query-efficient Decision-based Adversarial Attack

Preprint

Full-text available

Dec 2021

Benchmarking Deep Deblurring Algorithms: A Large-Scale Multi-Cause Dataset and A New Baseline Model

Preprint

Full-text available

Nov 2021

Blur artifacts can seriously degrade the visual quality of images, and numerous deblurring methods have been proposed for specific scenarios. However, in most real-world images, blur is caused by different factors, e.g., motion and defocus. In this paper, we address how different deblurring methods perform on general types of blur. For in-depth per...

Quantized Adam with Error Feedback

Article

Oct 2021

In this article, we present a distributed variant of an adaptive stochastic gradient method for training deep neural networks in the parameter-server model. To reduce the communication cost among the workers and server, we incorporate two types of quantization schemes, i.e., gradient quantization and weight quantization, into the proposed distribut...

EDFace-Celeb-1M: Benchmarking Face Hallucination with a Million-scale Dataset

Preprint

Full-text available

Oct 2021

Benchmarking Ultra-High-Definition Image Super-resolution

Conference Paper

Oct 2021

SynFace: Face Recognition with Synthetic Data

Conference Paper

Full-text available

Oct 2021

Pyramid Architecture Search for Real-Time Image Deblurring

Conference Paper

Oct 2021

Multi-scale and multi-patch deep models have been shown effective in removing blurs of dynamic scenes. However, these methods still suffer from one major obstacle: manually designing a lightweight and high-efficiency network is challenging and time-consuming. To tackle this obstacle, we propose a novel deblurring method, dubbed PyNAS (pyramid neura...

End2End Occluded Face Recognition by Masking Corrupted Features

Preprint

Full-text available

Aug 2021

With the recent advancement of deep convolutional neural networks, significant progress has been made in general face recognition. However, the state-of-the-art general face recognition models do not generalize well to occluded face images, which are exactly the common cases in real-world scenarios. The potential reasons are the absences of large-s...

SynFace: Face Recognition with Synthetic Data

Preprint

Aug 2021

With the recent success of deep neural networks, remarkable progress has been achieved on face recognition. However, collecting large-scale real-world training data for face recognition has turned out to be challenging, especially due to the label noise and privacy issues. Meanwhile, existing face recognition datasets are usually collected from web...

LARNet: Lie Algebra Residual Network for Face Recognition

Conference Paper

Full-text available

Aug 2021

Face recognition is an important yet challenging problem in computer vision. A major challenge in practical face recognition applications lies in significant variations between profile and frontal faces. Traditional techniques address this challenge either by synthesizing frontal faces or by pose invariant learning. In this paper, we propose a nove...

Learning Semantic Priors for Texture-Realistic Sketch-to-Image Synthesis

Article

Aug 2021

Sketch-to-image synthesis is a challenging task in the field of computer vision that generates photo-realistic images from given sketches. Existing methods of this kind are unable to discover the inherent semantic information contained in an image and use it to guide the synthesis process, substantially reduce their capacity to generate photo-reali...

Figure 1. Meteorological records (for 2015) of the eight weather...

Figure 2. Valid daily NPP-VIIRS albedo values corresponding to the...

Figure 3. Calibration ((a); 236 data) and validation ((b); 102 data)...

Figure 5. Invalid NPP-VIIRS albedo data were retrieved from...

Figure 6. Comparison between the original and retrieved NPP-VIIRS and...

Novel approach for retrieving land-surface albedo: case study at the Nanling National Nature Reserve, Guangdong Province

Article

Full-text available

Jul 2021

We developed a novel method for retrieving land-surface albedo (LSA) based on 338 groups of meteorological and NPP-VIIRS albedo data. Results showed that the LSA retrieval model calibrated using therandom forest (RF) machine-learning regression algorithm performed well. The RF-based LSA retrieval model explained approximately 84% of the NPP-VIIRS a...

End2End Occluded Face Recognition by Masking Corrupted Features

Article

Full-text available

Jul 2021

AlphaGAN: Fully Differentiable Architecture Search for Generative Adversarial Networks

Article

Jul 2021

Generative Adversarial Networks (GANs) are formulated as minimax game problems, where generators attempt to approach real data distributions by adversarial learning against discriminators which learn to distinguish generated samples from real ones. In this work, we aim to boost model learning from the perspective of network architectures, by incorp...

Parser-Free Virtual Try-on via Distilling Appearance Flows

Conference Paper

Jun 2021

DeFLOCNet: Deep Image Editing via Flexible Low-level Controls

Conference Paper

Jun 2021

ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows

Conference Paper

Full-text available

Jun 2021

VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

Conference Paper

Jun 2021

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On

Conference Paper

Jun 2021

Beyond Monocular Deraining: Parallel Stereo Deraining Network Via Semantic Prior

Preprint

May 2021

Figure 1. Content leak visualization. Existing style transfer methods...

Figure 3. Loss curves of AdaIN [20] training: Using both content and...

Figure 5. Causes of the Content Leak phenomenon. (a) Reconstruction...

Figure 11. Visualization of content features of AdaIN, WCT, and the...

ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows

Preprint

Full-text available

Mar 2021

Universal style transfer retains styles from reference images in content images. While existing methods have achieved state-of-the-art style transfer performance, they are not aware of the content leak phenomenon that the image content may corrupt after several rounds of stylization process. In this paper, we propose ArtFlow to prevent content leak...

Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A Faster and Better Framework

Preprint

Full-text available

Mar 2021

DeFLOCNet: Deep Image Editing via Flexible Low-level Controls

Preprint

Mar 2021

User-intended visual content fills the hole regions of an input image in the image editing scenario. The coarse low-level inputs, which typically consist of sparse sketch lines and color dots, convey user intentions for content creation (\ie, free-form editing). While existing methods combine an input image and these low-level controls for CNN inpu...

Figure 2. Virtual try-on comparisons. Inpainting based methods (ACGPN...

Figure 5. Visual evaluations on the VITON-HD dataset. Our DCTON is...

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On

Preprint

Full-text available

Mar 2021

Image virtual try-on replaces the clothes on a person image with a desired in-shop clothes image. It is challenging because the person and the in-shop clothes are unpaired. Existing methods formulate virtual try-on as either in-painting or cycle consistency. Both of these two formulations encourage the generation networks to reconstruct the input i...

VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

Preprint

Mar 2021

MoCo is effective for unsupervised image representation learning. In this paper, we propose VideoMoCo for unsupervised video representation learning. Given a video sequence as an input sample, we improve the temporal feature representations of MoCo from two perspectives. First, we introduce a generator to drop out several frames from this sample te...

Parser-Free Virtual Try-on via Distilling Appearance Flows

Preprint

Full-text available

Mar 2021

Image virtual try-on aims to fit a garment image (target clothes) to a person image. Prior methods are heavily based on human parsing. However, slightly-wrong segmentation results would lead to unrealistic try-on images with large artifacts. Inaccurate parsing misleads parser-based methods to produce visually unrealistic results where artifacts usu...

Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics

Article

Feb 2021

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest co...

Visual tracking via supervised and unsupervised learnings. Supervised...

Precision and success plots on the OTB-2015 dataset Wu et al. (2015)...

Precision and success plots on the Temple-Color dataset Liang et al....

Top: Accuracy-Robustness (AR) ranking plots generated by sequence mean...

Success plots on the LaSOT testing set Fan et al. (2019). The legend in...

Unsupervised Deep Representation Learning for Real-Time Tracking

Article

Full-text available

Feb 2021

The advancement of visual tracking has continuously been brought by deep learning models. Typically, supervised learning is employed to train these models with expensive labeled data. In order to reduce the workload of manual annotation and learn to track arbitrary objects, we propose an unsupervised learning method for visual tracking. The motivat...

Figure 1: The above figures are for function value with different r and...

Figure 4: Performance profiles of mini-batch Adam, RMSProp and AMSGrad...

Figure 5: Performance profiles of mini-batch Adam, RMSProp and AMSGrad...

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Preprint

Full-text available

Jan 2021

Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique,...

Spatiotemporal Co-Attention Recurrent Neural Networks for Human-Skeleton Motion Prediction

Article

Jan 2021

Human motion prediction aims to generate future motions based on the observed human motions. Witnessing the success of Recurrent Neural Networks in modeling the sequential data, recent works utilize RNN to model human-skeleton motion on the observed motion sequence and predict future human motions. However, these methods disregarded the existence o...

PointPWC-Net: Cost Volume on Point Clouds for (Self-)Supervised Scene Flow Estimation

Chapter

Full-text available

Oct 2020

We propose a novel end-to-end deep scene flow model, called PointPWC-Net, that directly processes 3D point cloud scenes with large motions in a coarse-to-fine fashion. Flow computed at the coarse level is upsampled and warped to a finer level, enabling the algorithm to accommodate for large motion without a prohibitive search space. We introduce no...

Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

Chapter

Oct 2020

Automatically generating sentences to describe events and temporally localizing sentences in a video are two important tasks that bridge language and videos. Recent techniques leverage the multimodal nature of videos by using off-the-shelf features to represent videos, but interactions between modalities are rarely explored. Inspired by the fact th...

Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer Proxies

Preprint

Full-text available

Oct 2020

Deep metric learning plays a key role in various machine learning tasks. Most of the previous works have been confined to sampling from a mini-batch, which cannot precisely characterize the global geometry of the embedding space. Although researchers have developed proxy- and classification-based methods to tackle the sampling issue, those methods...

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

Preprint

Sep 2020

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue th...

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Preprint

Full-text available

Aug 2020

Vulnerability vs. Reliability: Disentangled Adversarial Examples for Cross-Modal Learning

Conference Paper

Aug 2020

Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

Preprint

Jul 2020

Unsupervised Deep Representation Learning for Real-Time Tracking

Preprint

Jul 2020

The advancement of visual tracking has continuously been brought by deep learning models. Typically, supervised learning is employed to train these models with expensive labeled data. In order to reduce the workload of manual annotations and learn to track arbitrary objects, we propose an unsupervised learning method for visual tracking. The motiva...

MTL-NAS: Task-Agnostic Neural Architecture Search Towards General-Purpose Multi-Task Learning

Conference Paper

Jun 2020

Conference Paper

Jun 2020

Deblurring by Realistic Blurring

Conference Paper

Full-text available

Jun 2020

Deblurring by Realistic Blurring

Preprint

Full-text available

Apr 2020

Existing deep learning methods for image deblurring typically train models using pairs of sharp images and their blurred counterparts. However, synthetically blurring images do not necessarily model the genuine blurring process in real-world scenarios with sufficient accuracy. To address this problem, we propose a new method which combines two GAN...

Pixel2Mesh: 3D Mesh Model Generation via Image Guided Deformation

Article

Apr 2020

In this paper, we propose an end-to-end deep learning architecture that generates 3D triangular meshes from single color images. Limited by the nature of the prevalent deep learning techniques, the majority of previous works usually represent 3D shapes in 3D volumes or point clouds. However, it is non-trivial to convert them to compact and ready-to...

MTL-NAS: Task-Agnostic Neural Architecture Search towards General-Purpose Multi-Task Learning

Preprint

Mar 2020

We propose to incorporate neural architecture search (NAS) into general-purpose multi-task learning (GP-MTL). Existing NAS methods typically define different search spaces according to different tasks. In order to adapt to different task combinations (i.e., task sets), we disentangle the GP-MTL networks into single-task backbones (optionally encode...

Networks with intermediate feature visualization. Yellow lines denote...

a The elemental block structure of conjoint layers of Sign, 1-bit...

The proposed network structure. a The shallow Bi-Real net for 18-layer...

The representational capability (R\documentclass[12pt]{minimal}...

a The 1-layer-per-block structure that will not work (He et al. 2016a)....

Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance

Article

Full-text available

Jan 2020

In this paper, we study 1-bit convolutional neural networks (CNNs), of which both the weights and activations are binary. While being efficient, the lacking of a representational capability and the training difficulty impede 1-bit CNNs from performing as well as real-valued networks. To this end, we propose Bi-Real net with a novel training algorit...

Anytime Recognition with Routing Convolutional Networks

Article

Dec 2019

Achieving an automatic trade-off between accuracy and efficiency for a single deep neural network is highly desired in time-sensitive computer vision applications. To achieve anytime prediction, existing methods only embed fixed exits to neural networks and make the predictions with the fixed exits for all the samples (refer to the “latest-all” str...

PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

Preprint

Full-text available

Nov 2019

We propose a novel end-to-end deep scene flow model, called PointPWC-Net, on 3D point clouds in a coarse-to-fine fashion. Flow computed at the coarse level is upsampled and warped to a finer level, enabling the algorithm to accommodate for large motion without a prohibitive search space. We introduce novel cost volume, upsampling, and warping layer...

Occlusion Robust Face Recognition Based on Mask Learning With Pairwise Differential Siamese Network

Conference Paper

Oct 2019

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Conference Paper

Oct 2019

Spatiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Motion Prediction

Preprint

Full-text available

Sep 2019

Human motion prediction aims to generate future motions based on the observed human motions. Witnessing the success of Recurrent Neural Networks (RNN) in modeling the sequential data, recent works utilize RNN to model human-skeleton motion on the observed motion sequence and predict future human motions. However, these methods did not consider the...

Hierarchical Long Short-Term Concurrent Memory for Human Interaction Recognition

Article

Full-text available

Sep 2019

In this work, we aim to address the problem of human interaction recognition in videos by exploring the long-term inter-related dynamics among multiple persons. Recently, Long Short-Term Memory (LSTM) has become a popular choice to model individual dynamic for single-person action recognition due to its ability to capture the temporal motion inform...

Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network

Preprint

Aug 2019

In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., the mo...

Occlusion Robust Face Recognition Based on Mask Learning with PairwiseDifferential Siamese Network

Preprint

Aug 2019

Deep Convolutional Neural Networks (CNNs) have been pushing the frontier of the face recognition research in the past years. However, existing general CNN face models generalize poorly to the scenario of occlusions on variable facial areas. Inspired by the fact that a human visual system explicitly ignores occlusions and only focuses on non-occlude...

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

Article

Jun 2019

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a description, we propose a reconstruction network (RecNet) in a novel encoder-decoder-reconstructor architecture, which leverages both forward (v...

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

Preprint

Jun 2019

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) in a novel encoder-decoder-reconstructor architecture, which leverages both f...

Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables

Conference Paper