Zhenfang Chen
The University of Hong Kong | HKU · Department of Computer Science

About

Publications

3,435

Reads

523

Citations

Skills and Expertise

Machine Learning

Feature Extraction

Publications

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Preprint

May 2024

Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question...

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

Preprint

May 2024

Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve dee...

Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning

Article

Mar 2024

Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as tw...

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Conference Paper

Oct 2023

Evaluating physical scene understanding with objects consisting of different physical attributes in humans and machines

Article

Aug 2023

3D-LLM: Injecting the 3D World into Large Language Models

Preprint

Jul 2023

Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to...

Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties

Preprint

Jun 2023

General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recen...

Figure 1: The architecture of ModuleFormer. The sparse activation...

Continual Joint Pre-Training Result (accuracy↑).

ModuleFormer: Learning Modular Large Language Models From Uncurated Data

Preprint

Full-text available

Jun 2023

Large Language Models (LLMs) have achieved remarkable results. But existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training data without forgetting previous knowledge. This paper proposes a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficie...

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

Conference Paper

Jun 2023

3D Concept Learning and Reasoning from Multi-View Images

Conference Paper

Jun 2023

Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners

Conference Paper

Jun 2023

Figure 6: Response comparison on Vicuna benchmark questions: assessed...

Figure 7: Relative response quality on Vicuna benchmark questions:...

Figure 8: Principle usage statistics in our Self-Instruct dataset.

Figure 9: Principle usage statistics in our TGRT Self-Instruct dataset.

TruthfulQA generation task. We report the fraction of truthful and...

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Preprint

Full-text available

May 2023

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain...

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

Preprint

Apr 2023

Humans, even at a very early age, can learn visual concepts and understand geometry and layout through active interaction with the environment, and generalize their compositions to complete tasks described by natural languages in novel scenes. To mimic such capability, we propose Embodied Concept Learner (ECL) in an interactive 3D environment. Spec...

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

Preprint

Apr 2023

Humans possess a versatile mechanism for extracting structured representations of our visual world. When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them. To mimic such capability, we propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencie...

Figure 1. An exemplar scene with multi-view images and question-answer...

Figure 2. An overview of our 3D-CLR framework. First, we learn a 3D...

Figure 3. Qualitative examples of generalizing to Replica dataset.

Figure 4. More Qualitative Examples on 3DMV-VQA.

Figure 7. Qualitative examples of generalizing to Replica dataset.

3D Concept Learning and Reasoning from Multi-View Images

Preprint

Full-text available

Mar 2023

Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habit...

Planning with Large Language Models for Code Generation

Preprint

Mar 2023

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process. Although the programs they generate achieve high token-matching-based scores, they often fail to compile or generate incorrect outputs. The main reason is that conventional Transformer decoding algorithms may n...

Deep Face Video Inpainting via UV Mapping

Article

Full-text available

Feb 2023

This paper addresses the problem of face video inpainting. Existing video inpainting methods target primarily at natural scenes with repetitive patterns. They do not make use of any prior knowledge of the face to help retrieve correspondences for the corrupted face. They therefore only achieve sub-optimal results, particularly for faces under large...

Figure 1. The human process to handle knowledge-based visual reasoning....

Figure 7. Exemplar prompting templates of the PICa [64] baseline....

Figure 8. Exemplar prompting templates of the CoT [60] baseline....

Rationale performance comparison of our model and CoT baseline on...

See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning

Preprint

Full-text available

Jan 2023

Large pre-trained vision and language models have demonstrated remarkable capacities for various tasks. However, solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect the external world knowledge, and perform step-by-step reasoning to answer the questions...

Sparse Universal Transformer

Conference Paper

Jan 2023

Quantitative Results on the PASCAL-Context. Mod-Squad constantly...

Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

Preprint

Full-text available

Dec 2022

Optimization in multi-task learning (MTL) is more challenging than single-task learning (STL), as the gradient from different tasks can be contradictory. When tasks are related, it can be beneficial to share some parameters among them (cooperation). However, some tasks require additional parameters with expertise in a specific type of data or discr...

S^3-NeRF: Neural Reflectance Field from Shading and Shadow under a Single Viewpoint

Conference Paper

Full-text available

Nov 2022

In this paper, we address the "dual problem" of multi-view scene reconstruction in which we utilize single-view images captured under different point lights to learn a neural scene representation. Different from existing single-view methods which can only recover a 2.5D scene representation (i.e., a normal / depth map for the visible surface), our...

PS-NeRF: Neural Inverse Rendering for Multi-view Photometric Stereo

Conference Paper

Full-text available

Oct 2022

Traditional multi-view photometric stereo (MVPS) methods are often composed of multiple disjoint stages, resulting in noticeable accumulated errors. In this paper, we present a neural inverse rendering method for MVPS based on implicit representation. Given multi-view images of a non-Lambertian object illuminated by multiple unknown directional lig...

S$^3$-NeRF: Neural Reflectance Field from Shading and Shadow under a Single Viewpoint

Preprint

Full-text available

Oct 2022

A Unified Framework for Masked and Mask-Free Face Recognition Via Feature Rectification

Conference Paper

Full-text available

Oct 2022

PS-NeRF: Neural Inverse Rendering for Multi-view Photometric Stereo

Preprint

Full-text available

Jul 2022

Figure 3: The perception module detects objects' location and visual...

Figure 4: Generalization of physical reasoning.

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

Preprint

Full-text available

May 2022

Objects' motions in nature are governed by complex interactions and their properties. While some properties, such as shape and material, can be identified via the object's visual appearances, others like mass and electric charge are not directly visible. The compositionality between the visible and hidden properties poses unique challenges for AI m...

A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification

Preprint

Full-text available

Feb 2022

Face recognition under ideal conditions is now considered a well-solved problem with advances in deep learning. Recognizing faces under occlusion, however, still remains a challenge. Existing techniques often fail to recognize faces with both the mouth and nose covered by a mask, which is now very common under the COVID-19 pandemic. Common approach...

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Conference Paper

Full-text available

Dec 2021

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Conference Paper

Dec 2021

In this work, we propose a unified framework, called Visual Reasoning with Differentiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by three seamlessly integrated parts, including a visual perception module, a concept learner, and a di...

Figure 2: Comparisons of the data efficiency evaluation on four types...

Figure 3: VRDP learns new concepts and accurately reasons about...

Figure 6: An illustration of the reasoning process of the program...

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Preprint

Full-text available

Oct 2021

In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a diffe...

Deep Face Video Inpainting via UV Mapping

Preprint

Full-text available

Sep 2021

The Blessings of Unlabeled Background in Untrimmed Videos

Conference Paper

Jun 2021

Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning

Conference Paper

Full-text available

May 2021

We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework tha...

Figure 1: The process to handle visual reasoning in dynamic scenes. The...

Figure 5: Typical examples of CLEVRER-Grounding datasets. The target...

Figure 6: A exemplar query expression and 4 of its associated positive...

Evaluation of video grounding. For spatial grounding, we consider it to...

Evaluation of concept learning on the block tower dataset. Our DCL can...

Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning

Preprint

Full-text available

Mar 2021

The Blessings of Unlabeled Background in Untrimmed Videos

Preprint

Full-text available

Mar 2021

Weakly-supervised Temporal Action Localization (WTAL) aims to detect the intervals of action instances with only video-level action labels available during training. The key challenge is how to distinguish the segments of interest from the background segments, which are unlabelled even on the video-level. While previous works treat the background a...

Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension

Conference Paper

Full-text available

Jun 2020

Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models,...

Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension

Preprint

Full-text available

Feb 2020

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

Preprint

Full-text available

Jan 2020

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal annotation during training. We propose a two-stag...

Learning Local Similarity with Spatial Relations for Object Retrieval

Conference Paper

Full-text available

Oct 2019

Many state-of-the-art object retrieval algorithms aggregate activations of convolutional neural networks into a holistic compact feature, and utilize global similarity for an efficient nearest neighbor search. However, holistic features are often insufficient for representing small objects of interest in gallery images, and global similarity drops...

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Preprint

Full-text available

Jun 2019

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a...

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Conference Paper

Full-text available

Jan 2019

Boosting up Scene Text Detectors with Guided CNN

Preprint

Full-text available

May 2018

Deep CNNs have achieved great success in text detection. Most of existing methods attempt to improve accuracy with sophisticated network design, while paying less attention on speed. In this paper, we propose a general framework for text detection called Guided CNN to achieve the two goals simultaneously. The proposed model consists of one guidance...

Aggregated Deep Feature from Activation Clusters for Particular Object Retrieval

Conference Paper

Full-text available

Oct 2017

This paper introduces a clustering based deep feature for particular object retrieval. Many object retrieval algorithms focus on aggregating local features into compact image representations. Recently proposed algorithms, such as R-MAC and its variants, aggregate maximum activations of convolutions from rectangular regions of multiple scales and ha...