Hang Dai's research while affiliated with University of Glasgow and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (54)


High-Resolution Iterative Feedback Network for Camouflaged Object Detection
  • Article

June 2023

·

15 Reads

·

30 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

Xiaobin Hu

·

Shuo Wang

·

·

[...]

·

Ling Shao

Spotting camouflaged objects that are visually assimilated into the background is tricky for both object detection algorithms and humans who are usually confused or cheated by the perfectly intrinsic similarities between the foreground objects and the background surroundings. To tackle this challenge, we aim to extract the high-resolution texture details to avoid the detail degradation that causes blurred vision in edges and boundaries. We introduce a novel HitNet to refine the low-resolution representations by high-resolution features in an iterative feedback manner, essentially a global loop-based connection among the multi-scale resolutions. To design better feedback feature flow and avoid the feature corruption caused by recurrent path, an iterative feedback strategy is proposed to impose more constraints on each feedback connection. Extensive experiments on four challenging datasets demonstrate that our HitNet breaks the performance bottleneck and achieves significant improvements compared with 29 state-of-the-art methods. In addition, to address the data scarcity in camouflaged scenarios, we provide an application example to convert the salient objects to camouflaged objects, thereby generating more camouflaged training samples from the diverse salient object datasets. Code will be made publicly available.

Share



Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

March 2023

·

24 Reads

Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds. To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged object detection. Specifically, we propose a nonlocal token enhancement module (NL-TEM) that employs the non-local mechanism to interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers. Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent transformer features through a layer-bylayer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for object information decoding. Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics. Our code is publicly available at https://github.com/ZhouHuang23/FSPNet.


MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

March 2023

·

40 Reads

LiDAR and camera are two modalities available for 3D semantic segmentation in autonomous driving. The popular LiDAR-only methods severely suffer from inferior segmentation on small and distant objects due to insufficient laser points, while the robust multi-modal solution is under-explored, where we investigate three crucial inherent difficulties: modality heterogeneity, limited sensor field of view intersection, and multi-modal data augmentation. We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion to mitigate the modality heterogeneity. The multi-modal fusion in MSeg3D consists of geometry-based feature fusion GF-Phase, cross-modal feature completion, and semantic-based feature fusion SF-Phase on all visible points. The multi-modal data augmentation is reinvigorated by applying asymmetric transformations on LiDAR point cloud and multi-camera images individually, which benefits the model training with diversified augmentation transformations. MSeg3D achieves state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under the malfunctioning multi-camera input and the multi-frame point clouds input, MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is publicly available at \url{https://github.com/jialeli1/lidarseg3d}.


Fig. 1: Target scan (left) and template (right) morphed with Laplacian ICP. Correspondence sets are: (i) landmarks (red); (ii) right ear landmarks (cyan); (iii) left ear landmarks (cyan); (iv) symmetry contour (blue); and (v) all remaining vertices on mesh (grey surface).
Laplacian ICP for Progressive Registration of 3D Human Head Meshes
  • Preprint
  • File available

February 2023

·

47 Reads

We present a progressive 3D registration framework that is a highly-efficient variant of classical non-rigid Iterative Closest Points (N-ICP). Since it uses the Laplace-Beltrami operator for deformation regularisation, we view the overall process as Laplacian ICP (L-ICP). This exploits a `small deformation per iteration' assumption and is progressively coarse-to-fine, employing an increasingly flexible deformation model, an increasing number of correspondence sets, and increasingly sophisticated correspondence estimation. Correspondence matching is only permitted within predefined vertex subsets derived from domain-specific feature extractors. Additionally, we present a new benchmark and a pair of evaluation metrics for 3D non-rigid registration, based on annotation transfer. We use this to evaluate our framework on a publicly-available dataset of 3D human head scans (Headspace). The method is robust and only requires a small fraction of the computation time compared to the most popular classical approach, yet has comparable registration performance.

Download

CTVSR: Collaborative Spatial-Temporal Transformer for Video Super-Resolution

January 2023

·

9 Reads

IEEE Transactions on Circuits and Systems for Video Technology

Video super-resolution (VSR) is important in video processing for reconstructing high-definition image sequences from corresponding continuous and highly-related video frames. However, existing VSR methods have limitations in fusing spatial-temporal information. Some methods only fuse spatial-temporal information on a limited range of total input sequences, while others adopt a recurrent strategy that gradually attenuates the spatial information. While recent advances in VSR utilize Transformer-based methods to improve the quality of the upscaled videos, these methods require significant computational resources to model the long-range dependencies, which dramatically increases the model complexity. To address these issues, we propose a Collaborative Transformer for Video Super-Resolution (CTVSR). The proposed method integrates the strengths of Transformer-based and recurrent-based models by concurrently assimilating the spatial information derived from multi-scale receptive fields and the temporal information acquired from temporal trajectories. In particular, we propose a Spatial Enhanced Network (SEN) with two key components: Token Dropout Attention (TDA) and Deformable Multi-head Cross Attention (DMCA). TDA focuses on the key regions to extract more informative features, and DMCA employs deformable cross attention to gather information from adjacent frames. Moreover, we introduce a Temporal-trajectory Enhanced Network (TEN) that computes the similarity of a given token with temporally-related tokens in the temporal trajectory, which is different from previous methods that evaluate all tokens within the temporal dimension. With comprehensive quantitative and qualitative experiments on four widely-used VSR benchmarks, the proposed CTVSR achieves competitive performance with relatively low computational consumption and high forward speed.


Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection

November 2022

·

552 Reads

Leveraging LiDAR-based detectors or real LiDAR point data to guide monocular 3D detection has brought significant improvement, e.g., Pseudo-LiDAR methods. However, the existing methods usually apply non-end-to-end training strategies and insufficiently leverage the LiDAR information, where the rich potential of the LiDAR data has not been well exploited. In this paper, we propose the Cross-Modality Knowledge Distillation (CMKD) network for monocular 3D detection to efficiently and directly transfer the knowledge from LiDAR modality to image modality on both features and responses. Moreover, we further extend CMKD as a semi-supervised training framework by distilling knowledge from large-scale unlabeled data and significantly boost the performance. Until submission, CMKD ranks $1^{st}$ among the monocular 3D detectors with publications on both KITTI $test$ set and Waymo $val$ set with significant performance gains compared to previous state-of-the-art methods.


Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection

November 2022

·

13 Reads

·

39 Citations

Leveraging LiDAR-based detectors or real LiDAR point data to guide monocular 3D detection has brought significant improvement, e.g., Pseudo-LiDAR methods. However, the existing methods usually apply non-end-to-end training strategies and insufficiently leverage the LiDAR information, where the rich potential of the LiDAR data has not been well exploited. In this paper, we propose the Cross-Modality Knowledge Distillation (CMKD) network for monocular 3D detection to efficiently and directly transfer the knowledge from LiDAR modality to image modality on both features and responses. Moreover, we further extend CMKD as a semi-supervised training framework by distilling knowledge from large-scale unlabeled data and significantly boost the performance. Until submission, CMKD ranks 1st among the monocular 3D detectors with publications on both KITTI test set and Waymo val set with significant performance gains compared to previous state-of-the-art methods. Our code will be released at https://github.com/Cc-Hy/CMKD.


Highly Accurate Dichotomous Image Segmentation

November 2022

·

63 Reads

·

44 Citations

We present a systematic study on a new task called dichotomous image segmentation (DIS), which aims to segment highly accurate objects from natural images. To this end, we collected the first large-scale DIS dataset, called DIS5K, which contains 5,470 high-resolution (e.g., 2K, 4K or larger) images covering camouflaged, salient, or meticulous objects in various backgrounds. DIS is annotated with extremely fine-grained labels. Besides, we introduce a simple intermediate supervision baseline (IS-Net) using both feature-level and mask-level guidance for DIS model training. IS-Net outperforms various cutting-edge baselines on the proposed DIS5K, making it a general self-learned supervision network that can facilitate future research in DIS. Further, we design a new metric called human correction efforts (HCE) which approximates the number of mouse clicking operations required to correct the false positives and false negatives. HCE is utilized to measure the gap between models and real-world applications and thus can complement existing metrics. Finally, we conduct the largest-scale benchmark, evaluating 16 representative segmentation models, providing a more insightful discussion regarding object complexities, and showing several potential applications (e.g., background removal, art design, 3D reconstruction). Hoping these efforts can open up promising directions for both academic and industries. Project page: https://xuebinqin.github.io/dis/index.html.


Citations (31)


... One major challenge in LiDAR segmentation is the integration of different sparse convolution backends [20,23,77,79,97], which are essential for efficiently processing the sparse and irregular nature of LiDAR point clouds [5,49,50,53]. Sparse convolutional networks have demonstrated significant improvements in performance and computational efficiency for 3D point cloud processing [78]. However, exploring and comparing these backends within a unified framework has been challenging due to the lack of standardized tools and benchmarks [21,58]. ...

Reference:

An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models
MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving
  • Citing Conference Paper
  • June 2023

... MGL [65] incorporated edge details into the segmentation stream via two graph-based modules. More recently, vision transformer-based models like SINet-v2 [18], ZoomNet [41], and FSPNet [21] have shown strong global and local context modeling capabilities in camouflaged object detection. ...

Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers
  • Citing Conference Paper
  • June 2023

... With feedback connections, high-level features are rerouted to the low layer to refine low-level feature representations. The feedback mechanism has been widely employed in various 2D image vision tasks, some works [28][29][30] use feedback mechanism in image super-resolution, Sam 31 and Feng 32 use it to enrich network features, and Chen 33 use it in image deraining problems. In the 3D field, Su 34 and Yan 35 use it to complete the point cloud. ...

High-Resolution Iterative Feedback Network for Camouflaged Object Detection
  • Citing Article
  • June 2023

Proceedings of the AAAI Conference on Artificial Intelligence

... In addition to the annotation time cost, we also assess the reduction in labor achievable with SAM. Specifically, in line with common practice (see [17], [52]), we employ the Human Correction Efforts (HCE) criterion to estimate the annotation cost in terms of human effort required to annotate the given images from scratch or to refine the generated masks to reach the quality of ground-truths. While standard evaluation metrics attempt to characterize the semantic gap between ground-truth and predicted masks, Qin et al. [52] propose the HCE metric as a measure that reflects the actual human effort (for example, in terms of the number of mouse clicks) required to refine the masks to match ground-truth samples. ...

Highly Accurate Dichotomous Image Segmentation
  • Citing Chapter
  • November 2022

... The data can be accessed on the KITTI 3D official website. Inspired by other research works [32], [33], we incorporate the KITTI raw data in the training process and evaluate the trained model using the Evaluation metrics. We evaluated the object detection capability of the proposed network on three classes: 'Car', 'Pedestrian', and 'Cyclist', using two evaluation metrics, namely AP 3D and AP BEV . ...

Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection
  • Citing Chapter
  • November 2022

... However, there is no real-world or simulated adverse weather data set for LiDAR semantic segmentation at present. Moreover, there exists preliminary attempts to investigate the robustness issue of the fusion methods for 3D object detection (Bai et al., 2022;Li et al., 2022;. Concretely, TransFusion (Bai et al., 2022) evaluates the robustness of different fusion strategies under several scenarios, e.g., daytime and nighttime, DeepFusion test the model robustness by adding noise to LiDAR reflections and camera pixels and proposes a robust benchmark for LiDAR-camera fusion, which analyzes seven cases of robustness scenarios. ...

Self-Distillation for Robust LiDAR Semantic Segmentation in Autonomous Driving
  • Citing Chapter
  • October 2022

... Researchers have exerted considerable effort to utilize monocular depth cues. This includes transforming inputs into pseudo-lidar point clouds [35,36,56] and explicitly incorporating depth into models [8,9,16]. Another significant research direction involves the explicit use of geometric priors, encompassing approaches like key-point constraints [6,20,33], shape projection relationships [22,34,40,64], and temporal depth estimation [54]. ...

Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving
  • Citing Conference Paper
  • June 2022

... Key performance metrics for evaluation (Chen et al., 2023a;Wang et al., 2023a;Zhao et al., 2023) included Overall Accuracy (OA), User's Accuracy (UA), Producer's Accuracy (PA), and the F1-score (Table 1). While metrics such as the Intersection over Union (IoU) and pixel accuracy are indeed widely used in deep learning-based image segmentation methods (Chen et al., 2023c;Huang et al., 2022;Zhao et al., 2022), their direct relevance in the proposed framework is somewhat diminished, given our emphasis on pseudo-labels. Furthermore, choosing to avoid validation within the used dataset and instead opting for a wider evaluation using an extensive set of sample points presents a more genuine appraisal of the advocated methodology. ...

Scribble-based boundary-aware network for weakly supervised salient object detection in remote sensing images
  • Citing Article
  • September 2022

ISPRS Journal of Photogrammetry and Remote Sensing

... One of the challenges in using 3D photogrammetry is finding a reliable and consistent landmark for different images over time. 74 Landmarks with well-defined borders or edges showed higher degrees of reproducibility than those placed on gently curving slopes. 75 The difference in hard and soft tissue landmarks can influence the reliability on 3D photogrammetry. ...

Applications of 3D Photography in Craniofacial Surgery

Journal of Pediatric Neurosciences

... Then the Perspectiven-Point (PnP) algorithm is used to estimate the pose of the object. Aiming at the feature mismatch problem of the current monocular 3D detection method based on anchor, Luo et al. [73] proposed M3DSSD. First, the image is input into the backbone network to generate the class and confidence of each anchor. ...

M3DSSD: Monocular 3D Single Stage Object Detector
  • Citing Conference Paper
  • June 2021