Conference Paper

The MIR Flickr retrieval evaluation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In most well known image retrieval test sets, the imagery typically cannot be freely distributed or is not representative of a large community of users. In this paper we present a collection for the MIR community comprising 25000 images from the Flickr website which are redistributable for research purposes and represent a real community of users both in the image content and image tags. We have extracted the tags and EXIF image metadata, and also make all of these publicly available. In addition we discuss several challenges for benchmarking retrieval and classification methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Three widely used multi-modal datasets are employed in these experiments which are MIR-Flickr 25K [13], NUS-WIDE [7], and IAPR TC-12 [9]. MIRFlickr25K consists of 25,000 image-text pairs annotated with 24 labels. ...
... The datasets analyzed during the current study are available from references [7,9,13], respectively. ...
Article
Full-text available
The need for cross-modal retrieval increases significantly with the rapid growth of multimedia information on the Internet. However, most of existing cross-modal retrieval methods neglect the correlation between label similarity and intra-modality similarity in common semantic subspace training, which makes the trained common semantic subspace unable to preserve semantic similarity of original data effectively. Therefore, a novel cross-modal hashing method is proposed in this paper, namely, Deep Supervised Fused Similarity Hashing (DSFSH). The DSFSH mainly consists of two parts. Firstly, a fused similarity method is proposed to exploit the intrinsic inter-modality correlation of data while preserving the intra-modality relationship of data at the same time. Secondly, a novel quantization max-margin loss is proposed. The gap between cosine similarity and Hamming similarity is closed by minimizing this loss. Extensive experimental results on three benchmark datasets show that the proposed method yields better retrieval performance comparing to state-of-the-art methods.
... This paper conducts comparative experiments on three general image and text datasets, including Wikipedia [38], MIR-Flickr-25K [39] and NUS-WIDE [40], to verify the effectiveness of UMSP in cross-modal retrieval. The Wikipedia dataset [38] consists of 2866 image-text pairs selected from Wikipedia and annotated into ten classes. ...
... In the experiments, this dataset is divided into 693 test pairs and 2173 training pairs. The MIRFlickr-25K dataset [39] consists of 25,000 image-text pairs collected from the Flickr website. In this paper, data pairs with at least 20 text labels are selected for experiments and each text is represented as a 1386-dimensional bag-of-words vector. ...
Article
Full-text available
Deep hashing cross-modal image-text retrieval has the advantage of low storage cost and high retrieval efficiency by mapping different modal data into a Hamming space. However, the existing unsupervised deep hashing methods generally relied on the intrinsic similarity information of each modal for structural matching, failing to fully consider the heterogeneous characteristics and semantic gaps of different modalities, which results in the loss of latent semantic correlation and co-occurrence information between the different modalities. To address this problem, this paper proposes an unsupervised deep hashing with multiple similarity preservation (UMSP) method for cross-modal image-text retrieval. First, to enhance the representation ability of the deep features of each modality, a modality-specific image-text feature extraction module is designed. Specifically, the image network with parallel structure and text network are constructed with the vision-language pre-training image encoder and multi-layer perceptron to capture the deep semantic information of each modality and learn a common hash code representation space. Then, to bridge the heterogeneous gap and improve the discriminability of hash codes, a multiple similarity preservation module is builded based on three perspectives: joint modal space, cross-modal hash space and image modal space, which aids the network to preserve the semantic similarity of modalities. Experimental results on three benchmark datasets (Wikipedia, MIRFlickr-25K and NUS-WIDE) show that UMSP outperforms other unsupervised methods for cross-modal image-text retrieval.
... We assess the effectiveness of the proposed DMMH method in multimedia retrieval tasks. We use three public datasets: MIR-Flickr25K [1], NUS-WIDE [2], and MS COCO [3]. We use mean Average Precision (mAP) as the analysis metric. ...
Preprint
Inspired by the excellent performance of Mamba networks, we propose a novel Deep Mamba Multi-modal Learning (DMML). It can be used to achieve the fusion of multi-modal features. We apply DMML to the field of multimedia retrieval and propose an innovative Deep Mamba Multi-modal Hashing (DMMH) method. It combines the advantages of algorithm accuracy and inference speed. We validated the effectiveness of DMMH on three public datasets and achieved state-of-the-art results.
... In this section we perform extensive experiments. Specifically, we choose three benchmark datasets in Section 4.0.1, and adopt the most popular evaluation metrics in MIRFlickr-25K [34] is collected from a social photography site, which contains 25,000 images and corresponding text descriptions. There are a total of 24 categories in this dataset, and each image or text will be assigned one or more of them. ...
Preprint
Full-text available
Hashing cross-modal retrieval methods aim to retrieve different modalities and learn common semantics with low storage and time cost. Although many excellent hashing methods have been proposed in the past decades, there are still some issues. For example, most methods focus on the Euclidean domain, ignoring the graph-structure information contained in data points, so outliers and noise in the Euclidean domain will cause a drop in accuracy. Some methods only learn a latent subspace, which may be unreasonable because the dimensionality of the modalities is not the same as the distribution. To address these issues, we propose a hashing technique called Two Stage Graph Hashing (TSGH). In the first stage, we first learn a specific latent subspace for each modality using Collective Matrix Decomposition and the proposed Graph Convolutional Network (GCN). Therefore, the learned subspace contains the features of Euclidean and non-Euclidean domains, which can eliminate the influence of noise and outliers in the dataset. And then, Global Approximation is used to align the subspaces of the different modalities, so that high-level shared semantics can be explored. Finally, discrete hash codes are learned from the latent subspace and their semantic similarity. In the second stage, we design a linear classifier as the hash function and propose Local Similarity Preservation to capture the local relationship between hash codes and Hamming spaces. To verify the effectiveness of TSGH, we conduct extensive experiments on three public datasets. We achieve the best results compared to previous SOTA methods, illustrating the superiority of TSGH.
... In our experimental evaluation, we employed two widely used benchmark datasets. The MIRFlickr [63] dataset comprises 25, 000 instances distributed across 24 categories. Following the preprocessing steps in [64], which removes instances with tags appearing fewer than 20 times, 20, 015 imagetext pairs are left. ...
Preprint
In the real world, multi-modal data often appears in a streaming fashion, and there is a growing demand for similarity retrieval from such non-stationary data, especially at a large scale. In response to this need, online multi-modal hashing has gained significant attention. However, existing online multi-modal hashing methods face challenges related to the inconsistency of hash codes during long-term learning and inefficient fusion of different modalities. In this paper, we present a novel approach to supervised online multi-modal hashing, called High-level Codes, Fine-grained Weights (HCFW). To address these problems, HCFW is designed by its non-trivial contributions from two primary dimensions: 1) Online Hashing Perspective. To ensure the long-term consistency of hash codes, especially in incremental learning scenarios, HCFW learns high-level codes derived from category-level semantics. Besides, these codes are adept at handling the category-incremental challenge. 2) Multi-modal Hashing Aspect. HCFW introduces the concept of fine-grained weights designed to facilitate the seamless fusion of complementary multi-modal data, thereby generating multi-modal weights at the instance level and enhancing the overall hashing performance. A comprehensive battery of experiments conducted on two benchmark datasets convincingly underscores the effectiveness and efficiency of HCFW.
... Data preprocessing as described in [15,19,29] involves removing data pairs without labels or text, reducing the impact of data imbalance on the model, and helping to improve the model's generalization ability. [37] consists of 25,000 image-text pairs collected from the Flickr website. After removing samples with empty labels or tags, we select 20,015 image-text pairs, where each instance belongs to at least one of 24 categories. ...
Article
Full-text available
Cross-modal hashing has attracted widespread attention due to its ability to reduce the complexity of storage and retrieval. However, many existing methods use a symbolic function to map hash codes, which leads to a loss of semantic information when mapping the original features to a low-dimensional space and consequently decreases retrieval accuracy. To address these challenges, we propose a cross-modal hashing method called Multi-Label Semantic Sharing based on Graph Convolutional Network for Image-to-Text Retrieval (MLSS). Specifically, we employ dual transformers to encode multimodal data and utilize CNN to assist in extracting local information from images, thereby enhancing the matching capability between images and text. Additionally, we design a multi-label semantic sharing module based on a graph convolutional network, which learns a unified multi-label classifier and establishes a semantic bridge between the feature representation space and the hashing space for images and text. By leveraging multi-label semantic information to guide feature and hash learning, MLSS generates hash codes that preserve semantic similarity information, leading to a significant improvement in the performance of image-to-text retrieval. Our experiments on three benchmark datasets demonstrate that MLSS outperforms several state-of-the-art cross-modal retrieval methods. Our code can be found at https://github.com/My1new/MLSS.
... For the experiments, we used logits deriving from the following deep learning networks: GoogleNet [7] trained on Places365 classifications [8]; SqueezeNet [9] and AlexNet [10] trained on ImageNet classifications [11]; and DinoV2 [12] outputs. In all cases, we derived logits from the first 10,000 images of the MirFlickr one million image set [13]. These data are summarised in Table 2. ...
Article
Full-text available
Cross-entropy loss is crucial in training many deep neural networks. In this context, we show a number of novel and strong correlations among various related divergence functions. In particular, we demonstrate that, in some circumstances, (a) cross-entropy is almost perfectly correlated with the little-known triangular divergence, and (b) cross-entropy is strongly correlated with the Euclidean distance over the logits from which the softmax is derived. The consequences of these observations are as follows. First, triangular divergence may be used as a cheaper alternative to cross-entropy. Second, logits can be used as features in a Euclidean space which is strongly synergistic with the classification process. This justifies the use of Euclidean distance over logits as a measure of similarity, in cases where the network is trained using softmax and cross-entropy. We establish these correlations via empirical observation, supported by a mathematical explanation encompassing a number of strongly related divergence functions.
... dataset's different categories are shown in Fig. 7a-c. Furthermore, we separately train our network using the dataset named MIRFLICKR [39], which consists of 25,000 images, in order to assess the performance of the proposed data hiding network in comparison with the existing data hiding schemes. We create 100 QR code images for the similarity measurement network. ...
Article
Full-text available
Traceability via quick response (QR) codes is regarded as a clever way to learn specifics about a product’s history, from its creation to its transit and preservation before reaching consumers. The QR code can, however, be easily copied and faked. Therefore, we suggest a novel strategy to prevent tampering with this code. The method is divided into two primary phases: concealing a security element in the QR code and determining how similar the QR code on the goods is to the real ones. For the first problem, error-correcting coding is used to encode and decode the secret feature in order to manage faults in noisy communication channels. A deep neural network is used to both conceal and extract the information encoded in a QR code, and the suggested network creates watermarked QR code images with good quality and noise tolerance. The network has the ability to be resilient to actual distortions brought on by the printing and photographing processes. In order to measure the similarity of QR codes, we create neural networks based on the Siamese network design. To assess whether a QR code is real or fraudulent, the hidden characteristic extracted from the acquired QR code and the outcome of QR code similarity estimation are merged. With an average accuracy of 98%, the proposed technique performs competitively and has been used in practice for QR code authentication.
... Two widely adopted cross-modal retrieval datasets are utilized for model evaluation: 1) The MIRFlickr dataset [5], which encompasses a compilation of 25,000 correspondences between images and texts, all procured from the Flickr service. Every correspondence is linked to a collection of tags, originating from a pool of 24 unique categories. ...
Preprint
Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing technique designed for semi-paired cross-modal retrieval tasks, named Reconstruction Relations Embedded Hashing (RREH). RREH assumes that multi-modal data share a common subspace. For paired data, RREH explores the latent consistent information of heterogeneous modalities by seeking a shared representation. For unpaired data, to effectively capture the latent discriminative features, the high-order relationships between unpaired data and anchors are embedded into the latent subspace, which are computed by efficient linear reconstruction. The anchors are sampled from paired data, which improves the efficiency of hash learning. The RREH trains the underlying features and the binary encodings in a unified framework with high-order reconstruction relations preserved. With the well devised objective function and discrete optimization algorithm, RREH is designed to be scalable, making it suitable for large-scale datasets and facilitating efficient cross-modal retrieval. In the evaluation process, the proposed is tested with partially paired data to establish its superiority over several existing methods.
... MIRFlickr [18] consists of 25K images collected from Flickr, each of which has multiple class labels. These labels are utilized by the VLP teacher as well as for evaluation. ...
Preprint
Full-text available
Learning to hash'' is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enhancing the performance of learning to hash with the proliferation of powerful large pre-trained models, such as Vision-Language Pre-training (VLP) models. We introduce a novel method named Distillation for Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of VLP models to improve hash representation learning. Specifically, we use the VLP as a `teacher' to distill knowledge into a `student' hashing model equipped with codebooks. This process involves the replacement of supervised labels, which are composed of multi-hot vectors and lack semantics, with the rich semantics of VLP. In the end, we apply a transformation termed Normalization with Paired Consistency (NPC) to achieve a discriminative target for distillation. Further, we introduce a new quantization method, Product Quantization with Gumbel (PQG) that promotes balanced codebook learning, thereby improving the retrieval performance. Extensive benchmark testing demonstrates that DCMQ consistently outperforms existing supervised cross-modal hashing approaches, showcasing its significant potential.
... To evaluate XBT at a larger scale, we also include the Flickr [14] and COCO [22] datasets, which consist of images that are each paired with 5 relevant textual captions. For the Flickr dataset, we use the entire dataset, encompassing 31,783 images and 158,915 captions. ...
Preprint
Full-text available
Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.
... MIRFLickr-25K [37]: It comprises 25,000 image-text pairs, encompassing 24 distinct categories. For the experiments conducted in this paper [29,38], we selected 2,000 pairs as the query set, while the remaining pairs were designated as the training set, with 5,000 pairs. ...
Article
Full-text available
As multimedia technologies advance, untagged image-text data processing has become central in cross-modal retrieval. However, current methods often neglect three critical issues when learning hash codes: 1. Incomplete feature representation limits capturing diverse latent semantics. 2. Binary codes from quantisation loss lack overall constraints and global interaction. 3. Prioritizing retrieval performance overlooks modality robustness, leading to significant multi-modal retrieval disparities. To address these challenges, we introduce HMIB, an unsupervised cross-modal hashing algorithm. We leverage deep feature encoders with pre-trained models like CLIP and VGG, capturing latent semantic associations across natural language and image classification. A hierarchical interactive modal similarity generator introduces comprehensive process constraints and corrects ambiguous edge semantic data, enhancing robustness and generating high-quality hash codes. We conducted extensive experiments on three widely used datasets, maintaining high-level performance while minimizing cross-modal retrieval disparities.
... To validate the effectiveness of the proposed method and its competitors, a series of experiments were conducted on three widely-used datasets, including MIRFlickr [77], NUS-WIDE [78], and IAPR TC-12 [79]. ...
Article
Full-text available
Cross-modal hashing (CMH) has attracted considerable attention in recent years. Almost all existing CMH methods primarily focus on reducing the modality gap and semantic gap, i.e., aligning multi-modal features and their semantics in Hamming space, without taking into account the space gap, i.e., difference between the real number space and the Hamming space. In fact, the space gap can affect the performance of CMH methods. In this paper, we analyze and demonstrate how the space gap affects the existing CMH methods, which therefore raises two problems: solution space compression and loss function oscillation. These two problems eventually cause the retrieval performance deteriorating. Based on these findings, we propose a novel algorithm, namely Semantic Channel Hashing (SCH). Firstly, we classify sample pairs into fully semantic-similar, partially semantic-similar, and semantic-negative ones based on their similarity and impose different constraints on them, respectively, to ensure that the entire Hamming space is utilized. Then, we introduce a semantic channel to alleviate the issue of loss function oscillation. Experimental results on three public datasets demonstrate that SCH outperforms the state-of-the-art methods. Furthermore, experimental validations are provided to substantiate the conjectures regarding solution space compression and loss function oscillation, offering visual evidence of their impact on the CMH methods. Codes are available at https://github.com/hutt94/SCH .
... In addition, we randomly selected two classes as unseen, and the rest classes were considered as seen in the dataset. MIRFlickr [48]: This dataset consists of 25,000 imagetext pairs sourced from Flickr. Each pair is associated with one of 24 categories. ...
Article
Full-text available
Recently, zero-shot hashing methods have been successfully applied to cross-modal retrieval. However, these methods typically assume that the training data labels are accurate and noise-free, which is unrealistic in real-world scenarios due to the noises introduced by manual or automatic annotation. To address this problem, we propose a robust zero-shot discrete hashing with noisy labels (RZSDH) method, which fully considers the impact of noisy labels in real scenes. Our RZSDH method incorporates the sparse and low-rank constraints on the noise matrix and the recovered label matrix, respectively, to effectively reduce the negative impact of noisy labels. Therefore, this significantly enhances the robustness of our proposed method in practice cross-modal retrieval tasks. Additionally, the proposed RZSDH method learns a representation vector of each category attribute, which effectively captures the relationship between seen classes and unseen classes. Furthermore, our approach learns the common latent representation with drift from multimodal data features, which is more conducive to obtaining stable hash codes and hash functions. Finally, we employ a fine-grained similarity preserving strategy to generate more discriminative hash codes. Experiments on several benchmark datasets verify the effectiveness and robustness of the proposed RZSDH method.
... We evaluate our proposed methods in three widely used cross-modal retrieval benchmark datasets: the Wikipedia dataset [3], the NUS-WIDE-10k dataset citebib5, the PKU XMedia dataset [52,53], NUS-WIDE [51], and Flickr [54]. We followed the processing of the standard long-tail dataset CIFAR100-LT [55] to make the data unbalanced. ...
Article
Full-text available
Cross-modal retrieval aims to project the high-dimensional cross-model data to a common low-dimensional space. Previous work relies on balanced dataset for training. But with the growth of massive real datasets, the long-tail phenomenon has be found in more and more datasets and how to train with those imbalanced datasets is becoming an emerging challenge. In this paper, we propose the complementary expert balanced learning for long-tail cross-modal retrieval to alleviate the impact of long-tail data. In the solution, we design a multiple experts complementary to balance the difference between image and text modalities. Separately for each expert, to find the common feature space of images and texts, we design an individual pairs loss. Moreover, a balancing process is proposed to mitigate the impact of the long tail on the retrieval accuracy of each expert network. In addition, we propose complementary online distillation to enable collaborative operation between individual experts and improve image and text matching. Each expert allows mutual learning between individual modalities, and multiple experts can complement each other to learn the feature embedding between two modalities. Finally, to address the reduction in the number of data after long-tail processing, we propose high-score retraining which also helps the network capture global and robust features with meticulous discrimination. Experimental results on widely used benchmark datasets show that the proposed method is effective in long-tail cross-modal learning.
... • MIRFLICKR-25K [10]: This dataset is obtained from the Flickr website. It gathers a total of 25,000 photos. ...
Article
Full-text available
Unsupervised hashing for cross-modal retrieval has received much attention in the data mining area. Recent methods rely on image-text paired data to conduct unsupervised cross-modal hashing in batch samples. There are two main limitations for existing models: (1) learning of cross-modal representations is restricted to batches; (2) semantically similar samples may be wrongly treated as negative. In this paper, we propose a novel category-level contrastive learning for unsupervised cross-modal hashing, which alleviates the above problems and improves cross-modal query accuracy. To break the limitation of learning in small batches, a selected memory module is first proposed to take global relations into account. Then, we obtain pseudo labels through clustering and combine the labels with the Hadamard Matrix for category-centered learning. To reduce wrong negatives, we further propose a memory bank to store clusters of samples and construct negatives by selecting samples from different categories for contrastive learning. Extensive experiments show the significant superiority of our approach over the state-of-the-art models on MIRFLICKR-25K and NUS-WIDE datasets.
... Through principal component analysis, it provides 1000-dimensional BOW features to every text. MIRFlickr [49] The MIRFlickr consists of 25,000 images downloaded from the social photography site Flickr via a public application program interface, complete with manual annotations, and bag-of-words-based similarity features. Specifically, it contains 25,000 pictures and related descriptive tags in 24 various classes. ...
Article
Full-text available
Multimodal hash technology maps high-dimensional multimodal data into hash codes, which greatly reduces the cost of data storage and improves query speed through the Hamming similarity calculation. However, existing unsupervised methods still have two key obstacles: (1) With the evolution of large multimodal models, how to efficiently distill the multimodal matching relationship of large models to train a powerful student model? (2) Existing methods do not consider other adjacencies between multimodal instances, resulting in limited similarity representation. To address these obstacles, called Unsupervised Graph Reasoning Distillation Hashing (UGRDH) is proposed. The UGRDH approach uses the CLIP as the teacher model, thus extracting fine-grained multimodal features and relations for teacher–student distillation. Specifically, the multimodal features of the teacher are used to construct a similarity–complementary relation graph matrix, and the proposed graph convolution auxiliary network performs feature aggregation guided by the relation graph matrix to generate a more discriminative hash code. In addition, a cross-attention module was designed to reason potential instance relations to enable effective teacher–student distilled learning. Finally, UGRDH greatly improves search precision while maintaining lightness. Experimental results show that our method achieves about 1.5%, 3%, and 2.8% performance improvements on MS COCO, NUS-WIDE, and MIRFlickr, respectively.
... In the experimental section of this study, we provide a broad and in-depth evaluation of the performance of cross-modal retrieval. Experiments were performed on three distinct datasets: Wikipedia [41], NUS-WIDE [42], and MIRFlickr-25k [43]. These datasets cover image and text data from different domains and content types, providing a diverse set of challenges and evaluation environments. ...
Article
Full-text available
With the development of intelligent collection technology and popularization of intelligent terminals, multi-source heterogeneous data are growing rapidly. The effective utilization of rich semantic information contained in massive amounts of multi-source heterogeneous data to provide users with high-quality cross-modal information retrieval services has become an urgent problem to be solved in the current field of information retrieval. In this paper, we propose a novel cross-modal retrieval method, named MGSGH, which deeply explores the internal correlation between data of different granularities by integrating coarse-grained global semantic information and fine-grained scene graph information to model global semantic concepts and local semantic relationship graphs within a modality respectively. By enforcing cross-modal consistency constraints and intra-modal similarity preservation, we effectively integrate the visual features of image data and semantic information of text data to overcome the heterogeneity between the two types of data. Furthermore, we propose a new method for learning hash codes directly, thereby reducing the impact of quantization loss. Our comprehensive experimental evaluation demonstrated the effectiveness and superiority of the proposed model in achieving accurate and efficient cross-modal retrieval.
... the datasets selection by(Tan et al. 2018;Li and Chen 2021;Wen et al. 2023), we conduct experiments on five commonly used datasets, namely Corel5k(Duygulu et al. 2002), Pascal07(Everingham et al. 2010), ESPGame (Von Ahn and Dabbish 2004), IAPRTC12(Grubinger et al. 2006), and Mirflickr(Huiskes and Lew 2008). ...
Article
Recently, multi-view multi-label classification (MvMLC) has received a significant amount of research interest and many methods have been proposed based on the assumptions of view completion and label completion. However, in real-world scenarios, multi-view multi-label data tends to be incomplete due to various uncertainties involved in data collection and manual annotation. As a result, the conventional MvMLC methods fail. In this paper, we propose a new two-stage MvMLC network to solve this incomplete MvMLC issue with partial missing views and missing labels. Different from the existing works, our method attempts to leverage the diverse information from the partially missing data based on the information theory. Specifically, our method aims to minimize task-irrelevant information while maximizing task-relevant information through the principles of information bottleneck theory and mutual information extraction. The first stage of our network involves training view-specific classifiers to concentrate the task-relevant information. Subsequently, in the second stage, the hidden states of these classifiers serve as input for an alignment model, an autoencoder-based mutual information extraction framework, and a weighted fusion classifier to make the final prediction. Extensive experiments performed on five datasets validate that our method outperforms other state-of-the-art methods. Code is available at https://github.com/KevinTan10/TSIEN.
... We conduct experiments on six real-world partial label data sets collected from several domains and tasks, including FG-NET (Panis et al. 2016) for facial age estimation, Lost (Cour et al. 2009), Soccer Player (Zeng et al. 2013) and Yahoo!News (Guillaumin, Verbeek, and Schmid 2010) for automatic face naming, MSRCv2 (Liu and Dietterich 2012) for object classification and Mirflickr (Huiskes and Lew 2008) for web image classification. The details of the data sets are summarized in the Supplementary. ...
Article
In partial label learning (PLL), each instance is associated with a set of candidate labels among which only one is ground-truth. The majority of the existing works focuses on constructing robust classifiers to estimate the labeling confidence of candidate labels in order to identify the correct one. However, these methods usually struggle to rectify mislabeled samples. To help existing PLL methods identify and rectify mislabeled samples, in this paper, we introduce a novel partner classifier and propose a novel ``mutual supervision'' paradigm. Specifically, we instantiate the partner classifier predicated on the implicit fact that non-candidate labels of a sample should not be assigned to it, which is inherently accurate and has not been fully investigated in PLL. Furthermore, a novel collaborative term is formulated to link the base classifier and the partner one. During each stage of mutual supervision, both classifiers will blur each other's predictions through a blurring mechanism to prevent overconfidence in a specific label. Extensive experiments demonstrate that the performance and disambiguation ability of several well-established stand-alone and deep-learning based PLL approaches can be significantly improved by coupling with this learning paradigm.
Article
Full-text available
Deep Cross-Modal Hashing (DCMH) has garnered significant attention in the field of cross-modal retrieval due to its advantages such as high computational efficiency and small storage space. However, existing DCMH methods still face certain limitations: (1) they neglect the correlation between labels, while label features exhibit high sparsity; (2) they lack fine-grained semantic alignment; (3) they fail to effectively address data imbalance. In order to tackle these issues, this paper introduces a framework named Semantic-Alignment Transformer and Adversary Hashing for Cross-modal Retrieval (SATAH). To the best of our knowledge, this is the first attempt at the Semantic-Alignment Transformer algorithm. Specifically, this paper first designs a label learning network that utilizes a crafted transformer module to extract label information, guiding adversarial learning and hash function learning accordingly. Subsequently, a Balanced Conditional Generative Adversarial Network (BCGAN) is constructed, marking the first instance of adversarial training guided by label information. Furthermore, a Weighted Semi-Hard Cosine Triplet Constraint is proposed to better ensure high-ranking similarity relationships among all items. Lastly, considering the correlation between labels, a semantic-alignment constraint is introduced to handle label correlation from a fine-grained perspective, capturing similarity on a global scale more effectively. Extensive experiments are conducted on multiple representative cross-modal datasets. In experiments with 64-bit hash code length, SATAH achieves average mAP values of 84.75%, 68.87%, and 68.73% on MIR Flickr, NUS-WIDE, and MS COCO datasets, respectively, outperforming state-of-the-art methods. The code is available at https://github.com/Daydaylight/SATAH.
Chapter
Recent investigations into adversarial deep hashing networks have underscored the security threat posed by adversarial examples, known as adversarial vulnerability. Effectively distilling reliable semantic representatives for deep hashing to guide adversarial learning proves challenging, impeding the improvement of adversarial robustness in deep hashing-based retrieval models. Additionally, existing research on adversarial training for deep hashing lacks a unified minimax structure. This chapter introduces Semantic-Aware Adversarial Training (SAAT) (Yuan et al (2023) IEEE Trans Inf Forensics Secur) to enhance the adversarial robustness of deep hashing models. A discriminative mainstay features learning (DMFL) scheme is conceived to construct semantic representatives for guiding adversarial learning in deep hashing. DMFL, with a strict theoretical guarantee, is adaptively optimized in a discriminative learning manner, considering both discriminative and semantic properties jointly. Adversarial examples are generated by maximizing the Hamming distance between hash codes of adversarial samples and mainstay features, validated for efficacy in adversarial attack trials. Notably, this chapter formulates the formalized adversarial training of deep hashing into a unified minimax optimization for the first time, guided by generated mainstay codes. Extensive experiments on benchmark datasets demonstrate superb attack performance against state-of-the-art algorithms, while the proposed adversarial training effectively eliminates adversarial perturbations, ensuring trustworthy deep hashing-based retrieval.
Article
Full-text available
Lifelong machine learning concerns the development of systems that continuously learn from diverse tasks, incorporating new knowledge without forgetting the knowledge they have previously acquired. Multi-label classification is a supervised learning process in which each instance is assigned multiple non-exclusive labels, with each label denoted as a binary value. One of the main challenges within the lifelong learning paradigm is the stability-plasticity dilemma, which entails balancing a model’s adaptability in terms of incorporating new knowledge with its stability in terms of retaining previously acquired knowledge. When faced with multi-label data, the lifelong learning challenge becomes even more pronounced, as it becomes essential to preserve relations between multiple labels across sequential tasks. This scoping review explores the intersection of lifelong learning and multi-label classification, an emerging domain that integrates continual adaptation with intricate multi-label datasets. By analyzing the existing literature, we establish connections, identify gaps in the existing research, and propose new directions for research to improve the efficacy of multi-label lifelong learning algorithms. Our review unearths a growing number of algorithms and underscores the need for specialized evaluation metrics and methodologies for the accurate assessment of their performance. We also highlight the need for strategies that incorporate real-world data from varying contexts into the learning process to fully capture the nuances of real-world environments.
Article
Online cross-modal hashing has received increasing attention due to its efficiency and effectiveness in handling cross-modal streaming data retrieval. Despite the promising performance, these methods mainly focus on the supervised learning paradigm, demanding expensive and laborious work to obtain clean annotated data. Existing unsupervised online hashing methods mostly struggle to construct instructive semantic correlations among data chunks, resulting in the forgetting of accumulated data distribution. To this end, we propose a Dynamic Prototype-based Online Cross-modal Hashing method, called DPOCH. Based on the pre-learned reliable common representations, DPOCH generates prototypes incrementally as sketches of accumulated data and updates them dynamically for adapting streaming data. Thereafter, the prototype-based semantic embedding and similarity graphs are designed to promote stability and generalization of the hashing process, thereby obtaining globally adaptive hash codes and hash functions. Experimental results on benchmarked datasets demonstrate that the proposed DPOCH outperforms state-of-the-art unsupervised online cross-modal hashing methods.
Article
The large and growing amount of digital data creates a pressing need for approaches capable of indexing and retrieving multimedia content. A traditional and fundamental challenge consists of effectively and efficiently performing nearest-neighbor searches. After decades of research, several different methods are available, including trees, hashing, and graph-based approaches. Most of the current methods exploit learning to hash approaches based on deep learning. In spite of effective results and compact codes obtained, such methods often require a significant amount of labeled data for training. Unsupervised approaches also rely on expensive training procedures usually based on a huge amount of data. In this work, we propose an unsupervised data-independent approach for nearest neighbor searches, which can be used with different features, including deep features trained by transfer learning. The method uses a rank-based formulation and exploits a hashing approach for efficient ranked list computation at query time. A comprehensive experimental evaluation was conducted on 7 public datasets, considering deep features based on CNNs and Transformers. Both effectiveness and efficiency aspects were evaluated. The proposed approach achieves remarkable results in comparison to traditional and state-of-the-art methods. Hence, it is an attractive and innovative solution, especially when costly training procedures need to be avoided.
Article
Deep cross-modal hashing retrieval has recently made significant progress. However, existing methods generally learn hash functions with pairwise or triplet supervisions, which involves learning the relevant information by splicing partial similarity between data pairs; notably, this approach only captures the data similarity locally and incompletely, resulting in sub-optimal retrieval performance. In this paper, we propose a novel Multi-Relational Deep Hashing (MRDH) approach, which can fully bridge the modality gap by comprehensively modeling the similarity relationship between data in different modalities. In more detail, to investigate the inter-modal relationships, we constrain the consistency of cross-modal pairwise similarities to maintain the semantic similarity across modalities. Moreover, to further capture complete similarity information, we design a new similarity metric, which we term cross-modal global similarity, by encouraging hash codes of similar data pairs from different modalities to approach a common center and hash codes for dissimilar pairs to converge to different centers. Adopting this approach enables our model to generate more discriminative hash codes. Extensive experiments on three benchmark datasets demonstrate the superiority of our method on cross-modal hashing retrieval.
Article
Hashing techniques have been extensively studied in cross-modal retrieval due to their advantages in high computational efficiency and low storage cost. However, existing methods unconsciously ignore the complementary information of multimodal data, thus failing to consider learning discriminative hash codes from the perspective of information complementarity while often involving time-consuming training overhead. To tackle the above issues, we propose an efficient discriminative hashing (EDH) with information complementarity consideration. Specifically, we reckon that multimodal features and their corresponding semantic labels describe heterogeneous data viewed from low- and high-level structures, which owns complementarity. To this end, low-level latent representation and high-level semantics representation are simply derived. Then, a joint learning strategy is formulated to simultaneously exploit the above two representations for generating discriminative hash codes, which is quite computationally efficient. Besides, EDH decomposes hash learning into two steps. To obtain powerful hash functions which are conductive to retrieval, a regularization term considering pairwise semantic similarity is introduced into hash functions learning. In addition, an efficient optimization algorithm is designed to solve the optimization problem in EDH. Extensive experiments conducted on benchmark datasets demonstrate the superiority of our EDH in terms of retrieval performance and training efficiency. The source code is available at https://github.com/hjf-hjf/EDH .
Article
Existing unsupervised deep product quantization methods primarily aim for the increased similarity between different views of the identical image, whereas the delicate multi-level semantic similarities preserved between images are overlooked. Moreover, these methods predominantly focus on the Euclidean space for computational convenience, compromising their ability to map the multi-level semantic relationships between images effectively. To mitigate these shortcomings, we propose a novel unsupervised product quantization method dubbed Hierarchical Hyperbolic Product Quantization (HiHPQ), which learns quantized representations by incorporating hierarchical semantic similarity within hyperbolic geometry. Specifically, we propose a hyperbolic product quantizer, where the hyperbolic codebook attention mechanism and the quantized contrastive learning on the hyperbolic product manifold are introduced to expedite quantization. Furthermore, we propose a hierarchical semantics learning module, designed to enhance the distinction between similar and non-matching images for a query by utilizing the extracted hierarchical semantics as an additional training supervision. Experiments on benchmark image datasets show that our proposed method outperforms state-of-the-art baselines.
Article
Multi-label classification is an arduous problem given the complication in label correlation. Whilst sharing a common goal with contrastive learning in utilizing correlations for representation learning, how to better leverage label information remains challenging. Previous endeavors include extracting label-level presentations or mapping labels to an embedding space, overlooking the correlation between multiple labels. It exhibits a great ambiguity in determining positive samples with different extent of label overlap between samples and integrating such relations in loss functions. In our work, we propose Multi-Label Supervised Contrastive learning (MulSupCon) with a novel contrastive loss function to adjust weights based on how much overlap one sample shares with the anchor. By analyzing gradients, we explain why our method performs better under multi-label circumstances. To evaluate, we conduct direct classification and transfer learning on several multi-label datasets, including widely-used image datasets such as MS-COCO and NUS-WIDE. Validation indicates that our method outperforms the traditional multi-label classification method and shows a competitive performance when comparing to other existing approaches.
Article
As social media faces with large amounts of data and multimodal properties, cross-modal hashing (CMH) retrieval gains extensive applications with its high efficiency and low storage consumption. However, there are two issues that hinder the performance of the existing semantics-learning-based CMH methods: 1) there exist some nonlinear relationships, noises, and outliers in the data, which may degrade the learning effectiveness of a model; and 2) the complementary relationships between the label semantics and sample semantics may be inadequately explored. To address the above two problems, a method called robust asymmetric cross-modal hashing retrieval with dual semantic enhancement (RADSE) is proposed. RADSE consists of three parts: 1) cross-modal data alignment (CDA) that applies kernel mapping and establishes a unified linear representation in the neighborhood to capture the nonlinear relationships between cross-modal data; 2) relaxed label semantic learning for robustness (RLSLR) that uses a relaxation strategy to expand label distinctiveness, and leverages $\ell_{2,1}$ norm to enhance the robustness of the model against noise and outliers; and 3) dual semantic enhancement learning (DSEL) that learns more interrelationships between samples under the label semantic guidance to ensure the mutual enhancement of semantic information. Extensive experiments and analyses on three popular datasets demonstrate that RADSE outperforms the most existing methods in terms of mean average precision (MAP), precision recall (P–R) curves, and top-N precision curves. In the comparisons of MAP, RADSE improves by an average of 2%–3% in two retrieval tasks.
Article
In recent years, to address the issue of networked data sparsity in node classification tasks, cross-network node classification (CNNC) leverages the richer information from a source network to enhance the performance of node classification in the target network, which typically has sparser information. However, in real-world applications, labeled nodes may be collected from multiple sources with multiple modalities (e.g., text, vision, and video). Naive application of single-source and single-modal CNNC methods may result in sub-optimal solutions. To this end, in this paper, we propose a model called M ² CDNE (Multi-source and Multi-modal Cross-network Deep Network Embedding) for cross-network node classification. In M ² CDNE, we propose a deep multi-modal network embedding approach that combines the extracted deep multi-modal features to make the node vector representations network-invariant. In addition, we apply dynamic adversarial adaptation to assess the significance of marginal and conditional probability distributions between each source and target network to make node vector representations label-discriminative. Furthermore, we devise to classify nodes in the target network through the related source classifier and aggregate different predictions utilizing respective network weights, corresponding to the discrepancy between each source and target network. Extensive experiments performed on real-world datasets demonstrate that the proposed M ² CDNE significantly outperforms the state-of-the-art approaches.
Conference Paper
Full-text available
In this paper we describe ImageCLEF, the cross language image retrieval track of the Cross Language Evaluation Forum (CLEF). We instigated and ran a pilot experiment in 2003 where participants submitted entries for an ad hoc bilingual image retrieval task on a collection of historic photographs from St. Andrews University Library. This was designed to simulate the situation in which users would express their search request in natural language but require visual documents in return. For 2004 we have extended the tasks to include a medical image retrieval task and a user-centred evaluation.
Conference Paper
Full-text available
Other than the pixel information, a digital photo of today has a host of other information regarding the photo shooting event. These information are captured by different sensors present on the camera and are stored as metadata. In this paper we exploit this meta information and derive useful semantics about the digital photo. We also compare our results with classical relevance models used for automatic photo annotation. We create a dataset of digital photos containing all information and report results on it. We also make the dataset available to the community for further experiments.
Conference Paper
Full-text available
To demonstrate the performance of content-based image retrieval systems (CBIRSs), there is not yet any standard data set that is widely used. The only dataset used by a large number of research groups are the Corel Photo CDs. There are more than 800 of those CDs, each containing 100 pictures roughly similar in theme. Unfortunately, basically every evaluation is done on a different subset of the image sets thus making comparison impossible. In this article, we compare different ways of evaluating the performance using a subset of the Corel images with the same CBIRS and the same set of evaluation measures. The aim is to show how easy it is to get differing results, even when using the same image collection, the same CBIRS and the same performance measures. This pinpoints the fact that we need a standard database of images with a query set and corresponding relevance judgments (RJs) to really compare systems. The techniques used in this article to “enhance” the apparent performance of a CBIRS are commonly used, sometimes described, sometimes not. They all have a justification and seem to change the performance of a CBIRS but they do actually not. With a larger subset of images it is of course much easier to generate even bigger differences in performance. The goal of this article is not to be a guide of how to make the “apparent” performance of systems look good, but rather to make readers aware of CBIRS evaluations and the importance of standardized image databases, queries and RJ.
Conference Paper
Full-text available
Multimedia Information Retrieval (IR)techniques and associated systems are now numerous and justify the development of strategies and actions to objectively evaluate their capabilities. A number of initiatives following this line exist, each with its own peculiarities. In this paper, we take a bird's eye view on benchmarking multimedia IR systems (with the particular case of image and video retrieval)and summarize contributions made to a dedicated special session at the ACM Multimedia Information Retrieval Workshop (ACM MIR2006).From the analysis of a classical IR system, we identify locations of interest for evaluation of performance.We review proposals made in the context of existing benchmarks, each specialized in its own aspect and media.
Article
Full-text available
Extending beyond the boundaries of science, art, and culture, content-based multimedia information retrieval provides new paradigms and methods for searching through the myriad variety of media over the world. This survey reviews 100+ recent articles on content-based multimedia information retrieval and discusses their role in current research directions which include browsing and search paradigms, user studies, affective computing, learning, semantic queries, new features and media types, high performance indexing, and evaluation techniques. Based on the current state of the art, we discuss the major challenges for the future.
Article
Full-text available
In this paper we describe ImageCLEF, the cross language image retrieval track of the Cross Language Evaluation Forum (CLEF). We instigated and ran a pilot experiment in 2003 where participants submitted entries for an ad hoc bilingual image retrieval task on a collection of historic photographs from St. Andrews University Library. This was designed to simulate the situation in which users would express their search request in natural language but require visual documents in return. For 2004 we have extended the tasks to include a medical image retrieval task and a user-centred evaluation.
Article
Full-text available
Successful and effective content-based access to digital video requires fast, accurate and scalable methods to determine the video content automatically. A variety of contemporary approaches to this rely on text taken from speech within the video, or on matching one video frame against others using low-level characteristics like colour, texture, or shapes, or on determining and matching objects appearing within the video. Possibly the most important technique, however, is one which determines the presence or absence of a high-level or semantic feature, within a video clip or shot. By utilizing dozens, hundreds or even thousands of such semantic features we can support many kinds of content-based video navigation. Critically however, this depends on being able to determine whether each feature is or is not present in a video clip. The last 5 years have seen much progress in the development of techniques to determine the presence of semantic features within video. This progress can be tracked in the annual TRECVid benchmarking activity where dozens of research groups measure the effectiveness of their techniques on common data and using an open, metrics-based approach. In this chapter we summarise the work done on the TRECVid high-level feature task, showing the progress made year-on-year. This provides a fairly comprehensive statement on where the state-of-the-art is regarding this important task, not just for one research group or for one approach, but across the spectrum. We then use this past and on-going work as a basis for highlighting the trends that are emerging in this area, and the questions which remain to be addressed before we can achieve large-scale, fast and reliable high-level feature detection on video.
Article
Full-text available
The TREC Video Retrieval Evaluation (TRECVid) is an international benchmarking activity to encourage research in video information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. TRECVid completed its fifth annual cycle at the end of 2005 and in 2006 TRECVid will involve almost 70 research organizations, universities and other consortia. Throughout its existence, TRECVid has benchmarked both interactive and automatic/manual searching for shots from within a video corpus, automatic detection of a variety of semantic and low-level video features, shot boundary detection and the detection of story boundaries in broadcast TV news. This paper will give an introduction to information retrieval (IR) evaluation from both a user and a system perspective, highlighting that system evaluation is by far the most prevalent type of evaluation carried out. We also include a summary of TRECVid as an example of a system evaluation benchmarking campaign and this allows us to discuss whether such campaigns are a good thing or a bad thing. There are arguments for and against these campaigns and we present some of them in the paper concluding that on balance they have had a very positive impact on research progress.
Article
A photograph captured by a digital camera usually includes camera metadata in which sensor readings, camera settings and other capture pipeline information are recorded. The camera metadata, typically stored in an EXIF header, contains a rich set of information reflecting the conditions under which the photograph was captured. This set of rich information can be potentially useful for improvement in digital photography but its multi-dimensionality and heterogeneous data structure make it difficult to be useful. Knowledge discovery, on the other hand, is usually associated with data mining to extract potentially useful information from complex data sets. In this paper we use a knowledge discovery framework based on data mining to automatically associate combinations of high-dimensional, heterogeneous metadata with scene types. In this way, we can perform very simple and efficient scene classification for certain types of photographs. We have also provided an interactive user interface in which a user can type in a query on metadata and the system will retrieve from our image database the images that satisfy the query and display them. We have used this approach to associate EXIF metadata with specific scene types like back-lit scenes, night scenes and snow scenes. To improve the classification results, we have combined an initial classification based only on the metadata with a simple, histogram based analysis for quick verification of the discovered knowledge. The classification results, in turn, can be used to better manage, assess, or enhance the photographs.
Conference Paper
The importance of the visual information search problem has given rise to a large number of systems and prototypes being built to perform such search. While different systems clearly have their particular strengths, they tend to use different collections to highlight the advantages of their algorithms. Consequently, a degree of bias may exist, and it also makes it difficult to make comparisons concerning the relative superiority of different algorithms. In order for the field of visual information search to make further progress, a need therefore exists for a standardised benchmark suite to be developed. By having a uniform measure of search performance, research progress can be more easily recognised and charted, and the resultant synergy will be essential to further development of the field. This paper presents concrete proposals concerning the development of such a benchmark, and by adopting an extensible framework, it is able to cater for a wide variety of applications paradigms and to lend itself to incremental refinement.
Conference Paper
ABSTRACT In this paper we review the evaluation of relevance feedback methods for content-based image retrieval systems. We start out by presenting an overview of current common practice, and argue that the evaluation of relevance feedback methods differs from evaluating CBIR systems as a whole. Specifi- cally, we identify the challenging issues that are particular to the evaluation of retrieval employing relevance feedback. Next, we propose three guidelines to move toward more ef- fective evaluation benchmarks. We focus particularly on as- sessing feedback methods more directly in terms of their goal of identifying the relevant target class with a small number of samples, and show how to compensate for query targets of varying difficulty by measuring efficiency at generalization. Categories and Subject Descriptors
Conference Paper
Shared evaluation tasks have become popular over the last decades as ways of making communities of researchers advance together. This paper presents the organization of five new shared task evaluation campaigns for image indexing and retrieval. We have designed these campaigns based on our previous experience of participating in or organizing various text retrieval campaigns such as TREC, AMARYLLIS and CLEF. Our purpose behind these campaigns is to minimize the gap between technology evaluation and user-oriented evaluation in the field of information retrieval.
Conference Paper
The importance of the visual information search problem has given rise to a large number of systems and prototypes being built to perform such search. While different systems clearly have their particular strengths, they tend to use different collections to highlight the advantages of their algorithms. Consequently, a degree of bias may exist, and it also makes it difficult to make comparisons concerning the relative superiority of different algorithms. In order for the field of visual information search to make further progress, a need therefore exists for a standardised benchmark suite to be developed. By having a uniform measure of search performance, research progress can be more easily recognised and charted, and the resultant synergy will be essential to further development of the field. This paper presents concrete proposals concerning the development of such a benchmark, and by adopting an extensible framework, it is able to cater for a wide variety of applications paradigms and to lend itself to incremental refinement.
Article
We analyze the nature of the relevance feedback problem in a continuous representation space in the context of multimedia information retrieval. Emphasis is put on exploring the uniqueness of the problem and comparing the assumptions, implementations, and merits of various solutions in the literature. An attempt is made to compile a list of critical issues to consider when designing a relevance feedback algorithm. With a comprehensive review as the main portion, this paper also offers some novel solutions and perspectives throughout the discussion.
The Benchathlon Network
  • S Marchand-Maillet
S. MARCHAND-MAILLET. 2005. The Benchathlon Network. http://www.benchathlon.net.
The Benchathlon Network Benchmarking image and video retrieval: an overview
  • S S Marchand-Maillet
  • Worring Marchand-Maillet And M
S. MARCHAND-MAILLET. 2005. The Benchathlon Network. http://www.benchathlon.net. [4] S. MARCHAND-MAILLET AND M. WORRING. 2006. Benchmarking image and video retrieval: an overview. In MIR '06: Proceedings of the 8th ACM international workshop on Multimedia information retrieval, pages 297– 300, New York, NY, USA. ACM Press.
Report on the need for and provision of an ideal information retrieval test collection
  • K Jones And
  • C Van Rijsbergen
  • SPARCK JONES AND C.