Figure 3 - uploaded by Ali Borji
Content may be subject to copyright.
Image retrieval examples. The input (query) image is on the left and its closest match is on the right. The query images are from [75] and the closest match is from [90]. The observers' fixation density map is overlaid.

Image retrieval examples. The input (query) image is on the left and its closest match is on the right. The query images are from [75] and the closest match is from [90]. The observers' fixation density map is overlaid.

Source publication
Article
Full-text available
This paper presents a novel fixation prediction and saliency modeling framework based on inter-image similarities and ensemble of Extreme Learning Machines (ELM). The proposed framework is inspired by two observations, 1) the contextual information of a scene along with low-level visual cues modulates attention, 2) the influence of scene memorabili...

Context in source publication

Context 1
... then fetches the neural fixation predictor units corresponding to the n images with the smallest dist i in order to form the ensemble of neural fixation predictors, to be discussed in Section 4.3. Figure 3 demonstrates the results of retrieval system. It visualizes a query image and its corresponding most similar retrieved image between two differ- ent databases with the observer gaze information overlaid. ...

Similar publications

Preprint
Full-text available
When we bring to mind something we have seen before, our eyes spontaneously reproduce a pattern strikingly similar to that made during the original encounter. Eye-movements can then serve the opposite purpose to acquiring new visual information; they can serve as self-generated cues, pointing to memories already stored. By isolating separable prope...

Citations

... Following the idea that a self-attention mechanism has the ability to indicate the discriminative regions in an image [46], a novel network architecture [86] that incorporated saliency information as input was designed. Local deep feature representations from training samples and their corresponding saliency maps obtained from [87] were combined for improving the classification performance on FSFGIC. Following the idea of object localization strategy [88], a meta-reweighting strategy [19] was designed to extract and exploit local deep feature representations of support samples. ...
Article
Full-text available
Few-shot fine-grained image classification (FSFGIC) methods refer to the classification of images (e.g., birds, flowers, and airplanes) belonging to different subclasses of the same species by a small number of labeled samples. Through feature representation learning, FSFGIC methods can make better use of limited sample information, learn more discriminative feature representations, greatly improve the classification accuracy and generalization ability, and thus achieve better results in FSFGIC tasks. In this paper, starting from the definition of FSFGIC, a taxonomy of feature representation learning for FSFGIC is proposed. According to this taxonomy, we discuss key issues on FSFGIC (including data augmentation, local and/or global deep feature representation learning, class representation learning, and task-specific feature representation learning). In addition, the existing popular datasets, current challenges and future development trends of feature representation learning on FSFGIC are also described.
... Following the idea that a self-attention mechanism has the ability to indicate the discriminative regions in an image [50], an image saliency regions incorporation strategy [51] was designed. Local deep feature representations from training samples and their corresponding saliency maps obtained from [52] are combined for improving the classification performance on FSFGIC. Following the idea of object localization strategy [53], a meta-reweighting strategy [54] was designed to extract and exploit local deep feature representations of support samples. ...
Preprint
Full-text available
Few-shot fine-grained image classification (FSFGIC) methods refer to machine learning methods which aim to classify images (e.g., bird species, flowers, and airplanes) belonging to subordinate object categories of the same entry-level category with only a few samples. It is worth to note that feature representation learning is used not only to represent training samples, but also to construct classifiers for performing various FSFGIC tasks. In this paper, starting from the definition of FSFGIC, a taxonomy of feature representation learning for FSFGIC is proposed. According to this taxonomy, we discuss key issues on FSFGIC (including data augmentation, local or/and global deep feature representation learning, class representation learning, and task specific feature representation learning). The existing popular datasets and evaluation standards are introduced. Furthermore, a novel classification performance evaluation mechanism is designed with a 0.95 confidence interval for judging whether the classification accuracy obtained by a certain specified method is good or bad. Moreover, current challenges and future trends of feature representation learning on FSFGIC are elaborated.
... This penalty regression ELM overcomes the defect that the least square method is not suitable for nonlinear fitting. Tavakoli et al. attempt to develop a novel fixation prediction framework based on interimage similarities [17]. The ELM model estimates the saliency of the given image and proves the effect of image classification. ...
Article
Full-text available
For orbital angular momentum (OAM) recognition in atmosphere turbulence, how to design a self-adapted model is a challenging problem. To address this issue, an efficient deep learning framework that uses a derived extreme learning machine (ELM) has been put forward. Different from typical neural network methods, the provided analytical machine learning model can match the different OAM modes automatically. In the model selection phase, a multilayer ELM is adopted to quantify the laser spot characteristics. In the parameter optimization phase, a fast iterative shrinkage-thresholding algorithm makes the model present the analytic expression. After the feature extraction of the received intensity distributions, the proposed method develops a relationship between laser spot and OAM mode, thus building the steady neural network architecture for the new received vortex beam. The whole recognition process avoids the trial and error caused by user intervention, which makes the model suitable for a time-varying atmospheric environment. Numerical simulations are conducted on different experimental datasets. The results demonstrate that the proposed method has a better capacity for OAM recognition.
... Tavakoli et al. (2017b) and Dodge and Karam (2018) inferred saliency using a collection of deep neural networks. iSEEL's (Tavakoli et al., 2017b) deep neural networks are aided by an inter-image similarity retrieval unit to predict¯xations. Inter-image similarity originates from the fact that people have similar¯xation patterns towards similar images. ...
Article
Full-text available
Visual saliency models mimic the human visual system to gaze towards fixed pixel positions and capture the most conspicuous regions in the scene. They have proved their efficacy in several computer vision applications. This paper provides a comprehensive review of the recent advances in eye fixation prediction and salient object detection, harnessing deep learning. It also provides an overview on multi-modal saliency prediction that considers audio in dynamic scenes. The underlying network structure and loss function for each model are explored to realise how saliency models work. The survey also investigates the inclusion of specific low-level priors in deep learning-based saliency models. The public datasets and evaluation metrics are succinctly introduced. The paper also makes a discussion on the key issues in saliency modeling along with some open problems and growing research directions in the field.
... There have been attempts to combine DNNs with eye data to perform various tasks. Some basic tasks include predicting how an eye will move across presented stimuli, whether text-based (Sood et al., 2020b) or images in general (Ghariba et al., 2020;Li and Yu, 2016;Harel et al., 2006;Huang et al., 2015;Tavakoli et al., 2017). These predictions can be used to create saliency maps that show what areas of a visual display are attractive to the eye. ...
Conference Paper
Attention describes cognitive processes that are important to many human phenomena including reading. The term is also used to describe the way in which transformer neural networks perform natural language processing. While attention appears to be very different under these two contexts, this paper presents an analysis of the correlations between transformer attention and overt human attention during reading tasks. An extensive analysis of human eye tracking datasets showed that the dwell times of human eye movements were strongly correlated with the attention patterns occurring in the early layers of pre-trained transformers such as BERT. Additionally, the strength of a correlation was not related to the number of parameters within a transformer. This suggests that something about the transformers’ architecture determined how closely the two measures were correlated.
... A large body of work has been directed at producing computational models that generate human saliency maps given an input of a particular image (for example Li & Yu [2016]; Ghariba et al. [2020]; Huang et al. [2015]; Harel et al. [2007]; Tavakoli et al. [2017]). An approach for generating a visual saliency model based on Markov Chains [Harel et al., 2007] represents the image as a fully connected graph. ...
... This achieves success in predicting human fixations points without requiring annotated data. Recent approaches [Li & Yu, 2016;Ghariba et al., 2020;Huang et al., 2015;Tavakoli et al., 2017] use convolutional neural networks in various configurations with training inputs being representative images and training labels being actual human saliency maps for these images that have been calculated by recording human eye fixation. Predicted saliency maps are typically represented as a matrix representing the field of vision, with elements being weights whose magnitude indicates the relative degree of attention a human might apply to that part of the image, or alternatively, the probability that a human might attend to that part. ...
... Flores et al [Flores et al., 2019] utilise such models (mainly Tavakoli et al. [2017] and Huang et al. [2015]) to demonstrate improvement in object classification tasks. They use the models as pre-learned saliency map generators to create saliency maps for the images in their training set and then pair the resulting maps and images as dual input into a convolutional model that fuses the two inputs before running through further convolutional layers prior to classification. ...
Preprint
Processes occurring in brains, a.k.a. biological neural networks, can and have been modeled within artificial neural network architectures. Due to this, we have conducted a review of research on the phenomenon of blindsight in an attempt to generate ideas for artificial intelligence models. Blindsight can be considered as a diminished form of visual experience. If we assume that artificial networks have no form of visual experience, then deficits caused by blindsight give us insights into the processes occurring within visual experience that we can incorporate into artificial neural networks. This article has been structured into three parts. Section 2 is a review of blindsight research, looking specifically at the errors occurring during this condition compared to normal vision. Section 3 identifies overall patterns from Section 2 to generate insights for computational models of vision. Section 4 demonstrates the utility of examining biological research to inform artificial intelligence research by examining computation models of visual attention relevant to one of the insights generated in Section 3. The research covered in Section 4 shows that incorporating one of our insights into computational vision does benefit those models. Future research will be required to determine whether our other insights are as valuable.
... A large body of work has been directed at producing computational models that generate human saliency maps given an input of a particular image (for example [Li and Yu, 2016;Ghariba et al., 2020;Huang et al., 2015;Harel et al., 2007;Tavakoli et al., 2017]). An approach for generating a visual saliency model based on Markov Chains [Harel et al., 2007] represents the image as a fully connected graph. ...
... This achieves success in predicting human¯xations points without requiring annotated data. Recent approaches [Li and Yu, 2016;Ghariba et al., 2020;Huang et al., 2015;Tavakoli et al., 2017] use CNNs in various con¯gurations with training inputs being representative images and training labels being actual human saliency maps for these images that have been calculated by recording human eye¯xation. Predicted saliency maps are typically represented as a matrix representing the¯eld of vision, with elements being weights whose magnitude indicates the relative degree of attention a human might apply to that part of the image, or alternatively, the probability that a human might attend to that part. ...
... Predicted saliency maps are typically represented as a matrix representing the¯eld of vision, with elements being weights whose magnitude indicates the relative degree of attention a human might apply to that part of the image, or alternatively, the probability that a human might attend to that part. Flores et al. [2019] utilise such models (mainly Tavakoli et al. [2017] and Huang et al. [2015]) to demonstrate improvement in object classi¯cation tasks. They use the models as pre-learned saliency map generators to create saliency maps for the images in their training set and then pair the resulting maps and images as dual input into a convolutional model that fuses the two inputs before running through further convolutional layers prior to classi¯cation. ...
Article
Processes occurring in brains, a.k.a. biological neural networks, can and have been modeled within artificial neural network architectures. Due to this, we have conducted a review of research on the phenomenon of blindsight in an attempt to generate ideas for artificial intelligence models. Blindsight can be considered as a diminished form of visual experience. If we assume that artificial networks have no form of visual experience, then deficits caused by blindsight give us insights into the processes occurring within visual experience that we can incorporate into artificial neural networks. This paper has been structured into three parts. Section 2 is a review of blindsight research, looking specifically at the errors occurring during this condition compared to normal vision. Section 3 identifies overall patterns from Sec. 2 to generate insights for computational models of vision. Section 4 demonstrates the utility of examining biological research to inform artificial intelligence research by examining computational models of visual attention relevant to one of the insights generated in Sec. 3. The research covered in Sec. 4 shows that incorporating one of our insights into computational vision does benefit those models. Future research will be required to determine whether our other insights are as valuable.
... (4) Graph-based visual saliency (GBVS) model (Harel, Koch, & Perona, 2006) is a graphical model that uses low-level features to highlight conspicuities, which forms saliency maps using a Markovian model. (5) Inter-image similarity and ensemble of extreme learners (iSEEL) model (Tavakoli, Borji, Laaksonen, & Rahtu, 2017) defines saliency via inter-image similarities and extreme learning machines. (6) Learning discriminative subspaces (LDS) model (Fang, Li, Tian, Huang, & Chen, 2016) estimates saliency by learning a set of discriminative subspaces that highlight targets. ...
Article
As collaborative research between engineering and fashion, the purpose of this study was to investigate if saliency models can be applied for predicting consumers’ visual attention to fashion images such as fashion advertisements. A human subject study was conducted to record human visual fixations on 10 colour fashion advertisement images, which were randomly selected from fashion magazines. The participants include 67 college students (26 males and 41 females). All mouse-tracking locations on images were recorded and saved using Psychtoolbox-3 with MATLAB. The locations represent the human fixation points on the images and are used to generate fixation maps. This collaborative research is an innovative and pioneering approach to predict consumers’ visual attention toward fashion images using saliency models. From the results of this study, the engineering area’s saliency models were proven as effective measurements in predicting fashion consumers’ visual attention when looking at fashion images such as advertisements.
... ey compared CNN architectures of different standards, such as AlexNet [41], VGGNet [42], and GoogLeNet [46], and demonstrated the effectiveness of their architecture, particularly the one based on the VGGNet. ereafter, several VGGNet based saliency prediction models have been proposed [47][48][49][50][51][52][53][54][55][56][57]. e aforementioned deep-learning-based saliency prediction models have achieved promising results. ...
Article
Full-text available
In recent years, the prediction of salient regions in RGB-D images has become a focus of research. Compared to its RGB counterpart, the saliency prediction of RGB-D images is more challenging. In this study, we propose a novel deep multimodal fusion autoencoder for the saliency prediction of RGB-D images. The core trainable autoencoder of the RGB-D saliency prediction model employs two raw modalities (RGB and depth/disparity information) as inputs and their corresponding eye-fixation attributes as labels. The autoencoder comprises four main networks: color channel network, disparity channel network, feature concatenated network, and feature learning network. The autoencoder can mine the complex relationship and make the utmost of the complementary characteristics between both color and disparity cues. Finally, the saliency map is predicted via a feature combination subnetwork, which combines the deep features extracted from a prior learning and convolutional feature learning subnetworks. We compare the proposed autoencoder with other saliency prediction models on two publicly available benchmark datasets. The results demonstrate that the proposed autoencoder outperforms these models by a significant margin.
... This transfer learning strategy boosted considerable progress on the performance of saliency prediction. In the following years, different powerful deep neural networks have been used as backbones of saliency prediction models, such as VGGNet [18][19][20][21][22][23][24], ResNet [25,26], DenseNet [27,28] and so on. In this work, we set VGG-16 [17], ResNet-50 [29], and DenseNet-161 [30] as the backbone, respectively, for performance and efficiency evaluation, and finally submitted the results of three versions to the SALICON benchmark for comparison with the state-of-the-art models. ...
Article
This paper proposes a deep convolutional neural network with a concise and effective encoder-decoder architecture for saliency prediction. Local and global contextual features make a considerable contribution to saliency prediction. In order to integrate and exploit these features more thoroughly, in the proposed pithy architecture, we deploy a dense and global context connection structure between the encoder and decoder, after that, a multi-scale readout module is designed to process various information from the previous portion of the decoder with different parallel mapping relationships for full-scale accurate results. Our model ranks first in light of multiple metrics on two famous saliency benchmarks and performs good generalization on other datasets. Besides, we evaluate the precision and the speed of our model with different backbones. The saliency prediction performance of VGGNet-Based, ResNet-based, and DenseNet-based model gradually increases while the speed also drops off. And the experiments illustrate that our model performs better than other models even if replacing the backbone of our model with the same backbone of the compared model. Therefore, we can provide optional versions of our model for different requirements of performance and efficiency.