Figure 5 - uploaded by Lawrence K Cormack
Content may be subject to copyright.
Human versus random search. The first four classification images (from left to right) show the classification images for observer L.K.C. across the four targets. To simulate a searcher who looks about randomly hoping to find the target by chance, the noise stimuli were sampled randomly. The result of averaging the noise pixels at these random fixation points is shown in the far right. The lack of significant structure in the classification image for the random searcher sharply contrasts with the obvious target-like structures generated by the observer. All the classification images are displayed using the same range of gray scales to highlight the relative pixel magnitudes across classification images. 

Human versus random search. The first four classification images (from left to right) show the classification images for observer L.K.C. across the four targets. To simulate a searcher who looks about randomly hoping to find the target by chance, the noise stimuli were sampled randomly. The result of averaging the noise pixels at these random fixation points is shown in the far right. The lack of significant structure in the classification image for the random searcher sharply contrasts with the obvious target-like structures generated by the observer. All the classification images are displayed using the same range of gray scales to highlight the relative pixel magnitudes across classification images. 

Source publication
Article
Full-text available
Visual search experiments have usually involved the detection of a salient target in the presence of distracters against a blank background. In such high signal-to-noise scenarios, observers have been shown to use visual cues such as color, size, and shape of the target to program their saccades during visual search. The degree to which these featu...

Context in source publication

Context 1
... mean equal to the mean (Beard & Ahumada, 1998). The statistically thresholded classification image corresponding to the classification image in Figure 3A is shown in Figure 3B. The statistically thresholded results for all observers and targets are shown in Figure 4. The first row in Figure 4 illustrates the four targets (circle, dipole, triangle, and bow tie) that observers were instructed to find in the stimulus. Each of the other rows shows the classification images for the three observers (L.K.C., U.R., and E.M.) for these targets. Each subject made an average of 3,000 fixations per target. Thus, each classification image is the result of averaging around 3,000 noise patches. To quantify the uncertainty associated with the shapes of these classification images, we bootstrapped (Efron, 1994) the averaging procedure and computed the boundaries of each of the resulting thresholded classification images as follows. To detect the boundaries of the statistically filtered classification images, the two largest regions in the statistically filtered image were detected using a connected- components algorithm. The outlines of each of these regions were then used to represent the boundary of the classification image. Bootstrapping was then used to verify the robustness of these boundaries. First, the ensemble of noise patches at the observer’s point of gaze was resam- pled (with replacement) 200 times. The image patches in each bootstrap sample (3,000 patches, on average) were then averaged together, statistically filtered, and processed to detect the boundaries. The boundaries of the resulting 200 classification images were then added together and superimposed on the classification images to reflect the stability of the boundary. The aggregate of all of the bootstrapped boundaries is shown superimposed in red in Figure 4. The most striking result of this analysis is the emergence of classification images (in Figure 4) that resemble spatial features of the target. Notice that the boundaries of most classification images from the bootstrap procedure are well defined, indicating that the shape information for these classification images is indeed reliable. (In fact, the full width of a bootstrapped contour at a given location gives an upper bound on the width of the 99.5% con- fidence interval about that location). This indicates that although gaze is being rapidly shifted about the image in an effort to find the target as quickly as possible, saccadic programming is being clearly influenced by spatial features in the noise with sufficient precision to generate these robust classification images. Also worthy of note is the fact that these classification images vary in a target-dependent manner within an observer (to varying degrees), and they also vary across observers for a given target. The data of L.K.C. and E.R., for example, show fairly dramatic changes across the targets, indicating that the gaze of these observers was attracted by features unique to a particular target. For example, this effect is most pronounced for observer L.K.C. who, for the case of the circle, seemed to be fixating at regions of high luminance but modifies the search template for the dipole search by fixating at regions that have a luminance profile that matches the horizontal edge of the dipole. This adaptation of the classification image to match some feature of the target is evident, albeit to a lesser extent, for observer E.M. who changes his strategy for the dipole, the triangle, and the bow tie. Observer U.R., in contrast, seemed to use a simpler heuristic, consisting of fixating bright regions of roughly the appropriate size. The interobserver variability is evident for most of the shapes. Even in the case of the triangle, in which the classification images look fairly similar, closer inspec- tion reveals some subtle but reliable differences. Observer L.K.C.’s gaze tends to land on the right side of bright regions that have a dark region further to the right, whereas observer U.R.’s gaze lands on the left side of bright regions that have dark regions further to the left. Both of these observers, however, are attracted to roughly circular bright areas on average, whereas observer E.M. clearly favors a more angular, elongated structure. We also simulated an observer who randomly looks about the stimulus hoping to find the target by chance (this behavior was largely consistent with our own introspection of behavior in this very difficult task; observers reported often being surprised when their gaze landed on the target). We did this by simply selecting random spatial coordinates from a uniform distribution with the constraint that the entire ROI surrounding the fixation point had to be within the image. The number of random fixations used was approximately equal to that comprising the samples from the human observers (around 3,000). The result of adding the noise pixels at the random fixation points is shown in the far right in Figure 5, where all the classification images in the figure are displayed using the same range of gray scales to highlight the relative pixel magnitudes across classification images. The lack of any image structure in the random sampling case and the ability to generate many classification images across subjects from the same set of 1/ f noise stimuli indicates that observers are not random searchers but are actually directing their fixations to regions that resemble some feature of the target. To verify that observers were indeed trying to follow our instructions to find the target quickly, we analyzed the histograms of the fixation duration (time per fixation) and saccadic magnitudes for observers’ eye movements for this experiment. A relatively short fixation dwell time (mean = 0.25 s) and a wide range of saccadic magnitudes (mean = 3 deg, SD = 1.8 deg) confirmed that our observers were indeed trying to survey the range of the stimulus area quickly and were not fixating on particular regions for an inordinately long time. These statistics reflect previously reported measures of fixation durations and saccade magnitudes in natural viewing tasks (Duchowski, 2002; Rayner, 1998; Yarbus, 1967). Traditionally, experiments using the classification image analysis have used additive white Gaussian noise as the masking stimulus. White noise images, being spatially uncorrelated by design, do not contribute to any artificial structure to the classification images. In our experiments, we used 1/ f noise as the stimulus because the presence of many large-scale, target-like salient features inherent in the noise structure made it an effective masker. Due to the correlated nature of 1/ f noise, the resulting classification images are not unbiased linear templates. To obtain the true unbiased linear estimator, we can apply a prewhitening filter (Abbey & Eckstein, 2000) to the classification images obtained using 1/ f noise. An example of the dipole classification image for observer L.K.C. before and after prewhitening is shown in Figure 6. However, even the classification images obtained using the prewhitening filter does not reflect the true unbiased template in our experiments. This is because, unlike the experimental situation in psychophysics, the contribution of each pixel to the creation of the classification image is shift variant across trials (fixation points in this case) due to several factors. First, oculomotor precision and measurement errors inherent in eye movements and their recording result in spatial uncertainty in the exact location of an observer’s fixation. Second, even if we ignore errors in recording fixations and simply assume that an observer was using only a single visual feature to succeed in the search task, there is no guarantee that the observer will precisely fixate the same location on this feature every time. For example, assume that the observer always looked for black triangles in the case of the B bow-tie [ search. During search, the observer could decide to fixate the left triangle in some trials and the right triangle on others. Thus, the noise samples extracted around these fixations are not necessarily perfectly aligned across trials, resulting in spatially blurred classification images. However, this does not imply that observers are unable to use precise shape information. In a related study (Beutter, Eckstein, & Stone, 2004), subjects performed an 8-AFC contrast discrimination task and classification images were generated using a saccade-contingent analysis and an 8-AFC perceptual decision framework. The resulting classification images for these two cases were not found to be significantly different from each other, indicating that saccade mechanisms can indeed use precise shape information to guide search. Analysis of stimuli at the observer’s point of gaze can provide an understanding of strategies used by observers in visual tasks. Although existing proof for the guided saccade-targeting hypothesis is supported using efficiency of task performance or reaction times, in this paper we demonstrated that classification image analysis and accu- rate eye tracking can be used in conjunction to reveal shape cues that guided saccades in a difficult visual search task. Our results indicate that even in very noisy stimuli, human observers are not random in their search strategy; instead, they make directed eye movements to regions in the stimuli that resemble some structural feature of the target. Given the difficulty of the task and the rapidity at which fixations are made, we find it remarkable that the visual system seems to analyze spatial structure in the low-resolution periphery and direct fixations to regions that resembled some spatial feature of the target. The goal of this study was to investigate the influence of structural cues in visual search. However, reverse correlation analysis of noise stimuli at the point of gaze of observers has also been used to provide valuable ...

Similar publications

Article
Full-text available
The development of dynamic vision was investigated in 400 healthy subjects (200 females and 200 males) aged between 4 and 24 years. The test consisted of a computer-generated random-dot kinematogram in which a Landolt ring was briefly presented as a form-from-motion stimulus. Motion contrast between the ring and background was varied in terms of th...
Article
Full-text available
People are able to perceive the 3D shape of illuminated surfaces using image shading cues. Theories about how we accomplish this often assume that the human visual system estimates a single lighting direction and interprets shading cues in accord with that estimate. In natural scenes, however, lighting can be much more complex than this, with multi...
Article
Full-text available
Four experiments addressed the role of cast shadows of the body in orienting tactile spatial attention to the body itself. We used a modified spatial-cueing paradigm to examine whether viewing of the cast shadow of a hand can elicit spatial shifts of tactile attention to that very same hand. Participants performed a speeded tactile-discrimination t...
Article
Full-text available
Motion-defined form can seem to persist briefly after motion ceases, before seeming to gradually disappear into the background. Here we investigate if this subjective persistence reflects a signal capable of improving objective measures of sensitivity to static form. We presented a sinusoidal modulation of luminance, masked by a background noise pa...
Article
Full-text available
Human observers explore scenes by shifting their gaze from object to object. Before each eye movement, a peripheral glimpse of the next object to be fixated has however already been caught. Here we investigate whether the perceptual organization extracted from such a preview could guide the perceptual analysis of the same object during the next fix...

Citations

... Image resolution plays an important role in decisionmaking as higher-resolution visualizations enhance performance as key details are more apparent (Yeshurun & Carrasco, 1998). Coarsening the resolution of a visualization decreases the signal-to-noise ratio which results in difficulty in differentiating signals (Rajashekar et al., 2006). In many cases, quantities are viewed in maps and other visualizations by "mapping" values to colors. ...
... Accepted manuscript to appear in GNC the linear correlation coefficients (CC) value [48], the normalized scanpath saliency (NSS) value [49] and F values. ROC curves of the proposed method and the traditional salient target detection methods in the four popular datasets are shown in Fig. 8. ...
Article
A maritime target saliency detection method inspired by the stimulation competition and selection mechanism of raptor vision is presented for the airborne vision system of unmanned aerial vehicle (UAV) in an unknown maritime environment. The stimulation competition and selection mechanism in the visual pathway of raptor vision based on the phenomenon of raptor capturing prey in complex scenes are studied. Then, the mathematical model of the stimulation competition and selection mechanism of raptor vision is established and employed for the salient object detection. Popular image datasets and practical scene datasets are applied to verify the effectiveness of the presented method. Results show that the detection performance of the proposed method is better than that of other comparison methods. The proposed algorithm provides an idea for maritime target salient detection and cross-domain joint mission for UAV or other unmanned equipment.
... In a typical experiment of the CI method, the observer's responses to a visual target embedded in white noise are collected, and the information in the stimulus that affected the observer's response is mapped out by analyzing the correlation between the noise and the response in each trial. The CI method has been widely used to reveal the spatiotemporal distribution of critical information (or the perceptive field) that determines the observers' judgments for various visual tasks with static and dynamic stimuli [35][36][37][38][39][40][41] applied the CI method to Posner's cueing paradigm 13 and showed that the weight of information in the CI is greater at the spatial location where attention was directed 36 . However, observers in these studies made judgments after the visual stimuli had been shown, like in many psychophysical reverse-correlation studies [28][29][30]32,33 . ...
... In a typical experiment of the CI method, the observer's responses to a visual target embedded in white noise are collected, and the information in the stimulus that affected the observer's response is mapped out by analyzing the correlation between the noise and the response in each trial. The CI method has been widely used to reveal the spatiotemporal distribution of critical information (or the perceptive field) that determines the observers' judgments for various visual tasks with static and dynamic stimuli [35][36][37][38][39][40][41] . ...
Article
Full-text available
In many situations, humans serially sample information from many locations in an image to make an appropriate decision about a visual target. Spatial attention and eye movements play a crucial role in this serial vision process. To investigate the effect of spatial attention in such dynamic decision making, we applied a classification image (CI) analysis locked to the observer’s reaction time (RT). We asked human observers to detect as rapidly as possible a target whose contrast gradually increased on the left or right side of dynamic noise, with the presentation of a spatial cue. The analysis revealed a spatiotemporally biphasic profile of the CI which peaked at ~ 350 ms before the observer’s response. We found that a valid cue presented at the target location shortened the RT and increased the overall amplitude of the CI, especially when the cue appeared 500–1250 ms before the observer's response. The results were quantitatively accounted for by a simple perceptual decision mechanism that accumulates the outputs of the spatiotemporal contrast detector, whose gain is increased by sustained attention to the cued location.
... Similarly, the Pearson Correlation coefficient is applied, it calculates the effect of change in one variable when the other variable changes. The implementation details are discussed in Section C and D. Using these techniques, the explainability of these two features is ensured which is inspired from [12] about the visual search in noise. ...
Article
Full-text available
Classification of ECG noise (unwanted disturbance) plays a crucial role in the development of automated analysis systems for accurate diagnosis and detection of cardiac abnormalities. This paper mainly deals with the feature engineering of the ECG signals in building robust systems with better detection rates. We use the human visual perception paradigm as the image analysis method for the extraction of new features from the signals. AI, in general, is the development of computer systems that mimics human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. This paper is principally focused on the reproduction of human perception on visual cues, where ECG signal points and additional features extracted through image processing are utilized. With human intelligence, the noisy signals are easy to differentiate as it uses visual cues such as color, size, and shape of the target to program their saccades during visual search. In this respect, the naked eye differentiates the noise by the shape and density of the signal. Copying the human eye strategies we investigated and built a model with two new features using feature extraction in addition to the ECG signals that linked to succeed in detection accuracy. The consideration of two new features namely, 1. The number of peaks and 2. The compactness, in addition to the ECG signals, demonstrates better noise detection and classification rates than any complex state-of-the-art modeling methods. The framework achieves an average sensitivity of 99%, specificity of 100%, and overall accuracy of 100%. © 2022, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
... Identification of visual search targets can be affected by many factors including the variability of distractor content, similarity of target to distractors, or the prevalence of a target relative to the number of distractors (Duncan & Humphreys, 1989;Hout, Walenchok, Goldinger, & Wolfe, 2015). Some theories of visual search assert that eye movements during visual search tasks are typically guided by low-level visual features such as color and shape (Rajashekar, Bovik, & Cormack, 2006). Performance decrements can occur when these features are less perceivable due to insufficient spatial or temporal resolution. ...
... However, if there is insufficient resolution, these signal features will not be distinguished from the surrounding noise, regardless of top-down processes guiding search (Foulsham & Underwood, 2008;Swensson & Judy, 1981). Resolution degradation decreases the signal-to-noise ratio of image and video stimuli (Rajashekar et al., 2006), so relevant signals become more difficult to differentiate. Spatial resolution can decrease classification accuracy in searching static and dynamic imagery (see Figure 1 for examples of varying spatial resolution). ...
... There was greater response conflict when observers had to classify events involving humans compared to vehicles. This was understandable since events involving human activity are potentially less salient due to their smaller visual size (Rajashekar et al., 2006). How the event was classified (present or absent) also had a significant impact on response curvature. ...
Article
Security screeners identify targets from surveillance videos. Their performance can fluctuate as resolution or video content salience varies. Image spatial resolution may be degraded if a camera has low pixel density. Similarly, temporal resolution can be degraded due to transmission interference or low bandwidth. Both types of deteriorated resolution can increase observer uncertainty and reduce target identification accuracy. Further, these outcomes may reflect an interaction between target type and resolution. To help quantify, study, and explain possible interactions, we utilize process‐tracing methodology to understand the cognitive dynamics of surveillance decisions. These insights can improve the development of surveillance augmentation technologies and processes. Mouse tracking is a robust process‐tracing method successfully leveraged to observe continuously unfolding cognitive processes. Lateral vacillating mouse movements reflect decision uncertainty, and trajectory curvature serves measures cognitive conflict to select an unchosen alternative. We utilized mouse tracking to measure decision conflict during a video surveillance task with varying salience due to resolution or target type. Greater cognitive conflict was found when observers classified person‐related targets compared to vehicle‐related targets. Trajectories indicated stronger certainty when classifying events as having occurred, rather than when indicating events did not occur. Greater cognitive conflict was seen under degraded spatial resolution when observers were moderately confident compared to strongly confident or reported themselves as guessing. Under degraded temporal resolution, the greatest cognitive conflict occurred when observers reported they were guessing. Beyond these findings, we have demonstrated the viability of mouse tracking as an unobtrusive process‐tracing method for measuring uncertainty during realistic surveillance.
... Human usability with images and audio Human handling of different types of noise is a well-understood problem, especially in the image classification domain [47,5,23]. For example, Geirhos et al. investigated human performance for an object detection task and compared it to the state of the art Deep Neural Networks (DNNs) [23]. ...
Chapter
With crime migrating to the web, the detection of abusive robotic behaviour is becoming more important. In this paper, we propose a new audio CAPTCHA construction that builds upon the Cocktail Party problem (CPP) to detect robotic behaviour. We evaluate our proposed solution in terms of both performance and usability. Finally, we explain how to deploy such an acoustic CAPTCHA in the wild with strong security guarantees.
... In a visual search task, "error fixations" prior to locating the target (Fig. 1) are more likely to be on objects similar to the target [4,9]. Therefore, it is possible to decode target information from the eye movements [2,12,8]. Existing target decoding methods are limited in using elementary search statistics [8], or handcrafted features [2,12]. ...
... Therefore, it is possible to decode target information from the eye movements [2,12,8]. Existing target decoding methods are limited in using elementary search statistics [8], or handcrafted features [2,12]. Moreover, existing approaches have only been tested with pre-defined classes and limited object set sizes. ...
... However, alternative ways of combining feature similarity maps led to large differences in natural images. InferNet outperforms the alternative models, suggesting that error fixations are guided by sub-patterns of the search target [8]. ...
Preprint
Can we infer intentions from a person's actions? As an example problem, here we consider how to decipher what a person is searching for by decoding their eye movement behavior. We conducted two psychophysics experiments where we monitored eye movements while subjects searched for a target object. We defined the fixations falling on \textit{non-target} objects as "error fixations". Using those error fixations, we developed a model (InferNet) to infer what the target was. InferNet uses a pre-trained convolutional neural network to extract features from the error fixations and computes a similarity map between the error fixations and all locations across the search image. The model consolidates the similarity maps across layers and integrates these maps across all error fixations. InferNet successfully identifies the subject's goal and outperforms competitive null models, even without any object-specific training on the inference task.
... Here, we show that serial dependence can also be investigated through a three-alternative forced-choice classification task, which has the advantage of not presenting any visual stimulus during the response. Finally, our results highlight the importance of serial dependence in the domain of visual search [81][82][83] . They show that serial dependence can strongly bias subsequent search for items, but only if the items are similar and are presented within a limited temporal and spatial window. ...
Article
Full-text available
In everyday life, we continuously search for and classify objects in the environment around us. This kind of visual search is extremely important when performed by radiologists in cancer image interpretation and officers in airport security screening. During these tasks, observers often examine large numbers of uncorrelated images (tumor x-rays, checkpoint x-rays, etc.) one after another. An underlying assumption of such tasks is that search and recognition are independent of our past experience. Here, we simulated a visual search task reminiscent of medical image search and found that shape classification performance was strongly impaired by recent visual experience, biasing classification errors 7% more towards the previous image content. This perceptual attraction exhibited the three main tuning characteristics of Continuity Fields: serial dependence extended over 12 seconds back in time (temporal tuning), it occurred only between similar tumor-like shapes (feature tuning), and only within a limited spatial region (spatial tuning). Taken together, these results demonstrate that serial dependence influences shape perception and occurs in visual search tasks. They also raise the possibility of a detrimental impact of serial dependence in clinical and practically relevant settings, such as medical image perception.
... Here, we focused primarily on visual recognition. Rajashekar et al. (2006) used classification images to estimate the template that guides saccades during the search for simple visual targets, such as triangles or circles. Caspi et al. (2004) measured temporal classification images to study how the saccadic targeting system integrates information over time. ...
Preprint
A white noise analysis of modern deep neural networks is presented to unveil their biases at the whole network level or the single neuron level. Our analysis is based on two popular and related methods in psychophysics and neurophysiology namely classification images and spike triggered analysis. These methods have been widely used to understand the underlying mechanisms of sensory systems in humans and monkeys. We leverage them to investigate the inherent biases of deep neural networks and to obtain a first-order approximation of their functionality. We emphasize on CNNs since they are currently the state of the art methods in computer vision and are a decent model of human visual processing. In addition, we study multi-layer perceptrons, logistic regression, and recurrent neural networks. Experiments over four classic datasets, MNIST, Fashion-MNIST, CIFAR-10, and ImageNet, show that the computed bias maps resemble the target classes and when used for classification lead to an over twofold performance than the chance level. Further, we show that classification images can be used to attack a black-box classifier and to detect adversarial patch attacks. Finally, we utilize spike triggered averaging to derive the filters of CNNs and explore how the behavior of a network changes when neurons in different layers are modulated. Our effort illustrates a successful example of borrowing from neurosciences to study ANNs and highlights the importance of cross-fertilization and synergy across machine learning, deep learning, and computational neuroscience.
... The Peak value of Dice Similarity Coefficient (PoDSC) [27] corresponds to the optimal threshold, and it is the best possible evaluation score. Linear Correlation Coefficients (CC) [22]: CC is also called Pearsons linear coefficient. It measures the strength of linear correlation between two variables. ...
Article
Full-text available
Visual saliency is the distinct perceptual quality which makes some subsets in an image stand out from their neighbours and immediately grab human attention in the early vision. Visual saliency is useful in locating the region of interest. Quick visual saliency detection is desirable in an application that uses the region of interest. The paper embeds a new heuristic module in the original hypercomplex Fourier transform based model. It allows generating saliency maps falling in the search path only, and hence reduces the number of intermediate saliency maps from N to average value \(log_2 N+1\). Ultimately, speed up the original saliency model significantly.