Figure 3 - uploaded by Masoud Ghodrati
Content may be subject to copyright.
Higher degree of invariance (DoI) in ISL compared to C2. (A) View-tolerance at the level of C2 units. Each tuning curve shows the degree of invariance in the responses of C2 units for a particular viewing angle (face view). Only a subset of tuning curves is presented (details for every view is shown in Supplementary Fig. S1). The vertical axis is the correlation between feature vectors at one reference view from a set of subjects and feature vectors, computed for the same subjects across different view. The horizontal axis indicates different views with the steps of 5°. The colored, horizontal lines underneath each curve demonstrate the significant range of DoI (p < 0.02– ranksum test) for a particular view. Each row in the invariance matrix, below the tuning curves, corresponds to a tuning curve for a face viewpoint (viewing angles are separated by 5°, from − 90° in the first row to + 90° in the last row. Head poses and camera position are schematically shown along the horizontal axis). Color bar at right inset represents the range of correlation. The gray horizontal lines, printed on the invariance matrix, exhibit the degree of invariance for every view similar to tuning curves (ranksum test). (B) View tolerance at the level of ISLs. (C) Summary of view tolerance responses for each face view in C2 units and ISLs. Each bar exhibits the DoI for a face view for C2 units (red bars) and ISLs (blue bars). The horizontal axis shows different face views. (D) Average DoI across all views for ISL and C2, calculated using data shown in (C).  

Higher degree of invariance (DoI) in ISL compared to C2. (A) View-tolerance at the level of C2 units. Each tuning curve shows the degree of invariance in the responses of C2 units for a particular viewing angle (face view). Only a subset of tuning curves is presented (details for every view is shown in Supplementary Fig. S1). The vertical axis is the correlation between feature vectors at one reference view from a set of subjects and feature vectors, computed for the same subjects across different view. The horizontal axis indicates different views with the steps of 5°. The colored, horizontal lines underneath each curve demonstrate the significant range of DoI (p < 0.02– ranksum test) for a particular view. Each row in the invariance matrix, below the tuning curves, corresponds to a tuning curve for a face viewpoint (viewing angles are separated by 5°, from − 90° in the first row to + 90° in the last row. Head poses and camera position are schematically shown along the horizontal axis). Color bar at right inset represents the range of correlation. The gray horizontal lines, printed on the invariance matrix, exhibit the degree of invariance for every view similar to tuning curves (ranksum test). (B) View tolerance at the level of ISLs. (C) Summary of view tolerance responses for each face view in C2 units and ISLs. Each bar exhibits the DoI for a face view for C2 units (red bars) and ISLs (blue bars). The horizontal axis shows different face views. (D) Average DoI across all views for ISL and C2, calculated using data shown in (C).  

Source publication
Article
Full-text available
Converging reports indicate that face images are processed through specialized neural networks in the brain –i.e. face patches in monkeys and the fusiform face area (FFA) in humans. These studies were designed to find out how faces are processed in visual system compared to other objects. Yet, the underlying mechanism of face processing is not comp...

Similar publications

Article
Full-text available
Perceptual accuracy is known to be influenced by stimuli location within the visual field. In particular, it seems to be enhanced in the lower visual hemifield for motion and space processing, and in the upper visual hemifield for object and face processing. The origins of such asymmetries are attributed to attentional biases across the visual fiel...

Citations

... 71 What are the computational principles that give rise to 72 the representational hierarchy evident in the face-patch [8,9]. Diverse CNN models, trained on tasks such 85 as face identification [10][11][12], object recognition [13], in-86 verse graphics [14], sparse coding [15], and unsuper-87 vised generative modeling [16] have all been shown to 88 replicate at least some aspects of face-patch system 89 representations. Face-selective artificial neurons occur 90 even in untrained CNNs [17], and functional specializa- [4]. ...
Article
Full-text available
Primates can recognize objects despite 3D geometric variations such as in-depth rotations. The computational mechanisms that give rise to such invariances are yet to be fully understood. A curious case of partial invariance occurs in the macaque face-patch AL and in fully connected layers of deep convolutional networks in which neurons respond similarly to mirror-symmetric view (e.g., left and right profiles). Why does this tuning develop? Here, we propose a simple learning-driven explanation for mirror-symmetric viewpoint tuning. We show that mirror-symmetric viewpoint tuning for faces emerges in the fully connected layers of convolutional deep neural networks trained on object recognition tasks, even when the training dataset does not include faces. First, using 3D objects rendered from multiple views as test stimuli, we demonstrate that mirror-symmetric viewpoint tuning in convolutional neural network models is not unique to faces: it emerges for multiple object categories with bilateral symmetry. Second, we show why this invariance emerges in the models. Learning to discriminate among bilaterally symmetric object categories induces reflection-equivariant intermediate representations. AL-like mirror-symmetric tuning is achieved when such equivariant responses are spatially pooled by downstream units with sufficiently large receptive fields. These results explain how mirror-symmetric viewpoint tuning can emerge in neural networks, providing a theory of how they might emerge in the primate brain. Our theory predicts that mirror-symmetric viewpoint tuning can emerge as a consequence of exposure to bilaterally symmetric objects beyond the category of faces, and that it can generalize beyond previously experienced object categories.
... The "face inversion effect" leads to disproportionally reduced recognition performance and neural activity for inverted faces compared with inverted nonface objects (Kanwisher et al., 1998;Rossion & Gauthier, 2002;Yovel & Kanwisher, 2005). This effect was already reported in simpler computational models (Farzmahdi et al., 2016;Hosoya & Hyvärinen, 2017) and more recently in various DCNNs (Dobs et al., 2022b;Tian et al., 2022;Vinken et al., 2022;Xu et al., 2021;Yovel et al., 2022a;Zeman et al., 2022). While recent evidence suggests that the "face inversion effect" manifests only in face-id models (Dobs et al., 2022b), it has also been observed to a limited extent in face-de models (Tian et al., 2022;Xu et al., 2021), and in object-cat models that contained occasional faces (Vinken et al., 2022). ...
Preprint
Full-text available
Deep convolutional neural networks (DCNNs) have become the state-of-the-art computational models of biological object recognition. Their remarkable success has helped vision science break new ground and recent efforts have started to transfer this achievement to research on biological face recognition. In this regard, face detection can be investigated by comparing face-selective biological neurons and brain areas to artificial neurons and model layers. Similarly, face identification can be examined by comparing in vivo and in silico multidimensional "face spaces". In this review, we summarize the first studies that use DCNNs to model biological face recognition. On the basis of a broad spectrum of behavioral and computational evidence, we conclude that DCNNs are useful models that closely resemble the general hierarchical organization of face recognition in the ventral visual pathway and the core face network. In two exemplary spotlights, we emphasize the unique scientific contributions of these models. First, studies on face detection in DCNNs indicate that elementary face selectivity emerges automatically through feedforward processing even in the absence of visual experience. Second, studies on face identification in DCNNs suggest that identity-specific experience and generative mechanisms facilitate this particular challenge. Taken together, as this novel modeling approach enables close control of predisposition (i.e., architecture) and experience (i.e., training data), it may be suited to inform long-standing debates on the substrates of biological face recognition.
... Thus, achieving high face recognition accuracy in machines (and possibly also humans) requires not only extensive face experience, but extensive experience within each of multiple face types. This finding, along with our finding that the face inversion effect arises spontaneously in CNNs trained to discriminate face identities but not in CNN trained on face detection and/or object classification, accords with other findings showing signatures of human face perception in face-identity-trained networks, such as face familiarity effects (23), the Thatcher illusion (47) and view-invariant identity representations (31,48). ...
Article
Human face recognition is highly accurate and exhibits a number of distinctive and well-documented behavioral "signatures" such as the use of a characteristic representational space, the disproportionate performance cost when stimuli are presented upside down, and the drop in accuracy for faces from races the participant is less familiar with. These and other phenomena have long been taken as evidence that face recognition is "special". But why does human face perception exhibit these properties in the first place? Here, we use deep convolutional neural networks (CNNs) to test the hypothesis that all of these signatures of human face perception result from optimization for the task of face recognition. Indeed, as predicted by this hypothesis, these phenomena are all found in CNNs trained on face recognition, but not in CNNs trained on object recognition, even when additionally trained to detect faces while matching the amount of face experience. To test whether these signatures are in principle specific to faces, we optimized a CNN on car discrimination and tested it on upright and inverted car images. As we found for face perception, the car-trained network showed a drop in performance for inverted vs. upright cars. Similarly, CNNs trained on inverted faces produced an inverted face inversion effect. These findings show that the behavioral signatures of human face perception reflect and are well explained as the result of optimization for the task of face recognition, and that the nature of the computations underlying this task may not be so special after all.
... The "face inversion effect" leads to disproportionally reduced recognition performance and neural activity for inverted faces compared with inverted nonface objects ( Yovel & Kanwisher, 2005;Rossion & Gauthier, 2002;Kanwisher, Tong, & Nakayama, 1998). This effect was already reported in simpler computational models (Hosoya & Hyvärinen, 2017;Farzmahdi, Rajaei, Ghodrati, Ebrahimpour, & Khaligh-Razavi, 2016) and more recently in various DCNNs (Dobs, Yuan, et al., 2022;Tian, Xie, Song, Hu, & Liu, 2022;Vinken et al., 2022;Yovel, Grosbard, & Abudarham, 2022a;Zeman, Leers, & de Beeck, 2022;Xu, Zhang, Zhen, & Liu, 2021). While recent evidence suggests that the "face inversion effect" manifests only in face-id models (Dobs, Yuan, et al., 2022), it has also been observed to a limited extent in face-de models (Tian et al., 2022;Xu et al., 2021), and in object-cat models that contained occasional faces (Vinken et al., 2022). ...
Article
Full-text available
Deep convolutional neural networks (DCNNs) have become the state-of-the-art computational models of biological object recognition. Their remarkable success has helped vision science break new ground, and recent efforts have started to transfer this achievement to research on biological face recognition. In this regard, face detection can be investigated by comparing face-selective biological neurons and brain areas to artificial neurons and model layers. Similarly, face identification can be examined by comparing in vivo and in silico multidimensional "face spaces." In this review, we summarize the first studies that use DCNNs to model biological face recognition. On the basis of a broad spectrum of behavioral and computational evidence, we conclude that DCNNs are useful models that closely resemble the general hierarchical organization of face recognition in the ventral visual pathway and the core face network. In two exemplary spotlights, we emphasize the unique scientific contributions of these models. First, studies on face detection in DCNNs indicate that elementary face selectivity emerges automatically through feedforward processing even in the absence of visual experience. Second, studies on face identification in DCNNs suggest that identity-specific experience and generative mechanisms facilitate this particular challenge. Taken together, as this novel modeling approach enables close control of predisposition (i.e., architecture) and experience (i.e., training data), it may be suited to inform long-standing debates on the substrates of biological face recognition.
... Thus, achieving high face recognition accuracy in machines (and possibly also humans) requires not only extensive face experience, but extensive experience within each of multiple face types. This finding, along with our finding that the face inversion effect arises spontaneously in CNNs trained to discriminate face identities but not in CNN trained on face detection and/or object classification, accords with other findings showing signatures of human face perception in face-identity trained networks, such as face familiarity effects (22), the Thatcher illusion (44) and view-invariant identity representations (30,45,46). ...
Preprint
Full-text available
Human face recognition is highly accurate, and exhibits a number of distinctive and well documented behavioral “signatures” such as the use of a characteristic representational space, the disproportionate performance cost when stimuli are presented upside down, and the drop in accuracy for faces from races the participant is less familiar with. These and other phenomena have long been taken as evidence that face recognition is “special”. But why does human face perception exhibit these properties in the first place? Here we use deep convolutional neural networks (CNNs) to test the hypothesis that all of these signatures of human face perception result from optimization for the task of face recognition. Indeed, as predicted by this hypothesis, these phenomena are all found in CNNs trained on face recognition, but not in CNNs trained on object recognition, even when additionally trained to detect faces while matching the amount of face experience. To test whether these signatures are in principle specific to faces, we optimized a CNN on car discrimination and tested it on upright and inverted car images. As for face perception, the car-trained network showed a drop in performance for inverted versus upright cars. Similarly, CNNs trained only on inverted faces produce an inverted inversion effect. These findings show that the behavioral signatures of human face perception reflect and are well explained as the result of optimization for the task of face recognition, and that the nature of the computations underlying this task may not be so “special” after all. Significance Statement For decades, cognitive scientists have collected and characterized behavioral signatures of face recognition. Here we move beyond the mere curation of behavioral phenomena to asking why the human face system works the way it does. We find that many classic signatures of human face perception emerge spontaneously in CNNs trained on face discrimination, but not in CNNs trained on object classification (or on both object classification and face detection), suggesting that these long-documented properties of the human face perception system reflect optimizations for face recognition per se, not by-products of a generic visual categorization system. This work further illustrates how CNN models can be synergistically linked to classic behavioral findings in vision research, thereby providing psychological insights into human perception.
... Feedforward CNNs remain among the best models for predicting mid-and high-level cortical representations of novel natural images within the first 100-200 ms after stimulus onset [7,8]. Diverse CNN models, trained on tasks such as face identification [9,10], object recognition [11], inverse graphics [12], and unsupervised generative modeling [13] have all been shown to replicate at least some aspects of face-patch system representations. Face-selective artificial neurons occur even in untrained CNNs [14], and functional specializa-tion between object and face representation emerges in CNNs trained on the dual task of recognizing objects and identifying faces [15]. ...
Preprint
Full-text available
Primates can recognize objects despite 3D geometric variations such as in-depth rotations. The computational mechanisms that give rise to such invariances are yet to be fully understood. A curious case of partial invariance occurs in the macaque face-patch AL and in fully connected layers of deep convolutional networks in which neurons respond similarly to mirror-symmetric views (e.g., left and right profiles). Why does this tuning develop? Here, we propose a simple learning-driven explanation for mirror-symmetric viewpoint tuning. We show that mirror-symmetric viewpoint tuning for faces emerges in the fully connected layers of convolutional deep neural networks trained on object recognition tasks, even when the training dataset does not include faces. First, using 3D objects rendered from multiple views as test stimuli, we demonstrate that mirror-symmetric viewpoint tuning in convolutional neural network models is not unique to faces: it emerges for multiple object categories with bilateral symmetry. Second, we show why this invariance emerges in the models. Learning to discriminate among bilaterally symmetric object categories induces reflection-equivariant intermediate representations. AL-like mirror-symmetric tuning is achieved when such equivariant responses are spatially pooled by downstream units with sufficiently large receptive fields. These results explain how mirror-symmetric viewpoint tuning can emerge in neural networks, providing a theory of how they might emerge in the primate brain. Our theory predicts that mirror-symmetric viewpoint tuning can emerge as a consequence of exposure to bilaterally symmetric objects beyond the category of faces, and that it can generalize beyond previously experienced object categories.
... In a hierarchical manner, these layers represent different features of the input stimuli from very simple (e.g., lines in specific angles) to very complex (e.g., faces) ones and increase the invariancy of the representations by pooling mechanisms between layers (Serre et al., 2007a;Kheradpisheh et al., 2016b;Riesenhuber and Poggio, 2000). In object recognition tasks, these hierarchical models, such as the HMAX model (Riesenhuber and Poggio, 1999), different extensions of the HMAX model (Farzmahdi et al., 2016;Zabbah et al., 2014;Rajaei et al., 2012;Ghodrati et al., 2012) and, more recently, deep convolutional neural network (DCNN)-based approaches (Cichy et al., 2016;Kriegeskorte, 2015), can categorize input images with the similar accuracy to that of humans or animals. Spiking versions of these models explain how neurons respond to specific features in the input stimulus via STDP learning rules (Kheradpisheh et al., 2018). ...
... The processes of object recognition and deciding about that in both representation (Delorme et al., 2010) and decision-making (Deng et al., 2009) parts take different times for different objects. For example, representation of ambiguous objects in the IT cortex (Farzmahdi et al., 2016) and the decisions to recognize them are slower and less accurate than those which are less ambiguous (Fukushima and Miyake, 1982;Ghodrati et al., 2014). Importantly, this speed and the accuracy can be adjusted via the decision process in the brain. ...
Article
The underlying mechanism of object recognition- a fundamental brain ability- has been investigated in various studies. However, balancing between the speed and accuracy of recognition is less explored. Most of the computational models of object recognition are not potentially able to explain the recognition time and, thus, only focus on the recognition accuracy because of two reasons: lack of a temporal representation mechanism for sensory processing and using non-biological classifiers for decision-making processing. Here, we proposed a hierarchical temporal model of object recognition using a spiking deep neural network coupled to a biologically plausible decision-making model for explaining both recognition time and accuracy. We showed that the response dynamics of the proposed model can resemble those of the brain. Firstly, in an object recognition task, the model can mimic human's and monkey's recognition time as well as accuracy. Secondly, the model can replicate different speed-accuracy trade-off regimes as observed in the literature. More importantly, we demonstrated that temporal representation of different abstraction levels (superordinate, midlevel, and subordinate) in the proposed model matched the brain representation dynamics observed in previous studies. We conclude that the accumulation of spikes, generated by a hierarchical feedforward spiking structure, to reach abound can well explain not even the dynamics of making a decision, but also the representations dynamics for different abstraction levels.
... In future instances when humans are tasked with deepfake detection, it is important to consider whether a video has been manipulated in such a way as to reduce specialized processing. Moreover, given the usefulness of specialized processing of faces for humans in detecting deepfakes, it is possible that computer vision models for deepfake detection may find use in incorporating (and/or learning) such specialized processing (89). ...
Article
Full-text available
Significance The recent emergence of deepfake videos raises theoretical and practical questions. Are humans or the leading machine learning model more capable of detecting algorithmic visual manipulations of videos? How should content moderation systems be designed to detect and flag video-based misinformation? We present data showing that ordinary humans perform in the range of the leading machine learning model on a large set of minimal context videos. While we find that a system integrating human and model predictions is more accurate than either humans or the model alone, we show inaccurate model predictions often lead humans to incorrectly update their responses. Finally, we demonstrate that specialized face processing and the ability to consider context may specially equip humans for deepfake detection.
... Our result of discrimination of upright faces and inverted faces coincides with those for other neural network models fc7 fc8 conv5 fc6 Model layer RDM (c) c) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) human monkey conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 (Yildirim et al., 2015;Tan & Poggio, 2016;Farzmahdi et al., 2016;Hosoya & Hyvarinen, 2017). In the FC layers, the dissimilarity values among human individuals for upright faces were larger than those for inverted faces, although the dissimilarity values among monkey expressions for upright faces were not larger than those for inverted faces. ...
Article
Full-text available
Feed-forward deep neural networks have better performance in object categorization tasks than other models of computer vision. To understand the relationship between feed-forward deep networks and the primate brain, we investigated representations of upright and inverted faces in a convolutional deep neural network model and compared them with representations by neurons in the monkey anterior inferior-temporal cortex, area TE. We applied principal component analysis to feature vectors in each model layer to visualize the relationship between the vectors of the upright and inverted faces. The vectors of the upright and inverted monkey faces were more separated through the convolution layers. In the fully-connected layers, the separation among human individuals for upright faces was larger than for inverted faces. The Spearman correlation between each model layer and TE neurons reached a maximum at the fully-connected layers. These results indicate that the processing of faces in the fully-connected layers might resemble the asymmetric representation of upright and inverted faces by the TE neurons. The separation of upright and inverted faces might take place by feed-forward processing in the visual cortex, and separations among human individuals for upright faces, which were larger than those for inverted faces, might occur in area TE.
... e remarkable point is that the firing rate reduction in inverted mode is mostly compensated by top-down influence in feedback mode, and it creates a considerable increase in accuracy for some categories of inverted images. e classification's level in inverted images in the case of superordinate, basic, and subordinate is also alike to up-right images, in which this requirement to top-down influences is less evident [29,30]. e last two experiments were dedicated to occlusion and deletion modes, respectively, in which top-down influences had the most impact as a result. ...
Article
Full-text available
Humans can categorize an object in different semantic levels. For example, a dog can be categorized as an animal (superordinate), a terrestrial animal (basic), or a dog (subordinate). Recent studies have shown that the duration of stimulus presentation can affect the mechanism of categorization in the brain. Rapid stimulus presentation will not allow top-down influences to be applied on the visual cortex, whereas in the nonrapid, top-down influences can be established and the final result will be different. In this paper, a spiking recurrent temporal model based on the human visual system for semantic levels of categorization is introduced. We showed that the categorization problem for up-right and inverted images can be solved without taking advantage of feedback, but for the occlusion and deletion problems, top-down feedback is necessary. The proposed computational model has three feedback paths that express the effects of expectation and the perceptual task, and it is described by the type of problem that the model seeks to solve and the level of categorization. Depending on the semantic level of the asked question, the model changes its neuronal structure and connections. Another application of recursive paths is solving the expectation effect problem, that is, compensating the reduce in firing rate by the top-down influences due to the available features in the object. In addition, in this paper, a psychophysical experiment is performed and top-down influences are investigated through this experiment. In this experiment, by top-down influences, the speed and accuracy of the categorization of the subjects increased for all three categorization levels. In both the presence and absence of top-down influences, the remarkable point is the superordinate advantage.