Fig 2 - uploaded by Eric Granger
Content may be subject to copyright.
Synthetic-based still-to-video face recognition system 

Synthetic-based still-to-video face recognition system 

Source publication
Conference Paper
Full-text available
In still-to-video face recognition (FR), the faces captured with surveillance cameras are matched against reference stills of target individuals enrolled to the system. FR is a challenging problem in video surveillance due to uncontrolled capture conditions (variations in pose, expression, illumination, blur, scale, etc.), and the limited number of...

Context in source publication

Context 1
... overall block diagram of the proposed FR system is shown in Figure 2. During the enrollment, a set of non-target videos are collected and their ROIs are extracted and the head pose of each ROI in each frame is estimated. The ROI with the face pose angles less than 3 ◦ are selected. Then, GLQ i ( x, y ) and GCQ i ( x, y ) between each ROI isolated from still, and the video ROIs of various non-target individuals are measured. Next, clustering is performed on the normalized GLQ i ( x, y ) and GCQ i ( x, y ) in 2D space using K-means and the representative image of each cluster is determined. The optimal number of clusters obtained using Dunn index is typically around k = 4. This process is repeated 3 times for 10 different sets of non-target videos. So, for each watchlist individual, 12 non-target images are selected. Each watchlist individual is then morphed with the decomposed large scale layers. A total of 12 synthetic face images with diverse illumination and contrast for each still individual are generated and added to the watchlist gallery to create a new gallery. During design phase, for each face of gallery, segmentation is performed and the ROI is then scaled to a common size 48 × 48 to limit processing time. Next, a division into 3 × 3 = 9 uniform non-overlapping patches is performed on each ROI representations. Next, uniform pattern of 59 local binary pattern features of face patches using in the single reference ROI and corresponding synthetic ones are extracted to generate diverse face representations. The extracted features are normalized to range between 0 and 1, and assembled into a ROI pattern of features for matching. The latter are then stored as a template into a gallery. The enrollment phase produces a template gallery with 13 different templates per watchlist person (the original image plus 12 synthetic images). During the operational phase, frames undergo the same processing steps as for the enrollment and following that, template matching is applied that matches the facial models of probes against those models stored in the gallery during enrollment. Each matcher provides a similarity score between every patch of the input vector and the corresponding patch template in the gallery via Euclidian distance. Output scores from matchers are fed into the fusion module after score normalization. A face tracker also regroups faces from each different person and accumulates positive predictions over time for robust spatio-temporal recognition. A positive prediction is produced if a matching score surpasses an individual-specific threshold. Finally, the decision function combines the tracks and matching predictions in order to recognize the most likely individuals in the scene. To assess the transaction-level performance of the proposed FR system, partial ROC curve (pAUC), area under precision-recall space (AUPR) and F1-measure for a desired false positive rate of 1% , are considered. Prior to each replication, 5 persons are randomly selected as target watchlist individuals. The remaining individuals are used in the operational phase as non-target subjects. This process is repeated 5 times. In order to validate of the performance achieved by devel- oped FR system for watchlist screening applications, ChokePoint video dataset has been employed. The dataset includes an array of three cameras placed above two portals to capture subjects walking through them. It contains 54 video sequences in portal 1 and 29 subjects in portal 2. Captured face frames have variations in terms of illumination conditions, pose, sharpness, as well as misalignment due to automatic face localization/detection [25]. Figure 3 represents examples of the images generated with assis- tance of the proposed synthesizing algorithm. Results in Table 1 present the average transaction-level performance (pAUC(20%), AUPR and F1-measure) in the baseline FR system (BFR) with only one sample per person and the FR system with extra images under various illumination conditions. Moreover, it compares the results obtained with and without patching. As shown in Table 1, the recognition system with extra samples under varying illumination outperform the baseline system because of its robustness to the illumination variations. Furthermore, the results achieved via the patch-based technique provide a higher level of performance compared to the baseline system.Since, the extracting features from each patch allow exploiting more discriminant infor- mation and consequently yield better matching performance. Table 2 compares the average transaction-level performance in the synthetic FR system based on the number of synthetic samples. It can be concluded that the number of images added to the gallery has a direct impact on the recognition rate and time complexity. In- creasing synthetic images enhances the system performance; how- ever, reduces time efficiency. Therefore, there should be a trade-off between performance and computational cost associated with in- creased number of samples. It can however be observed that the results vary according to the watchlist individuals. This is demonstrated by the results, for instance, with individual ID#01, adding extra images improve the performance pAUC=0.118 to pAUC=0.378, but in individual ID#04 from pAUC=0.319 to pAUC=0.339. Given the challenges of still-to-video FR in video surveillance ap- plication, a new approach is proposed in this paper to generate multiple synthetic face images per reference still based on camera- specific capture conditions. This approach exploits the abundance of diverse facial ROIs from non-target individuals that appears in a specific camera viewpoint. An extension of image morphing allows to generate a set of diverse images with a smooth transition of illumination. It is able to accurately convey a range of synthetic face images with diverse illumination and contrast. Experimental results with the ChokePoint dataset show that the proposed approach is an effective approach to improve the representativeness under illumination and contrast conditions found in many video surveillance applications, for instance in watchlist screening only one reference face still, captured under controlled condition, is available during enrollment. It is worth mentioning that, this method can be generalized to transfer other appearance variations such as shadow and blur to any objects for a wide range of applications. In order to design a more robust still-to-video FR system, future research should include methods to generate even more synthetic faces based on variations in pose and expression of a target ...

Citations

... Multiple techniques and algorithms focus on tasks like face detection [1][2], identity verification [3][4], face capturing [5], and face hallucination [6][7]. However, this type of image processing presents several challenges, including multiple noise sources, motion blur [8], environmental disturbances due to illumination changes and contrast [9], and insufficient sensor density [10]. All these contribute to image degradation, significantly affecting capture quality [11]. ...
Article
Full-text available
This document details the implementation of a sub-pixel convolutional neural network designed to enhance the resolution of face images. The model uses a series of filters to progressively increase the number of pixels, estimating the necessary information for new pixels from the original image and training derived from 22000 synthetic images produced by adversarial neural networks. Within the context of surveillance and related applications, the trained convolutional network exhibits beneficial characteristics. For instance, it can be deployed within a device to achieve higher-resolution images than those the physical camera can produce. This research underscores the feasibility of such a device through the implementation and evaluation of the network on the NVIDIA Jetson TX2 embedded system. The findings demonstrate the model's practicality for real-time surveillance applications and its ability to produce superior-quality images compared to several interpolation methods, as determined by an exhaustive testing process measuring various attributes of the generated images.
... One of the main reasons to generate synthetic images is low cost, high efficiency, and testing privacy. Researchers do not need to depend on real-world data; they can work on synthetic data [14]. A generator model was learned over training images to generate a synthetic image. ...
Conference Paper
Full-text available
A new method for synthetic palm image generation is proposed in this paper based on StyleGAN2-ADA, a specialized GAN architecture. This method is based on the modification of the styles of the palm, such as principal lines, secondary lines, wrinkles, etc. The model was trained on 3500 palm images, combined from two public datasets. The quality of the synthetic images, generated by the proposed model, is evaluated by a Scale Invariant Feature Transform (SIFT)-based custom algorithm where the features of the synthetic images (for example, principal lines) are compared with reference palm images. The synthetic images having lower quality metrics, below the threshold, are discarded. This quality assessment algorithm shows that 95 percent of the generated synthetic images are acceptable and have enough diversity to be employed for further biometric research. This research is significant as it can address the scarcity of biometric data especially of the palm image which is a relatively new research domain with lots of potential to be a robust identification and verification system.
... One of the main reasons to generate synthetic images is low cost, high efficiency, and testing privacy. Researchers do not need to depend on real-world data; they can work on synthetic data [14]. A generator model was learned over training images to generate a synthetic image. ...
Conference Paper
A new method for synthetic palm image generation is proposed in this paper based on StyleGAN2-ADA, a specialized GAN architecture. This method is based on the modification of the styles of the palm, such as principal lines, secondary lines, wrinkles, etc. The model was trained on 3500 palm images, combined from two public datasets. The quality of the synthetic images, generated by the proposed model, is evaluated by a Scale Invariant Feature Transform (SIFT)-based custom algorithm where the features of the synthetic images (for example, principal lines) are compared with reference palm images. The synthetic images having lower quality metrics, below the threshold, are discarded. This quality assessment algorithm shows that 95 percent of the generated synthetic images are acceptable and have enough diversity to be employed for further biometric research. This research is significant as it can address the scarcity of biometric data especially of the palm image which is a relatively new research domain with lots of potential to be a robust identification and verification system.
... These technologies aim to increase intra-class variations and the resilience of facial models. In some displays, different patches and facial descriptors are used [8] and artificial facial images are synthesised using 2D morphing or 3D reconstructions [10]. [10]. ...
... In some displays, different patches and facial descriptors are used [8] and artificial facial images are synthesised using 2D morphing or 3D reconstructions [10]. [10]. A generic auxiliary dataset comprising other people faces can be used to modify domains [11] and to classify displays through dictionary training [12]. ...
Article
Full-text available
The smart surveillance system is becoming a vital application in each streets or houses. Most of the streets are prone to several misbehavior conducts for instance theft in atm, robbery, fights etc. and hence it is necessary to detect and analyse the crime scenes for finding the suspects. However, most of the surveillance system suffers from poor detection of objects due to poor camera resolution, absence of light and other factors. In order to improve the detection of faces after detecting the objects using ResNet, it is necessary to adapt some advanced devices for image capturing and analyzing. In this paper, Internet of Things (IoT) based ESP32 CAM WiFi Module Bluetooth with OV2640 Camera Module 2MP is used for image acquisition that capture better images from the scenes. The study uses dense convolutional network namely DenseNet to detect the faces present in the crime scenes after the object detection. The deep learning module is trained with selected crime scenes for training the classifier. The simulation is conducted further to validate the model with other variants of deep learning.
... Shao et al. [24] Introduced FR method which depends on Sparse Representation-based Classification (SRC) that increases the dictionary by using a collection of artificial images created by measuring the difference between a pair of images. Authors in [25] increased the gallery of references by producing a collection of artificial images under camera-specific illumination circumstances to build a reliable FR system under examination circumstances. Blanz and Vetter [26] introduced a 3D Morphable Model (3DMM) for recreating a 3D image from a 2D image and subsequently synthesizing new face photos. ...
Article
Full-text available
Face Recognition (FR) problem is one of the significant fields in computer vision. FR is used to identify the faces that appear over distributed cameras over the network. The problem of face recognition can be divided into two categories, the first is recognition with more than one sample per person, which can be called traditional face recognition problem. The second is the recognition of faces using only a Single Sample Per Person (SSPP). The efficiency of face recognition systems decreases because of limited references especially (SSPP) and faces taken in the Operational Domain (OD) different from faces in the Enrollment Domain (ED) in illumination, pose, low-resolution, and blurriness. This paper proposed a method that deals with all problems related to face recognition with SSPP. 3D face reconstruction is used to increase the reference gallery set with different poses and generate a design domain dictionary to overcome the problem of limited reference. Besides, the design domain dictionary is used to feed different deep learning models. Face illumination transfer techniques are utilized to overcome the illumination problem. Labeled Faces in theWild (LFW) dataset is used to train Super-Resolution Generative Adversarial Network (SRGAN) to overcome the low-resolution problem. Deblur Generative Adversarial Network (DeblurGAN) is trained on the LFW dataset to overcome the problem of blurriness. The proposed method is evaluated using the Chokepoint dataset and COX-S2V dataset. The final results confirm an overall enhancement in accuracy compared to techniques that use SSPP for face recognition (generic learning and face synthesizing approaches). Also, the proposed method outperforms of Traditional and Deep Learning (TDL) method accuracy, which uses SSPP for face recognition.
... These techniques were applied to mitigate the loss of classification performance due to changes in facial appearance. Mokhayeri et al. [74] proposed an approach that generates multiple synthetic face images per person on camera to address the low-quality image problem caused by illumination variations. Weyrauch et al. [75] presented a face recognition approach invariant to illumination and pose by incorporating component-based recognition and 3D morphable models. ...
... However, the criteria of the selected videos was not stated. Mokhayeri et al. [13] proposed an approach that generates multiple synthetic face images per person on a camera to address low-quality image problems caused by variations in illumination. The ChokePoint dataset was used to evaluate the performance of the proposed method. ...
... In face recognition, the shapes of face modality are important as it provides richer information to represent the features [35]. We deploy a preprocessing pipeline with feature transformations based on [10] and [50], resulting in textures extracted from f and p to develop a frequency domain procedure. ...
Article
Full-text available
Although there is an abundance of current research on facial recognition, it still faces significant challenges that are related to variations in factors such as aging, poses, occlusions, resolution, and appearances. In this paper, we propose a Multi-feature Deep Learning Network (MDLN) architecture that uses modalities from the facial and periocular regions, with the addition of texture descriptors to improve recognition performance. Specifically, MDLN is designed as a feature-level fusion approach that correlates between the multimodal biometrics data and texture descriptor, which creates a new feature representation. Therefore, the proposed MLDN model provides more information via the feature representation to achieve better performance, while overcoming the limitations that persist in existing unimodal deep learning approaches. The proposed model has been evaluated on several public datasets and through our experiments, we proved that our proposed MDLN has improved biometric recognition performances under challenging conditions, including variations in illumination, appearances, and pose misalignments.
... This creates more capability of matching partial faces [6,7]. In addition, due to the rapid growth of camera use in social networks, surveillance, and smartphones, this arguably increases the interest of periocular recognition [8,9]. For all these reasons, periocular recognition has become an area of intense study in the biometrics and computer vision communities. ...
Article
Full-text available
Periocular recognition remains challenging for deployments in the unconstrained environments. Therefore, this paper proposes an RGB-OCLBCP dual-stream convolutional neural network, which accepts an RGB ocular image and a colour-based texture descriptor, namely Orthogonal Combination-Local Binary Coded Pattern (OCLBCP) for periocular recognition in the wild. The proposed network aggregates the RGB image and the OCLBCP descriptor by using two distinct late-fusion layers. We demonstrate that the proposed network benefits from the RGB image and thee OCLBCP descriptor can gain better recognition performance. A new database, namely an Ethnic-ocular database of periocular in the wild, is introduced and shared for benchmarking. In addition, three publicly accessible databases, namely AR, CASIA-iris distance and UBIPr, have been used to evaluate the proposed network. When compared against several competing networks on these databases, the proposed network achieved better performances in both recognition and verification tasks.
... 70 These techniques seek to enhance the robustness of face models to intra-class 71 variations. In multiple representations, different patches and face descriptors are 72 employed [2,4], while 2D morphing or 3D reconstructions are used to synthesize 73 artificial face images [16,22]. A generic auxiliary dataset containing faces of 74 other persons can be exploited to perform domain adaptation [20] and sparse 75 representation classification through dictionary learning [36]. ...
Chapter
Full-text available
Face recognition (FR) systems for video surveillance (VS) applications attempt to accurately detect the presence of target individuals over a distributed network of cameras. In video-based FR systems, facial models of target individuals are designed a priori during enrollment using a limited number of reference still images or video data. These facial models are not typically representative of faces being observed during operations due to large variations in illumination, pose, scale, occlusion, blur, and camera interoperability. Specifically, in still-to-video FR application, a single high-quality reference still image captured with still camera under controlled conditions is employed to generate a facial model to be matched later against lower-quality faces captured with video cameras under uncontrolled conditions. Current video-based FR systems can perform well on controlled scenarios, while their performance is not satisfactory in uncontrolled scenarios mainly because of the differences between the source (enrollment) and the target (operational) domains. Most of the efforts in this area have been toward the design of robust video-based FR systems in unconstrained surveillance environments. This chapter presents an overview of recent advances in still-to-video FR scenario through deep convolutional neural networks (CNNs). In particular, deep learning architectures proposed in the literature based on triplet-loss function (e.g., cross-correlation matching CNN, trunk-branch ensemble CNN and HaarNet) and supervised autoencoders (e.g., canonical face representation CNN) are reviewed and compared in terms of accuracy and computational complexity.