FIGURE 3 - uploaded by Farhan Ullah
Content may be subject to copyright.
Schematic diagram of rotation consistency between 2D and 3D. The 3D camera coordinate of X, Y , and Z axis corresponds to Pitch, Yaw and Roll respectively. The image plane on 2D x-y coordinate is parallel to X-Y plane on the camera coordinate. Image obtained by rotating δ around Z axis that followed by projection operation is equivalent to that of rotating δ directly on 2D image.

Schematic diagram of rotation consistency between 2D and 3D. The 3D camera coordinate of X, Y , and Z axis corresponds to Pitch, Yaw and Roll respectively. The image plane on 2D x-y coordinate is parallel to X-Y plane on the camera coordinate. Image obtained by rotating δ around Z axis that followed by projection operation is equivalent to that of rotating δ directly on 2D image.

Source publication
Article
Full-text available
Image-based Head Pose Estimation (HPE) from an arbitrary view is still challenging due to the complex imaging conditions as well as the intrinsic and extrinsic property of the faces. Different from existing HPE methods combining additional cues or tasks, this paper solves the HPE problem by relieving problem complexity. Our method integrates the de...

Context in source publication

Context 1
... bins is used to represent the final estimation. Regarding loss functions, we argue that existing methods usually formulate loss functions independent of images. In this paper, we observe that there exists a strong relevance on Euler angles between the input image pair. Details of the proposed pairwise pose loss are described in Sec. III-D. Fig. 3 shows the schematic diagram of the rotation consistency between 2D image space and 3D camera space, indicated by x-y and X-Y -Z respectively. Note that the image plane x-y is parallel to that of the X-Y plane in camera coordinate, which is perpendicular to that of the Z axis. The orange path shows 2D rotation pipeline, where the 2D ...

Citations

... In recent years, Convolutional Neural Networks (CNN) has become one of the most widely utilized techniques for computer vision problems [28][29][30][31]. CNN mainly comprises two parts, feature extraction, and classification [32,33]. ...
Article
Full-text available
Over the last several years, olive cultivation has grown throughout the Mediterranean countries. Among them, Spain is the world’s leading producer of olives. Due to its high economic significance, it is in the best interest of these countries to maintain the crop spread and its yield. Manual enumeration of trees over such extensive fields is impractical and humanly infeasible. There are several methods presented in the existing literature; nonetheless, the optimal method is of greater significance. In this paper, we propose an automated method of olive tree detection as well as crop estimation. The proposed approach is a two-step procedure that includes a deep learning-based classification model followed by regression-based crop estimation. During the classification phase, the foreground tree information is extracted using an enhanced segmentation approach, specifically the K-Mean clustering technique, followed by the synthesis of a super-feature vector comprised of statistical and geometric features. Subsequently, these extracted features are utilized to estimate the expected crop yield. Furthermore, the suggested method is validated using satellite images of olive fields obtained from Google Maps. In comparison with existing methods, the proposed method contributed in terms of novelty and accuracy, outperforming the rest by an overall classification accuracy of 98.1% as well as yield estimate with a root mean squared error of 0.185 respectively.
... Over the last three decades, methods to estimate the pose of the head have received increasing attention [3] because of their application in various image analysis tasks. A variety of implementations can be named, ranging from, at the coarsest level, the identification of a head pose from a finite set of orientations [4] (e.g., frontal vs. lateral view) to the identification of continuous angular measurement [5,6]. ...
Article
Full-text available
The aim of this work is to present an automatic solution to control the surveillance camera merely by the movements of the operator’s head. The method uses convolutional neural networks that work in a course-to-fine manner to estimate head orientation in image data. First, the image frame of the operator’s head is acquired from the camera on the operator’s side of the system. The exact position of a head, given by its bounding box, is estimated by a Multitask Cascaded Convolutional Network. Second, the customized network for a given scenario is used to classify the orientation of the head-on image data. In particular, the dedicated image dataset was collected for training purposes and was given a discrete set of possible orientations in the vertical and horizontal planes. The accuracy of the estimators is higher than 80%, with an average of 4.12 fps of validation time. Finally, the current head orientation data are converted into a control signal for two degrees of freedom surveillance camera mounting. The feedback response time is 1.5 s, which is sufficient for most real-life surveillance applications.
... This could be an interesting area to be explored in the future. [154] DL gender and expression IMDB Park et al. [155] DL age and gender Mega Asian Benkaddour et al. [156] DL age and gender Adience Micheala et al. [157] DL age and gender FERET Kale et al. [158] DL race, age, and gender Face 2020 Li et al. [159] DL age and gender AFLW Barra et al. [160] geometric age and gender AFLW Lim et al. [161] DL age and gender IMDB [162] regression age and expression AFLW HyperFace [109] DL appearance and DL AFLW Sergio et al. [50] IBM gender, age, and expression FEI, FERET Hsu et al. [117] regression HPE and expression AFLW Khan et al. [52] IBM HPE, age, and gender FEI, FERET Thomaz et al. [147] IBM gender and expression FEI, FERET Zhou et al. [134] DL gender and age Adience, VGGFace2 2018 Gupta et al. [163] regression expressions and gender AFLW Ruiz et al. [116] DL age and gender AFLW Smith et al. [22] DL gender and age VGGFace Acien et al. [144] DL gender and race VGGFace Baltrusaitis et al. [103] DL HPE and expression ICT-3DHP Das et al. [145] DL gender, age, race UTKFace Mane et al. [146] appearance gender and expression IMDB 2017 Derkach et al. [164] regression appearance SASE Duan et al. [115] hybrid gender and age Aidence Dehghan et al. [142] DL age, gender, expression Face+ Ranjan et al. [109] DL age, race, gender, expression AFW Shin et al. [149] DL age and gender AFW 2016 Baltrusaitis et al. [109] 3D morphable HPE and gender Multi-Pie+BU Ranjan et al. [109] DL head pose and gender AFLW and CelebA Xia et al. [165] geometric race, gender, and age FRGCv2 Lapuschkin et al. [136] appearance age and gender Adience 2015 Afifa et al. [166] appearance age and gender FacePix, CMU PIE, BU Tulyakov et al. [64] tracking HPE and expr. Dali3DHP Peng et al. [167] mainfold embedded gender and expr. ...
Article
Full-text available
Human face image analysis using machine learning is an important element in computer vision. The human face image conveys information such as age, gender, identity, emotion, race, and attractiveness to both human and computer systems. Over the last ten years, face analysis methods using machine learning have received immense attention due to their diverse applications in various tasks. Although several methods have been reported in the last ten years, face image analysis still represents a complicated challenge, particularly for images obtained from ’in the wild’ conditions. This survey paper presents a comprehensive review focusing on methods in both controlled and uncontrolled conditions. Our work illustrates both merits and demerits of each method previously proposed, starting from seminal works on face image analysis and ending with the latest ideas exploiting deep learning frameworks. We show a comparison of the performance of the previous methods on standard datasets and also present some promising future directions on the topic.
... Furthermore, during deep 3D face modeling, facial landmarks can be obtained. The head pose estimation problem is solved in [23] by reducing the problem's complexity. A deep task reduction-guided image regularization module is integrated with an anchor-guided pose estimation module, and the HPE problem is formulated as a unified end-to-end learning framework. ...
Article
Full-text available
Head pose estimation (HPE) is a key step in computation and quantification of 3D facial features and has a significant impact on the precision and accuracy of measurements. High-precision HPE is the basis for standardized facial data collection and analysis. The Camper’s plane is the standard (baseline) plane commonly used by anthropologists for head and face research, but there is no research on automatic positioning of the Camper’s plane using color and depth cameras. This paper presents a high-accuracy method for Camper’s plane localization and HPE based on multi-view RGBD depth sensors. The 3D facial point clouds acquired by the multi-view RGBD depth sensors are aligned to obtain a complete 3D face. Keypoint RCNN is used for facial keypoint detection to obtain facial landmarks. A method is proposed to build a general face datum model based on a self-built dataset. The head pose is estimated by applying rigid body transformation to an individual 3D face and the general 3D face model. In order to verify the accuracy of Camper’s plane localization and HPE, 102 cases of 3D facial data and experiments were collected and conducted. The tragus and nasal alar points are localized to within 7 pixels (about 0.83 cm) and the average accuracy of the three dimensions of Camper’s plane identified is 0.87°, 0.64° and 0.47° respectively. The average accuracies of the three dimensions of HPE were 1.17°, 0.90° and 0.97. The experiment results demonstrate the effectiveness of the method for Camper’s plane localization and HPE.
... Head pose estimation is a computer camera detecting a head's position in 3D spaces concerning the surrounding in an image or a video sequence. The space is relative to the camera [43]. The main goal of head pose estimation is to get three-dimensional Euler angles, including the pitch, roll, and yaw angles. ...
Article
Full-text available
Automated human pose estimation is evolving as an exciting research area in human activity detection. It includes sophisticated applications such as malpractice detection in the examination, distracted driving, gesture detection, etc., and requires robust and reliable pose estimation techniques. These applications help to map the attention of the user with head pose estimation (HPE) metrics supported by emotion and gaze analysis. This paper solves the problem of attention score estimation with HPE. The proposed method ensures ease of implementation while addressing head pose estimation using 68 facial features. Further, to attain reliability and precision, head pose estimation has been implemented as a regression task. The coordinate pair angle method (CPAM) with deep neural network (DNN) regression and elastic net regression is carried out. The use of DNN ensures precision on low lighting, distorted or occluded images. CPAM methodology leverages facial landmark detection and angular difference to estimate head pose. Experimentation results showed that the proposed model could handle large datasets, real-time data processing, significant pose variations, partial occlusions, and diverse facial expressions with a mean absolute error (MAE) of 3° and less. The proposed system was evaluated on three standard databases: the 300W across large poses (300W-LP) dataset, annotated facial landmarks in the wild (AFLW2000) dataset, and the national institute of mental health child emotional faces picture set (NIMH-ChEFS) dataset. The results achieved are on par with recent state-of-the-art methodologies such as anisotropic angle distribution learning (AADL), joint head pose estimation and face alignment algorithm (JFA), rotation axis focused attention network (RAFA-Net), and propose an MAE ranging up to 6°. The paper could achieve remarkable results for attention span prediction using head pose estimation and for many possible future applications.
... In this section, we will review some existing methods in defining 6-DoF pose from a single RGB image. From our observations, the basic framework is to build a neural network for outputting the pose parameters directly [1], [3], [11], [14]- [21] or using the outputs to detect a set of control points (corners or keypoints) [7], [10], [22]- [27] for inferring the 6-DoF parameters later. Methods using pose parameters (rotation and translation matrices) as the end-targets usually execute faster but with low accuracy. ...
... Assuming the availability of each point coordinates (x, y, z) and its normal (n x , n y , n z )(provided by most of the synthetic point cloud data set), we can solve the coefficients [a -g] via local fitting from (1) to (3). The principal/normal curvature for each object point i, denoted as C i , can then be derived from the computation of 2×2 Weingarten matrix W which is composed of a, b, and c above as the elements [39]. ...
Article
Full-text available
Estimating the 6-DoF (Degree of Freedom) object pose from a single RGB image is one of the challenging tasks in the field of computer vision. Before the pose which is defined as the translation and rotation parameters can be derived by the traditional PnP algorithm, 2D image projections of a set of 3D object keypoints must be accurately detected. In this paper, we present techniques for defining 3D object surface keypoints and predicting their corresponding 2D counterparts via deep-learning network architectures. The main technique to designate 3D object keypoints is to employ quadratic fitting scheme for calculating the principal surface curvatures as the weights and then select from all surface points the ones mostly distributive with larger curvatures to describe the object shape as possible. However, the 2D projected keypoints are not directly regressed from the network, but encoded as the unit vector fields pointing to them, so that the voting scheme to recover back those 2D keypoints can be performed. Moreover, an effective loss function with the regularization term is adopted in training ResNet for predicting image projections of object keypoints by focusing on small-scale errors. Experimental results show that our proposed technique outperforms state-of-the-art approaches in both “2D projection” and “3D transformation” metrics.
... In the above article [1], the affiliation for the corresponding author Jiang Wang needs to be corrected. ...
Article
Full-text available
In the above article [1] , the affiliation for the corresponding author Jiang Wang needs to be corrected.
... The basic approaches for face pose estimation were comprehensively summarized in the literature [14], and there are many novel proposed methods recently. Wang et al. proposed a novel tree-based neural network architecture which embeds the relationship of the continuity in pose intervals [15]; Li et al. combined the task-simplification mechanism and anchor-guided estimation method into one unified learning framework to estimate the face poses [16]; Lee et al. proposed a fast and accurate estimation algorithm based on the convolutional random projection forest [17], etc. Although these methods achieve great results, they will be affected more or less if the object is a masked face because there lost a lot of face information when wearing a face mask. ...
Article
Full-text available
Aiming at the new requirements of masked face poses classification during the epidemic outbreak, this paper proposes an efficient transfer learning approach combined skip-connected structure to improve the accuracy of masked face poses classification in the absence of masked face poses data. We have worked on the following two aspects: 1) According to the features transition of the convolutional neural networks, we propose an efficient transfer learning approach and opt for a more appropriate source domain to solve the problem that the specificity of features in the pre-trained deep networks will damage the performance when transferring to the target domain. First, a semisynthetic masked face poses dataset is constructed to replace ImageNet as the source domain, which can reduce the span of transfer and improve the pertinence of transfer learning. Second, the shallow networks which contain the general features are frozen while the deep networks which contain the specific features are retrained and the entire networks are fine-tuned afterwards. It optimizes the specific features in the source domain when transferring, and promoted transfer learning more effectively; 2) To further improve the overall accuracy by improving the accuracy of masked face pose classes with subtle differences, a skip-connected structure is proposed to fuse general features containing rich detailed information in the shallow networks into the classifier. Experiments on AlexNet and VGG16 show that the proposed method has certain advantages, and the overall accuracy can reach 96.43% and 99.29% at the final respectively.
Article
The capacity to create ”fake” videos has recently raised concerns about the reliability of multimedia content. Identifying between true and false information is a critical step toward resolving this problem. On this issue, several algorithms utilizing deep learning and facial landmarks have yielded intriguing results. Facial landmarks are traits that are solely tied to the subject’s head posture. Based on this observation, we study how Head Pose Estimation (HPE) patterns may be utilized to detect deepfakes in this work. The HPE patterns studied are based on FSA-Net, SynergyNet, and WSM, which are among the most performant approaches on the state of the art. Finally, using a machine learning technique based on K-Nearest Neighbor and Dynamic Time Warping, their temporal patterns are categorized as authentic or false. We also offer a set of experiments for examining the feasibility of using deep learning techniques on such patterns. The findings reveal that the ability to recognize a deepfake video utilizing an HPE pattern is dependent on the HPE methodology. On the contrary, performance is less dependent on the performance of the utilized HPE technique. Experiments are carried out on the FaceForensics++ dataset, that presents both identity swap and expression swap examples. The findings show that FSA-Net is an effective feature extraction method for determining whether a pattern belongs to a deepfake or not. The approach is also robust in comparison to deepfake videos created using various methods or for different goals. In mean the method obtain 86% of accuracy on the identity swap task and 86.5% of accuracy on the expression swap. These findings offer up various possibilities and future directions for solving the deepfake detection problem using specialized HPE approaches, which are also known to be fast and reliable.