Figure 3 - uploaded by Magy Seif El-Nasr
Content may be subject to copyright.
‘Skeleton template’ of the Vicon motion capture system, required for labeling and calibration (top). Rendering with Unity engine, used in experiments (bottom) 

‘Skeleton template’ of the Vicon motion capture system, required for labeling and calibration (top). Rendering with Unity engine, used in experiments (bottom) 

Source publication
Article
Full-text available
One of the open problems in creating believable characters in computer games and collaborative virtual environments is simulating adaptive human-like motion. Classical artificial intelligence (AI) research places an emphasis on verbal language. In response to the limitations of classical AI, many researchers have turned their attention to embodied...

Contexts in source publication

Context 1
... conversational gestures for animated embodied agents. Gorman [6] shows that imitation of the behaviors of a computer game player creates enhanced believability in artificial agents. Stone, et al [17] describe a technique for reproducing the structure of speech and gesture in new conversational contexts. Neff, et al, show how the gestural styles of individual speakers can be reconstructed, focusing on arm gestures [12]. The emotional and narrative content that emerges through extended interaction between a virtual agent and a human can be used for simulating memory and emotional states, thus increasing believability, as indicated by work by Seif El-Nasr [16]. Modeling the affective dimensions of characters and their personalities, as demonstrated by Gebhard [5], provides more robust, consistent behavior in an agent over extended time. While we have not developed such components, we have developed a scheme by which believability over extended interaction time can be measured. Studies in using point light displays have shown that humans are sensitive to the perception of human movement, such as detecting human gait [1][9], and there are findings of distinct patterns of neural activity associated with the perception of human-made movement [15][14], indicating that a small number of visual elements can be used, not only for testing perception of believable motion, but also for use as control points in an animated character, using inverse kinematics (IK). IK is commonly used in computer animation to determine the joint rotations of a character, based on goal positions. In our research we have chosen to reduce the visual aspect to a minimum, as a way to work with first principles of motion behavior and also to establish an efficient and manageable set of controllers for animating a character. This paper demonstrates a scheme for testing the believability in this highly- reduced set of primary motion features, using an established, well- studied method: the Turing Test. We do not address the issue of coverbal gesture or ways to add a nonverbal layer to an existing verbal layer for conversational agents. This is basic research focusing on “silent copresence” and primitive communication through motion only. The plastic human brain routinely adapts to new communication media and user interfaces. If a human communicator spends enough time “being” three dots, and communicating with three dots that behave in a similar way, then the body map quickly adapts to that schema. We have designed AI algorithms that “know” they exist as 3 dots, and must use a three-dot interface to communicate. Consider that an AI used for the classic Turing Test does not require the simulation of a mouth, tongue, lungs, diaphragm, or any apparatus used to generate verbal language, and nor does it require the simulation of fingers tapping on computer keyboards. Similarly, the Gestural Turing Test AI does not need to simulate the entire muscular and skeletal apparatus required for moving these points around. Semiosis is confined to the points only – they are the locus of communication. How many points are needed to detect communicative motion? We had originally considered two dots – even one dot, as comprising the gestural alphabet. Our hypothesis was that, given enough time for a subject to interact with the dot(s), the intelligence behind it (or lack thereof) would eventually be revealed. With one dot, there would be very little indication of a physical human puppeteer – however, the existence of a human mind might become apparent over time, due to the spontaneous visual language that naturally would emerge, given enough time and interaction. Several points of light (dozens) would make it easier for the subject to discern between artificial and human (as illustrated in Figure 1). But this would require more sophisticated physical modeling of the human body, as well as a more sophisticated AI. For our experiment, we chose three points, because we believe the head and hands to be the most motion-expressive points of the body. The majority of gestural emblems originate in the head and hands. Figure 2 shows a schematic of the studio setup. A human observer (the subject, shown at right) sits in a chair in front of a screen projected with two sets of three white dots. The subject wears motion-capture markers (attached to a hat and two gloves), which are used to move the three dots on the right side of the screen. The three dots on the left side of the screen are moved by a hidden agent obscured by a room divider. This hidden agent is either another human with similar motion capture markers, or a software program that simulates human- made motion of the dots. No sounds or text can be exchanged between the subject and the hidden agent. Sign language is not possible, due to the limited number of dots. To generate the points from the human subjects we used the Vicon motion capture studio at Emily Carr University of Art and Design in Vancouver, BC. Figure 3 shows a screenshot of the Vicon interface (top). In order for the Vicon system to differentiate between the various objects in the scene, one of the hats used four markers, and each glove on the opposite side used two markers. Everything else used only one. This explains the linear-connected figures in the screenshot. This is only for purposes of calibration and disambiguation for the Vicon system, and makes no difference to the subject’s view. Twenty cameras are deployed in the studio (six of them can be seen represented in wireframe at the top). Because of the room divider, some camera views of the markers are obscured, which accounts for occasional drop-out and flickering in the resulting points. The stream of 3D positional data generated by the Vicon system while the subjects moved was distributed via a local area network to a laptop running the Unity game engine. We formatted the data using XML, which was also used for recording motions and archiving results from the experiments. A 3D scene consisting of six small white spheres – three on either side of a black divider – are animated by this data stream at 30 Hz. We used the Unity engine because we intend to extend this research to drive realistic avatars in a subsequent version of the project. An example display from Unity is shown at the bottom of Figure 3. To generate the artificial gestures, we designed two algorithms. Both algorithms relied on the detection of the energy of the human’s motions to trigger responsive gestures. We calculated energy from continually measuring the sum of the instantaneous speeds of the three points. If at any time energy changed from a value below a specified threshold to greater than that threshold, a response could be triggered, which depended on what kinds of gestures were playing at the time. The first algorithm (AI1) employed a state-machine that chose among a set of short, pre-recorded gestures made by a human. These pre-recoded gestures included “ambient” motions (shifting in the chair, scratching, etc.) and a set of more dynamic gestures (emblems and communicative gestures such as waving, pointing, drawing shapes, “chair-dancing”, etc.). When triggered, it played gestures form the set of dynamic gestures, and when it detected smaller movements, it responded by playing smaller, ambient gestures. AI1 did not include a sophisticated blending scheme for smooth transitions between gestures, and so it was often apparent to the subjects that it was not human. This was intentional: we wanted to expose the subjects to less-believable behaviors so that they could establish a base-level of non- believability on which to judge other motions. The second algorithm (AI2) used a combination of procedurally-generated motions and imitative motions created while the experiment was being done. The procedurally-generated gestures were continuous (no explicit beginning, middle or end) and so any one of them could be blended in or out at any time, or layered together. These were constructed through combinations of several sine and cosine oscillations, with carefully-chosen phase offsets and frequencies. This included slight motions using a technique similar to Perlin Noise [13]. The imitative gestures were created by recording the positions of the subject’s motions, translating them to the location where the AI was “sitting”, and playing them back after about a second, with some variation, when a critical increase in energy was detected. AI2 used a more sophisticated blending technique, achieved by allowing multiple gestures (each with varying weights) to play simultaneously, such that the sum of the weights always equals 1. A cosine function was used for blending transitions to create an ease-in/ease-out effect, which helped to smooth transitions. Our rationale for recording the subject’s gestures and playing them back, using AI2, was that it would not only appear human, but that it would also be imitative. Imitation is one of the most primary and universal aspects of communication – especially when there is a desire for rapport and emotional connection. Gratch, et. al, report that virtual agents that exhibit postural mirroring, and imitation of head gestures enhance the sense of rapport within subjects [7]. There were 17 subjects. We ran 6 to 12 tests on each subject. A total of 168 tests were done. Figure 4 shows the results in chronological order from top to bottom. In this graph, the set of tests per subject is delineated by a gray horizontal line. The length of the line is proportional to the duration it took for the subject to make a response. The longest duration was just over 95 seconds. If the response was “false”, a black dot is shown at the right end of the line. Wrong guesses are indicated by black rectangles at the right-side of the graph. This graph reveals some differences in subjects’ abilities to make correct guesses, and also differences in duration before subjects made a response. But we are more ...
Context 2
... only for testing perception of believable motion, but also for use as control points in an animated character, using inverse kinematics (IK). IK is commonly used in computer animation to determine the joint rotations of a character, based on goal positions. In our research we have chosen to reduce the visual aspect to a minimum, as a way to work with first principles of motion behavior and also to establish an efficient and manageable set of controllers for animating a character. This paper demonstrates a scheme for testing the believability in this highly- reduced set of primary motion features, using an established, well- studied method: the Turing Test. We do not address the issue of coverbal gesture or ways to add a nonverbal layer to an existing verbal layer for conversational agents. This is basic research focusing on “silent copresence” and primitive communication through motion only. The plastic human brain routinely adapts to new communication media and user interfaces. If a human communicator spends enough time “being” three dots, and communicating with three dots that behave in a similar way, then the body map quickly adapts to that schema. We have designed AI algorithms that “know” they exist as 3 dots, and must use a three-dot interface to communicate. Consider that an AI used for the classic Turing Test does not require the simulation of a mouth, tongue, lungs, diaphragm, or any apparatus used to generate verbal language, and nor does it require the simulation of fingers tapping on computer keyboards. Similarly, the Gestural Turing Test AI does not need to simulate the entire muscular and skeletal apparatus required for moving these points around. Semiosis is confined to the points only – they are the locus of communication. How many points are needed to detect communicative motion? We had originally considered two dots – even one dot, as comprising the gestural alphabet. Our hypothesis was that, given enough time for a subject to interact with the dot(s), the intelligence behind it (or lack thereof) would eventually be revealed. With one dot, there would be very little indication of a physical human puppeteer – however, the existence of a human mind might become apparent over time, due to the spontaneous visual language that naturally would emerge, given enough time and interaction. Several points of light (dozens) would make it easier for the subject to discern between artificial and human (as illustrated in Figure 1). But this would require more sophisticated physical modeling of the human body, as well as a more sophisticated AI. For our experiment, we chose three points, because we believe the head and hands to be the most motion-expressive points of the body. The majority of gestural emblems originate in the head and hands. Figure 2 shows a schematic of the studio setup. A human observer (the subject, shown at right) sits in a chair in front of a screen projected with two sets of three white dots. The subject wears motion-capture markers (attached to a hat and two gloves), which are used to move the three dots on the right side of the screen. The three dots on the left side of the screen are moved by a hidden agent obscured by a room divider. This hidden agent is either another human with similar motion capture markers, or a software program that simulates human- made motion of the dots. No sounds or text can be exchanged between the subject and the hidden agent. Sign language is not possible, due to the limited number of dots. To generate the points from the human subjects we used the Vicon motion capture studio at Emily Carr University of Art and Design in Vancouver, BC. Figure 3 shows a screenshot of the Vicon interface (top). In order for the Vicon system to differentiate between the various objects in the scene, one of the hats used four markers, and each glove on the opposite side used two markers. Everything else used only one. This explains the linear-connected figures in the screenshot. This is only for purposes of calibration and disambiguation for the Vicon system, and makes no difference to the subject’s view. Twenty cameras are deployed in the studio (six of them can be seen represented in wireframe at the top). Because of the room divider, some camera views of the markers are obscured, which accounts for occasional drop-out and flickering in the resulting points. The stream of 3D positional data generated by the Vicon system while the subjects moved was distributed via a local area network to a laptop running the Unity game engine. We formatted the data using XML, which was also used for recording motions and archiving results from the experiments. A 3D scene consisting of six small white spheres – three on either side of a black divider – are animated by this data stream at 30 Hz. We used the Unity engine because we intend to extend this research to drive realistic avatars in a subsequent version of the project. An example display from Unity is shown at the bottom of Figure 3. To generate the artificial gestures, we designed two algorithms. Both algorithms relied on the detection of the energy of the human’s motions to trigger responsive gestures. We calculated energy from continually measuring the sum of the instantaneous speeds of the three points. If at any time energy changed from a value below a specified threshold to greater than that threshold, a response could be triggered, which depended on what kinds of gestures were playing at the time. The first algorithm (AI1) employed a state-machine that chose among a set of short, pre-recorded gestures made by a human. These pre-recoded gestures included “ambient” motions (shifting in the chair, scratching, etc.) and a set of more dynamic gestures (emblems and communicative gestures such as waving, pointing, drawing shapes, “chair-dancing”, etc.). When triggered, it played gestures form the set of dynamic gestures, and when it detected smaller movements, it responded by playing smaller, ambient gestures. AI1 did not include a sophisticated blending scheme for smooth transitions between gestures, and so it was often apparent to the subjects that it was not human. This was intentional: we wanted to expose the subjects to less-believable behaviors so that they could establish a base-level of non- believability on which to judge other motions. The second algorithm (AI2) used a combination of procedurally-generated motions and imitative motions created while the experiment was being done. The procedurally-generated gestures were continuous (no explicit beginning, middle or end) and so any one of them could be blended in or out at any time, or layered together. These were constructed through combinations of several sine and cosine oscillations, with carefully-chosen phase offsets and frequencies. This included slight motions using a technique similar to Perlin Noise [13]. The imitative gestures were created by recording the positions of the subject’s motions, translating them to the location where the AI was “sitting”, and playing them back after about a second, with some variation, when a critical increase in energy was detected. AI2 used a more sophisticated blending technique, achieved by allowing multiple gestures (each with varying weights) to play simultaneously, such that the sum of the weights always equals 1. A cosine function was used for blending transitions to create an ease-in/ease-out effect, which helped to smooth transitions. Our rationale for recording the subject’s gestures and playing them back, using AI2, was that it would not only appear human, but that it would also be imitative. Imitation is one of the most primary and universal aspects of communication – especially when there is a desire for rapport and emotional connection. Gratch, et. al, report that virtual agents that exhibit postural mirroring, and imitation of head gestures enhance the sense of rapport within subjects [7]. There were 17 subjects. We ran 6 to 12 tests on each subject. A total of 168 tests were done. Figure 4 shows the results in chronological order from top to bottom. In this graph, the set of tests per subject is delineated by a gray horizontal line. The length of the line is proportional to the duration it took for the subject to make a response. The longest duration was just over 95 seconds. If the response was “false”, a black dot is shown at the right end of the line. Wrong guesses are indicated by black rectangles at the right-side of the graph. This graph reveals some differences in subjects’ abilities to make correct guesses, and also differences in duration before subjects made a response. But we are more interested in how well the two AI algorithms performed against the human. This can be shown by separating out the tests according to which hidden agent was used (Human, AI1, or AI2). Figure 5 shows the percentages of wrong vs. right responses in the subjects for each of the three agents. As expected, the human had the most guesses of real . Also as expected, AI2 scored better than AI1 in terms of fooling subjects into thinking it was real. Believability in virtual agents can be measured in many ways – it doesn’t have to be a binary choice, and in fact it has been suggested by critics of the classic Turing Test that its all-or- nothing test criterion may be a problem, and that a graded assessment might be more appropriate and practical [4]. One approach might be to measure how quickly a subject is convinced that a virtual agent is real. We calculated the average durations for each case of right and wrong responses, as shown in Table 1. The average durations before responding for the human agent are less than the average durations for the other agents, except for when subjects guessed wrongly for AI2. It may be that the authenticity of the human agent is easily and quickly determined, on average, which accounts for the slightly shorter average durations. But this is not conclusive. We ran a t-test and did not find any ...

Citations

... Dynamic figures with little anthropomorphic appearance can be perceived as human movement when certain kinematic features (speed, stiffness, and phase etc.) approach those of natural human movements (Morewedge et al., 2007; Thompson et al., 2011; Blake and Shiffrar, 2007; Johansson, 1973; Heider and Simmel, 1944). Similarity, the contingencies between a human and a virtual agent's movement also lead to humanness attribution to the virtual agent following their interaction (Pfeiffer et al., 2011; Ventrella et al., 2010). In the current study, VP's hand has anthropomorphic Table.3 ...
Article
Variability is a property of biological systems, and in animals (including humans), behavioral variability is characterized by certain features, such as the range of variability and the shape of its distribution. Nevertheless, only a few studies have investigated whether and how variability features contribute to the ascription of humanness to robots in a human-robot interaction setting. Here, we tested whether two aspects of behavioral variability, namely, the standard deviation and the shape of distribution of reaction times, affect the ascription of humanness to robots during a joint action scenario. We designed an interactive task in which pairs of participants performed a joint Simon task with an iCub robot placed by their side. Either iCub could perform the task in a preprogrammed manner, or its button presses could be teleoperated by the other member of the pair, seated in the other room. Under the preprogrammed condition, the iCub pressed buttons with reaction times falling within the range of human variability. However, the distribution of the reaction times did not resemble a human-like shape. Participants were sensitive to humanness, because they correctly detected the human agent above chance level. When the iCub was controlled by the computer program, it passed our variation of a nonverbal Turing test. Together, our results suggest that hints of humanness, such as the range of behavioral variability, might be used by observers to ascribe humanness to a humanoid robot.
Article
This FDG 2011 Doctoral Consortium paper gives an overview of my Computer Science research approach and initial results in applying principles and theories from the Performing Arts to problems related to embodied agent design and animation for games.