Conference Paper

A Lightweight Intelligent Virtual Cinematography System for Machinima Production.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Machinima is a low-cost alternative to full production filmmaking. However, creating quality cinematic visualizations with existing machinima techniques still requires a high degree of talent and effort. We introduce a lightweight artificial intelligence system, Cambot, that can be used to assist in machinima production. Cambot takes a script as input and produces a cinematic visualization. Unlike other virtual cinematography systems, Cambot favors an offline algorithm coupled with an extensible library of specific modular and reusable facets of cinematic knowledge. One of the advantages of this approach to virtual cinematography is a tight coordination between the positions and movements of the camera and the actors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, evaluating the quality of film editing (whether generated by machines or by artists) is a notoriously difficult problem (Lino et al. 2014). Some contributions mention heuristics for choosing between multiple editing solutions without further details (Christianson et al. 1996) while other minimize a cost function which is insufficiently described to be reproduced (Elson and Riedl 2007). Furthermore, the precise timing of cuts has not been addressed, nor the problem of controling the rhythm of cutting (number of shots per minute) and its role in establishing film tempo (Adams, Dorai, and Venkatesh 2002). ...
... Darshak goes a long way into motivating the shots, but the actual cinematography and editing are not evaluated. Cambot (Elson and Riedl 2007) is a movie-making system where the choice of shots is found as a solution to an optimization problem using dynamic programming. The scene is expressed as a sequence of nonoverlapping dramatic beats and their approach evaluates different placement of characters (blockings) and camera choices for each beat. ...
... The scene is expressed as a sequence of nonoverlapping dramatic beats and their approach evaluates different placement of characters (blockings) and camera choices for each beat. Though we also make use of dynamic programming, our method is very different from (Elson and Riedl 2007). Firstly, we search a much larger set of possible solutions, by evaluating a higher number of shot transitions at a finer level of granularity (every frame, rather than every beat). ...
Article
We describe an optimization-based approach for automatically creating well-edited movies from a 3D animation. While previous work has mostly focused on the problem of placing cameras to produce nice-looking views of the action, the problem of cutting and pasting shots from all available cameras has never been addressed extensively. In this paper, we review the main causes of editing errors in literature and propose an editing model relying on a minimization of such errors. We make a plausible semi-Markov assumption, resulting in a dynamic programming solution which is computationally efficient. We also show that our method can generate movies with different editing rhythms and validate the results through a user study. Combined with state-of-the-art cinematography, our approach therefore promises to significantly extend the expressiveness and naturalness of virtual movie-making.
... Some approaches based on mass-spring physical systems have been designed to stage subjects in theater plays [Talbot 2015] or for interactive stories [Kapadia et al. 2016], without considering the constraints specific to the camera placement and camera angles. A cinematography system was proposed in [Elson and Riedl 2007], that encompassed staging, blocking and filming, but only relied on a limited collection of combinations. ...
... Finally, in the specific case of cinematographic systems, the machinima approach proposed in [Elson and Riedl 2007] provides a mechanism to express a film script (where there are camera and character specifications) into a placement of characters and cameras in a 3D environment. The approach relies on a library of camera placement and motions, coupled with a library of staging configurations. ...
... Section 2), we make the strong hypothesis of performing all computations in 2D abstractions of 3D scenes. In that we follow hypotheses followed by [Elson and Riedl 2007;Lino et al. 2010;Talbot 2015]. Typically, this relieves us from the problem of 3D visibility computation in arbitrary 3D environments, for which exact or approximate techniques remain computationally expensive [Durand et al. 2002]. ...
Conference Paper
Full-text available
While the topic of virtual cinematography has essentially focused on the problem of computing the best viewpoint in a virtual environment given a number of objects placed beforehand, the question of how to place the objects in the environment with relation to the camera (referred to as staging in the film industry) has received little attention. This paper first proposes a staging language for both characters and cameras that extends existing cinematography languages with multiple cameras and character staging. Second, the paper proposes techniques to operationalize and solve staging specifications given a 3D virtual environment. The novelty holds in the idea of exploring how to position the characters and the cameras simultaneously while maintaining a number of spatial relationships specific to cinematography. We demonstrate the relevance of our approach through a number of simple and complex examples.
... In light of this challenge, 360°video is a particularly appealing domain to invoke automatic videography techniques, which aim to convert unedited materials into an effective video presentation that conveys events [7,9,10,18,19,31,36,37,39]. While automatic videography in prior work has largely dealt with virtual environments and handcrafted heuristics [7,9,18,31], recent work shows the potential of learning how to extract informative portions of 360°v ideo as a presentable NFOV video [37]. ...
... In light of this challenge, 360°video is a particularly appealing domain to invoke automatic videography techniques, which aim to convert unedited materials into an effective video presentation that conveys events [7,9,10,18,19,31,36,37,39]. While automatic videography in prior work has largely dealt with virtual environments and handcrafted heuristics [7,9,18,31], recent work shows the potential of learning how to extract informative portions of 360°v ideo as a presentable NFOV video [37]. The authors propose the Pano2Vid problem, which takes a 360°video as input and as output generates NFOV videos that look like they could have been captured by a human observer equipped with a real NFOV camera. ...
... Virtual cinematography Most existing work on virtual cinematography studies virtual camera control in virtual (computer graphics) environments [7,9,18,31] or else a specialized domain such as lecture videos [10,36,39]. Aside from camera control, some prior works also study automatic editing of raw materials like videos or photos [4,12,13,19]. ...
Article
360{\deg} video requires human viewers to actively control "where" to look while watching the video. Although it provides a more immersive experience of the visual content, it also introduces additional burden for viewers; awkward interfaces to navigate the video lead to suboptimal viewing experiences. Virtual cinematography is an appealing direction to remedy these problems, but conventional methods are limited to virtual environments or rely on hand-crafted heuristics. We propose a new algorithm for virtual cinematography that automatically controls a virtual camera within a 360{\deg} video. Compared to the state of the art, our algorithm allows more general camera control, avoids redundant outputs, and extracts its output videos substantially more efficiently. Experimental results on over 7 hours of real "in the wild" video show that our generalized camera control is crucial for viewing 360{\deg} video, while the proposed efficient algorithm is essential for making the generalized control computationally tractable.
... It then uses this model to identify candidate viewpoints and events of interest to capture in 360 • video, before finally stitching them together through optimal camera motions using a dynamic programming formulation for presentation to human viewers. Unlike prior attempts at automatic cinematography, which focus on virtual 3D worlds and employ heuristics to encode popular idioms from cinematography [1][2][3][4][5][6], Au-toCam is (a) the first to tackle real video from dynamic cameras and (b) the first to consider directly learning cinematographic tendencies from data. ...
... Virtual cinematography Ours is the first attempt to automate cinematography in complex real-world settings. Existing virtual cinematography work focuses on camera manipulation within much simpler virtual environments/video games [1][2][3][4], where the perception problem is bypassed (3-D positions and poses of all entities are knowable, sometimes even controllable), and there is full freedom to position and manipulate the camera. Some prior work [5,6] attempts virtual camera control within restricted static wide field-of-view video of classroom and video conference settings, by tracking the centroid of optical flow in the scene. ...
... Human-directed camera trajectories are content-based and often present scenes in idiomatic ways that are specific to the situations, and with specific intentions such as to tell a story [40]. Rather than hand-code such characteristics through cinematographic rules/heuristics [1][2][3][4], we propose to learn to capture NFOV videos, by observing HumanCam videos from the web. The following overviews our data collection procedure. ...
Article
Full-text available
We introduce the novel task of Pano2Vid $-$ automatic cinematography in panoramic 360$^{\circ}$ videos. Given a 360$^{\circ}$ video, the goal is to direct an imaginary camera to virtually capture natural-looking normal field-of-view (NFOV) video. By selecting "where to look" within the panorama at each time step, Pano2Vid aims to free both the videographer and the end viewer from the task of determining what to watch. Towards this goal, we first compile a dataset of 360$^{\circ}$ videos downloaded from the web, together with human-edited NFOV camera trajectories to facilitate evaluation. Next, we propose AutoCam, a data-driven approach to solve the Pano2Vid task. AutoCam leverages NFOV web video to discriminatively identify space-time "glimpses" of interest at each time instant, and then uses dynamic programming to select optimal human-like camera trajectories. Through experimental evaluation on multiple newly defined Pano2Vid performance measures against several baselines, we show that our method successfully produces informative videos that could conceivably have been captured by human videographers.
... However, a close look at these dedicated tools shows that a lot is still done manually, typically in selecting the appropriate moments, setting the cameras, and performing edits between multiple cameras. In parallel, for the last decade, researchers in computer graphics focusing on automated virtual camera control have been proposing a number of efficient techniques to automatically place and move cameras [Halper et al. 2001;Lino and Christie 2012] as well as editing algorithms to automatically or interactively edit the shots of a movie [Elson and Riedl 2007;Lino et al. 2011b]. ...
... These approaches are mostly founded on what could be referred to as "action-based" camera control, in the sense that a typical idiom is associated to each action occurring in the 3D environment (an idiom is a stereotypical way of shooting the action, either through a single shot or a sequence of shots). A film is then constructed by computing the best sequence of shots portraying a sequence of actions performed by the characters (as in [Elson and Riedl 2007;Lino et al. 2011b;Lino et al. 2011a;Markowitz et al. 2011]). ...
... Idiombased techniques [Christianson et al. 1996] would typically fail due to the inability to handle complex situations and the necessity to design idioms for many different actions and situations. Finally, optimization-based approaches such as [Elson and Riedl 2007] require the manual specification of cinematographic patterns for each situation, while [Lino et al. 2011a] maps actions to shot preferences in a straightforward way. ...
Article
Full-text available
This paper presents a system that generates cinematic replays for dialogue-based 3D video games. The system exploits the narrative and geometric information present in these games and automatically computes camera framings and edits to build a coherent cinematic replay of the gaming session. We propose a novel importance-driven approach to cinematic replay. Rather than relying on actions performed by characters to drive the cinematography (as in idiom-based approaches), we rely on the importance of characters in the narrative. We first devise a mechanism to compute the varying importance of the characters. We then map importances of characters with different camera specifications, and propose a novel technique that (i) automatically computes camera positions satisfying given specifications, and (ii) provides smooth camera motions when transitioning between different specifications. We demonstrate the features of our system by implementing three camera behaviors (one for master shots, one for shots on the player character, and one for reverse shots). We present results obtained by interfacing our system with a full-fledged serious game (Nothing for Dinner) containing several hours of 3D animated content.
... This work offers interesting perspectives yet remains limited by its realtime constraints and the binary aspect of its cost functions. With the Cambot system, Elson and Riedl [ER07] also addressed the challenge of automatic editing using an optimization-based approach. Unlike standard idiom-based solutions, they encoded three distinct layers (or facets) of cinematic knowledge (see Figure 2.24): the stage (area of space that the system assumes to be free of occlusions and obstructions), the blocking (placement and movements of the characters within the stage) and the shot (position, orientation and focal length of a virtual camera relative to the stage ; it also handles camera motion). ...
... However, evaluating the quality of film editing (whether generated by machines or by artists) is a notoriously difficult problem [LRGG14]. Some contributions mention heuristics for choosing between multiple editing solutions without further details [CAH * 96] while other minimize a cost function which is insufficiently described to be reproduced [ER07]. Furthermore, the precise timing of cuts has not been addressed, nor the problem of controling the rhythm of cutting (number of shots per minute) and its role in establishing film tempo [ADV02]. ...
... Idiom-based techniques [CAH * 96] would typically fail due to the inability to handle complex situations and the necessity to design idioms for many different actions and situations. Finally, optimization-based approaches such as [ER07] require the manual specification of cinematographic patterns for each situation, while [LCCR11] maps actions to shot preferences in a straightforward way. ...
Thesis
Full-text available
The wide availability of high-resolution 3D models and the facility to create new geometrical and animated content, using low-cost input devices, opens to many the possibility of becoming digital 3D storytellers. To date there is however a clear lack of accessible tools to easily create the cinematography (positioning and moving the cameras to create shots) and perform the editing of such stories (selecting appropriate cuts between the shots created by the cameras). Creating a movie requires the knowledge of a significant amount of empirical rules and established conventions. In particular continuity editing -- the creation of a sequence of shots ensuring visual continuity -- is a complex endeavor. Most 3D animation packages lack continuity editing tools, calling the need for automatic approaches that would, at least partially, support users in their creative process.In this thesis we address both challenges of automating cinematography and editing in virtual environments. In a first contribution we propose a system that relies on Reynolds' model of steering behaviors to control and locally coordinate a collection of camera agents in dynamic 3D environments. The second contribution consists of a novel optimization-based approach for automatically creating well-edited movies from a 3D animation. We propose an efficient solution through dynamic programming, by relying on a plausible semi-Markov assumption. The next contribution uses and extends our previous work to propose a novel importance-driven approach to cinematic replay that exploits both the narrative and geometric information in games to automatically compute camera paths and edits. Finally, our last contribution addresses camera control issues by constraining motion on camera rails to ensure realistic shots.
... In attempts to replace this manual endeavor, different techniques have been proposed in the literature to automatically compute cinematographic sequences by relying on film-tree representations [CAH * 96, ER07], or evolutions of idiom-based representations [HCS96]. Film-tree representations encode a cinematographic sequence as a set of scenes, further decomposed into shots and into frames. ...
... A complete cinematographic process has been proposed by Elson and Reidl [ER07] that deals simultaneously with tasks of blocking (placing the characters and the scene), shooting and editing. The authors rely on a combination submitted to COMPUTER GRAPHICS Forum (4/2015). of search and dynamic programming to identify the best shots and best blockings, by processing a library of canonical shots associated with canonical blockings. ...
... More formally, let X be the set of all possible shot descriptions in a cinematographic language and let E be the set of all event types existing in a given story. The task of any automated editing process as proposed in [HCS96] or [ER07] is to associate the best sequence of shotsṡ = s 1 , s 2 , · · · , s N , with s i ∈ X within all possible shot sequencesŝ with a given sequence of eventsê = e 1 , e 2 , · · · , e N , with e i ∈ E (see equation 1).ṡ = arg max s (p(ŝ|ê)) ...
Article
Automatically computing a cinematographic consistent sequence of shots over a set of actions occurring in a 3D world is a complex task which requires not only the computation of appropriate shots (viewpoints) and appropriate transitions between shots (cuts), but the ability to encode and reproduce elements of cinematographic style. Models proposed in the literature, generally based on finite state machine or idiom-based representations, provide limited functionalities to build sequences of shots. These approaches are not designed in mind to easily learn elements of cinematographic style, nor do they allow to perform significant variations in style over the same sequence of actions. In this paper, we propose a model for automated cinematography that can compute significant variations in terms of cinematographic style, with the ability to control the duration of shots and the possibility to add specific constraints to the desired sequence. The model is parameterized in a way that facilitates the application of learning techniques. By using a Hidden Markov Model representation of the editing process, we demonstrate the possibility of easily reproducing elements of style extracted from real movies. Results comparing our model with state-of-the-art first order Markovian representations illustrate these features, and robustness of the learning technique is demonstrated through cross-validation.
... However, evaluating the quality of film editing (whether generated by machines or by artists) is a notoriously difficult problem (Lino et al. 2014). Some contributions mention heuristics for choosing between multiple editing solutions without further details (Christianson et al. 1996) while other minimize a cost function which is insufficiently described to be reproduced (Elson and Riedl 2007). Furthermore, the precise timing of cuts has not been addressed, nor the problem of controling the rhythm of cutting (number of shots per minute) and its role in establishing film tempo (Adams, Dorai, and Venkatesh 2002). ...
... Darshak goes a long way into motivating the shots, but the actual cinematography and editing are not evaluated. Cambot (Elson and Riedl 2007) is a movie-making system where the choice of shots is found as a solution to an optimization problem using dynamic programming. The scene is expressed as a sequence of nonoverlapping dramatic beats and their approach evaluates different placement of characters (blockings) and camera choices for each beat. ...
... The scene is expressed as a sequence of nonoverlapping dramatic beats and their approach evaluates different placement of characters (blockings) and camera choices for each beat. Though we also make use of dynamic programming, our method is very different from (Elson and Riedl 2007). Firstly, we search a much larger set of possible solutions, by evaluating a higher number of shot transitions at a finer level of granularity (every frame, rather than every beat). ...
Conference Paper
Full-text available
We describe an optimization-based approach for automatically creating well-edited movies from a 3D animation. While previous work has mostly focused on the problem of placing cameras to produce nice-looking views of the action, the problem of cutting and pasting shots from all available cameras has never been addressed extensively. In this paper, we review the main causes of editing errors in literature and propose an editing model relying on a minimization of such errors. We make a plausible semi-Markov assumption, resulting in a dynamic programming solution which is computationally efficient. We also show that our method can generate movies with different editing rhythms and validate the results through a user study. Combined with state-of-the-art cinematography, our approach therefore promises to significantly extend the expressiveness and naturalness of virtual movie-making.
... This overcomes the traditional limitations associated with a purely idiom-based system. Visibility is taken into account through the use of "stages", i.e. empty spaces with unlimited visibility, similar to [ER07]. Both systems use a simple algebra of "stages", i.e. intersections and unions of stages, allowing for very fast visibility computation against the static elements of the scene. ...
... From left to right : the methods covered in this survey differ in the required input, working from existing footage, or existing animation, or existing script. Top to bottom: we also distinguish between methods that work offline, and methods that work in real-time.ER07, RRE08] [JY05, JY06, JY10] [GCR * 13] [MCB15] [GRLC15], Cinematic replay [AWCO10] ...
Article
Full-text available
Over the last forty years, researchers in computer graphics have proposed a large variety of theoretical models and computer implementations of a virtual film director, capable of creating movies from minimal input such as a screenplay or storyboard. The underlying film directing techniques are also in high demand to assist and automate the generation of movies in computer games and animation. The goal of this survey is to characterize the spectrum of applications that require film directing, to present a historical and up‐to‐date summary of research in algorithmic film directing, and to identify promising avenues and hot topics for future research.
... However, the framing and camera movement types they can handle are limited to only a small subset of various types, meaning that their practical use is highly limited. Furthermore, previous studies mostly focus on designing a specific language model to generate the virtual camera layout [3,11,26,31,35], which thwarts its use by novice users. ...
... One popular approach is to provide a shot specification through a high-level camera composition language that describes established filming styles and techniques. These languages, often delivered in the form of text, describe how the shot is composed and what the setup constraints are for virtual camera placement [11,23,26,31,35]. This method is direct and clear yet requires a certain degree of cinematic knowledge. ...
... Another series of papers pose video editing as a discrete optimization problem, solved using dynamic programming [12,21,13,22]. The key idea is to define the importance of each shot based on the narrative, and solve for the best set of shots that maximize viewer engagement. ...
... The key idea is to define the importance of each shot based on the narrative, and solve for the best set of shots that maximize viewer engagement. The work of Elson et al. [12] couples the twin problems of camera placement and camera selection; however, details are insufficiently described to be reproduced. Meratbi et al. [22] employs a Hidden Markov Model for the editing process, where shot transition probabilities are learned from existing films. ...
Preprint
We present GAZED- eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to generate an edited video comprising cinematically valid actor shots and shot transitions to generate an aesthetic and vivid representation of the original narrative. We model cinematic video editing as an energy minimization problem over shot selection, whose constraints capture cinematographic editing conventions. Gazed scene locations primarily determine the shots constituting the edited video. Effectiveness of GAZED against multiple competing methods is demonstrated via a psychophysical study involving 12 users and twelve performance videos.
... However, while a video summarization algorithm makes binary decisions on whether to select a frame or not, an agent for 360 piloting needs to operate on a spatial space to steer the viewing angle to consider events of interest in a 360 • video. On the other hand, in virtual cinematography, most camera manipulation tasks are performed within relatively simpler virtual environments [8,22,12,40] and there is no need to deal with viewers' perception difficulty because 3-D positions and poses of all entities are known. However, a practical agent for 360 piloting needs to directly work with raw 360 • videos. ...
... Finally, existing virtual cinematography works focused on camera manipulation in simple virtual environments/video games [8,22,12,40] and did not deal with the perception difficulty problem. [14,56,7,6] relaxed the assumption and controlled virtual cameras within restricted static wide field-of-view video of a classroom, video conference, or basketball court, where objects of interest could be easily extracted. ...
Article
Full-text available
Watching a 360{\deg} sports video requires a viewer to continuously select a viewing angle, either through a sequence of mouse clicks or head movements. To relieve the viewer from this "360 piloting" task, we propose "deep 360 pilot" -- a deep learning-based agent for piloting through 360{\deg} sports videos automatically. At each frame, the agent observes a panoramic image and has the knowledge of previously selected viewing angles. The task of the agent is to shift the current viewing angle (i.e. action) to the next preferred one (i.e., goal). We propose to directly learn an online policy of the agent from data. We use the policy gradient technique to jointly train our pipeline: by minimizing (1) a regression loss measuring the distance between the selected and ground truth viewing angles, (2) a smoothness loss encouraging smooth transition in viewing angle, and (3) maximizing an expected reward of focusing on a foreground object. To evaluate our method, we build a new 360-Sports video dataset consisting of five sports domains. We train domain-specific agents and achieve the best performance on viewing angle selection accuracy and transition smoothness compared to [51] and other baselines.
... To get closer to how real movies are built, the additional control of the staging has also been explored. Elson and Riedl [ER07] proposed Cambot, a system to assist users in machinima production. They defined three different levels (or facets) of cinematic knowledge (see Figure 2.21): the stage (area of space that is assumed to be free of occlusion and obstruction), the blocking (geometric placement of subjects w.r.t. the center point of a stage), and the shot (position, rotation, and focal length of a virtual camera w.r.t. the center point of a stage). ...
... As displayed in the table, our approach satisfies most of the properties required in an interactive storytelling system and stands in stark contrast. The closest contribution to our work (Elson & Rield [ER07]) is not a reactive approach (the system runs offline) and therefore cannot handle interactivity, does not handle composition on the screen (all shots are pre-computed path sequences) and does not offer path-planning capacities for complex camera motions. Table 3.1 -Comparing our cinematography system to main contributions in the domain. ...
Article
Full-text available
Virtual camera control is nowadays an essential component in many computer graphics applications. Despite its importance, current approaches remain limited in their expressiveness, interactive nature and performances. Typically, elements of directorial style and genre cannot be easily modeled nor simulated due to the lack of simultaneous control in viewpoint computation, camera path planning and editing. Second, there is a lack in exploring the creative potential behind the coupling of a human with an intelligent system to assist users in the complex task of designing cinematographic sequences. Finally, most techniques are based on computationally expensive optimization techniques performed in a 6D search space, which prevents their application to real-time contexts. In this thesis, we first propose a unifying approach which handles four key aspects of cinematography (viewpoint computation, camera path planning, editing and visibility computation) in an expressive model which accounts for some elements of directorial style. We then propose a workflow allowing to combine automated intelligence with user interaction. We finally present a novel and efficient approach to virtual camera control which reduces the search space from 6D to 3D and has the potential to replace a number of existing formulations.
... Research on automatic film-making has been conducted for many years. More specifically, the problem of automatic film-editing has been adressed several times with different approaches [HCS96, ER07,GRLC15]. This paper presents a comparative analysis of human-made editing and automatically computed editing. ...
... Another approach consists of considering film-editing as an optimization problem. The Cambot system, presented in [ER07], optimizes editing using heuristics for shot selection and cuts. Though being novel and efficient, this work does not account for the pacing and does not provide any details on the heuristics used for the optimization. ...
Conference Paper
Full-text available
Through a precise 3D animated reconstruction of a key scene in the movie "Back to the Future" directed by Robert Zemekis, we are able to make a detailed comparison of two very different versions of editing. The first version closely follows film editor Arthur Schmidt original sequence of shots cut in the movie. The second version is automatically generated using our recent algorithm [GRLC15] using the same choice of cameras. A shot-by-shot and cut-by-cut comparison demonstrates that our algorithm provides a remarkably pleasant and valid solution, even in such a rich narrative context, which differs significantly from the original version more than 60% of the time. Our explanation is that our version avoids stylistic effects while the original version favors such effects and uses them effectively. As a result, we suggest that our algorithm can be thought of as a baseline ("film-editing zero degree") for future work on film-editing style.
... The term 'machinima' denotes a combination of the terms 'machine' and 'cinema' and hence mostly refers to cases where animated content is generated with a certain degree of automation. X. Zhang and Hu (2012) and Elson and Riedl (2007) use this term and present approaches for placing and animating virtual cameras. ...
Thesis
Full-text available
For recorded video content, researchers have proposed advanced concepts and approaches that enable the automatic composition and personalised presentation of coherent videos. This is typically achieved by selecting from a repository of individual video clips and concatenating a new sequence of clips based on some kind of model. However, there is a lack of generic concepts dedicatedly enabling such video mixing functionality for scenarios based on live video streams. This thesis aims to address this gap and explores how a live vision mixing process could be automated in the context of live television production, and, consequently, also extended to other application scenarios. This approach is coined the 'Virtual Director' concept. The name of the concept is inspired by the decision making processes which human broadcast TV directors are conducting when vision mixing live video streams stemming from multiple cameras. Understanding what is currently happening in the scene, they decide which camera view to show, at what point in time to switch to a different perspective, and how to adhere to cinematographic and cinematic paradigms while doing so. While the automation of vision mixing is the focus of this thesis, it is not the ultimate goal of the underlying vision. To automate for many viewers in parallel in a scalable manner allows taking decisions for each viewer or groups of viewers individually. To successfully do so allows moving away from a broadcast model where every viewer gets to see the same output. Particular content adaptation and personalisation features may provide added value for users. Preferences can be expressed dynamically, enabling interactive media experiences. In the course of this thesis, Virtual Director research prototypes are developed for three distinct application domains. Firstly, for distributed theatre performance, a script-based approach and a set of software tools are designed. A basic approach for the decision making process and a pattern how to decouple it into two core components are proposed. A trial validates the technology which does not implement full automation, yet successfully enables a theatre play. The second application scenario is live event 'narrowcast', a term used to denote the personalised equivalent to a 'broadcast'. In the context of this scenario, several computational approaches are considered for the implementation of an automatic Virtual Director with the conclusion to use and recommend a combination of (complex) event processing engines and event-condition-action (ECA) rules to model the decision making behaviour. Several content genres are subject to experimentation. Evaluation interviews provide detailed feedback on the specific research prototypes as well as the Virtual Director concept in general. In the third application scenario, group video communication, the most mature decision making behaviour is achieved. This behaviour needs to be defined in what can be a challenging process and is formalised in a model that is referred to as the 'production grammar'. The aforementioned pattern is realised such that a 'Semantic Lifting' process is processing low-level cue information in order to derive in more abstract, higher-level terms what is currently happening in the scene. The output of the Semantic Lifting process is informing and triggering the second process which is called the 'Director' decision making and eventually takes decisions on how to present the available content on screens. Overall, the exploratory research on the Virtual Director concept resulted in its successful application in the three domains, validated by stakeholder feedback and a range of informal and formal evaluation efforts. As a synthesis of the research in the three application scenarios, the thesis includes a detailed description of the Virtual Director concept. This description is contextualised by many detailed learnings that are considered relevant for both scholars and practitioners regarding the development of such technology.
... To this day, AI has been able to compose music (Cope, 2015;De Mantaras & Arcos, 2002), perform visual and artistic art (Cohn, 2018) and write linguistic work such as poems and novels (Liu, Fu, Kato, & Yoshikawa, 2018). In addition, creative decisions in filmmaking (Elson & Riedl, 2007;ScriptBook, 2018) or other organizational fields were performed by AI-based systems (Aleksander, 2017;Anderson, Rainie, & Luchsinger, 2018;Schwartz, Hagel, Wooll, & Monahan, 2019). One of the leading researchers in computational creativity, Margaret Boden, argues that research on computational creativity helped towards a better understanding of creativity and that combinatorial and transformational exploration can be performed by computers (Boden, 2009). ...
Article
Full-text available
This article shows how the creative performance of start-ups or established organizations can be improved through the use of AI-based systems for actively promoting creative processes. With insights from two studies conducted with entrepreneurs, innovation managers and workshop facilitators, we provide recommendations for companies and entrepreneurs on the ability of AI to support creative potential to remain innovative and marketable in the long term. Our studies cover aspects such as AI for entrepreneurial activities or creativity workshops and show how to make use of AI-based systems to enhance the creative potential of the person, the process or the press (environment). Our findings also provide theoretical insights into the perception of AI as an equal partner and call for further research on the design of AI for the future creative workplace.
... Machinima, on the other hand, aims to create non-interactive videos based on existing assets and systems, typically from games [12]. These videos are created using scripting and tools [13], and can be augmented by generative systems, including for cinematography [3] or (interactive) narrative [2]. ...
Chapter
The overall goal of VRN is to develop a novel technology solution at Children’s Hospital Los Angeles (CHLA) to overcome barriers that prevent the recruitment of diverse patient populations to clinical trials by providing both caregivers and children with an interactive educational experience. This system consists of 1) an intelligent agent called Zippy that users interact with by keyboard or voice input, 2) a series of videos covering topics including Privacy, Consent and Benefits, and 3) a UI that guides users through all available content. Pre- and post-questionnaires assessed willingness to participate in clinical research and found participants either increased or maintained their level of willingness to participate in research studies. Additionally, qualitative analysis of interview data revealed participants rated the overall interaction favorably and believed Zippy to be more fun, less judgmental and less threatening than interacting with a human. Future iterations are in-progress based on the user-feedback.
... Due to the very artistic and subjective nature of cinematography, automated cinematic camera systems are often designed with a huge degree of human input. To bridge the gap is Cambot, a lightweight system developed by Elson and Reidl to mimic real filmmaking processes for virtual environments [13]. Cambot is a scriptable cinematic camera system with built-in knowledge of a set of standard camera movements, such as wide shots and shot/reverse-shots for character dialog. ...
Article
E-sports is currently estimated to be a billion dollar industry which is only growing in size from year to year. However the cinematography of spectated games leaves much to be desired. In most cases, the spectator either gets to control their own freely-moving camera or they get to see the view that a specific player sees. This thesis presents a system for the generation of cinematically-pleasing views for spectating real-time graphics applications. A custom real-time engine has been built to demonstrate the effect of this system on several different game modes with varying visual cinematic constraints, such as the rule of thirds. To create the cinematic views, we encode cinematic rules as cost functions that are fed into a non-linear least squares solver. These cost functions rely on the geometry of the scene, minimizing residuals based on the 3D positions and 2D reprojections of the geometry. The final cinematic view is found by altering camera position and angle until a local minimum is met. The system was evaluated by comparing video output from a traditional rigidly constrained camera and the results of our algorithm’s optimally solved views. User surveys are then used to qualitatively evaluate the system. The results of these surveys do not statistically find a preference between the cinematic views and the rigidly constrained views. In addition, we present performance and timing considerations for the system, reporting that the system can operate within modern expectations of latency when enough constraints are placed on the non-linear least squares solver.
... Dengan teknologi inilah sebuah produk sinematik akan dihasilkan. Machinima adalah sebuah alternative untuk menghasilkan sebuah animasi dengan biaya yang lebih murah [2], tetapi untuk menghasilkan produk sinematik yang bagus tentu saja dibutuhkan penelitian yang berkaitan dengan kontrol terhadap kamera virtual. Dalam dunia nyata seorang sutradara sering menggunakan sebuah storyboard untuk visualisasi ide mereka supaya dapat dipahami oleh animator atau juru kamera [3]. ...
Article
Full-text available
Teknologi komputer telah banyak digunakan di tidak hanya untuk tujuan penelitian dan pendidikan saja tetapi juga di dunia hiburan. Salah satu hiburan berbasis komputer adalah permainan komputer dan animasi. Salah satu komponen pendukungnya adalah machinima. Machima adalah teknologi yang akan menempatkan sebuah komponen sinematik dalam dunia virtual. Salah satu komponen yang dapat dikontrol adalah penempatan posisi kamera. Seorang sutradara bisa dibedakan gaya penyutradaraannya, salah satunya berdasarkan penempatan posisi kamera. Dengan menerapkan suatu gaya penyutradaraan pada sebuah permainan atau animasi bisa mendapatkan suasana yang berbeda. Penelitian ini akan mencoba melakukan profiling terhadap gaya seorang sutradara berdasarkan kebiasaan penempatan posisi kamera. Pendekatan yang dilakukan berbasis logika fuzzy. Penelitian ini akan menggunakan 19 variabel input yang berasal dari hasil ektraksi data simulasi dan 5 variabel output untuk melakukan profiling terhadap dua buah gaya sutradara yang berbeda dengan pendekatan logika fuzzy. Akan dihasilkan diagram area dan histogram sehingga mempermudah pembacaan dalam membedakan gaya sutradara dan berhasil dibedakan berdasarkan modus hasil analisa terhadap diagram histogram.
... A cinematic product can be produced using this technology. Besides, machinima is also a low-cost alternative for the full production of filmmaking [2]. Nevertheless, to produce a higher quality cinematic product, research on improving camera control language, style incorporation on camera placement, etc. is much needed. ...
Article
Full-text available
Machinima is a computer imaging technology typically used in games and animation. It prints all movie cast properties into a virtual environment by means of a camera positioning. Since cinematography is complementary to Machinima, it is possible to simulate a director’s style via various camera placements in this environment. In a gaming application, the director’s style is one of the most impressive cinematic factors, where a whole different gaming experience can be obtained using different styles applied to the same scene. This paper describes a system capable of automatically profile a director’s style using fuzzy logic. We employed 19 output variables and 15 other calculated variables from the animation extraction data to profile two different directors’ styles from five scenes. Area plots and histograms were generated, and, by analyzing the histograms, different director’s styles could be subsequently classified.
... Film idioms and their associated storytelling meanings are encoded, and a planner chooses the best camera position for each event, while optimizing transitions between shots to improve the uency of the whole sequence. Elson and Riedl [8] adopted the idea of blockings from lm practice, which involves the absolute placement of a number of characters and a camera in a non-occluded environment. e database of blockings, stages, and shots can be expanded. ...
Article
Full-text available
This article introduces Film Editing Patterns (FEP), a language to formalize film editing practices and stylistic choices found in movies. FEP constructs are constraints, expressed over one or more shots from a movie sequence, that characterize changes in cinematographic visual properties, such as shot sizes, camera angles, or layout of actors on the screen. We present the vocabulary of the FEP language, introduce its usage in analyzing styles from annotated film data, and describe how it can support users in the creative design of film sequences in 3D. More specifically, (i) we define the FEP language, (ii) we present an application to craft filmic sequences from 3D animated scenes that uses FEPs as a high level mean to select cameras and perform cuts between cameras that follow best practices in cinema, and (iii) we evaluate the benefits of FEPs by performing user experiments in which professional filmmakers and amateurs had to create cinematographic sequences. The evaluation suggests that users generally appreciate the idea of FEPs, and that it can effectively help novice and medium experienced users in crafting film sequences with little training.
... For camera control various interactive tools exist to control virtual cameras [1,14]. Others also tried to make previs more accessible by automatically compiling scripts to previs shots [5]. Previs tasks mainly re-quire interaction with complex 3D content which comprises special challenges. ...
Conference Paper
Full-text available
Previsualization (previs) is an essential phase in the design process of narrative media such as film, animation, and stage plays. Digital previs can involve complex technical tasks, e.g. 3D scene creation, animation, and camera work, which require trained skills that are not available to all personnel involved in creative decisions for the production. Interaction techniques such as virtual reality (VR) enable users to interact with 3D content in a natural way compared to classical 2D interfaces. As a first step, we developed VR based prototypes and performed an exploratory user study to evaluate how non-technical professionals from the film, animation, and theater domain assess the use of VR for previs. Our results show that users were able to interact with complex 3D scenes after a short phase of familiarization and rated VR for previs as useful for their professional work.
... As our techniques change the camera orientation in the scene, our work relates to prior work in virtual cinematography which studies virtual camera control in rendered scenes [22,16], and automatic real-world camera control for use in remote meetings or lectures [30,17]. Our problem differs in that we consider already edited 360 • videos. ...
Conference Paper
Virtual reality filmmakers creating 360-degree video currently rely on cinematography techniques that were developed for traditional narrow field of view film. They typically edit together a sequence of shots so that they appear at a fixed-orientation irrespective of the viewer's field of view. But because viewers set their own camera orientation they may miss important story content while looking in the wrong direction. We present new interactive shot orientation techniques that are designed to help viewers see all of the important content in 360-degree video stories. Our viewpoint-oriented technique reorients the shot at each cut so that the most important content lies in the the viewer's current field of view. Our active reorientation technique, lets the viewer press a button to immediately reorient the shot so that important content lies in their field of view. We present a 360-degree video player which implements these techniques and conduct a user study which finds that users spend 5.2-9.5% more time viewing the important points (manually labelled) of the scene with our techniques compared to the traditional fixed-orientation cuts. In practice, 360-degree video creators may label important content, but we also provide an automatic method for determining important content in existing 360-degree videos.
... The machinima production tool, Cambot, developed by [9] implements a library of "blockings" which are stage plans (the scene in the 3D environment) of where characters are placed and how they move around the stage. Then cameras are placed by looking for shots that are suitable for a given stage and blocking. ...
Conference Paper
Full-text available
In this work we design a tool for creators of interactive stories to explore the effect of applying camera patterns to achieve high level communicative goals in a 3D animated scenario. We design a pattern language to specify high level communicative goals that are translated into simple or complex camera techniques and transitions, and then flexibly applied over a sequence of character actions. These patterns are linked to a real-movie shot specification database through elements of context such as characters, objects, actions, and emotions. The use of real movies provides rich context information of the film, and allows the users of our tool to replicate the feel and emotion of existing film segments. The story context, pattern language and database are linked through a decision process that we call the virtual director, who reasons on a given story context and communicative goals, translates them into the camera patterns and techniques, and selects suitable shots from the database. Our real-time 3D environment gives the user the freedom to observe and explore the effect of applying communicative goals without worrying about the details of actor positions, scene, or camera placement.
... However, most of them only account for the framing of a single shot and neglect the semantics over multiple shots. Offline approaches can better account for complex story sequences, but are unable to react to dynamic gaming environments [ER07] [GRLC15]. Often the vocabulary of camera idioms defined by these systems are not easily user-extensible. ...
Article
Full-text available
The definitive version is available at http://diglib.eg.org/
... Virtual Cinematography: The majority of the work has focused on active and interactive cinematography, where one has control over the position of the cameras and other parameters such as lighting conditions [34,35]. We are mainly concerned with a passive case where multiple videos of a scene exist, and one needs to decide which camera to use for each frame. ...
Conference Paper
Full-text available
In this paper, we study the problem of recognizing human actions in the presence of a single egocentric camera and multiple static cameras. Some actions are better presented in static cameras, where the whole body of an actor and the context of actions are visible. Some other actions are better recognized in egocentric cameras, where subtle movements of hands and complex object interactions are visible. In this paper, we introduce a model that can benefit from the best of both worlds by learning to predict the importance of each camera in recognizing actions in each frame. By joint discriminative learning of latent camera importance variables and action classifiers, our model achieves successful results in the challenging CMU-MMAC dataset. Our experimental results show significant gain in learning to use the cameras according to their predicted importance. The learned latent variables provide a level of understanding of a scene that enables automatic cinematography by smoothly switching between cameras in order to maximize the amount of relevant information in each frame.
Article
Full-text available
Video oyunlarının, teknolojiyle birlikte gelişim gösterip yaygınlaştıkça oynanabilirliğin ötesine geçerek yeni öykü anlatma yöntemleri doğurduğu görülmektedir. Bu yöntemlerden biri olan makinima, animasyon ve sinema tekniklerinin 3B oyun ortamlarında uygulanmasıyla, hibrid bir film türü olarak ortaya çıkmıştır. Ancak ilgili literatür incelendiğinde bu yöntemin gerçek bir tanımının yapılabilmesi hususunda, sınırları muğlak, sürekli değişen bir ara form olduğu için zorlanıldığı görülmektedir. Bu çalışmada, 1990’lı yıllardan beri gelişimini sürdüren makinima, farklı medya araçlarını kullanırken, bu araçların video oyunları ve ilişkili kavramlar üzerinden evrimleşmesine neden olan dönüştürücü bir teknik uygulama olarak ele alınmıştır. Bu teknik aracılığıyla, oyunların doğası gereği interaktif tasarıların ve dünyaların içerisine çekilerek etkin bir role bürünmüş oyuncu/izleyicinin de yeniden pasifize edilerek hareketsizleştirildiği görülmektedir. Bu evrimsel ve rol değişikliğine yol açan dönüşümün anlaşılabilmesi için öncelikle makinimanın ortaya çıkışı ve tanımlanışıyla ilgili tartışmalar ele alınmıştır. Böylece film prodüksiyon ortamlarının sanala doğru dönüşümünün makinimayla ilişkisinin ve bu dönüşüm sürecinin oyun çalışmalarındaki yansımalarının ortaya çıkarılması amaçlanmaktadır. Bu bağlamda MovieStorm, iClone, NVIDIA Omniverse Machinima ve Unreal 4 gibi video oyun ve makinima motorlarının sunduğu olanaklar incelenmiş, makinimanın görsel-işitsel öykü anlatımına etkisi değerlendirilmiştir. Sonuç olarak, makinima motorlarının kullanılarak sinematik eserlerin yaratılabilmesi olasılığı, film yapımının ve geleneksel iletişim aygıtlarının geleceğini ve gerçekliğin sınırlarını değiştirecek bir yaklaşım olarak değerlendirilmiştir.
Article
Virtual cinematography refers to automatically selecting a natural-looking normal field-of-view (NFOV) from an entire 360 $^{\circ}$ video. In fact, virtual cinematography can be modeled as a deep reinforcement learning (DRL) problem, in which an agent makes actions related to NFOV selection according to the environment of 360 $^{\circ}$ video frames. More importantly, we find from our data analysis that the selected NFOVs attract significantly more attention than other regions, i.e., the NFOVs have high saliency. Therefore, in this paper, we propose an attention-based DRL (A-DRL) approach for virtual cinematography in 360 $^{\circ}$ video. Specifically, we develop a new DRL framework for automatic NFOV selection with the input of both the content, and saliency map of each 360 $^{\circ}$ frame. Then, we propose a new reward function for the DRL framework in our approach, which considers the saliency values, ground-truth, and smooth transition for NFOV selection. Subsequently, a simplified DenseNet (called Mini-DenseNet) is designed to learn the optimal policy via maximizing the reward. Based on the learned policy, the actions of NFOV can be made in our A-DRL approach for virtual cinematography of 360 $^{\circ}$ video. Extensive experiments show that our A-DRL approach outperforms other state-of-the-art virtual cinematography methods, over the datasets of Sports-360 video, and Pano2Vid.
Article
We present prominent structures in video, a representation of visually strong, spatially sparse and temporally stable structural units, for use in video analysis and editing. With a novel quality measurement of prominent structures in video, we develop a general framework for prominent structure computation, and an efficient hierarchical structure alignment algorithm between a pair of videos. The prominent structural unit map is proposed to encode both binary prominence guidance and numerical strength and geometry details for each video frame. Even though the detailed appearance of videos could be visually different, the proposed alignment algorithm can find matched prominent structure sub-volumes. Prominent structures in video support a wide range of video analysis and editing applications including graphic match-cut between successive videos, instant cut editing, finding transition portals from a video collection, structure-aware video re-ranking, visualizing human action differences, etc.
Chapter
Deliberation-driven reflective sequences, or DDRSs, are cinematic idioms used by film makers to convey the motivations for characters adopting a particular course of action in a story. We report on an experiment where the cinematic generation system Ember was used to create a cinematic sequence with variants making different choices for DDRS use around a single decision point for a single character.
Chapter
Creation of machine generated cinematics currently requires a significant amount of human author time or manually coding domain operators such that they may be realized by a rendering system. We present FireBolt, an automated cinematic realization system based on a declarative knowledge representation that supports both human and machine authoring of cinematics with reduced authorship and engineering task loads.
Article
We present Write-A-Video, a tool for the creation of video montage using mostly text-editing. Given an input themed text and a related video repository either from online websites or personal albums, the tool allows novice users to generate a video montage much more easily than current video editing tools. The resulting video illustrates the given narrative, provides diverse visual content, and follows cinematographic guidelines. The process involves three simple steps: (1) the user provides input, mostly in the form of editing the text, (2) the tool automatically searches for semantically matching candidate shots from the video repository, and (3) an optimization method assembles the video montage. Visual-semantic matching between segmented text and shots is performed by cascaded keyword matching and visual-semantic embedding, that have better accuracy than alternative solutions. The video assembly is formulated as a hybrid optimization problem over a graph of shots, considering temporal constraints, cinematography metrics such as camera movement and tone, and user-specified cinematography idioms. Using our system, users without video editing experience are able to generate appealing videos.
Conference Paper
We propose an automatic virtual cinematography method that takes a continuous optimization approach. A suitable camera pose or path is determined automatically by computing the minima of an objective function to obtain some desired parameters, such as those common in live action photography or cinematography. Multiple objective functions can be combined into a single optimizable function, which can be extended to model the smoothness of the optimal camera path using an active contour model. Our virtual cinematography technique can be used to find camera paths in either scripted or unscripted scenes, both with and without smoothing, at a relatively low computational cost.
Article
We present a system for efficiently editing video of dialogue-driven scenes. The input to our system is a standard film script and multiple video takes, each capturing a different camera framing or performance of the complete scene. Our system then automatically selects the most appropriate clip from one of the input takes, for each line of dialogue, based on a user-specified set of film-editing idioms. Our system starts by segmenting the input script into lines of dialogue and then splitting each input take into a sequence of clips time-aligned with each line. Next, it labels the script and the clips with high-level structural information (e.g., emotional sentiment of dialogue, camera framing of clip, etc.). After this pre-process, our interface offers a set of basic idioms that users can combine in a variety of ways to build custom editing styles. Our system encodes each basic idiom as a Hidden Markov Model that relates editing decisions to the labels extracted in the pre-process. For short scenes (< 2 minutes, 8--16 takes, 6--27 lines of dialogue) applying the user-specified combination of idioms to the pre-processed inputs generates an edited sequence in 2--3 seconds. We show that this is significantly faster than the hours of user time skilled editors typically require to produce such edits and that the quick feedback lets users iteratively explore the space of edit designs.
Conference Paper
We introduce the novel task of Pano2Vid — automatic cinematography in panoramic 360\(^{\circ }\) videos. Given a 360\(^{\circ }\) video, the goal is to direct an imaginary camera to virtually capture natural-looking normal field-of-view (NFOV) video. By selecting “where to look” within the panorama at each time step, Pano2Vid aims to free both the videographer and the end viewer from the task of determining what to watch. Towards this goal, we first compile a dataset of 360\(^{\circ }\) videos downloaded from the web, together with human-edited NFOV camera trajectories to facilitate evaluation. Next, we propose AutoCam, a data-driven approach to solve the Pano2Vid task. AutoCam leverages NFOV web video to discriminatively identify space-time “glimpses” of interest at each time instant, and then uses dynamic programming to select optimal human-like camera trajectories. Through experimental evaluation on multiple newly defined Pano2Vid performance measures against several baselines, we show that our method successfully produces informative videos that could conceivably have been captured by human videographers.
Chapter
This chapter provides a survey of research on the analysis and generation of narrative discourse that deals with effective presentation of story content through the visual medium. It starts with a theoretical grounding in narratology and cognitive science where the distinction between story and discourse is established. Theories of visual discourse that expand the notions of textual discourse to fit the analysis of visual narratives will be described. Finally, a discussion of automatic generation of coherent visual discourse in terms of viewpoint selection in virtual environments will be carried out.
Chapter
This chapter provides a survey of research on the analysis and generation of narrative discourse that deals with effective presentation of story content through the visual medium. It starts with a theoretical grounding in narratology and cognitive science where the distinction between story and discourse is established. Theories of visual discourse that expand the notions of textual discourse to fit the analysis of visual narratives will be described. Finally, a discussion of automatic generation of coherent visual discourse in terms of viewpoint selection in virtual environments will be carried out.
Conference Paper
Full-text available
The popularity of Machinima movies has increased greatly in the recent years. From a transmedia point of view, there was little development regarding tools to assist the production of Machinima. These are still mainly focused on the gaming community, and 3D animators. The developed tool aims to bring the typical workflow present on a normal movie set, into a machinima creation environment, expanding possibilities for transmedia productions. With Mappets as a plugin for the Unity3D game engine, we allow a translation from the typical movie dimension to a virtual one. This work evaluates the current state of art of machinima development tools and presents a working solution more adequate for transmedia productions and non-expert users interested in the production of machinima. © IFIP International Federation for Information Processing 2013.
Article
This paper proposes a physics based model to simulate a reactive camera that is capable of both high-quality tracking of moving target objects and producing plausible response interactively to a variety of game scenarios. The virtual physical rig consists of a motorized pan-tilt head that is controlled to meet desired target look-at directions as well as an active suspension system that stabilizes the camera assembly against disturbances. To showcase its differences with other camera systems, we contrast our physically based technique with other direct (kinematic) computed methods from industry standard techniques.
Article
Full-text available
I review my research activities in Video Indexing and Action Recognition and sketch a research agenda for bringing those two lines of research together to address the difficult problem of recognizing actions in movies. I first present a series of older projects in Video Indexing, starting with the DIVAN project at INA and the MPEG expert group (1998-2000), and continuing at INRIA under the VIBES project (2001-2004). This research falls under the general approach of "computational media aesthetics", where we attempt to recognize film events based on our knowledge of filming styles and conventions (cinematography and editing). This is illustrated with two applications - the automatic segmentation of TV news into topics; and the automatic indexing of movies with their scripts. I then present my more recent research in Action Recognition with the MOVI group at INRIA (2005-2008). Building upon the GRIMAGE infrastructure, I present experiments in (1) learning and recognizing a small repertoire of full body gestures in 3D using "motion history volumes"; (2) segmenting a raw stream of 3D image sequences into recognizable "primitive actions" ; and (3) using statistical models learned in 3D for recovering primitive actions and relative camera positions from a single 2D video recording of similar actions.
ResearchGate has not been able to resolve any references for this publication.