Content uploaded by Virginia Kuhn
Author content
All content in this area was uploaded by Virginia Kuhn on Jun 13, 2020
Content may be subject to copyright.
Large Scale Video Analytics
On-demand, iterative inquiry for moving image research
Virginia Kuhn
USC
Los Angeles, CA, USA
Ritu Arora
TACC
Austin, TX, USA
Alan Craig, Kevin Franklin, Michael Simeone
ICHASS
Urbana, IL, USA
Dave Bock, Luigi Marini
NCSA
Urbana, IL, USA
Abstract- Video is exploding as a means of communication
and expression, and the resultant archives are massive,
disconnected datasets. Thus, scholars’ ability to research this
crucial aspect of contemporary culture is severely hamstrung
by limitations in semantic image retrieval, incomplete
metadata, and the lack of a precise understanding of the actual
content of any given archive. Our aim in the Large Scale
Video Analytics (LSVA) project is to address obstacles in
both image-retrieval and research that uses extreme-scale
archives of video data that employs a human-machine hybrid
process for analyzing moving images. We propose an
approach that 1) places more interpretive power in the hands
of the human user through novel visualizations of video data,
and 2) uses a customized on-demand configuration that
enables iterative queries.
Index Terms- High Performance Computing, Image Edge
Detection, Image Retrieval, Multimedia Databases, Software,
Visualization.
I. INTRODUCTION
The process of understanding and utilizing the content of
large databases of video archives has remained both time-
consuming and laborious. Aside from the massive size of
contemporary archives and the challenges that have faced
semantically-sensitive image retrieval for the last 20 years,
other key challenges to effectively analyzing video archives
with existing methods include limited metadata and the lack of
a precise understanding of the actual content of the archive. A
final difficulty lies in the incompleteness of translation across
semiotic registers - words can never fully represent sounds and
images, leaving a gap in meaning when labels alone are
employed to describe and search for content.
The real-time, interactive and iterative analysis of large
video archives can be both compute-intensive and memory-
intensive. High Performance Computing (HPC) platforms and
storage resources are therefore needed to handle the large
volume, velocity and variety associated with such video
archives. Given that about 72 hours of video are uploaded to
YouTube alone every minute (volume and velocity), and the
videos come in diverse formats and codecs (variety), large-
scale video analytics is actually a BigData problem [1] where
data is semi-structured or unstructured.
Though HPC is indispensable for analyzing such large
databases of videos, for a humanities researcher, one of the
obstacles associated with working in an open-science HPC
environment is the long wait-time associated with job-
processing when a job is submitted to a regular queue. The
nature of the humanities research, especially video analysis,
necessitates that the researcher is able to quickly get results
from one query in order to formulate the next one. Therefore, a
truly interactive system for video analysis that can function in
an HPC environment isrequiredtosupportresearchers’goals.
The Large Scale Video Analytics (LSVA) research project
explores new possibilities offered by both an innovative use of
the Gordon supercomputer at the San Diego Supercomputing
Center (SDSC), and the conjoined interests of HPC and the
cultural and historical study of moving images. We aim to
facilitate humanities research on moving images at a scale
heretofore unthinkable, demonstrating the possibility for
humanists to productively inform policies and infrastructure at
the supercomputing centers, even as the affordances of HPC
enlivens and extends humanities research.
Our aim in this project is to address obstacles in both
image retrieval and research that uses extreme-scale
archives of video data. The searching, tagging, and analysis
enabled by image retrieval faces the semantic gap problem
of satisfactorily using low-level image features and actions
to retrieve user-identified objects. This gap is only
exaggerated as queries by historians and cinema and media
scholars demand a high degree of precision and nuance in
their study of moving images. To solve this problem we
propose a two pronged approach that 1) places more
interpretive power in the hands of the human user through
novel visualizations of video data, and 2) uses a customized
on-demand configuration of Gordon that enables iterative
queries over a short period of time.
The rest of the paper describes our efforts towards
enabling real-time, interactive and iterative video analysis
in an open-science HPC environment.
II. BACKGROUND: THE STUDY OF MOVING IMAGES BY THE
HUMANITIES
Traditionally, cinema scholars’ methods consist of
conducting close readings of individual films or genres of
films, much the way that those in the field of English studies
explicate literature. By and large this inquiry is confined to
theatrical films - that is to say, those films, mainly fictive,
which are produced by studios and which are created for
entertainment. The challenge of analyzing the 115 years of
cinema across the globe that is being digitized is already
daunting. But that is only the tip of the moving image dataset.
With the explosion of affordable recording devices, from
consumer-level video cameras to cell phone recorders, video
has exploded as a common form of authoring, and this content
is widely shared across multiple online platforms in various
forms and lengths. In this environment, the notion of a discrete
film or even a single demarcated archive is quickly becoming
obsolete and irrelevant. It is as though we are building an
alphabet of images and sound but we have no dictionary, nor
grammar to help understand the impact of extra-linguistic
communication. These datasets demand critical analysis of
both form and content.
Like all photorealistic media, video combines two
contradictory features: it carries the presumed objectivity of
machine-recorded evidence that neutrally documents, and yet it
always has a point of view. To shoot footage is to frame, and to
frame is to exclude. No longer confined to theatres, moving
images saturate contemporary culture and they inundate human
beings. However, the role and impact of these ubiquitous
images and sounds that form time-based media is difficult if
not impossible to gauge without innovations in research
methodologies that allow a researcher access to vast archives
that s/he would never be able to view in a single lifetime.
III. SUPERCOMPUTING ON-DEMAND
The arrival the XSEDE resource “Gordon”, the
supercomputer having extensive flash memory, has opened the
possibility for researchers to interactively, and on-demand,
query large databases in real-time, including databases of
digital videos. Additionally, the computational capability of
Gordon is sufficient for extensive analysis of video-assets in
real-time for determining which videos to return in response to
a query. This is a compute- and memory- intensive process
involving queries that cannot be anticipated ahead of time.
This project will be using the Gordon supercomputer to not
only pre-process videos to automatically extract meaningful
metadata, but also as an interactive engine that allows
researchers to generate queries on the fly for which metadata
that was extracted a priori is not sufficient. In order to be
useful to researchers, we are combining an interactive database,
a robust web-based front-end (Medici [2]), and powerful
visualization representations to aid the researcher in
understanding the contents of the video-footage without
requiring them to watch every frame of every movie.
Due to the need for high-quality end-user experience (low-
latency and high-throughput), the LSVA project has received
dedicated and interactive access to Gordon’s I/O nodes. The
overall system architecture is shown in Figure 1. As can be
noticed from this Figure, besides the database of the metadata
extracted from the videos, the repository of the videos will also
be residing on the I/O node on Gordon to minimize the time
involved in I/O. It should be noted that this approach will be
modified after the complete workflow has been prototyped and
tested on Gordon to address the scalability issues related to the
massive increase in the size of datasets during production stage.
Fig. 2. Medici Interface
Fig. 1. Overview of the System Architecture
A. Robust Front-End
Medici is a scalable content management system that
allows users to upload and run analytics on a variety of file
types like images, audio, video and PDF. It supports both
automatic metadata extraction and user-defined content
tagging. Automatic metadata extraction services are driven by
file MIME type and include: image extractor, gamera extractor,
document extractor, PDF extractor, and video extractor. For the
LSVA project, Medici will be extended to extract metadata of
interest to cinema researchers such as shot-length and color-
palette. But perhaps more profoundly, the custom
visualizations we will add to Medici will allow new knowledge
about the role and impact of video data to emerge. The
metadata will be stored in a relational database for running
various tools in the analytical pipeline in a batch-mode.
B. Interactive Database at the Backend
As mentioned in [3], existing video analysis applications
generally fail to scale because the majority of platforms for
video processing treat databases merely as a storage engine
rather than a computation engine. In the LSVA project, the
rich metadata associated with the video repositories will be
stored in the database along with the additional information
related to the processing of videos (e.g., algorithms to be used
in the work-flow) such that the analytics can be performed
proactively in a batch-mode with minimal end-user
interaction. Such proactive processing along with optimization
schemes will result in near real-time end-user experience. The
metadata extraction service in Medici, by default, extracts
standard metadata elements and writes it as RDF-tuples. In
this research, not only will this metadata extraction service be
modified to launch multiple concurrent processes for faster
metadata extraction but it will also be modified so that
additional metadata that is of interest to cinema scholars can
be extracted and stored in a relational database schema for
faster querying and access in the analytical pipeline.
Currently, the size of the sample data that is being used to
establish the complete workflow for this research is under 4
TBs and can hence be stored in the flash memory on the I/O
node allocated for this project. However, with the increase in
the amount of data, and the need for compute-intensive
processing for some of the steps in the batch-mode (e.g.,
massive amounts of metadata extraction from hundreds of
TeraBytes of videos with short turn-around time), the Lustre
filesystem on Gordon will be used to avoid fetching the files at
the start of a batch job.
C. Visualization
Insight and understanding are greatly enhanced when
information is explored from multiple perspectives. To provide
such perspectives, information design must continue to evolve
and experiment with the latest tools and technologies to
provide effective means of communicating ever-increasing and
complex information [4, 5, and 6]. Fundamental principles of
spatial and temporal simultaneity, metamorphosis, time-
modification, and juxtaposition are investigated using advanced
information design and visualization tools in order to develop
methods to effectively represent large collections of video
databases. Our goal is to experiment with presenting video
collections in novel representations and as a means of
visualizing video data and as image tags to searchable video
databases.
Movie Cube
One visualization method involves the concept of a movie
cube. In this study, we explore ways in which we can apply
visualization tools to analyze a movie sequence. A movie
sequence is first converted into a three-dimensional dataset by
extracting and ordering each frame of the sequence along the Z
axis. Once in this form, we can use a variety of visualization
techniques to examine the data. We use our custom
visualization system to examine this dataset as shown in the
examples below (see Figures 3, 4, 5, and 6) by using a
sequence from one of the Internet Archive movies in the
Prelinger Collection (Safety Patrol, 1937 [7]). We begin by
rendering slice planes in various locations along the Z axis
along with the bounds of the dataset. As expected, we see the
individual frames of the movies as shown in Figure 3. Note that
time progresses from front to back in our movie cube.
Fig. 3. Rendering slice planes.
Fig. 4. Rendering slice planes across time (vertical)
Experimenting with different orientations of our slice plane,
we begin to see some interesting patterns emerge. As shown in
Figures 4 and 5, we render slice planes cutting across time
revealing only a single row (left) or column (right) from each
frame in each instance of time. Note that these visualizations
give us a clear representation of camera shots. Specifically, we
can see when in time (along the Z axis) new camera shots
occur as well as the relative duration of each camera shot. We
can also see patterns of movement in time.
Fig. 5. Rendering slice planes across time (horizontal)
We can also sample our movie cube volume using any type
of shape. In Figure 6, we map our movie data to a cylinder
positioned within the volume.
Fig. 6. Mapping movie data to a cylinder
By employing these visualizations that treat the videos as
signals over time rather than a cinematographic creation, we
are able to create two-dimensional images that contain within
them activity over time. Thus, while it is possible to use our
data to search two dimensional images extracted from films by
segmenting individual shots and scenes - a search better
oriented for seeking out specific objects or persons, fraught
with the obstacles that have impeded image retrieval since its
beginning - our search will explore the possibilities offered by
searching for activity instead of objects.
It is our hope that by searching for activity types
(something more easily translatable to machine-readable
pattern) we may prototype an equitable model for hybrid
systems used to navigate large-scale archives of moving
images. In this model the human user retains more
interpretative power to help mitigate the distortion often
introduced to searches by semantic gaps that are often found in
image retrieval.
IV. ANTICIPATED OUTCOMES
To date, data visualization tools have successfully rendered
“snapshots” of large videodatasets, but these produce “meta-
images”that, while informative, hold little explanatory power
on their own and, as such, are difficult to evaluate (See Figures
7 and 8 [8, 9]). When considered on their own, they become
little more than visual indices vis-à-vis the front-end and
graphs of code tolerances on the back-end, neither of which can
hold up as generalizable knowledge objects. Thus one of our
main goals is to use interpretive frameworks to draw some
useful conclusions about these large data sets by versioning
approaches (e.g. crowd sourced verification of machine-read
recognition [9]).
Fig. 7. Visualization using Image Plot algorithm
Fig. 8. Visualization using Cinemetrics algorithm
As detailed in [10], the conceptual issues that inhere when
labeling images with words is another major theme
interrogated and, as such, content tagging will be extremely
important. We will be endeavoring to create a mix of standard
tags, as well as idiosyncratic labels in order to more fully
represent the possibilities presented by a vocabulary of images.
In this way, we will leverage the power of HPC with the
expertise and interpretive strategies of humanities scholars in
order to arrive at a robust system that makes possible
sophisticated analysis of the vast video archives that
characterize contemporary culture. We are also evaluating the
currently existing scene completion techniques [11] to integrate
them in our analytical workflow. Such a tool will be useful for
completing scenes from a repository of semantically-related
pictures or videos.
ACKNOWLEDGEMENT
This work uses the Extreme Science and Engineering
Discovery Environment (XSEDE), which is supported by
National Science Foundation grant number OCI-1053575.
We are grateful to XSEDE for providing us the resources
required for development and deployment of this project.
REFERENCES
[1] Paul Zikopoulos, Chris Eaton, Paul Zikopoulos. 2011.
Understanding Big Data: Analytics for Enterprise Class Hadoop
and Streaming Data, First Edition, pp. 1-166: http://www-
01.ibm.com/software/data/bigdata/
[2] Medici multi-media content management system:
http://medici.ncsa.illinois.edu/
[3] Qiming Chen, Meichun Hsu, Rui Liu, and Weihong Wang.
2009. Scaling-Up and Speeding-Up Video Analytics Inside
Database Engine. In Proceedings of the 20th International
Conference on Database and Expert Systems Applications
(DEXA '09), 244-254.
[4] BarrySalt, 2006. “The NumbersSpeak,” Moving Into Pictures.
Starwood P.
[5] James E. Cutting, Jordan E. DeLong and Christine E. Nothelfer
Attention and the Evolution of Hollywood Film. Psychological
Science published online 5 February 2010 DOI:
10.1177/0956797610361679
[6] Yuri Tsivian and Gunars Civjans. Cinemetrics: Movie
Measurement and Study Tool Database.
http://www.cinemetrics.lv/.
[7] Safety Patrol film. 1937. Producer: Handy (Jam) Organization
Sponsor: Chevrolet Division, General Motors Corporation.
[8] Software Studies Initiative, Image Plot visualization software:
explore patterns in large image collections
http://lab.softwarestudies.com/p/imageplot.html
[9] Brodbeck, Frederic. Cinemetrics thesis project:
http://cinemetrics.fredericbrodbeck.de/
[10] VirginiaKuhn,2010.“FilmicTextsandtheRiseoftheFifth
Estate,”International Journal of Learning and Media, MIT P.
Volume 2, Issue 2-3 doi: 10.1162/IJLM_a_00057
[11] James Hays, Alexei A. Efros. Scene Completion Using Millions
of Photographs. ACM Transactions on Graphics (SIGGRAPH
2007). August 2007, vol. 26, No. 3.