Conference PaperPDF Available

Movies and Actors: Mapping the Internet Movie Database

August 2007

August 2007

DOI:10.1109/IV.2007.78

Source
IEEE Xplore

Conference: Information Visualization, 2007. IV '07. 11th International Conference

Authors:

Bruce Herr

Indiana University Bloomington

Weimao Ke

Drexel University

Katy Borner

Indiana University Bloomington

This paper presents the results of an analysis and visualization of 428,440 movies from the Internet Movie Database (IMDb) provided for the Graph Drawing 2005 contest. Simple statistics are presented as well as a tapestry of all movies with an overlay of the giant component of the co-actor network. Academy award winners are highlighted. Major insights are discussed.

Content uploaded by Weimao Ke

Content may be subject to copyright.

Movies and Actors: Mapping the Internet Movie Database

Bruce W. Herr, Weimao Ke, Elisha Hardy & Katy Börner

School of Library and Information Science, Indiana University, Bloomington, IN 47405

{bherr@indiana.edu, wke@indiana.edu, efhardy@indiana.edu, katy@indiana.edu}

Abstract

This paper presents the results of an analysis and

visualization of 428,440 movies from the Internet Movie

Database (IMDb) provided for the Graph Drawing 2005

contest. Simple statistics are presented as well as a

tapestry of all movies with an overlay of the giant

component of the co-actor network. Academy award

winners are highlighted. Major insights are discussed.

Keywords---network analysis, domain visualization,

movies

1. Introduction

Since 2002, the International Sunbelt Social

Network Conference has hosted a so called Viszards

session [6] that aims to show the power of network

analysis and visualization. The work discussed in this

paper was done for Viszards 2006 at Sunbelt XXVI

which took place in Vancouver, BC, Canada on April

, 2006. Viszards 2006 asked network science

researchers to analyze data retrieved from the Internet

Movie Database (IMDb). IMDb (http://www.imdb.com)

is a popular site cataloging almost every movie ever

made.

The study of IMDb data is interesting for several

reasons. For one, most people know about and can relate

to movies and actors. Thus, when presented with a

visualization of movie data, they will try to find their

favorite movies and actors, identify movies of potential

interest or explore the complex co-actor relationships

among actors. Second, the dataset has rich information

on each movie and actor allowing for a wide variety of

data analyses. Third, the dataset is sufficiently clean and

structured so that analysis can be done without using

semantic matching techniques.

From the beginning, our goal was to show all

movies as well as major co-actor relationships. We

wanted to give the world an overview of the movie and

actor space that almost everyone is familiar with. Doing

this on a large canvas (the final visualization has a size of

36” high and 73” wide) and in a way that people can

reason about and understand the visualization was a

major challenge. The required data density due to data

volume per square inch posed additional difficulties.

With this paper and the IMDb visualization we hope

to communicate the power of visually pleasing yet

informative visualizations to a general audience.

Visualizations can be more than eye candy. Paper

printouts are discussed as a viable alternative for the

presentation of high density visualizations.

The remainder of the paper is organized as follows:

Section 2 introduces the dataset used. Section 3 explains

the data analyses and results. Section 4 discusses the

iterative design of the visualization and insights gained.

The paper concludes with a discussion and outlook.

2. Data preparation

The data for the IMDb visualization originates from

the Graph Drawing 2005 web site [3] at

http://www.ul.ie/gd2005/dataset.html. The dataset is a

bipartite graph in which each node either corresponds to

an actor or to a movie. Edges go from a movie to each

actor in the movie. It also provides metadata for the

nodes like movie/actor name, year of the movie, and

genre of the movie. This data was then parsed and stored

in a relational database to ease data manipulation.

As with all large datasets, there were diverse

anomalies. Out of the 428,440 movies in the set, 2,091

movies had no year data, six movies were produced in 1

CE, two were produced in 2 CE, 24 more were produced

between the years 3 and 1888 CE, and the ‘Adult’ movie

entitled ‘Westside Boys’ is to be produced in 9006 CE.

The biggest anomaly in the derived data is the fact that of

the 428,440 movies provided, 123,617 movies have no

actor data at all. This is particularly problematic for us

since we are showing the interplay between actors and

movies. We believe that this is most likely a problem

inherited from the derived data, since the official IMDb

statistics say that as of March 2007 (the derived data was

from early 2005) there are 365,328 movies in the

database. In the end, we excluded those movies that did

not have actor information.

Herr II, Bruce W., Ke, Weimao, Hardy, Elisha, and Börner, Katy. (2007) Movies and Actors: Mapping the Internet Movie Database.

In Conference Proceedings of 11th Annual Information Visualization International Conference (IV 2007),

Zurich, Switzerland, July 4-6, pp. 465-469, IEEE Computer Society Conference Publishing Services.

3. Data analysis

After getting the data into a relational database,

several statistics were run to get a feel for the data. We

excluded the anomalous data discussed in the last section

resulting in 302,691 movies produced between 1890 and

2007. It should be noted that this data was from early

2005, so all movies beyond that were in differing stages

of production and not yet released. Figure 1 shows the

growth of movies over time. The red lines show the

boundaries of the movies that we considered.

Figure 1. Growth of Movies Over Time

Figure 2. Movie Out-Degree Distribution

There are 3,792,390 links connecting 302,691

movies and their 896,308 actors, see Figure 2. 38,027

movies have exactly one actor. Movies with more than

1,000 actors are ‘The Eurovision Song Contest’ (1,338

actors), ‘Around the World in 80 days’ (1,287 actors),

and ‘General Hospital’ (1,123 actors).

In order to get a feel for the actor space, we created

a co-actor network where actors are connected based on

the movies they acted together in. Actors that appear in a

movie together are said to co-act. The network of co-

acting actors contains 896,308 actor nodes and

114,128,535 co-actor links, see Figure 3. Each link is

weighted by the number of movies the two actors were in

together.

Figure 3. Co-Actor Out-Degree Distribution

4. Data visualization

Our main goal with this visualization was to give a

global overview of the entire movie and actor space.

During the initial design phase, we wanted to draw a co-

actor network and have it be surrounded by a list of all

movies. However, fitting this into a reasonably sized

canvas proved difficult. The sheer number of movies to

plot was so large that we eventually decided to render

them in columns and overlay the co-actor network

directly on top of the movies. We also wanted to

constrain ourselves to 36” high and somewhere around

40” wide, but eventually went with 73” wide to

accommodate all of the movies. A description of the

layers of the final visualization follows.

At the bottom of the visualization is the movies

layer. The movies were grouped by year and plotted in

97 columns. Within each year, the movies were sorted

and their titles size encoded by the number of starring

actors. Furthermore, movie titles are sorted and color

coded by genre. Each of the seven top genres (Short,

Drama, Comedy, Documentary, Adult, Romance, and

Thriller) was given a distinct color while the rest were

given a light grey color. Over plotting was utilized to fit

the movies into the area provided. A white outline is

drawn around each character to improve text legibility. A

close-up of the movies layer is given in Figure 4.

Figure 4. Zoomed View of the Movie Layer

The next layer up is the actor layer. We felt that the

best way of showing the actors was by laying out their

co-actor network using a force-directed layout algorithm.

Each edge between the actors was weighted by the

number of times the two actors had been in the same

movie together. Interested to see the strongest co-

Herr II, Bruce W., Ke, Weimao, Hardy, Elisha, and Börner, Katy. (2007) Movies and Actors: Mapping the Internet Movie Database.

In Conference Proceedings of 11th Annual Information Visualization International Conference (IV 2007),

Zurich, Switzerland, July 4-6, pp. 465-469, IEEE Computer Society Conference Publishing Services.

actorship linkages we excluded all those links that had a

weight of less than three. Resulting unconnected nodes

were excluded. The remaining core network with

105,758 nodes and 1,292,816 edges was fed into VxOrd

[2] to lay out actor nodes with a modified spring force

layout algorithm. This algorithm ensures that highly

interlinked nodes are close to each other and unlinked or

weakly linked nodes are further apart. The resulting list

of coordinates for each of the 105,758 actor nodes was

rendered using Pajek [4]. The color of each actor node

corresponds to the movie genre s/he most contributed to.

The results are shown below in Figure 5. A zoomed in

portion of the co-actor network can be seen in Figure 6.

Figure 5. Co-Actor Network

Figure 6. Zoomed View of the Co-Actor Network

Another layer was added to provide landmarks in

this complex co-actor network. The network was

spatially cut into a 10x10 grid and the actor node that had

been in the most movies in each of the cells was labeled

with their actor name using a light colored, 15-point font.

These labels are useful in identifying clusters of actors.

The discussed layers form a reference system that

can be used to overlay additional data. In the

visualization described in this paper, we added two more

data layers. The first shows Academy Award’s best actor

and actress winners and nominees from 2000-2004 [1].

They are represented as 41 darker and larger actor labels

on top of the co-actor network. The most interesting part

about this layer is that most of the actors are tightly

packed into one cluster. Though not fully explored, this

may mean that in order to increase one’s chances of an

academy award for best actor/actress, one should work

closely with actors in this cluster.

The second additional layer dealt with the winners

and nominees for the Academy Award’s best picture

award. The 25 movies nominated (including the winners)

have exactly 433 actors in the co-actor network. This

layer draws lines from the 25 nominated movies in the

underlying movie layer to the associated actors in the co-

actor network layer. The color of the lines corresponds to

the genre of the movie. The curves of the lines were

chosen so as to not cover up too much of the co-actor

network. This layer helps to highlight what areas of the

actor space is being used by top movies in the field.

All of the layers except for the co-actor network

were created with custom code that reads in the provided

data and produces PostScript® files. The co-actor

network’s layer was outputted to PostScript through the

Pajek program. To produce the final image, the assorted

layers were combined and rasterized at 400 dots per inch

(DPI) in Adobe Photoshop©. An additional layer was

created in Photoshop that added the informational

column on the right side of the visualization.

The movies layer proved to be very difficult to

rasterize in Photoshop due to its size and complexity. For

the version presented at Sunbelt, we had to reduce its

complexity by removing the textual outline drawing.

This worked, but we were never quite satisfied with the

loss in quality that resulted. After nine months of trying

larger machines and distributed rendering, a solution was

found. By using the GNU Image Manipulation Program

(GIMP) and utilizing a Sun server with 32 GB of RAM

and 4 processing cores, we finally got the layer to

satisfactorily render at 400 DPI. This new image has

replaced the older movies layer.

Figure 7 at the end of this paper shows the final

visualization. Unfortunately, it is more than eight times

smaller than the original visualization and many details

are lost at this size. To really appreciate the visualization,

one must either have a full resolution printed version or

go to http://scimaps.org/maps/movieactors to see a

zoomable Google Maps interface to the visualization.

The map is also available for sale from

http://scimaps.org/ordermaps in support of the Places &

Spaces: Mapping Science exhibit.

Discussion

The presented work demonstrates the utility of paper

printouts for serving high data density visualizations.

Paper as a medium is easy to access and transport, offers

high data density, and is comparatively cheap. Humans

have used paper and interacted with it for well over

2,000 years and have highly optimized it as a medium to

store, transmit, and preserve information. Paper naturally

supports exploration. Interactivity like zooming and

panning can be accomplished by physically moving

closer to and further away from the print. While there are

problems with zooming and panning in computational

environments, this sort of interaction with paper is

immediately obvious to viewers. Arbitrary annotations

are possible. Last but not least, there is something to be

Herr II, Bruce W., Ke, Weimao, Hardy, Elisha, and Börner, Katy. (2007) Movies and Actors: Mapping the Internet Movie Database.

In Conference Proceedings of 11th Annual Information Visualization International Conference (IV 2007),

Zurich, Switzerland, July 4-6, pp. 465-469, IEEE Computer Society Conference Publishing Services.

said about a visualization that can be physically touched

and has a real texture to it.

The higher density of paper has allowed us to give

an overview of the entire movie and actor space in a

reasonable physical space. Our current visualization

renders at 400 DPI, but there are techniques to utilize up

to 4,000 DPI. The result is extremely crisp graphics that

allow for further zooming with a physical magnifying

glass.

In addition to bringing out paper’s natural strength,

this visualization work also made obvious the current

limitations of rendering on large display walls. Display

walls are limited by the rather low resolution of modern

monitors and projectors. To render the full resolution

visualization, a display wall would have to be around 4

times as large as the equivalent print. A 12’x24’ display

wall would be prohibitively expensive. Compare this to

an equivalent 3’x 6’ print which is much cheaper, denser,

and could be mass produced.

Future work aims to update the data behind the

visualization and add a layer of interactivity. We will do

this by utilizing an invention by W. Bradford Paley

called an illuminated diagram [5]. This technique uses a

projector to interactively highlight interesting parts on

statically printed diagrams. We can then take advantage

of the interactivity of computers, yet still retain the

qualities of printed media.

Acknowledgements

We would like to thank all those involved in the

Internet Movie Database for creating an excellent

dataset, Vladimir Batagelj for organizing the Viszards

session, Bryan J. Hook for editing, and Sumeet Ambre

for creating the Google Maps interface for our

visualization.

This research is supported by the National Science

Foundation under IIS-0513650 and a CAREER grant

under IIS-0238261. Any opinions, findings, and

conclusions or recommendations expressed in this

material are those of the author(s) and do not necessarily

reflect the views of the NSF.

References

[1] Academy Award’s best actor and actress winners and

nominees from 2000-2004 downloaded from

http://www.imdb.com/Sections/Awards. Accessed on

April 2006.

[2] Davidson, G.S., Wylie, B.N. and Boyack, K.W. Cluster

stability and the use of noise in interpretation of

clustering. Proc. IEEE Information Visualization 2001.

23-30.

[3] Internet Movie Database (IMDb) network provided for

GD’05 at http://www.ul.ie/gd2005.

[4] Nooy, W.d., Mrvar, A. and Batagelj, V. Exploratory

Social Network Analysis with Pajek. Cambridge

University Press, 2005.

[5] Paley, W. Bradford. Illuminated Diagrams: Using Light

and Print to Comparative Advantage. InfoVis Conference

2002.

[6] Viszards: Analysis and Visualization of IMDB Networks

session description from

http://www.insna.org/2006/special.sessions.html.

Accessed on April 2006.

Herr II, Bruce W., Ke, Weimao, Hardy, Elisha, and Börner, Katy. (2007) Movies and Actors: Mapping the Internet Movie Database.

In Conference Proceedings of 11th Annual Information Visualization International Conference (IV 2007),

Zurich, Switzerland, July 4-6, pp. 465-469, IEEE Computer Society Conference Publishing Services.

Figure 7. Complete Map

Herr II, Bruce W., Ke, Weimao, Hardy, Elisha, and Börner, Katy. (2007) Movies and Actors: Mapping the Internet Movie Database.

In Conference Proceedings of 11th Annual Information Visualization International Conference (IV 2007),

Zurich, Switzerland, July 4-6, pp. 465-469, IEEE Computer Society Conference Publishing Services.

A Knowledge Graph-Driven CNN for Radar Emitter Identification

Article

Full-text available

Jun 2023

In recent years, the rapid development of deep learning technology has brought new opportunities for specific emitter identification and has greatly improved the performance of radar emitter identification. The most specific emitter identification methods, based on deep learning, have focused more on studying network structures and data preprocessing. However, the data selection and utilization have a significant impact on the emitter recognition efficiency, and the method to adaptively determine the two parameters by a specific recognition model has yet to be studied. This paper proposes a knowledge graph-driven convolutional neural network (KG-1D-CNN) to solve this problem. The relationship network between radar data is modeled via the knowledge graph and uses 1D-CNN as the metric kernel to measure these relationships in the knowledge graph construction process. In the recognition process, a precise dataset is constructed based on the knowledge graph according to the task requirement. The network is designed to recognize target emitter individuals from easy to difficult by the precise dataset. In the experiments, most algorithms achieved good recognition results in the high SNR case (10–15 dB), while only the proposed method could achieve more than a 90% recognition rate in the low SNR case (0–5 dB). The experimental results demonstrate the efficacy of the proposed method.

A key review on graph data science: The power of graphs in scientific studies

Article

Jun 2023
CHEMOMETR INTELL LAB

La representación de los youtubers e instagrammers en la producción audiovisual internacional

Article

Full-text available

Jun 2022

Se analiza la inclusión de las figuras de youtubers e instagrammers en la producción audiovisual internacional. Como objetivos se propone mostrar la progresiva introducción de estos perfiles en el imaginario de la sociedad a través del mapa cultural que dibuja la cinematografía actual. Sobre una muestra de 1738 producciones audiovisuales que incluyen el término que es objeto de análisis en el título o como palabra clave, se ha llevado a cabo un análisis de contenido cuantitativo y cualitativo. En primer lugar, a través de la base de datos IMDb, se ha accedido a las narraciones que tratan el tema de estos perfiles para conocer el número de producciones, el progresivo crecimiento y los diferentes géneros audiovisuales que los recogen. Después, mediante un análisis de palabras clave, sinopsis y críticas especializadas, se ha podido conocer los rasgos de estos nuevos comunicadores que las películas de ficción reflejan. Los resultados muestran la ausencia de valores y los riesgos que derivan de un uso poco responsable de las redes. Por la capacidad del cine para interpelar y educar, se concluye con la necesidad de formación en competencias éticas y estéticas de los cineastas y especialmente de la ciudadanía que consume este medio audiovisual.

La “grande bellezza”: thirty years of Italian set locations

Article

Apr 2021

From the farthest north to the deepest south, the cities, towns, and countryside of Italy have provided set locations which show us images of an Italy that is multiform and various. Based on a massive database of more than 5000 Italian-set films from 1988 to 2016, this paper explores the geography of Italian set locations and their related cinematic landscapes. In order to achieve this goal, the study proceeds in two phases. First, the paper identifies the concentrations of set locations (filmogenic spots) overall and by genre in particular areas through spatial analysis tools. Second, it presents a qualitative analysis of the representation of selected locations, underlining the different roles played by the landscape in film narratives. By using an integrated quali-quantitative perspective, the paper offers an analysis that combines the spatial and the representational dimensions of Italian set locations over a wide area for a long period.

Mining Recency–Frequency–Monetary enriched insights into resources’ collaboration behavior from event data

Article

Jul 2023
ENG APPL ARTIF INTEL

Organizations increasingly rely on teamwork to achieve their goals. Therefore they continuously strive to improve their teams as their performance is interwoven with that of the organization. To implement beneficial changes, accurate insights into the working of the team are necessary. However, team leaders tend to have an understanding of the team’s collaboration that is subjective and seldom completely accurate. Recently there has been an increase in the adoption of digital support systems for collaborative work that capture objective data on how the work took place in reality. This creates the opportunity for data-driven extraction of insights into the collaboration behavior of a team. This data however, does not explicitly record the collaboration relationships, which many existing techniques expect as input. Therefore, these relationships first have to be discovered. Existing techniques that apply discovery are not generally applicable because their notion of collaboration is tailored to the application domain. Moreover, the information that these techniques extract from the data about the nature of the relationships is often limited to the network level. Therefore, this research proposes a generic algorithm that can discover collaboration relationships between resources from event data on any collaborative project. The algorithm adopts an established framework to provide insights into collaboration on a fine-grained level. To this end, three properties are calculated for both the resources and their collaboration relationships: a recency, frequency, and monetary value. The technique’s ability to provide valuable insights into the team structure and characteristics is empirically validated on two use cases.

The impact of default on the evolving dynamic networks of debtor-creditor relationships

Article

Jan 2023

Quantitative approaches for evaluating the influence of films using the IMDb database

Article

Full-text available

Why do films certain remain influential throughout film history? The purpose of this paper is to attempt to answer this question. To do so, we adopt some quantitative approaches that facilitate an objective interpretation of the data. The data source we have chosen for this study is the Internet Online Movie Database (IMDb), and in particular, one of its sections called "Connections", which lists references made to a film in subsequent movies and references made in the film itself to previous ones. The extraction and analysis of these networks of citations allows us to draw some conclusions about the most influential movies in film history, identifying their distinguishing features, and considering how their popularity has evolved over time.

Visual analysis of meteorological factors for flight delays in airport group

Conference Paper

Feb 2022

Design of a Database Management System for Movie Recommendation Related to the History of Industrial Engineering for Courses

Chapter

Jan 2021

This study aims to combine education and cinema to provide suitable movie content for students, academics, or users. At the first stage, the topics related to industrial engineering courses from each semester are determined by the development of the industry through industrial revolutions. The terms are then associated with the lessons and, matching film-semester-film is made as an appropriate auxiliary resource for the lessons. The database created in the SQLite program with the help of the keywords searched on the IMDb site is connected to C# and the database is used in the Windows form application. In the application presented to the user, three different pages are presented as lectures, semesters, and movies. It is possible to see which films are related to the selected course from the course list. This is a pioneer study in the literature suggesting a movie to the industrial engineering undergraduate students related to their courses. According to the best of our knowledge, an application-based project proposing films for courses has not been found in the literature.

Çocuk Filmlerinin Değişimi: Farklılaşma, Süreklilik ve Dijital İmkanlar

Chapter

Aug 2020

Serdar Nerse

Cluster Stability and the Use of Noise in Interpretation of Clustering

Conference Paper

Full-text available

Oct 2001

A clustering and ordination algorithm suitable for mining extremely large databases, including those produced by microarray expression studies, is described and analyzed for stability. Data from a yeast cell cycle experiment with 6000 genes and 18 experimental measurements per gene are used to test this algorithm under practical conditions. The process of assigning database objects to an X, Y coordinate, ordination, is shown to be stable with respect to random starting conditions, and with respect to minor perturbations in the starting similarity estimates. Careful analysis of the way clusters typically co-locate, versus the occasional large displacements under different starting conditions are shown to be useful in interpreting the data. This extra stability information is lost when only a single cluster is reported, which is currently the accepted practice. However, it is believed that the approaches presented here should become a standard part of best practices in analyzing computer clustering of large data collections.

Illuminated Diagrams: Using Light and Print to Comparative Advantage

Article

Jan 2002

W. Bradford Paley

A hybrid medium is presented; it exploits the best characteristics of contemporary print and projector capabilities. This large-scale display consists of a print carrying static data, and light projected onto the surface of the print. The projected lightr adds many capabilities: interactivity, attention direction, and transient detail, while the bulk of the information still comes from the print's ultra-high information density.

Exploratory Social Network Analysis With Pajek

Book

Jan 2004

This is an extensively revised and expanded second edition of the successful textbook on social network analysis integrating theory, applications, and network analysis using Pajek. The main structural concepts and their applications in social research are introduced with exercises. Pajek software and data sets are available so readers can learn network analysis through application and case studies. Readers will have the knowledge, skill, and tools to apply social network analysis across the social sciences, from anthropology and sociology to business administration and history. This second edition has a new chapter on random network models, for example, scale-free and small-world networks and Monte Carlo simulation; discussion of multiple relations, islands, and matrix multiplication; new structural indices such as eigenvector centrality, degree distribution, and clustering coefficients; new visualization options that include circular layout for partitions and drawing a network geographically as a 3D surface; and using Unicode labels. This new edition also includes instructions on exporting data from Pajek to R software. It offers updated descriptions and screen shots for working with Pajek (version 2.03).

Analysis and Visualization of IMDB Networks session description from

Apr 2006

Viszards

Viszards: Analysis and Visualization of IMDB Networks session description from http://www.insna.org/2006/special.sessions.html. Accessed on April 2006.

Movies and Actors: Mapping the Internet Movie Database

Abstract

Recommended publications

Information Needs and Access among Women in Sagnerigu District of Northern Region, Ghana

Detection of setting and subject information in documentary video

Ordering and selecting production rules for constraint maintenance: Complexity and heuristic solutio...

Modified Class-Incremental Generalized Discriminant Analysis