Interactive Visualization for IMDb database

University Paris-Sud
Interactive
Visualization for
IMDb database
Interactive Information Visualization
Walter FERREIRA
2/18/2014
Dataset description:
This IMDb database version is composed by two separate files. The first one represents the
movie database itself. It is composed by main characteristics of each movie, such as title, budget,
genre, rating and crew. However the crew information is not complete, in this first file it is possible
to find only the id codes of each person with relation to the second file. This second one is composed
by specific information about actors, directors, producers, writers and some other professional
categories. In order to have the whole information about a movie, it is compulsory to merge both the
database files programmatically, doing so it is possible to have richer information about movies and
actors.
The focus of this visualization is directed towards the actors, director, producers and writers
and how they are related among them. They are represented as a social network and this type of
visualization brings a few known issues. The first of them is scalability when displaying hundreds of
thousands of nodes. For instance, the crew database is composed by around 250.000 people, which
makes impossible the task of representing everybody at once in the same visualization. Therefore I
submitted the crew dataset to a preprocessing before building the social network. Out of the
250.000 is easy to say most of them are not known by the common people, since we are talking
about famous people and stars, so the program does not lose a lot of information if unknown people
are not considered. For the scope of this visualization 100 people among actors, directors,
producers and writers were chosen according to the IMDb popularity rank.
Data encoding:
The main encoding property of this visualization is the node-link graph representation for
social networks. I built the layout based on LinLog force field algorithms with clustering to be able
to show which people cooperate more frequently between them.
Each node represents one person. The nodes have circular formatting and the area of the
circle represents how many movies this person has done considering the people that are also
displayed in the social network. The default configuration shows all the nodes with the same color,
but it is possible to highlight the clusters by a key command.
The edges represent cooperation between people. All cooperation between two people is
merged together into a single edge and is represented by the weight of the edges and clusters.
Looking at the basic visualization it is not possible to see the weights of the graph, it is necessary to
hover one node to see all the connections highlighted and the proper weights of each node. However
it always possible to have the notion of people that work often together by taking a look at the
cluster formation.
Colors and transparency are used to help identify nodes and highlight information. All the
nodes are initially blue, but when the user hovers any node, it turns to red, all its node connections
turn to black and the nodes that are not related to it become a bit transparent. After clicking on one
node, the vision is blocked to that node and if the user hovers one of the related nodes, initially in
black, it turns to red to help identifying the connection between them.
Information about movies appears by default next to the mouse cursor, as if the mouse was
a magic lens, but there is also the option to show these information at the corner of the screen.
For more details concerning interaction techniques available, please refer to the following
sessions.
Technical aspects:
The project was developed under Processing environment (v.2.1) that provides tools to deal
easier with graphical representation, yet using Java libraries. In order to build the node-link layout, I
used a modified version of the LinLogLayout for Java Swings developed by A. Noack and distributed
as a free software under the GNU free software license.
Interactions:
A few interaction techniques are available in this project to help the user better explore the
data. See them on the following list:
 Key commands to filter data encoding;
 Pan;
 Zoom;
 Brushing by hovering and clicking;
 Augmentation on mouse hovering, close to magic lens.
The interactions will be explained in the following session with pictures and more details on
how to explore the dataset.
How to use:
The following picture represents the default visualization presented to the user once the
program is all loaded.
The user has the choice of hiding the links by pressing the key “L” and he can also decide to
highlight or not the clusters by pressing the key “C”. The next picture shows the visualization with
clusters highlighted and without links.
By tapping the commands “I” and “O” the user can zoom in and out, respectively, and by clicking
and dragging it is possible to pan the visualization. Below is a representative image of the zoomed in
graph.
In order to discover all the movies that a particular person did, the user can press and hold the
CTRL key and then hover a particular node. By default, the list of movies appears just next to the mouse
cursor. If the user prefer it not to be attached to the mouse, he can press the key “F” and the list will be
displayed at the top left of the screen. The list is sorted by rating is descending order. On the images
below, we are looking at Richard Gere’s movies.
In order to see the connections of a particular actor, it suffices to hover the mouse over the node
representing the actor. Once the nodes are highlighted, if you click on the node, the nodes become fixed
and you can navigate through the peer actors to see the connections between them. If you hover on the
peer, the movies will be displayed by default next to the mouse, or at the top left corner if the user
prefers. Below, see all of Woody Allen’s connections and then his movies with Scarlett Johansson.
Challenges and limitations:
At first, I was trying to create my own layout method using circular layout and animations, but it
turned out to be too hard and with very little scalability, so I decided to tackle the layout with the force
field approach.
The algorithm for displaying the node-link layout has some limitations and it scales for a few
thousand nodes. I tried running the algorithm for the whole set of crew, but after more than 10 minutes
it crashed. Therefore, I was obliged to reduce the dataset.
One current issue of the visualization is the zoom. The coordinates system in processing works in
a blurry way and when the user zooms in or out, the coordinates are lost and I did not find a way to
transform between screen coordinates and scaled coordinates. Therefore, it’s not possible to hover
objects while zooming.
Improvements and future work:
A good improvement would be to allow the user to decide who the people that are meaningful
to him are. A small set of the top 100 people from IMDb may be meaningful, but still not everybody has
the same taste in movies. It would be interesting to add a first screen before the proper visualization to
choose who they want to see, or do a search tool in the visualization itself where the user can look for
someone and add them dynamically.
References:
Processing references, examples, tutorial and forum:
http://processing.org/
LinLogLayout for Java Swings, by A. Noack:
https://code.google.com/p/linloglayout/
The Avis class website, particularly the lessons about Graphs, Interactions and Visual Variables:
http://www.aviz.fr/Teaching2013/Schedule