deleeuw.pdf

!
"# $% '&($)&! *+&,-&./%
*-&,10-)'2 '3$
4576"8:9<;=9>9@?BADCFEHGJI9LKNMPOJQJRS;25T9LKN9
y UWdeVYX>jtZ7Z7VYVY[:[zl{\Y]N_N[=dk^`u|Zc_ab>ZcV=b@VY}kdeVY_aZcb>ZgfiVYhY[Yjk~ _N_aXXmlNl`j UW]Nd=nmoLZ7VY[qpmhrfgVsXmhtVtu2U2vxw
€‚ ƒF„Y… †ˆ‡Š‰Œ‹Y† Ž m„‘€‚ ’ “ŒŽ
Keywords:
Information Visualization, Multi Dimensional Scaling.
Abstract
Multi Dimensional Scaling is a structure preserving projection method that
allows for the visualization of multidimensional data. In this paper we discuss
our practical experience in using MDS as a projection method in three different
application scenarios. Various reasons are given why structure preserving projection methods are useful for the analysis of multidimensional data. We discuss
two visual forms (glyphs, heightfields) which can be used to represent the output
of the projection methods.
”–•
—‚˜š™›œ.ž Ÿ+¡™
—Fœ.˜
In this paper we discuss our practical experience in using Multi Dimensional
Scaling (MDS) for the visualization of multidimensional data. We show how
MDS is used to gain insight into multidimensional spaces that are represented
in a table. A large class of data can be characterized by tables. Such tables
can be described by a matrix of attribute variables in one dimension and the
outcome of specific cases in the other. Discovery and understanding of the
structure in this type of data has many applications in science and business, [1].
Here the word structure refers to geometric relationships among subsets of
the data variables in the table. Examples of structure include clusters, regular
patterns, outliers, distance relations, proximity of data points etc.
There are many numerical and statistical techniques that can be used to analyze structural information from multidimensional data tables. These techniques can be used to automatically extract certain structural properties from
the data. Examples of such techniques are principal component analysis (PCA),
¢
k-means and hierarchical clustering algorithms (see [2, 3]). The majority of
these techniques focus on specific aspects of the structure of the data such as
clusters.
A different class of techniques for the analysis of structural information is
based on the idea that the multidimensional data points can be projected in
a lower dimensional space such that the structural properties of the data are
preserved. We called this class of techniques structure preserving projection
methods.
In this paper we discuss how multidimensional data can be visualized using
structure preserving projection methods. We sketch three alternative methods
and point out some differences between them. The paper is structured as follows: in the following section we will give an overview of the visualization
process of data analysis using projection methods. In section 3 we sketch three
structure preserving projection methods. Section 4 describes the visualization
of the output of the described projection methods. Three applications illustrate
the methods in section 5.
The process of transforming data tables into a visual form can be considered
in the context of the well known visualization pipeline[4]. For projection based
methods, a pipeline of four stages can be specified as follows (see Figure 1):
data acquisition, projection, mapping, and rendering.
data
aquisition
projection
mapping
rendering
interaction
"!$#&%('*),+
Transforming tabular data into images.
Data acquisition is the process of acquiring and selecting the data to be
analyzed. This stage results in the data table.
In the projection stage, nonlinear projection techniques are used to transform data points in a high dimensional data space to a lower dimensional visualization space. The goal of these techniques is to compute a spatial representation which preserves structural properties of the data table.
In the mapping stage the output of the projection is translated into a set of
graphical primitives. The goal of this stage is to effectively present the data
in a visual form. During rendering the graphical primitives are rendered as an
image.
User interaction allows the user to investigate different aspects of the data.
In all but the smallest data sets it is impossible to present all information con-
!#"$&%'()
*
tained in the data automatically in a single image. Therefore the user should
be able to interact with the parameters in the visualization pipeline in a meaningful and understandable way.
+-,
.0/2143650798;:<1>=
?@5A80BC1EDGF
Projection methods for the analysis of structure have the following useful
properties:
The methods do not depend upon any control parameters that would require a priori knowledge about the data. For example, these methods do
not depend on control parameters that determine the number of clusters.
The methods are not limited to specific types of structures. In contrast
to many specific structure seeking methods, projection methods can be
used for the analysis of a wide range of complex structures.
The methods use human visual capacity to recognize and interpret structure. For example, problems concerning anomalies in the data are overcome since humans can easily eliminate troublesome data points (automatic clustering algorithms have difficulty doing this).
We briefly summarize some aspects of three projection based techniques. It
goes beyond scope of this paper to discuss each technique in detail:
Multi Dimensional Scaling (MDS) computes a configuration of points
in a low-dimensional Euclidean space so that the distances between two
points match the original dissimilarities between the corresponding variables in the data table [5].
To apply MDS, first a distance matrix (also called a similarity or adjacency matrix) must be generated from the data table. This is done by
defining a metric by which the similarity or dissimilarity between cases
in the table can be determined. Depending on the data type in the table,
numeric, boolean or textual, many different metrics exist to calculate this
difference [6].
Formally, if HJILK is the distance between points MNI and MOK and P)I is the
position of MI in visualization space, the minimum of the equation
QSRUT
T
I K!V<I
W
H ILKYX[ZZ P I\X P KOZZ^]_
`'aLb
(1.1)
must be computed. Various numerical methods can be used for the minimization; e.g. ranging from iterative newton-raphson based methods to
genetic algorithms.
Self Organizing Maps (SOM) is a technique that uses a neural network
consisting of a two dimensional arrangement of nodes (neurons) [7].
The basic idea is that similar input points produce similar responses in
a trained network. During training, neuron responses are adjusted based
on a collection of representative input points. After training, the distribution of responses in the SOM is a representation of the structure of the
data set.
Generative Topographic Mapping (GTM) is a technique in which an topographic mapping function between the input data and the visualization
space is found [8]. The idea is to use a function, which maps a density
distribution in visualization space in combination with a Gaussian noise
model into the original data space. An EM (expectation-maximization)
algorithm is used to find a combination of the 2D distribution function
and mapping function which gives the optimal representation of the original data.
There are two major differences between MDS and SOM. First, given the
set input points, MDS results in a set of points in visualization space while
SOM results in a response on a two dimensional field. Second, in the case of
SOM, a trained neural network will result in a mapping function which can
be applied to additional data points. In the case of MDS, each additional data
point will require a re-computation of the projected configuration. Hence, a
trained neural network describes a projection function, while such a projection
function does not exist in the MDS case.
There are also two major differences between SOM and GTM . First, SOM
results in a response on a two dimensional field whereas GTM results in a
density distribution in data space. Second, in the case of SOM, the trained
projection function is implicitly defined by the neuron responses. In contrast,
the GTM projection function is explicitly defined as a parametric non-linear
function.
The data resulting from the projection based methods can have two types:
a discrete set of positions or a continuous distribution function. For the visual
forms, a distinction is made between discrete mapping and continuous mapping. Some possible mappings are shown in figure 2.
If the output of the projection is a set of discrete points, each point can
be represented by a glyph. When the number of points is large, glyphs are
less suitable due to cluttering. Projection methods may result in highly nonuniform distribution of points in visualization space, cluttering is hard to avoid
in this case. To counter this problem the continuous representation can be
useful to gain insight in the global structure of the data. For the mapping of a
!#"$&%'()
projection output
*
visualization
glyphs
Discrete
Discrete
splat map
Continuous
Continuos
heightfield
+-, .0/!13254
Visualization mappings of projection output data.
distribution function, the underlying data points are not explicitly represented
but an aggregation is applied such that the overall properties of the set are
reflected in the visualization. It is clear that, if the output of the projection
is a distribution function, discrete mapping is impossible as the position of
individual points is lost in the mapping.
Glyphs can be used to visualize both the point and its attributes. For example, the shape, color, transparency, orientation of glyphs can be used to encode
information associated to the point [9].
Heightfields can be used to visualize continuous functions, such as the distribution function. In the case that the discrete set of points is large, a heightfield can be constructed through GraphSplatting [10]. In this method, a field is
constructed by accumulating individual Gaussian basis splats.
The usage of glyphs and heightfields are complementary. Heightfields are
useful for visualizing the overall structure of the field. Glyphs are useful to
visualize the details of a small set of points. In addition, depending on the
representation, the height field and the glyphs can be combined in the same
visualization.
687
9;:<:<=?>-@A9CB<>EDGFIH
In this section the usage of the previously described methods are illustrated
in the context of specific applications. The methods are implemented in a
system [10] which includes an MDS projection based method and support for
continuous as well as discrete visualization methods.
687JK7
@L>MBCN
OP>-HQBR9PFI@AS<H
This application computes and visualizes the locations of 39 cities in the
Netherlands with respect to road distances. Instead of the Euclidean distance
metric, the length of the road connecting the two cities is used as the distance
metric. Using this data, cities are modeled as points and the distance matrix
was filled with acquired road distances. MDS was used to project the points in
a visualization space. Note that in this case the distance metric is not derived
from attribute data of the data points, instead the distance metric is part of the
input data.
The left panel of figure 3 shows the result of the MDS projection. Points are
labeled with city names. Grey lines connect cities with a road distance of less
than 25 kilometers. The right panel shows the result overlayed on a map of
the Netherlands. Red discs are drawn at the actual city locations on the map.
Green discs are drawn at the computed locations.
City distance visualization. Left: MDS of 39 cities based on the road distances.
Right: the solution overlayed on a map of the Netherlands
The right panel shows various discrepancies between the actual city locations (red discs) and the computed city locations (green discs). The largest
discrepancies are in the cities in the south west of the country. An explanation
can be found in the fact that the “road distance” is larger than the “earth distance” between cities. A large detour is needed to reach cities in the south west
of the Netherlands due to the water around these cities.
This application requires user interaction to register the output with the overlaid map. Since MDS uses only the distance matrix as its input, the result will
be a point configuration that is rotation invariant; i.e. although the distances
between cities are correct, the complete point set may be rotated around a
center of rotation. Similarly, the result can be mirrored with respect to the visualization plane. To overcome these problems, the user can pin points on the
visualization space. In this application, the user must pin at least three cities
on the map in order to avoid mirrored and rotated solutions.
!#"$&%'()
+-,.-,
/1032547698:6;2=<?>7@A6
*
BDCE25FG6?B
The goal of image classification is to allow images to be retrieved from data
repositories subject to a user defined query. Images are classified based on a
collection of features, such as color, texture and object shape. The usefulness
of the features for the query system is an important question for the developers
of image retrieval systems. The goal of this application was the development
of a system in which feature developers can experiment with features on a wide
variety of image sets.
The input of our system is a set of images and each image is classified
by a feature vector [11]. In this way images can be represented by points
in a high dimensional feature space. The distance matrix is defined by the
Euclidean distance between the points. Two images with high similarity will
result in points that are close to each other in the feature space. The MDS
layout will provide a global overview of the structure of the feature space as
well as similarity relations among images.
In addition, a weighting factor is associated to each element of the feature
vector. A user can interactively scale each dimension of the feature space by
changing the weighting factor, resulting in a new distance matrix. In this way
the user can explore the relation between features.
We applied our method to a collection of images taken from the Corel Image Collection [12]. A set of 200 images was selected across different genres,
yet, at the same time care has been taken that there is a small fraction of images per genre that would be commonly regarded as “similar”. For example,
images of similar objects like sailing boats, or image of objects which differ
in lighting characteristics or camera positions only. For each image, 6 feature
vectors were computed: a four-dimensional Gabor feature vector for texture
analysis and 5 distinct color-based features vectors. Texture-base features are
particularly successful when applied to genres of images where color information is of lesser importance, eg. air photography [13]. The color-based features
vectors including a hue histogram, a hue histogram of the center region of the
image, and 3 hue transition histograms. For transition histograms, the hue is
first dithered to 16 bins; then the histogram of the 256 resulting combinations
is recorded. As a pre-processing step, the images were segmented into 32, 128,
and 256 tiles, and each tile was replaced by its dominant hue. The dimensionality of the feature space spanned by the 6 features vectors is 804.
Figure 4 shows a snapshot of the user interface. The left panel shows the
discrete view: an arrangement of points in the visualization space. Small dots
are used to represent points. Grey lines represent edges between points with
distances below the user provided threshold distance. Some selected points are
annotated with a thumbnail image. The right panel shows the continuous field
of the same point configuration.
Two views of a layout for a subset of the Corel Image Collection. The left panel
shows a discrete representation. Points in visualization space are represented as small dots. The
right panel shows a continuous field. Some points are annotated with a thumbnail image.
The layout provides a view in which the images are displayed according to
their mutual dissimilarities. Similar images are clustered. A problem with the
discrete view is the potential cluttering of dots, making it difficult to estimate
density of points in dense regions. The continuous field provides a view of
a density field. Colors are used to show which areas have a high density of
points. In this way, the user can see in a glance which images are similar.
Users can interact with the system in three ways. First, by dragging and
pinning points in the visualization space. In this case, the MDS algorithm
will compute a new solution. Second, by varying the mapping parameters of
the density field, the frequency of the density field can be controlled. Changing these parameters effect the mapping stage of the pipeline. Finally, the
weight factors can be changed resulting a scaling of the high dimensional feature space. All three of these interactions will result in a new distance matrix.
!
The goal of this application is the analysis of the citation index of all IEEE
Vis’9X papers. We show that clustering of citations leads to specific topics in
visualization.
We have applied our method to the analysis of the IEEE Vis’9X citation index [14]. The input data set are BibTeX entries of all papers in the proceedings
of the IEEE Vis’9X conferences and all references to papers in this set from
other papers in the set. The data set consists of 599 BibTeX entries and 881
references.
!#"$&%'()
*
The goals of the visualization was to test the hypothesis that topics in visualization could be identified by only analyzing the density of the references.
The motivation of this hypothesis is that papers in one topic often refer to other
papers in the same topic.
The distance matrix was the reference matrix. This matrix which has the
dimension of the total number of papers in both directions and each element
contains ‘true’ if a paper references the other.
+-,/.10!2354
Left: All papers published in IEEE Vis’9X conferences, represented as discs and
references between the papers represented as lines. Right: Interacting with the citation index.
The influence of a group of papers is drawn with yellow (incoming) and blue (outgoing) references. Papers in the region selected by the highlighted contour on the right are shown as
discs.
The left panel of figure 5 shows the output of the MDS layout. Papers are
represented as small circles. References between papers are represented as
lines. As can be seen, aside from the papers which are not referenced and do
not reference papers, there is a single clustering of points. Due to the large
number of points in the cluster, it is difficult to obtain insight to the structure
of the data.
The right panel shows a slightly zoomed in 2D rendition of a continuous representation of the MDS layout. The density of the field clearly shows various
clusters of papers. For example, the papers in the large dark region in the middle of the image deal with flow visualization. The (smaller) region below are
papers describing visualization systems. The region on the right are volume
visualization papers. Discrete discs (red dots representing papers) and lines
(yellow lines for incoming references and blue lines for outgoing references)
are also drawn as annotation. The region at the top left contain information
visualization papers. The distance to the other peaks in the field illustrate the
distinction between information visualization and other data visualization topics.
Contour lines can be used to show cluster boundaries. Also, the influence of
a paper is shown by drawing the edges representing references to the selected
papers.
Figure 5 also illustrates some interaction techniques. Contour lines can be
used as a selection criterion. In this way all papers in a cluster can be selected.
The user can also pick individual papers and show all information related to
that individual paper.
This paper concerns the visualization of multidimensional data using structure preserving projection methods. Some possible visualization techniques
which can be used for the display of the projection methods were discussed.
Three applications were given as illustration.
An advantage of projection based methods is that they make use of human
pattern recognition abilities for the interpretation of the data. Also, projection
based methods do not require a priori knowledge about the multidimensional
data. For this reasons, these methods are well suited to be included in explorative visualization toolkits.
!"
[1] S.K. Card, J.D. Mackinlay, and B. Shneiderman, editors. Readings in
Information Visualization. Morgan Kaufmann Publishers, 1999.
[2] G. H. Ball. A comparison of some cluster-seeking techniques. Technical
Report RADC-TR-66-512, Rome Air Development Center, Rome, NY,
1966.
[3] V. Barnett. Interpreting Multivariate Data. John Wiley & Sons, Inc.,
New York, 1981.
[4] R.B. Haber and D.A. McNabb. Visualization idioms: A conceptual model
for scientific visualization systems. In G.M. Nielson, B.D. Shriver, and
L.J. Rosenblum, editors, Visualization in Scientific Computing, pages 74–
92. IEEE Computer Society Press, 1990.
[5] T.F. Cox and M.A.A. Cox. Multidimensional Scaling. Chapman & Hall,
London, 1994.
[6] L. Kaufman and P.J. Roussew. Finding Groups in Data - An Introduction
to Cluster Analysis. Wiley-Science Publication John Wiley & Sons Inc.,
1990.
[7] T. Kohonen. Self-Organizing Maps. Springer-Verlag Berlin Heidelberg
New York, 1995.
"!#%$&'(
)&)
[8] C.M. Bishop, M Svensén, and C.K.I. Williams. GTM: A principled alternative to the self-organizing map. Advances in Neural Information
Processing Systems, 9:354–363, 1997.
[9] D. Ebert, R. Rohrer, C. Shaw, P. Panda, J. Kulka, and D. Roberts. Procedural shape generation for multi-dimensional data visualization. In
E. Groller, H. Loffelmann, and W. Ribarsky, editors, Data Visualization
’99 (Proceedings EG-IEEE VisSym 1999), pages 3–13. Springer Verlag,
2000.
[10] R. van Liere and W.C. de Leeuw. Graphsplatting: Visualizing graphs
as continuous fields. accepted for publication in IEEE Transactions on
Visualization and Computer Graphics, 2002.
[11] Robert van Liere, Wim de Leeuw, and Florian Waas. Interactive visualization of multidimensional feature spaces. In D.S. Ebert and C.D.
Shaw, editors, Proceedings on New Paradigms for Information Visualization (NPIVM’00). IEEE Computer Society Press, 2000.
[12] Corel, http://www.corel.ca/products/clipartandphotos/photos/index.htm.
Corel Stock Photos, 1999.
[13] B.S. Manjunath and W.Y. Ma. Texture features for browsing and retrieval
of large image dat a. IEEE Transactions on Pattern Analysis and Machine
Intelligenc e, 18(8):837–842, August 1996.
[14] References IEEE visualization proceedings 1990-1999 can be downloaded at http://www.cwi.nl/r̃obertl/visbib, 2000.