Matrix algorithms for the seriation problem

Matrix algorithms for the seriation problem
Anna Concas
Department of Mathematics and Computer Science, University of Cagliari, Italy
In collab. with Caterina Fenu, Giuseppe Rodriguez
Due giorni di Algebra Lineare Numerica
Como, 16-17 Febbraio 2017
Anna Concas
Matrix algorithms for the seriation problem
The seriation problem
The
seriation problem
is an important ordering problem:
it seeks the best enumeration order of a set of objects
according to a given correlation function;
the sought order can be characteristc of the data, a
chronological order, a gradient or any sequential structure of
the data;
its concept has been formulated in many dierent ways and
appears in various elds (archaeology, antropology, psychology,
biology, etc.).
In archaeology, seriation has been used to date ancient objects and
determine their chronology.
Anna Concas
Matrix algorithms for the seriation problem
We will consider seriation using a mathematical approach,
considering it as units arrangement in sequence according to a
certain gradient.
One of the most important aims of the archaeological investigation
is to determine a relative chronology, i.e., a dating which indicates
if a given unit is chronologically preceding or subsequent to another.
Generally, relative chronologies are devoid of a direction, i.e., units
are placed in a sequence which can be read in both directions.
Anna Concas
Matrix algorithms for the seriation problem
Data matrix
The problem data are generally represented in an m × n matrix A
(data matrix ) in which the rows are the archaeological units and
the columns the types.
The data matrix can be of two types:
: binary representation, i.e., the matrix
indicates the presence (1) or the absence (0) of a certain type
in a unit.
incidence matrix
: the matrix reports the absolute or relative
percentage number of objects belonging to a certain type
(column) in a unit (row).
abundance matrix
Anna Concas
Matrix algorithms for the seriation problem
Mollebakken 2
Kobbea 11
Mollebakken 1
Levka 2
Grodbygard 324
Melsted 8
Bokul 7
Heslergaard 11
Bokul 12
Slamrebjerg 142
Nexo 6
G3
1
0
1
0
0
0
0
0
0
0
0
F27
1
1
1
1
0
0
0
0
0
0
0
S1
1
1
0
1
0
1
0
0
0
0
0
F26
1
0
1
0
0
1
0
0
0
0
0
N2
0
1
1
1
1
0
0
0
0
0
0
F24
0
1
0
0
1
0
0
0
0
0
0
P6
0
0
1
0
0
1
1
0
0
0
0
F25
0
0
1
1
0
1
1
1
1
0
0
P5
0
0
0
1
0
0
0
0
1
0
0
P4
0
0
0
0
1
1
0
1
0
1
1
N1
0
0
0
0
0
0
1
0
0
0
1
F23
0
0
0
0
0
0
0
0
1
1
1
Table: Incidence matrix for archaeological data come from female burials
of Bornholm site (Germany)
Anna Concas
Matrix algorithms for the seriation problem
A data matrix may represent types of found objects as columns,
with the features (graves, pits, etc.) in which they are found as
rows.
Aims of seriation:
arrange the graves/features in chronological order based upon
the assumption that the types were produced only for a limited
period of time;
this approach can be used to reorganize rows or columns of a
dataset so as to enumerate them in an appropriate order;
search an enumeration of rows and columns that best
expresses resemblance relationship between element.
Anna Concas
Matrix algorithms for the seriation problem
Consecutive ones problem
The consecutive
problem.
ones problem
(C1P) is a closely related ordering
A (0,1)-matrix has the consecutive ones property for columns if
there exists a permutation of its rows that places the 1's
consecutive in every column.
One can symmetrically dene the equivalent property for rows.
If a matrix has this property, then the C1P is to nd all such
permutations.
Anna Concas
Matrix algorithms for the seriation problem
The similarity matrix
The rst mathematical formalized seriation methods were based on
the construction of the symmetric similarity matrix S .
One possible way to construct it is
S
= AAT .
We can observe that:
each element of the similarity matrix is equal to the number of
the types shared between the corresponding units.
if, permuting rows and columns of S , the larger values
congregate to the main diagonal and the smaller ones dissolve
far from it, then the corresponding rows sequence in A will
near units similar in types.
Anna Concas
Matrix algorithms for the seriation problem
Robinson method
The Robinson method is a statistical technique based on a
particular similarity matrix.
In the original formulation, it is applicable only to abundance
archaeological data.
Steps of the method:
construct the abundance matrix
A m
× n;
construct the similarity matrix:
i ,j = M −
s
n
X
k =1
|ai ,k − aj ,k |,
(M = 200)
permute the rows and the columns of S in order to
concentrate the larger numerical values on the main diagonal;
the obtained sorting of
data matrix A.
S
gives an order for the rows of the
Anna Concas
Matrix algorithms for the seriation problem
R-matrices
Let us consider a symmetric
an R-matrix if and only if
m
× m matrix S ; we will say that
S
is
i ,j 6 si ,k , if j 6 k 6 i ,
si ,j > si ,k ,
if i 6 j 6 j .
s
If S can be symmetrically
we say that S is a pre-R.



6 4 2 2
6
4 8 5 3
4



2 5 9 4 R
9
2 3 4 7
2
permuted to became an R-matrix, then
4
8
5
3
9
5
9
4
Anna Concas

2
3
 not R
4
7

9
2

5
4
2
6
4
2
5
4
8
3

4
2
 pre-R
3
7
Matrix algorithms for the seriation problem
R-matrices
Let us consider a symmetric
an R-matrix if and only if
m
× m matrix S ; we will say that
S
is
i ,j 6 si ,k , if j 6 k 6 i ,
si ,j > si ,k ,
if i 6 j 6 j .
s
If S can be symmetrically
we say that S is a pre-R.



6 4 2 2
6
4 8 5 3
4



2 5 9 4 R
9
2 3 4 7
2
permuted to became an R-matrix, then
4
8
5
3
9
5
9
4
Anna Concas

2
3
 not R
4
7

9
2

5
4
2
6
4
2
5
4
8
3

4
2
 pre-R
3
7
Matrix algorithms for the seriation problem
A spectral algorithm for the seriation problem
In [Atkins, Boman, Hendrickson, SICOMP 1998] it is introduced a
spectral algorithm for the seriation problem.
Given a symmetric correlation function f , reecting the desire for
units i and j to be near to each other in the sought sequence, the
point is to nd all permutations π s.t.
πi 4 πj 4 πk ⇐⇒
f
(i , j ) ≥ f (i , k ) and f (j , k ) ≥ f (i , k )
Anna Concas
Matrix algorithms for the seriation problem
The Fiedler vector
The Laplacian of a symmetric, irreducible matrix
symmetric, positive semi-denite matrix
L
where
D
A
is the
=D −A
is a diagonal matrix s.t. di ,i =
Pm
j =1 ai ,j .
The minimun eigenvalue of L with an eigenvector orthogonal to e
(the vector of all ones) is called Fiedler value (algebraic
connectivity ) and the corresponding eigenvector is the Fiedler
vector.
Alternatively, the Fiedler value is given by
min
x e =0,x x =1
T
T
and a Fiedler vector is any vector
under these constraints.
Anna Concas
x
x
T Lx
that achieves this minimum
Matrix algorithms for the seriation problem
Motivations
Let
F
= {fi ,j } be a (similarity) matrix.
The authors approach is to consider the minimization of a penalty
function g whose value is small when highly correlated elements are
close to each other:
X
2
g (π) =
fi ,j (πi − πj ) .
(i , j )
Since minimizing
function
g
is NP-hard, it can be approximated by a
( )=
h x
X
(i ,j )
2
i ,j (xi − xj ) ,
f
x
∈ Rn
In order to have a unique minimizer and to leave out the trivial
solution, it's necessary to add two constraints.
Anna Concas
Matrix algorithms for the seriation problem
Motivations
The resulting minimization problem is:
minimize
subject to
P
( ) = (i ,j ) fi ,j (xi − xj )2
P
P 2
i xi = 0 and
i xi = 1.
h x
It can be shown that h(x ) can be rewritten as xT Lx, where L is the
Laplacian of the correlation matrix F .
The constraints require that x be a unit vector orthogonal to e, and
since L is symmetric, all the eigenvectors, except the rst one,
satisfy the constraints.
Consequently, a solution to the constrained minimization problem is
just a Fiedler vector.
Sorting the entries of the Fiedler vector generates an ordering that
tries to keep highly correlated elements near each other.
Anna Concas
Matrix algorithms for the seriation problem
PQ-trees
As observed before, the seriation problem is related to nding all
the possible ways to sequence the elements of a certain set so that
the correlations are consistent.
Althought there may a very large number of such orderings, they
can all be described in a compact data structure known as PQ-tree.
Anna Concas
Matrix algorithms for the seriation problem
PQ-trees
A PQ-tree is a data structure introduced by Booth and Lueker to
encode a set of related permutations.
A PQ-tree T over a set U = {u1 , u2 , . . . , un } is a rooted tree whose
leaves are elements of U and whose internal nodes are distinguished
as either P-nodes and Q-nodes.
T
is
proper
when the following conditions hold:
every ui ∈ U appears once as a leaf;
every P-node has at least two children;
every Q-node has at least three children.
Anna Concas
Matrix algorithms for the seriation problem
PQ-trees
Two PQ-trees are said to be equivalent if one can be transformed
into the other by applying a sequence of the following two
equivalence transformations:
arbitrarily permute the children of a P-node;
reverse the children of a Q-node.
We will describe a variant of the spectral algorithm which, given a
pre-R matrix, constructs a PQ-tree for the set of all permutation
that produce an R-matrix using the Fiedler vectors.
Anna Concas
Matrix algorithms for the seriation problem
PQ-trees
Two PQ-trees are said to be equivalent if one can be transformed
into the other by applying a sequence of the following two
equivalence transformations:
arbitrarily permute the children of a P-node;
reverse the children of a Q-node.
We will describe a variant of the spectral algorithm which, given a
pre-R matrix, constructs a PQ-tree for the set of all permutation
that produce an R-matrix using the Fiedler vectors.
Anna Concas
Matrix algorithms for the seriation problem
PQ-tree representation
1
1
7
7
2
2
6
6
p node
q node
leaf
3
3
5
5
6
6
2
2
4
4
4
4
1
7
1
7
5
5
3
3
3
5
3
5
7
7
1
1
4
4
4
4
2
6
2
6
5
3
5
3
6
2
6
2
7
1
7
1
set of all permutations of the indices of the
1
3
4
5
7
2
6
PQ-tree relating a matrix A with
rows/columns of A
two connected components
Anna Concas
Matrix algorithms for the seriation problem
Spectral Sort
algorithm implementation;
introduction of a tollerance τ ;
handling of the case with a multiple Fiedler value.
Anna Concas
Matrix algorithms for the seriation problem
Spectral Sort Part I
1: function T = spectralsort(A, τ, U )
2: α = mini 6=j ai ,j ,
A
= A − αee T
3: {A1 , . . . , Ak } connected components of A
4: {U1 , . . . , Uk } corresponding index sets
5: if k > 1
6:
for
7:
j
= 1, . . . , k
Tu (j )
8:
end for
9:
T
= spectralsort(Aj , τ, Uj )
= pnode (Tu )
10: else
11:
Part II
12: end if
Anna Concas
Matrix algorithms for the seriation problem
Spectral Sort Part II
1: if n = 1
2:
T
= lnode (U )
3: else if n = 2
4:
T
= pnode (U )
5: else
6:
L: Laplacian matrix of A
7:
d : eigenvalues of L
8:
d
9:
f : vector of the indices in d with d (2) value
10:
11:
12:
13:
14:
if
= distinct(d , τ )
sort d and match its elements using
τ
length(f) > 1
T
= pnode (U )
else
Part III
end if
15: end if
Anna Concas
Matrix algorithms for the seriation problem
Spectral Sort Part III
1: x : Fiedler vector for L
2: y = distinct(x , τ ) sort x and match its elements using τ
3: t number of distinct values in y
4: for j = 1, . . . , t
5:
vj indices of elements in y with j th value
6:
Tv (j )
= spectralsort(A(v , v ), τ, U (v ))
7: end for
8: T = qnode (Tv )
Anna Concas
Matrix algorithms for the seriation problem
References
P.P.Agostinetti, Matteo Sommacal. "Il problema
".
Rivista di Scienze Preistoriche-LV-2005;
della
seriazione in archeologia
Jonathan E. Atkins, Erik G. Boman, and Bruce Hendrickson.
"A spectral algorithm for seriation and the consecutive ones
problem ". SIAM Journal on Computing, 28(1):297310, 1998;
K.S.Booth and G.S.Lueker. "Testing
for the consecutive ones
property,interval graphs and graph planarity using PQ-tree
algorithms
". J.Comput.System Sci.,13(1976),pp. 333-379.
M. Fiedler. "Algebraic connectivity of graphs ".
Czech.Math.Journal,23(1973), pp. 298-305;
Anna Concas
Matrix algorithms for the seriation problem
Thanks for your attention!
Anna Concas
Matrix algorithms for the seriation problem