let@token From Topology to Formal Languages: a

From Topology to Formal Languages: a
multiscale analysis of large data set
Emanuela Merelli
University of Camerino, Italy
Multiscale Modelling and Computing Workshop at Lorentz Center
Leiden – 11th April 2013
Big data, Topology and Formal languages
I propose an approach that exploits
→ topology for discovering global properties hidden in a large
data set, in particular MCG, as algebraic invariant, for
classifying a topological object into classes of equivalence
→ formal languages for describing the object as an information
(pattern, perception, ...), at syntactic level by
grammars-automata ..., at semantic level by denotations ...
for studying the dynamics of complex systems
The IT fifth revolution
Big Data
is a collection of data so large, exabytes (1018 ), that is impossible
to manage with the existing data management tools. These data
are characterized by high-dimensionality, redundancy, and noise.
Thus, the analysis of such data requires the handling of
high-dimensional vectors capable of weaning out the unimportant,
redundant coordinates
Topology and big data
Geometry and topology are very natural tools to handle large,
high-dimensional, complex spaces of data
Analogously, living matter must have the ability to handle data, in
situations where the system is barely able to keep pace with the
data produced.
Summaries are more valuable than parameter choices
Conventional method of handling data is building a graph whose
vertex set is a set of points (cloud) and two points are connected
by an edge if their distance is ≤ δ, then try to determine the
optimal choice of δ.
It is however much more informative to maintain the whole
dendrogram, which gives at once a summary of the relevant
features of the clustering under all possible values of δ.
We need to develop mechanisms to know how global features vary
under changes of parameter δ.
The idea that emerges is that the methods to adopt should be
inspired by topology
TOPOLOGY
Topology is the branch of mathematics dealing with qualitative
geometric information: it studies not only the connected
components of a space, but more generally the classification of its
loops and higher dimensional manifolds
Topology studies geometric properties in a way which is less
sensitive to metrics than geometric methods: it ignores the value
of distance functions and replaces it with the notion of connective
nearness: proximity
Topology studies only properties of geometric objects which do not
depend on coordinates, but rather on intrinsic geometric properties
of the objects: it is coordinate-free
Betti numbers – the invariants of topological surfaces
Topological information is contained in persistence homology, a
topological invariant that can be determined and presented as a
sort of parameterized version of Betti numbers.
β0 number of connected components (0-dimensional); β1 numbers
“circular” holes (1-dimensional); β2 numbers of holes
(2-dimensional); β3 numbers of holes (3-dimensional); ...
In 2-dimensional surface the invariant of topological objects are
also categorized by their genus (number of holes) and by the
associated MCG
A glance into the idea
genus, surfaces and manifolds
FP7 – DyM-CS
Dynamics of Multilevel Complex Systems
Towards a new “mathematics” nine FET Proactive Projects have
been funded, among which
- TOPDRIM: Topology Driven methods for complex systems
- SOPHOCLES: Self-Organised information PrOcessing, CriticaLity
and Emergence in multilevel Systems
Three-step approach
? from data to a topological global object
1. simplicial complex
2. persistent homology – determines the genus g
? from topological object to classes of behavioural equivalence
1. basic group – is tied to the genus g
2. mapping class group – classes of equivalence of genus g
? from classes of equivalence to formal description of global
object
1. finite state of automata for objects in a space of genus g
2. regular language for space of data of genus g
Legend of symbols
En
δ
Hi (X )
X
bi
G
S
K
Γ
A
GMC
collection of points
weight parameter
homology groups
topological space
i-th Betti number
combinatorial graph
simplicial complex and data space
group
close curve in S
tabular neighborhood of Γ
Mapping class group
Step One
Persistent homology
PERSISTENT HOMOLOGY
Persistence homology is the procedure to measure the ’lifetime’ of
the topological properties of a simplicial complex under filtration.
The emerging geometric/topological relationships involve
continuous maps between different objects, and therefore become
manifestations of functoriality, i.e, imply the notion that invariants
can be extended not just to the objects studied, but also to the
maps between such objects.
Functoriality is central in algebraic topology because the
functoriality of homological invariants is what permits one to
compute them from local information.
SIMPLICIAL COMPLEX
The conventional way to convert a collection of points in data space into a global
object is to use the point cloud as vertex set of a graph Γ, whose edges are determined
by proximity.
Γ captures data connectivity, but ignores a wealth of higher order features. Γ is the
scaffold of a higher-dimensional object: the simplicial complex.
Simplicial complex is a topological space constructed by “gluing
together” points, line segment, triangles and their n-dimensional
simplices, identified combinatorially along their faces obtained by
completion of Γ .
PERSISTENT HOMOLOGY
Filtration is the progressive sequence of sub-complexes that
describes the evolution of the complex, as it is constructed starting
from the empty set and adding simplices.
Homology groups, Hi (X ) are the most important set of invariants
of a topological space X. Betti numbers are the basic ingredients
for computing such homology groups.
Patterns in data space, when the (asymptotic) stable complex is
identified, are finally derived obtaining Morse numbers (in discrete
context) from Betti numbers
BETTI NUMBERS
Betti numbers are non-negative integers; the ith Betti number,
bi = bi (X ) is the rank of homology groups Hi (X ); knowing the ith
Betti number is the same as knowing the homology groups Hi (X ).
Betti 0, B0 the number of connected componnets
Betti 1, B1 the number of loops in 2-dimensions
...
Betti n, Bn the number of holes in n-dimensions
Step Two
Mapping Class Group
MCG – Mapping class Group
In the subfield of geometric topology, the mapping class group is
an important algebraic invariant of a topological space.
Briefly, the mapping class group is a discrete group of 0 symmetries 0
of a given space.
The group is the mathematical tool for constructing languages.
From topological object to classes of behavioural
equivalence
The second step consists of analysing the behavior of the
topological object under all possible, topology preserving,
transformations.
? The MCG leads to classifying the space of data into classes of
equivalence, each of which represents symmetries and
regularities hidden within the space of data itself.
? They are determined by the cosets of the mapping class group
of the topological space which is our object of analysis.
MCG – Hatcher & Thurston
Presentation of a group
In group theory, one method of defining a group G is by its
presentation G ∼ hS | Ri where S is the set of generators and R
the set of relations .
I
In short, the mapping class group of a topological (simplicial
complex) space S is the group of isotopy-classes of
automorphisms of S; that is the group of all transformations
of the space into a topologically identical object.
I
Performing these transformations implies imposing an order
on space S and determine a set of equivalence classes that
represent a partition of S.
I
The mapping class group is presented by a set of generators
and relations among generators that characterized the
equivalence classes. Such presentation can be
straigthforwardly expressed in the langugae generated by the
group that provides the basis to which S can be referred.
G168
As an example, take G168 , a basis for surfaces of genus g =3, and
its presentation G168 ∼ hU, V | V 2 , (UV )3 , U 7 , (VU 4 )4 i (notice
that the action of each generator is assumed to be invertible,
therefore – even though never explicitly done – together with U
and V , the inverses U −1 and V −1 should be in principle listed in
the presentation).
I
I
This means that G168 is the basic ingredient by which to
generate the languages suitable to describe any objects
represented by a genus three 2-manifold.
Even if elements in different classes may represent the same
concept, the configuration with which they are spatially
connected must be different; that means the order in the
relations is changed. Going back to the above example of
CH4 , if we take the topological object of genus g = 3 and we
use G168 we can create by it the classes of equivalence of all
transformations of CH4 , each class representing different
spatial configurations.
MCG for data space of genus g=3
Step Three
Formal languages
Formal language
If Σ is an alphabet, i.e., a finite non-empty set of symbols,
I
a string over Σ is a finite sequence of symbols obtained by
juxtaposition, the length of a string is the number of its
symbols, the concatenation of two strings is the juxtaposition
of the two strings,
I
Σ∗ is the set of all possible strings obtained over Σ and
L ⊆ Σ∗ , then L is a language.
I
S S S
Σ∗ can be defined by Σ∗ = Σ0 Σ1 Σ2 ... where Σi is the
set of all strings whose length is i and Σ0 = {}, the language
containing only the empty word (or string) .
Formal language and Regular Languages
Given an alphabet Σ = {a, b, c . . . } ∪ {} the set of regular
expressions is defined by
I
E ::= a | E + E | E • E | E ∗ |(E ) with a ∈ Σ,
I
the set of regular languages are defined by L[] = {};
L[a] = {a}; L[E + F ] = L[E ] ∪ L[F ]; L[E • F ] = L[E ] • L[F ];
L[E ∗ ] = (L[E ])∗ .
As an example, given an alphabet Σ = {a, b} ∪ {} and a
regular expression E = ab ∗ the language
L[E ] = L[a] • L[b ∗ ] = {a} • {b}∗ =
{a}({} ∪ {b} ∪ {bb} ∪ ...) = {a, ab, abb, abbb, ...} is the
regular language that describes all the strings that start with
0 a 0 and end with a certain number of ‘b’, possibly zero.
Formal languages a Myhill-Nerode theorem
Lemma
Let S be a nonempty set and let ∼ be an equivalence relation on
S. Then, ∼, yields a natural partition of S, where
ā = {x ∈ S | x ∼ a}. ā represents the subset to which a belongs to.
Each cell ā is an equivalence class.
Theorem (Myhill-Nerode)
If L is any subset of Σ∗ , one defines an equivalence relation ∼
(called the syntactic relation) on Σ∗ as follows: u ∼ v is defined
to mean uw ∈ L if and only if vw ∈ L for all w ∈ Σ∗ . The
language L is regular if and only if the number of equivalence
classes of ∼ is finite. If a language is regular, then the number of
equivalence classes is equal to the number of states of the minimal
deterministic finite automaton A accepting L.
Nerode, Anil (1958), 00 Linear Automaton Transformations 00 ,
Proceedings of the AMS 9
From classes of equivalence to formal description of global
object
The third step consists of interpreting the topological global object
with a suitable formal language able to describe the global
properties and symmetries. We recall that
? for any regular language we can define a non-deterministic
finite state automaton and for any non-deterministic finite
state automaton we can define a determinist finite state
automaton.
? a language is regular if the strings of the language can be
classified in a finite number of classes of equivalences and the
number of equivalence classes is equal to the number of states
of the minimal deterministic finite automaton accepting the
language (by Myhill-Nerode)
Formal language for describing the global object
Since both groups, G168 and MCG, respectively the basis and the
mapping class group determine, over the space of data, a finite set
of classes of equivalence, we can define the corresponding
non-deterministic deterministic automata that recognize the
languages defined by the groups.
Formal language for a group language
Given the finite presentation of the group G ∼ hS|Ri, with
S = s1 ...sm and R = {r1 , r2 ...rn } we can associate to each relation
ri ∈ R, for i = 1...n, a language Lri that recognizes all the
elements subject to ri . The language L associated to the
presentation G is the union of all languages that S
recognize all the
relations in R whose symbols are in Σ = S. L = ri , ri ∈R Lri
Formal language for G168
Given G168 presentation G168 ∼ hU, V | V 2 , (UV )3 , U 7 , (VU 4 )4 i,
we can construct the non deterministic finite state automaton that
recognizes the language L of all the strings generated by the
generators of group G168 .
The language L168 defined over the alphabet Σ = {T , V } for the
presentation of the group G168 is
L168 = LV 2
S
L(UV )3
S
LU 7
S
L(VU 4 )4 .
NDFA
L ∼ G168 = hU, V |V 2 , (UV )3 , U 7 , (VU 4 )4 i.
We label an edge of an automaton with a word as a shorthand for a sequence of states
and transitions such that the concatenation of the labels on the transitions equals the
word
DFA
L ∼ G168 = hU, V |V 2 , (UV )3 , U 7 , (VU 4 )4 i.
Honey-NDFA
L ∼ G168 = hU, V |V 2 , (UV )3 , U 7 , (VU 4 )4 i.
NFA, shortened representation and equivalent DFA
accepting L ∼ G168 = hU, V |V 2 , (UV )3 , U 7 , (VU 4 )4 i.
'
'
Multiscale
The approach, topology-based, is able to process the data in a
uniform way - through the filtration by persistent homology - but
also characterize the space of data by different invariants so to
emphasize different features (e.g., scales).
Topology has been widely use for multiscale analysis in the context
of quantum gravity and theory of turbulence. Moreover, the use of
topology for modeling multilevel complex systems is still a
challenge.
The TOPDRIM research community is tacking with different
approaches, among which the use of a two level model we are
developing at University of Camerino
PATTERN FORMATION in Complex Systems
I
pattern formation i.e. the emergence of non-trivial
superstructures, in complex systems. They can not be reconstructed
by applying a reductionist approach. The emergent features arise
out of the lower level interactions. Often, the patterns react back
on those lower levels in a possibly infinite feedback-loop
I
complex system can be defined as a multi-level system composed
of many non-identical elements entangled in loops of non-linear
local interactions to produce patterns observable as global
properties of the system
Future work
All the ideas and results are joint work with Mario Rasetti
from ISI Foundation in Turin
Thanks
Many thanks to the Lorentz Center and all the Scientific Organizers
Many thanks to the EC for funding TOPDRIM
Many thanks to all my collaborators at University of Camerino
Some References
E. Merelli, M. Rasetti (2013) Non-locality, topology, formal languages: new global
tools to handle large data set. Accepted for the Int. Conference on Computational
Science (ICCS)
E. Merelli, M. Rasetti (2012). The Immune System as a metaphor for topology driven
patterns formation in complex systems In Proc. of 11th Int. Conf. ICARIS 2012.
Springer LNCS 7597.
E. Merelli, N. Paoletti, L. Tesei (2013). Adaptability Checking in Multi-Level Complex
Systems. Submitted.
E. Merelli, N. Paoletti, L. Tesei (2012) A multi-level model for self-adaptive systems.
In Kokash N., Antonio Ravara A. (Eds) Proccedings of FOCLASA: 11th International
Workshop on Foundations of Coordination Languages and Self Adaptation. A Satellite
Workshop of CONCUR 2012, Newcastle, 8th September 2012.