5P24.pdf

KNOWLEDGE DISCOVERY TOOLS FOR
THE SPANISH VIRTUAL OBSERVATORY
L.M. Sarro, A. Delgado and A. Rodríguez
Dpt. Artificial Intelligence, ETSI Informática, UNED, Madrid, Spain
Spanish Virtual Observatory, LAEFF, Villafranca del Castillo, Spain
Abstract & Intro
In this poster we deal with the problem of knowledge discovery in scientific databases. The
advent of virtual observatories is providing astronomers with a previously unseen wealth of data
and stimulating the development of the new generation of tools needed to deal with such vast
amounts of scientific data. This development is triggered by domain specific objectives and it is
the aim of this poster to awake interest in the astrophysical community towards the new
questions that can now be posed by using the language of machine learning. The main areas in the
field of Machine Learning that can be of use for astronomers mining a virtual observatory are
summarized in the following figure:
PREDICTIVE MODELLING
LINK ANALYISIS TECHNIQUES
Automatic
classification
Time series
prediction
Regression
Data induced rule
definition
Functional correlation
discovery
DATABASE SEGMENTATION
DEVIATION DETECTION
Outlier detection of
exotic objects
Class anomalies
Dataset clustering
techniques
Taxonomical
knowledge
KNOWLEDGE DISCOVERY IN VIRTUAL OBSERVATORIES
We present here two lines of research being developed at present for the Spanish Virtual Observatory (hereafter
SVO). The first line applies to the field of automatic classification of light curves by making use of MultiAgent
Systems (MAS) and Artificial Neural Networks trained with Bayesian methods. The second tries to solve the
problem of optimal model parameter estimation given a set of observations by making use of Bayesian reasoning.
Furthermore, the SVO has already implemented a similarity based sample retrieval (SBSR) service that allows
users the recovery of light curves from the archive based on morphological proximity in a 2D Self-Organized map.
The same methodology can, in principle, be applied to recover spectral cutouts with similar features (e.g. line
shapes, absorption/emission lines, etc.)
Light curve used to query the archive
Kohonen map constructed from the archive
First light curves recovered from the archive
MULTIAGENT SYSTEM FOR THE IDENTIFICATION OF BRIGHTNESS VARIATIONS (I)
1.- Problem description:
Given a set set of observations of the brightness of a stellar object, automatically...
● asses the degree of variability;
● classify as periodic/non periodic;
● classify the nature of the variability (pulsation, eclipses, rotational modulation, flares...);
● refine the variability class (e.g. if eclipsing variable, what geometric configuration?, if pulsating star, RR
Lyrae, Mira, Cepheid...?, etc)
● provide a full diagnostic of variability plus all associated derived measures used in the reasoning
(presence/absence of spectral lines, line positions, widths, shapes...).
2.- Tasks involved (from highest to lowest level)
Solution planning: initial design of the strategy used to characterize the variability plus automatic
redefinition of strategy as new knowledge is being discovered.
● Web search of theoretical/observational/bibliographic information
● Probabilistic interpretation of available data (classification)
● Extraction of features from raw data
●
MAS scheme with some agents and communication links removed for clarity
3.- Implemented agents based on bayesian neural networks (I). Theoretical framework.
Although we use neural networks both for the definition of the mean brightness as a function of time
(regression) and for the morphological classification of light curves, we deviate significantly from a classical
backpropagation training. Under the Bayesian framework, instead of a class assignement or brightness
prediction we obtain a predictive probability distribution. The network class prediction Cn+1 for a new light
curve xn+1 given a training set Strain is computed as
where we have represented the parameter set (the connection weights of a neural network) as θ . The a
p o ste rio ri probability can be computed by applying Bayes theorem for each possible parameter set θ . Once
this probability distribution is obtained, single value predictions can be obtain by minimization of loss
functions such as squared error loss (equivalent to guessing the mean) or absolute error loss (equivalent to
guessing the median) or, as in our case, 0-1 loss functions more suitable for classification tasks (equivalent to
guessing the mode). The extension to the regression problem is straightforward.
MULTIAGENT SYSTEM FOR THE IDENTIFICATION OF BRIGHTNESS VARIATIONS (II)
3a.- Implemented agents (II): Bayesian NN agent for variability detection in data from the Optical
Monitoring Camera on board INTEGRAL
I. Separate chunks of data
II. Perform statistical regression
with ensemble of neural networks.
Prediction for each JD is the
weighted sum of all neural
networks (weight being the a
posteriori likelihood of the network
given the data) trained by bayesian
methods.
III. Search for sistematic trends
above 3σ (given by regression)
and outliers
3b.- Implemented agents (III): Bayesian NN agent for periodicity classification
The agent is composed of the following modules:
Preprocessing
Pattern
Completion
with SOM
Bayesian NN
Classification
Bayesian NN
Regression
Classification scheme
ECLIPSING
VARIABLES
Type 0
TYPE 1!
Type 3
Type 2
PULSATING
VARIABLES
Type 5
Type 6
Type 4
2 OPTIMAL CIRCUMSTELLAR DISKS PARAMETER ESTIMATION
1.- Problem description:
Find the parameter set that best explains the observational SED (Spectral Energy Distribution) of a given circumstellar disk +
star system.
Given a SED such as...
...compute the optimal degree of
complexity (number of model
components) and the best
parameter set (see below) at
reproducing the data. The most
simple model includes only the
central star, while more complex
models include a circumstellar
disk with several dust
components and a heated inner
wall emitting a black body SED
(right).
2.- Bayesian approach & (a few) formulas
The bayesian approach to model parameter estimation tries to
●
●
●
●
●
Incorporate as much knowledge as available (by making use prior
probabilities on the model parameters)
Provide probability distributions instead of (or besides) single valued
predictions
Asses the required model complexity (number of disk components)
Provide quantitative measure of the goodness of the choice.
Allow for the combination of datasets from different origins with
variable weights determined from the statistical properties of the
data.
Let T0 and T1 denote two different models or a given model with two different degrees of complexity in the description, and let
θ and Φ denote the parameter sets needed to fully specify a model in T0 and T1 respectively. Consider, for example, the
circumstellar disk models by D'alessio et al. (1998 and subsequent papers) with one or two dust components. Then, θ consists
of the following parameters: (i) internal and external disk radii, (ii) inclination angle, (iii) mass accretion rate, (iv) viscosity, (v)
maximum dust grain size and (vi) slope of dust grain size distribution. A description of the grid of models that explores the
parameter space for different temperatures and ages of the central stars can be found in the poster
`New WWW database of self consistent physical disk models for SED analysis'
by B. Merín, P. D'alessio, N. Calvet, L. Hartmann and B. Montesinos
Let L denote a set of observations of the SED of a star with circumstellar disk. The evidence supporting T0 and T1 can be
computed as
and the model with a most compact support chosen thus naturally incorporating Occam's razor criterion. Once the model (or,
as in our case, the model complexity) is chosen, the posterior probability distribution of parameter sets given the SED L can be
used to bracket a hypercube in parameter space where we can optimize the parameter estimation.
At present, the agent computes the integrals by summing over all the models in the grid. In the future, if more computational
power becomes available Markov Chain MonteCarlo (MCMC) methods can be used to approximate the value of the evidences.
The next step will be to combine several sets of data from different virtual observatories at different wavelength ranges in
the estimation process. This can be consistently achieved by introducing the weights assigned to each dataset as new
parameters in the estimation process (hyperparameters).