Active Machine Learning for Embryogenesis

Active Machine Learning for Embryogenesis
Emmanuel FAURE1 , Benoit Lombardot1 , Miguel Luengo Oroz2 , Matteo
Campana3 , Nadine Peyriéras4 , René Doursat1 , and Paul Bourgine1
1
2
3
4
Centre de Recherche en Epistmologie Appliquée, CNRS - École Polytechnique,
{emmanuel.faure,benoit.lombardot,paul.bourgine,rene.doursat}@shs.polytechnique.fr
ETSI , Universidad Politécnica de Madrid [email protected]
Informatica e Sistemistica, Università de Bologna. [email protected]
DEPSN, CNRS - Gif-sur-Yvette [email protected]
Summary. The intrinsic complexity of biological systems creates huge amounts
of unlabeled experimental data. The exploitation of such data can be achieved by
performing active machine learning accompanied by a high-level symbolic expert
who defines categories and their best boundaries using as little data as possible. We
present a global strategy for designing active machine learning methods suited for
the observation and analysis of complex systems, such as embryonic development.
We developed a procedure that uses all available knowledge, whether gathered manually or automatically, and is able to readjust when new data is provided. We show
that it is a powerful method for the investigation of the morphogenetic features of
embryogenesis. It will make possible to properly reconstruct the in vivo cell morphodynamics, a main challenge of the post-genomic era.
1 Introduction
The European project Embryomics aims at fully reconstructing in both space
and time the cell lineage of four different species of deuterostome embryos
from the one-cell stage throughout their embryogenesis. This task is crucial
to the understanding of how organs are shaped during development and how
they are clonally related. It can be achieved by time-lapse laser scanning microscopy, which produces tridimensional images through time (3D+t) from
live embryos that have been engineered to highlight their cellular structures
[1]. This type of data gives access to the reconstruction of the morphodynamics of cell division and pattern formation during embryonic development in
live animals. Toward this goal, we use several sophisticated image processing methods and tools (filtering, detection, segmentation, tracking, matching,
etc.) and link them together as modules in a chain, otherwise referred to as
”workflow”. At every step of the workflow, difficult perceptual and cognitive
comparison tasks, which would normally involve great human learning effort,
must be performed. The main goal of the Embryomics project is to automate
2
Emmanuel FAURE et al.
the workflow with the use of machine learning methods. In this paper we
present global strategies for designing active machine learning methods suited
for the observation and analysis of complex systems. We address in particular
pattern recognition applied to the morphological features of embryogenesis.
We give an example of segmentation and mitosis phase detection and conclude
with a short discussion.
2 Active machine learning
Active machine learning algorithms are used when large numbers of unlabeled
examples are available but obtaining labels for them is costly (e.g., it requires
consulting a human expert)[2]. Typically, this is the case of biological and
social systems, whose intrinsic complexity leads to huge amounts of unlabeled
data. Exploiting such data can be achieved by active machine learning, in
which a high-level symbolic naming expert defines categories and traces bestguess boundaries using as little data as possible. The overall procedure is then
to use all the currently available knowledge and readjust when new data is
added. It involves the following four algorithmic steps:
1. Data clustering → Unsupervised Learning
2. Pre-classification defined by categorization → Supervised Learning
3. First experiment
a) the active learning system asks the human supervisor to annotate the
data at the class boundaries → Semi-supervised Learning
b) the active learning system refines the boundaries using this new knowledge and loops back to step a) until it has no other question
4. Following experiments
a) the system classifies new data using all the knowledge it accumulated
so far → Self-supervised Learning
b) the system redefines the class boundaries through step 3.
3 Active machine learning in the Embryomics project
3.1 Structure of the automated Embryomics workflow
The image processing workflow starts with 4D image acquisition and leads
to the reconstruction of the cell lineage tree. Cell lineage can be represented
as a branching process and is achieved by tracking cells, cell divisions and
cell deaths. In this paper we report experiments about the zebrafish embryo
[3] , a fast developing vertebrate model easily accessible in its earliest stages.
However its relatively large number of cells (expanding from 1 to 50,000 in
24 hours) presents a new challenge. Acquired images are first passed through
denoising and structure enhancement filters [4], before the detection [5] and
Active Machine Learning for Embryogenesis
3
segmentation [6] of cell nuclei and membranes. This leads to reconstruction of
the embryo in 3D and is followed by the tracking [7] of cell divisions, displacements and deaths to create the final lineage tree. Finally, various interesting
biological features at different scales, from single cell (e.g., mitosis) to sheets
and organs (e.g., deformations), can be classified and recognized. In the end,
the workflow makes it possible to categorize the cells into morphogenetic fields.
3.2 Assisting the Embryomics workflow with machine learning
design
The two crucial steps in the reconstruction of the cell lineage tree are (i)
proper segmentation of the cells (including nuclei and membranes) and (ii)
identification of mitotic cells. Classifier techniques are useful in both cases
to correct or support machine vision methods. First, errors in segmented cell
structures due to the limitations of image processing techniques can be systematically identified. Second, apriori knowledge about the relevant features
of cell division can be included via machine learning into the design of an artificial assistant. This would allow to better predict the fate of mitotic cells and
track their trajectories through successive division phases. Machine learning
integration constitutes a challenging task, partly because of the great variability of the segmented objects. However, a major benefit will be the automation
of hand tracking, visual inspection and cell counting that are extremely time
consuming for biologists. We expect that machine learning and computation
methods will soon rival the gold standard currently provided by the experts.
3.3 Biologists’ annotations during supervised training
For a high-level interpretation of the embryogenesis process (morphogenetic
fields, organs, macroscopic deformations, etc.), biologists need a general computer tool to visualize and annotate the segmented cells corresponding to a
subregion of the cell lineage tree. This tool has been implemented with open
source software (VTK, ITK, FLTK Kitware); its interface was customized to
the needs of Embryomics project (Figure 1).
4 Active machine learning applied to cells classification
In our context, our data were represented by unlabeled cells that needed to be
classified. The goal of our classification was to automatically detect whether
a cell was correctly segmented and in which mitotic phase it was.
4.1 Parameters for cells categorization
Given {xi } a list of cells, where each x ∈ X ⊂ Rd is a d-dimensional vector,
we denoted by xi = {p1 , .., pd } the characteristic list of d parameters p ∈ P .
4
Emmanuel FAURE et al.
For each cell, parameters were calculated using both the segmented data and
the raw data, based on the membrane and nucleus. A typical parameter list
P was: volume, center of gravity, detected center, surface area, normalized
shape index, axis ratios of included ellipse, including rectangular box, voxel
intensity variables (max, min, sum, mean, standard deviation) within box and
cell boundaries. Parameters pertaining to image acquisition were also added,
such as: cell age, microscope zoom, laser intensity, size and position of imaged
volume due to multishot strategy. We compared these parameters at different
scales, globally in the embryo and locallly among neighboring cells. The complete space characteristics P that we used was of dimension d = 42.
4.2 Designing classifiers for cell segmentation and mitosis features
The task was to estimate a classification function f : Rd → {1, . . . , Nc } where
Nc is the number of classes. For this, we developed two interrelated classifiers.
The first classifier (fS ) eliminated the cells that were incorrectly segmented,
assigning fS (xi ) = −1 for an incorrect segmentation and +1 for a correct one.
The second classifier (fM ) classified the correctly segmented cells, and only
these, into 7 classes corresponding to the different phases of cell mitosis. The
following learning scenario was then applied:
1. Unsupervised Learning: A first (fS ) classification was performed with
a density clustering method. We used support vector machines (SVM)
[8], specifically One-Class SVM clustering [9], to put all the data of one
class into a hypersphere, which represented in this context the correctly
segmented data fS (xi ) = +1.
2. Supervised Learning: A human supervisor using prior biological and
physical knowledge readjusted the previous (fS ) classification manually to
fix possible errors. Then, she or he initialized (fM ) by pre-classifying some
of the correctly segmented cells in their corresponding mitotic phase. For
this, the supervisor specified a set of constraints K = {k1 , ..., kNc }, one
for each class, calculated on parameters P . Constraint ki represented the
boundaries of a specific partition in P space. A cell xi was thus put in
class j (i.e., f (xi ) = j) if it satisfied constraint kj .
Segmentation classification fS : The supervisor looked at the volume
of the membrane and nucleus, their convexity and the intensity of the raw
data inside the element and at the boundaries.
Mitosis phases detection fM : Biologists distinguish different mitotic
phases, which correspond in this context to the following observed characteristics:
• Prophase : Fragmentation of the nucleus enveloppe and condensation
of the chromosomes, along with changes in nucleus shape and increased
brightness. There is also a deformation of the cell membrane from
elliptic to spheric.
• Prometaphase : Changes in volume and convexity of nucleus shape.
Active Machine Learning for Embryogenesis
5
•
Metaphase : Planar arrangement of the chromosomes in the future cell
division plane.
• Anaphase : Separation of the two sets of chromosomes yielding two
nuclei within a single cell membrane.
• Telophase : Membrane constriction in the cell division plane.
• Cytokinesis : Separation of the daughter cells.
3. Semi-supervised Learning :
a) The quality of the reconstruction was increased by manually annotating specific cells near the class boundaries. It had the effect of moving
the boundaries toward their optimal position. These cells were represented by the support vectors of a Multi-Class SVM classifier [10].
Practically, they corresponded to the points closest to the hyperplans
that separated the classes. After each reevaluation of a support vector
that was incorrectly classified, the new boundaries were computed.
b) The different classes were recomputed until the supervisor agreed with
the classification. The system eventually converged to the best margin
between classes.
4. Self-supervised Learning:
a) A new 3D image was added and categorized with the knowledge of
the previous Multi-Class SVM classification.
b) Classification was readjusted in step 3.
5 Results and Discussion
The methodology described in this paper has been applied to an image data
set obtained from the early embryogenesis of the zebrafish. It included more
than 4,000 cells that were followed for 10 hours (corresponding to 150 time
steps). Our machine learning methods were able to classify segmented nuclei
and membranes and detect mitosis (Figure 1). Sorting out bad segmentations
allowed identifying sources of errors and further improving segmentation algorithms. Consequently, cell tracking and the reconstruction of the cell lineage
tree require human assistance for recognizing the specific features of dividing
cells.
Our machine learning strategy allowed us to complete the Embryomics
workflow and provide the first reliable lineage tree from a fully automated
procedure. This has never been achieved before for any living embryo. Our
methods will now be systematically used for clustering morphodynamical patterns at several scales and measuring differences between individuals. Our
ultimate goal is to design a European platform assisting biologists in the
characterization of emergent phenomena. Such a goal involves designing an
excellent artificial assistant that combines various types of learning methods:
•
bottom-up sensorimotor learning method capable of autonomously learning categories, prototypes and the morphodynamics of emergent phenomena through self-supervision
6
•
Emmanuel FAURE et al.
top-down symbolic learningm, including supervised learning of biological
meaning of the emergent phenomena, and reinforcement learning for an
efficient exploration behavior.
Fig. 1. Computer interface for segmentation evaluation and mitotis annotation
References
1. S.G. Megason ans S.E. Fraser, Digitizing life at the level of the cell : highperformance laser-scanning microscopy and image analysis for in toto imaging
of development,Mech.Dev. (2003), 1407-1420.
2. J. F. Bowring, J. M. Rehg, and M. J. Harrold. Active learning for automatic
classification of software behavior. International Symposium on Software Testing and Analysis, pages 195205, 2004.
3. C.B. Kimmel, W.W. Ballard, S.R. Kimmel, B.Ullmann, and Th.F. Schilling,
Stages of embryonic development of the zebrafish, Devl. Dyn. (1995), 253-310.
4. P. Perona and J. Malik, Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 7, 1990.
5. R. O. Duda and P. E. Hart, Use of the hough transformation to detect lines and
curves in pictures. Commun. ACM, vol. 15, no. 1, pp. 1115, 1972.
6. J. Sethian, Level set methods and fast marching methods: Evolving interfaces
in computational geometry, 1998.
7. B. K. Horn and B. G. Schunck, Determining optical flow. Cambridge, MA,
USA, Tech. Rep., 1980.
8. V.N. Vapnik, The nature of statiscal learning theory, Springer-Verlag New York,
Inc., New York, NY, USA, 1995.
9. B. Scholkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R. C. Williamson.
Estimating the support of a high-dimensional distribution. Neural Computation,
2001.
10. K. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in Neural Information Processing Systems 11, 1998.