Learning Algorithms Applied to Cell Subpopulation Analysis in High

Learning Algorithms Applied to Cell Subpopulation Analysis in High Content Screening
Bohdan Soltys*, Yuriy Alexandrov, Denis Remezov, Marcin Swiatek, Louis Dagenais, Samantha Murphy and Ahmad Yekta
GE Healthcare, 500 Glenridge Ave, St. Catharines, Ontario, Canada L2S 3A1; Tel: 905.688.2040, E-mail: [email protected]
Abstract and Introduction
Application 1: Cell cycle analysis
The use of machine learning in image-based screening is
reported. This approach provides a novel means to deal
with complex image analysis problems. Cell heterogeneity
often confounds image analysis, even in established cell
lines. There is a need to increase data quality by classifying
cells into distinct subpopulations. Learning algorithms
classify each cell according to user-specified cell categories.
Once classified, measurements are reported for specific
subpopulations. In this report we demonstrate the use of
supervised learning algorithms. Here, the operator first
trains the program simply by providing typical images of
cells from each class. The algorithm represents these cells
in feature space using multiple intensity and morphology
descriptors. Next, in screening applications, the program
automatically classifies cells using their feature vector
outputs. Robust results are demonstrated by comparing
performance of the algorithm with human scoring of images
acquired using the IN Cell Analyzer 1000.
Human
G1
G2
M
Machine
G1
G2
M
PLC
91.5%
13
3
8
87%
3
0.5
0
94%
Table 2. Cell by cell comparison of human and machine
classification results. The confusion matrix shows that the
algorithm correctly identifies >85% of the cells in the training
data. One does not expect perfect agreement as even two
different humans will have disagreeing results.
Experimental Methods
Cells were plated and grown in Greiner 96-well plastic
microplates (#655090, Greiner BioOne Inc.). The following
cell lines were used:
• U2OS cells expressing a cyclin B1–GFP reporter
construct (Amersham Biosciences 2580-10-50) was used
for cell cycle analysis. The cell line and experimental
procedures have been previously described (1-2).
Colchicine was used as a model drug to cause
accumulation of cells in mitosis.
• GFP-PLCδ-PH Domain Assay (Amersham Biosciences
25-8007-26; CHO-derived cell line).
• AKT1-EGFP Assay. (Amersham Biosciences, 25-801017; CHO-derived cell line). AKT1 is also known as protein
kinase Bα(PKBα). IGF-1 was used as a reference agonist
to stimulate translocation to membrane ruffles.
Nuclei were labeled with Hoechst 33258, then cells were
washed with physiological saline and imaged live. Images
were acquired using the IN Cell Analyzer 1000 at 1392 x
1040 pixels, 12-bit precision at low magnification (10X
objective, field of view ~1 mm2).
Machine
Results
Fig 2. Step 2: building the classifier. The operator determines
which and how many descriptors (here, 6) are best for
identifying each population in the training data. The software
automatically selects the best descriptors from the candidates.
Three algorithmic approaches are available for identifying the
classes: nearest neighbor; neural network; quadratic
discriminant. We find that the best way of selecting one is by
experimentation with the data at hand.
Next, canonical variate analysis (CVA) is used to represent the
multi-dimensional data, as shown in the lower part of the figure.
The goal is to achieve maximum separation of the classes, as
exemplified here by the red and green data points (each cell is
represented by a data point).
The classifier training is now complete. The next step couples
the classifier to a desired analysis module for image
quantification.
Fig 4. Cell classification. U2OS cells expressing GFP-cyclin
B1: (cell body, A and C); (nuclei, B and D). Cells are identified by
the software as being in either G1/S (green), G2 (yellow) or
mitosis (red). Control cells (A and B) are compared with
colchicine-treated cells (C and D).
About 1200 cells were human annotated and used to train the
system, as described in Figs. 1-2. Classification is possible
because of the GFP signal expression levels for different cell
phases, and because of changes in cell and nuclear DNA
morphology: G1/S - low nuclear and cytoplasmic expression of
cyclin B1; G2 - low nuclear and high cytoplasmic expression;
Mitosis (M) - high ‘nuclear’ expression, cell is rounded up, DNA
condensation
In this example, the number of descriptors used to build the
classifier is 6. Examples of descriptors are given in Table 1.
Cell body intensity/background intensity
Nucleus intensity/cell body intensity (both channels)
Nucleus size/cell size
Nuclear and cell morphology
Nearest Neighbor
G1
G2
G1
80
20
G2
5
92
M
2
3
M
0
3
95
G1
92
24
3
G2
7
74
3
M
0
1
94
Neural Network
74
19
21
2
5
76
85
22
9
14
69
5
1
10
86
24
76
3
Quadratic Discriminant
1
6
93
1
9
90
4
68
28
5
70
25
94
2
4
94
4
2
Table 3. Comparison of different classification methods in
two independent image datasets.
When applied to the current test data, the nearest neighbor
method produces lower error rates than the other two.
Experimentation allows for selection of the best approach.
The two datasets in Table 3 differ in signal to noise ratios. Best
results are obtained when image qualities of training and test
data are similar.
AKT
annexin
PLC
87%
10
3
AKT
19
72%
9
annexin
4
6
90%
Table 2. Comparison of known and machine classification
of subcellular patterns.
The algorithm was trained and tested using pure cell
populations. (cf. Fig. 6). The software correctly identified >70%
of the indicated patterns in each cell type.
a
Outlook
The described learning algorithms are flexible and are expected
to be applicable to a wide variety of applications, enabling
independent identification and analysis for each subset of cells
in a population, thereby increasing both information content and
quality.
Suggested additional applications of learning algorithms:
ƒ Mixed cell cultures containing more than one cell type or
multiple reporters
ƒ Transfected cells where only a subset of cells have the
correct expression level
ƒ Responder/nonresponder classification
ƒ Intrinsic cell heterogeneity characterization
ƒ Search for inherent non-evident links between cell classes
and response profiles
ƒ Anomaly detection and genetic screening.
Conclusions
• Novel learning algorithms have been developed for
Application 2: Organelle/object identification
use in high content screening
• These algorithms identify cell states, and provide
Table 1. Morphological and intensity descriptors used by
the cell cycle analysis software.
the means to increase data quality where cells have
Algorithmic Methods
heterogeneous cell states
• Effective discrimination of G1, G2 and mitotic cells
Training the algorithm involves 3 steps
descriptor
feature
vectors
classifier
is demonstrated
trained analysis module
• Preliminary work shows that pattern recognition of
subcellular organelles or particles is feasible
G1/S
G2
cell cycle
analysis
training
mitosis
untrained analysis module
Fig 1. Step 1: annotation. The operator provides the
algorithm with example images of cells in each class. Point
and click. This human annotation operation creates the
training dataset.
Fig 3. Step 3: import the classifier into an assay analysis
protocol. By this approach the analysis module is converted
into a trained module. Multiple classifiers can be used
simultaneously in an analysis module. A classifier can in
principle be used in any module with learning capability.
GE and GE Monogram are trademarks of General Electric Company. Amersham Biosciences UK Limited, a General Electric company, going to market as GE Healthcare
© 2004 General Electric Company - All rights reserved. Amersham Biosciences UK Limited Amersham Place Little Chalfont Buckinghamshire England U.K. HP7 9NA. Amersham Biosciences
AB SE-751 84 Uppsala Sweden. Amersham Biosciences Corp 800 Centennial Avenue PO Box 1327 Piscataway NJ 08855 USA. Amersham Biosciences Europe GmbH Munzinger Strasse 9
D-79111 Freiburg
References
Figure 5. Software classification of cell subpopulations.
Colchicine treatment leads to an expected accumulation of cells
in mitosis (Fig. 5). Determining how accurate software is,
however, requires cell by cell comparisons with human scoring,
as shown in Tables 2 and 3.
Fig 6. Classification of subcellular objects. The algorithm
was trained to identify 3 types of subcellular patterns (A) PLCGFP cells showing cell periphery labeling (B) AKT-GFP
showing membrane ruffle labeling and (C) annexin, showing
punctuate/vesicular labeling. The pattern assignments made by
the algorithm are shown as colored outlines: red=PLC,
yellow=AKT; green=annexin. The algorithm misclassified only a
few cells.
All goods and services are sold subject to terms and conditions of sale of the company within the Amersham group which supplies them. A copy of these terms and conditions are available on
request General Electric Company reserves the right, subject to any regulatory approval if required, to make changes in specifications and features shown herein, or discontinue the product
described at any time without notice or obligation. Contact your GE Representative for the most current information
This poster was presented at the Conference of the Society of Biomolecular Screening, 11th to 15th September 2004, Orlando, Florida.
* To whom all correspondence should be addressed.
(1) Tinkler, H., Thomas, N., Goodyer, I., Zaltsman, A., Arini, N,
Game, S. 2003. Multi-parameter analysis of cell cycle related events
using a GFP sensor (Amersham Scientific Poster 147). Conference of
the Society of Biomolecular Screening. Eugene, Oregon.
(2) Goodyer, I, Jones, A., Thomas, N., Tinkler, H., Zaltsman, A., Pines,
J., Almuina, NM, Game, S. 2002. Cell Cycle Position Reporting
(Amersham Scientific Poster 121) Conference of the Society of
Biomolecular Screening, The Hague, Netherlands.