CYTO 2017 Image analysis challenge

CYTO 2017 Image analysis challenge Background information
Short description for the CYTO program:
In a time when vast amounts of bioimaging data are produced in labs around the globe every day,
effectively extracting salient information from this growing resource is paramount to understanding
complex biological questions. The CYTO 2017 Image analysis challenge proposes four tasks where the
aim is to classify fluorescence microscopy images from the Human Protein Atlas database
www.proteinatlas.org​ based on subcellular protein localization, and present your findings during the
final platform session at CYTO 2017.
Description:
In a time when vast amounts of bioimaging data are produced in labs around the globe every day,
effectively extracting salient information from this growing resource is paramount to understanding
complex biological questions. In this challenge, you have the opportunity to attempt a series of automated
classification tasks for fluorescence microscopy data and present your findings during the final platform
session at CYTO 2017.
Prizes:
In addition to the satisfaction of besting the challenges, prizes for winners may include complimentary
registration to CYTO 2018 in Prague, complimentary ISAC membership and the possibility of participating in a
paper published on the challenge in Cytometry Part A.
The Dataset:
The image data provided in this challenge were generated by the Cell Atlas (Thul et al. “A subcellular map
of the human proteome”, Science, in press.), part of the Human Protein Atlas database
www.proteinatlas.org​ (​Uhlen et al. 2010​). The images visualize immunostaining of human proteins and
the aim of this challenge is to recognize the patterns of protein subcellular distribution to major
organelles and fine substructures. All images were acquired in a standardized manner using Leica SP5
confocal microscopes using a 63x/1.2 NA oil objective and Nyquist sampling rate in 4 fluorescence
channels. Each field of view is comprised of 4 images. This includes 3 reference channels; DAPI for the
nucleus (“blue”), antibody based staining of microtubules (“red”), and endoplasmic reticulum (“yellow”).
These can be used to aid you in predicting localizations of the protein of interest (“green”).
Download the data from:​ http://www.proteinatlas.org/CYTO_challenge2017/
Inputs:
Images:
The input images for all sub-challenges will be .tif format with separate images for each of the channels
from a given field of viewt. A brief description of the dataset contained in each sub-challenge:
Challenge 1:​ 1802 fields of view containing multilabel data for 2 protein localizations
Challenge 2:​ 20,000 fields of view containing multilabel data for 13 protein localizations
Challenge 3:​ 870 fields of view containing multilabel data for an additional 3 classes to be combined with
the dataset from Challenge 2
Challenge 4:​ There are no new images in Challenge 4. Solution keys for this challenge reveal patterns
that were merged in previous challenges and should replace the solution keys from those challenges.
Bonus challenge:​ There are no new images for this challenge. Solution keys for this challenge specify a
binary value indicating whether the field of view has been labeled as “variable”.
Solution keys:
Solution keys for the data have been manually generated by gamers in EVE Online via Project Discovery
and curated/augmented by the Human Protein Atlas for quality. Each solution key will contain a list of
image filenames and strings encoding the set of locations of the protein of interest using keywords
provided in the attached “keywords.txt” file. The solution key lists the type of localizations present in
each image. It is important to note that the same type of localization may not be present in all cells in the
same image.
e.g.
1001_A1_1, Mitochondria,Nucleoli
1001_A2_1, Nucloli
…
Initial assessment:
Self-assessment can be performed for each sub-challenge using cross validation for the average per-class
F1-score.
precision·recall
F 1 = 2· precision+recall
Final assessment:
Held out datasets without solutions will be provided via the challenge website 2 weeks before the
Cytometry conference. Accuracy will be judged using the average per-class F1-score.
Solutions to these sets can be submitted online at [URL] for automated scoring in the format.
<File_ID>,Class 1, Class 2, Class 3
Final assessment metrics are briefly outlined in each sub-challenge below.
Presentation of results at CYTO 2017 (June 14 15:30-16:00):
The leader board and presentation submissions will close 24hrs prior to presentations (June 13 15:30
EST). Top teams present at the CYTO conference should prepare a 5-minute presentation of their
approaches and email them to [email protected]. Teams not present that still wish to
present results should submit a 2-slide presentation to [email protected]. Teams will be
notified by 17:00 on June 13 if they are presenting.
CYTO2017 Image Analysis Challenge
This challenge is split into 5 sub-challenges. Participants may choose to complete any or all of these to the
best of their abilities, however sub-challenges are generally meant to build on each other and increase in
difficulty so it is suggested that participants attempt them in order.
1. Getting started
Using the mito_nui.tar dataset, create a model capable of distinguishing the three classes within the
dataset (Figure 1).​ ​The solution key for this dataset is called mito_nui_solution_key.zip.
TIP:​ Creating a learner capable of recognizing multi-label data will be key for future sub-challenges.
(a)
(b)
(c)
Figure 1. Example of protein localizing to mitochondria (a), nucleoli (b) and a protein localizing to both
mitochondria and nucleoli (c).
Assessment - ​This sub-challenge will be assessed using the F1-score for a held out set of data containing
the same classes present in this sub-challenge.
2. Adding more complexity
Using the major13.tar dataset, create a model capable of distinguishing each of the 13 “major” organelles
and their mixtures. This dataset contains 13 labels, where each image may have any number of labels
(1-13). The solution key for this dataset is called major13_solution_key.zip.
TIP: ​Some classes may be much less common than others, so class balancing may be necessary to not
over-train certain classes.
Figure 2. Cartoon representation of the 13 major organelles present in major13.tar. Each of these
localizations may be present
Assessment​ - This sub-challenge will be assessed using the F1-score for a held out set of data containing
the same classes present in this sub-challenge.
3. Rare events
Often what is most interesting is what’s unusual. As you discovered in the previous sub-challenge, some
classes are far more rare than others. Adding the rare_events.tar dataset to the major13.tar dataset,
design a classifier capable of accurately recovering rare phenotypes. The solution key for this dataset is
called rare_events_solution_key.zip.
TIP:​ How do you represent rare phenotypes? Is trusting one guess beneficial or is a minimum number of
instances required? This might not be the same for every class.
(a)
(b)
(c)
Figure 3. The rare classes contained in rare_events.zip are cytokinetic bridge, aggresomes, and focal
adhesions (a-c). Each class may be present in combination with other classes or individually. Note that
when a pattern is present, not every cell in the image must contain the pattern. Particularly, patterns
unique to a transient temporal phase such as cytokinetic bridge may be both rare in the population and
uncommon in the image.
Assessment - ​This sub-challenge will be assessed using the F1-score for rare events.
4. Class discovery
How many classes are there really? So far there has been a common class for all nucleoli localizations,
referred to as ‘nucleoli’, but in reality this localization could be subdivided into more detailed
localizations such as ‘nucleoli rim’ and ‘nucleoli fibrillar center’, increasing the number of classes.. Using
the major13.tar and the class_discovery_solution_key.zip develop a model capable of “discovering” such
distinct sub-populations.
TIP:​ It may be possible to find even more sub-classes than presented in the
class_discoverysolution_key.zip.
Assessment - ​This sub-challenge will be assessed using the F1 score for a set of held out “hidden” classes
BONUS ROUND: Not all mixtures are created equal
Multilabel data is central to this challenge, however sometimes only a fraction of the cells show a pattern
or set of patterns. In other words, not all mixtures e.g. “Mitochondria,Nucleoli” are the same. Some cells
may show a Mitochondria pattern together with a Nucleoli pattern in every cell, while others may show
only Mitochondria in some cells and only Nucleoli in others.
Identifying these cell-to-cell variations can be key in understanding dynamic protein behavior
such as cell-cycle, micro-environment or drug effects. Depending on your architecture these cases may be
very difficult to distinguish from each other. In this challenge, we are interested in how you handle these
cases. Can your algorithm distinguish them and if so how? The ccv_data_solution_key.zip will help you to
tune your model by providing binary information about what fields of view in the major13.tar dataset
have cells with varying protein localizations, however ​we do not currently have per-cell annotations
of these fields of view, so this sub challenge will not have a leader board.
We will however consider solution descriptions sent to us at [email protected] and
choose interesting implementations to highlight during the final platform session.
Here are some things to consider when you attempt this challenge:
1. Is the classifier capable of finding single-cell variations and distinguishing them from cases
where mixtures are present within cells?
2. Can you estimate which cells are showing which patterns? What is the fraction of cells
showing each pattern, or maybe this is better described as fraction of fluorescence?
TIP:​ Rerun the solution for this challenge on previous challenges. How many variable cases do you find?
Does pruning these from your training set improve performance?
(a)
(b)
Figure 4. Examples of variable protein expression. Here Nucleoli are present in one cell but not others. In
another case, PSMC6 is shown to translocate between the cytosol and the nucleus.
Assessment -​ This sub-challenge will not be formally assessed; however, we will accept submissions to
[email protected] for consideration in the final platform session presentations.