Interactive gene clustering—A case study of breast cancer

Inf Syst Front (2006) 8: 21–27
DOI 10.1007/s10796-005-6100-x
Interactive gene clustering—A case study of breast cancer
microarray data
Alicja Gruźdź · Aleksandra Ihnatowicz ·
Dominik Ślȩzak
C Springer Science + Business Media, Inc. 2006
Abstract We present a new approach to clustering and visualization of the DNA microarray gene expression data. We
utilize the self-organizing map (SOM) framework for handling (dis)similarities between genes in terms of their expression characteristics. We rely on appropriately defined
distances between ranked genes-attributes, also capable of
handling missing values. As a case study, we consider breast
cancer data and the gene ESR1, whose expression alterations, appearing for many of the tumor subtypes, have
been already observed to be correlated with some other
significant genes. Preliminary results positively verify applicability of our approach, although further development
is definitely needed. They suggest that it may be very effective when used by the domain experts. The algorithmic
toolkit is enriched with GUI enabling the users to interactively support the SOM optimization process. Its effectiveness is achieved by drag&drop techniques allowing for the
cluster modification according to the expert knowledge or
intuition.
Keywords DNA microarrays . Breast cancer .
Self-organizing maps . Missing values . Entropy
Introduction
The DNA microarray technology provides enormous quantities of biological information about genetically conditioned susceptibility to diseases. The acquired data sets
refer to genes via their expression levels. Comparison of
the gene expressions in various conditions and for variA. Gruźdź · A. Ihnatowicz · D. Ślȩzak
Department of Computer Science, University of Regina, Regina,
SK, S4S 0A2 Canada
ous organisms is very helpful while formulating biomedical
hypotheses.
One of the challenges while analyzing DNA microarrays
is in defining functional (dis)similarities between genes. It is
complex from the genetic point of view, with no commonly
approved methodology existing. The appropriate choice of
the gene similarity/distance functions is particularly important for clustering algorithms, based on, e.g., hierarchical
(Anders, Botstein, and Brown, 1998) or Bayesian approaches
(Lawrence et al., 2003), as well as self-organizing maps
(SOM) (Altman et al., 2001).
Clustering genes is interesting with regards to revealing
their functional groups. It may lead to discovery of specific
genetic alterations differing among the cancer types or even
their stages of progression. From this point of view, clusters
sharing somehow correlated expression characteristics may
be more convincing than single genes.
While grouping genes it is important to provide the means
for interactivity. One can then handle dependencies not easily
derivable from data. A number of data-based learning methods providing graphical models can be applied with that respect. We focus on the above-mentioned SOM method, as it
expresses (dis)similarities in a very direct way (Altman et al.,
2001). We develop the Gene-Organizing Map (GenOM) system for mapping relationships between genes-attributes derived from DNA microarray data sets. We implement graphical interface enabling the expert to work interactively with
the SOM learning process.
The similarity/distance functions to be mapped to the
SOM grid can be calculated in various ways, beginning
from the Euclidean-style comparison, through statistical
correlations (Eriksen, Hornquist, and Sneppen, 2004), up to
the information theoretic measures (Friedman et al., 2000).
Following our previous research (Gruźdź, Ihnatowicz, and
Ślȩzak , 2005b; Gruźdź, Ihnatowicz, and Ślȩzak, 2005a),
Springer
22
Inf Syst Front (2006) 8: 21–27
we focus on functions interpreting the gene expression
characteristics as ranked attributes. We adapt the Spearman
correlation and entropy-based distances to handle missing
values in a non-invasive way, which is extremely important
for this type of data.
As the case study, we analyze the breast cancer data gathered in the Stanford Microarray Database (Demeter, Deng,
and Geisler, 2003; Akslen et al., 2000). Susceptibility to cancer can be inherited but also caused by chemicals, radiation,
diet, or smoking. Each of these factors can be a reason of gene
mutations. Particularly important expression level changes
are present in ESR1 (the ER-alpha itself) gene, which encodes the estrogen receptor (ER). Data suggest that ER status may have an interactive effect on breast cancer survival
(Boyapati, Shu, and Ruan, 2005). Not only ESR1 but also
other genes or proteins with correlated expression profiles
have been observed as significant for many tumors. Molecular markers related to ER status and grade have been used
to classify tumors for a long time.
The paper is organized as follows: First, we introduce
the interactive SOM-based framework for handling the gene
similarities. Then we consider various distance/correlation
functions for measuring the gene similarities, taking into account the problem of missing values occurring in microarrays. Finally, we characterize tumor-related data and discuss
experimental results.
Self-organizing maps
Thanks to dimension reduction capabilities, SOMs are helpful in understanding complex dependencies between genes
(Castrn et al., 2001; Kaski, 2001). When well designed, they
become intuitive and flexible research tool, giving wide possibilities for the experts’ interactions, necessary to make the
microarray analysis really effective.
Every SOM forms a nonlinear mapping of a high dimensional data manifold into a regular, low-dimensional (usually
2D) grid. Let us consider a gene expression data system
Fig. 1 GenOM interface. The
left window displays the grid
map of clusters. The right part
displays details concerning the
highlighted clusters. While
selecting a given map region,
the Gene Cards database from
the Weizmann Institute of
Science is automatically queried
to provide information about all
genes inside the corresponding
cluster. The user can drag&drop
genes from cluster to cluster at
the right part. Functions for
tracking genes and recalculation
of the map after the user-made
changes are supported
Springer
A = (U, A), where U denotes the set of considered objects
(experiments) and A denotes the set of genes, with expressions defined by functions a : U → R, for any a ∈ A.
For any distance function U : A × A → [0, +∞) based
on the expression levels of particular objects in U , we
consider the following optimization problem: For every
g ∈ A, find the 2D-grid coordinates (x g , yg ) such that the
grid distances
2D (a, b) = (xa − xb )2 + (ya − yb )2
reflect optimally the actual distances U (a, b), for all possible
pairs of genes a, b ∈ A.
To optimize the gene grid locations, a standard learning procedure proposed by Kohonen (1982) can be applied. Every grid position (i, j) is first initiated with randomly generated artificial gene expressions ai j : U → R,
i, j = 1, . . . , M, where M reflects the grid size. Then the
following heuristic steps are repeated:
r For a randomly chosen a ∈ A, find (using possibly a ran-
r
domized method) the U -closest artificial gene ai j , where
U (a, ai j ) is calculated as if ai j ∈ A. Remember the obtained (i, j) as the current a’s grid location.
Change ai j to ai j in such a way that U (a, ai j ) ≤
U (a, ai j ). Do the same with the neighboring grid locations
(k, l). The decrease of U (a, akl ) with
respect to U (a, akl )
should be opposite to the distance (i − k)2 + ( j − l)2 .
We implement this strategy within our algorithmic framework called GenOM (Gene-Organizing Map), using various
ways to calculate distances and update the grid characteristics. Its display window is illustrated by Fig. 1. In particular,
it enables the user to interact with the learning process using
cluster-to-cluster gene drag&drops. In this way, the following limitations of standard SOM are waived:
r The space of solutions—the grid locations of genes—is
enormous. There is no guarantee that heuristic procedures
lead to optimum. A correcting guidance is helpful.
Inf Syst Front (2006) 8: 21–27
r Microarrays do not provide complete information. They
refer to genes via their expressions, which can be
imprecise. Involvement of the expert knowledge is
required.
Drag&drop interactions are followed by fast grid recalculation. The system detects the changes of genes’ positions and
modifies the grid settings by appropriate increase (decrease)
of influence of a drag&dropped gene’s expression characteristics on a new (old) gene’s neighborhood. Modifications,
customized with respect to the applied distance functions,
are automatically taken into account during further learning
process.
Gene expression distances
Retrieving meaningful cluster-based information demands
the usage of appropriate distances. This non-trivial problem becomes even more complicated in the case of such
unexplored and complex model as genes. Defining biologically natural and understandable similarities is very
challenging, mostly from the fact that we still know little about the gene functions. At the current level of genetic knowledge we can just compare the analytical results
with the experts’ directions and commonly known medical
facts.
We tested the Euclidean distance, Pearson and Spearman
correlations (Friedman et al., 2000), as well as specially
designed ranked entropy referring to information-theoretic
measures (Kapur and Kesavan, 1992). We focus on the Spearman and ranked entropy cases, reported as most relevant in
our preliminary studies (Gruźdź, Ihnatowicz, and Ślȩzak ,
2005b; Gruźdź, Ihnatowicz, and Ślȩzak, 2005a).
The Spearman correlation S (a, b) between genes a and b
is proportional to the Euclidean distance between their rankings σ (a), σ (b) : U → {0, . . . , N − 1}, where N denotes the
number of experiments in U . For simplicity we put
S (a, b) =
[σ (a)(u) − σ (b)(u)]2
23
mally, calculation of H (a, b) requires discretization of quantitative expression data or parameterized probabilistic density estimates. Instead, we propose the following procedure:
Given arbitrary u ∈ U , we discretize the data with respect to
its values a(u), a ∈ A. Every gene a : U → R is then transformed to the binary u-discretized attribute au : U \ {u} →
R defined by
au (e) =
1
if and only if a(e) > a(u)
0
if and only if a(e) < a(u)
Given B ⊆ A and u ∈ U , we denote by Bu the set of udiscretized genes taken from B. Entropy of Bu , denoted by
H (Bu ), is calculated from the product probabilities of binary
variables au ∈ Bu . So called ranked entropies in H (a, b) are
calculated using the following formula:
H (B) =
1 H (Bu )
N u∈U
We have H (a, b) ≥ 0, where equality holds if and only if the
a’s and b’s value ranks are directly or inversely proportional.
Moreover H (a, b) ≤ H (a, b) + H (a, b), where equality
holds, if and only if au and bu determine cu and they are
statistically independent given cu for every u ∈ U .
Ranked entropy can be equivalently rewritten in terms of
the attributes’ rankings. Throughout the rest of the paper we
assume that the gene expressions are provided in a rank-based
form, obtained by the original data preprocessing. This is an
advantage for both the Spearman and ranked entropy distances. The rank-related distances assume less information
about the data and may better approximate global gene relationships. An additional advantage of ranked entropy is that
it enables to measure equally the direct and inverse proportions. This is important for the genes with significant functional relationships.
Handling missing values
u∈U
We have S (a, b) ≥ 0, where equality holds, if and only if
σ (a) and σ (b) are identical. The value of S (a, b) reaches
maximum, if and only if σ (a) and σ (b) are opposite to each
other.
In its simplest form, the entropy distance can be defined
as follows:
H (a)
H (b)
1
H (a, b)= (H (a|b) + H (b|a))=H (a, b) −
−
2
2
2
where H (a) and H (b) are entropies of genes a, b ∈ A, and
H (a, b) is the entropy of the product variable (a, b). Nor-
Before proceeding with calculations, we have to face the
problem of missing, occurring in microarray data due to complicated, microscale process of their manufacturing. There
are many estimation methods which deal with missing expression values. A naive approach is to fill the empty spaces
with the row average or simply zero values. More advanced
methods refer to the k-NN (de Brevern, Hazout, and Malpertuy, 2004), singular value decomposition (Altman et al.,
2001), or Bayesian principal component analysis (Ishii, Matsubara, and Monden, 2003). We follow an approach adjusted
to the ranked attributes, with no advanced statistical or machine learning models required. The proposed technique is
Springer
24
Inf Syst Front (2006) 8: 21–27
Table 1 Example of microarray data: 9 experiments and 20 genes. The values are rounded. Missing values
are denoted by “* * *”. For every non-missing value, we also provide its rank-based value (underlined)
A
u0
u1
u2
u3
u4
u5
u6
u7
a0
a1
2.17 7
−0.23 1
−0.44 0
0.44 4
1.36 6
1.32 6
0.70 5
2.91 8
a2
0.50 4
0.08 2
0.95 6
a3
a4
−1.45 2
−2.35 3
0.66 8
−1.07 6
a5
−0.27 6
−1.55 2
a6
***
−0.35 2
a7
1.06 5
1.31 6
a8
−1.99 0
2.36 8
a9
−1.71 0
2.13 7
a10
−0.08 4
a11
0.78 8
a12
−0.06 3
a13
a14
0.56 3
0.50 5
***
−0.20 2
−0.07 1
2.39 7
0.69 4
−0.34 0
0.14 2
0.20 3
2.39 7
0.57 5
−0.26 1
2.74 8
−0.37 0
0.13 3
−1.59 1
***
−1.01 3
−2.34 4
−0.50 5
−2.49 2
−0.45 6
−2.87 1
− 1.79 0
−3.53 0
−0.42 7
−0.96 7
−0.99 4
−1.24 5
−0.36 5
−1.20 3
−1.78 1
−2.68 0
− 0.08 7
***
−1.16 4
***
0.34 4
0.15 3
−0.96 1
−1.79 0
0.85 5
1.17 6
0.82 2
1.32 7
0.99 4
2.21 8
− 0.39 1
0.84 3
0.78 0
−0.28 6
−1.32 1
−0.67 5
0.13 7
− 1.22 2
−0.68 4
−1.20 3
***
−0.96 3
−0.85 4
0.09 6
−1.39 1
−0.68 5
−1.10 2
1.04 8
0.76 6
0.39 5
−1.32 1
−1.58 0
0.85 7
−1.22 2
−0.55 3
−0.44 2
−0.89 0
0.19 4
0.64 7
0.47 6
− 0.66 1
0.03 3
0.27 5
−0.38 1
−0.47 0
0.10 4
1.36 8
1.20 7
− 0.34 2
0.60 6
0.42 5
−0.70 1
−0.67 2
−0.34 4
−0.62 3
0.70 7
0.87 8
− 1.08 0
0.06 6
−0.10 5
0.90 3
0.64 1
0.64 2
1.04 5
1.39 7
0.11 0
1.02 4
1.24 6
1.54 8
a15
−0.16 0
0.47 2
1.28 5
2.39 7
0.97 4
0.53 3
1.83 6
2.40 8
0.11 1
a16
−0.19 1
0.65 3
1.08 4
2.10 8
1.73 7
1.54 6
1.12 5
−0.37 0
−0.07 2
a17
2.51 4
0.59 2
0.83 3
***
3.46 6
3.72 7
3.39 5
−0.46 1
−0.83 0
a18
0.61 4
0.06 2
***
1.81 6
1.72 5
2.10 7
0.52 3
−2.03 0
−0.59 1
a19
−0.05 8
−0.63 5
−1.35 3
−1.01 4
−1.73 2
−2.02 0
− 1.91 1
−0.39 7
−0.59 6
considered separately for the Spearman and ranked entropy
distances, as they rely on different approaches to understanding data relationships.
Denote by M(a) the number of missing values for a ∈
A. We treat a ranked attribute as the function a : U →
{0, . . . , N − M(a) − 1} ∪ {∗}, where every value different
than ∗ occurs exactly once. Table 1 provides an example.
For every u ∈ U and a ∈ A, we calculate the expected rank
ã(u) ∈ [0, N − 1] of a(u). We assume a uniform random
distribution of the missing value ranks. Simple calculations
result in the following formula:
⎧
M(a)(a(u) + 1)
⎪
⎨a(u) +
N − M(a) + 1
ã(u) =
N
−
1
⎪
⎩
if a(u) = ∗
2
Springer
To adapt ranked entropy distance H (a, b), we do not need
to calculate the expected ranks. While discretizing a ∈ A
with respect to values a(u) ∈ {0, . . . , N − M(a) − 1} ∪ {∗}
we estimate probabilities that the values of objects e ∈ U are
greater/lower than a(u). Let us focus, e.g., on the probability
that the rank a(e) is lower than a(u), denoted by P(a(e) <
a(u)):
1.
2.
if a(u) = ∗
The sum of the expected ranks is the same for every gene,
as illustrated by Table 2. It is equal to the sum of rankings
for non-missing values, that is N (N − 1)/2. Consequently,
the expected expression ranks can be compared using the
standard Euclidean distance, which is actually a generalization of the Spearman correlation onto the case of partially
incomplete rankings.
u8
3.
4.
If a(u) = ∗ and a(e) = ∗, then we put P(a(e) <
a(u)) = 1 if a(e) < a(u) and 0 otherwise.
If a(u) = ∗ and a(e) = ∗, then we put P(a(e) <
a(u)) = a(u)/(N − M(a)) as the percentage of known
ranks below a(u) = ∗, estimating a chance that ∗ <
a(u).
If a(u) = ∗ and a(e) = ∗, then we put P(a(e) <
a(u)) = (N − M(a) − a(e))/(N − M(a) + 1)
to scale a chance of a(e) < ∗, beginning from
1 − 1/(N − M(a) + 1) for a(e) = 0, down to
1/(N − M(a) + 1) for a(e) = N − M(a) − 1.
If a(u) = ∗ and a(e) = ∗, then we put simply P(a(e) <
a(u)) = 1/2.
Finally, we can derive P(a(u) = v), P(b(u) = w),
P(a(u) = v, b(u) = w), for v, w = 0, 1, which can be used
Inf Syst Front (2006) 8: 21–27
25
Table 2 The expected ranks for the data set illustrated in Table 1. The ranks of genes with no missing values
are the same as before. Each row sums up to 36, that is N (N − 1)/2 for N = 9.
A
u0
u1
u2
u3
u4
u5
u6
u7
u8
ã0
8.0
0.125
6.875
5.75
3.5
3.5
1.25
4.625
2.375
ã4
3.5
6.875
3.5
4.625
2.375
1.25
0.125
8.0
5.75
ã5
6.875
2.375
5.75
3.5
1.25
0.125
8.0
3.5
4.625
ã6
3.5
2.857
3.5
5.429
4.143
1.571
0.286
6.714
8.0
ã9
0.125
8.0
3.5
3.5
4.625
6.875
1.25
5.75
2.375
ã17
4.625
2.375
3.5
3.5
6.875
8.0
5.75
1.25
0.125
ã18
4.625
2.375
3.5
6.875
5.75
8.0
3.5
0.125
1.25
to calculate H (au ), H (bu ), H (au , bu ). For instance:
Analysis of breast cancer data
of initial grid, the considered genes are always close to
ESR1.
Figure 2 presents the first group of genes potentially significant to breast cancer, being repeatedly located in the ESR1’s
clusters. They occur in literature as potentially influenced by
ESR1 expression changes, which partially confirms a good
performance of our technique. GATA3 and VAV3 are deregulated during tumor progression. LIV-1 is a breast cancerassociated protein related to cancer progression because it is
responsible for the zinc transportation. NAT1’s transcription
is responsible for breast cancer risk and may be changed by
environmental factors.
The genes discussed in Fig. 2 are rather positively correlated with ESR1. Still, the Spearman and ranked entropy
distances focus on their different subgroups. The common
genes in those groups are therefore worth special attention
The analysis performed by GenOM is presented in the form of
clusters of genes that are most significant to current oncology
knowledge. The goal of our case study was to find groups
of genes correlated with estrogen receptor 1 (ESR1) which
is known to have a serious effect on breast cancer survival
(Boyapati, Shu, and Ruan, 2005). Discovering the reasons of
complex relationships between ESR1 and other genes may
have large impact on understanding mechanism underlying
the breast cancer development. In this preliminary study we
focus on consistency of GenOM’s results with already known
facts rather than on new hypotheses—our approach has to be
thoroughly verified before its application as a supportive tool
for biomedical experts is possible.
Experimental samples that we dealt with are coming from
various breast cancer subtypes, thus, we rather search for a
common mutation ’portrait’ than predict tumor subclasses.
As a basic outcome, we found some of the already identified genes co-expressed with ESR1 (Lacroix and Leclercq,
2004; Akslen et al., 2000; Aas, Botstein, and Brown, 2001).
Below we present interesting groups of genes occurring repeatedly in the same clusters or in the neighborhood of
ESR1. By ‘repeatedly’ we mean that, although the performance of self-organizing maps depends on random settings
Fig. 2 Examples of genes significant for breast cancer. The top picture
presents characteristics of genes found in the ESR1’s cluster produced
using the Spearman distance. The bottom picture refers to the results
obtained using the ranked entropy
P(au = 0) =
1
P(a(e) < a(u))
N − 1 e∈U \{u}
P(au = 0, bu = 0)
=
1
P(a(e) < a(u))P(b(e) < b(u))
N − 1 e∈U \{u}
where P(au = 0) turns out to be equal to a(u)/(N − M(a) −
1) if a(u) = ∗, and 1/2 otherwise. It completes mathematical
framework that we need to analyze incomplete microarray
data.
Springer
26
Inf Syst Front (2006) 8: 21–27
Fig. 3 Results of grouping selected significant genes negatively correlated with ESR1. The left grid shows the 2D SOM map obtained for the
considered data using the ranked entropy distance. The right grid corresponds to the Spearman distance. Circles denote clusters including the
genes listed above, with the number of genes indicated. ‘E’ denotes the
position of ESR1. The grids illustrate one of many outcomes of our calculations, generally confirming that the negatively correlated genes are
better grouped around ESR1 while using the ranked entropy distance
from analytical point of view. This is the case for RAB5EP
(RAB GTPase binding effector protein 1), which regulates
intracellular membrane traffic and controls the processes of
apoptosis and endocytosis. The RAB GTPase genes were
generally observed to play an important role in case of cancer and many other human diseases.
Figure 3 shows genes rather negatively correlated with
ESR1. They tend to be located closer to ESR1 if the SOM
grid is learnt using the ranked entropy. It is significant because negatively and positively correlated genes should be
treated as of the same importance. CDH3 is responsible for
the cell adhesion and loss of heterozygosity events in breast
and prostate cancer. GALNT3’s mutations cause familial tumoral calcinosis, which suggests that it may serve as a marker
of tumor differentiation. GSTP1 is frequently over-expressed
in many human cancers. It belongs to a family of enzymes
playing an important role in detoxification. Its expression increases with tumor progression and results in resistance to
therapy.
Conclusions
We developed a SOM-based framework for analyzing and
visualizing the gene expression similarities. Genes are being grouped based on their possibly incomplete expression
characteristics, treated as the ranking vectors with missSpringer
ing positions. We estimate the missing ranks or their probabilities. Our toolkit additionally enables to interactively
tune up the grouping process by drag&dropping genes between clusters. The obtained results are a good prognosis for further, more extensive research and active cooperation with the domain experts. Thanks to implemented interactivity, the system is particularly well prepared for such
cooperation.
Calculations over the Breast Cancer data (Demeter, Deng,
and Geisler, 2003; Akslen et al., 2000) were presented for the
two most promising dissimilarity measures—the Spearman
and ranked entropy distances. We examined different tumor
subtype samples to characterize a common gene expression
mutation portrait. We kept track of the estrogen regulated
ESR1 gene, whose status and grade have been used by the
oncologists to define the breast cancer subclasses. We observed the groups of significant genes tending to be clustered
together with ESR1 or remaining in its close neighborhood.
Some of them are regulated by estrogens as well, while the
others are various breast cancer-related genes responsible for
crucial cell functions. It confirms the hypothesis about importance of ESR1 for the mechanisms underlying many breast
cancer subtypes.
Acknowledgments The research reported in this article was supported
by research grant from Natural Sciences and Engineering Research
Council of Canada awarded to the third author.
References
Aas T, Botstein D, Brown P. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications.
PNAS 2001;98:10869–10874.
Akslen L, Botstein D, Eisen M, Fluge O, Jeffrey S, Lonning P. Molecular
portraits of human breast tumors. Nature 2000;406:747–752.
Altman R, Botstein D, Brown P, Cantor M, Hastie T, Tibshirani R. Missing value estimation methods for dna microarrays. Bioinformatics
2001;17:520–525.
Anders K, Botstein D, Brown P. Comprehensive identification of
cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998;9:3273–
3297.
Boyapati S, Shu X, Ruan Z. Polymorphisms in er-alpha gene interact
with estrogen receptor status in breast cancer survival. Clin Cancer
Res 2005;11:1093–1098.
Castrn E, Kaski S, Nikkil J, Trrnen P, Wong G. Analysis and visualization of gene expression data using self-organizing maps. In: IEEE
- EURASIP Workshop on Nonlinear Signal and Image Processing
(NSIP-01), Baltimore 2001.
de Brevern A, Hazout S, Malpertuy A. Influence of microarrays experiments missing values on the stability of gene groups by hierarchical
clustering. BMC Bioinformatics 2004;5:114.
Demeter J, Deng S, Geisler S. Repeated observation of breast tumor
subtypes in independent gene expression data sets. Proc Natl Acad
Sci USA 2003;100:8418–8423.
Eriksen K, Hornquist M, Sneppen K. Visualization of large-scale
correlations in gene expressions. Funct Integr Genomics
2004;4:241–245.
Inf Syst Front (2006) 8: 21–27
Friedman N, Linial M, Nachman I, Pe’er D. Using bayesian networks
to analyze expression data. Journal of Computational Biology
2000;7:601–620.
Gruźdź A, Ihnatowicz A, Ślȩzak D. Gene expression clustering: Dealing with the missing values. In: Klopotek, M.A., Trojanowski,
K., and Wierzchoń, S., eds., Proc. of IIS 2005, LNAI, Springer
Verlag, 2005a; 521–530.
Gruźdź A, Ihnatowicz A, Ślȩzak D. Interactive som-based gene
grouping: An approach to gene expression data analysis. In Hacid,
M.-S., Murray, N.V., Raś, Z.W., and Tsumoto, S., eds. Proc. of
ISMIS 2005, LNAI, Springer Verlag 2005b; 514–523.
Ishii S, Matsubara K, Monden M. A bayesian missing value estimation
method. Bioinformatics 2003;19:2088–2096.
Kapur J, Kesavan H. Entropy Optimization Principles with Applications. Academic Press, 1992.
Kaski S. Som-based exploratory analysis of gene expression data.
In: Advances in Self-Organizing Maps, Springer Verlag 2001;
124–131.
Kohonen T. Self-organized formation of topologically correct feature
maps. Biological Cybernetics 1982;43:59–69.
Lacroix M, Leclercq G. About gata3, hnf3a and xbp1, three genes
co-expressed with the oestrogen receptor-gene (esr1) in breast
cancer. Molecular and Cellular Endocrinology 2004;219:1–7.
Lawrence C, Liu J, Palumbo M, Zhang J. Bayesian clustering with
variable and transformation selections. In: Bayesian Statistics 7,
Oxford University Press. 2003; 249–275.
Alicja Gruźdź is a graduate student at the Department of
Computer Science, the University of Regina, Canada. She received her B.Sc. degree with honors from the Polish-Japanese
Institute of Information Technology, Poland, 2004. Her interests are in the field of bioinformatics, with emphasis on
visualization of gene expression data.
27
Aleksandra Ihnatowicz received her B.Sc. degree with
honors from the Polish-Japanese Institute of Information
Technology, Poland, 2004. Her primary research interest is
in applied computing, especially the gene expression data
mining.
Dominik Ślȩzak received his Ph.D. degree in Computer
Science from the Warsaw University, Poland, 2003. He
has been cooperating with the Group of Logic headed by
Dr. Andrzej Skowron at the Warsaw University, and the Department of Robotics and Multi-Agent Systems headed by
Dr. Lech Polkowski in Polish-Japanese Institute of Information Technology, Poland. In 2004 he moved to the University
of Regina, Canada, where he works in the Computer Science
Department. Dr. Ślȩzak’s interests are in the areas of approximate knowledge discovery in databases, with emphasis to the
theory of rough sets and probabilistic reasoning. He develops
applications to medicine, bioinformatics, multimedia, generally dealing with compound data, containing large amounts
of cases and features of varied types. Dr. Ślȩzak (co-)authors
over 50 refereed journal and conference papers, as well as
book chapters. He has been active as a chair, advisory board
member, and program committee member of a number of
international events. He has been an external reviewer for
a number of world-class journals. Currently, he is a member of the Executive Board of the International Rough Set
Society.
Springer

Download Report

Interactive gene clustering—A case study of breast cancer

Paperzz.com

Your Paperzz