Training Artificial Neural Networks on the Marginal Distribution

LU TP 16-26
May 2016
Training Artificial Neural Networks on the Marginal
Distribution Estimates of the C index
Frank Johansson
Department of Theoretical Physics, Lund University
Bachelor thesis supervised by Patrik Edén
Abstract
In this project, we compare two error functions for the purpose of training artificial neural
networks on heavily censored data (data where key information is missing).
J. Kalderstam et al. [2] has shown that it is possible to train artificial neural networks
directly on Harrell’s C index [5] using genetic algorithms. He has also investigated the
possibilities of improving the performance of a neural network trained on what is is referred
to as the mean squared censored error, introduced by Van Belle et al. [8], using the marginal
distributions estimates of the uncensored data [3].
This project develops the method further, investigating the difference in performance
between a network trained on the C index and a network trained on a new performance
estimator, the Soft C, which introduces the marginal distribution estimates of the C index.
The Soft C trained network seems to outperform Cox Regression [4], which is a standard
method that new methods are always compared to. However, the results of the compared
performance of the error functions, during validation, were inconclusive and further studies
are required in order to determine which of the error functions performs the best.
Populärvetenskaplig sammanfattning
Inom sjukvården delas patienter ofta in i riskgrupper som avgör vilken behandling som
passar varje patient bäst. Företrädesvis ska indelningen baseras på information om patienterna, som till exempel ålder och tumörstorlek, och det är möjligt att använda sig av känd
patientinformation för att lära en dator att koppla en given patient till en viss riskgrupp.
Detta görs genom att man minimerar någon vald feluppskattning mellan vilken riskgrupp
datorn väljer och den kända riskgruppen. Efter detta kan man koppla patienter utan känd
riskgrupp till korrekt behandling.
Det är vanligt att dela in i riskgrupper efter överlevnadstid. Man kan försöka förutsäga
den troliga överlevnadstiden eller (som i detta projekt) ordna patienter efter deras överlevnadstid
för att avgöra vem som är i kritiskt behov av behandling.
Att ta beslut om riskgrupper är ett svårt problem i sig och något som förvärrar saken är
att många av proverna man tränar datorn på är ofullständiga. Det vill säga att patienterna
föll ur undersökningen innan överlevnadstiden blev känd.
Syftet med detta projekt är att undersöka ett nytt sätt att beskriva avvikelsen mellan
datorns indelning och den korrekta indelningen. Denna beskrivning ska förhoppningsvis
förbättra hur man handskas med ofullständiga prover.
Contents
1 Introduction
2 Theory
2.1 Previous Theory . . . . . . . .
2.1.1 Kaplan-Meier Estimate .
2.1.2 Cox Regression . . . . .
2.1.3 C index . . . . . . . . .
2.2 Introduced Theory . . . . . . .
2.2.1 Soft C index . . . . . . .
2.2.2 Analysis of the difference
2.2.3 Ties . . . . . . . . . . .
2.3 ANN . . . . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
3
3
3
4
5
6
6
7
8
9
3 Method
3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Performance Measurements . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
11
12
4 Results
4.1 Number of hidden nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Soft C vs. C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Soft C vs. Cox Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
15
16
16
5 Discussion and Outlook
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
6 Acknowledgements
19
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
between
. . . . .
. . . . .
. .
. .
. .
. .
. .
. .
the
. .
. .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Soft C
. . . .
. . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
and the C
. . . . . .
. . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
index
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
Introduction
In cancer research, one area of study is the time to certain events. These events could for
example be relapse after surgical removal of the tumour or death. Regardless of event,
the time before these events is referred to as survival time. The general purpose of these
studies is to divide patients into risk groups with corresponding treatments, depending on
survival time.
The problem can be described using the survival probability S(t). The function describes the probability to be alive after time t. An obstacle encountered when trying to
calculate the survival probability is not having access to the whole time span of a patient. Data can be left-censored, meaning that the early time span is unaccounted for.
Perhaps the cancer had already progressed when the first examination was conducted and
it is impossible to tell exactly when it progressed. Another possibility is that the data
is interval-censored. This could occur if a patient misses an examination in an ongoing
study. However, the most common type of censoring, and the type which this report will
be limited to, is right-censoring. Right-censoring suggests that one does not know what
happened to a patient after a certain time point. This occurs if the patient leaves the
study for any other reason than experiencing the event of interest, for example if the patient experiences another event which makes further studies impossible or would render
the results of those examinations inconclusive. Such an event is referred to as a censored
event. A universally recognized method for approximating the survival probability in the
presence of censored data is the Kaplan-Meier method [7] which will be discussed more in
depth later. The reason this method is popular is that the Kaplan-Meier estimator is the
maximum likelihood estimate of the survival function S(t).
An important medical application is to use several parameters, for example age and
tumour size, which are thought to affect the survival time and use them to assign to each
patient an index related to the expected survival time. The indices are then used, together
with some threshold values, to divide patients into risk groups with different assigned
treatments. The most common method to relate parameters and survival time is Cox
regression [4] and it is against this method that all other methods are measured. One of
these other methods is artificial neural networks (ANNs) which is the method that will be
considered in this report. The main reason for using ANNs instead of Cox regression is
that the latter is a linear classifier whereas ANNs can be constructed to recognize more
complex decision boundaries.
An evaluation of a proposed index should compare the sorting of the indices with that
of the event times, while in someway taking censored events into account. A commonly
used evaluation of this was created by F. Harrell [5] and is called the concordance index
or C index for short. It was shown by J. Kalderstam that it it possible to train an ANN
directly on the C index using genetic algorithms [2].
A possible improvement on the approach to censored data has been studied by
J. Kalderstam et al. [3] where one uses the probability distribution of the complete data
set to predict the behaviour of the censored patient data after the censor points.
The purpose of this project is to study the implications of this approach to the censored
1
data for the C index and to find out whether this Soft C leads to any improvement in the
performance. This method should theoretically be able to extract more information from
the censored data and thereby be able to make more accurate predictions. This will ensure
that patients receive a potent enough treatment to combat the disease and also to make
certain that the treatment is not too powerful, which could be equally dangerous.
2
2
2.1
2.1.1
Theory
Previous Theory
Kaplan-Meier Estimate
The Kaplan-Meier estimate is a method of estimating the survival function, S(t) (where t
denotes the time from a patient’s first examination), for a risk group. It can be shown that
the KM estimate is the maximum likelihood approximation of the survival function [7].
Assume independent events, which may be censored or not, at times t1 < t2 < t3 < ... < tn .
The probability of surviving to a time tj can be iteratively calculated from the probabilities
to survive to each time tk , k < j. This iterative process is, according to the KM method:
S(tj ) = S(tj−1 )(1 −
dj
)
nj
(2.1)
where dj is the number of uncensored (real) events at time tj and nj is the number of
patients, known for certain to be alive, at time tj . This process can be traced back to
t0 = 0 and S(0) = 1. An example of the Kaplan-Meier estimate can be seen in figure 1.
The purpose of the dashed frame will be explained later.
1.0
0.8
1.00
S(ti )
0.6
0.75
S(ti | tj )
0.4
0.50
0.2
0.0
0.25
0
1000
2000
3000
4000
ti
Figure 1: Plot of Kaplan-Meier estimate for the Mayo dataset [17]. The dashed axes show
the conditional probability function of surviving until a time ti given that the patient was
alive at tj .
3
2.1.2
Cox Regression
The hazard is the conditional probability per time that the event of interest occurs, given
that is has not occured earlier. The hazard is related to the survival function through
1 dS
(2.2)
S dt
In his regression method [4], Cox makes the assumption that the hazard for a patient
j is proportional to some base hazard and that the proportionality constant is a function
of the input values xj , which are the parameters of patient j, stemming from the data
obtained during examination.
h=−
hj (t) = h(t, xj ) = h0 (t)f (xj ), xj = (xj1 , xj2, ... )
(2.3)
Furthermore, he assumes that this function is the exponential function
f (xj ) = eβ·xj
(2.4)
where β is a set of parameters which are set using the maximum likelihood estimation.
Start with an example in which there is a risk group R of N members. In this example
there is an observation of an event for patient i at time t. Each patient j, in R, has a
hazard hj (t) for the time t and the probability to observe an event within ∆t after this
time is then
X
hj ∆t
(2.5)
j∈R
Given a single event observed at this time, the probability that the observation was of
patient i is
hi (t)
h (t) ∆t
Pi
= P
hj (t) ∆t
hj (t)
j∈R
(2.6)
j∈R
Given a set of hazards {hj }, one can then calculate the likelihood (Li (β)) of patient i
having an event at time ti
h (t )
eβ·xi
Pi i
= P β·xj ≡ Li (β)
hj (ti )
e
j∈Ri
(2.7)
j∈Ri
where Ri is the group of people at risk at time ti
The problem that now remains is to numerically find β ∗ which maximizes the likelihood
Y
L=
Li (β)
(2.8)
events i
4
The most likely proportionality constant is then
f (xj ) = eβ
∗
·xj
(2.9)
This in turn gives
hj (t) = h0 (t) eβ
∗
·xj
(2.10)
The base hazard h0 is still unknown, but the product yj = β ∗ · xj can be used to sort
patients to hopefully match the order of their survival times.
2.1.3
C index
A possible method evaluating the division of patients is by using M. Kendall’s τ [6]. This
takes into account whether the target times, ti and tj , for two patients are in the same
order as their corresponding sorting values, or “outputs”, yi and yj . Going over all possible
pairs, the τ is defined as
τ=
(NC − ND )
(NC + ND )
(2.11)
where
• NC is the number of pairs in order ((ti − tj )(yi − yj ) > 0), or concordance.
• ND is the number of pairs in disorder ((ti − tj )(yi − yj ) < 0)), or discordance.
With this we can evaluate a method trying to order patients correctly. However, this
method is not able to handle the censored cases. This was solved by F. Harrell who
constructed a concordance index, or C index [5]. Harrell differentiated between informative
pairs, where the shortest survival time is an uncensored event, and uninformative pairs,
where the shortest survival time is a censored event.
The C index is defined as
C=
ÑC
ÑC + ÑD
(2.12)
where
• ÑC is the number of informative pairs in concordance.
• ÑD is the number of informative pairs in discordance.
For the purposes of this project it is convenient to rewrite the C index using Kendall’s
τ . To do this we must use an estimate, τ̃ , where all uninformative pairs are excluded.
τ̃ =
ÑC − ÑD
ÑC
ÑC + ÑD
=2
−
= 2C − 1
ÑC + ÑD
ÑC + ÑD ÑC + ÑD
5
(2.13)
and then solve for the C index.
τ̃ + 1
2
With Pij , the probability that ti < tj , we can write τ̃ as
P
2Pij − 1
i<j
τ̃ = P
| 2Pij − 1 |
C=
(2.14)
(2.15)
i<j
where indices are sorted so that yi < yi+1 . For the case of ties in y, see section 2.2.3.
In the case of the C index, Pij can easily be calculated for a pair of targets ti , tj

δi = 1
 1,
1/2, δi = 0, δj = 1
Pij = P (ti < tj | Zi < Zj ) =
(2.16)

1/2, δi = 0, δj = 0
where
• Zk is an observation time
• δk = 1, 0 denotes whether the observation was a real event or a censored event,
respectively
Note that an uninformative pair corresponds to Pij = 12 .
2.2
2.2.1
Introduced Theory
Soft C index
In this project, another error function is introduced with the purpose of taking into account the censored events as well by using another probability for concordance. When
calculating the Soft C, another probability Pij is introduced which uses the Kaplan-Meier
survival estimate for the censored cases. Assume that we have a pair of samples i, j with
observations at times Zi and Zj , Zi < Zj , respectively. Then let ti and tj denote the
true event time for the samples. If Zi is an uncensored event it is certain that ti < tj ,
independently of whether Zj is a censored event or not. If instead Zi is a censored event,
it is impossible to tell. Nevertheless, from the Kaplan-Meier estimate, we can calculate
the probability of being alive at time Zj given that the patient was alive at time Zi . This
probability is given by
S(Zj | Zi ) =
S(Zj )
S(Zi )
(2.17)
A detailed derivation of the equation is performed in [3]. For an intuitive understanding,
please see the dashed frame in figure 1.
6
The probability of dying before tj given that the patient was alive at time Zi is then
1−
S(Zj )
S(Zi )
If both observations are censored events, one can use the same probability for being alive
at the second observation. However, the only information on tj is that Zj < tj < ∞ and
the possibility of being alive at tj if the patient is alive at Zj is 50%. The probability of
dying before Zj given that the patient was alive at time Zi is then
1−
1 S(Zj )
S(Zj ) 1 S(Zj )
+
=1−
S(Zi ) 2 S(Zi )
2 S(Zi )
Pij can then be written as


 1,
1−
Pij = P (ti < tj | Zi < Zj ) =

 1−
S(Zj )
,
S(Zi )
1 S(Zj )
,
2 S(Zi )
δi = 1
δi = 0, δj = 1
(2.18)
δi = 0, δj = 0
with notation from equation 2.16.
The Soft C is given by
τ̃ + 1
2
where τ̃ is given by equation 2.15, but Pij is defined by 2.18 rather than 2.16.
Csoft =
2.2.2
(2.19)
Analysis of the difference between the Soft C and the C index
Table 1: Table over Pij for different pairings of censored and uncensored samples.
Error
Both uncensored First uncensored, First censored,
Both
function uncensored
second censored
second uncensored censored
C index
Soft C
1
1
1
1
1/2
1−
S(Zj )
S(Zi )
1/2
1−
1 S(Zj )
2 S(Zi )
As one can see in table 1, the two error functions only differ in their evaluation of couples
where the first is censored. Starting with the case that both are censored, Soft C causes
S(Z )
S(Z )
this pair to contribute with a factor 1 − 12 S(Zji ) to Pij . Since S(Zji ) < 1, this factor will
be between 12 and 1. The importance of the pair is determined by |2Pij − 1|. Therefore,
the Soft C approach will always give a higher importance than the normal C index. This
means that whether the outputs of the pair are in concordance with their target censoring
time or not will affect Soft C to a higher degree than the C index (which is not affected at
7
all). One can think of this as if the Soft C introduces an internal ranking between censored
samples.
The other case, where the first is censored and the second uncencored, is harder to
S(Z )
analyze. If one again looks at the term S(Zji ) < 1, the Soft C approach means that the
corresponding factor in Pij will be between 0 and 1. There is no way to know, before looking
at the event times, the distribution of censored samples so it is impossible to predict the
factor. The C index approach considers all these pairs uninformative and the hope is that
the information included in the Soft C should be better than none at all.
2.2.3
Ties
In the theory above, we neglected the possibilities of ties. If two patients have the same
observed time Zi , but only one of them is uncensored, then the patient with censored event
will clearly have a later event. If both patients are censored, or both uncensored, the pair
is uninformative. These cases can be summarized as
1
(2.20)
P (ti < tj | Zi = Zj ) = (1 + δi − δj )
2
Ties in output y may also occur. Initially during this project, we treated such pairs as
uninformative, since concordance is based on the sign of (yi − yj )(2Pij − 1). Our sums over
i < j were then modified to sums over i, j : yi < yj . However, the use of tanh-functions
in the ANN activiation functions allows the network to create ties in y (within machine
precision) for very many pairs at once. Our initial runs highlighted that this was indeed a
problem, as some ANN:s got caught in local minima with essentially all outputs equal. We
therefore made a late change, where the informativeness of a pair is always determined by
|2Pij − 1|, while concordance is only considered for pairs with yi − yj . The final definition
of τ̃ is therefore:
P
2Pij − 1
i,j:yi <yj
(2.21)
τ̃ = P
| 2Pij − 1 |
i<j
where indices are sorted so that yi ≤ yi+1 .
8
2.3
ANN
Artificial neural networks can be constructed in multiple ways, but we will make use of the
feed-forward neural network, which are designed to transform a number of input values to
a single (or a few) output values that matches the target. A feed-forward neural network
consists of an input layer, an output layer and a number of hidden layers in between. Each
layer contains a number of nodes. The nodes in a layer are connected to all nodes in
the former and the succeeding layer with the connections going towards the output. The
connections have weights determining the importance of the previous node to the following
node. An example of a feed-forward network can be seen in figure 2.
Figure 2: Example of a Feed-forward Neural Network with two input values, two hidden
nodes and one output node.
Given a set of inputs and targets, the network can be trained to map the input to the
target by adjusting the weights to minimize an error function, often the squared difference
between the network output and the target.
The most commonly used technique to train a network is backpropagation [11] or a
variant of it [12]. However, this requires a differentiable error function. The C index and
our proposed alternative are rank based error functions, and not differentiable. We do
not try to minimize the difference between the target and the output but instead order
the output for different patients in the same order as their corresponding targets. As
in [2] we will in this project use genetic algorithms. Networks which use this type of
algorithms are referred to as genetic networks and can be trained on rank based error
functions. First, a number of randomly initiated networks are created. The networks are
represented by genomes, as shown in figure 3. The networks are evaluated with the aid of
the error function and then ordered after their performance. The algorithm then chooses
one network at random which is adjusted slightly (mutated) or chooses two networks at
random (with a higher probability of choosing the “better” networks) , and creates two new
networks which inherit some characteristics (weights) from one or both “parent networks”.
The last method is meant to mimic reproduction. The methods are illustrated in
figure 4. The new networks are also evaluated and the number of networks is kept constant
by removing the worst networks. This process is repeated until a small enough error or a
maximum number of iterations has been reached.
9
Figure 3: Example of a Feed-forward Neural Network and how it can be represented by a
set of numbers. The set is referred to as the genome representation of that network. The
genome can be subject to the mutation and cross-over methods as illustrated in figure 4.
Figure from [1].
Figure 4: Illustrative figure showing the evolution mechanisms of a genetic network. Figure
from [1].
10
3
Method
The ANN module (written in C++) created by J. Kalderstam [13] was modified to allow
for calculation of Soft C and a main program was written in Python. The Lifelines [14]
package by C. Davidson-Pilon was used for Cox Regression calculations.
3.1
Datasets
Table 2: Datasets used in the project. For the FLchain dataset 900 samples were chosen
at random from the initial dataset of 7871 samples for computing time reasons.
Dataset
Size
Events
Censored
Input Features
Lung
Mayo
FLchain
228 165 (72%) 63 (28%)
312 125 (40%) 187 (60%)
900 247 (27%) 653 (73%)
7
17
7
All datasets are publically available from [15]. Specific datasets are from [16], [17] and [18].
The datasets vary in size, number of input features and level of censoring. The data
turned out to vary in difficulty, both for Cox Regression and ANN. The datasets are
therefore expected to cover a wide variety of problems.
The datasets were normalized by setting continuous variables to have zero mean and a
variance of one. Discrete variables were converted to binary form with one binary variable
for each possible value. If a value was missing, the mean of the existing variable values
was inserted.
3.2
Network Structure
The feed-forward network used had a number of input nodes corresponding to the number
of input features, varying with each data set. The network had a single output node since
the only required information is an index which can be used for sorting. All networks had
a single hidden layer with a number of hidden nodes ranging from one to three to make it
possible to observe the effects of overtraining and too simple networks.
After initial testing (see section 4.1), the number of hidden nodes was set to 1.
The activation function of the hidden layer and the output nodes was the hyperbolic
tangent function, resulting in values ∈ [−1, 1]. Any monotonically increasing function for
the output node would result in the same ordering of output indices.
11
Dataset
K-fold cross validation
K=3
1/3 removed for validation
Create and train two
networks,
one on the C index
and one on the Soft C
repeat
Cox
Proportional
Hazard
Validation
on part K
Figure 5: Process flowscheme of the model.
3.3
Performance Measurements
Networks were trained using K-fold cross-validation (K = 3). In this method, the data
used was split into three subsets, where two were used for training and one for validation.
The roles were swapped so that each third was used for validation once. For each number of
hidden nodes (1-3 during initial tests and a single hidden node in later tests), one network
was trained on the C index and one on the Soft C. For each network, a training performance
and a validation performance was calculated during each K-fold and an average over the
K-folds was calculated and compared to one another. In the case of differing number of
hidden nodes, all networks were compared against all others.
Since both error functions are rank-based and use network outputs as arbitrary indices
for sorting, it is impossible to evaluate them using any error function which aims to pinpoint
the exact survival time. Therefore, the performance of the networks was evaluated using
the same error functions that they were trained on. Both error functions were used during
validation to avoid any bias towards one or the other error function.
For each K-fold, a Cox Regression model was fitted to the training part and validated
using the C index on the validation data. An average over one iteration of K-folds was
calculated and compared to the average validation C index for the network trained on
the Soft C for the same iteration. The C index was used for validation to avoid any bias
12
towards Soft C since Cox handles censored events similarly to the C index.
This process was repeated 100 times and the number of times the network trained on
the Soft C gave a higher C index and Soft C than the network trained on the C index and
the number of times the network trained on the Soft C outperformed Cox Regression were
recorded.
13
4
Results
To investigate the different characteristics of the error functions, the ranks of the training
and validation data given by the Soft C were plotted against the ranks given by the C
index. However, as can be seen in figures 6 and 7, no distinguishable characteristics were
observed.
Figure 6: Comparison of the rankings of the training data made by a network trained
on the Soft C with that of one trained on the C index. There does not seem to be any
significant difference between how the two methods rank censored and uncensored samples.
14
Figure 7: Comparison of the rankings of the validation data made by a network trained
on the Soft C with that of one trained on the C index. There does not seem to be any
significant difference between how the two methods rank censored and uncensored samples.
4.1
Number of hidden nodes
After initial testing, the amount of hidden nodes was set to one, the reason being that the
networks with a single hidden node seemed to perform the best for both error functions. An
example of this can be seen in table 3. This was not the case for the Lung dataset where
networks with a higher number of hidden nodes performed equally or better. However,
there seemed to be no corrolation between the number of hidden nodes and the best
performing error function. To eliminate as many variables as possible, other than the error
functions, we therefore limited all further studies to networks with a single hidden node.
Table 3: Number of times a network, with a certain number of hidden nodes and trained
on a specific error function, performed the best when validated using the C index on the
Mayo dataset. Average describes the mean C index of all networks trained on the listed
error function.
Error Function
1 Hidden Node
2 Hidden Nodes
3 Hidden Nodes
Total
Average
C index
Soft C
37
32
6
7
11
7
54
46
0.824
0.823
15
4.2
Soft C vs. C
Table 4: Number of times a network with a single hidden node, trained on the Soft C,
performed better than a similar network trained on the C index. The number of hidden
nodes for each network was set to one.
Dataset
Validated with C index
Lung
52/100
Mayo
56/100
FLchain 33/100
Validated with Soft C
51/100
47/100
38/100
With the number of hidden nodes set to one, the results in table 4 shows that there is close
to an equal chance for both the Soft C and the C index to perform better on the Lung and
Mayo datasets whereas for the FLchain dataset, the C index seems significantly better.
The method of evaluation was chosen with a hope that any bias towards any of the
error functions would be negated. However, there does not seem to be any noticeable
correlation between a network performing better on validaion with a certain error function
and the error function used to train the network. This shows that the evaluation method
is reliable.
4.3
Soft C vs. Cox Regression
Table 5: Number of times a network with a single hidden node, trained on the Soft C
index, performed better than Cox Regression, when validated using the C index and a
comparison of the mean C index for the two methods.
Metod
Lung dataset
Number of times the ANN trained on the
69/100
Soft C index outperformed Cox Regression
Average C index for the ANN
0.628
Average C index for Cox Regression
0.624
Mayo dataset
FLchain dataset
100/100
87/100
0.831
0.670
0.809
0.743
When compared to Cox Regression, a network trained on the Soft C, with a single hidden
node, seemed to perform better overall. Since a single hidden node was used, the decision
boundary of the ANN is linear, like that of Cox Regression. Therefore, no advantages comes
from the forthmost reason of using an ANN instead of Cox Regression. This indicates that
there is an advantage in using the Soft C over Cox Regression. However, as seen in table
5, the difference in performance seems to be dataset dependent.
16
5
Discussion and Outlook
From the theory, there should be a tendency for networks trained on the C index to rank
censored patients higher (later) than a network trained on the Soft C since the C index
does not punish a network for ordering all censored patients last, whereas the Soft C does.
Furthermore, the C index does not punish a network for placing the survival time of a
censored sample directly after its censor time whereas the Soft C aims to place it slightly
later (because of the marginal distribution). However, from figures 6 and 7 this does not
seem to affect the rankings in any noticable way, neither during validation nor during
training.
For two of the datasets, the results showed a very similar performance of the Soft C and
the C index. It is not possible to draw any conclusions from this other than that the error
functions seem to be performing equally well on the datasets we have used. An exception
to this is the FLchain dataset where the C index is the clearly preferred method. There
was not enough time to investigate the reasons for this closely but the results indicate that
the preferred method may depend on the problem at hand.
In comparison with the Cox method there was a preference for the Soft C trained ANN.
This seems to suggest some advantage in using the Soft C method over Cox Regression.
Again, time was the limiting factor for investigating the reasons behind this advantage.
A way to test the real strengths of the methods against each other could be to create
a synthetic dataset and artificially censor it. Then one would have access to the real
event times for the censored samples and could compare how well both methods perform
evaluated on the “true” ranking [2]. When the results came back inconclusive, along with
the investigation into why they were inconclusive, this was proposed to be the next step
in the project. Unfortunately, there was not enough time and that research has to be
performed after the completion of this project.
An interesting investigation for the future is the correlation with the level of censoring.
Since the Soft C uses the real events to estimate the distribution of the censored events,
it is possible that the real events in a dataset are too few to represent the whole dataset
accurately in a dataset with a too high level of censoring.
5.1
Conclusion
We have implemented the modifications to the handling of censored data introduced in [3]
and showed that it is possible to train an ANN on a version of the C index where this
modification is applied.
From our results, no error function seems to significantly outperform the other. The
results mostly indicate that the two error functions are equal in merit. There are still
questions about, for example, the level of censoring, which will have to be answered in the
future by constructing a synthetic dataset where the level of censoring can be varied.
Compared to Cox Regression, however, initial results seem to argue for Soft C to be
the preferred method. Nevertheless, the differences must be investigated further before
any conclusion can be made.
17
References
[1] J. Kalderstam, Neural Network Approaches To Survival Analysis (2015)
http://lup.lub.lu.se/record/5364868
[2] J. Kalderstam et al., Training artificial neural networks directly on the concordance
index for censored data using genetic algorithms. Artificial Intelligence in Medicine
2013;58(2):125-132.
[3] J. Kalderstam, P. Edén, J. Nilsson, and M. Ohlsson, A regression model for survival
data using neural networks. 2015(submitted); LU TP 15-08
[4] D.R. Cox, Regression models and life-tables. Journal of the Royal Statistical Society.
Series B: Methodological 1972;34(2):187-220.
[5] F. Harrell, K. Lee and D. Mark, Multivariable prognostic models: issues in developing
models, evaluating assumptions and adequacy, and measuring and reducing errors.
Statistics in Medicine 1996;15(4):361-387.
[6] M. Kendall, A New Measure of Rank Correlation. Biometrika. 1938;30(1-2):81-93
[7] E. Kaplan and P. Meier, Nonparametic Estimation from Incomplete Observations.
Journal of the American Statistical Association 1958;53(282):457-481
[8] V. Van Belle, K. Pelckmans, J. Suykens and S. Van Huffel, Additive survival leastsquares support vector machines. Statistics in Medicine 2010;29(2):296-308.
[9] M.J. Bradburn, T.G. Clark, S.B. Love and D.G. Altman, Survival Analysis Part I:
Basic concepts and first analyses. British Journal of Cancer. 2003;89:232-238
[10] M.J. Bradburn, T.G. Clark, S.B. Love and D.G. Altman, Survival Analysis Part II:
Multivariate data analysis – an introduction to concepts and methods. British Journal
of Cancer. 2003;89:431–436
[11] D.E. Rumelhart, G.E. Hinton† and R.J. Williams, Learning representations by backpropagating errors. Nature. 1986;323:533-536
[12] C. Igel and M. Hüsken, Improving the Rprop Learning Algorithm. Proceedings of the
Second International Symposium on Neural Computation NC’2000:115-121
[13] J. Kalderstam, Artificial Neural Networks package for Python focused on survival
data. https://github.com/spacecowboy/pysurvival-ann
[14] C. Davidson-Pilon, Lifelines, (2016), Github repository, Survival analysis in Python.
https://github.com/CamDavidsonPilon/lifelines
[15] T.M. Therneau, T. Lumley (2014) Package ‘survival’. Survival analysis Published on
CRAN
18
[16] C.L. Loprinzi, J.A. Laurie, H.S Wieand, J.E. Krook, P.J. Novotny et al., Prospective
evaluation of prognostic variables from patient-completed questionnaires. Journal of
Clinical Oncology 1994;12:601–607.
[17] T.M. Therneau, Modeling survival data: extending the Cox model. Springer.(2000)
[18] A. Dispenzieri, J.A. Katzmann, R.A. Kyle, D.R. Larson, T.M. Therneau, et al. Use
of nonclonal serum immunoglobulin free light chains to predict overall survival in the
general population. Mayo Clinic Proceedings. Elsevier. 2012;87:517–523
6
Acknowledgements
I would like to thank Patrik Edén for supervising me through this project, for always being
there when I needed help, and for putting as much effort into this project as I have.
I would also like to thank Jonas Kalderstam for letting me use his code, and other
pieces of work, and for his extraordinary help with setting up the program.
Additionally, I would like to thank all my friends at the Theoretical Physics department
and my family for their support during this project.
19