Combining Continuous Events and Calculating

NESUG 2012
Coders' Corner
Combining Contiguous Events and Calculating Duration
in Kaplan-Meier Analysis Using a Single Data Step
Hui Song, PRA International, Horsham, PA
George Laskaris, PRA International, Horsham, PA
ABSTRACT
Many studies require contiguous events (e.g., events one or two days apart) being combined before survival analysis. Since individual events overlap in many different ways, it is challenging to effectively (1)
combine the contiguous events, (2) calculate their duration (and other time-to-event parameters), and (3)
output the combined events only to a dataset. In this work, we present a clean, single-data-step approach
that can address the three challenges efficiently. A sample scenario is used to illustrate the process.
INTRODUCTION
Kaplan-Meier (KM) analysis, or survival analysis, represents a set of statistical methods that estimate lifetime or length of time between two events of a given event-of-interest (or EOI). Survival data are often
analyzed in terms of time-to-event parameters, such as time-to-first-event (or onset), time-to-resolution,
duration, etc.
If there is at least one event as specified in a given EOI, the days to the earliest on-study event will be the
time-to-first-event. Note that, if there are no events for the given EOI, time-to-first-event will be the time
to the censor date per statistical analysis plan (or SAP). Duration is the difference between the (imputed)
start date and the (imputed) end date. If an event does not have an imputed end (or start) date, it should
be censored per SAP. For time-to-resolution, only those events that cross the last dose date (i.e., start
before the last dose date and end after) are of concern.
The calculation of time-to-event parameters is straightforward before combining contiguous events is required for KM analysis. Contiguous events are events that are a short time period apart (e.g., one or two
days) defined per SAP. Combining the contiguous events before KM analysis is required in many studies.
Given this requirement, the derivation of time-to-event parameters becomes extremely challenging due to
the following reasons.
First, a given EOI may have hundreds of AE events. In addition, the individual events may overlap in many
different ways. One needs to differentiate contiguous events from those that are not so that the combination can be done correctly. Second, calculating the time-to-event parameters (such as duration) on the fly
will be difficult given all the different combinations of events. Efficiently outputting the combined events onthe-fly is the third challenge.
In this abstract, we present a clean, efficient, single-data-step approach that can combine the contiguous
events and calculate the time-to-event parameters on the fly. Especially, we will show how to design and
implement the proposed scheme with robustness and efficiency in mind.
DESCRIPTION OF METHOD
Several approaches can use to address the issues we mentioned above. The straightforward one is to go
through all the events (for a given EOI) multiple rounds. For each round, adjacent events that are contiguous will be combined into one. The process continues until no more combination can be done. Some preliminary tests showed that, in order to combine all the events correctly for our study, it might take up to 30
rounds for the algorithm to converge. In this sense, the algorithm’s running time is 30*n (where n is the
total number of events). This simple solution becomes inefficient when the number of AE events increases
to thousands or more.
A second approach is to combine the events in groups of two in the first round. In the next round, we
combine the two combined-events in the previous round, so on and so forth until the log(n)-th round, in
which the algorithm converges. A simple illustrate can be seen below.
NESUG 2012
Coders' Corner
Round 1: (1 2) (3 4) (5 6) (7 8)… (n-3 n-2) (n-1 n)
Round 2: (1 2 3 4) (5 6 7 8)… (n-3 n-2 n-1 n)
…
The running time will be n. However, it will be hard to implement due to how data are processed in the
data step. (Due to space limitations, more detailed discussion and results are not included in this abstract).
In this abstract, we present a method that has a linear running time of n, utilizing SAS’s features such as
RETAIN statement. In addition, it combines and outputs the combined contiguous events on-the-fly. Two
critical techniques used here are RETAIN statement and a look-ahead mechanism. The RETAIN statement is used to maintain some flags across the observations (during a data step), telling SAS whether to
check for combination or to output an observation selectively at the end of the data step iteration. The second technique is a look-ahead mechanism. By looking-ahead (to get the AE start and end date of the
next event, etc.), the method can make decisions on setting the flags or updating the parameters of the
combined events (such as start and end date) when necessary to facilitate the on-the-fly processing.
Here is a brief description of the look-ahead mechanism. The one we adapted (SASCOMMUNITY.ORG,
2011) can be described as follows.
1
2
3
4
5
6
7
8
9
data ae1;
set ae;
by subjid;
retain …;
set ae (firstobs=2 keep=aestdti aeendti
rename=(aestdti=next_aestdti aeendti=next_aeendti) )
ae (obs=1 drop=_all_);
next_aestdti=ifn(last.subjid, (.), next_aestdti);
next_aeendti=ifn(last.subjid, (.), next_aeendti);
…
10 run;
In the DATA step above, we have two SET statements (Line 2 and Line 5), all reading from the same data
sets but different observations. In each iteration of the DATA step, the first SET statement (Line 2) reads
one observation. The second SET statement (Line 5) has two input data sets. The first one (Line 5) reads
the following observation (in Line 2) by specify data set option firstobs=2. In addition, aestdti and aeendti
are renamed to next_aestdti and next_aeendti, respectively.
The next time the first SET statement is executed, it reads the next observation, and so on. For the last
iteration, while reading the very last observation (Line 2), no observation will be left for the data set on Line
5. Note that, the data step ends whenever any of the data sets (in any SET statement) reaches the last
record. As a result, SAS will quit the data step without outputting this very last observation (Line 2). The
second data set on Line 7 is to make sure that the very last observation is output as well.
The look-ahead mechanism makes it possible to decide whether we need to combine with the next event
ahead of time. However, the variety of event overlapping makes it hard to decide when the combination is
done (so that we can output it). If one overlapping scenario is not considered, the combination may be
totally wrong. In addition, since we intend to do it on the fly in one single data step, it will be hard to debug
unless we design the algorithm robustly, together with tracking flags for correctness verification.
Due to all these considerations, we follow a rigid algorithm design process to implement our method. The
process includes four steps: problem statement, algorithm outline, algorithm design, and algorithm implementation. In the following, we describe each step in more detail, followed by a sample scenario.
NESUG 2012
Coders' Corner
1. PROBLEM STATEMENT
As mentioned above, the problem to solve is to effectively (1) combine the contiguous events, (2) calculate their duration (and other time-to-event parameters), and (3) output the combined events only to a dataset, all on the fly. The following two examples illustrate a subset of the event overlapping scenarios and
what the expected results are.
Table 1 shows two events after combination: Event 1 consists of A + B + C, where the start date of B is
less than two days from the end date of A (similar for C). Event 2 consists of D, which occurs 2 or more
days after the end date of C and is therefore a separate event.
Table 1. Example Scenario 1
Original
Event
Study
day
Event A
Event 1
Event B
Event 1
Event C
Event 1
Event D
Event 2
1
2
3
Study
4
Day
5
6
7
8
9
A
B
C
D
In Table 2, A and B should be combine into one event, Event 1, since they are overlapped. In addition, Event 1 is
an unresolved event with duration calculate from minimum of (AE start date for event A, AE start date for event B)
to maximum of (censored AE end date for event A, AE end date for event B).
Table 2. Example Scenario 2
Original
Event
Combined
Event
Event A
Event 1
Event B
Event 1
1
2
3
Study
Day
4
5
6
7
8
9
A
B
2. ALGORITHM OUTLINE
Fig. 1 shows the notations that we used in this abstract. Given the problem statement before, we sketch the algorithm as follows, which consists of five steps.
The first three steps are data preparation, which should be done in a separate data step to adjust aestdti and
aeendti per SAP. The sorted data set (referred as ae_sorted below) will then be fed to the algorithm we described
in the next subsection (algorithm design). The design, implementation, and testing of the last two steps are the
focus of the rest of the abstract.
NESUG 2012
Coders' Corner
The Algorithm Outline
For all records flagged for an EOI category (e.g., ae.skdcdn=1) do the following:
1) If aestdti is missing then set aestdti to the first dose date.
2) If aeendti is missing or if the aeendti is after data cutoff then censor aeendti to the data cutoff date if the subject is still on treatment, otherwise set to 30 days after last study drug administration date.
3) Sort the records by subjid, aestdti.
4) Increment the event line number by 1, set the event start date to the aestdti of the record, set the event end
date to the aeendti, and the event duration (kmdy) to the event end date minus the event start date plus 1.
5) If the difference between the event end date and the next aestdti is less than 2 and the aeendti is greater than
the event end date then set the event end date to aeendti and the event duration (kmdy) to the difference between aeendti and the event start date plus 1; if the next aestdti is greater than the event end date by 2 or
more then return to Step 4.
3.
ALGORITHM DESIGN
Given the outline of the algorithm above, we describe the last two steps in SAS pseudo code as below. Fig. 1 presents the notations used in the discussion below.
Fig. 1.
In each data step iteration, we RETAIN four flags.
Notations
a) censfln: event status, whether the combined event is a
resolved or unresolved event (0: resolved 1: unreae_sorted = the sorted AE dataset
solved). If it contains any unresolved events, the comskdcdn = skin disorder code (1: yes 0: no)
bined event is treated as unresolved. Otherwise, it is
fdosdt
= first dose date
resolved.
ldosdt
= last dose date
aestdti
= imputed ae start date
b) fstdt: the start date of the first event among all events
aeendti = imputed ae end date
of a combined event. In other words, it is the minimum
kmstdti = start date of the combined ae
(aestdti) of all events within a combined one.
kmendti = end date of the combined ae
c) lstdt: the maximum end date (aeendti) of all the event
kmdy
= event duration
of a combined event.
d) contfln: this is an important flag, which tell SAS whether try to combine the current event with the previous one.
By default, it sets to be 1. It will be set to 0 if the combination will stop at the current observation. This flag is
also used to decide which observations will be output. In this abstract, all observations with contfln=0 will be
output (as a combined event). We will see more clearly how it is used in the algorithm pseudo code below.
We introduced two auxiliary variables, next_aestdti and next_aeendti, to make it easier to compare the start and
end date of two adjacent events in the data set.
The core algorithm is presented in three separate figures, Fig. 2, 3, and 4, due to the page size limitation. The algorithm consists of four major components: RETAIN statement, Case 1, Case 2, and combined event output. The
algorithm is written in SAS pseudo code and is pretty self-explained. Here we summarize each of them and discuss potential issues that need carefully consideration.
Fig. 2
Fig. 2 shows the first two components, which are simple.
Algorithm Pseudo Code (part 1 of 3)
The RETAIN statement retains the four flags we stated
DATA STEP BEGIN;
above when the data step goes through iterations. CASE 1
handles the situation where a subject has only one event (or
SET ae_sorted; *the sorted AE dataset;
record). In such a case, contfln is set to be zero since the
RETAIN censfln fstdt lstdt contfln;
next event should not be combined with the current one.
The censfln is set to zero since this is a resolved event.
CASE 1: the subject has only one record
CASE 2, the most complex part, is presented in Fig. 3 and
Fig. 4. It is divided into three conditions. Note all observations (events) need to be check for Condition 1. Condition 2
and 3 are mutual exclusive. In other words, one observation
will fit in either Condition 2 or Condition 3, both not both.
4
if first.subjid and last.subjid then do;
*no event combination is needed;
contfln=0; censfln=0; * set flags;
end; *end of Case 1;
NESUG 2012
Coders' Corner
Fig. 3
Algorithm Pseudo Code (part 2 of 3)
DATA STEP (continued);
Fig. 4
Algorithm Pseudo Code (part 3 of 3)
DATA STEP (continued);
……
……
CASE 2: if the subject has more than one record
CASE 2: (continued)
Condition 1: first record of a subject, reset flags
Condition 3: event combination may not needed
if first.subjid then
if contfln=0 or last.subjid then
do;
do;
contfln=1; *set continued flag to 1;
a. if contfln=1 and last.subjid then do;
fstdt=aestdti; lstdt=aeendti; *initialization;
fstdt/lstdt --> kmstdti/kmendti;
censfln=0; *set resolved flag to 0 (resolved);
kmdy=kmendti-kmstdti+1;
end;
contfln=0; censfln=0; *set flags;
aestdti/aeendti --> fstdt/lstdt;
Condition 2: check whether should be combined
end;
if contfln=1 and not last.subjid then
b. if (not last.subjid and contfln=0) and
do;
a. should combine with next observation
(next_aestdti-kmendti<2 or next_aestdti <= lstdt+1)
if (next_aestdti-kmendti<2 or next_aestdti<=lstdt+1)
then do; *should be combined;
contfln=1; *set continued flag to 1;
then do;
update fstdt/lstdt;
fstdt/lstdt-->aestdti/aeendti;
update kmstdti/kmendti;
censfln=0; *set resolved flag to 0 (resolved);
if aeendti<=next_aeendti
kmstdti/kmendti-->fstdt/lstdt
if aeendti<=next_aeendti
then kmendti=next_aeendti;
then kmendti=next_aeendti;
if next_aeendti>lstdt then lstdt=next_aeendti;
if next_aeendti>lstdt then lstdt=next_aeendti;
end;
end;
b. should not combine with next observation
else do;
c.
else do;
kmstdti=fstdt;
kmstdti=fstdt;
if kmendti<lstdt then kmendti=lstdt;
if kmendti<lstdt then kmendti=lstdt;
kmdy=kmendti-kmstdti+1;
kmdy=kmendti-kmstdti+1;
contfln=0; censfln=0; * set flags:
contfln=0; censfln=0; * set flags:
aestdti/aeendti --> fstdt/lstdt;
aestdti/aeendti --> fstdt/lstdt;
end;
end;
end; *end of Condition 2;
end; *end of Condition 3;
end; *end of Case 2;
OUTPUT THE COMBINED EVENTS
if contfln=0;
DATA STEP END;
5
NESUG 2012
Coders' Corner
The three conditions for CASE 2 are listed below:
Condition 1: first record of a subject, reset flags
Condition 2: check whether should be combined
Condition 3: event combination may not needed
Condition1 handles the situation where the observation is the first event for a new subject. Since we have a new
subject now, all the flags need to be reset appropriately. contfln is set to 1. By default, we assume combination is
needed. fstdt and lstdt keep the minimum aestdti and maximum aeendti seen by far. It is used for checking
whether event combination is necessary in the iteration process. They are initialized to the aestdti and aeendti of
the first event of a new subject. Finally, we set censfln to be 0, assuming resolved. Note the first observation of a
new subject should still be checked for Condition 2 or Condition 3.
Condition 2 describes the situation where the current observation should be combined with the previous event. We
will also check whether it should be combined with the next event. The checking results two branches in the program (branch a and b as seen in the Fig. 3).
In Branch a, we need to update fstdt, lstdt, kmstdti, and kmendti, respectively. The latter two (kmstdti and kmendti)
are what we kept in the final output for the start and end date of the combined event. Thus, they need to keep earliest start date and latest end date of the contiguous events. The former two should be updated as well so that the
combination can be done correctly if combining is still needed for the next event (the one we look-ahead at the
current observation).
In Branch b, we prepare the current observation for final output, since this is the last event of a series of contiguous events. The kmstdti and kmendti are set accordingly (kmstdti=fstdt; if kmendti<lstdt then kmendti=lstdt). kmdy,
the duration, is calculated given kmstdti and kmendti. Then we set contfln to be 0 should it will be output at the end
of the data step. Finally, we reset the rest of the flags, as in CASE 1, except contfln.
The reason that contfln is not reset to 1 is because it is used for two purposes. First, it is used to signal whether
we should combine with the previous event (note contfln is retained). Second, it is also used for output: if 0 output
the observation. Otherwise, do not output since more combination may be needed. Bear this in mind as we proceed to Condition 3. You will see we need to check and take care of contfln flag carefully.
Now let us look at Condition 3, which has three branches. Branch a is the case where it’s the last observation of a
given subject. Thus, no further combination is possible. It is processed as in the second branch of CASE 2.
Branch b is for the case where the event is the first event for a possible new contiguous event series. The current
value of contfln is zero because the previous event is the last observation of a contiguous event and contfln is set
to be zero for output. That is also why in this branch we need to reset contfln to be one again. In some sense, this
branch is equivalent to Condition 1 and Branch a of Condition 2. Finally, Branch c handles the rest of the situation,
where the event should be set for output, as before.
The last component of the algorithm is for outputting. If the contfln flag is set to be zero, the observation will be
output to the final combined event dataset. It is this flag (contfln) that makes it possible to combine and output
combined events on the fly. It is also this flag that needs carefully handling as we seen in CASE 2.
4. ALGORITHM IMPLEMENTATION
Our implementation is done in SAS 9.1.3. Nevertheless, the algorithm applies to any SAS versions. Given the
pseudo code above, the implementation of the algorithm in SAS is straightforward and will not be discussed further.
6
NESUG 2012
Coders' Corner
THE SAMPLE SCENARIO AND RESULTS
Table 3 shows the AE events for a given EOI (skin disorder) for a subject. We will use this sample scenario to illustrate the presentation of our algorithm.
Table 3. Sample AE Event Data for Skin Disorder
Event
1
2
3
4
5
6
7
SUBJID
ABC-XYZ-001
ABC-XYZ-001
ABC-XYZ-001
ABC-XYZ-001
ABC-XYZ-001
ABC-XYZ-001
ABC-XYZ-001
FDOSDT
30-Jan-07
30-Jan-07
30-Jan-07
30-Jan-07
30-Jan-07
30-Jan-07
30-Jan-07
LDOSDT
22-May-07
22-May-07
22-May-07
22-May-07
22-May-07
22-May-07
22-May-07
AESTDTI
6-Feb-07
12-Feb-07
19-Mar-07
2-Apr-07
2-Apr-07
9-Apr-07
29-Apr-07
AEENDTI
12-Feb-07
2-Apr-07
16-Apr-07
2-Apr-07
16-Apr-07
16-Apr-07
7-May-07
SKDCDN
1
1
1
1
1
1
1
Fig. 5 is an illustration of the events to be combined. As can be seen, the seven events should be combined into
two events.
Fig. 5. AE Events in Timeline
Event
2/1
2/6
2/12
3/1
3/19
4/2
4/9
4/16
4/29
5/7
1
2
3
4
5
6
7
Table 4 shows the results with all flags’ information kept for illustration (note, subjid is not displayed).
Table 4. Contiguous Event Combination Results
Event
KMSTDTI
KMENDTI
KMDY
CENSFLN
CONTFLN
NEXT_AESTDTI
NEXT_AEENDTI
1
2/6/2007
2/12/2007
7
0
1
12-Feb-07
2-Apr-07
2
2/12/2007
4/2/2007
50
0
1
19-Mar-07
16-Apr-07
3
3/19/2007
4/16/2007
29
0
1
2-Apr-07
2-Apr-07
4
4/2/2007
4/2/2007
1
0
1
2-Apr-07
16-Apr-07
5
4/2/2007
4/16/2007
15
0
1
9-Apr-07
16-Apr-07
6
2/6/2007
4/16/2007
70
0
0
29-Apr-07
7-May-07
7
4/29/2007
5/7/2007
9
0
0
.
.
According to our algorithm, only the last two rows (highlighted) will be output (where contfln=0).
7
NESUG 2012
Coders' Corner
CONCLUSIONS
In this abstract, we presented our one-data-step process that can merge contiguous events and calculate duration
on the fly and output those combined events only. We showed a four-step approach to design and implement the
algorithm in a robust way. In the algorithm, we use one time-to-event parameter, duration, to illustrate our idea. In
fact, other time-to-event parameters can also be included in the calculation when necessary (such as event free
days that should be subtracted from the duration). We also used a sample scenario to illustrate our algorithm. The
algorithm has been proved to be efficient and robust in our successfully finished study. Note that, there are many
ways to combine the contiguous events. This abstract just showed one of them.
REFERENCES:
SASCOMMUNITY.ORG, “Look-Ahead and Look-Back,” http://www.sascommunity.org/wiki/LookAhead_and_Look-Back (accessed August 21, 2012)
ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Hui Song
PRA International Inc.
630 Dresher Road
Horsham, PA 19044
Work Phone: 215-444-8583
Email: [email protected]
George Laskaris
PRA International Inc.
630 Dresher Road
Horsham, PA 19044
Work Phone: 215-444-8575
Email: [email protected]
************************************************
8