Print - Circulation

DIAGNOSTIC METHODS
ELECTROCARDIOGRAPHY
Assessment of the performance of
electrocardiographic computer programs with the
use of a reference data base
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
Jos L. WILLEMS, M.D., PIERRE ARNAUD, M.D., JAN H. VAN BEMMEL, PH.D.,
PETER J. BOURDILLON, M.D., CHRISTIAN BROHET, M.D., SERGIO DALLA VOLTA, M.D.,
JENS DAMGAARD ANDERSEN, M.D., ROSANNA DEGANI, PH.D., BERNARD DENIS, M.D.,
MICHEL DEMEESTER, M.D., JOACHIM DUDECK, M.D., FRITS M. A. HARMS, M.D.,
PETER W. MACFARLANE, PH.D., GIANFRANCO MAZZOCCA, M.D., JORGEN MEYER, M.D.,
JORG MICHAELIS, M.D., JOS PARDAENS, D.Sc., SIEGFRIED J. P6PPL, PH.D.,
BERNARD C. REARDON, PH.D., HENK J. RITSEMA VAN ECK, M.D.,
ETIENNE 0. ROBLES DE MEDINA, M.D., PAUL RUBEL, M.SC., JAN L. TALMON, PH.D.,
AND CHRISTOPH ZYWIETZ, M.SC.
ABSTRACT To allow an exchange of measurements and criteria between different electrocardiographic (ECG) computer programs, an international cooperative project has been initiated aimed at
standardization of computer-derived ECG measurements. To this end an ECG reference library of 250
ECGs with selected abnormalities was established and a comprehensive reviewing scheme was devised
for the visual determination of the onsets and offsets of P, QRS, and T waves. This task was performed
by a group of cardiologists on highly amplified, selected complexes from the library of ECGs. With use
of a modified Delphi approach, individual outlying point estimates were eliminated in four successive
rounds. In this way final referee estimates were obtained that proved to be highly reproducible and
precise. This reference data base was used to study measurement results obtained with nine vectorcardiographic and 10 standard 12-lead ECG analysis programs. The medians of program determinations of
P, QRS, and T wave onsets and offsets were close to the final referee estimates. However, an important
variability could be demonstrated between measurements from individual programs and mean differences from the referee estimates amounted to 10 msec for QRS for certain programs. In addition, the
variances of all programs with respect to the referee point estimates were variable. Some programs
proved to be more accurate and stable when the data from high- vs low-noise recordings were analyzed.
Average Q wave durations calculated from ECGs for which programs agreed on the presence of a Q or
QS wave differed by more than 8 msec in several program-to-program comparisons. Such differences
may have important consequences with respect to diagnostic performance. Various factors that might
explain these differences have been determined. The present study demonstrates that to allow an
exchange of results and diagnostic criteria between different ECG computer programs, definitions,
minimum wave requirements, and measurement procedures urgently need to be standardized.
Circulation 71, No. 3, 523-534, 1985.
DURING the last decade rapid growth has occurred in
computer electrocardiographic (ECG) processing. 1-3
At present, however, no standards for quantitative
ECG analysis exist. There is a lack of agreement on
The authors academic affiliations are listed in the Common Standards
for Quantitative Electrocardiography (CSE) organizational structure
that appears before the references.
Supported in part by the Commission of the European Communities,
within the frame of its Medical and Public Health Research program
under project No. 82/616/EEE 11.2.2, and by local and national research
funding to different institutes in nine member states of the European
Economic Community.
Address for correspondence: Jos L. Willems, M.D., CSE Project
Leader, University Hospital of Gasthuisberg, 49, Herestraat, 3000 Leuven, Belgium.
Received May 30, 1984; revision accepted Nov. 8, 1984.
Vol. 71, No. 3, March 1985
definitions of waves and common measurements, standardized criteria for classification, and common terminology for reporting.4' 5
To overcome some of these problems, an international project entitled Common Standards for Quantitative Electrocardiography (CSE) was initiated in the
European community.69 The principal objectives of
CSE are to establish recommendations for the standardization of computer-derived ECG measurements
and to obtain agreement on definitions of waves and on
references for the on- and offsets of P, QRS, and T
waves. In other words, when the same data are given
as input to any three computer programs, the ultimate
523
WILLEMS et al.
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
goal is to obtain the same measurement results, e.g.,
for Q durations. Only then can diagnostic ECG criteria
for myocardial infarction and other conditions be exchanged and possibly standardized. Means and variances of measurement results obtained by various programs analyzing a common data base should fall
within acceptable ranges.
A measurement reference library was therefore established through a comprehensive, interactive review
process that was performed by a group of cardiologists
on highly amplified ECG tracings. Methods used were
those that would ensure maximum quality and reproducibility of the resulting reference data base. The
purpose of the present report is to describe results
obtained by nine vectorcardiographic (VCG) and 10
standard 12-lead ECG computer programs analyzing
this data base, and to highlight the need for recommendations on more precise standards, rules of measurement, and definitions. The study does not aim to criticize individual programs but rather to provide a firm
foundation for improvement of all programs in the
future.
Methods
Study protocol and description of the data base. The CSE
Working Party consists of active participants from 20 institutions of the European community. In addition, investigators
from six North American and one Japanese center also collaborated in the project by processing data or as consultants.
The protocol of the study (the method of collection and analysis of the data base by a group of referee-cardiologists) has been
described in detail elsewhere.6- '0 Briefly, from a group of digitized ECGs submitted to the coordinating center by five participating institutes, a sample of 250 was chosen that represented a
wide variety of ECG morphologies. The data were collected at
500 Hz with a resolution of at least 10 bits and a minimum
quantization level of 5 ,u V. They were recorded with equipment
meeting AHA standards in groups of at least three simultaneous
leads; the group included tracings of the standard 12 as well as
the Frank XYZ leads.
Because different ECG measurement programs have various
philosophies with respect to analysis. e.g., some select 1 beat
for analysis while others base results on an average beat, socalled artificial ECGs were also created. This was done by
selecting 1 beat from each of the lead groups of each of the 250
original recordings and by creating strings of identical beats
with stable RR intervals over 10 sec for the XYZ leads and over
5 sec for each three-lead group of the conventional 12 leads. The
selected beats were chosen by eye in such a way as to be close to
the dominant beat with the least possible baseline shift, noise,
and artifact. A variable segment was interlaced between the
beats to correct for possible offset artifacts. Another group of 60
artificial ECGs was composed from additional beats selected for
a study of beat-to-beat variation, so the total artificial ECG
library was composed of 310 recordings. Seventy of the artificial ECGs were recorded with six simultaneous leads, i.e., the
six peripheral and the six precordial leads.
The 250 original and 310 artificial ECGs were randomly
*CSE participants are listed before the references.
524
divided into two sets containing nearly equal samples of each
pathologic entity. It was agreed that detailed results would be
made generally available only from data set 1, the so-called
training set, whereas summary results only would be available
from data set 2, the test set. This was done to prevent the
processing centers from adapting their programs based on the
referee results of the test set.
Analysis by the referees. The beats selected for the artificial
library have been analyzed by a board of referee-cardiologists
from five different countries. The referees had experience in
computer-assisted ECG interpretation, but to avoid bias had not
been involved in program development. An overview of their
analysis is presented in figure 1.
In view of the well-known interobserver and intraobserver
variability in determining wave recognition points, an elaborate
reviewing scheme, consisting of four rounds, was devised. With
the use of a modified Delphi approach," individual referee
outliers were eliminated from the analysis in successive steps,
an outlier being a point estimate that differs considerably from
the median result.
The referees were asked to mark the group on- and otfsets of
the P wave and the group end of the T wave, as well as the
individual on- and offsets of the QRS complexes in each lead
(figure 2). on highly amplified tracings written out at 500 mm/
sec and 100 mm/mV gain. The earliest onset and latest offset of
QRS in any lead was taken as the QRS group onset and offset,
respectively. These leadgroup onsets and offsets were used to
compute so-called isoelectric segments at the beginning and end
of QRS in each lead by measuring the distance to QRS onsets
and offsets determined in the respective single leads. In addition, the referees had to provide, per lead, a wave morphology
description (e.g., P + QRSR'T + or positive P and T wave and
an R' after a QRS complex). The referees completed their firstround analyses at home with Mingograph recordings. Reference
points were marked on the paper tracings and subsequently
81 ,450 points were transferred to the computer in the coordinating center. The subsequent rounds were performed on a subset
of ECGs in the coordinating center on a Tektronix 4010 graphics display terminal.
To test intraobserver variability the referees were given the
same beats of 26 ECGs on two other randomly selected occasions over a period of 1 year. Measurement precision could also
be assessed in the ECGs in which six leads were recorded
simultaneously but which were analyzed in sets of three. From a
theoretical point of view, wave onsets and offsets should occur
at the same time in simultaneously recorded unipolar and bipolar limb leads, since these leads are mathematically interrelated.
This is not necessarily the case for the precordial leads.
Processing by the computer programs. Ten centers in Europe, five in North America, and one in Japan, including those
of commercial groups, participated in the analysis of the ECGs.
Each of the cooperating centers had to present results of the
analysis on magnetic tape in an agreed format. Both the 250
original and the 310 artificial ECG recordings were processed
by a total of nine VCG and 10 standard 12-lead programs, which
are listed in table 1. Descriptions of these programs have been
published. 12-15
The parameters measured were those of basic interval and
amplitude, i.e., P and QRS duration, PR and QT interval,
duration and amplitude of Q, R, S, R, S', and R", and amplitude of the J point and of the positive and negative components
of the P and T waves. Time locations with respect to the beginning of the record or of the reference beat were requested, as
well as a copy of the raw data for the modal or averaged beat,
when applicable (see Discussion). Alignment of the respective
averaged beats with the beat analyzed by the referees was made
in the coordinating center by means of a cross-correlation methCIRCULATION
DIAGNOSTIC METHODS-ELECTROCARDIOGRAPHY
Ist ROUND
AT LEAST 4 REFEREES
WITHIN D1 VALUE
FROM MEDIAN
YES
NO
d
2nd ROUND
o EACH REFEREE
REVIEWS MEASUREMENT
WITH FEEDBACK
h
o
TAKE MEDIAN AS
REFEREE ESTIMATE
.
L
3rd ROUND
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
FIGURE 1. Summary of the different reviewing
rounds for the final determination of P, QRS, and T
onsets and offsets by the group of referees. The limits
Dl to D3 used for the deviations of the individual
from the median referee results are given in the inset
of the flow diagram.
[
4th ROUND
od. As for the referee results, the earliest onset and latest offset
of QRS in any of the three corresponding leads were taken to
represent the computer QRS onset and offset for that lead group.
Statistical analysis. Various listings and tables containing
results of referee-to-referee, program-to-program, and program-to-referee comparisons were returned as feedback to the
processing centers.
With respect to wave onsets and offsets, differences (algebraic and absolute) were calculated between final referee estimates
and the median, as well as between the estimates and individual
program results. This has been performed for both data sets
separately and combined, as well as for the ECGs divided into
those with the lowest and those with the highest noise content.
Details on the calculation of the noise content and the applied
ranking procedure have been reported elsewhere.'6
With respect to the durations and amplitudes of the various
components of P, QRS, and T, differences were computed
between individual and median program results. This was done
for the artificial as well as for the original ECG recordings. The
referees were not asked to make such measurements on the
individual wave components. Median program results were
used to determine the minimum duration and amplitude of the
QRS waves that the referees could recognize confidently, i.e.,
at least four of the five referees had to agree on the presence or
absence of the specific wave component, and their wave onsets
and offsets of QRS had to fall within specific limits.
Parametric statistics were used to evaluate mean differences
and variances between program and referee results. Also, 99%
Vol. 71, No. 3, March 1985
confidence intervals were calculated. Because one or two large
outliers might significantly distort variance figures,
2% of the cases with the highest differences for QRS onset and
offset) and 3% for P and T wave results were deleted for each
program for this calculation. The agreement between programs
on the absence and presence of QRS waves was tested with
nonparametric analysis of variance (Friedman and Wilcoxon
tests).
program
Results
Number of measurements reviewed by the referees. The
percentage of measurements reviewed by each referee
during the second round amounted to 9.5% (1548 of
16,290) of the total. For P onset, P offset, T end, QRS
onset, and QRS offset this amounted to 8.0%, 12.6%,
14.6%, 7.8%, and 9.0%, respectively, of the total
number of measurements for each. The overall results
were not significantly different in the two data sets.
The number of measurements reviewed in the third
round averaged 3.0% (n = 486). Each referee reviewed between 1.6% and 3.5% of his measurements
during the so-called 4/1 review. For the five readers
combined, this amounted to 1975 measurements or
525
WlLLEMS et al.
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
FIGURE 2. Example of an enlarged beat (amplification x 10) given as feedback for the third-round discussion. In this case, P
end was discussed. Small vertical lines denote the five individual point estimates and long ones the median results. The values
close to the latter denote sample point locations relative to the onset of the selected beat. Note that individual referee estimates
may overlap and that QRS onset in lead 11 apparently starts 10 msec (five sample points) later than in lead I.
12.1% of the grand total. The fourth-round analysis
was performed on 340 and 363 measurements from
data sets 1 and 2, respectively. Modifications of the
third-round estimates were made on 66 measurements
in both sets combined.
Interobserver variability. When individual referee results obtained after the second round were compared
with the final group estimates, minor but systematic
differences were observed (figure 3). Mean differences
were smallest for QRS and P onset, whereas they were
largest for T end. The SD of the differences was approximately 3 msec for QRS onset and equalled 5 to 6
msec for the end of QRS and for P on- and offset,
whereas for T end it varied between 12 and 20 msec.
Results obtained in data set 2 were concordant with
those in data set 1.
Reproducibility of referee results. Table 2 lists estimates of the reproducibility of the final group results
for the 26 ECGs that were analyzed three times during
the study period. Maximal differences between any
pair of the three repeat readings are listed. It can be
seen that for 89.0% of the measurements (347 out of
390), the final estimates of QRS onset were within 4
526
msec. For QRS offset, P onset, and P offset these
values were 76.4%, 72.4%, and 67.6%, respectively.
The repeat readings of T end were within 20 msec of
the originals 80.0% of the time.
TABLE 1
Programs examined in the present study
CSE
program Program
No.
name
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CIMHUB
Louvain
Hannover
HP
Giessen
IBM
Nagoya
Lyon
AVA
Glasgow
Halifax
Padova
Telemed
Modular
Sicard-Riedel
12
lead
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
XYZ
yes
yes
yes
yes
yes
yes
yes
yes
yes
Version
June 81
1979
3
3.4
1980
2-5890
BI
5.6
4.0
1976
1980
1980
6H
8101
1980
CIRCULATION
DIAGNOSTIC METHODS-ELECTROCARDIOGRAPHY
MERSUREMENT\REFu-
3
2
1
5
14
ONSET P
END P
p
i
ONSET ORS
4
END ORS
END T
*
.*k
.*
conz
SCALE iMECi
A
-
+*1i
-10
+10
-10
+ 10
-10
-10
10
-10
4
in
FIGURE 3. Bar graph of differences (in msec) between individual referee estimates and final group results. Mean differences are
depicted by small vertical lines and 99% confidence intervals by horizontal bars. The long vertical lines denote zero difference.
Composite lead group results are presented for data set 1 and 2 combined (n = 310). For P measurements n = 261 due to
exclusion of ECGs showing atrial fibrillation, flutter, and atrioventricular junctional rhythm.
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
With respect to the reproducibility of individual referee results, no significantly different results were obtained. Average deviations and corresponding SDs
were of the same order of magnitude for each referee.
Results from six-channel recordings. Reproducibility
and precision of referee results could also be derived
from the ECGs in which six channels were recorded
simultaneously. Determinations of QRS onset were
the most reliable. The onset of QRS in lead group I-III
differed by no more than 4 msec from the time location
obtained in the simultaneously recorded, but separately analyzed, lead group aVR-aVF in 91.4% of the
cases (64 out of 70). For QRS offset and P on- and
offset a difference of less than or equal to 4 msec was
observed in 87.1%, 88.2%, and 80.4%, respectively.
The difference for T end was less than 20 msec in 90%
of the cases. The differences between the point estimates derived from the bipolar and unipolar limb leads
varied symmetrically around zero and were significantly less (p < .01) than the differences noted between the precordial lead groups. P onset, P end, and
QRS onset were often determined earlier, whereas
TABLE 2
Reproducibility of median referee results
Max
QRS
difA
(msec)
onset
QRS
offset
P
onsetB
offsetB
Max
difA
T
end
(%)
(%)
(%)
(%)
(msec)
(%)
0
2
4
6
8
36.2
42.8
10.0
3.3
3.8
3.8
31.0
35.4
10.0
9.7
3.1
10.8
22.9
31.4
18.1
6.7
5.7
15.2
9.5
42.9
15.2
7.6
5.7
19.0
0-2
4-6
8-10
12-14
16-18
.20
17.7
24.6
12.3
13.1
12.3
20.0
'10
P
AMaximum differences (in msec) between medians of three repeat
readings in 26 cases. QRS results were derived from each of the 15 leads
(15 x 26= 390), whereas P and T refer to lead group (5 x 26= 130)
measurements.
BFive cases with
atrial fibrillation excluded.
Vol. 71, No. 3, March 1985
QRS offset was often located later in V, -V3 than in the
simultaneously recorded lead group V4-V6.
Isoelectric segments and small waves. From figure 4 it
is evident that so-called isoelectric segments of 10
msec and more were not uncommon at the beginning
or at the end of QRS, especially in leads I, aVR, aVL,
and X, where it occurred in 17% to 23% of the cases.
The smallest recognizable wave that could be detected in a reproducible manner on standard ECG recordings was studied by comparing the referees' wave
morphologic results and the measured values from the
programs. Scattergrams of duration and amplitude results for small Q and R waves reliably identified by the
referees (i.e., four of five reported a wave) demonstrated that the smallest detected QRS waves have an amplitude on the order of 20 ,uV and a duration of 6 msec.
Only a few programs detected waves of less than 30
gV.
Comparison of
program
with referee point estimates.
The median results of the programs were quite close to
the referee estimates. However, differences between
individual program results and the referee standard
were significantly larger. The bar graphs in figures 5
and 6 show the 99% confidence intervals for the mean
differences in P, QRS, and T onsets and offsets with
the referee results as reference. They demonstrate that
various program results deviate significantly from the
referee point estimates. Not only mean results, but also
the variances (indicating scatter around the referee
standard), differed from program to program. The
results obtained from data set 2, the test set, were
not significantly different from those from data set 1.
Variations for QRS on- and offset were larger on
ECGs indicating conduction defects (n = 47) and
slightly larger on ECGs indicating myocardial infarction (n = 91) than on tracings with normal QRS complexes (n = 67).
Comparison of program-to-program results. Agree527
WILLEMS et al.
ORS onset
0
2 - 4
rI
8-8
>- 14
12
-
10
msec
I
100
80
80
70
c
0
~'50
cl
40
30
_2I
II
aVR
III
aVL
VI
aVF
V3
V?
v8
V5
V4
X
z
Y
Lead
ORS offset
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
4Z
c
C)
C4)
CL
FIGURE 4. Bar chart of isoelectric segments found at QRS onset and offset in the 15 leads analyzed by the referees. Results
derived from the final referee estimates. n = 310 observations for each lead.
\PROGRfRM- MEDIAN
ONSET P
.
5
2
6
l-IlII
7
c
*a$1m
V1-V3
a
$m
a!
4!
E*
4d
_ _ _J6 _
16
V4 -V6
END P
1-1II
RVR -RVF
_ 15_
2
4,
VO-V3
14
13
*
b
RVR -PVF
12
8
14a
E* 4,
4,
0
2
*zz
9
*3
4!
14m
E *m
4!
54-V6
4!
w1m
E~
END T
E*a
4
1-III
RVR-RVF 4
4
VI-V3
V
V4-YE
ONSET ORS
I-III
AVR-RA
4
4
*
4,
E
Y1-V3
V4-V6
END OfS
I-111
Is
4,
,10 + 10
-1 0
+
-10
+10
¢1
E
*
*
VI -V3
Ea
*
*
4!
A-RAPVF
1¢
c
*
1
0
43
r
V4-V6
SCRLE (MSECI-10
,
10
-10
*10
-10
*10
4p
-4!
+10
-10
*10
*i
F-1O
+10
o1
+10
F
-10
410
-10
*1I
FIGURE 5. Comparison of referee standard with individual and median lead group onsets and offsets determined by 10 standard
12-lead programs used to analyze data set 2 (n 155). Means are depicted by small vertical lines and 99% confidence intervals,
after omitting outliers, by horizontal bars. The long vertical lines denote zero differences. Note that program performance is
characterized by measurement instability (width of bars) and systematic deviations (distance between short and long vertical
lines). Some results for programs 2 and 5 were missing.
528
CIRCULATION
DIAGNOSTIC METHODS-ELECTROCARDIOGRAPHY
\PROGRRMu- MEDI}AN
ONSET P
X X z
2
334
E20
END P
X y z
4
11
LI
1*
*
4'3
rr4m t.
*M
-a
-i
121-
4'
$1
*
ENO T
X y z
10
9
6
04
52
ONSET ORS
X X z
4
b
*
END QRS
X X Z
i
q
SCALE (MSEC)-10
-
+-A
-10
+10
-10
*10
-10
+10
*
-10
*l0
0
-lo
+1I
-10
1
4 10
-1c
+10
-10
' 10
+ 10
-10
FIGURE 6. Comparison of program results with referee standard (mean differences and 99% confidence intervals). Individual
and median lead group results derived from the Frank XYZ leads by nine VCG computer programs for data set 2 (n = 155) are
shown.
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
ment in reporting the presence or absence of Q waves
was attained 70% to 96% of the time, depending on the
leads analyzed. High agreement figures were obtained
for the right precordial leads, lower ones for the limb
leads. The agreement between different computer programs varied, on the average, between 80% and 90%.
The number of small Q and R waves (duration 12
msec and amplitude .50 ,uV) reported by the various
programs differed largely, as can be seen from table 3.
When average durations were calculated for those
ECGs for which programs agreed on the presence of a
Q or QS wave, then significant differences amounting
to more than 10 msec were found, as illustrated in
figure 7. These differences were apparent to the same
degree and in the same direction in the original and the
so-called artificial recordings.
Effect of noise on program results. A comparison of
point estimates from high- vs low-noise recordings
indicated that on the average computer-derived wave
onsets and offsets were shifted outward by noise (figure 8). However, this shift was significantly less for
some programs than for others.
Discussion
One objective of the CSE project is aimed at reducing the variation of measurements made by computer
programs for interpreting the ECG. To this end a data
base with well-defined wave reference points was established. As might be expected, individual referee
results demonstrated a certain interobserver and intraobserver variability. However, this variability was
lower than in former studies17 because of the interactive reviewing process. Indeed, each of the four
successive rounds of the Delphi-type reviewing process led to smaller variances. When acting as a group,
the final results of the referees proved to be very stable
and can be supposed to be a valid standard reference.
Results for each recording of half the library (the socalled training set) have been published in a CSE AtVol. 71, No. 3, March 1985
las`8 and are available on magnetic tape. These results
can be used to test or refine wave recognition results of
ECG analysis programs in which three simultaneously
recorded leads are used.
A number of compromises were required for an effective implementation of the procedures used for
choosing the standard reference time points for ECG
wave onsets and ends.8 One such compromise was the
use of median values of the referees after the first and
second review rounds as the "correct" reference
points for evaluating interval measurements by the
programs. It is conceivable that occasionally the median value did not correspond to the most accurate reference point. However, the choice of the median values
was considered necessary to cope with the problems
caused by outliers, i.e., sporadic erratic measurements
in certain difficult or noisy records. On the other hand,
from statistical theory it is well known that means or
medians of multiple observations are more precise and
reproducible in reflecting the population truth than single estimates. In view of this it can be postulated that
TABLE 3
Number of "small" Q and R waves identified by 12-lead ECG
programs in data set 1 and 2 combined
Q wave <12 msec
and .50 ,uV
Program
No.
2
5
6
7
8
12
13
14
15
16
R wave <12 msec
and .50 gV
Artificial
Artificial
Original
Original
ECGs
ECGs
ECGs
ECGs
(n= 12 x 250) (n= 12 x 310) (n= 12 x 250) (n= 12x 310)
26
14
86
45
16
148
63
0
157
10
46
76
39
80
22
240
97
13
178
11
11
8
29
26
7
81
32
21
88
39
16
31
7
37
11
150
63
14
132
52
529
WILLEMS et al.
LERD
MERSUREMENT 0 DUR.
J
LERD
MERSUREMENT 0 OUR.
X
N
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
-20
DIFFERENCE (PROG-MED) IN MSEC
DIFFERENCE IPROG-MED) IN MSEC
FIGURE 7. Histogram of differences (in msec) between individual Q duration and median program results derived by eight VCG
programs in lead X and by 10 standard 12-lead ECG programs in lead I. Q wave durations are not calculated by the Lyon
program. Results are from data set 1 and 2, original and artificial ECGs combined (n = 560).
response. However, such an effect on the readers at
most affected 12% of the measurements that were
made after the first-round analysis (figure 1). This
percentage only slightly decreased over the study period, indicating that the five readers remained independent. Although the referees were given low-pass (15
Hz) filtered recordings to assist them in localizing the
another group of readers would produce similar results
within the statistical limits presented in this study.
Indeed, the median of the independent programs, each
with its built-in cardiologic experience, closely approached the final referee estimated (figures 5 and 6),
further supporting their validity. It can be assumed that
the interactive reviewing process resulted in a learned
CSE
DRTR SET 1+2
-
SOX LOWEST NOISE
RRNKS
- 50X HIGHEST NOISE RANKS
~
LOW NOISE RECORFDS VERSUS HIGH NOISE RECORDS
COMPRRISON OF PROGRRM RESULTS WITH REFEREE STRNDRRO; MERN DIFFERENCES RNO 99z CONFIDENCE INTERVRLS
\PROGRRMw- MEDIRN
_
3_
____
2
4
9
6
I1
10
12
15
ONSET P
XYz
c
P
END P
X yz
%
END T
9
SW
ONSET ORS
xX Y
z
*
END ORS
X Y Z
1*
SCRLE (ISEC)
-10
+10
:1
10
+10
-
10
+ 10
-
-
10
1+10
-0
-so
2S
i1+10
1
-
10
+1t
+ 10
-1in
1'10
-10
+10
1
-10
0
*10
-10
410
FIGURE 8. Lead group onsets and offsets from XYZ leads obtained by nine VCG computer programs in comparison to the
reference standard in the 50% lowest vs 50% highest noise recordings. Results were obtained from data set 1 and 2 combined (n
= 310).
530
CIRCULATION
DIAGNOSTIC METHODS-ELECTROCARDIOGRAPHY
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
onset and offset of the P wave and the end of the T
wave, they had to indicate all the fiducial points on the
unfiltered high-gain recordings. Averaging techniques
and signal transformations such as spatial velocity or
magnitude curves were not provided since none of the
cardiologists used such signals in routine electrocardiography. In addition, this could have biased the analysis results toward certain algorithms. The standard
12-lead ECGs of the current data base were acquired
with the conventional three-channel sequencing,
which is known to be suboptimal as a result of lack of
orthogonality of the lead groups. However, this recording technique is used all over the world and all
routinely used 12-lead computer programs in existence
at the start of the project required such data. Of course,
lack of orthogonality cannot be offered as a criticism
for the analysis of XYZ data in the present study.
There were several reasons why the establishment of
a data base with well-defined onsets and offsets of P,
QRS, and T waves was given high priority in the CSE
project. Experiences of investigators working in pattern recognition have demonstrated that several mathematic algorithms may lead to similar solutions in the
average case. Some methods, however, may perform
better under different conditions than others and vice
versa. The use of a data base for the development of
algorithms is standard practice in various fields, from
automated character reading to computer-assisted
chromosome and leukocyte typing. For this, a local
data bank and human wave recognition, usually by a
single reader, has mostly been used. Furthermore, discussions with cardiologists revealed an unwillingness
of the medical community to accept strict mathematic
definitions if they had not been tested against wave
recognition results derived by human reading.
From the intraobserver and interobserver reproducibility tests in the present study it is apparent that QRS
onset is the measurement that can be made most reliably. A precision of less than 6 msec (three sample
points) is attainable for QRS onset at high amplification in relatively noise-free records. Based on the results of the present study for P onset and offset, as well
as for QRS offset, a difference of 10 msec is tolerable,
whereas for T end, this may be increased to 25 msec.
These empirical findings are in accordance with electrophysiologic theory. The onset of ventricular depolarization is usually a well-defined entity. QRS offset, in contrast, is a rather arbitrary fiducial point at
which final echoes of depolarization merge imperceptibly with the early signs of repolarization. The same is
true for the end of P. The T wave recovery forces move
slowly and are of small magnitude. The end of T is
Vol. 71, No. 3, March 1985
therefore inherently less well defined. Nonetheless, in
practical electrocardiology the end of QRS, as well as
of the P and T waves, needs to be determined as accu-
rately as possible.
The construction of the current data base from simultaneously recorded three-lead ECGs is a primary
step in the process of standardization of computer ECG
measurement programs, as was recommended at the
first IFIP Conference'2 and at the Tenth Bethesda Conference19 on computer-assisted electrocardiography.
While the data base cannot be guaranteed to be a representative sample of the ECG universe (the collection of
all conceivable ECGs) and the number of ECGs in the
data base has been constrained by practical considerations, it is highly probable that conclusions reached
by evaluating program performance with the present
data base may be generalized to cover program performance in daily routine practice.
In fact, the present study demonstrates a rather wide
variation in wave measurement results, and especially
time intervals, obtained by nine currently used VCG
and 10 standard 12-lead ECG computer programs analyzing a common reference data base. This variability
may be explained by several factors. Various programs
apply different algorithms and references for wave
recognition, beat selection, and parameter extraction. 12-15, 19-21 Most wave recognition programs apply
threshold-level crossing methods to amplitude differences of filtered leads or use different matching techniques on templates in the filtered spatial velocity-time
function. These templates or threshold levels are at the
best derived from a set of ECGs and are computed
around the points indicated by one or more human
observers. Since the onsets and offsets of waves to
which various programs were tailored have been determined by different referees using different sets of
ECGs, it follows that systematic differences in computer results similar to those observed in human interobserver variability studies may be expected.
Other factors also contribute to the variability in
measurement results. It has been demonstrated that
programs that apply strategies for location of fiducial
points on simultaneously recorded leads produce
greater measurement reliability and reproducibility
than programs in which single-lead ECG analysis is
used.22, 23 The sampling rate and record length effectively used by various programs are often different.
Some apply a sampling rate of 250 Hz, while others
use 300, 400, or 500 Hz. Significant differences between program measurements and the reference standard are less likely to occur with programs using a
sampling rate of 500 Hz. In programs based on a sam531
WILLEMS et al.
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
pling rate of 250 Hz the odd samples were analyzed in
the present study, whereas those using 300 or 400 Hz
interpolation techniques were applied on the 500 Hz
CSE library. As demonstrated by Bailey et al.,"2 small
shifts in sampling thereby introduced may have caused
different measurement results.
All programs attempt to make use of the redundancy
of complexes available in the sampled ECG to optimize the accuracy of measurement extraction from the
tracing.'9 Three techniques are currently in use. The
first locates the "best complex" for analysis. To find
the best complex, all of the complexes of a given lead
set are located. The program then chooses one complex for analysis, generally the one with the least noise
and baseline wander. In the second technique, some
time-coherent averaging is made of all complexes that
are considered to be morphologically of the same type.
This procedure reduces the random noise in the signal.
The third method extracts the measurements from every complex in the lead set and subsequently operates
on the measurements of similar dominant complexes.
From the above it is evident that, although identical
ECGs may be given as input to various programs, final
measurements may have been derived from different
beats. This will inevitably lead to different results in a
specific ECG record. However, if computer analysis
of ECGs is to become a standardized laboratory procedure, averages of measurements based on a sample of
sufficient size should be identical and variances should
be within "acceptable ranges. "5 Results from the present study indicate that this goal has not yet been
achieved.
In the CSE project, a so-called artificial ECG library
was created by selecting 1 beat from each of the lead
groups of the original ECG recordings and by making
strings of identical beats with a constant RR interval.'° Differences in measurement results due to the
different beat selection methods listed above have
thereby been circumvented. Nevertheless, an important variability between measurements obtained with
the VCG and standard 12-lead computer programs
could still be demonstrated.
This variability is the result not only of difficulties in
the determination of wave onsets and offsets, but also
results from a lack of consistent and precise common
definitions, minimum wave requirements, and measurement rules. In the present study we have found that
isoelectric segments of 10 msec and more are not uncommon at the beginning and end of QRS in various
leads. There are at present no generally accepted
guidelines with respect to these segments. In a minority of programs, these isoelectric segments are en532
closed in the duration of the initial or terminal QRS
components, whereas in the majority they are excluded. Furthermore, various programs use different
limits for the detection and labeling of small QRS
waves. Results from the present investigation indicate
that the smallest recognizable waves, by visual inspection, have an amplitude on the order of 20 ,uV and a
duration of 6 msec. A few programs provide Q and R
wave measurement results below these limits. Other
programs, however, require a minimum amplitude of
30 or even 40 ,uV and a duration of 8 or 10 msec.
Others use noise-dependent thresholds based on a signal derivative or a combination of amplitude and duration results.
On the average, the programs tested agreed on the
presence and absence of Q and QS waves about 80% to
90% of the time. When durations were calculated for
those records for which programs agreed on the presence of a Q or QS wave, significant average differences amounting to more than 8 msec were found in
the limb leads and the differences were even greater
in the right precordial leads. Such differences have
important consequences for diagnostic performance,
given that these programs might use the same thresholds and logic for the diagnosis of myocardial
infarction.23 24
The comparison of referee with computer point estimates reported in the present study demonstrates that
some programs are more precise and show less variability than others. In general the measurement performance of XYZ programs was better than that of 12lead programs. These results have been confirmed by
noise-tolerance tests. Results from low- and highnoise recordings indicate that P, QRS, and T onsets
and offsets are shifted outward by noise in various
computer programs. However, the extent of this shift
is variable from program to program, probably as a
result of different preprocessing methods. Further
studies in this area are still in progress.29 Preliminary
results indicate that programs that apply time-coherent
averaging perform better in noisy records.
To allow an exchange of diagnostic criteria wave
measurement results need to be standardized. From the
above it is obvious that common standards for quantitative electrocardiography are still missing. Therefore,
parallel to the establishment of the CSE reference data
base, the CSE Working Party has attempted to establish definitions, wave requirements, and measurement
procedures. Recommendations in this direction are being developed. In addition steps have been initiated to
evaluate diagnostic program performance and to test
the clinical impact of improved measurements.
CIRCULATION
DIAGNOSTIC METHODS-ELECTROCARDIOGRAPHY
The data presented in the current investigation were
derived from ECG analysis programs with the use of
three simultaneously recorded leads. At the start of the
CSE project, equipment that could be used to acquire
12 (eight independent) or 15 leads simultaneously was
not yet on the market. Some of the latest programs can
only operate on such multichannel leads.2628 To this
end the CSE data base has recently been extended with
several hundred new ECGs. However, the basic problems encountered in the present analysis are also applicable to these newer programs.
We gratefully acknowledge the secretarial assistance of
Diane Wolput and Viviane Dillemans, as well as the technical
assistance of Ludo Van den dries and Danny De Schreye.
2. Rautaharju PM: The current state of computer ECG analysis: a
critique. In van Bemmel JH, Willems JL, editors: Trends in computer-processed electrocardiograms. Amsterdam, 1977, North
Holland Publishing Co, p 117
3. Drazen E: Use of computer-assisted ECG interpretation in the
United States. In Ripley KL, Ostrow HG, editors: Computers in
cardiology. Long Beach, CA 1979, IEEE Computer Society, p 83
4. Willems JL, Pardaens J: Differences in measurement results obtained by four different ECG computer programs. In Ostrow HG,
Ripley KL, editors: Computers in cardiology. Long Beach, CA,
1977, IEEE Computer Society, p 115
5. Willems JL: A plea for common standards in computer aided ECG
analysis. Comp Biomed Res 13: 120, 1980
6. The CSE European Working Party: Common standards for quantitative electrocardiography. The CSE pilot study. In Gremy F, et al,
7.
8.
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
Organizational structure: CSE committees and
participants
CSE Steering Committee. P. Arnaud (France), R. Degani
(Italy), P. W. Macfarlane (United Kingdom), J. H. van Bemmel
(The Netherlands), J. L. Willems (Project Leader; Belgium) C.
Zywietz (West Germany).
CSE Board of Referees. P. J. Bourdillon (United Kingdom), G. Mazzocca (Italy), B. Denis (France), J. Meyer (West
Germany), E. 0. Robles de Medina and F. M. A. Harms (acting
as a team with one vote, The Netherlands), H. J. Ritsema van
Eck (consultant, The Netherlands).
CSE European Working Party. Belgium: C. Brohet (University of Louvain), M. Demeester (University of Brussels), J.
Pardaens and J. L. Willems (University of Leuven). West Germany: J. Dudeck (University of Giessen), J. Meyer and J.
Michaelis (University of Mainz), S. J. Poppl (Institute Medical
Data Processing, Munchen), C. Zywietz (University of Hannover). Denmark: J. Damgaard Andersen (University of Copenhagen). France: P. Arnaud (INSERM U121 Lyon), B. Denis
(University of Grenoble), P. Rubel (INSA, Lyon). Greece: S.
Moulopoulos (University of Athens), E. Skordalakis (NCR Democritos, Attiki). Italy: S. Dalla Volta (University of Padova),
R. Degani (Ladseb CNR, Padova), G. Mazzocca (University of
Pisa). Ireland: 1. Graham and B. C. Reardon (University of
Dublin). The Netherlands: J. H. van Bemmel and J. L. Talmon
(Free University, Amsterdam), F. M. A. Harfms and E. 0.
Robles de Medina (University of Utrecht), H. J. Ritsema van
Eck (Rotterdam). United Kingdom: P. J. Bourdillon (University of London), P. W. Macfarlane (University of Glasgow).
Consultants. J. J. Bailey (N.I.H.) and Pipberger HV
(George Washington University, Washington, D.C.), P. M.
Rautaharju (University of Dalhousie, Halifax, Nova Scotia).
Non-European participants. U.S.A.: R. Bonner (IBM), J.
Doue (Hewlett-Packard), K. Michler (Telemed). Canada: P. M.
Rautaharju and P. Macinnis (University of Dalhousie, Halifax,
Nova Scotia). Japan: M. Okajima, N. Okamoto, M. Yokoi
(University of Nagoya), M. Ohsawa (Fukuda Denshi).
CSE Coordinating Center. Division of Medical Informatics, University of Leuven, Belgium.
References
1. Pipberger HV: Twenty years of ECG data processing. What has
been accomplished? In Antaloczy Z, editor: Modem electrocardiology. Amsterdam, 1978, Excerpta Medica, p 159
Vol. 71, No. 3, March 1985
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
editors: Medical informatics Europe 81. Berlin, 1981, Springer
Verlag, p 319
The CSE European Working Party: Common standards for quantitative electrocardiography. CSE project phase one. In Ripley KL,
editor: Computers in cardiology. Long Beach, CA, 1982, IEEE
Computer Society, p 69
Bourdillon PJ, Denis B,Harms FMA, Mazzocca GU Meyer J, Robles de Medina EO, Ritsema van Eck HJ, Willems JL: European
experience in the standardization of measurements and of definitions of the electrocardiogram. In Laks M, editor: Computerized
interpretation of electrocardiograms VII. New York, 1982, Engineering Foundation, p 9
Macfarlane PW, Willems JL on behalf of the CSE Working Party:
The CSE Project: progress as viewed by the cooperating centers. In
Selvester R, editor: Computer interpretation of electrocardiograms
VIII. New York, 1983, Engineering Foundation (in press)
Willems JL, Arnaud P. Degani R, Macfarlane PW, van Bemmel
JH, Zywietz C: Protocol for the concerted action project "Common
Standards for Quantitative Electrocardiography," Second R&D
programme in the field of Medical and Public Health Research of
the EEC (80/344/EEC), CSE Ref. 80-06-00, Leuven, Belgium,
1980, ACCO Publ, p 152
Dalkey N: Analysis from a group opinion study. Rand Corporation.
Futures, December 1969, p 541
Zywietz C, Schneider B, editors: Computer application in ECG and
VCG analysis. Amsterdam, 1973, North Holland Publishing, p
271
van Bemmel JH, Willems JL, editors: Trends in computer processed electrocardiograms. Amsterdam, 1977, North Holland Publishing, p 437
Wolf HK, Macfarlane PW, editors: Optimization of computer ECG
processing. Amsterdam, 1980, North Holland Publishing, p 346
Talmon JL: Pattern recognition of the ECG. A structured analysis,
doctoral thesis. Free University, Amsterdam, 1983, p 366
Willems JL: Common standards for quantitative electrocardiography. Third progress report. Leuven, Belgium, 1983, ACCO
Publ, p 275
Fischmann E, Cosma J, Pipberger HV: Beat to beat and observer
variation of the electrocardiogram. Am Heart J 75: 465, 1968
Willems JL, editor: CSE atlas -referee results first phase library
data set one, CSE Ref. 83-05-13, Leuven, Belgium, 1983,
ACCO Publ, p 655
Rautaharju PM, Ariet M, Pryor TA, Arzbaecher RC, Bailey JJ,
Bonner R, et al: Task Force III: computers in diagnostic electrocardiography. Am J Cardiol 41: 158, 1978
Stallman FW, Pipberger HV: Automatic recognition of electrocardiographic waves by digital computer. Circ Res 9: 1138, 1961
van Bemmel JH, Talmon JL, Duisterhout JP, Hengeveld SJ: Template wave form recognition applied to ECG/VCG analysis. Comp
Biomed Res 6: 430, 1973
Bailey JJ, Horton M, Itscoitz SB: A method for evaluating computer programs for electrocardiographic interpretation. III Reproducibility testing and the sources of program errors. Circuation 50: 88,
1974
Helppi RR, Unite V, Wolf HK: Suggested initial performance
requirements and methods of performance evaluation for computer
ECG analysis programs. Can Med Assoc J 108: 1251, 1973
Rautaharju PM: Use and abuse of electrocardiographic classification systems in epidemiologic studies. Eur J Cardiol 8: 155, 1978
533
WILLEMS et al.
25. Zywietz C, Alraun W, Willems JL on behalf of the CSE Working
Party: Results of ECG program noise tests within the CSE project.
In Ripley KL, editor: Computers in cardiology. Long Beach, CA,
1984, IEEE Computer Society (in press)
26. MAC II Marquette Electronics Inc, Milwaukee, 1982
27. Macfarlane PW, Peden A, Podolski M, Lawrie TDV: A new 12
lead ECG diagnostic computer program. Jpn Heart J 23(suppl I):
667, 1982
28. Bortolan G, Cavaggion C, Degani RT: A comparison of ECG
measurements derived from 3, 6 and 12 simultaneous leads. In
Ripley KL, editor: Computers in cardiology. Long Beach, CA,
1983, IEEE Computer Society, p 269
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
534
CIRCULATION
Assessment of the performance of electrocardiographic computer programs with the use
of a reference data base.
J L Willems, P Arnaud, J H van Bemmel, P J Bourdillon, C Brohet, S Dalla Volta, J D
Andersen, R Degani, B Denis and M Demeester
Downloaded from http://circ.ahajournals.org/ by guest on June 16, 2017
Circulation. 1985;71:523-534
doi: 10.1161/01.CIR.71.3.523
Circulation is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231
Copyright © 1985 American Heart Association, Inc. All rights reserved.
Print ISSN: 0009-7322. Online ISSN: 1524-4539
The online version of this article, along with updated information and services, is located on
the World Wide Web at:
http://circ.ahajournals.org/content/71/3/523
Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally
published in Circulation can be obtained via RightsLink, a service of the Copyright Clearance Center, not the
Editorial Office. Once the online version of the published article for which permission is being requested is
located, click Request Permissions in the middle column of the Web page under Services. Further
information about this process is available in the Permissions and Rights Question and Answer document.
Reprints: Information about reprints can be found online at:
http://www.lww.com/reprints
Subscriptions: Information about subscribing to Circulation is online at:
http://circ.ahajournals.org//subscriptions/