Human Reproduction vol.10 no.8 pp.2010-2016, 1995
Inter-observer agreement in analysis of basal body
temperature graphs from infertile women
D.Ayres-de-Campos1, J.L.Silva-Carvalho, C.Oliveira,
I.Martins-da-Silva, J.Silva-Carvalho and
L.Pereira-Leite
Department of Gynaecology and Obstetrics, Hospital de S. Joao,
4200 Porto, Portugal.
In this study, we evaluated the reproducibility of analysis
of basal body temperature graphs under optimized conditions for agreement. A total of 160 recordings were selected
from spontaneous cycles of infertile women and analysed
by three experienced clinicians using uniform criteria.
Agreement at the various stages of analysis was assessed
by the 'proportions of agreement' with 95% confidence
intervals. Agreement in identification of the thermal nadir
was clearly superior to that reported in previous publications. Reproducibility of 'ovulatory' graph features (i.e.
biphasic graphs and adequate thermal shifts) was excellent.
Agreement in classification of monophasic graphs and
inadequate thermal shifts, although lower, was still good.
Thus, with experienced observers and uniform criteria, a
good agreement can be achieved in analysis of the most
important parameters of the basal body temperature graph.
We believe that an effort should be made to generalize the
use of uniform analysis criteria, because only then can
results from different institutions be compared and the
remaining clinical evaluation of the method be performed.
Key-words: basal body temperature/infertility/ovulation/proportions of agreement/reproducibility
Introduction
Nowadays, particularly in the area of human reproduction, the
constant flow of information regarding new technologies leads
to an understandable lesser interest in the study of pre-existing
methods. However, in this and other areas, many widely used
diagnostic tests have not been submitted to what is now
considered adequate clinical evaluation (Corson et al, 1995;
Grant, 1984; Guzick et al., 1994; Steer et al., 1995). It is well
to remember that some of these tests, by affecting diagnosis
and therapeutic options, can have a significant influence in
overall treatment results.
The usually biphasic pattern of women's basal body temperature (BBT) throughout the menstrual cycle was first described
in 1868 (Squire, 1868). Today, daily recording of body temperature remains one of the simplest, most practical and inexpensive ways of detecting ovulation. In the last few years, however,
the value of the BBT graph has been questioned, particularly
2010
Materials and methods
Previously recorded BBT graphs were selected at random from the
out-patient infertility unit of the Hospital de S. Joao, a tertiary care
university state hospital in northern Portugal. Couples attending this
facility are all unable to conceive after at least 1 year of unprotected
intercourse, in spite of consciously trying to do so. All women had been
instructed to register a 3 month period of daily rectal temperatures, on
awakening and before any physical activity. From the 3 month graph,
a cycle, well delimited by two menstrual periods, was chosen at
random for analysis. Exclusion criteria for the selection of graphs
were: (i) non-spontaneous (i.e. ovulation induced) cycles, or (ii) more
than four missed temperature recordings in a cycle. A total of 160
graphs meeting these conditions were chosen for analysis, selected
from the records of 122 different women. Graphs were presented
separately in groups of 40, to observers with no time-limit for
analysis. They were identified solely by a reference number and the
patient's initials. No clinical information on the patients was supplied.
Analysis was performed by three experienced clinicians (J.L.S-C,
I.M-S., J.S-C). All had > 10 years experience in reproductive medicine
and interpretation of BBT graphs. Observers were first asked to
classify the graphs as biphasic, monophasic or uninterpretable. In
graphs classified as biphasic they were subsequently requested to
identify the day of the thermal nadir and classify the thermal shift as
adequate or inadequate.
© Oxford University Press
Downloaded from http://humrep.oxfordjournals.org/ at Pennsylvania State University on September 13, 2016
'To whom correspondence should be addressed
with regard to its validity, or ability to evaluate correctly the
phenomena of the menstrual cycle (Moghissi, 1976; Lenton
et al., 1977; Hilgers and Bailey, 1980; Bauman, 1981; Newill
and Katz, 1982; Kambic and Gray, 1989; Martinez et al.,
1992). Less importance has been placed on the reproducibility
of the method, i.e. its capacity to reproduce the readings.
Nowadays, it is widely believed that the study of any method,
whether old or new, should first involve the study of reproducibility, and later of validity (Grant, 1984). If good reproducibility is not found, the method should be abandoned or the
technique re-evaluated until this is achieved (Grant, 1991).
Reproducibility of the BBT graph involves two aspects: the
measurement and recording of temperatures by women and
the visual interpretation of graphs by clinicians. In this study
we focused solely on the second aspect. The three most
frequently analysed parameters of the BBT graph were studied:
the classification of the graph as biphasic, monophasic or
uninterpretable; the identification of the thermal nadir; and the
classification of the thermal shift as adequate or inadequate.
Our aim was to study the inter-observer variation in analysis
of these parameters under the best possible conditions for
agreement, in order to evaluate how reproducible the method
could be. Thus, poor quality recordings were eliminated
and analysis was performed by experienced clinicians using
uniform criteria.
Inter-observer agreement in BBT graph analysis
A previous meeting had been held between observers to define a
uniform and consensual set of criteria for graph analysis. Chosen
criteria were based on the findings of Hilgers and Bailey (1980),
Downs and Gibson (1983), and McCarthy and Rockette (1983),
adapted to the original World Health Organization (WHO) definition
(Vollman, 1977). Thus, graphs were classified as biphasic if a rise in
temperature of at least 0.2°C above the six previous temperatures
was identified close to the middle of the cycle, and this was sustained
for at least 3 days. If such a shift was not found the graph was
considered monophasic. Other patterns were to be classified as
uninterpretable. The thermal nadir was defined as the day that
preceded the previously described rise in temperature. Criteria for
classification of the thermal shift as inadequate were: (i) a slow rise
in temperature, lasting >48 h; (ii) a hypothermic shift, with the
majority of temperatures <0.2°C above the average of the follicular
phase; (iii) the appearance of deep and sustained falls in temperature
during the luteal phase; (iv) duration of the shift < 11 days.
Results
The results are summarized in Table I. Figure 1 shows examples
of graphs where there was agreement among the three observers
in all steps of analysis.
In the classification of recordings as biphasic, monophasic
or uninterpretable, the three observers agreed on 130 of the
160 graphs (81 %), there was agreement between two observers
on 27 (17%) and total disagreement on three graphs (2%).
Figure 2 shows examples of disagreement in the analysis of
this parameter. The proportion of agreement for biphasic
graphs was 0.86 with 95% CI 0.83-0.90. For monophasic
recordings it was 0.62 (95% CI 0.53-0.70), and for uninterpretable recordings 0.14 (95% CI 0.01-0.27).
Agreement in the remaining stages of analysis (nadir identification and thermal shift classification) was only studied for
those graphs considered biphasic.
In determining the day of the thermal nadir there was
agreement among all observers on 79 of 106 graphs (75%),
agreement between two on 34 of 121 graphs (28%), and
total disagreement on five of 106 graphs (5%). Examples of
disagreement are shown in Figure 3. The overall proportion
of agreement for this parameter was 0.81 (95% CI 0.77-0.86).
In classification of the thermal shift as adequate or inadequate
there was agreement among all observers on 86 of 106 graphs
(81%) and agreement between two observers on 32 of 121
graphs (26%). Figure 4 shows examples of disagreement in
this parameter. The proportion of agreement for adequate
thermal shifts was 0.82 (95% CI 0.77-0.87) and for inadequate
thermal shifts 0.70 (95% CI 0.62-0.77).
Discussion
The BBT graph is a relatively well tolerated (Martinez et al.,
1992), simple and inexpensive method for ovulation detection
(it needs only a thermometer, a pen and a piece of paper). The
Number of
trialsa
Classification of graphs
biphasic
385
monophasic
130
uninterpretable
28
Identification of thermal nadir
global
333
Thermal shift classification
adequate
235
inadequate
141
Proportion of
agreement
95% CI
0.86
0.62
0.14
0.83-0.90
0.53-0.70
0.01-0.27
0.81
0.77-0.86
0.82
0.70
0.77-0.87
0.62-0.77
a
Number of agreement comparisons between observers, performed for each
category.
cost and lower acceptability of alternative methods (hormonal
studies, serial ultrasonography, endometrial biopsy, etc.) make
it by far the easiest method to adopt on a routine basis. It also
provides other important information, namely an objective
duration of menses, cycle length, frequency and timing of
intercourse. It is certainly for all these reasons that it remains
a useful tool in infertility clinics. However, its clinical evaluation has never been thoroughly performed and we, as others
(Grant, 1984), believe that the study of reproducibility should
be the first step. The issue of reproducibility has been extensively studied and is considered of prime importance in the
evaluation of other methods that rely on visual interpretation
of data, such as cardiotocography and mammography (Donker,
1991; Elmoreera/., 1994).
Nowadays, the BBT graph is chiefly employed in the study
of infertile patients, and so we believe that it is in this
population that the study of the method's reproducibility is
most useful. Lenton et al. (1977) reported that BBT graphs of
infertile patients are more difficult to interpret than those
of the general population. For the latter, a higher number of
unquestionably biphasic graphs are expected and probably a
better agreement in interpretation.
The reproducibility of the BBT graph is addressed in a
small number of papers, where it is frequently analysed
together with aspects of validity. To our knowledge it has
never been extensively studied. Previous reports indicate a
low reproducibility in assessment of the day of ovulation,
although chosen conditions for agreement are very diverse.
Kambic and Gray (1989) submitted 28 BBT recordings considered difficult to interpret to four experienced observers and
found a unanimous choice of the first day of the temperature
shift in only 38%. Bauman (1981) asked six gynaecologists
with different experiences of graph interpretation to estimate
the day of ovulation and reported a unanimous choice in only
22% of 88 recordings. Lenton et al. (1977) found a low intraobserver agreement (not quantified) in assessment of the
ovulation day in 60 graphs of a predominantly infertile
population.
In this study, we aimed for the best possible conditions for
agreement in order to evaluate how reproducible the method
could be. Other methods exist where, in spite of optimized
agreement conditions, good reproducibility is not found
2011
Downloaded from http://humrep.oxfordjournals.org/ at Pennsylvania State University on September 13, 2016
Inter-observer agreement was assessed by the 'proportions of
agreement' and 95% confidence intervals (CI), as described by Grant
(1991). According to this author the proportion of agreement that
signifies good agreement is arbitrary, but if the 95% CI includes 0.5,
then agreement is almost certainly poor (provided the study population
is large enough).
Table I. Proportions of agreement and 95% confidence intervals (CI) for the
various stages of basal body temperature (BBT) graph analysis
34|
X
3
CM
CO
I
m
I
?
o
i
Ol
CM
r
CO
Ol
CM
CM
s
CO
CM
N
CM
CM
<D
CM
CD
CM
in
CM
s
s
CM
CM
CO
*f
*
m
\
3
3
n
o
r«
CO
g
CM X
CO
CM
i
CD
CM
?
CM
CM
CM
3
\
CM
CM
I
CM
M
O
CM
O
CM
01
Ol
5
CO
CO
a
\
/
i
\
10
A
10
ra
)
2
m
CO
CD
in
m
5
in
*
«
CO
CO
CO
CO
CM
CM
CM
o
o
CM
lr
y
\
O-B
i£
\
L
/
Ol
CD
i
CO
s.
CD
in
in
>
X
CO
X
CM X
)
•
n
D
c
I
•
3
c
i
r
r
K
n
•
M
•
(a
n
0
1
Graph 101
Cycia day
f
s
t
s »
p c
e
P
x; o
•i
r-
H M
CO
00
pi
S «
f-S
15-
X
CM X
X
s,
To mo. "'
1
I
CO
X
Graph 11*
Cycle day
Graph 57
Cycle day
I!
1
f
^u o
•g
a 8
X
CM
0
t
m
i
4
o.a.
2 -S
CD
CO
X
n
1
1
3
t
m
c
•
s
n
a
1
0
i
N
•
[
D
Temp. *"
CD
"5.
°
e o)
oo c
01
CO
=3
g
rt
E
2 -°
\
i
i-
E
es
o)
E
CA
O
\
0>
O
•§.!•
£
i
s
c •—
i
O)
CO
a
o
o
i
00 vi
CM
CM
e
3
/
CM
CM
•a
X) ft-
X
CM
CM
N
X
in
CO
CM
i
X
Downloaded from http://humrep.oxfordjournals.org/ at Pennsylvania State University on September 13, 2016
co
I
i
mper
CM
CO
/
a
X
8
in
n>
it
"5 «
stia
g-2
X C
w .2
-'1e-"
as
<u Si
iS .E
Inter-observer agreement in BBT graph analysis
Graph 62
Cycle day
1 2
3
4
5
Menstruation
X
X
X
X
X
8
7
9
8
10 11 12 13 14 15 16 17 18 19 2 0 21 2 2 2 3 2 4 25 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4
X
X
X
X
X
37.5* •
Basal
Blphaalc
I 2 I Unlnterpretable
I 3 I Monophaalc
Graph 53
1
2
3
Usnatruatlon
X
X
X
4
5
6
7
9
8
10 11 12 13 14 15 16 17 18 19 2 0 21 2 2 23 24 25 2 6 27 28 2 9 3 0 3 1 3 2 3 3 3 4
X
X
Downloaded from http://humrep.oxfordjournals.org/ at Pennsylvania State University on September 13, 2016
Cycle day
X
37.9° •
Basal
Tamp
A1
^ j
\
\
\
\
g
A
V
I
MJ3°-
A
V
Biphasic
f
A
V
>\
J
/
[ 2 l Monophaalc
i
/
/ 'v
/
v
r
V
[ 3 ] Unlnterpretable
Graph 7C»
Cyclo day
1
2
3
4
5
6
Msnstruatlon
X
X
X
X
X
X
7
8
9
1 0 11 12 13 14 15 16 17 18 19 2 0 21 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4
X
X
X
X
X
X
37.S" •
Basal
TQmp.
37
A
/
A
/ \
/ \
/
\
\
/
/
/
V
3B.5» -
1 13 I
Monophaalc
\2\
/
f
/
Unlnterpretabla
Graph 3
Cycle day
1
2
3
4
5
Uanltrustlon
X
X
X
X
X
6
7
8
9
1 0 11 12 13 14 15 16 17 18 19 2 0 21 2 2 2 3 24 2 5 26 2 7 2 8 29 3 0 31 3 2 3 3 3 4
X
X
X
37.3° •
Basal
Tamp.
7
/
f-
S^|—
_
1 12 |
Monophaslc
Biphasic
Figure 2. Examples of inter-observer disagreement in the classification of tracings as biphasic, monophasic or uninterpretable. Observer
analysis is displayed at the base of the tracings, after the corresponding observer number.
2013
•a <$
o
w
X
u
J
*•
X
* •
UI
X
UI
L
\
09
u
BBl I
31
o
•
•
s
io
o
••
w- • -
•
m
IO
u
I
i
i
>
UI
01
t
•
UI
\
•
m
03
o
B"
re
•a
3
8
ro
s
>
ro
ro
ro
i
IO
yI
ro
u
f
ro
ro
ro
UI
IO
S
s
X
ro
01
09
X
IO
ro
ro
N
X ro
10
X
IO
IO
o
u
X
X
s
u
w
X
u
ro
X
g
'8
IO
g
o
Er
re
00
(O
ro
o
ro
a
<
/
3
3
N
ro
o
ro
s
D.
s
10
I
- •• i
i
•I M
• • m•
0
OD
t
10
u
3
i
{
00
i
o
L
Q IS)
u
a
4
I
09
10
* ••
1
o
M
N
f
w ro
01
Downloaded from http://humrep.oxfordjournals.org/ at Pennsylvania State University on September 13, 2016
X
u
J
1
1•
Cycio day
ro
J
00
Bal
ro
X
X
t
3
-*
-*
u
'
c
3
£T 6
25 cr
X
X
B
«
8
3
re
Tomp.
Ss.
z
Ia
it
E
a
Cycio day
II
o--o
Basal
Tomp. "
|m
& £
o 3
ro
1
••
>
Inter-observer agreement in BBT graph analysis
Graph 8£i
Cycle day
1
X
Manstruatlon
2
3
4
5
6
7
8
9
10 11 1 2 1 3 14 15 1 6 17 18 19 2 0 2 1 2 2 2 3 2 4 25 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4
X
X
37.5»-
Basal
Temp.
37
°
I
>
—
—
V
—4
SB.5" "
1 I 3 I Adequate
k
I 2 I Inadoquata-ahort
1 2 3 4 5 6 7 B 9
10 1 1 1 2 1 3
Graph 1
Cycle day
1
2
3
4
5
Menstruation
X
X
X
X
X
6
7
8
10 11 12 13 14 15 1 6 17 1B 19 2 0 21 2 2 2 3 2 4 25 2 6 2 7 2 8 2 9 3 0 31 3 2 3 3 3 4
9
X
X
X
X
Downloaded from http://humrep.oxfordjournals.org/ at Pennsylvania State University on September 13, 2016
X
37.5"-
Basal
Temp.
37
°
V
/
A,
*•*
V
1
Adequate
I 2 | 3 I Inadequate-falls
A1
f
2 3 4 5 8 7 8 9
1 0 11 12
IH2I3I
Graph 12 2
Cycle day
i
2
Menstruation
X
X
3
4
5
6
7
8
10 11 12 13 14 15 16 17 18 19 20 21 22 23 2 4 25 2 6 27 2 8 2 9 3 0 31 3 2 3 3 34
9
X
X
X
X
37.5" -
Basal
Temp.
37
^,
S8.s»-
HT31 Adequate [2~
i
Inadequate-slow rise
12
2
3
4
5
8
7
8
9 10 11
3
Graph 1£i0
Cycle day
1
2
3
4
s
Menstruation
X
X
X
X
X
6
7
8
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 34
9
X
X
X
37.3" •
Basal
Temp.
37
°
-f«i
• ^
—1
/*>
I*"
38.3° "
| 1 I Inadequate-short
3
| 2 | Adequate
Inadequate-hypothermlc
pj~
v t"
u
I
A1
2 3 4 5 8 7 8 9
10 11
Figure 4. Examples of inter-observer disagreement in the classification of the thermal shift as adequate or inadequate. Observer choice of
nadir is represented by arrows and classification of thermal shifts is displayed at the base of the tracings after the corresponding observer
number. Classification of shifts as inadequate is followed by the justification for this choice.
2015
D.Ayres-de-Campos et al.
We believe that an effort should be made to generalize the
use of uniform analysis criteria, as only then can results from
different institutions be compared and the remaining clinical
evaluation of the method be performed.
Incorrect measurement and registration of temperature by
patients is another important aspect that has to be taken into
account in graph interpretation. Various electronic devices are
now available that reduce the dependence on patients to
generate graphs, although none eliminate it altogether. These
devices will no doubt improve the general quality of tracings,
and probably the method's reproducibility. With their use,
however, two of the great advantages of the method are
reduced: inexpensiveness and easy application.
References
Bauman, J.E. (1981) Basal body temperature: unreliable method of ovulation
detection. Fertil. Steril, 36, 729-733.
Corson, S.L., Batzer, F.R., Gocial, B., et al. (1995) Intra-observer and interobserver variability in scoring laparoscopic diagnosis of pelvic adhesions.
Hum Reprod., 10, 161-164.
Donker, D.K. (1991) In Interobserver Variation in the Assessment of Fetal
Heart Rate Recordings. VU University Press, Amsterdam, pp. 145-161.
Downs, K.A. and Gibson, M. (1983) Basal body temperature graph and the
luteal phase defect. Fertil. Steril., 40, 466-468.
Elmore, J.G., Wells, C.K., Lee, C.H., Howard, D.H. and Feinstein, A.R.
(1994) Variability in radiologists' interpretation of mammograms. N. Eng.
J. Med., 331, 1493-1499.
Grant, A. (1984) Principles for clinical evaluation of methods of perinatal
monitoring. J. Perinat. Med-, 12, 227-231.
2016
Grant, J.M. (1991) The fetal heart rate trace is normal, isn't it? Observer
agreement of categorical assessments. Lancet, 337, 215-218.
Guzick, D.S., Grefenstette, I., Baffone, K., et al. (1994) Infertility evaluation
in fertile women: a model for assessing the efficacy of infertility testing.
Hum. Reprod., 9, 2306.
Hilgers, T.W. and Bailey, A.J. (1980) Natural family planning: II. Basal body
temperature and estimated time of ovulation. Obstet. Gynecoi, 55, 333.
Kambic, R. and Gray, R.H. (1989) Interobserver variation in estimation of
day of conception intercourse using selected natural family planning charts.
Fertil. Steril., 51, 430-434.
Lenton, E.A., Weston, G.A. and Cooke, I.D. (1977) Problems in using basal
body temperature recordings in an infertility clinic. Br. Med. J., 1, 803-805.
Martinez, A.R., van Hooff, M.H.A., Schoute, E., van der Meer, M., Broekmans,
F.J.M. and Hompes, P.G.A. (1992) The reliability, acceptability and
applications of basal body temperature (BBT) records in the diagnosis and
treatment of infertility. Eur. J. Obstet. Gynecoi. Reprod. Bio/., 47, 121-127.
McCarthy, J.J.Jr and Rockelte, H.E. (1983) A comparison of methods to
interpret the basal body temperature graph. Fertil. Steril., 39, 640-646.
Moghissi, K.S. (1976) Accuracy of basal body temperature for ovulation
detection. Fertil. Steril., 27, 1415-1421.
Newill, R.G.D. and Katz, M. (1982) The basal body temperature chart
in artificial insemination by donor pregnancy cycles. Fertil. Steril., 38,
431^38.
Royston, J.P. and Abrams, R.M. (1980) An objective method for detecting
the shift in basal body temperature in women. Biometrics, 36, 217-224.
Squire, W.S. (1868) Puerperal temperatures. Trans. Obstet. Soc. (London), 9,
129-144.
Steer, C.V., Wiiliams, J., Zaidi, J., Campbell, S. and Tan, S.L. (1995) Intraobserver, interobserver, interultrasound transducer and intercycle variation
in colour Doppler assessment of uterine artery impedance. Hum. Reprod.,
10,479^181.
Vollman, R.F. (1977) In The menstrual cycle. Saunders, Philadelphia, p. 80.
Received on December 12, 1994; accepted on April 10, 1995
Downloaded from http://humrep.oxfordjournals.org/ at Pennsylvania State University on September 13, 2016
(Donker, 1991). For these, computer analysis has become a
way to overcome the problem. A small number of reports
exist on BBT graph analysis by computer-based mathematical
algorithms (Royston and Abrams, 1980; McCarthy and
Rockette, 1983), but in none has a clinical evaluation been
performed. Computerized analysis has the disadvantage of
making methods more expensive, complicated and less widely
applicable. If visual analysis is reproducible, a computerized
approach is probably unnecessary.
In this study, graph analysis was performed by clinicians
with experience in this area, in an attempt to eliminate the
frequent disagreement associated with less correct interpretation (Lenton et al., 1977; Downs and Gibson, 1983; Martinez
et al., 1992). Using uniform analysis criteria, we aimed to
eliminate yet another common cause of disagreement (Bauman,
1981; McCarthy and Rockette, 1983; Kambic and Gray, 1989).
Different criteria for graph analysis are commonly used,
particularly regarding the day of ovulation. Agreement in
identification of the thermal nadir was clearly superior to
that reported in previous publications. Reproducibility of
'ovulatory' BBT graph features (i.e. biphasic graphs and
adequate thermal shifts) was excellent. Agreement in classification of monophasic graphs and inadequate thermal shifts,
although lower, was still good. Classification of uninterpretable
graphs was clearly not reproducible. Thus, with experienced
observers and uniform criteria, a good reproducibility can be
achieved in visual analysis of the most important parameters
of the BBT graph. The analysis criteria chosen for this study
provided good results, but Figures 2-4 show that there is still
room for improvement.
© Copyright 2026 Paperzz