American Journal of Epidemiology
Copyright O 1997 by The Johns Hopkins University School of Hygiene and Public Health
All rights reserved
Vol. 145, No. 4
Printed In U.S.A
Constructing Reproductive Histories by Linking Vital Records
Melissa M. Adams,1 Hoyt G. Wilson,1 Dale L Casto,1 Cynthia J. Berg,1 Jeanne M. McDermott,1
James A. Gaudino,1'2'3 and Brian J. McCarthy1
Certificates of 1,449,287 live births and fetal deaths filed in Georgia from 1980 through 1992 were linked to
create chronologies that, excluding induced abortions and ectopic pregnancies, constituted the reproductive
experience of individual women. The authors initially used a deterministic method (whereby linking rules were
not based on probability theory) to link as many records as possible, knowing that some of the linkages would
be incorrect. They subsequently used a probabilistic method (whereby evaluation of linkages was developed
from probability theory) to evaluate each linkage, and they broke those that were judged to be incorrect. Of the
1.4 million records, 38% did not link to another record. From the remaining records, 369,686 chains of two or
more events were constructed. The longest chain included 12 events. Of the chains, 69% included two events;
22% included three events. Longer chains tended to have lower scores for probable validity. The probabilitybased evaluation of chains affected 3.0% of the records that had been in chains at the end of the deterministic
linkage. A greater percentage of records in longer chains were affected by the evaluation. Unfortunately, the
small subset of records that were the most difficult to link tended to overrepresent groups with the greatest
risk of adverse pregnancy outcomes. Researchers contemplating a similar linkage can anticipate that, for the
majority of records, linkage can be accomplished with a relatively straightforward, deterministic approach.
Am J Epidemiol 1997; 145:339-48.
birth certificates; epidemiologic methods; fetal death; medical record linkage; reproductive history
account, however, a different pattern emerges: The
risk of perinatal mortality decreases with each subsequent birth (1). Thus, cross-sectional studies can lead
to false conclusions. In contrast, analyses of longitudinal data (i.e., data on an individual woman's successive pregnancies) offer many advantages: They can
better elucidate patterns across pregnancies, and they
are not biased by selective fertility (2, 3). Finally,
longitudinal data are important for evaluating the success of public health interventions, such as progress
toward reducing the rate of repeated cesarean section.
Compiling longitudinal data in the United States has
been difficult because of the absence of unique identifiers that facilitate linking records. This report describes the methods used to link birth and fetal death
certificates filed in Georgia from 1980 through 1992
and gives an overview of the results.
Population-based data on women's reproductive experiences over successive pregnancies have not been
available in the United States until recently. Crosssectional studies have provided most of our current
understanding of the relations of outcomes across
pregnancies. For example, analyses of cross-sectional
data have led researchers to conclude that perinatal
mortality increases with each subsequent birth (1).
When a woman's total number of births is taken into
Received for publication December 29, 1995, and accepted for
publication September 27, 1996.
From the World Health Organization Collaborating Center Institutions: Emory University Regional Perinatal Center, Centers for
Disease Control and Prevention, and Division of Public Health, State
of Georgia
1
World Health Organization Collaborating Center in Perinatal
Care and Health Services Research in Maternal Child Health, Division of Reproductive Health, Atlanta, GA.
2
Office of Perinatal Epidemiology, Epidemiology and Prevention
Branch, Division of Public Health Department of Human Resources,
State of Georgia, Atlanta, GA.
3
Epidemic Intelligence Service, Division of Field Epidemiology,
Epidemiology Program Office, Centers for Disease Control and
Prevention, Atlanta, GA.
Reprint requests to Dr. Melissa M. Adams, World Health Organization Collaborating Center in Perinatal Care and Health Services
Research in Maternal Child Health, Division of Reproductive Health,
MS K-23, National Center for Chronic Disease Prevention and
Health Promotion, Centers for Disease Control and Prevention,
Public Health Service, US Department of Health and Human Services, 4770 Buford Highway NE, Atlanta, GA 30341-3724.
MATERIALS AND METHODS
We linked live birth and fetal death certificates into
chronological chains of events that, excluding induced
abortions and ectopic pregnancies, constituted the reproductive experience of individual women. Each
baby or fetus corresponded to a reproductive event that
was represented by a record in the computer file. Thus
a twin birth, for example, corresponded to two events.
339
340
Adams et al.
Induced abortions were excluded because the vital
records for these events lacked personally identifying
information (hereafter referred to as personal identifiers). Certificates for ectopic pregnancies were not reliably recorded. We use the term link to mean designation of two records as belonging to the same mother.
A chain is a set of two or more records that have been
linked. If a record is linked to any record in an existing
chain, that record becomes a member of the chain and
is considered to be linked to all other records in the
chain. If two records belonging to different chains are
linked, the two chains are combined into a single
chain. A record cannot be a member of more than one
chain.
Theoretically, each person in the United States has a
unique identifier: his or her Social Security number.
As described below, we found that Social Security
numbers alone were inadequate for complete linkage.
Instead, we used a combination of variables, including
maternal date of birth, first name, maiden name, and
Social Security number.
The linkage process entailed two multistep procedures. First, we used a deterministic method to link as
many records as possible, knowing that some of the
linkages would be incorrect. We then used a probabilistic method to evaluate each linkage, and we broke
linkages that were judged to be incorrect. The initial
record-linking method employed linking rules based
on our judgment and intuition. It was deterministic in
that the linking rules were not developed or justified
on the basis of probability theory. In contrast, the
method used for breaking linkages was probabilistic in
that the rules for delinking were based on probability
theory.
We apply the terms match and matching both to
pairs of records and to pairs of values of a variable
contained in two different records. If a pair of records
matches, it means that the two records (events) correspond to the same mother. When we say that the
values of a variable in two different records match, we
mean that the two values are enough alike that we
conclude that the correct values are identical. In other
words, the values would be identical if they had been
reported, recorded, and keyed correctly. For example,
one of the variables used in the deterministic linking
process was mother's first name; the two values "Sally" and "Sallie" were considered a match and were
awarded the same score as if the two values had been
identical. We use the term exact match to signify
identical pairs of values. All processing of records
consisted of sorting and sequential processing using
SAS programs (4).
Data available for linking
The database included vital records for 104,102 fetal
deaths and 1,345,185 live births from 1980 through
1992 that occurred in Georgia, regardless of maternal
state of residence, or that occurred in other states to
Georgia residents and for which certificates were sent
to Georgia. Georgia law requires the filing of a fetal
death certificate for all spontaneous terminations of
pregnancy not resulting in a live birth, regardless of
the length of gestation before the termination occurs.
We used 27 variables to create and evaluate the linkages (table 1).
Deterministic linkage
The deterministic linkage consisted of phase I,
which entailed six processing steps during which
TABLE 1. Percentage of records with data for personal
identifiers by record type, Georgia, 1980-1992
Variable
by
type
Maternal
Maiden name
first name
Initial of middle name
Date of birth
Social Security no.
Race
Zip code of residence
Years of education
Stats of birth
tear of most recent
Live birth
Fetal death
Month of most recent
Live birth
Fetal death
Previous live births, <2,500 g
Now living
Now dead
Previous live births, £2,500 g
Now living
Now dead
Previous live births, now living
Previous live births, now dead
Previous fetal deaths, <20 weeks
Previous fetal deaths, £20 weeks
Fetal
deaths
(%)
Live
births
(%)
87
100
67
99
0*
98
0*
100
100
89
99
94
99
100
99
96
57
0*
85
84
99
97
78
99
66
67
95t
95t
95t
95f
99f
94t
99t
94t
99*
99*
100*
100*
91t
90f
96t
95t
Paternal
Surname
First name
Initial of middle name
Date of birth
65
65
62
30*
84
84
75
Infant
Surname
Date of birth
0*
100
100
100
83
• Data not recorded for 1980-1992.
t Data not recorded for 1989-1992.
* Data not recorded for 1980-1988.
Am J Epidemiol
Vol. 145, No. 4, 1997
Constructing Reproductive Histories
chains were formed and individual (previously unlinked) records were added to chains. Next followed
phase n, which entailed multiple passes through the
file to combine chains belonging to the same mother.
At each step in phase I, the file was first sorted on
one or more key variables; then all pairs of records
having the same value of the sort key were compared
and considered for linking. For example, in the first
step, all pairs of records having the same mother's date
of birth (month, day, and year) were compared and
considered for linking. This process required sufficient
computer memory (real or virtual) to allow all records
having the same sort key to be held in memory at one
time. If the data had been complete and error free, all
matching pairs of records could have been identified
based on a single key variable. Because some data
were missing or inaccurate, however, we repeated the
process for multiple key variables to identify as many
matching pairs of records as possible. Appendix 1 lists
the key variables used in the six processing steps.
In each step, record matches were determined by
comparing pairs of records on 10 variables (table 2)
and computing a deterministic score for each pair. A
match on any of the variables added one point to the
deterministic score; no points were given when data
were missing. Points were subtracted for nonmatches
TABLE 2. Points assigned for comparisons between two
records and linkage criteria of 1980-1992 Georgia live births
and fetal death records
Type o) variable
compared between
two records
Numeric
Maternal date of birth*
Paternal date of birth
Date of most recent
event (month/year)*
Maternal Social Security
no.*
Name
Maiden name*
Maternal first name*
Paternal surnamef
Paternal first name
Child's sumamet
Child's surname and
paternal surname on
most recent event
Points assigned
Data
missing
Match
Disagree
+1
+1
-1
0
0
0
+1
0
0
+1
-1
0
+1
+1
+1
+1
+1
-1
-1
0
0
0
0
0
0
0
0
+1
0
0
* Linkage criteria include matching this variable as one of at
least two maternal variables (to avoid linkage on paternal
information only) and one of the following: 1) a deterministic score
of 4 or greater, or 2) a deterministic score of 3 and matching on two
name variables and one numeric variable.
f Only 1 point can be given for matches between child's and
father's surnames, regardless of whether the match derived from a
comparison of father's surname on record A and child's surname on
record B or vice versa.
Am J Epidemiol
Vol. 145, No. 4, 1997
341
on some variables (table 2). Criteria for declaring a
match are listed in table 2. Mother's last name was not
used in the scoring because it was not recorded on
birth certificates before 1989 and thus was not available for the bulk of our records.
To allow for spelling and keying errors, we did not
require exact matches on first names, surnames, or
Social Security numbers. For first names, we required
that only the first three letters match exactly. For
surnames, we applied an algorithm that took into account the length of the name and the number of matching letters. For Social Security numbers, we permitted
as many as two transpositions or incorrect digits.
Appendix 1 includes a description of rules that constrained the linkage at each step in phase I. These rules
were designed to avoid the formation of separate
chains corresponding to the same mother (chain fragments). In spite of these measures, the method resulted
in some chain fragments, so that after the six steps
were completed, additional passes were made through
the file (phase II) to consolidate any fragments. Because of the sequential nature of our processing and
the potentially complex linkage patterns required for
combining multiple chains, the phase II process of
combining chains was a nontrivial task, requiring multiple passes through the file.
Probabilistic evaluation of linked records
To identify linkages that were likely to be incorrect,
we evaluated each potential chain using a probabilistic
approach, wherein linkage (or delinkage) decisions
were made according to rules based on probability
theory. The methods were based on the general approach widely ascribed to Fellegi and Sunter (5) and
refined by many other authors (6-8). Our process
consisted of the following eight elements: 1) establishment of the probability-based linkage scoring methods; 2) selection of variables and definition of outcome sets for record-record comparisons on all
variables; 3) estimates of required probabilities from
the database; 4) computation of scores for all record
pairs in each of the chains that had been formed by the
deterministic linkage; 5) from the record pair scores,
computation of a quality index (called the weakest
path score) for each chain; 6) manual review of randomly selected chains to establish the cutoff score for
breaking chains; 7) application of the cutoff rule to all
chains, breaking those that fall below the cutoff; and
8) manual review of the few remaining chains having
marginal scores, breaking those that appear to be incorrectly linked. Although these elements are listed in
logical sequence, there was considerable overlap and
iteration, especially in steps 2-4, as we worked
through the process of defining variables and assessing
342
Adams et al.
their discriminating power. These three steps required
regular visual review of distributions of values of the
variables, outcomes of record-record comparisons on
the variables, and linked records. Step 6 also required
a visual review of many chains. The visual review of
chains in step 8 required, in comparison, only a moderate amount of labor.
Figure 1 lists the 18 maternal, paternal, and other
variables used in computing the probability-based
scores for pairs of linked records. For each of the
variables, X,, / = 1,2, ..., 18, the linkage score component, r,, was based on the outcome, xt, of comparing
X, values in two records. For some variables, such as
maternal middle initial, we observed only whether the
two fields matched exactly; thus the corresponding
x, could take only the values "match" and "nonmatch."
For other variables, the outcome set was more complex, as described below.
The 18 scores, rh computed from the results of the
comparisons of the 18 variables, were added to obtain
the composite, probability-based score r for each
record pair. This scheme is intuitively appealing in
that the score r,- for a variable is positive if the corresponding fields match, and it is negative if they do not
match. Details of the computations used in probabilis-
Fetal Death and Birth Certificate Variables
Maternal
1. Maiden name
2. First name
3. Middle initial
4. Date of birth
5. Years of education
6. State of birth
7. Race
8. Zip code of residence
9. Social Security number
Paternal
10. First name
11. Last name
12. Middle initial
13. Date of birth
Other
14. Date of most recent live birth (later record)
versus event date of earlier record
15. Date of most recent fetal death (later record)
versus event date of earlier record
16. Number of previous live births
17. Number of previous fetal deaths
18. Event date (elapsed time between the two events;
used as an indicator of biologic plausibility).
FIGURE 1. Variables used in probabilistic evaluation of vital record
linkages, Georgia, 1980-1992.
tic evaluation of linkages are given in Appendix 2.
Adding the 18 variable scores /•„ which are logarithms
of probability ratios, to obtain an overall score r is
appropriate as long as the 18 outcome variables are
independent. This assumption of independence among
the Xj appeared generally to be a reasonable working
assumption, with a few notable exceptions, discussed
below.
The outcome sets (values of *,) reflected multiple
degrees of matching for the following variables: maternal Social Security number and years of education;
dates of most recent live birth and fetal death; numbers
of previous live births and fetal deaths; and event dates
(to measure the biologic plausibility of interval between two events). The purpose was to allow for
comparisons that, although not perfect matches, suggested that the records belonged to the same mother.
For example, when "mother's years of education" was
compared for two records, the following outcomes
were considered: Years of education matched exactly,
the chronologically later record had 1 more year of
education than the earlier record, the later record had
2 more years of education than the earlier record, and
so forth. As expected, scores (r, values) were higher
for closer matches.
The outcome sets for the dates of the most recent
live birth and fetal death and the numbers of previous
live births and fetal deaths (figure 1, variables 14-17)
were determined as follows. We first checked for
whether the date (month and year) of the most recent
live birth recorded on the chronologically later record
was the same as the date of the earlier live birth. If the
dates did not match exactly, we checked whether the
date on the later record was after the earlier live birth,
suggesting a failure to link a live birth that occurred
between the two records. Thinking that data for the
most recent live birth could have been inaccurately
recorded in the field for most recent fetal death, we
also checked to see whether the date for the most
recent fetal death recorded on the chronologically later
record was the same as the date of the earlier live birth.
We repeated these steps for the date of the most recent
fetal death, checking the date of the most recent fetal
death for possible incorrect recording in the field for
the most recent live birth. Finally, we separately
checked fetal deaths and live births for consistency
between the number of previous events of each type
recorded on the certificate and the number of preceding records in the chain.
We used the interval between two events (figure 1,
variable 18) as an indicator of biological plausibility of
a record match, judging that very short intervals were
unlikely. The outcome set consisted of the following
three categories: 1) less than 140 days between an
Am J Epidemiol
Vol. 145, No. 4, 1997
Constructing Reproductive Histories
event (live birth or fetal death) and a subsequent live
birth, 2) less than 55 days between an event and a
subsequent fetal death, and 3) any other interevent
interval. For outcomes 1 and 2, we required further
that the event pair under consideration not be twins.
Specific categories used for other variables are available from the authors.
The scoring method accounted for the specific values of the following variables: maternal first name and
maiden name, date of birth, zip code of residence,
race, and state of birth; and paternal date of birth, first
name, and last name. For each of these variables, the
set of outcomes, x,, reflected not only the degree of
match but also the particular values for those situations
when fields matched exactly. Thus greater weight was
given to matches on less common values. For example, two records with the maiden name of "Adams,"
which is relatively common in Georgia, had a lower
score than two records with the name "Gaudino,"
which is much less common. Likewise, two records
with the mother's state of birth as Georgia had a lower
score than two records with the mother's state of birth
as Delaware.
In preparation for the scoring of record pairs, we had
to estimate from the database probabilities from which
matching scores, r,-, were computed. For each possible
comparison outcome value, JC,-, for each variable Xh we
estimated the following two conditional probabilities:
1) P(xj\m), the probability of observing outcome *,
given that the record pair is a match (i.e., the records
correspond to the same mother); and 2) P(x,\m'), the
probability of observing outcome x, given that the
record pair does not match. (See Appendix 2 for
additional notational definitions and development of
the probabilistic scoring method.) For those variables
for which the comparison outcome did not reflect the
specific value of the variable (such as Social Security
number), we estimated probabilities by taking advantage of the completed deterministic linkage. From
existing chains, we used pairs of records that met
stringent matching criteria as a set of "true" matches to
estimate values of P{x\m). Similarly, we used a large
pool of record pairs that were clearly nonmatches to
estimate values of P{x\m').
For those variables (such as names) for which outcome (and thus P(xj\m')) varied according to the particular value of the variable, we used the frequency
distributions of values in the database to estimate
probabilities associated with each individual value
(see Appendix 2 for details). By this means, matches
on uncommon names received higher scores than
matches on common names, as described above.
Am J Epidemiol
Vol. 145, No. 4, 1997
343
When data were missing for a variable in either or
both of the records under consideration, the corresponding r, was set to zero, so that points were neither
added to nor subtracted from the score.
In applying these scoring methods to the delinking
of existing chains, we first computed the composite,
probability-based score r for each pair of records in a
chain (not just adjacent pairs). We then computed a
"weakest path" score (described below) for the chain,
and chains whose weakest paths were less than 16
were broken. This step yielded some shorter chains
and some unlinked records. Finally, we reviewed the
20 chains representing 133 events that met either of
the following two criteria: 1) the chain contained 10 or
more events, or 2) the chain contained five or more
events and had a weakest path score of 16-19. Those
that were judged to be incorrectly linked were manually split. This final step was undertaken because of
the low likelihood that a woman had 10 or more
infants during the 13 years of the study and because of
the observation that some of the longer chains with
low scores were incorrectly linked.
The weakest path score is the lowest probabilitybased score in the "best" path (not necessarily the
sequential, chronological connection) between any
two records in the chain. For example, consider a
three-record chain consisting of records A, B, and C,
with probability-based scores shown in figure 2. The
weakest path score is 25 because any two records
(events) in the chain can be connected by a path that is
never lower than 25. For example, the best path between records B and C is B-A-C, in which the probability-based scores are 30 and 25.
We established the weakest path cutoff value of 16
for breaking chains by manually reviewing many
chains that had a wide range of record-record
probability-based scores, r. We judged most record
pairs having link scores of 16 as valid matches. Scores
higher than 16 indicated even greater likelihood of
match validity. To put the value of 16 in perspective,
the score component, rh for an exact match on the
Social Security number variable was 18. The range of
r=25
r=30
r=15
FIGURE 2. Weakest path in a three-record chain used in the
construction of reproductive histories in Georgia, 1980-1992. r,
probability-based score.
344
Adams et al.
score components, r,, for exact matches on other key
variables included the following: maternal maiden
name, 6-13; maternal first name, 4-12; and maternal
date of birth, 8-10. Recall that score components, rh
were negative for nonmatching fields, resulting in the
subtraction of points from the composite probabilitybased score r.
RESULTS
Data used for linking
In Georgia, the reported annual number of live
births increased from 95,640 in 1980 to 114,235 in
1992; and the reported annual number of fetal deaths
increased from 6,556 in 1980 to 10,636 in 1992.
Assuming that approximately 15 percent of clinically
recognized pregnancies end in spontaneous loss (9)
and using the number of reported live births and fetal
deaths to approximate the number of clinically recognized pregnancies, we estimated that 217,393 spontaneous pregnancy losses (fetal deaths) occurred. Only
104,102 fetal death certificates were filed, suggesting
approximately 52 percent underreporting of fetal
deaths.
Some variables were not collected in all years or
were collected in different formats on fetal death and
live birth certificates (table 1). In general, personal
identifiers were more completely recorded on certificates for live births than for fetal deaths. For live
births, completeness of reporting was consistent across
the 13 years of the study except for two variables.
First, for the mother's Social Security number, completeness increased from 87 percent in 1980 to 96
percent in 1988 and subsequent years. Second, for the
number of previous live births less than 2,500 g and
the number of previous spontaneous abortions beyond
20 weeks, completeness increased from approximately
88 percent in the early 1980s to 99 percent in 1984 and
later years.
For fetal deaths, completeness of recording for several variables changed over time. Completeness for
maiden name declined from 96 percent in the early
1980s to 77 percent in 1992. Completeness for maternal education declined from 64 percent in the early
1980s to 51 percent in the early 1990s. Completeness
for the dates of the most recent live birth and fetal
death increased from about 80 percent in the early
1980s to 98 percent in the early 1990s. Completeness
for the father's first and last names decreased from
about 72 percent in the early 1980s to 55 percent in
1992. Completeness for information on fetal death
certificates did not appear to vary with length of gestation at delivery.
When initially planning the linkage approach, we
considered but ultimately decided against basing it
exclusively on the mother's Social Security number.
The mother's Social Security number was not collected on fetal death certificates from 1980 through
1988 and tended to be missing in a nonrandom manner
on other records. It was less likely to be recorded on
certificates for babies bom to younger women, women
with lower levels of education, and women who were
not born in the United States (table 3). A third consideration was that the mother's Social Security number was not always accurately recorded. More than
7,100 pairs of records were observed that had the same
maternal Social Security number but had different
TABLE 3. Percentage of birth certificates with misting data
for maternal Social Security number by selected maternal and
infant characteristics, Georgia, 1980-1992
No. of
records
= 1,345,185)
Variable
Records wtth
missing Social
Security no.
Maternal age (years)
<18
18-19
20-24
25-29
30-^34
>34
Unknown
94,383
137,979
424,687
378,667
217,521
76,845
15,103
21.4
6.7
3.8
3.2
3.3
4.5
95.6
Maternal education (years)
<12
12
13-15
>15
Unknown
353,187
556,711
229,105
203,516
2,666
11.7
4.5
3.2
3.7
50.9
Marital status
Married
Other
Unknown
967,875
376,182
1,128
5.3
8.0
57.9
Maternal race
White
Black
Other
Missing
858,842
469,462
2,753
14,128
5.8
6.3
12.6
15.3
Maternal state of birth*
Georgia
Other US
Other country
Unknown
810,428
427,794
10,009
2,909
5.9
5.1
16.7
6.2
21,740
91,224
1,231,346
875
10.5
7.8
5.9
29.6
Infant birth weight (g)
<1,500
1,500-2,499
>2,499
Unknown
•Data available for live birth only.
Am J Epidemiol
Vol. 145, No. 4, 1997
Constructing Reproductive Histories
the deterministic step, 29.1 percent changed to shorter
chains after the probability-based evaluation. In contrast, among the 247,275 records that were in chains of
three events, 2.4 percent changed to shorter chains;
and among the 506,012 records in chains of two
events, 0.7 percent were split apart. The probabilitybased evaluation also affected proportionately more
records of women whose marital status was unknown,
who were of races other than white, who had 12 or
fewer years of education, and who were born before
1950. Infant outcome (fetal death or live birth) and
birth weight did not affect the likelihood that a record
was delinked.
maternal dates of birth, first names, and maiden
names. Examination of the personal identifiers for
these pairs suggested that the events corresponded to
different mothers and that the matching Social Security numbers were erroneous.
Results of linkage
Of the 1.4 million records in the database, 38 percent (551,391) did not link to another record. From the
remaining 897,896 events, 369,686 chains of two or
more events that occurred to the same woman were
constructed (table 4). The longest chain included 12
events. The preponderance of chains contained two
events. For most chains, the weakest path had a score
of 30 or more. Chains with greater numbers of events,
however, tended to have lower scores for their weakest
paths.
DISCUSSION
Major strengths of our linkage approach were identifying as many potentially correct linkages as possible
and evaluating these linkages using a wide range of
ancillary data. When we attempted linkage initially,
we observed that women with less complete personal
identifiers tended to have characteristics associated
with increased risks of adverse pregnancy outcomes.
For example, women whose certificates lacked their
Social Security numbers tended to be younger or have
less education—factors previously associated with adverse pregnancy outcomes (10, 11). Similarly, women
whose certificates lacked paternal information or who
had different fathers for successive pregnancies were
often not married and thus at increased risk for adverse
pregnancy outcomes (10, 11). Because of the public
health importance of women with these characteristics, we avoided basing the linkages on only a few
personal identifying variables. When evaluating the
linkages, we attempted to compensate for potential
overlinkage by using a probabilistic approach based on
a wide range of personal attributes. In addition, every
Impact of probability-based evaluation
The probability-based evaluation of chains resulted
in delinkages affecting approximately 27,000 records,
representing 3.0 percent of the records that had been in
chains at the end of the deterministic linkage. Proportionately greater numbers of records in longer chains
were affected by the assessment. For example, of the
5,768 records that were in chains of seven events after
TABLE 4. Reproductive history chains constructed by events in the chain and the score for the
weakest path between events, Georgia, 1980-1992
No. cri
events In
chain
Score for weakest path
16-19
%
20-29
i30
%
Total
%•
2
1,262
0.5
11,412
4.5
243,189
95.0
255,863
69.2
736
0.9
7,994
9.7
73,527
89.4
82,257
22.3
4
267
1.2
3,394
15.1
18,876
83.8
22,537
6.1
5
80
1.3
1,369
21.9
4,815
76.9
6,264
1.7
26.5
0.5
6
32
1.7
494
1,338
71.8
1,864
7
17
2.8
166
27.2
427
70.0
610
0.2
>7
8
2.7
87
29.9
196
67.4
291
0.1
2,402
0.7
24,916
6.7
342,368
92.6
369,686
•Rounded.
Am J Epidemiol
%
3
Total
Vol. 145, No. 4, 1997
345
100
346
Adams et al.
linkage was assigned a score that corresponds to its
validity, permitting an analyst to select only linkages
with highly valid scores.
Throughout the linkage process, we were concerned
that linkages would be driven by paternal information,
resulting in the linking of events that had the same
father but different mothers. Creation of links of this
type was avoided in the deterministic part by requiring
matches on two or more maternal variables. Additionally, when evaluating linkages, a limited number of
paternal variables was considered, thereby restricting
the impact of paternal information on the scores. Thus,
linkages occurring solely on the basis of paternal information were judged to be very unlikely, and the
manual review of records supported this impression.
A related consideration was the possibility that we
failed to link events that had different fathers but were
experienced by the same mother. An evaluation of this
linkage, reported elsewhere (12), showed rates of accurate linkage only slightly lower when paternity differed among a mother's births or when paternal information was not stated for one or more events. Thus,
we believe that this linkage methodology yielded data
appropriate for assessing the impact of changes in
paternity on pregnancy outcome.
Potential weaknesses in the approach included a
lack of independence among some of the variables
used in the probabilistic evaluation of linkages and the
occurrence of incorrect linkages between family members, especially mothers who were twins. The probabilistic evaluation of the linkages was based on the
assumption that the variables used were statistically
independent of each other. Generally, this appeared
true. A few instances were observed in which this
assumption did not hold, such as within ethnic groups
for whom a small number of first names and surnames
were used, thus violating the independence between
first name and maiden name. This problem was exacerbated by the rareness of these ethnic names, which
received high point values for matches.
In reviewing the linkages, we observed a few that
appeared to occur between mothers who were twins.
These mothers had identical information for date of
birth, state of birth, maiden name, and race and often
had very similar information for their first name (e.g.,
Mary and Martha), years of education, Social Security
number, and zip code. Despite the use of a wide range
of identifying information, some incorrect linkage of
the offspring of mothers who were twins may be
unavoidable, remedied only by manual review of
many records. Because this type of error appeared
rare, we did not undertake this review.
One cost of the approach was the substantial programing and computer resources needed to accomplish
the linkage and the probability-based evaluation. The
substantial resources were necessitated by the large
number of records that were used and variables that
were considered. Many time-consuming computer
runs were required to build the tables of frequencies
needed to assign probabilistic scores associated with
individual variables. When the linkage was started, no
commercially distributed software was available that
met the needs of the project.
Beyond our methodological approach, the available
data also influenced the success of the linkages. By
limiting the database to certificates filed in Georgia,
we excluded from the linkage events of women who
had a delivery in another state and subsequently
moved to Georgia. Because a national data set of fetal
deaths and live births that contains personal identifiers
is not available, there was no good alternative to using
Georgia data. Limiting the database to events of
1980-1992 meant that there probably were not enough
data to create lifetime pregnancy histories for many
women. The likely underregistration of fetal deaths
probably caused the linkage of these events to be
incomplete. Finally, inaccuracies and omissions in
personal identifying data inevitably limited the ability
to link records.
Probabilities were used not to link records, but only
to evaluate chains that had already been constructed.
Developing the probabilities needed for linkage scoring required a set of records that were assumed to be
correctly linked; thus, a probabilistic linkage could not
have been done without first performing a deterministic linkage.
These linked data are being used to investigate a
number of relations, such as the accuracy of the vaginal birth after cesarean section delivery method on the
birth certificate. Analyses are in progress to evaluate
the patterns of maternal behaviors across pregnancies,
such as smoking and delayed entry to prenatal care.
The data are also being used to examine the association between length of interpregnancy interval and
pregnancy outcome, adequacy of prenatal care and
risk of intrauterine growth retardation, and the impact
of changes in paternity on adverse pregnancy outcome.
Researchers contemplating a similar linkage may be
encouraged to know that, for the majority of records,
linkage can be accomplished with a relatively straightforward, deterministic approach. Evaluation of our
initial linkages shows that nearly all of them are accurate and that failure to link births correctly was rare
(12). Unfortunately, the small subset of records that
are the most difficult to link tend to overrepresent
groups at highest risk of adverse pregnancy outcomes.
For these records, evaluation of a wide range of idenAm J Epidemiol
Vol. 145, No. 4, 1997
Constructing Reproductive Histories
347
APPENDIX 1
tifying information may be needed. Future research is
needed to evaluate alternate approaches for linking
these records.
Details of the Deterministic Linkage
Each step in phase I consisted of a sort followed by a
processing run that linked records. The sort keys of the six
steps are the following:
ACKNOWLEDGMENTS
The authors thank the following individuals: Cynthia
Mervis, who identified differences in the reporting formats
of birth and fetal death certificates over the study period;
staff of the Information Resources Management Office,
Centers for Disease Control and Prevention, who provided
computational resources; Michael Lavoie, Director, Vital
Statistics Office, Center for Health Information Branch,
Division of Public Health, State of Georgia, who provided
vital records data and consulted on their interpretation;
Virginia Floyd, Director, Maternal and Child Health
Branch, Division of Public Health, State of Georgia, who
facilitated the process of conducting the linkage; Hani
Atrash, Chief, Pregnancy and Infant Health Branch, Division of Reproductive Health, National Center for Chronic
Disease Prevention and Health Promotion, Centers for Disease Control and Prevention, who provided administrative
and technical support.
REFERENCES
1. Golding J. The epidemiology of perinatal death. In: Kiely M,
ed. Reproductive and perinatal epidemiology. Boca Raton,
FL: CRC Press, 1991:406-8.
2. Skjaerven R, Wilcox AJ, Lie RT, et al. Selective fertility and
the distortion of perinatal mortality. Am J Epidemiol 1988;
128:1352-63.
3. Bakketeig LS, Hoffman HJ. Perinatal mortality by birth order
within cohorts based on sibship size. Br Med J 1979;2:693-6.
4. SAS Institute, Inc. SAS language: reference, version 6, 1st ed.
Cary, NC: SAS Institute, Inc., 1990.
5. Fellegi IP, Sunter AB. A theory of record linkage. J Am Stat
Assoc 1969;64:1183-210.
6. Newcombe HB. Handbook of record linkage: methods for
health and statistical studies, administration, and business.
Oxford, United Kingdom: Oxford University Press, 1988.
7. WinkJer WE. Using the EM algorithm for weight computation
in the Fellegi-Sunter model of record linkage. Proceedings of
Survey Research Methods Section. Alexandria, VA: American
Statistical Association, 1988:667-71.
8. Thibaudeau Y. The discrimination power of dependency
structures in record linkage. Surv Methodol 1993; 19:31—8.
9. Hertz-Picciotto I, Samuels SJ. Incidence of early pregnancy
loss. N Engl J Med 1988,319:1483-4.
10. Committee to Study the Prevention of Low Birthweight, Institute of Medicine. Preventing low birthweight. Washington,
DC: National Academy Press, 1985:51.
11. Berkowitz GS, Papiemik E. Epidemiology of preterm birth.
Epidemiol Rev 1993;15:414-43.
12. Adams MM, Berg CJ, McDermott JC, et al. Evaluation of
reproductive histories constructed by linking vital records.
Paed Perinat Epidemiol (in press).
Am J Epidemiol
Vol. 145, No. 4, 1997
1. mother's date of birth,
2. mother's Social Security number,
3. mother's maiden name and the first initial of her first
name,
4. mother's Social Security number,
5. mother's maiden name and the first initial of her first
name,
6. mother's maiden name and the first initial of her first
name.
We denoted the linkage status of a record as "U" if it was
unlinked (i.e., not linked to any other record) and as "L" if
it had been linked to another record. At any point in the
linking process, then, we designated the linkage status of
any pair of records as U-U, L-L, or U-L, depending on
whether neither, both, or only one of the records had previously been linked to another record. For each of the six
steps, we specified which statuses were eligible for linking
and which, if any, were eligible to be "noted." Linking two
records was done by setting the variable for chain number to
the same value in both records. Noting a linkage was done
by setting the value of an auxiliary variable in one of the
records to the value of the chain number variable of the
other record; this information was later used for consolidating chains in phase n. The categories of linkages we permitted in the six steps were as follows:
1.
2.
3.
4.
5.
6.
link:
link:
link:
link:
link:
link:
U-U, U-L, L-L; note: none;
U-L; note: L-L;
U-L; note: L-L;
U-L; U-U note: none;
U-L; note: none;
U-L, U-U; note: none.
Thus, in the first step, any pair of records that satisfied the
matching criteria was linked. Linkages of the L-L type were
consolidated into single chains within this step. It was
possible to consolidate chains in the first step because there
were no linkages to records outside the block of records
having the same sort key. Such consolidation of L-L linkages was not possible in steps 2-6 because the members of
a chain could be spread throughout the file. In steps 2, 3, and
5, only pairs of records wherein one of the records was
previously linked and the other was unlinked were eligible
for linking. The idea was to try to add records to existing
chains where possible, rather than starting new chains.
Because the rules disallowed U-U and L-L links in steps 2,
3, and 5, the process included later steps using the same sort
keys to identify any remaining valid links that had not been
permitted in earlier steps.
APPENDIX 2
Details of Probabilistic Evaluation of Linkages
For any pair of records, the event of having the same
mother was denoted as m, the event of having different
348
Adams et al.
mothers as m', and the vector of outcomes of the 18 variables as x = (xx,x2, •••,-Jig)- The probability, P{m\x), that
two records belong to the same mother, given the observed
pattern of field-by-field outcomes, can be computed as
follows from Bayes' theorem:
and
r =
P(m)P(x\m)
P(m)P(x\m)
- P(m)]P(x\m')
1
1 + KIR
riS
= T\
where
r, =
where
K = [1 -
P(m)]/P(m)
R = P(x\m)/P(x\m').
The quantity P{m) is the unconditional probability that
two randomly selected records match (i.e., belong to the
same mother). P(m) and K are constants for our data set. As
the ratio R increases, P(m\x) increases. P(m) and K could
have been estimated—at least roughly— from our data, but
it was convenient to use
r = log(fl) = log[P(x\m)] -
\og]P(x\m')]
as the working index of the likelihood of a match between
two records, rather than computing P(m\x) per se. Thus, it
was unnecessary to estimate values of P(m) and K.
If the x, are mutually independent (conditional on true
match/nonmatch status), then
P{x\m) = P{x\m) • P{x2\m) . . . P{xn\m)
and
P{x\m')
= P(xx\m') • P(x2\m') . . . P(x]S\m')
from which
R =
P(X]\m)
P(x2\m)
P(xls\m)
P{x\\m') ' P(x2\m') ' ' ' P(xia\m')
Thus, computation of the score, rb requires estimation of
the conditional probabilities P(x\m) and P(x,\m') or the ratio
P(x\m)IP(x,\m') corresponding to the outcome x,. For variables for which the outcome does not reflect the specific
value of the variable, we can estimate P(xj\m) from a large
pool of correctly matched records. For this purpose, we used
those record pairs produced by the deterministic linkage for
which we were very confident of the correctness of the
linkage. To estimate P(xj\m'), it was relatively easy to
extract a large set of record pairs that clearly did not match.
For variables whose outcomes reflected the specific value
of the variable, such as name variables, we estimated the
unconditional probability of each outcome, P(x,), from the
frequency distribution of the values of the variable and then
assumed that when x{ reflects identical values of X, in the
two records, P(x,\m') = [P(x,)]2. For example, the probability of two records both having the maiden name "Adams"
when the mothers are known to be different is approximately the same as when the two records are selected totally
at random. We assumed further that P(x,\m) = P(x,), an
approximation that causes very little distortion in the score
component, r,-. This leads to
R, =
when Xj represents matching values of Xt.
Am J Epidemiol
Vol. 145, No. 4, 1997
© Copyright 2026 Paperzz