DATA ANALYSIS PROCEDURES FOR A COORDINATING CENTER
OF A LARGE COLLABORATIVE STUDY
By
Ronald W. Helms
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 1003
May, 1975
2
1.
INTRODUCTION:
(4)
PURPOSE OF TIlE PAPER
Performing statistical analyses, including requisite data
manipulations, statistical computations, etc.
(The factory
Many scientists have learned to cope with the computer as a tool
analogy fails here •••• )
for handling data in small or moderate-sized research projects.
Although
The objective of this paper is to describe the sequence of events
he might not be able to program the computer or prepare data for input
to a packaged program. the scientist frequently has a general idea of the
steps involved. the difficulties which can arise. the amounts of money and
time involved. and generally the ability to effectively supervise a pro-
which arise, the amounts of time and effort involved, etc.
The development
and operation of a system for handling and analyzing large data files can be
Design of a data management system by "systems analysts."
phase is analogous to an architect's work in designing a
(This
fac~ory.)
Implementing the data management system, i.e., writing, testing,
documenting, and installing computer programs and training personnel
in data management procedures.
(Analogous to constructing
a
factory, installing the equipment, and training employees.)
Operating the system, i.e.,
expensive analyses.
Section 2 of this paper contains an "overview" description of the
steps involved in
developing a statistical analysis from a large data file.
Each of the succeeding sections contains a more detailed description of one
considered in the following categories:
(3)
It is also hoped that understanding the
ication between scientists and computer specialists. and speedier, less
Many
scientists are not familiar with these specialized procedures, the difficulties
(2)
in dollars and in time, to be able to interact effectively with the specialists
procedures and costs will lead to more careful pre-planning, better commun-
When the amount of data becomes large, however, special computeroriented data handling and data analysis procedures are required.
The hope is that
the reader will develop a sufficient understanding of the steps and costs,
involved in performing the work.
grammer or data processor who performs the work.
(1)
and steps involved in developing a statistical analysis.
receiving data, checking for errors,
correcting errors, matching data on particular subjects, storing
data in easily retrievable form, producing reports on the status
and progress of data collection and processing, etc.
is analogous to the operation of a factory:
reporting on production, etc.
This phase
producing goods,
or more steps, including information on the costs (computer costs, manpower
costs, and calendar time) required to perform the step(s).
The cost est-
imates arise from experience with several large projects, but due to differing
complexity of problems, vacation and holiday schedules, etc., the cost and
time requirements can vary over a wide range and the figures given can only
be used as a rough guide.
In a particular instance, computer specialists
should be able to provide reasonable estimates of costs and time requirements.
Rather than asauming the estimates given here are "close enough," in each
particular case one should request specific estimates.
One of the sources of communications problems between scientists and
computer specialists is the special jargon used by each.
Since this paper
contains a substantial amount of "computerese" an appendiX has been added
to explain some of the terms.
3
2.
4
A SUMMARY OF THE STEPS INVOI,VED IN DEVELOPING AN ANALYSIS
The first step involves planning the analysis and scheduling
personnel and deadlines.
This section contains a listing of tbe steps required to
develop a statistical analysis from a data file.
In order to list
A program is developed to copy from the
master data file those variables and those cases (records) which are
sucb a sequence of steps it is necessary to make Some assumptions:
of interest in the analysis.
(1)
We assume the data file which forms the basis for the analysis
file."
is "in good condition;" that is, we aSSume the data have
must be made on this file before statistical analysis programs can be
passed through quite good error detection procedures, including
used:
field tests and consistency cHe~ks.
transformed data and is called the "analysis file."
(See the Appendix for
Typically, further calculations (called "data transformations")
the result of the calculations is another file, which contains
A wide variety of
explanations of the terms field test and consistency checks.)
"preliminary" analyses is performed on the data in the analysis file,
We also assume that a substantial effort has been made to correct
including histograms, descriptive statistics (mean, standard deviation,
the errors thus detected and that, as a result, the data file is
mode, etc.), cross-tabulations, scatter diagrams, correlations, etc.
relatively "clean."
A major objective of the preliminary analyses is to reveal "outliers,"
If the data file contains both "dirty"
(uncorrected) data and "clean" data we assume there is a simple
technique for classifying each data field as "dirty" or "clean."
(2)
The new file is called a "raw analysis
which are carefully examined for errors, as well as to reveal other
"
errors in the data which have escaped previous error detection proWhen (!!£! "if") errors are discovered, the original data file
We aSSume the statistical analysis has a well defined purpose
cesses.
and is not a "fishing expedition."
is corrected, a new "raw analysis file" is produced, a transformed
Fisbing expeditions tend to
go where the good fishing is, which is an unpredictable path and
analysis file is produced, and the preliminary analysis calculations
which will tend to use all the available resources, whether extensive
are repeated.
or quite limited.
results of the preliminary analysis and proceeds.
Only if an analysis has a reasonably well
When no more errors are detected, one writes up the
Sometimes computer
defined purpose can one write down a reasonable plan for achieving
programs which will perform the necessary statistical computations are
the purpose.
not available and new programs must be developed.
ended"
We explicitly exclude analyses to answer "open-
questio~.
Bearing these aBsumptions in mind, a general overview of the
whole procedure seems justified, before listing the detailed steps.
This development of
statistical software can proceed concurrently with the preceding steps.
When the software is ready, the statistical computations are performed and
statistical analysis begins.
Errors in previous stages are often
detected at this point, causing a return to an earlier stage for corrections and pre-procesaing the intervening steps.
Mistakes, or poor predictions,
6
5
to the Master Update Progr8lll, which will copy the desired
in the selection of variables or cases sometimes necessitate repeating
the entire process.
up to the statistical analysis stage, several times.
i~
variables onto a "raw analy81s file" while performing an update
It is not unusual to repeat the whole process,
also very expensive.
This procedure
run.
7.
The final steps involve completing the analysis
Updete Program.
and then writing up the subject matter for publication or other distri-
("inclusions") and which should not ("exclusions").
The specific steps in the procedure are as follows:
8.
Decide the questions to be answered and the general analyses
analysis.
Obtain approval to proceed.
(Steps 6 and 7 are preparatory; this
step actually produces the file.)
9.
Check the raw analysis file for correct format, correct variables,
Plan the sequence of the programming, data processing, and
and correct cases (inclusions/exclusions).
statistical computation steps required; draw up an "operational
determine the source of errors, correct the problem, and return
plan."
to Step 6 or 7, as indicated.
10.
3.
Execute en update run of the Master Update Program to produce
the "raw analysis f11e."
to be performed; write down the scientific objectives of the
2.
This subprogram "tells" the Master Update
Program which cases should be copied onto the "raw analysis file"
bution.
1.
Incorporate the "inclusion subprogram" (Step 5) into the Master
Schedule the performance of the steps, including personnel
If not correct,
Duplicate the raw analysis file and save the copy in a secure
place as a backup.
assignments and setting deadlines; draw up an "operational
NOTE:
Skip to Step ,14 if no data transformations are required
schedule."
prior to statistical computations.
I!.
4.
Begin work on the problem.
5.
Write, debug, test, and document an "inclusion" computer sub-
Design, write, test, debug, and document all "transformation
programs" required to perform data transformations and produce a
"transfo~d analysis file."
program to evaluate the criteria for including or excluding
This step may include progr81118 for
linking data from two or more raw analysis files.
a case from the analysis.
NOTE:
6.
currently with Steps 5-10.
Develop specifications (control cards) which define the variables
to be used in the analyses.
These specifications will be input
If personnel are available, this step may proceed con-
12.
Set up and execute the transformation programs (Step 11) and produce the transformed analysis file.
Check the file: if errors
8
7
are found, determine their· origin, make the required corrections
16.
analy8is.
(tbis could involve any of Steps 5-11), and return to the
appropriate step (one of Steps 5-11).
13.
14.
If no errors are found,
17.
Re-examine the scientific objectives document and the operational
proceed to Step 13.
plan (Steps 1,2).
Make a backup copy of the transformed analysis file and save it
Some steps may not need to be repeated; this will be indicated
in a secure place.
in the new operational plan.
Perform computations for preliminary statistical analyses, using
the "latest" analysis f11e.
18.
If changes are made, return to Step 1.
Design, write, debug, test, and document statistical computation programs required for the statistical analyses.
Typical calculations include
statistics uaually called "descriptive statistics":
15.
Write a summary of the subject-.atter results of the preliminary
NOTE:
histograms,
This step may be a long, involved process, not just another
percentiles, means, medians, standard deviations, skewness, and
step in the procedure. ·Whenever this step is required, other
other moments, cross-tabulations, scatter diagrams, correlations,
personnel are usually assigned to it and the work proceeds con-
regressions, etc.
currently with Steps 4-16.
Examine the output from Step 14 for outliers and other indications of erroneous values.
19.
Trace such "outliers" to the original
data and determine which are errors and which are correct.
Perform the statistical computations required
fo~
the desired
analyses.
20.
Analyze the output created in Step 19 and write preliminary
conclusions.
--Data errors must be corrected on the data master file an4
the process must return to Step 8 for creation of a new, corrected,
NOTE:
analysis f11e.
in addition to the preliminary analyses performed in Steps 14-16.
--Errors caused by incorrect specifications of inclusion
One analysis or set of computations frequently generates ideas for
criteria require that the specifications be corrected and that the
process
performing other analyses, which is sll a part of the art of
return to Step 5.
--Programming errors require that the program involved be
corrected and that the process return to one of the earlier
statistical analysis.
21.
been executed, new errors found, etc., and eventually no more
errors are
det~cted
at this atep, one proceeds to Step 16.
This process is impossible to "flowchart."
Determine whether additional calculations are needed.
(a)
If so:
Return to Step 19 if the necessary data are on the analysis
file and no additional programming is required.
steps, depending upon which program contained the error.
--After errors have been corrected, the preceding steps have
Typically. a number of different analyses will be required
(b)
Return to Step 18 if the necessary data are on the analysia
file but additional programming is needed.
10
9
(c)
3.
Return to Step 2 if the necessary data are not on the
scheduling.
This decision usually involves personnel outside the coordinating
22.
When no
Naturally, these steps are taken hefore the first data
processing steps.
project officer, participating physicians, etc.
f~rther
STEPS 1-4.
The first four steps of an analysis involve planning and
analysis file.
center:
PLANNING THE ANALYSIS:
calculations are needed, write up the results
3.1.
for distribution or publication.
Step 1.
Decids the questions to be anBIJezoed and the general
analyses to be pepformed; ",Pits dot.m the scientific objectives of the
Additional description and details of these steps are contained
analysis.
in the following sections.
3.1.1.
Objectives
The objectives of this step are (a) to
require the statisticians. physicians, and other professionals requesting
the analyses to consider, in detail, the objectives of the analyses,
and (b) to produce a clear, detailed. written statement specifying
the scientific questions being addressed, the specific objectives of
the analyses, and, as specifically as possible, the actual statistical
"
computations to be performed.
3.1.2.
Importance
The importance of developing a statement
of the scientific questions. objectives. and particular analyses
cannot be understated.
This "scientific objectives" document is almost
literally the cornerstone of all the steps which follow.
All sUb-
sequent steps are built upon the scientific objectives paper, and to
the extent that it is erroneous or inadequate all subsequent steps
will also be erroneous and/or inadequate.
Seemingly minor adstakes,
particularly omissions. in the scientific objectives paper are typically
not discovered until the final analysis stage, Step 20.
When dis-
covered, such errors in this fundamental document usually require
restarting the whole procedure at Step 1 and effectively discarding
virtually all the work performed up to the point of discovering the
12
11
error.
Thus, errors in the sc:1ent1fic objectives paper have very
serious consequences.
"appropriate authority" depends upon the situation.
Even a seemingly minor error or omission can
result in doubling the cost of an analysis.
The extra time required
The procedure
is to submit the document to the Director of the CPll, with a request
that the analyses be performed.
The CPR Director, after consultation
may be much more than doubled because the staff will be scheduled to
with the LRC Program Office, will submit the scientific objectives
work on other projects at the end of the period originally scheduled;
document to an appropriate person or committee for review.
The review
it may be difficult to delay other projects and re-assign the staff
may result in approval or a request for additional detail.
Upon
to an analysis project which is being re-done because of errors at
approval by the "appropriate authority," the CPR Director will notify
the scientific objectives stage.
members of the originating panel of the approval and will initiate
3.1.3.
Personnel and Responsibilities.
The scientific
Step 2.
objectives document must be drawn up by the scientists, including
3.1.4.
Drawing up the Scientific Objectives Document
It is
statisticians, who are posing the questions and want the answers.
not practical to provide an outline for the scientific objectives
This responsibility cannot be delegated.
document but it is possible to indicate some of the ways it will be
Accompanying the responsibility for preparing the document
is the accountability for its errors.
this document are very real.
used.
The costs caused by errors in
The scientists preparing the document
must be prepared to accept direct responsibility for money, time, and
The "authority" which reviews the document will hopefully
expect the answers to seversl questions:
(a)
other resources wasted as a result of errors in the document.
Is the proposed analysis justified?
being posed
What it means to be "held responsible" for errors in the
LRC Program?
scient~fically
cost?
the nature of accountability depends upon the situation.
objectives too broad?
possible "penalty" 18 a substantial delay in obtaining resultll.
A
(b)
more serious problem arises if the staff is required for other tasks
and it is not possible to repeat the analysis with corrected specifications.
After the scientific objectives document is completed by the
important within the context of the
Do the anticipated results justify the anticipated
scientific objective document is purposefully left unspecified because
The least
That is, are the questiona
Are there too many objectives, or are the specified
Is the document complete?
Are there any oversights or omissions?
Should other, related objectives be added?
(c)
Are the objectives specified accurately and in sufficient detail?
Will the persons doing the analysis have to request additional
panel of scientists requesting the analyses, the document must be
information; for example, to select variables, to program
approved by appropriate LRC Program authority.
inclusions/exclusions or to select appropriate statistical analyses?
The meaning of
13
(d)
Is the document vell written?
14
That is, are the objectives
specified clearly end unambiguously?
Have specific analyses or
computations been proposed where practical?
Principally,
Add 5 man days of review effort.
The total is 60-65 man days, mostly at the senior scientist level,
plus travel cost.
The planning which takes place in Steps 2-4 will also place
requirements on the scientific objectives document.
days of secretarial/editorial effort.
Elapsed time could be as little as 2-3 weeks (very
efficient committee work), or as much as 4-5 months, if the review
"authority" requires additional information.
the planners must be able to determine, on the basis of the statements
3.2.
of the scientific objectives, what variables will be needed, whet
and statisticaZ computation steps required; tboaLJ up an "operatiollQZ pZan. ..
~
PZan the sequence of programrting, data processing,
cases will be needed (inclusion/exclusion criteria), what to do about
After the scientific objectives document is approved the
missing data, etc.
The planners must be able to determine the approxCPR Director assigns CPR personnel to develop en "operational plan"
imate level of effort which will be required in each of the succeeding
for performing the required analyses to achieve the scientific
steps.
They must be able to determine the types of statistical
objectives.
The operational plan is a detailed, written, step-by-
analyses which will be performed (regression, contingency tables,
step "program" (or "set of instructions") specifying every task which
etc.), and whether available statistical software will be adequate.
must be performed to complete the work.
The details in the operational
This reinforces the point that the scientific objectives docuplan go down to the level of naming particular data files on which
ment is the cornerstone of the whole effort.
The document should not
operations are to be performed and particular programs which will
include complete analysis plans, but it should contain all the inforperform operations.
mation to form a logical basis for all that follows.
The planning
Basically, the developers of the operational plan work back-
and performance of subsequent steps must logically follow, almost in
wards.
The first task is to determine which statistical analyses
the sense of deductive logic, from the statement of scientific
will be needed to accomplish the scientific objectives, to answer
objectives.
the scientific questions.
3.1.5.
Costs of Step 1
The second step is to determine, from the
Costs of producing the scientific
scientific objectives, what dats will be required.
Thirdly, the developers
objectives document vary widely, depending upon the composition of the
must determine whether there is statistical software available to perform
panel proposing the analysis, their geographic separation (travel
the necessary calculations, taking into account special problems, such
and long distance telephone tolls), and their efficiency in producing
as missing data.
If not, either the analysis specifications must be
An approximation for a typical LRC committee might
such a document.
altered or new software must be produced (Step 18).
be:
12 members
scientist level.
~
4 days avg. work/member. 48 man days at the
A4d travel costs for one meeting.
Add 6-10 man
Approval
16
15
for such software development IlU8t be obtained frOlB CPR 1Il8Il1llement.
additional computations and analyses; the analyses specified in the
And finally. the planners IllU8t develop the detailed plans for develoP1n&
operational plan are those required to satisfy the scientific objec-
the necessary data filea from the available ones. i.e •• the specifica
tives.
of Steps 5-15.
The written operational plan is submitted to CPR management
The operational plan must include details to the following
levels.
for approval before Step 3 (acheduling), although acheduling will
For data processing and statistical computation tasks, the
undoubtedly be a consideration in developing the operational plan.
task should be described, the input and output files should be
identified explicitly (by name),
should be listed.
ana
The cost of developing an operational plan basically depends
procedures for checking results
on two factors:
For computer programming tasks the input files
(1) the completeness of the scientific objectives
document; and (2) complexity of the operations required to do the work.
must be explicitly identified, the desired computations or other
If new statistical analysis techniques (theory) and corresponding
program processing must be clearly described, and output files must
software must be developed, the cost could be quite large, running
be clearly described.
into man-years.
[Whether output record formats are given
depends upon the situation, but the structure of the file and the
contenta of records must be clearly specified.]
In addition, proce-
But if the analyses are straightforward, the soft-
ware available, and data processing simple (no programming required)
J
the operational plan could involve as little as 10-15 man days.
In
dures for checking the results of the computations must be specified.
the former case the calendar time required could be montha; in the
It may be assumed that good design, programming. debugging, testing.
latter case the calendar time could be as little as 7-10 days.
and documentation practices will be employed; the planners are not
responsible for specifying the details of these facets of the work.
The operational plan should specify all preliminary analyses
(Step 14) to be performed, to the level of specifying intervals of
histograms for continuous variables.
The plan should also specify.
with similar detail. the computations to be performed for substanta-
The personnel involved will typically be a minimum of one or
more statisticians, a data processing manager, a computer programming
manager. and secretarial ataff.
For complex plans others may be
involved as well.
3.3.
Step 3.
Sahedule the performance of the steps, including
personnel assignments and setting deadlines; dzomJ up an "operational schedule.·
tive statisticsl analyses (Step 19); e.g., if regressions are to be
Once the operational plan is approved by CPR management the
run, the dependent and independent variables for each regression
CPR Director asaigns appropriate CPR peraonnel the tasks of proposing
should be specified at this stage.
Undoubtedly the data will suggest
17
18
schedules, personnel assignments, and deadlines, vhich are combined
objectives document, operational plan, and operational schedule) and
into an "operational schedule,"
the actual beginning of data processing or programaing operations.
Since CPR c01llPuting personnel are
in sbort supply relative to tbe demand, this step involves assigning
~riorities
elapsed time between completion of the operational schedule and
to major on-going projects.
The written operational schedule is submitted to CPR management for approval.
The manpower cost of this "step" is zero, of course, but the
After approval, the scientific objectives document,
commencement of operations will depend upon the CPR computing workload, the priority assigned to the particular analysis project, and
the operational plan, and the operational schedule are combined into
the complexity of the project.
one document, an "analysis plan."· Copies of the anslysis plan are
and data processing may have different operational starting times; it
distributed to the originating panel and are provided to virtually
may be desirable to complete programming before beginning data
all CPR personnel involved in developing the analysis.
processing.
(Keypunch
operators or equivalent are excluded; data processors specifically
working on part of the job receive copies.)
The cost of developing the operational schedule varies vith
the complexity of the problem and with the workload at the CPR.
A long, complex problem may require 15-20 man-days of scheduling,
priority adjustments, etc., especially during periods of heavy workloads; a small problem may require as little as 1.5 man-days.
The
schedules are developed by middle and higher CPR management:
data
processing manager, computer programming manager, and statisticians.
Priority setting may require decisions at the CPR Director or
Program Office levels.
3.4.
~
Begin w.l'k en ths p1'CbZ8m.
This is a "milestone" rather than an actual step.
This step
is included in the overall plan to emphasize the fact that there will
often be a delay between completion of the analysis plan (scientific
Projects which require both programming
20
19
for data analysis.
4.
EXTRACTING AN ANALYSIS FILE FROM THE MASTER DATA FILE:
Steps 5-10.
The data from a study are stored on the "Master Data File"
Also, there are usually dozens of variables, or
perhaps even hundreds of variables, which will not be used in a
particular statistical analysis.
In addition, a particular statistical
by the data management system which handles the data for that study.
analysis may not include all of the subjects for which there sre data
The Master Data File typically consists of eight different, complete
on the Master Data File.
data sets, usually stored on tape, which are used in rotation.
exclude very young subjects, or it may exclude subjects who have
During
For example, a particular analysis may
each update run the data from the most recent version of the Master Data File are
erroneous data for particular items, or other such exclusions.
copied, along with new data and revisions, onto a second file.
fore, the analysis file typicslly contains a subset of the data on the
The
first file is called the "father" ·nle and the file generated from the
Master Dsts File:
update
Master Data File, and only a subset of the records (subjects) is
is called the "son" file.
At the next update run, the old "son"
only a subset of the vsriables is
there-
copied from the
copied
file becomes a "father" file, the old "father" file becomes a "grand-
from the Master Data File. .Exact specifications of the variables to be
father" file, and a new son file is created by copying data from the
included on a particular analysis file and criteria for selecting
DeW
"father" file and incorporating revisions, onto a third blank tape.
This process is repeated until eight tapes are in use.
After that, dur-
ing each update run the oldest tape is used for a "son" f11e, and the
subjects to be included on a particulsr analysis file
('~1nclusion
criteria"), are specified in detail, and in writing, at Step 2, or
perhaps at Step I, of these procedures.
data originally recorded on this tape are overwritten by the new data.
4.1.
Step 5.
Wzoite, debug, test, and doClll7lent an "inclusion"
In this way the tapes are used again and again in an eight-tape cycle.
computer subprogram to evaluate the criteria for including
Because the Master Data Tapes are being changed frequently in
a case
from
01'
e=luding
the analysis file.
the update cycle, and because the data written on the Maater Data Tapes
are not written in a format suitable for input to statistical analysis
The analysis file is actually produced during an update run of
programs, it is necessary to produce a copy of the data, called an
the Master Data File.
"Analysis File," which is actually used for analysis.
Program which, in effect, "looks at" each data record just before it
Steps 5-10 of
our procedures involve the creation of an analysis file.
Typically, not all of the data on a Master Data File are
required for a particular analysis.
For example, the subject's name
A subroutine is appended to the Master Updats
is written onto the "son" or output data file.
are evaluated for each data record.
The inclusion criteria
If the inclusion criteria are
satisfied the inclusion subprogram notifies the Master Update Program
may be recorded in a number of different places in the Master Data
that data from this particular record are to be copied onto an output
File record, but.the subject's name typically will not be used at all
analysis file tape.
If the data in the record do not satisfy the
inclusion criteria the inclusion subprogram notifies the Master Update
22
21
Except in the simplest casea the progr_r will "design" (flow-
Program that data from this particular record are not to be copied onto
the analysts ftle tape.
chart) the subprogram and get the design approved by hie supervisor
This is a basic au_ry of the way in which the
before proceeding.
inclusion subprogram and the Master Update Program interact.
The progr_r tllen writes the progr... in an
appropriate language, usually PL/I at the CPR, and vill begin debugging
The objective of this step is obviously to develop an inclusion
subprogram which vill correctly select data records for inclusion in
the subprogram.
the analysis fiie.
tested with a special version of the system which is maintained for the
The inclusion subprogram must be developed for
every analysis file which is extracted from the Master Data File.
After the subprogram is suitably debugged, it viII be
purpose of testing subprograms.
The
This special test version of the system
subprogram would be trivial of course, if one wanted to include data
has its own set of Master Data Tapes, which contain
from every subject's record.
the active data on the real Master Data File.
Typicaily , however, the inclusion sub-
program is fairly simple but not trivial.
c~itical
While the testing is
proceeding, the programmer must document the inclusion program.
For example, one usually
The
documentation is quite important in this case, because many expensive
considers certain variables "critical," and does not wish to extract
data on those subjects for which the
real data but not
operations will be performed on the output analysis file.
data variables' values
If questions
are kncwn to be in error, or have not yet arrived at the Coordinating
later develop as to accuracy of the inclusion criteria the documentation
Center.
must be available to help programmers determine the accuracy of the
The inclusion subprogram is written by an intermediate level
J
subprogram.
Moreover, it is not unusual for an analysis to be repeated
or junior level programmer, based upon information provided by the
periodically, say, at six month intervals.
operational plan developed in Step 2.
inclusion criteria are not changed from one analysis to the next, the
Both the specifications of the
inclusion criteria and the list of variablea to be included in the
analysis must be stated explicitly in the operational plan.
Occasion-
In such a case, if the
same inclusion subprogram may be used each time.
The programmer's supervisor is responsible for determining
ally, the inclusion criteria are so complicated that a flowchart must
that the tests applied to the subprogram are appropriate, and that the
be provided in tha operational plan which defines the evaluation of the
subprogram has successfully passed the tests.
criteria.
Usually, however, the inclusion criteria are much simpler and
The cost of an inclusion aubprogram depends critically upon
can be stated in English language aentencea or simple mathematical rela-
the complexity of the inclusion criteria, and upon the clarity of the
tions, such as "serum cholesterol> 150."
definition of the inclusion criteria in the operational plan.
The planners must be careful
A
in each case to consider exclusion criteria related to missing data,
relatively minor, or trivisl, inclusion subprogram may require as little
data not yet arrived, errors in particular data fields, etc.
ss one to three man-days of effort to write, debug, test, and document.
The elapsed calendar time for such a simple subprogram would probably
be close to a week, because of ths problema of slow turnarounds when
24
23
using tapes for testing, turnaround time with keypunch, delays in
getting docUlllentation typed, etc.
An inclusion subprogram to evaluate
listed explicitly, and the documentation of the Master Update Progr_,
a complex set of inclusion criteria might require as much as twenty
which contains information on the format of the control cards.
or thirty man-days of effort to write, debug, test, and document.
Afthough programmers find such subprograms
challenging, and enjoy
extensive hand calculations are required for each test record to deter-
more complex subprograms requires
Also, documentation for
a 'correspondingly
only difficulty in this step is that the variables to be included in
Master Update Program requires that the variables be specified by
"Item Number."
Each data variable in the Master Data File has a
unique identifying number called an "Item Number."
longer time.
The computer-associated cost of developing an inclusion subprogram can vary from $15-$20 for minor or simple inclusion subprograms
to $500 or more for complex subprograms.
The
the analysis are usually specified by name or description, while the
writing them, testing such subprograms is often difficult because
mine if the program is operating correctly.
the operational plan, in which the data variables to be included are
paring the control cards must look at a "Form Format Table" and
determine the item numbers of all variables which are specified in
the operational plan to be included in the analysis.
Most of this cost will be
The person pre-
These item nUBbers
are then entered into control cards in the format specified in the
incurred in the testing phase.
At the time the operational plan is developed, the
man~power,
calendar time, and computer costs should be estimated and included in
documentation of the Master Update Program.
Thia translation of variable names and/or descriptions into
specific item numbers is obviously a very critical step.
the plan.
An error at
this atage means that the wrong variable would be copied onto the
4.2.
.llil!....!:
Deve'top speaifications (ccmtl'OZ
variabZes to be used in tlas anaZyeis.
cards)
whioh define tlas
The speaificaticms !XZZ be
analysis file and, if the wrong variable had the same characteristics
as the correct variable, the problem might not be discovered until very
input to the Master Update Progzoarn whiah !JiZZ aopy the desired variabZ4s
late in the analysis, if at all.
cmto a "Raw AnaZysis FiZe" whiZe perfOl'ming an update run.
important that the translation process be double-checked by the data
processor's supervisor.
The objective of this step is to provide the Master Update
Program with the information necessary to select those variables which
are to be written onto the analysis file.
A data processor or junior
Obviously, accuracy is promoted by a very
clesr description in the operational plan, which in turn depends upon
a very clear description of the variables in the scientific objectives
document.
level programmer will be assigned the task of developing the control
cards which provide this information to the Master Update Program.
For this reason it is extremely
The
basic informstion·for preparing the control cards comes from two sources,
A second function at this step is to specify the format of
the analysis file, that is, to write down which particular "columns"
25
particular data variables will be written into.
26
This information will
be used in all subsequent analysis steps.-;,
4.3.
Step 7:
Inoo:rporate the "inclusion subploogram" (Step 5) into
the Mastel' Update P1'ogzoam.
The costs of this step depend upnn the number of
P1'ogzoam Llhich cases should be copied onto the
't
"l'l'Z!oI
analysis file"
("inclusion").
variables to be included in the analysis and the clarity of the
descriptions of those variables in the operational plan.
This BUbp1'Ogzoarrr "tells" the Mastel' CJpdate
For a short,
This is basically a technical step.
The inclusion subprogram
simple list. the manpower required for-this step may ,be-as little_as
is written and tested in an environment remote from the Master Update
an hour. including keypunching, verifying, etc.
A very long list,
Frogram and the real Master Data File.
Before the update run which
however, including say 300-400 variables, might well take eight to
extracts the analysis file from the Master Data File is executed,
ten man-days to develop and double ~nd triple check.
(Longer lists
the subprogram must be "linked" to the Master Update Frogram, which is
should be checked even more carefully than shorter lists.)
The elapsed
the objective of this step.
time for this step may be as little as one calendar day for very
After the inclusion program is determined to have passed its
short lists of variables or as great as several weeks.
Fortunately,
tests, the programmer in charge of maintaining the Data Management
there are no problema st this stage with computer turnaround.
The only
System will be assigned the task of linking the inclusion subprogram
real nonproductive delays are waits for keypunch turnaround in the case
into the Master Update Program for one update run.
The linking is a
of long Hsts.
straightforward task, but such actions are restricted to the one proAs
noted above, the effect -of ".a Ilistake at this step can
<
grammer who is responsible for maintaining the system.
This programmer
extremely costly, whether the mistake is made in the original
can link the subroutine, check the output from the linking run, and
scientific objectives document, the operational plan, or in the transnotify his supervisor that the system is ready for an update run which
lation from a variable name to an item number.
The more subtle the
will produce an analysis file.
error the more expensive the mistake, because the mistake is typically
The costs of this step are relatively small.
If no substantial
not detected until the preliminary statistical analyse. in Step 15 or
problema arise, the step will typically take one man-day of effort spread
even the detailed .tati.tical analy.es in Step 20.
In principle, it is
perhaps over three to five calendar days.
However, if the inclusion
possible for an error at this stage to completely escape detection, and
program requires a particularly large amount of core storage or bas
for the statistical analyse. to be performed on the wrong variable.
The
other complications, this step could require up to six to eight man-
only effective insurance against such an error is clarity in those
days of effort spread over several weeks.
The computer costs for this
sections of the scientific objectives document and the operational plan
step are typically small, usually in the under $20 range.
which describe the variables to be included in the anslysis.
28
27
Since th1a is a highly technical step, III1staltes lIlade at th1e
if both steps 6 and 7 are completed shortly after a regularly
step are usually caught before the programmer completes his work.
scheduled update run. there could be a two to three week
It 1e highly unusual for a III1stake made at thb step to be propagated
delay.
into following steps, and when that happens the error is usually detected
There i. only a slight possibility of a III1stake at this step,
in Step 8.
4.4.
Step 8:
to produce the
as the procedure is straightforward.
E:J:eaute an update
''ra:z,)
l'Wl
this step it will be detected in the following step, Step 9.
of the Master Update Program
The
effect of such a III1stake would be a delay in the production of the
analysis file."
analysis file.
Steps 5, 6, and 7 are preparatory; this step actually producea
the analysis file.
However if a lII1.talte is sade at
The delay might be as much as two or three weeks.
until the next regularly scheduled update run, but in most cases
When the inclusion subprogram has been successfully
another update run would be made as soon as the mistake was discovered
linked to the Master Update Program, and the control cards which define
and corrected.
the variables from the output onto the analysis file have been approved
delay plus the cost of a second update run which might be as bigh
as being correct, the control cards are incorporated by the Data
as $150.
A mistake at this step then, might cause a one week
Processing Manager into the control card deck for the Master Update
4.5.
Program at the next regularly scheduled update run (or sooner if high
~:
Check the
ralJ
analysis file for correct fOPMat, co:t'J'ect
J
variables, and correct cases (inclusions/ezalusions).
priority is accorded to this analysis).
If not corrsct.
The raw analysis file will be
determine the source of errors, correct the problem, and return to
produced as a byproduct of the update run.
Step 6 or ?, as appropriate.
The output from this step is the raw analysis file. stored
as a data set on either magnetic tape or magnetic disk.
After the update run Which produce. the raw anslysi. file, the
The costs of this step are relatively small unless a special
update run is required to produce the analyais file.
Typically. setting
data processing manager assigns one of his people the task of producing
a printout of a number of records from the analysis file.
This printout
up the control card. from the Master Update Run and actually making the
is checked for correct format, correct variables, and correct cases.
run will require lese than one man-day of effort in addition to the
The initial work is typically done by one of the data proce.sors or a
usual effort associated with an update run.
junior level programmer assigned to a statistical analysis team.
The elapsed time will
depend upon the time at which Steps 6 and 7 are completed.
Should
The analysis file is checked by comparing the contents of
those steps be completed the day before a scheduled update run the
records on the analysis file with the contents of corresponding records
c'lendar time reqUired for this .tep would be one to two days.
on the Master Data Files and the documentation for the analy.is file.
However,
30
29
The analysis file documentation specifies which variables are to be
is made at this stage, it viII typically not be discovered until after
found in which positions of the analysis file record.
several analysis runs have been completed (Step 14 or Step 19), vhich
If errors are
detected in either the format, the list of variables, or the inclusion/
means that a great deal of effort and money viII have been wasted.
exclusion rules used, the data processing manager notifies the super-
Thus, a mistake made at tbis stage may easily cost as mucb as 65%-85%
v~sor
of the analysis who directs the problem to the appropriate person.
of tbe total cost estimated for the analysis in the operational plan.
The cost of this step depends upon the number of variables included in
For simple analysis problems this cost may be only a few bundred or
the analysis file and the complexity of the inclusion criteria.
a few thousand dollars, but for substantial analyses the cost of an
Enough
records must be checked manually to establish the correctness of the
format, to establish that each variable has been placed in the
error at this stage could easily run to ten thousand dollars or more.
4.6.
proper position within each record, and to establish that the inclusion
Step 10:
DupZicate the
l'<lIJ
analysis file and save the copy in
a secure place as a backup.
criteria used correspond to those specified in the operational plan.
An analysis file with only a few variables and with very simple
It should be clear from the description above that the analysis
inclusion criteria might require as little as one man-day of effort to
file at this stage represents a substantial investment in manpower,
obtain a printout of a subset of the analysis file and to check the file.
calendar time, and money.
An analysis file with several hundred data variables and complicated
that investment by keeping a backup copy of the raw analysis file in a
inclusion criteria would require about one-half of a man-day to obtain
secure place.
tbe printout, but tbe checking phase could take several man-weeks of
which renders the working copy of the analysis file unusable, the
effort at the Data Processor II or Computer Programmer I level.
procedure is to create a'new working copy of the analysis file by
The
The objective of this step is to protect
If an accident should occur in subsequent processing
computer costs for this step are quite small, usually under $10 for small
duplicating the backup file.
problema, and perhaps as much as $50-$75 for analysis files with hundredS
formed with extreme care by the Data Processsing Manager or a high
of variables and complicated inclusion criteria.
level progremmer, the backup copy of the analysis file is returned to a
Complicated files
require that more cases be printed out in order to verify the accuracy
of the inclusion criteria, and files with many variables require more
secure plsce.
In thie context "a secure place" means a fire-resistant cabinet
with e lock in a building remote from the computing center.
printout because each record requires more print.
The effect of a mistake at this step, that is, failure to
detect a mistake at an earlier step, may be disastrous.
After such duplication, which is per-
If the mistake
is subtle, such as a particular field containing data from a wrong,
but similar, variable the mistake may never be discovered.
If a mistake
The work of creating the beckup copy can be assigned to a
data processor or a junior level programmer depending upon the
evailability of personnel.
The work is performed under the supervision
31
32
of the Data Processing Manager or the person 8upervi8ing the
statistical analysis.
After the analysi8 is duplicated, a computer
program is executed which compares the working copy of the analysis
5.
DATA TRANSFORMA!IONS:
Steps 11-13.
It is usually the case that the data extracted froa the Master
file with the backup copy of the analysi8 file to determine whether the
Data Pile require Some sort of transformation before the 8tatistical
two are identical.
analysis.
If the two are identical, then the backup copy is
The transformation may be very simple, as for example, taking
the logarithm of serum cholesterol because it is believed that
removed to the secure place.
This step costs relatively little.
The manpower requirement
cholesterol has a lognormal distribution.
Other transformations may
will typically be less than one man-day; the elapsed time will typically
be more complicated and may require input from several different raw
be two to three days but may be as ~uch as five or six days, depending
data fields; for example, one may be analyzing the average of several
upon holiday schedules and weekends.
different determinations, or the data may require SOme logical manipula-
The actual computer costs for
duplicating the file will be directly proportional to the number of
tions as may be the case with data from the Rose Questionnaire.
records on the file, but these costs will typically be less than $50.
only very simple transformations are needed, such sa the logarithm
There are basically two mistakes which can be made at this
step.
The first is a failure to execute the program which compares the
two copies of the analysis file.
Only rarely will there be a difference
between the backup copy and the working copy of this file, but when
there is a difference it must be detected at this stage and the
way that can be done is to execute the comparison program.
on~y
Of course
one must also properly interpret the output of the comparison program.
The other type of mistake is that one could fail to put the backup
copy of the analysis file in a secure place.
A mistake at this stage
could cost as much as 15-30 percent of the total estimated cost of the
If
transformation above, the transformation can be handled by standard
statistical analysis programs. and Steps 11-13 can be 8kipped.
However,
when the transformations become too complicated to be handled by packaged
statistical analysis programs, or if the packaged programs which are to
be used cannot perform dsta transformations, then it is necessary to
produce an analysis file which contains transformed data.
Step 11:
Desi(JTl. w1'ite, test, debug, and docwnent
an
"tzoansformation
programs" required to perform data transformations and produce a "trans-
formed analysis file." This step may include progzoams fo:!' linking
data from two or mo:t'e
l'I%lJ
analysis files.
analysis.
The objective of this step is to develop the computer programs
necessary to process data from the raw analysis file, or files, and
produce an output file which can be used directly by the statistical
34
33
analysis programs.
the transformation procedure requires linking data from
The output file will. of course. contain all the
transformed or combined data required for analysis.
two
or more
data files and also requires complicated data transformations. such
The specifications for the program, or programs, to perform
as computing averages Within a particular subject'a record. the coati
the data transformations are a part of the operational plan produced
of producing the data transformation program may be quite high.
in Step 2.
Because packaged atatiatical analyais programs have fairly good data
A computer programmer is given the specifications and
assigned the task of designing a program which will perform the
transformation facilities, simple data transformation programs are
necessary computations.
usually not required.
The level of programmer assigned to the task
depends upon the complexity of the transformations required.
Designing
If a program is required. the problem is
usually a complicated one.
Thus, the costs of data transformations
a program involves devising an algorithm, usually represented by a
are usually either nearly zero or substantial.
flowchart, and the specifications for input data formats and output data
context would typically mean ten to twenty man-days of computer pro-
formats, and specifying the data processing operations such as data
grammer time, two to three days of supervisor time. $50-$300 of computer
sorting etc.
facilities time. and three to four weeks of elapsed time.
When the program design
is completed, the programmer's
supervisor checks and approves the design.
A programmer is then assigned
These are
ballpark estimates and in each particular situation estimates should be
the task of writing a program in an appropriate language, such as PL/I.
The programmer debugs his computer program and. together with his
"Substantial" in this
included in the operational plan.
The effect of an undetected mistake at this stage is quite
J
supervisor, devises appropriate test procedures for the program, using
serious. ,If the nature of the mistake is such that it may escape detection
data similar to the live data which will actually be transformed.
at the testing phase of the program development. then it is unlikely that
In
principle, the programmer develops documentation for the program 'as it
it will be detected until a much later stage, if at all.
progresses through the various stages:
mistake at this stage will require correcting the error in the program
testing.
design, writing. debugging. and
In fact. the programmer usually delays documentation until
after the computer program has been debugged and tested.
produced in this step, as well as reperforming all subsequent steps
which were done before detection of the error.
The programmer's supervisor makes the determination of when
Thus, a
The principal costa.
of course. will be the cost of reperforming all those subsequent Itepl.
the program has passed all appropriate tests.
5.2.
Step 12:
Set up ar.d eJ:ecute the transfonnation programs (Step 11)
The costs of developing a data transformation program. or pro-
and produce the tl'ansfo1"llled analysis file.
grams, varies directly with the complexity of the problem.
Check the file; if errors
For very
are found, determine their origin, IInke the required corrections (this
simple transformations the costs may be nil. or even zero if the transformations can be performed by the statistical analysis program.
could involve any of Steps 5-11), and return to the appropriate step
If
(one of Steps 5-11).
If no .errors are found, proceed to Step
n.
35
36
As with many other steps, the costs associated with this step
Once the data transformation program has been completely
tested and accepted as a correct program, the next step is to execute
depend upon the complexity of the program involved.
the program and produce the transformed analysis file.
computer costs also depend directly upon the amount of data Which _ust
This step may
be performed by either the programmer who developed the data trans~ormation
program or a data processor from the data processing group.
However, the
be transformed, and the complexity of the transformations •. Since a
special data transformation program is only required in cases of quite
Most frequently, the execution of the program will be the responsibility
complicated data transformations, when this step is required it may
of the programmer who designed, wrote, tested and documented the pro-
be expensive.
gram.
.
(This situation frequently leads to poor documentation of data
transformation programs.)
The costs of executing the computer program, if this
need be done only once, will typically not be high,
perhaps $10-$200.
The personnel costs for setting up and executing the program are usually
The input for this step consists of the raw analysis file or
negligible; perhaps one-half man-day at most.
However, the costs of
files, and perhaps other data files which will be "linked" to the raw
checking the results will be substantial for complicated programa.
analysis file.
costs will typically lie between one and six man-days for the checking.
The principal output from this step is a transformed
The
In terms of elapsed time, if all goes well this step may require ss
analysis f11e.
There are four basic work components in this step:
(1) setting
up the computer program to perform the computations; (2) executing the
little as two days, but if problems arise it can easily require up to
several weeks.
Mistakes at this stage are typically detected at this stage,
program on the computer; (3) checking the output from the program,
including a partial listing of the data on the output transformed
as they are usually not subtle mistakes.
analysis file to determine whether the program performed correctly; and
formation program are attributed to the previous step.)
(4) making corrections if necessary.
can occur at this stage include using the wrong raw analysis file for
Typically, setting up and executing
the computer program are very aimple steps.
Most of the work involved
(Errors 10 the data transErrors which
data input, or failure to adequately check the results of the computa-
is in checking the output from the program to determine whether
tions.
the results are correct.
costs of rerunning this step and all subsequent steps until the errors
In some cases it may be necessary to perform
In either case, the costs associated with such mistskes are the
extensive calculations by hand to verify that the transformations were
sre detected.
computed correctly.
are usually detected at this stage.
In principle, checking the results of the computa-
tions is performed by someone other than the programmer who developed
the computer program.
In practice, this may root be practical.
However, as waS stated above, errors made at this stage
31
5.3.
Step 13:
38
Males a backup copy of the transformed anaZysis file
and save in a seCUl'B place.
repeat Step 12 and recover to thia point.
Thus the cost of a disastrous
mistake at this step would usually be simply the costs of
The output file from the data transformation program is quite
Step 12.
repeatin~
However, the coata in terms of calendar days might be one
valuable; its cost is the complete cost of all steps up to this point.
or two weeks which would be required to discover that the backup file
It is important to back up this data file just as it is important to
was not correct.
back up original Master Data Files and any interim files.
Backing up the transformed analysis file is a simple operation
involving execution of a "copy program" and then the execution of a
program to compare the original and the copy.
This step can be per-
formed by any of the data processors or programmers.
The supervisor
in charge of executing the data transformation program is also
responsible for making a backup copy of the transformed analysis file.
The backup copy of the transformed analysis file is saved in
a secure place, separate from the computing center at which the original
copy of the transformed analysis file will be kept.
"Secure" means
properly protected from theft, fire, or other disaster.
The costs of this step are very small.
A magnetic tape for
backup costs less than $20; the execution of the copy program usually
costs less than $40; and the manpower requirement is usually less than
one-half man-day.
Elapsed time usually runs about one to two calendar
days.
Making a backup copy of the transformed analysis file is a very
simple procedure and is not usually prone to mistakes.
However, if
a mistake is made at this stage it is not usually serious; one would
not discover a mistake made at this stage unless one were attempting
to use the backup copy to regenerate an analysis file which had
been destroyed.
In that case one can usually simply go back and
40
39
Persons not familiar with the technique of using statistics to
6.
PRELIMINARY ANALYSES AND DEBUGGING THE DATA:
STEPS 14-17.
One of the assumptions stated in the Introduction is that the
debug data sre frequently surprised by the types of statistics snd
graphs computed and plotted for these purposes.
For example, both
data have "passed through quite good error detection procedures, in-
statisticians and non-statisticians are familiar with histograms, but
cluding field tests and consistency checks" and that "a substantial
the histograms usually seen divide the whole range for a variable into
effort has been made to correct the errors thus detected."
perhaps five to fifteen intervals.
In spite
Histograms used to find erroneous
of such checks, which are performed at data collection time, a large
data or data "pathologies" may have one interval for each possible
data set will usually contain a number of important errors which have
value of the variable.
escaped earlier detection.
interval for each value from zero to 1800.
The most basic objective of Steps 14-17 is
to detect and correct as many data errors as possible.
Since the
A histogram for cholesterol could include one
This makes for a very long
histogram, of course, but the low values, e.g., near zero, are needed
descriptive statistics used for detecting data errors are usually also
to detect anomalous values.
preliminary results in answering the basic scientific questions, the
to detect anomalous or highly unusual values.
preliminary analyses also have the secondary objective of producing
val for each possible value, one can detect such phenomena
the first substantive results.
preference; these show up as particularly high frequencies for numbers
The statistics produced in these ateps have still a third purpose.
Some analyses aimed at substantive questions may depend criti-
The high values, above 450, are also used
By using one class inter8S
digital
ending in zero or five.
By computing the percentiles of a distribution and preparing a
cally upon assumptions about the distribution of one or more of the
graph of the percentiles, usually called a "sample cumulative distribu-
data variables.
tion function," one can examine the assumption that a variable follows
Descriptive statistics and graphs can be used to
examine the validity of the assumptions.
If the data depart seriously
a particular distribution, such as a normal distribution.
There are
from the assumptions, one has the opportunity at Step 17 to alter the
also statistical tests which give an indication of whether data arise
analysis plans.
from a particular distribution, such as normal or lognormal.
The
"sample moments"-mean, standard deviation, skewness, kurtosis, and
6.1.
Step 14.
Pel'form computations fol' p1'6lintina1'Y statistical
sometimes higher moments--can also be used to give
analyses, using the "latest" analysis file.
an indication of
Typical calculations
whether the data follow an assumed distribution.
include statistics usually called "desC1'iptive statistics":
The more common use
histograms,
of the moments, particularly the mean and standard deviation, is to
pel'centiles, means, medians, standal'd deviations, sker.mess, and othel'
compare the data with data previously reported in the literature.
A
moments, cPOss-tabulations, ecattel' diagrams, cOl'relations, 1'eg1'essions,
mean or atandard deviation substantially different from a reported
etc.
value may indicate a data error.
When the data are known to follow
42
41
a distribution such as the normal or lognormal, if the skewness and
there are only two dimensions in the graph, but they are multivariate
kurtosis are substantially different from a theoretical value, this
in the sense that all the "independent" variables are taken into account
may indicate the existence of outliers or other data anomalies.
in the regression, and these graphs frequently reveal data anomalies
that escape detection in univariate or bivariate procedures.
The univariate statistics described above, auch as percentiles
and moments, are comparable to field tests (aa discussed in the
Appendix) in that only one variable at a time is examined.
The mechanics of Step 14 are fairly straightforward.
Multivariate
The analysis
team assigned to perform the calculations uses as input the transformed
test statistics and graphs are useful for examining combinations of
analysis file produced in the first twelve steps.
variables simultaneously.
required can be produced by standard, available packaged statistical
The most COmmon type of graph used for this
purpose is a scatter diagram.
Scatter diagrams used for the purpose of
analysis programs
such as SPSS, SAS, BMD, etc.
All of the statistics
The members of the
detecting data errors typically cover more combinations of variables
analysis team are familiar with these programs,and setting up the runs
than one would cover by simply looking at scientific questions of inter-
to produce the statistics is a fairly simple procedure, given the
est.
specifications in the "operational plan" produced in Step 2.
By plotting many different combinations of variables and examining
the "outliers" to see if the data are correct, one can discover a
The comp....
tations are produced by junior programmers and data processors under
number of data errors which might not show up in one-way statistics.
One uses cross-tabulations in a similar manner.
the direction of the analysis team leader.
J
The graphical procedures are typically more useful for discovering
If the transformed analysis file is not extremely long, e.g.,
if it contains fewer than 10,000 cases and fewer than
l5~200
variables
data errors than statistical procedures such as computing correlations
per case, the computations for preliminary statistical analysis are
and regressions, because the eye can scan a diagram for outliers much
relatively inexpensive.
more quickly than one can peruse a long list of statistics for anomalies.
may be necesssry to repeat Steps 14-16 a number of times, and the cost
Basic graphical procedures are typically limited to two variables at
each time through will be approximately the same.
a time; the value of computing correlations and regressions is that
However, if many data errors are detected, it
In most situations the control cards for the packaged statistical
these procedures allow one to examine more than two variables at a
analysis programs can be prepared in 0.5 to 5.0 man-days, depending on
time, even graphically.
the number of variables and the number of graphs to be produced.
tion
After the calculations of multiple correla-
coefficients and regression coefficients, etc., are computed, one
puter costs vsry with the number of variables, the amount of calculation,
typically has a difficult time interpreting numbers; however, one can
and particularly with the amount of printed output.
plot the deViations about the regression line versus the various variables
through a set of computations can be performed for $25-$200.
entered in the regression and, perhaps, other variables as well.
These
plots of deViations or "residusls" are also bivariate in the sense that
Com-
Typically, one pass
The calendar time required to perform the calculations for
preliminary statistical analysts also depends on the number of calculations,
44
43
the amount of output, and so on.
produced, the personnel will
If huge amounts of output are being
usually
use a procedure which requires
that the program be run during the early morning hours (thus reducing
costs) and the output printed at the computation center (thus reducing
t~
load on the terminal used by the programmers).
Also, if the trans-
some of this work, for the most part their efforts are limited to screening computer printouts and selecting suspicious printouts for examination by a statistician or subject matter expert.
There is no simple algorithm for detecting data errors.
work is very much along the line of detective work.
This
When suspicious
formed analysis file is extremely long, it will be contained on magnetic
values are noted in graphs, or their presence suggested by unusual
tape (which requires longer turnaround time) rather than magnetic disks.
values of descriptive statistics, it is necessary to trace the question-
.
One pass through the computation stage may require as little as three
able data item by returning to a listing of the data on the transformed
calendar days or as much as five weeks depending upon difficulties
analysis file.
encountered, vacations, etc.
original data form (or a copy on microfilm) is located, and the value
Note, however, that several passes through
Steps 14-16 are typically required before proceeding to Step 17.
Errors committed at this stage are
almost immediately.
usually
recorded on the data files is compared with the original data value.
found and corrected
The types of errors committed most frequently are
errors in control cards for the computer programs being used.
Once the suspicious data value has been located, the
Many of
(Of course, if the variable is the result of a transformation, one must
locate the data values involved in computing the transformation.)
If the
data on the transformed analysis file are verified from the original data
these errors are such that the program will not execute, or will pro-
form, one must select one of the following slternatives:
duce error messages which are detected immediately.
data as recorded; (2) attempt to have the clinic verify the data item by
If an error escapes
(1) accept the
detection, such as specifying a wrong variable, the error is likely to
locating the subject, or otherwise; (3) mark the data item as erroneous
be detected in Step 15.
and, in effect, delete tne data item from further statistical computations.
The cost of an error committed at this stage
is typically small, because each computer run is usually rather inexpen-
This decision is not to be made lightly, or systematic censoring of the
sive, on the order of $15-$150.
data file can easily change the conclusions of the analysis performed in
subsequent steps; including erroneous values in the statistical computa-
6.2.
Step 15.
E:z:amine the output from Step 14 for outZiers and other
tions can also seriously affect the conclusions to be reached later.
indications of erroneous values.
It
Trace such "outZiers" to the original
is important that decisions to delete or not delete apparently erroneous
data and determine lJhich are errors and lJhich are correct.
data items be carefully documented.
Step 15 is, of course, the most crucial in this series.
After
Errors typically fall into one of three categories:
data errors,
the descriptive statistics are computed, someone with a knowledge of
errors caused by incorrect specification of inclusion criteria, or pro-
the subject matter-area must examine the computer output for indications
gramming errors.
of data errors.
master data file and the process must return to Step 8 for production of
Although data processors and programmers can perform
As stated above, data errors must be corrected on the
46
45
a new, corrected analysis file.
Errors caused by incorrect specifica-
the time of anyone involved in earlier steps.
The costs of errors
tions of inclusion criteria require that the specificstion be corrected
caused by incorrect specifications of inclusion criteria or programming
and the process returned to Step 5.
errors have been discussed in the sections corresponding to previous
Errors in programming require that
the program involved be identified, corrected, and the process returned
steps.
to ,the step which involved the program.
four man-days of data processor time to update the Master Data File
It is clear that data errors (in contrast to errors caused by
The costs of correcting data errors will typically be one to
plus the computer time required to repeat the programs which were exe-
incorrect specifications of inclusion criteria or programming errors)
cuted in the previous steps.
must be handled by someone familiar with the subject matter area.
to estimate, but only the production runs of the previous steps need to
Programmers and data processors are not qualified to determine whether
be repeated, not debugging and testing runs.
a particular outlier should be deleted or retained in the data.
costs of correcting data errors may be as little as $25 or as high as
Thus,
Step 15 involves a higher level of personnel than the previous steps
in this procedure.
The cost of the computer time is difficult
The computer associated
$500, depending on the size of files, number of variables, etc.
Errors committed at this stage fall into two categories:
The cost associated with Step 15 depends, most importantly, on
(1)
relatively trivial errors which are typically discovered during Step
the number of passes through Step 15 and previous steps which are re-
IS, and (2) errors not detected at Step 15.
quired to debug the data.
detection stage, the errors which escape detection here are serious
There are almost always at least two passes:
Since Step 15 is an error
a first pass to detect errors, error corrections, and then a second
because they are either detected later, after considerable statistical
pass to verify the data.
analysis, or they are not detected at all, which is still more serious.
Typically, there will be several iterations
of computations (Step 14), discovery of data errors (Step 15), and data
The costs of errors detected during Step 15 processing is typically
error corrections.
quite small--negligible compared with the overall costs of the analysis.
The costs of detecting the errors, and deciding what
to do about them, are principally manpower costs.
Costs for data pro-
The costs of errors not detected at this stage may be in the range of
cessors, programmers, and analysis team supervisor may range from half
hundreds of dollars or thousands of dollars, since the error may not
a man-day to five man-days per pass.
be discovered until after large amounts of manpower and computer time
Senior scientist requirements will
typically be on the order of two to ten man-days per pass.
Locating data
have been expended in the analysis in subsequent steps.
We cannot
errors (comparing data from the transformed analysis file with data forms,
attach a dollar cost to those errors which escape detection until after
etc.) is a time consuming process, but is typically performed by data
the results have been published.
processors and junior programmers.
is to prevent such occurrences altogether.
This phase may require one to twenty
man-days per pass with later passes requiring less time.
The data correction phase also involves computer time and potentially
The importance of Step IS, of course,
47
6.3.
Step 16.
48
Ih'ite a sllfllla1'lJ of the subject-mattel' results of the
pre l imina:t"!J analysis.
and all the statistical analyses which will be of inter/lst.
gults of Steps 14-16 vill
The re-
usually raise questions about the asslDed
distribution of some of the variables, suggest analyses vhich bad not
There are two basic objectives for the written summary of the
been considered at the time the scientific objectives document vas
subject-matter results of the preliminary analysis.
The first objectprepared, or suggest other changes in the analysis.
ive is to list, or summarize, what has been done to the data.
If there vere no
That is,
competition for CPR resources, there would be little or no difficulty
one must list or summarize the treatment of data outliers
and other
in changing the plans at this stage.
However, because of the strong
changes in the data which could have an effect on the final analytical
competition for limited resources, a change at this stage may have
results.
The second objective is of greater interest to the subjectdire consequences on the progress of the statistical analysis.
matter specialist:
Thus,
since the analysis specified in the scientific
it is necessary for the senior level scientists involved in the analysis
objectives document and operational plan will typically include some
to consider at this stage whether changes should be made in the scientiof the preliminary analysis, a second objective is to write up those
fic objectives document or the operational plan as well as the probable
analyses which are pertinent to the scientific questions.
effects of such changes on the progress of the analysis.
If major changes
Only a small part of the summary of the subject-matter results
are made, it may be necessary to return all the way to Step 1.
The
from the preliminary analysis viII be written by the people who perfora
effect of such a drastic decision on the costs and time required to
the work in Step 14; most of this writing will be performed by the
complete a statistical analysis should be obvious.
statistician and other senior level scientists.
The personnel involved in this decision are primarily at the
Naturally, the costs of this stage will depend on the efficiency
senior scientist level; however, operational personnel will be availof the person writing the summary; typically, since notes will have
able to advise on the effects and costs of various alternatives.
been taken throughout Step 15 processing, the summary will require
The costs of this step depend upon the personnel involved and
one to ten man-days of effort to write, snd perhaps one to two man-days
how carefully the original scientific objectives document was drawn
of secretarial effort.
up.
6.4.
Step 17.
Re-e:r:anrine the scientific objectives document and the
operatiOY'.al plan (Steps I, 2).
If changes are 17l2de, return to Step 1.
Some steps may not need to be repeated; this ,.,-iZZ be indicated in the
neLl
operational plan.
It would be unusual for one to be able to completely specify
at Step 1 all the statistical computations which are to be performed,
If no changes in the scientific objectives document or the opera-
tional plan are indicated, this step may cost nothing.
However, if
changes are indicated, it may be necessary to call a meeting of a large
committee or some other group in which the costs would be quite large
including travel costs, costs of manpower for attending the aeeting, etc.
49
7.
DEVELOPING STATISTICAL SOF'NARE:
50
STEP 18.
~.
Step 18.
Design. Llrite. dBbug. test. and aom-nt statistico.Z
STATISTICAL ANALYSIS:
After uny preparatory steps, one finally
00IIIpU-
tation programs required for the statistical analysis.
STEPS 19-22.
Co.l!S
to those steps
usually associated with "statisticsl analysis."
Although development of statistical analysis software does not
properly fall under the heading of data analysis procedures, this step
8.1.
Step 19.
Perfom the statistioal computations required for the
dBsired analyses.
has been included in these procedures as a reminder that there are many
The objective of thia atep ia obvious; the cslculations must be
types of statistical analysis for which appropriate computer programs
performed before the results can be analyzed.
are not available.
We have separated a dis-
If the scientific objectives document requires the
cussion of the statistical computations from the discussion of the
use of such analyses it will be necessary for the CPR to develop the
statisticsl analyses in order to emphasize that these are two separate
corresponding computer software.
steps.
A description of the procedures, costs, etc., of developing
The work for this step is performed by data processors or junior
statistical software is well beyond the scope of this document.
Since
level programmers.
Working from the analysis specifications contained
software for all the "easy" analyses has already been developed and
in the operational plan, these personnel select appropriate statistical
incorporated into packages of statistical programs, one may infer that
analysis programs from available packages or use software developed in
development of software for new analyses ia an expensive process.
This
Step 18, and then prepare the program control statements necessary to
is generally true.
"Expensive" in this context means manpower On the
produce the required calculations.
The program control statements are
order of one or more man-years of moderate, or senior-level computer
punched into cards and test runs are executed.
programmer time, and several thousand dollars of computer time.
If the transformed
The
analysis file is small, the test runs will be executed against this
calendar time involved is rarely less than three months, and frequently
file; however, if the transformed analysis file is large, a subset of
greater than one calendar year.
this file will typically be used for testing purposes until the proIt is to be emphasized that there are many distinct stages in
gram control statements are debugged.
the development of statistical software:
After the debugging phase, the
designing, writing, debugging,
testing, and documenting; all are important stages.
We are all too
actual statistical computation runs are performed.
It is usually the case that several different types of statisticsl
familiar with programs which have been written and distributed without
analyses are being attempted.
In such a situation the analysis team
appropriate emphasis on the design, debug, test, and documentation
stages.
usually performs the computations for one type of analysis, passes
these to the statistician (Step 20) for analysis, and proceeds to
51
52
perform the computationa for the next type of analysis.
Thus, one can
have different phases of Steps 19 and 20 proceeding simultaneously.
In
analysis is an artful application of scientific metbod_ and principles,
~
shall not attempt to describe the process here.
For the purposes of
addition, after seeing the results of the computations, the statistician
this paper, it is sufficient to say that the statistician and subject-
will frequently require that additional computations be performed with
matter experts must examine many different facets of the data as revealed
slight changes in, for example, model specifications.
by the statistical computations.
Thus one can pro-
ceed from Step 19 to Step 20 then back to Step 19, Step 20, etc.
The cost of Step 19 depends directly upon the number of computations to be performed, the complexity of computations, the size of the
statistical analysis is an iterative process:
Simple analysis may require as little as
after examining results
of statistical computation, it is frequently necessary to perform additional computations, e.g. with a slightly different model formulation.
data file, and the number of iterations through Step 20 and back to
Step 19 that are required.
Moreover, as discussed under Step 19,
The costs associated with this step depend, of course, upon the
complexity of the problem and many other factors.
Simple analysis may
$20 of computer time, two man-days of data processor time, and one week
require as little as a half man-day at this stage, but more typically
of calendar time to complete.
the analysis stage will require one to four man-weeks of effort by
More complicated analysis may require up
to $2,000 of computer time, three man-months of data processor or junior
atatisticians and medical scientists.
level programmer time, and several months of calendar time.
may be accomplished within a day or two; in
Because of the
repetitive nature of going back and forth between
Steps 19 and 20, errors made at
Step 19 are typically detected either
immediately or at Step 20 and are then corrected.
In the simple case·, all the work
the more
~omplicated
cases,
the work may require from one to six calendar months, especially in those
cases in which new statistical computations are required again and again.
Errors are not
We cannot effectively assess the effect of mistakes at this stage.
unusual at Step 19 and usually arise from imprecise specification of
8.3.
the analysis to be performed.
Step 21.
Deternrine ",hether additional calculations are needed.
Since errors committed at Step 19 by data
After the "first cut" statistical analyses have been performed
processors are confounded with requests from the statistician (Step 20)
for revision of model and other types of recomputation, it is very
and documented, the results are typically distributed to all personnel
difficult to separate the effects.
directly interested or involved in the development of the work.
The
larger group traditionally decides that not quite enough has been done
8.2.
Step 20.
Analyze the output ureated in Step 19 and ",Pite preZimiin certain areas and requests additional statistical analyses.
ntU'!J conclusions.
Since these determinations are principally made outside the CPR,
After the statistical computations have been performed in Step
we shall not attempt to approximate the costs or the time
required.
19, the results must be examined by a statistician and subject-matter
We
experts in a process·called "statistical analysis."
stage is equivalent to defining a whole new analysis project which may
Since statistical
only note that substantial alterations of the analysis plan at this
require returning to Step 1.
53
8.4.
Step 22.
54
When no fuzrther calculations are needed. mte up the
results for distribution or publication.
Writing results for publication is a proceas which is quite
APPENDIX
familiar to the audience of this paper and will not be described
here.
EXPLANATIONS OF CERTAm COMPUTER PROCESSING TECHNIQUES
In the Lipids Research Project, the CPR will typically be
involved in this Step, both by the provision of statistical expertise,
and by the provision of some secretarial services.
Although the costs
attributed to the provision of these services cannot reasonably be
1.
Error checking:
1.1.
Field Tests and Consistency Tests
Field Tests
attributed to procedures for data analysis, we have included this
A field
aa the final Step, and we remind the reader of the great amounts of
~
for detecting an erroneous value in a data field
manpower time and calendar time required to write up results and get
is a technique based solely on knowledge about values which are
them into the published literature.
allowable for the field.
used:
Three types of field test are commonly
(1) valid values, (2) valid ranges, and (3) field type
definition.
1.1.1.
Valid Values Field Tests
One uses a valid values
field test when one can specify all the values which are valid for a
particular data field and when the number of valid values is reasonably small.
Consider, for example, the item:
17.
Sex of subject:
1 male. • • • • • ••
1
2
2
female. • • • • • •
There are only two valid values, 1 or 2, for the data field containing
the response to this item.
The valid value field test would check whe-
ther the value in the data (value punched in the card) was one of the
valid values 1 or 2.
If the value in the data field doea not match one
of the valid values for the field, the test is failed and an error
message is printed.
"
1.1.2.
y
Valid Range Field Tests
contain.
One uses a valid range
characters.
test. instead of a valid values test. when the value for a field
must fall in a particular range.
24.
Date of birth:
For ezample. a numeric field must not contain alphabetic
For the purposes of this type of test ve define four
types of fields:
For example. consider the item:
CD CD CD
(1)
integer (whole numbers).
(2)
floating point (integers or decimal fraction numbers).
(3)
alphabetic (the letters A-Z and the "blank" character) and
Since one cannot know in advance which months will be represented in
(4)
character string (no restrictions).
the data. a field test for the day (of month) must allow all values
There are explicit technical definitions of these field types. but
in the range 1-31.
those go into considerable technical detail.
month
It is
year
day
feasible to check "day" using a valid
value test. listing all the integers 1.2.3 •••• 31. as valid values.
are sufficiently descriptive.
But it is easier and quicker for both human and computer to use the
(a)
valid range test in this instance.
The valid range test compares a
data value with the specified range. 1-31 in this case.
The occurrence of a decimal point. a letter or
punctuation character (semicolon. comma. etc.)
falls within the limits. or equals either endpoint (both 1 and 31
are valid here). the test is "passed"; i f the value falls outside
Some examples:
The three fields in a date are usually declared to be "integer"
fields.
If the value
The indications above
(b)
A serum creatinine would be declared as a "floating point"
field.
In some cases there is a large number of valid valuea. but some values
value of "Z.O" could be entered simply as "2".
in the whole range are not valid.
letters and punctuation signs are illegal.
In such a case it may be necessary
The decimal point is allowable. but not required; a
to have several non-overlapping valid ranges: the test is passed if
exceptions not important enough
the value falls in anyone of the valid rangea.
here.)
For example. if one
birth might be 41-45. 50-54. and 63-73.
1.1.3
Field Type Definition Field Tests
(c)
definition test is quite different; rather than using information
about possible values for a field. the field type definition test is
based upon information about the types of characters a field may
Generally
(There are certain
in this context to describe
Only the letters A-Z and the ''blank'' are allowable in an
alphabetic field.
The field type
be
invalid.
the specified range the test is failed and an error message is issued.
were studying certain "war babies." the valid ranges for year of
wo~d
Numbers and punctuation are not allowed.
Thus the names O'REILLY and EL-ABU are invalid in an
alphabetic field.
For this reason name fieldS are usually
declared to be character strings. not alphabetic fields.
57
58
Certain multiple-choice questions and people's initials are
(d)
. Clearly, if the subject were born in February then 31 is not an
examples of alphabetic fields.
allowable value for "day of IDOnth."
When a field does not qualify as one of the other three types
month of birth (a data item) one cannot eliminate 31 aa a valid
is is declared to be a "character string," which has essentially
value.
no restrictions.
test, which is based solely on knowledge about the one particular
The field type definition test is performed in different
ways for different types.
fields.
However, without knoving the
This example illustrates the difference between a field
data field, and a consistency test, which is based upon the relation-
No test is performed on character string
Each character of an alphabetic field is checked to determine
ships between data in two or more fields.
Thus, in general,
~
field
test for the "day of month" must allow all values in the range
if the character is one of the letters A-Z or ''blank'': if any
1-31, but a consistency test--using information from two or more
character in the field fails the test the field fails and an error
fields--could use different ranges for day of month depending upon
message is printed.
the value in the month field or other fields.
the computer.
Integers have a special hardware storage mode in
When an integer field is tested, a special conversion
subroutine is called to attempt to "convert" the characters punched
in the card into the special internal representation.
define a consistency test for month, day, and year fields as follows:
(0)
If the value
in the field cannot be converted into the internal representation,
field test routine and an error message is printed.
(1)
(2)
If month is one of these:
01, 03, OS, 07, 08, 10, 12, then no
If month is one of these:
04, 06, 09, 11, then day must lie
in the range 1-30.
(3)
If month is 02 and year is not an even multiple of 4, then day
must lie in the range 1-28.
different conversion subroutine is used.
1.2.
(The day must pass the
further test need be done.
The same considera-
tions and procedures are used for floating point fields except that a
The day must lie in the range 1-31.
field test before bothering with the consistency test.)
because of an illegal character or a violation of one of the other
technical rules, the conversion subroutine indicates the error to the
For example, one could
(4)
Consistency Tests
If month is 02 and year is an even multiple of 4 (leap year)
then day must lie in the range 1-29.
Consider again the example used above to illustrate valid
range field tests:
24.
Date of birth:
[I]
Month
OJ
Day
These examples may not be realistic; the purpose ie to illustrate the
difference between a field test and a consistency test.
[I]
Year
When a data value fails a field test, only one field is
involved.
But when a set of fields fail a consistency test, this may
60
59
.1.3.
be due to an error in just one field or there may be concurrent
errore in two or more fields.
The consistency checking program
1.3.1.
usually cannot detect exactly which fields have erroneous valuee;
all fields involved in the test must be considered erroneous until
further (human) checking is performed.
For example, in the date
The error may be in the month
field (02 being erroneously recorded for 01, for example); it may be
in the day field (29 being recorded for 28, for example); it may be
in the yeer field (74 recorded for 72); or it could be in a com-
they are usually made up of one or more "logical" statement!,
04,06,09,11, then day must lie
in the range 1-30." While valid value and valid range tests can be
easily specified in terms of lists (or tables) of valid values and
valid ranges, consistency tests require more flexibility.
it could fail the
consistency test without failing the field test.
Since a failed
consistency test requires two or more items to be re-checked by
humans but a failed field test requires only one field to be rechecked, one performs the field tests first.
That is,
~
consistency
This implies that a consistency test
But, for example, if a date is recorded as 14/29/71 there is no need
to bother performing the consistency test until after the month is
The example above illustrates another feature of consistency
"If month is one of these:
value in a field would fail a field test it would also fail the con-
may not be performed until some while after the data are available.
all but one of the fieldS may have correct values.
such as:
tests are usually programmed with sufficient care that if the
passed all of its field tests.
The point is that when a consistency test is failed,
all fields involved in the teat are marked as erroneous even though
tests:
Consistency
test is not performed until each field involved in the test has
bination of fields (recording 02/29/74 instead of 03/01/74, for
example).
Field Tests Before Consistency Tests
sistency test, but not necessarily vice versa:
illustration given above, a date of 02/29/74 would fail the consistency
check because '74 is not a leap year.
General Considerations
For this
corrected.
It may take several weeks for the field-test-generated
error message to be returned to the source of the data, for the
correction to be looked up, entered on the form, returned to Chapel
Hill, keypunched, and put into the system.
The consistency test is
not performed until after all these steps have been taken and all
input fields for that test have passed field tests.
reason consistency tests are usually specified by writing computer
subprograms in a convenient language, such as PL/I or Fortran.
Field
1.3.2.
Errors and Improbable Values
The purpose of field
tests can be handled more conveniently in "tables" which are not
tests and consistency tests is to detect errors.
In some cases, as
included in a computer program, but are stored on disk and available
in the case of valid value tests for multiple-choice items, one can
to a computer program.
write down an "airtight" test:
if a data value fails the test, the
62
61
value is definitely in
error.~
But there are many items for which
. (data item) and set lim1ts which include 96%-98% of the population
it is difficult to "draw the 11ne" or define limits for valid values
and exclude 1%-2% of each "tail."
Dr ranges.
the value is more likely to be an error than a valid, but improbable
Consider setting a "valid range" for an item which
records the weight, in pounds, of adult subjects (18 and over) of
value.
both sexes.
processing.
What lower limit can one specify?
One could specify a
When a value 18 outside the 11m1t8
One sets these limits in advance of data collection and
One then keeps records of the proportion of "erroneous"
lower limit of 70 pounds. but we know there are adults. especially
values which are actually erroneous and the proportion which are
midgets. who weigh under 50 pounds.
valid but improbable.
If we set the lower limit too
After Some experience. one can adjust the
low. we fail to detect a large number of errors; if we set the lower
limits to change these two proportions to better match the needs
limit too high.then we make the mistake of calling valid data erroneous.
of the study.
One can't win; no matter where the lower limit is set one of these
two types of mistakes will be made:
either one will ignore data
errors which should be detected or one will declare a certain proportion of valid entries to be erroneous, or both.
The best approach is
to accept the fact that one is going to make both types of mistakes
and set the cutoff point so the rate of each type of mistake is
acceptable.
Thus. if One sets the lower cutoff point st 60 pounds,.
technically a data weight of 50 pounds should be called "improbable"
rather than "erroneous."
Since people understand what an "error
message" is. but may have difficulty with "improbability messages."
we will continue referring to values outside limits as "errors,"
while recogniZing that in a small percentage of cases the values are
correct. but "improbable."
The discussion above explains the problem but not the solution.
In practice one attempts to determine the distribution of a variable
*The converse, "if a data value passes the test the value is
definitely correct," is almost ~ true. Our tests are not that good.
© Copyright 2026 Paperzz