Quade, Dana; (1988). "Written Examinations for the Master of Science Degree in the Department of Biostatistics."

·e
WRITTEN EXAMINATIONS FOR THE MASTER OF SCIENCE DEGREE
in the
Department of Biostatistics
School of Public Health
University of North Carolina at Chapel Hill
assembled and edited by
DANA QUADE
Institute of Statistics Mimeo Series No. 1465
September 1984
WRITTEN EXAMINATIONS
for the degree
MASTER OF SCIENCE
in the
DEPARTMENT OF .BIOSTATISTICS
School of Public Health
University of North Carolina at Chapel Hill
·e
assembled and edited by
DANA QUADE
September 1984
TABLE OF CONTENTS
Page
1
Introduction
12
April
1980
4
11...;.12
April
1981
12
3~4
April
1982
19
1982 ("Special offer", Part II only)
28
29
May
15-16 January 1983
9-10 April
1983 ("special")
21-22 January 1984
15
·e
April
1984 (Part II only)
29
40
49
58
INTRODUCTION
This publication contains the written examinations which the
Department of Biostatistics has set for candidates for the degree Master
of Science (MS), beginning in 1980.
Prior to that time, MS candidates
took the "Basic Doctoral Written Examination" as their Master's Written
Examination, although the standard for acceptable performance was set
lower for them than for doctoral candidates.
(Copies of the Basic Exam-
ination are available in the Institute of Statistics Mimeo Series as
Issues #1343:
Closed-book Parts and #1344:
Take-home Parts.)
The
Departmental Examinations Committee prepares and conducts all Departmentwide written examinations, and arranges for their grading.
During the years 1980 through 1982 the MS examination consisted of
e
two Parts, Part I being closed-book, and Part II open-book.
Each Part
consisted of 5 questions, of which 114 and 115 were considered· more difficult
than the others; the·MS candidate was expected to answer 3 of the 5 questions,withiri. a period of 3 hours.
(MPH and MSPH candidates took the
same examination, but under more lenient rules; see Mimeo Series 111329.)
Beginning in 1983 the MS examination has been entirely separate.
There are two Parts:
Part I (Theory) is closed-book, and Part II (Appli-
(Applications) is open-book.
In each part there are 4 questions, of
which the candidate is to answer 3 within a period of 3 hours.
examination is given each January.
This
Special re-examinations for the MS
have been combined with the MPH examination later in the year, however,
and these have been conducted under different rules; see the copies of
the examinations which follow.
-2-
A team of two graders is appointed for each qu€!stion.
Where
possible, all graders are members of the Department of Biostatistics and
of the Graduate Faculty, and no individual serves on more than one team
for the same examination (the two Parts counting separately in this context).
The members of each grading team prepare for -their question an
"official answer" covering at least the key points.
They agree before-
hand on the maximum score possible for each component, the total for any
question being 25.
The papers are coded so that the graders are unaware
of the candidates' identities, and each candidate's answer is marked
independently by each of the two graders.
The score awarded reflects the
effective proportion correctly answered of the question.
The two graders
then meet together and attempt to clear up any major discrepancies between
their scores.
Their joint report may include comments on serious short-
comings in any candidate's answer.
On the basis of a condidate's total. score on a paper, the Examinations
Committee recommends to the faculty whether the candidate is to be passed,
failed, or passed conditionally.
ified, together .with a time limit.
faculty.
In the last case, the condition is specAll final decisions are by vbte of the
Examination papers are not identified as to candidate until after
the verdicts of PASS and FAIL have been rendered.
Once the decisions have
been made, advisors 'are free to tell their students unofficially;' the
official notification, however, is by letter from the Chairman of the
Examinations Committee.
Actual scores are never released, but the "official
answers" are made public, and candidates who are not: passed unconditionally
are permitted to see the graders' comments on their papers.
A candidate
whose performance is not of the standard required may be reexamined at
-3~
the next regularly scheduled examination, or at an earlier date set by
the Examinations Committee.
One reexamination is permitted automatically.
Candidates whose native language is not English are not to be allowed
extra time on Department-wide (not individual course) examinations.
This
condition may be waived for individual candidates at the discretion of the
Department Chairman upon petition by the candidate at least one week prior
to the examination.
NOTE.
Most of what follows reproduces the examinations exactly as they
were originally set; however, minor editorial changes and corrections.
have been made, particularly in order to save space.
-4-
BASIC MASTER LEVEL WRITTEN EXAMINATION IN BIOSTATISTICS
PART I
(April 12. 1980)
Question 1 The following questions apply to stratified sampling:
a)
Briefly describe stratified (simple) random sampling.
b)
What makes stratified random sampling a type of probability
sampling?
c)
Briefly discuss the set of gUidelines whi.ch you would use
in forming strata in this type of design.
d)
Briefly describe the relationship between an epsem (or
self-weighting) design and proportionate stratified sampling.
e)
Noting that the estimated variance of
y
Cl
simple random sample of size
is
n
from
n
= (l/n)
LY.
j J
~ l~pulation
from a
of size
N
.where
derive an estimator for the variance of the estimated mean
from a stratified random sample,
where
W
h
=(Nh/N),
N
h
is the number of elements in the h-th
H
stratum of the popUlation,
N =LN ,
h
h
H is the total number
of strata,
and
n
is the number of elements in the sample from the
h
h-th stratum.
.'
-5~stio~
Consider the regression model
2
Yi =a.+Slx i + S2 x i + e i ,
Suppose that you want to test
i =l, ... ,n.
a)
State the usual assumptions you need to make for this purpose.
b)
Derive the expression for the mean square due to error.
c)
d)
Write down the formula for the test statistics.
Suppose that you want to test also
H~: 81
e)
=82 =0
vs.
Hi: C131' 82 ) f.
ct·
What are the critical regions for the testing problems
in (c) and (d)?
A large machine consists of 50 components. Past experience has
shown that a particular component will fail during an8-hour
shift with probability .1 • The eqUipment will work if no more
than one component fails during an 8-hour shift.
a)
Calculate the probability that the machine will work throughout
an entire 8-hour shift, assuming the binomial distribution is
applicable. Carefully define any notation and state the
.
assumptions required for the valid application of the binomial
procedure.
b)
State the general situation and assumptions which are required
for .the Poisson model to be applicable to this situation.
Give the Poisson density function .
. c)
The poisson distribution may be used to approximate the
binomiai distribution when n is "large" and p is "small".
Use the Poisson distribution to find an approximation to the
probability computed in part Cal.
d) Calculate an approximation to the probability obtained in
. part (a) by using the normal approximation to the binomial
distribution. State the assumptions under which the approximation
is reasonable.
eJ
Compare the accuracy and usefulness of these three
calculations.
-6-
Question 4
Suppose that
X has a chi-square distribution
1
= ---.,,.,,.--- x
v/2 -1
~ V/2r(~)
e
-x/2
,
v> 0,
x> 0,
2
and
Y has a beta distribution
f
y
( );::
y
f(Ct+ (3)
fCallTBf
0.-1 (1 _ )8- 1
y
0. > 0,
y,
8> 0
0<)'<1.
a)
Obtain formulae for
positive integer.
b) . If
v
r < [2]
* show
r
E(X )
and
E(yr),
where
r
is a
that
E(X -r.)=[(v-2)(v-4 ) "'(v-2r)) -1
Find E(y-r). What condition must r s~ltisfy?
Assume that X and Yare independent, and let Z;:: X/Y.
Find E(2).
c)
~)
Question 5 Suppose that lifetime (T) of a certain mechacnical device has a
Weibull distribution with PDF
.
a)
c c-l
t c
fT(t) ;:: 8c t
exp[-(e)]
c>O,
t>O
Obtain the formula for cumulative distribution function,
FT(t), and evaluate the expected proportion of failures
(i)
(ii)
b)
6>0,
before time
after time
l;
T.
Suppose that N items were put on test at t;:: 0, and n
were observed to fail before time l. Suppose that exact
failure times were recorded; let t. denote the failure
1
time of the i-thitem among those items which failed
(i ;:: 1, ... ,n). Construct the likelihood function.
EDITORIAL NOTE:
Two tables were appended to this examination:
a) Standard normal distribution function <flex) for x ;:: -3(.01)3
b) Natural logarithms In(x) for x ;:: 1(.01)10
-7-
BASIC MASTER LEVEL WRITTEN EXAMINATION IN BIOSTATISTICS
PART II
(April 12, 1980)
M.P.H.
during
are to
should
and M.S.P.H. students are
the two-hour period (1 pm
answer three questions of
be from Section A - time
to answer'any two questions
- 3 pm). M.S. students
which not more than 2
period 1 pm - 4 pm
You are required to answer only what is asked in the
questions and not all you know about the topics.
·e
q~!.cstion_l
A survey of 320 families, each with 5 children, revealed the
following distribution:
No. of girls
No. of families
a)
b)
o
I
2
3
4
5
Total
18
56
110
88
40
8
320
Is the result consistent with the hypothesis that male and
female 'births are equally probable?
Test 'this hypothesis at the significance level 0= 0.05,
ex = 0.01
What.is the maximum likelihood estimate of the probability
of a female birth?
-8-
Question 2
A.
Briefly describe or explain the following terms:
(a)
(b)
(c)
(d)
(e)
(f)
(g)
B.
OS
TSO
Track (on magnetic disk)
Block
Logical Record
byte
JCL
Compare and contrast:
(a)
(b)
(c)
OS dataset
SAS database
SAS dataset
(That is, demonstrate that you know what each of these terms means
and the differences between them.)
C.
Figure 1.
C05
The printout in Figure 1 was produced by' the PROC PRINT statement
in line 190 of the SAS program shown in Figure 2. Show what will
be printed as a result of
(a)
the PROC PRINT statement in line 2300 in Figure 2. (Be careful
to write out the entire output to be produced by SAS e:x:cept
titles and page headings.)
(b)
Show what will be printed as a result of the PROC MEANS
(Figure 2-, line 250) and related statements.
(c.)
Show what will be printed as a result of the PROC PRINT
statement in line 440 of Figure 2..
(d)
Show what will be printed as a result of the PROC PRINT
statement in line 530 of Figure 2.
Printout Produced by the PROC PRINT Statement in Line 190
of Figure 2.
HME
SEX
AGE
HEIGHT
WEIGHT
-, _. - .-,;LrriE"i)----- - - - -.. --"1-4-------69-------f1"i.
2
ALIC;:
F
56
84
D~PNADETT:
13
3
F
13
65
~8
51
4
rlr.rnl\!'J\"---
P'
14--'--i52
5
JAMES
H
12
1'0'2;;:---83
-9-
SJ\S Program for Problem 2
Figure 2.
OO'5~
II
EXEC SAS
00060 /1*PW=EXh11
_0:)0.7.0 _/ /R2X_ .. P)?__ .~__...
..
.
_
01080 M 14 69 112 ALPHEn
·OOOqO P
13
S6 &4 ALIC3
0:)100 1" 13 65 c)8 B2RHADE'!''IE
OO-f10-·P-'-ij--GL- 10i D.H:BAHh -..- - - - - - . - - - - 00120 M 12 57 83 JAMES
J?~ l~.Q._lls_X ~.I.t~ D 0._ !... _.__.._.__
.
._
00140
VA!A
S~UD~~!1;
THFILE nrrx:
LEnGTH NAM3 $ 20 SEX $ 1;
00150
00160
-6-aT7-0---irfp-ri'i..·· SE-[-AG :;--11 Bi"e:; HTwE"I"ci I{T--N·f~iE;
00180
.
* THIS
-002 otr------ ._.. -.- .__._.---.
0~190
PRODn~~_~_F_IGUrlr·..:..2;
PROC PRINT;
00210 PROC SOPT;
00220
-002-jO-PROC
~.G~
BY SEX
::....!...-----
HEIGHT;
.
P-?I1(~-;
--·.--in-6iJ"LE:i
3 (A) ;
0:) 2 40
00250 PROC MEANS I'IEAN N~
* PROBlE!L-1JE);....!-_~
-oo"2K6-·
.-.~ BY-- S EX;---·_-------00270
VllR AG E ilEIG wr;
002AO
.
.
··o·j-i·:)"o··-p·noc·-·so-ilT; ·oY--rh ~iE-- ·st·x- AGE;·--
_
00300
00310
DATA CH~NG&~;
00330
MISSING _;
00340
INPU'! :'I::X AGE HZIGHT WEIGHT
-<Y61"i-o-- ---Y.EriG"'i:r(
-()O 35tr--·CA-:fns;---·'
$
~~·A[1E-·$-··;fo····-SEX
1; - - -..- - - - .
.- ..-----
00360
F 13 50 • ALICE
003 1 0
M,1 2 •
00390
F 1Q -1 • DARDAFn
F 16 62 102 BAnDARA
NAME;
.::...:..----~-.
R') .H. MES
-603ffo··_··-·M-l· 4--~·· - .'-. ·ALFRED· ---00400
--o·o"4Tb------P--T2-Scj-"8 4---jANi ....--.~--00420
00430
;
.
*
~!lD
OF CIIAtlGF DATA;
.'O\)4u6-"pg-OC-PR I N-i'":----IAc-·-·pf6oLEM 3 eCl ;
0~450
0:)460 PHOC SOH'I; 0"[ NAME SEX AGE;
--004
----- ----.--. ---_._.. ---------~io-·-·-·
00480 DATA STUDENT2;
...0 QI!~9__.
OO~OO
If P:.? AJ.l·LS TIII)E ~I_TJ..•C.~ ~.NG E~S.1.;_ _
BY NAME SEX AGE;
00510
IF HEIGH:
_0 J 520..__._.__...
_.
00530 PR0C PRIN'I;
>
0;
... _..__....
'" PR('lBlE~
3 (D) ;
.
~
__
-10-
Question 3 A sample survey of 800 adults is conducted in CL large city in order
to determine citizen attitudes toward national health insurance.
The sampling frame for the survey is a list of adult taxpayers
from which four strata are formed corresponding to each of four major
sections of the city. The design calls for selecting a simple random
sample within each stratum. General data for the survey sample as
well as results for an opinion question on national health insurance
are as follows:
Stratum (h)
Total number of adults (Nh )
Number of adults interviewed
in the sample (n )
h
Number of interviewed sample
adults who are in favor of
national health insurance (x h )
(Middle
Cla-ss
Suburbs)
20,000
(Inner
City)
20,000
(Blue
Collar)
50,000
200
200
200
200
800
190
180
40
30
440
a)
Calculate a stratified estimate of the proportion
in the city who favor national health insu:rance.
b)
Calculate an estimate of the variance of the estimate produced
in Part (a).
.
c)
A colleague argues that since a "random sample" of adults has been.
chosen and since simple random sampling was used, the estimate and
its variance can be calculated as if a simple sample of . n =800
adults had been selected. Calculate the estimate and its-variance
as the colleagues suggests.
Comment briefly on the difference between your analysis and your
colleague's analysis.
If you· decide to use Neyman allocation for a similar survey on
national health insurance in the future, how would you then
allocate a sample of n = 800 adults to the same four strata?
d)
e)
(P)
(Affluent TOTAL
Suburbs)
10,000
100,000
of adults
e
-11-
question 4
Read the attached article by Topoff and Mirenda from a recent issue
of Science. Note that from the data of Table I the authors were led
to calculate a chi-squared statistic. Discuss the statistical
appropriateness and the numerical accuracy of their analysis.
EDITORIAL NOTE: The article referred to, including Table I, was reproduced
and appended to this examination. It originally appeared in
Science 207: 1099-1100 (7 March 1980).
question S Ten observations on three independent variables (X ,X ' X )
l
2
3
one dependent variable were obtained as follows:
.
Xl
X2
X
3
Y
2.7
6.0
2.3
6.8
1.8
6.0
2.7
7.3.
2.7
7.1
2.8
6.8
2.2
6.0
2.4
4.0
5.S
6.2
5.S
6.2
7.9
6.2
5.S
9.0
7.8
1.6·
3.0
6.2
7.3
2.9
4.0
9.0
7.8
2.2
2.2
e
2.7
3.0
3.2
2.2
2.7
and
It is postulated that the model
is reasonable.
(i)
(ii)
(iii)
(iv)
Inspet:t the model and the data and comment on the
appropriateness of the model.
Determine an estimate of the variance of the random error,
2
0 ,
for these data.
By plotting the data or otherwise, select a model with
few parameters that appears appropriate .and fit that model.
Comment on the fit.
2
Does R tell you the same thing as a direct test of lack
of fit based on an estimate of pure error?
-12-
BASIC l-1ASTER LEVEL WRITTEN EXAMINATION IN BIOSTATISTICS
PART I
(April 11, 1981)
Question 1
a)
Define probability sampling from finite populations.
b)
Briefly discuss the advantages of probability samples
over nonprobability samples.
c)
The recreation department in a large city hires you to
design· a cluster sample of families for an interview
survey. The most important survey objective is to produce a 95% confidence interval for the proportion (P)
of families who are aware of city recre,ation programs.
The expected "half-width" of the confidence interval
for the estimated proportion (p) will b,e about*
d - 1. 96IVar(p)
You decide to select a simple random sample of blocks.
Your past experience tells you that the design effect
will be about 4. You also know that P :Ls about 0.3.
Ignoring the finite p()pulation correctil)n. how large
a sample of families would you recommend. so that d =
0.05?
Question 2
Let X and Y have the joint density function
-2
.
fX,Y(x.y) = 2(1-6) • 0 < a < y < x <: 1.
a)
Find the marginal distribution of·X.
b)
Find. the conditional distribution of Y
c)
The quantity E(Y IX = x) is often called the regress ion
function of Yon x. Find E(YIX = x).
d)
For n pairs of data points (x. Y.). i = 1. • . •• n. derive
~.
l~iven
that X = x.
~
the formula for the least squares estimcltor of
answer to part (c).
a
using your
*Recall that a 95% confidenee interval for the estimated proportion (p)
produced in your analysis will be of the form:
Lower Limit:
p - 1.96
/Var(p)
Upper Limit:
p + 1. 96
lVar(p)
-13-
Question 3 Suppose that 60% of a particular breed of mice exhibit
aggressive behavior when injected with a given dose of a
stimulant. An experimenter will apply the stimulant to 3
mice.one after another and will observe the presence or
absence of aggressive behavior in each case.
a)
List the sample space -for the experiment. (Use A to
denote aggressive and N to denote nonaggressive).
b) . Assuming that the behaviors of different mice are
independent, determine the probability of each
elementary outcome.
c)
Find the probability that
(i) two or more mice will be- aggressive,
(ii) exactly two mice will be aggressive, and
(iii) the first mouse will be nonaggressive while the
other two will be aggressive.
d) What can you say about the exact distribution of· the signed
rank statistic when Ho holds and n is small? What approximation would you recommend when 0 is large?
-1 -x/6
Let fx(x) = 6 e
,x> 0,
exponential distribution.
Question 4
e>
0 be the pdf of an
a) Show that, io fact, X is distributed as (6/2)X~ where
X2 has the chi square distribution with 2 degr~es of
-e
2
freedom •
b) Based on a random sample Xl' • • • , Xn of size n, find
the maximum likelihood estimator of6.
c) Construct the likelihood ratio test for testing the null
hypothesis HO: 6-6 0 against the alternative 6<6 0 ,
d) Derive the power function of the test.
Let Xl, • • • , Xn be n independent and identically distributed
random variables with an unknown (but continuous) distribution
• function F and 6 be the median of F. Suppose that one wants
test for
Quest~~~
to
a) Write down the expression for the sign test statsitic
for this testing problem and its. exact distribution
under HO' What approximation to this distribution would you
recommend whenn is large?
b) What additional assumption do you need to make to use the
Wilcoxon signed rank statistic for this testing problem?
, c) Compute the first two moments of the Wilcoxon signed-rank
. statistic (under -H0 ).
-14-
PART II
(April 12 t 1981)
Question 1
The mean drying time of a brand of spray paint is known to
be 90 seconds. The research division of the company that
produces this paint contemplates thsLt adding a new chemical
ingredient to the paint will accelerate the drying process.
To investigate this conjuncture, thE: ;Ja int ':lith the chemical
additions is sprayed on 15 surfaces and the drying times
are recorded. The mean and standard. deviations computed from
these measurements are 86 seconds and 5.6 seconds respectively.
5 points
(a)
Do these data prOVide strong evidence that the mean
drying time is reduced by the sLddition of the new
chemical?
6 points
(b)
Construct a 98% confidence intE!rval for the mean drying
time of the paint with the chemical additive.
9 points
(c)
Suppose that the actual standard deviation for the drying
time does n,ot change with the BLddition of the new
chemical and is known to be equal to 6 seconds. Given
this additional information, what would be your conclusions
in (a) and (b)?
5 points
(d)
Suppose that it is also conjectured that the standard
deviation of the drying time dE!creases with the addition
of the new chemical. Do these data provide a strong
evidence for that?
Question 2
The state welfare agent is in the. process of sampling unemployment data in his st*te. The state is divided,into.4
. 'regions each with approximately the same population. Each
region is in turn divided into 750 equal-sized sampling
units. From each region five sampli.ng units are selected
and sampled intensively. The percentage unemployment for one
such test is given below.
Region
Region
Region
Region
. (a)
A:
B:
4.2
4.4
3.7
C:
D:
4.8
3.9
5.0
3.5
3.1
4.5
4.1
5.1
3.6
5.0
S.l
4:4
5.2
4·.5
3.7
3.9
S.2
Is this a simple random sampling?
Why or why not?
(b)
Calculate the mean unemployment rate for each region
and use them to estimate the mean rate for the entire
sample of 20 obervations.
(c)
Compute the mean using the entire 20 observations.
Does it differ from the answer'ln part (b)? Why or
why not?
(d)
'olhat procedure would be necessa.ry if the regions and sampling
units were of different sizes in terms of population?
,.
-15Question 3
A)
Briefly describe the purpose(s) of:
(a)
JCL
(b)
DATA step of SAS
(c)
FROC step of SAS
B)
Define. and describe the relationships between:
OS file (dataset)
SAS database
SAS dataset
(Examples of corresponding JCL and SAS code maybe useful.)
C)
In less than 1 page. outline die basic steps of the process
required to create a SAS dataset that is ready for
statistical analysis. Suppose the input dataset is a
raw. unchecked dataset stored on disk. The SAS dataset
is also to be stored on disk.
D)
Write out the job or jobs. including JCL and SAS code. needed
to do the following on a dataset with the format given in
Figure 1. Use UNC.B.E.99U as the account number and
MEXAM as the password.
(a)
Create a SAS dataset. stored on on-line disk.· called
VEHACC that includesa11 the variables. but just the
reportable cases. Use the variable names given in capitals
·in the format. Label Height and Weight.
(b)
Print 10 observations.
(c)
Create a variable called AGE GP with the following codes:
AGE GP values
AGE values
o
thru 10
11 thru 24
25 thru 54
55 and over
Unknown
1
2
3
4
(d)
Plot Height
(e)
Create cross-tabulation tables of Sex by Restraint. Sex
by Injury. and Injury by Restraint.
VB.
Weight for males and females separately.
...
-16-
Figure 1
Format for a vehicle-oriented accident file
DSN=UNC.B.E999U.MASTERS.VEH.RAW
There are approximately 1000 records
Column
1
2-7
Information
Accident Reporting Type - ATYPE
1 Non-reportable
20n private property
3 Reportable
Accident Case Number - 10
8
Injury Class - INJ
1 Not injured
2 Class C injury
3 Class B injury
4 Class A injury
5 Killed
6 Notstatea
9
Restraint Used - BELT
1 .No belt
2 Lap and shoulder belt
3 Child restraint
4 Not stated
10
Race - RACE"
1 White
2 Black
3 Indian
4 Other
5 Not stated
11
Sex - SEX
1 Male
2 Female
3 Not stated
12-
Age 01-97
98
99
AGE
Actual age
Older than 97
Not stated
13-14
Height - HT
01-98 Actual height in inches
99
Not stated
15-17
Weight - WI
001-998 Actual weight in pounds
999
Not stated
e-
· . ..
-17-
~~stion 4
Suppose we have conducted an experiment to estimate the weight gain
for a sample of 14 dairy cows as a result of a one week exposure toa feed
additive.
One statistical question of interest is to test whether the
weight gain is zero.
Another major objective is
to
estimate the weight
gain.
The 14 dairy cows comprised three different breeds; there were 7
Holsteins, 5 Jerseys, and 2 Guernseys.
The data collected were as
follows:
Weight Gain in lbs.
Holsteins
1.
-7, 5, -1, 3, 1, 6,0
Jerseys
2, -2,1,7,2
Guernseys
9, 2
Suppose we were interested in testing HO: mean weight gain
=
0, and we
had in mind a target population of dairy cows in which Holsteins,
Jerseys, and Guernseys were in the ratio 7:5:2.
would carry out.
Specify the test you
Compute the test and state the significance level.
Provide a corresponding estimate of mean weight gain.
2.
Suppose we were interested in testing HO: mean weight gain
=
O. and we
were primarily interested in a target population of dairy cows in
which Holsteins, Jerseys, and Guernseys were in equal proportions.
-Specify the test you would carry out.
Compute the test.
Provide a
corresponding estimate of mean weight gain.
3.
Are tests in questions 1 and 2 the same? Conrnent.
4•. Suppose now that we were interested in testing HO: mean weight
= 0 and
we had a target population in which Holsteins, Jerseys, and Guernseys
were in the ratio WH:WJ :WG.
Specify an appropriate test.
"
'
-18-
Question 5
In measuring the various constituents' of cow's milk, it is
'of interest to determine how protein (Y) is r2lated to fat
(xl) and solid-nonfat (x2)' Samples of 10 cows were taken
and the following data were obtained:
Observations
Protein
y
Fat
Solid Non-Fat
-----
Xl
x
'2
l'
3,75
4.74
9.50
2
3.19
3.66
8.56
3
2.99
4.27
8.54
t.
3.46
4.03
8.62
5
3.27
3.51
9.35
6
3.27
3.97
8.39
7
2.78
3.23
7.87
8
3.59
3.79
9.33
9
3.16
3.36
8.86
10
3.65
3.64
9.21
Assume the linear regression of Y on x 1> x 2 and
I
(i)
Find the least squares estimates of the (partial)
regression coefficients.'
(ii)
We are interested in testing the null hypothesis that
both these partial regression coefficients are equal
to O. Test this hypothesis at the significance
level a. = 0.05. State clearly the assumptions you
need to make in this context.,
e-
-19-
BASIC MASTER LEVEL WRITTEN EXAMINATION IN BIOSTATISTICS
PART I
(April 3. 1982)
Question 1.
You are called upon to assist the health department in a large city with
the design of a local household survey. The survey's principal objective will
be to estimate the proportion of households in which the person usually
responsible for preparing the meals is aware of the importance of a balanced
dietary intake ,by members of the household. Between 20 and 40 percent of
the local households are thought to be aware of this. The health department
recognizes the 'importance of a high response rate. but has only a modest amount
of money to do the survey. Moreover. the survey will probably have to be
conducted by staff of the health department who collectively have little
survey experience.
-e
a.
Briefl,y discuss the relative merits of the three methods of data
collection being considered for the survey: self-administered
questionnaire' by mail. telephone interview. and personal interview.
b.
If,either telephone or personal interviewing is the selected method.
two-stage cluster sampling will be used to select the sample of
households. The design effect in either case is expected to be abolJt
1.5. Briefly describe what a "design effect" is and what things
contribute to its size.
c.
In the event that the two-stage design is used, determine the number
of completed household interviews which wOlJ1d be required to yield a
coefficient of variation of 10 percent. You can ignorthe finite
population correction~
Hint:
Recall that the coefficient of variation for the estimator (p)
of the true population (P) is
CV(p)
=
J
[Var(p)] 2
P
Question 2.
Let U be a random variable with p.d.f.
f(u) • 1,
= -).
o
<
1.1
< 1
10g(1-U).
(~)
Find the p.d. f. of X
(b)
Find E (X)
(c)
Let Z be a random variable (independent of U) with p.d.L
g( z.)
. Let Y = e -z
Find E(W) •
=
and
(2iT)
-~
exp(-~
2
.
z).
W .. -Y log(l-U).
-00
<
Z
<
00
•
-20Question 3.
An executive is willing to hire a secretary who has applied for a
position unless a significnace test indicates that she averages more
than one error per typed page. A random sample of five pages is
selected from some typed material by this secretary and the errors per
page are: 3, 4, 3, I, 2.
(a)
Assuming that the numbers of errors, A, say, per page has Poisson
distribution, what decision will be made? Use significance
level Ct < 0.05.
(b)
Calculate the power when the average
A, is equal to 2.
(c)
Suppose that the executive looks at a random sample of 225 pages
of the secretary's work and finds 252 errors. What would be his
decision, using a ~ 0.05?
nu~ber
of errors per page,
Individual terms, e-m mi /i!, of the Poiss,on distribution
m
i
1
2
5
10
0
.36788
.13534
.00674
.00005
1
.36788
.27067
.03369
.00045
2
.18394
.-27067
.08422
.00227
3
.06131
.18045
.14037
.00757
4
.01533
.09022
.17547
.01892
5
.00307
.03609
.17547
.03783
6
.00051
.01203
.14623
. 06306
7
-
.00344
.10445
,,09008
.00086
.06528
.,11260
-
.03627
,,12511
-
.01813.
,,12511
.00824
.. 11374
-
.00343
,,09478
.00132
,,07291
.00047
.. 05208
8
9
10
11
12
-
13
-
14
Normal percentile point
T. 95
T.
• 1.645
90 • 1.282
T .975
• 1.960
e-
..
;"
("'
-21Question 4.
Suppose that a system has two components whose life times (X and Yt
say) are independent and each has the same exponential distribution with
mean e(> 0). The system fails as soon as at least one of its components
does so. Let 2 be the life-time of the system.
(a)
What is theprobabi11ty density function of 2?
(b)
For
n(~
1) systems of the same type t let ZIt ... ,Zn be the
respective life times. Obtain the maximum likelihood estimator
of 8 (saYt §n ) based on Zlt'.'tZ n •
(c)
Obtain
. E(§ n) and var(e n ).
Question 5.
Let (XltYl), ... t(XntYn)
be
n
independent bivariate observations
from a continuous bivariate distribution F(x,yr,.J:JO < x,y <
H be the null hypothesis that X and Yare independent ..
O
(a)
Define
t
=2Pr{(X -X )(Y -Y ) > O} - 1
l l
l Z
00.
Let
and show that under
HO' ~ = O.
-e
(b)
Obtain the symmetric and unbiased estimator (t ) of
n
t
_
based on the n observations and deduce the expressions for
E(tnIHO) and V(tn/HO)'
(c)
What can you say about the large sample distribution of
rt%t , when H holds?
n
O
(d)
What modifications to t
n
would you suggest to accommodate
possible ties among the X's and/or the Y's?
-22PART II
(April 4, 1982)
Q.l.
An investigator is planning a study to evaluate a new medication for
the
for
105
two
treatment of hypertension. She knows from past experience that
patients with hypertension the mean diastolic blood pressure is
mm., the standard deviation is 15 mm. and the correlation between
measurements is 0.7.
(a)
Find the sample size needed i f she uses each patient as his
own control. Assume a - 0.05,
= 0.1, a one-sided test
and that she wants to detect a change in blood pressure of
10 mm·.
(b)
Suppose she takes a group of patients and randomly divides
them into two groups. She will give one group the new drug
and the other group will get no treatmeITt. If she will
compare the change in the blood pressure in the treated
group to that in the control group, what: sample size does
. she need? Use the same assumptions as ~lbove.
e
(c)
Discuss the relative merits of the two designs.
(d)
S\1ppose the design in part (b) is chosen and that a total of
10 patients will be used. Use the attached table of-random
numbers to prepare a randomization schedule such that ·5.
patients will be assigned to treatment ~lOd 5 to control.
Please give details about how the table is used so that the
grader can reconstruct your schedule.
EDITORIAL NOTE:
An attached table presented 3000 random digits in· 60 rows·
of 50 digits each.
e-
-23-
Q.2.
An experiment was conducted to determine whether selenium
supplementation is associated with reduced incidence of benign
ovarian tumors in pregnant cows. One treatment group and one control
group, of approximately equal sizes (N ~ 25 in each) were used. Each
cow in the treatment group received the same amount of selenium,
injected once~ a fixed number of weeks before the end of pregnancy.
To verify that the treatment raised the blood levels of two
important proteins throughout pregnancy, for each cow blood samples
were taken before injection and after the·end of pregnancy. The
concentrations of the proteins were determined at each of these
two times for each cow. The important questions of interest here
are whether blood levels were similar in the two treatment groups
before treatment, whether these blood levels changed between
treatment and the end of pregnancy within each treatment group,
and whether the change was greater in the treated than in the
control group if the latter also had a change. (If the treatment
is effective, blood levels should increase.)
For each protein, the investigators determ~ned whether differences
existed by the use of two-way ANOVA (with factors treatment, and
time when blood drawn), followed by the use of Tukey's multiple
comparison method with P - .05 •
-e
(a)
Evaluate .the method of analysis. Are the required assumptions
met? Does the analysis answer the questions of interest?
(b)
If you find the current analysis inappropriate, propose a better
one, showing how to answer the primary questions with level
P • .05. I f you find the current analysis appropriate, discuss
the use of Tukey's multiple comparison test vs. some other
method to answer the primary questions.· --
Q.3.
A.
Briefly describe the purpose(s) of:
(1)
(2)
.JCL
DATA step of SAS
(3)PROC step of SAS
B.
Define, and describe the relationships between:
OS file (dataset)
SAS database
SAS dataset
(Examples of corresponding JCL and SAS code may be useful.)
C.
List the major components of a large modern computer (CPU, etc.), and
briefly describe the functions of each component and describe the
relationship among them. A simple diagram may help you in organizing
your answer.
D.
List each type of JCL statement and briefly describe the function of each.
Write a valid job (or jobs) including at least one example of each type
of statement.
-24Q.4.
A dentis.t who was responsible for dental care of cerebral palsied
children in a state institution wanted to determine whe·ther he should
recommend that electric toothbrushes be purchased for routine use by
the patients. He wanted to be as objec·tive as possible in arriving
at a decision, and decided to consult with a statistician about
designing an experiment to determine whether short term improvement
in oral hygiene could be demonstrated. In answering the major
question, "Should the. purchase of electric toothbrushes he recommended
for ~his institution?", there are other considerations over and above
any real improvement in oral hygiene of patients which should be
taken into account but these were ignored in designing the study.
Study Design and Conduct of Trial
It was de~ided that the study should be designed to determine.whether
brushing with electric toothbrushes resulted in "cleaner teeth" than
brushing with regular toothbrushes during a two week period. First,
a search of the literature for measures of tooth cleanliness resulted
in a decision to use the debris index, which is an average of debris
scores for six teeth, * as the response variable, Le. ,the variable
which is to be altered (hopefully) by "treatment". Next, factors
which could (potentially) influence results were listed:
1.
Age
2 •. Race
3.
Sex
4.
Degree of ability to care for teeth (brushes own teeth or
brushed by nurse)
5.
Initial level of cleanliness (debris index)
6.
Placebo effects:
(a)
Attitudes and actions of children and nurses
(b)
Attitude and actions of examining dentist
Because only 35 children were available for the initial examination
it was impractical to stratify (or control) on all of these variables.
However, .randomization, a way of baiancing the effect of variables
which cannot be controlled, was used.
The study was carried out as follows:
1.
Each c~ild was examined by the dentist and a pre-trial debris
index was determined using the following 3x5 form for recording.
*Greene, J .C. and Vermillion, J .R. "Oral Hygiene .Index - A l-lethod
for Classifying Oral Hygiene Status", JADA, bl:172-179; (Aug. 1960).
·e -
-25-
Name.
-
Age.
Sex.
_ No. _ _
Race.
_
Comments:
Right
Max
Ant
Left
Total
(8)
_Hand (L)
Total
Debris Index
2.
The children were stratified by sex (ward) , and degree of
disability, i.e., divided into four groups:
(b)
Male - brushes own teeth
Male - assisted by nurse
(c)
Female - brushes own teeth
(d)
Female - assisted by nurse
(a)
-e
3.
Within each group children were randomly assigned to one of
two brushing groups:
(a)
Electric toothbrush
(b)
Regular tooth care
The assignments were not disclosed to the dentist,
"blind" ,as to type of care each child received.
i.e~,
he was
4.
A list of children assigned to the two groups was posted in each
ward and the nurses supervised (and assisted where necessary) to
see that assignments were followed.
5.
At the end of the two week trial another examination was made by
the dentist, who followed the same procedure as in the pre-trial
examination to determine a debris index for each child. Results
were recorded on another 3x5 card without reference to results of
the original examination.
6.
The results were matched with those from the first examination.
Actual results are shown on the following page.
PROBLEM
(a)
Test the statistical significance of the decline -in debris ind.ex
observed in' each group, and with electric toothbrushing·. as
compared with regular.
(b)
Write a brief report, aimed at the dentist and the director of
the .state institution, describing the results and their analysis.
·Child il23 was originally assf-gned to regular care group but transferred 'by
nurses to electric toothbrush group.
Note:
"Disch." refers to children who were discharp,ed before post-trial examin<ltion.
-27Q.5. Consider the following data (n-5) for a dependent variableY and a
carrier X (i.e., independent variable)
EX ... 29,
(i)
i
X
Y
I
2
3
3
8
5
17
5
18
13
4
2
9
5
7
I;y ... 58,
EXY .. 414,
EX 2 .. 207,
I;y2 .. 832
Fit the simple linear regression model
E(Y) .. So + SIX
(ii)
(iii)
(iv)
Theory strongly suggests So • O.
So fit the model E(Y)' ..
BI X.
Are there any differences between models (i) and (ii) as regards
numerical results? If so, please specify.
Suppose two further (X,Y) points, (0,1) and (O,~), were collected.
What impact would these have on the esti~te of B for model (ii)?
l
(v)
Would the impact of these two data points be the same for model
(i)? Explain in one or two brief sentences.
2
X
-e
(vi)
For model (ii), define
h • _.....;:;i_
·i
~ X2
k-l k
(denominator is summing over the data points for X).
Take hi to
be the "leverage that the observation Y has on the predicted value
i
'"
Y.
This leverage is exerted through the spacing of the X values
(1. e., the design), not through -the actual observed value of Yi .
In general
Yi
is a linear combination ofY's where the coefficients
are h-like terms.
Write a few brief sentences interpreting the formula for h.
l.
and comment if the definition of hi is compatible with your answer
. to part (iv).
Note:
lfh i - 0, the observed value Yi has no influence whatever
'" • At the other extreme if hi .. 1, the
on the predicted value Y
i
'"
.
predicted value Y will always be the observed value-Y •
i
i
(vii)
For model (i), define
Again, interpret the formula for hi in one or two sentences
and comment whether the definition is compatible with your
answer to part (v).
-28-
BASIC MASTER LEVEL WRITTEN EXAMINATION IN BIOSTATISTICS
PART II
Special Offer:
May 29, 1982
I~STRUCTIONS :
a)
This is an open book examination.
b)
M.P.H.
during
are to
should
c)
Put the answers to different questions on
of papers.
d)
Put your' code letter, not your name, on each page.
e)
Return t~e examination with a signed state~~nt of honor
pledge on a page separate from your answers.
f)
You are, required to answer onZy what is askeul in the questions.
and not aZZ you know about the topics.
EDITORIAL NOTE:
and M.S.P.H. students are to answer ,my two q~estions
the two-hour period (1': 30 pm - 3 :30 pm). M.S. stuoents
answer three questions of which not Inore than 2
be from Group A - time period 1:30 - 4:30 pm.
sl~parate
sets,
Thi~ "special offer" was identical with the 'regular Part II
gi ven on 12 April 1981 (see pages 14-'18).
-29-
BASIC M.S. WRITTEN EXAMINATION IN BIOSTATISTICS
PART I
(January 15, 1983:
Q.l
9:30 AM to 12:30 PM)
Suppose that a simple random sample of
for a health survey in a city containing
m households is chosen
M households.
Let
Ya
denote the number of persons in the a-th household who have been ill
during the month prior to the survey, and let
number of persons in the a-th household.
n
denote the total
The estimator
m
E Ya
r-:L· a-I
n
m
E n
a-I
.
a
is used to estimate the city's illness incidence rate,
M
E Ya
y
a--a
N
a-I
M
E na
a-I
a.
Wbatis the selection probability for each person in the a-th
household?
b.
Is this sampling design epsem or self-weighting with respect to
individuals?
c.
Explain your answer.
Explain why or why not.
Show that the bias of
Bias(r) ...
where
Prn
n.
is
-Prn{var(r)}~Cv(n)
is the correlation between
the variance of
of
r
r,
and
CV(n)
rand
n,
Var(r)
is
is the coefficient of variation
-30-
----------
Q.2
Researchers believe that the prevalence (Y) of byssinoses in
workers in textile manufacturing plants is linearly related to the
mean daily cotton dust level (X).
Under the assumption that zero
cotton dust level implies zero prevalence, the regression equation
Y to X is
relating the mean of
E(Y!X-x) -
a.
Given the
n
ax
pairs of data points
(x: ' Y ), i
= 1 ,2, ••. ,n,
i
chosen randomly from the conditional distribution
1
--(y-~x)
f
(y X-x) -
y.
1
If-io
e
20
l
2
2
show that the maximum likelihood estim'!ltor of
~
~
is
,.,
b.
Show that the maximum likelihood estimator ~ of ~ found in
part (a) is also .the least squares est:imator of ~; in other
words, show that
~
with respect to
B.
-
c.
Show that
E(~)
d.
Show that
V(B) -
part (a).
e.
Show that
,.,
,.,
B
minimizes the fun I::: t ion
~ given the condit.i1onal density in part (a).
2 given the Iconditional density in
2
0 /
x
i-I i
£
is normally distributed :given the conditional
. density in part (a) •
-31Q.3
A scientist hypothesizes the following esoteric theory for cellular
damage in humans due to radiation exposure.
individual is "hit" by radiation
A particular cell in an
X times during the course of the
individual's lifetime, with the number
X of such hits having the
Poisson distribution
x
PX(x) .-
Ae
-A
x- 0,1, •••
xl
,00
For any such hit, there is a fixed probability
p
that some basic
structural change will occur in the cell; also, the occurrences (or
not) of structural changes for different hits are assumed to be mutually
independent.
The random variable of interest is
Y,
the total number of
structural changes that the particular cell undergoes during the
course of the individual's lifetime.
a.
What is
pr(Y-ylx-x )?
o
distribution of
exactly
b.
X
o
In other words, what is the conditional
Y given that a particular cell experiences
hits?
Using the result in part (a), show that the unconditional
distribution of
Y,
namely
pr(Y·y),
is Poisson with mean
Ap.
c.
The scientist further hypothesizes that a particular cell will
become cancerous if it undergoes at least
k
structural changes.
Given the result in part (b), what is the probability that a
particular cell will become cancerous?
[NOTE:
(JUST SET UP YOUR ANSWER).
You do not need to be able to work parts (a) and (b) in
order to answer this question.]
Q.4
Let
X be distributed as a Poisson variate with mean .pA,
and
independently, let
Y be distributed as a Poisson variate with mean'
A.
The parameters
p
a.
Show that the distribution of ' X given
and
A are both positive.
X+ Y
m is the
binomial one:
b.
Find the maximum likelihood estimator (MLE) of
p
from (a).
c.
Find the asymptotic variance of the MLE, conditional on
d.
Find the unconditional information in part (c).
m.
-32-
BASIC N.S. WRITTEN EXAMINATION IN BIOSTATISTICS
PART II
(January 16, 1983:
Q.l
2:00 to 5:00 PM)
It has recently been discovered that certain blood chemistry
measurements correlate highly with the clinicall diagnosis of "depression".
One of these, the DST test, is estimated to ha.ve 70% sensitivity "and 96%
specificity for depression; another, the TSH test, has low sensitivity,
only 25%, but it seems to have 100% specificity, since no false positive
has yet been documented.
1.
Assuming that the two tests operate independently (which
agrees with the limited evidence available), what sensitivity
and specificity could you achieve by applying both to the.
same subject?
The above results were obtained in subjects who were not physically
ill.
There is interest in applying the tests to cervical cancer
patients, in whom the clinical diagnosis of depression is difficult
because depressive symptoms are easily confused with those of the
cancer itself.
You are called in to help design a study to find out
whether the two tests have different sensitivity and specificity in
such patients.
You are told that between 2 and 3 new patients per week
will be available, that about 25% to 35% of these patients will actually
have depression by clinical criteria, and that the study can continue
for about 6 to 8 months.
2.
Discuss generally the points that you would stress as a
statistician in talking with the principal investigator.
And more specifically, perform some calculations to indicate
what success the study is likely to have in meeting its
objectives.
Note:
sensitivity
= pr
(+ test result/diseased)
specificity
= pr
(- test result/not diseased)
-33Q.2
Twenty-six subjects with essential hypertension were classified
as "low" or "nonnal" with respect to plasma renin activity (PRA) in
1974-75, using two different methods.
The same subjects were re-examined
and reclassified in 1982.
The data for this study are given in the accompanying table.
A
subject had "low" PRA by.Method A (PRA,-sodium index) in 1974-75 (in
1982) i f
Al < .555
according as
(A2 < .555); by Method B (PRA after furosemide)
B1 or B2 < 1.75
(A
and
B do not measure PRA in the
same units.)
--'--,..--.
1.
Do the two methods of classification agree with each other?
2.
Does the classification by anyone method remain consistent
over a long period of time (i.e., from 1974-75 to 1982)?
3.
Do these data indicate any general decrease in PRA with
increasing age (as has been found in other studies)?
NOTE:
For furposes of this exam, satisfactory answers can
be given. which require very little computation.
LISTING OF SELECTED VARIABLES
OBS
ID
I
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Notes:
tit
17
18
21
22
23
25
26
27
28
29
RACE
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
2
1
DOB.
SEX
2
1
2
2
2
2
2
2
2
2
2
1
2
2
2
2
2
1
2
2
1
1
2
1
1
1
28
07
23
22
05
20
12
24
24
19
15
28
02
23
09
13
17
04
31
26
27
25
25
15
09
27
JUI.
MAR
MAY
FEB
JUN
JUL
JUI.
OCT
DEC
SEP
MAR
AUG
SEP
MAR
OCT
JUI.
APR
MAY
MAY
DEC
JUI.
OCT
DEC
AUG
APR
SEP
Al
28
33
44
43
41
36
37
24
27
42
51
39
22
38
47
22
34
47
32
33
27
31
27
27
34
38
0.478
0.581
3.298
0.230
0.370
1.207
1.145
1.233
0.058
0.692
1.295
1.361
0.744
0.986
4.939
0.903
1.926
5.365
0.191
2.284
0.043
1.323
0~402
0.927
1.229
2.188
A2
0.138
0.721
2.697
1.059
0.769
3.619
0.152
0.339
0.065
0.648
1.592
0.208
0.144
1.053
0.922
0.699
2.723
0.397
0.903
1.231
0.344
0.253
2.419
2.341
2.119
2.051
B1
0.60 ..
1.00
6.50
0.73
3.70
2.50
1.40
0.42
0.40
4.12
1.20
0.50
0.53
4.40
14.60
3.40
2.00
8.10
0.70
3.80
1.00
3.14
1.00
4.00
6.00
7.60
B2
0.49
2.24
1.70
1.36
2.37
4.38
1.60
0.43
0.05
0.90
4.09
0.95 .
1.16
1.04
1.87
3.52
2.36
0.60
--
3.43
4.20
0.53
4.05
5.50
2.40
--
Al-A2
B1-B2
-
0.3!IO
-0. It.O
O. (01
-0.1.29
-0.399
-2.412
0.993
0.894
-0.007
0.044
-0.297
1.153
0.600
-0.067
4.017
0.204
-0.797
4.968
-0.712
1.053
-0.301
1.070
-2.017
-1.414
-0.890
0.137
0.11
- 1.24
4.80
- 0.63
1.33
- 1.88
-0.20
- 0.01
0.35
3.22
- 2.89
- 0.45
- 0.63
3.36
12.73
- 0.12
- 0.36
7.50
0.37
- 3.20
2.61
- 3.05
- 1.50
3.60
1) RACE 1 = white, 2 = black
2) SEX 1 = male, 2 = female
3) DOB = date of birth
e
e
I
w
.~
1
-~-
Q.3
In a paper by Chalmers, et a1., (NEJM, 1977) entitled, "Evidence
Favoring the Use of Anticoagulants in the Hospital Phase of Acute
Myocardial Infarction", it was suggested that the negative findings
in five of six randomized control trials may have.beena function of
small sample sizes.
When the data from all six studies were "pooled",
the apparent reduction in mortality with anticoagulator was 4.2%,
which was found to be statistically significant.
As provided in this
paper, the data for these six trials can be shown as follows:
TRIAL
NUMBER.
1
2
3
4
5·
6
ANTICOAGULATED
TOTAL ~ _%45
28.9
13
12
15.6
77
712
16.2
115
745
111 14.9
2
27
7.4
500
48
9.6
2106
301
CONTROLS
TOTAL DIED
47
70
715
391
26
499
1748
18
15
129
83
2
56
303
%
38.3
21.4
'18.0
21.2
7.7
11.2
Chalmers et a1., pointed out that the arithmetic average.of the mortality
percents over the six trials was 15.4 for anticoagulated patients and
19.6 for controls.
e
a.
Using the above table, compute the crude mortality percent
for each group separately and compare your results with the
values obtained by Chalmers.
b.
How do your answers to part (a) differ from those obtained by
by Chalmers, et a1.
t
in terms of the weights used for averaging
(i.e., what are the weights for each averaging process)?
c.
Assuming that your objective is to provide a summary estimate
of the difference in mortality percents, describe the weighting
scheme you would recommend.
d.
Explain the reason for your choice.
(No calculations are necessary).
Assuming that pooling the results of the six trials is
appropriate from a study design standpoint, describe (without
calculations) how you would carry out a test of hypothesis
to determine whether there is a significant "overall" 'effect
of anticoagulant therapy which pools the results of all six
trials.
e.
,
Describe (without calculations) how you would compute a 95%
interval estimate of the overall difference in mortality
percents.
f.
How would you criticize, if at all, the use of any kind of
pooling procedure over the different trials?
-36-
Q.4
A.
Describe the processing of the DATA step listed below. Your answer should
indicate a description of both the compilation and the execution phase,
and a detailed description of the SAS data set created. Your answer should
include the following terms/concepts:
length
input buffer
compilation
type
execution
data matrix
'program data vector
missing values
variable name
EBCDIC/floating point representation
history infonnation
DATA WORK. ONE;
LENGTH NAME, $ 10 QUIZ1-QUIZ~ 4;
INPUT NAME $ 1-10
:~nZl12-14,
, QU'IZ2 16-18
'QUIZ3 20-22
••
AVERAGE=MEAN{QUIZ1, QUIZ2, QUIZ3);
OUTPUT;
RETURN;
CARDS;
B.
T\'Io SAS datasets, A.B and A.C are listed on the following
pages.
Each
part of this problem specifies a DATA step which inputs one or both of these
two datasets· and creates an output SAS dataset. The purpose of these problems
1S to' evaluate your understanding of how stateme~ts such as SET, MERGE, UPDATE.
work, with and without BY statements~
We purposefully made the datasets 1argee
,
.
-37-
to give you an opportunity to demonstrate YOUR knowledge -- we have
confidence that SAS knows how these DATA steps work!
Specify all answers to the number of digits shown in the printout(s).
Note that both A.B and A.C are sorted "BY ID DATE.
·e
..
"..
.
-38-
DATASET A.B
DATE BORN
10 -
205~~78
2~APR75
27FEB78
19~A~75
8,9'
8-'J8
DATE
OBS
l'
2
___ ~_.J!8~!'~J~_,
-
~-
-
6'
7
.
8
9
-_. i u--11
12
1j
1q
e,
1~
..:!!!-
'~1'~0-_FDEE'~775
.._
'"
'00
03"AT78
18MAR1..
'02
O,Afo'''7805MAR71J
'03
904
05APH7e
O~MAR7~
906
15SE~7b
22NOV73
05A~M78
250CT73
".?Ql_
-·'29f.'AH78 --_. -"-'Cfq"tit'C73'08
909
OeOEC76
OlDEC73
911
2dOCT76
1~SEP73._
- 91215SEP76
OqSEP73
913
16SEP76
1~~U~73
91q
16SE~76
1q~U~73
21FE~78
It,
- .1&--- --ii+-~rE~77'-'--i8'"JUL1317
16
lSSLP76
160CT75
17~UN73
18~AN73
2U
08ULC76
050EC72
27NOv72
1~
16~~N76
·'16
917
919
920
921
DATASET A.C
lL}
DATE BORN
19MAR75
~
~
"~8
~99
~05
~O~
~U&
7
~Oo
22NOV73
22NOV73
22NOV73
22NOV73
250CT73
1
~
5
b
~
~
23FEB75
C5MAH7q
04~AR74
22NOV73
~U6--"'22Ndv73-'-
~Ob
~06
lU
~06
11
,'107
Ij
~08
1~
'108
..
".
DATE
. - _. ...-.
WGT
~.
O;Ap~1~
160CT1~
20SEP1~
06SE~1"
HT
FVC
93
95
0.50
0.63
RACE
SEX
PEF
1.29 ~
2.85..,
10Q
0.81
M
3.17 .
16
100
0.7&
B
~
2.30
U5JAN17
1599
0.52
8
F
2.40
'0'" Ap't( ""-"16-- "'-1'02-"'0.66'· -"'8---" F '-.--1-.·76·-07JUL17
16
10~
0.65
B
.F
2.05
llJAN78
18
107
0.7~
8
F
1.81
OQ.UG18
19
110 '-0.97
B
F
3~~Q'
02NOV1~
20
110
0.9~
.8
F
2.9Q
11AUG7.,
17
10"0.9&
8
F
2.73
lq
15
17
8
M
B
8
F
- -"1 ;02--'-8-------"F-'~--2-;4'4--
i~'-~"ri7-'-~'5dCi7'3-'--ri2NOV1-"---lf-~-04
n40EC73
04DEC73
1'5
~09 '~"'-o'10EC73
1&
~09
11
----i~·
1~
20
010EC73
010EC73
'1 Otj--O1 oEc'7'f3
~09
010EC73,
. '111
10SEP73
~O~
2i"'~"'-'1'il
lO·SEP.7J
2~
10SEP73
'111
17
18
104
07JU~77
120CT71
13
1..-
1 7 ApR78
05JAN71
07JULf7
2b
~1~'
~ll
~11
10SEP73
10SEP73
10SEP7J
F
F
0.73
0.86
B
99
100
0.5&
0.60
B
M
8
M
15
15
102
B
M
16
101"'
0.58
0.70
8
F
1&
103
0.77
B
1.Q,5 ._~ ~.~ 1_..._._.13
107
0.80
B
108
0.98
B
10~. 1.0~
B
18MAH77..· ..'· .. 13' -·· .. _·95
·--"O.·4Ef~
B
B-"-'~""'-M
2.65
2.Q8
·--i.~·.83~-
2.11
1.79
iT~rAN1~n--1O(f-o.S8--B--·
- jIli---17S'9--'
061\PR17
.... ~~_ _~_~ 1_._'.!.Q.~_E:.P.7?_ _ .!.?().~!!,~..._.3: ~.
2~
2~
lQ~
2aJUN78
060CT78
I1JAN18
17AP~18
24JUL78
17
19
19
, (The dataset A.C is continued on the next page.)
99
~--·O.73
B·" .__.. _. F'
F
F
F
I
2.67
1.34
F
.~
2.56
2.42
..
?_.-,-2.1_,.__ .
2.45
2.88
2.92
I
,.
....
.
e·
-39-
DATASET A.C continued
DOS
ID
DATE
ii
~12
-~~SEP~3
080(C76
2~
"112
0~SEP73
07.JUL/7
31
32
"112
"113
OQSEP73
IIf..JUL73
14.:JUL73··
06NOV7~
11~AY77
~~
~l~
1~~UL73
O~MOV76
37
3H
"116
"116
. ., 16
18.JUL73
18.JUL73
18~UL 73
18.JUL73
120CT17
13.JAN7A
2f.1~A1J 1M
.;-~--~}~
·-~·~~13·
3~
~lq
-~·6---914
·e
DJ\TEBORN
WGT
HT
13
13
~:~·~-~i-·-~-t~·~·i~t;
06FtB1'J
17
16
16
16
16
RACE
92
96
0.38
0.1f.6
B
B
103
100
101
97
100
0.66
0.65
0;8"0
0.75
0.60
B
8
SEX
pEF
Ml.11
M
1.~~
1~~----3':~~--'~'---~--};~~-
1~.JUL73
O~~AN77
14JUC1'3---11M'A"7'~-···1'6-.
19
19
19
FVC
----nr--0-;"1r
8'
B
B
B
·M
1.91f.
1.85
M2.00· ..·
M
1.33
M
1.50
M
M --'T;ls-
107
0.88
B
M
2.27
107
0.81
8
M
1.94
..,,<J
108
0.64
B
lVJ
1.83
4U
~16
08MAY1~
20
109
0.93
B
M
3.02
41
~17
17~UN73
lU~AH17
15
96· 0.79
a
F
2.49
4 2-··-"117--f7JUN 7 3'" '·'13 JuL ''''~ ··1·6----1-0 0-'-'-a'~'83"---'.-a
-F' "---2-; 1f.-8- ._..
4~
"111
17.JUN73
120CT17
17
102
0.8~B
F
2.53
44
"117
l'JUM73
I1JAN1R
17
103
0.85
B
f
2.17
4~
'111"
l'JUN73
1 7 ApR78
18·'·-104 .. ·0.92·· .. ·8··· F
3.22-·'
4&
"117
l'JUN73
24MAY7~
18
104
n.91
B
~
3.80
47
"111
17JUN73
13SEP7P
18
lOS
1.11
[3
F
3.19
- . 4t:r·--····'":fl-,---17JUN73.' 01NOV7~'" .- '18'-'''·--"105 --··1.1·e· ···_··-B----·-F- ·····3-;3A·...~"1
~19
18JAN73
lA~OV1~
15
98
0.56
B
F
1.50
~u
'119
IBJAN73
~OMAH1A
15
99
0.67
8
F
1.37
51
'119
18JAN73
OaJUL1~
17
104
0.70
n
F
1.88 .
52
'119
18JAN73
1 4 SEP16
17
104
0.78
B
~
2.06
5~
""19
la~AN73
31JAN11
19
lOB
0.79
B
F
2.02
... 54' ..- ""19-' ·--18JAN7"3· ~_.- 2BApR77--- 1CJ------rl·o"-·"0-;a 7-"---"8 - - - . "F"'-' '-'2~' 07'·
5~
~19
18~AN73
13JUL77
.19
112
O.~8
8
f
2.15
56
""19
18JAN73
01 MAH7A
21
114
1.03
8
F
1.93
57
~19
18~~M73
2 7 SCP/8
22
119
1.17
B
F
2.69
58
,- ~19"'18JMJ73"" OHIOV7""
22'- -119 - 1.15
8
F
2.60 ,.. ,...
59
~lu
050EC72
~DMAP7~
14
94
o.~e
B
M
1.46
60
~20
050EC72
06MAY1~
14
9~
O.~5
e
M
1.85
61
""2U
C50EC7~
07.JUL1~
14
96
0.46
8
M
1.79
62
~2U
050EC72
OlDEClb
l~
990.4'
B
M
1.46
6~
"12U
050EC72
31JAN17
15
99
0.52
B
M
1.60
64······_-OjZa-·--050EC72-···-28ApRl,-·--15 -, '·-"01"
0.7&,·'S··-· -- -M . · .. ·2'.24··
65
~2U
050EC72
120CT17
16
103
o.~o
B
M
1.91
66
"12U
05DEC72
13JAN1~
16
105
0.80
B
M
2.10
67
'':121
'27NOV72'
2~MAH17" 18
loa
0.73
8
-'M
1.72
6~
'121
27NOV72
1~~UL17
19
110.
O.BO
B
M
2.31
69
~21
·27NOV72
14SEP17
20
112
0.99
B
M
1.56
-7r;r-.~2'r--:-27NOV72·-·"12·OCT7"'.-20-··-Tr2---1"·;O·O----8·---·----J'of"---1 ~44-71
"121
27NOV72
26.JAN7A
23
11~
0.88
B
M
1.92
72
""21
27NOV7?
lQAPH18
24
114
1.04
8
M
2.48
'73"
. 'j21···-27NOV72"·" 24.JUL18
2~
. 11'" '1.17
B
M
2.63
-40Dr,I'ARTr.fENT OF BIOSTA'I'I 81'1 CS
Special MS Examination, 1983
April 9, 1983: 1 - 4 PM
Part I
Answer any three of the following questions.
examination.
Q.I.
1111S is a closed book
A city transportation department is conducting a survey to
determine the gasoline usage of its residents.
Stratified
ranuom s<lmpling is used and the four city wards are treated as
the strata.
The amollnt of gasoline purchased in the last w('ck
is recorded for each households3JTIpled.
The strata siz-cs and
the summary information ohtained from the sample are:
STRATA
3272
2475
50
45
30
30
12.6
14.5
18.6
13.8 '
2.8
2.9
4.,8
3.2
II
3750
Sample size
Sample mean
(in ga 11 all 5)
Stratum size
Sampl~ v~HiallCC
a)
IV
III
1387
I
e
[stir.lat'c the mean weekly gasoline usage per household for
the city population and construct a 95% error bound for
your estimate.
b)
Estimate the wiuth of a 95% error bound for the
estima~or
in silnple random sampling (ignori:lg str::ltification whl'n n: ISS)'.
c)
Do
yOll
.1g1'(,(' with the surveyor that str.1tificafion has led
to some reJuct ion in the c;amp.l in!: l'rror of the estjmat ion
(over sil:lpll' r:lndo::l s:lmpling)?
Q. 2.
For the exponential distribution having probability dens i ty 'function
x
f(x)
:l.e
lJ
lJ
x > 0,
show that the maximum li1<elihood estimator of p is the mean X
of a random sample of size n from the population f(x).
Is thi s
an unbiased estimator of lJ? When n is reasonably large, the
distribution of X can be satisfactorily approximated by
norma]
distribution. Using this approximation, derive a IOO(I-a)% confidence interval for lJ. Apply the result to obtain a 95% confidence
interval when a sample of 49 yields X: 11.52. Also, estimate the
probability that Xis greater than 15.
a
-41Q.3.
In an experimcnt ucsi gne.d to determine the re I utionship hetween the
dosc of a compost fcrti lizer x and the yield of u crop )', the
following summary·statistics are recordeu:
x
n - 15
S2
x
70.6,
Y
= 10.8,
2
Sy = 98.S,
S
xy
122.7
=
68.3
Assume a linear relationship.
a) Find the equation of the least squares regression line.
b) Compute the error sum of squares and estimate
2
a .
c) Do the data contraJict the experimenter's·eonjecture that over
the range of x values covered iri the study, the average increase
in yield per unit increase in the Co~)ostdose is at least ].5?
d) Construct a 95% confidence interval for the expected yield corresponding to x = 12.
Q.4.
Consider the
2x2 contingency table with fixed row totals
B
1
.
e
Al
nIl
A
2
11
21
n OI
anQ
Let
PI
To test
P2
n
02
11
n
111
20
11
P(B1IA I ) and P( BI IA2 ),. respectively.
we can employ the normal test with the test
denote
P2 ,
HO:Pl
B
2
n
12
n
Z2
statistic
z
and p = nOI/n.
where
= nll/nIO' P2 = nZI/n ZO '
zZ is exactly the same as the X2 statistic.
PI
Prove that
Z
b) Prove that the formula for the x statisth' for n 2x2
contingency table can also be written
.,
n(nlln22-nlZnZl)nlOnZOnOlnOZ
[Note that
i .is
the Pearsonian goodness of fit test-statistic·l
-42SPECIAL MS WRITTEN EXAMINATION IN BIOSTATISTICS, PART II
April 10, 1983
(1 PM -
4 PM)
INSTRUCTIONS:
a) This is an open-book "in class" examination.
h) AAswerfrom Part I any two of the 3 questions which follow.
answer Q.4 from Part II. (Thus 3 answers in all)
Also
c) Put the answers' to different questions on separate sets of papers.
d) Put
your
code letter, not your name on each page.
e) Return the examination with a signed statement of the honor pledge on
'a page separate from your answer.
PART I (of PART II)
Q. 1 Assume that n=lOO individuals participate 1.n a study. A response
vari.able Y is measured, along with a continuous factor Xl and the presence
or Illbsence of a factor X2. The individuals come from four groups (A,B,C.D).
coded as X3 •
. a) Suppose that a regression model is fit, predicting Y from tpe
three.main effects, all two-way interactions, and the three-way interaction.
Fill In the suuunary ANOVA table and the degrees of freedom in the detailed
'table. The second table gives the extra sums of squares as each factor
is added to the model (the SASType I sums of squares).
Source
Model
df
SS
1500
Error
Corrected
total
2340
MS
F
-43Source
df
Type I SS
Xl
800
Xl
X3
XX ,
I 2
XX
I 3
200
330
50
60
XX
l 3
30
XXX
I l 3
30
b) Show that the three-way interaction term can be deleted from
the model.
c) Test whether a~l the two-way interaction terms could Simultaneously
be deleted from the model not containing the three-way interaction.
Assume that X is coded as 0 (abs~nt) or 1 (present), and that
2
a model is fit using only the data from group A. This model contains main
d)
effects for Xl and Xl ' ',and the XlXl interaction term, with parameter estimates
for X ,X and XIX of -2, +10, and +4, respectively.
l
I 2
overall ~ea_~ is, 5,0.
i.
·e
The estimate of the
What is the predicted value ofY for a subject in Group A
with Xl-IO if X2 is absent? If XI-IO and X is present?
2
ii. What is the predicted change in Y if Xl changes from 10
to 5 in an individual/in Group A with Xl present?
e)
Suppose that the mean values for Y for X present and absent,
2
in each group, are as shown in the following table.' Assess
informally whether there is an interaction between group and
the presence or absence of,X • Do not do an hypothesis test.
2
Mean Values of Y
Group
X2 Absent
,A
30
25
35
40
B
C
D
X Present
2
50
15
45
40
-44-
Q.2
a) Some or all of the JCt. statements .below contain om! or more errors.
Circle each error, the whole error ,and. nothing but the err'or, and
briefly explain what is wrong •. Consider 'eachstat,ement separately.
,Assume that blanks within the lines oceur only whe.re the synbol "_"
appears. None of the er~ors involves "lit versus "1 n or "0" versus "0".
COLUMN RULER
000000000111111111122222222223333333333444444444455555555556666666666n7
123456789012345678901234567890123456789012345678901234567890123456789012
(l).
II_JOB_UNC.B.E1234,HOSKING
(2).
II~EXEC_PGM=COPY,WASTE=YES
(3). IIPHRED DO UNIT~DISK,DISP={NEW,DELETE),VOL=REF=UNCCC.OFFLINE,
II
- -SPACE= (TRK, (10 ,20)) ,DCB=CRECFM=FB,BLKSIZE=600 ,LRECL=250) ,
/1
DSN=UNC.B.E1234.JONES.X
(4). IICOPYDISKS_JOB_UNC.B.E1234,BROWN,T=(,45),M=0
(5) •
IIEXEC-?GM=COPY,PARM=LIST
(6). IIOUTPUT DO DSN=UNC. B. E9999.JONES.MYSTUFF ,UNIT=TAPE ,DISP= (NHJ ,CATLG),
"_RING=TN,LABEL=(3,SL)
.
.
(7).
IIJOE_DD_UNIT=DISK,DSN=UNC.B.E1234.0ATA.407,DISP=OLO,DCB=(RECFM=FB,
II
LRECL=80,BLKSIZE=6000).
.
(8).
.
I/SYSOUT DO DSN=UNC.B.E1234.HElMS.STUFF,
/I
-VO[.=REF=UNCCC.ONLINE,
II
OCB=(RECFM=VB,BLKSIZE=6000,LRECL=5996),
1/
SPACE= (TRK, (10,5 ,RlSE) ) ,UfHT=DI SK,
II
01 SP= (NE1~ ,CATLG ,DELETE)
(9).
II INPUT_OD_DSN=UNC . B.£5001. SMITH. INDATA .ONTAPE ,LABEL=(3 ,SL}
II
DISP=OLD
(10).
/ IJOB#l_JOB_UNC. B. E1234_SMITH ,REGION=999K,MSGLEVEL= (l,O),TIME=5
,
-45-
..
b) Describe the processing of the. DATA step listed below. Your ans\'ler should
indicate a description of both the compilation and the execution phase •
. and a detailed description of the SAS data set created. Your answer should
include the following terms/concepts:
input buffer
length
.. compil ati on
type
execution
data matrix
program data vector
missin~ values
variable name
EBCDIC/floating point representation
history information
DATA WORK.ONE;
lErlGTH NMiE $ 10 QUIZ1-QUIZ3 4;
INPUT NAME $ 1-10
QUIZl 12-14
w.e
QUIZ2l6-l13
QUIZ3 20-22
.,
AVERAGE=MEAN{QUIZ1. QUIZ2, QUIZ3);
OUTPUT;
RETURN;
CARDSi
.<
-46Q.3 For the penod I July 1974 through 30 June 1978. North Carolina
experienced a sudden infant death (SID) rate of two per thousand live
bi rths ..
a) For a county having 3000 live births and 12 SIDS in the same
calendar period. calculate the P-va1ue for the possibi 1ity that
the SID rate in this county is greater than that of North Carolina.
By analogy to a binomial situation. use the normal distribution to
approximate a presumed underlying Poisson model.
b) Let· A denote the SID rate for a county with 3000 live births
during the above specified calendar period and consider the null
hypothesis Ho:)'=O.002. i.e., A· is two SIDS per thousand live
births. Determine an upper tail critical region with a level of
significance 0=0.05. Again use a normal approximation.
c) Following part (b). calculate the power corresponding to the
alternative hypothesis Ha:A=O.006.
d) Comment briefly on the appropriateness of the normal approximation
to the Poisson distribution for the above calculations. Sketch
how you would proceed in parts (a). lb), and (c) if the normal
approximation were not appropriate.
APPENDIX
609
e·
Table 1. Normal Probability Distribution Function (Probabilities That
Given Standard Normal Variahlcs Will
Not Be l:.:tcecded-LolI'er Tail)
N z( ~ z). Also N%(z) = I - N.( - =).
. -.,;
.0
.1
.2
.3
.4
.00
.01
.02
.03
.04
.50000
.46017
.42074
.38209
.34458
.49601
.45620
.41683
.37828
.34090
.49202
.45224
.41294
.37448
.48803
.44828
.48405
.44433
.40517
.36693
.32997
EDITORIAL NOTE.
~33724
.409OS
.37070
.33360
.
•OS
.OCi
.07
.08
.09
.48006
.44038
.40129
.36317
.32636
.47608
.43644
.39743
.35942
.32276
.47210
.43251
.39358
.3SS69
.319J8
.46812
.42858
.38974
;3S197
.31561
.46414
.42465
.38591
~34827
.31207
The table above has been abrid~:ed from one extended to
Z = 3.99 which was attached tel the original examination,
-47-
__tt
PART II (of PART II)
Q.4
A clinical trial is conducted to assess the efficacy of a new drug to alle-
viate symptoms of depression.
Patients are randomized to Drug (D) or Placebo (P)
in equal proportions at three psychiatric clinics.
..
A total of 120 patients are
entered with 20 patients allocated to each treatment group at each clinic •
Patients are tested for depression at baseline and at weeks 1. 2. 3. and 4•
. About 30% dropout occurs by week 4 (this attrition is to be expected in
depression trials) and so. the statistical analyst decides to concentrate on
"fiAlal rated value" as the main dependent varia.ble.
The scale of measurement of
major interest is Total Score of Hamilton Depression Scale.
This scale is from
0"-62 with·
0-13
~
14-19~
." e
little or no depression (essentially "cured")
Minor depression (not sick enough to enter trial but not well
enough to be "cured")
20-29 ~ moderate depress ion
> 30 - severe depression.
Some controversy ensues about the univariate statistical analysis of the
final rated values of the Hamilton Depression Scale.
One aspect of the contro-
versy
., is that there'is significant treatment x clinic interaction.
Clinic 1
shows a preference for placebo (non-significant); clinic 2 shows a preference
for drug (non-significant); clinic 3 shows a significant preference for drug.
The other aspect of. the controversy is whether to adjust for the pre HAMD score
via covariance analysis.
In clinic 1. placebo patients are a little more severe
at baseline (not significant. p
differences at baseline (p
=
= .19);
in clinic 2. there are no treatment group
.56); in clinic 3, drug patients are more severe at
base11ne(near Significance. p
=
.08).
Six different methods of analysis to
test treatment effects are proposed by different statisticians.
follows:
These are as
-48-
1.
Two-way ANOVA
Two-way ANOVA of final HAMDscores employing treatment and clinic as main
effects and treatment by clinic interaction.
2.
Two-way ANOVA of differences
. Two way ANOVA of "final HAMD score minus pre HAMD score" employing treatment
and cUnic as main effects and treatment x clinic interaction.
3.
Two-way ANCOVA
ANCOVA of final HAMD scores employing pre HAMD score! as a covariate. The
main effects treatment and clinic would be included in the model and so
would treatment x. clinic interaction.
4.
Separate ANOVA's
A t-test on .fi nal HAMD scores for each c1 inic separately.
5.
Separate ANCOVA's--different slopes
ANCOVAon fin·al HAMD scores for each clinic separatE!ly. The covariate is
pre HAMO score and the main effect is treatment (i.E! •• covariance adjusted
t-test )..
6.
Separate ANCOVA's--common slope
ANCOVA on final HAMD scores for each clinic separatl!ly. The (linear)
covariate is pre HAMD score but with a slope that is derived from a pooled
analysis of all three clinics. The main effect is treatmen-t.
Question
(18 poi nts) 1.
Comment on each of the proposed analyses outlining the advantages
and disadvantages of each.
(7·points) ii.
Which of the six analysis strategies would you choose?
why.
Explain
NOTE: For the purposes of this question. assume that the usual parametric
assumptipns of no·rmally and independently distributed errors with zero mean and
constant variance are reasonably appropriate. In other words. do not deal with
the issues regarding whether a non~parametric or parametric analysis is
appropri ate.
,
-49BASIC MASTER LEVEL WRITTEN EXAMINATION IN BIOSTATISTICS
PART I
(Januarv 21. 1984)
9:30 AM to 12:30 PM
1.
Suppose that the main purpose of a statewide household survey
is to estimate the proportion (P) of North Carolina households
without any torm of health insurance. A self-weighting household sample is selected by using simple random sampling in each
of three stages, with clusters of unequal size in the first two
stages. The estimator of P is the simple proportion of uninsured
households in the sample. The estimate turns out to be 0.24,
with a standard error of 0.02 and a design effect of 1.2.
(a) State whether or not the estimator of P is unbiased.
Briefly expla~nyour answer.
(b) How many sample households produced the estimate (Le.,
0.24) ?
(c) How many sample households, chosen by simple random Ramplin~
(Le., no cluster sampling). would have been needed to produce
the same amount of statistical precision (Le •• standard error
of 0.02)?
·e
solve the following
(a)
proble~~:
Find E(S2) and V(S2).
(b) Find random variables A and B such that pr(A<o2<B)=(1-a),
O<a<l.
(c) Let t>O and O<p<L Use Tchebysheff's Theorem to find the
smallest sample size n such that
pr{IS2_o21<t02}~p.
[HINT:
Your lower bound for n will be a function of t and. p. J
-50-
3.
Let Y be normally di:;trlbuted with mean II and variance
random sample of size n is drawn of values of Y.
mean is used to estimate l..I and to test H :J.l=0.
a
(a) and (b). assume that
0
2
0
2
•
A
The sample
For parts
is known.
Suppose that detection of a fixed alternative ll=m 0 is of
interest, and that the hypothesis test is done with a=.05.
(a)
Determine the chance of rejecting H
o
(the
powl~r
of the test)
if the sample size is chosen so that the half-width of a 95%
confidence interval for J.l is m.
(b)
Suppose that smllple size is determined to obtain a given
power for the fixed alternative J.l=m.
Show that. with this
sample size, the half-width of a 95% confidence interval for
l..I is a fixed percent of m. the percent depe{lding on the power
but not on m and not on
(c)
2
0 •
Show how that analysis is modified i f the variance is
unknown.
Give a very brief proof that your result in (a) is
still valid.
4.
Suppose that the length of life of electric tube. T, has an
~xponential
probability density function
fT(t)~Ae-At. t>O. A>O.
In a sample of n tubes observed fora period T, d tubes failed.
while the lifetimes of the remaining tubes were
~reater
than T.
(a)
Find the maximum likelihood estimator of A (;\ n , say).
(b)
Find the approximate expected value ·of An •
(c)
Show that the approximate variance (its lower bound) of
~n is
Var(~)
n
-51BASIC MASTER LEVEL WRITTEN EXAMINATION [N BIOSTATISTICS
PART II
(January 22, 1984:
1.
2-5 PM)
AS8ume that the dataset UNC.B.E555V.BIOSTAT.LIPID.LAB contains data
records keyed from the attached Lipid Laboratory Data Form. Assuming this
1s a catalogued dataset stored at TOCC, write the program needed to create
a permanent SAS dataset from this file, perform some basic edits. and
print some descriptive statistics. This program should consist of the
appropriate JCL and three SAS steps which perform the following tasks:
a)
Read the data records and create a SAS dataset named LABONE. This SAS
dataset should be stored in a SAS data library created by your job and
named
UNC.B.E555V.BIOSTAT.LIPID.SASDATA
Read each of the fields except those identified as "always blank."
Use the SAS variable names shown .in the attached ta.ble. Provide
appropriate formats and labels for these variables. This portion of
the program should include JCL statements to read the data file and .
create the SAS data library. Assume the following JCL statements are
given:
IIEXAM JOBUNC.B.EI23X,STUDENT
II*pw=-LUCK
II EXEC SAS
·e
b)
Write a SAS DATA step to read LABONE and create a temporary SAS
dataset ~med LABCLEAN containing only those data values that pass
the following edit tests:
variable
valid values
FORMNQ
CHYLO
TRIG
TRIGBLK
TRIGADJ
SLDAl
I, 2, 9
0-2000
0-99
TRIG-TRIGBLK (i.e., the value of TRIG minus the
value of TRIGBLK)
This step should print an error message on the SAS log, with
appropriate information, for each data value which fails an edit and
then set t~e value to missing.
c)
Write a SASPROC step to print descriptive statistics (including the
number of missing values, mean, minimum value, maximum value, and
variance) for the variables TRIG and TRIGADJ. Provide appropriate
titling information.
""
.
-52-
Field Specification for the Lipid Lab FOtrm
Variable
FOBMNO
1
Columns
Format
1-5
AAAAN .
6-16
always blank
VISDATE
17-22
MMDDn
LASTNAME
23-34
AAA. ••••••.
LASTINIT
23
A
INITIALS
35-36
AA
CHYLO
37
N
38-54
always blank
TRIG
55-58
NNNN
TRIGBLK
59-60
NN
TRIGADJ
61-64
NNNN
TRIGDATE
65-70
MHDDYY
71-80
always blank
1 A indicates byte can contain letters, numerals. or special·
characters
N. indicates byte can contain only numerals
MMDDYY indicates dates where HM-month
DO-day
YY-year
e
.
Version 731.65
LllC, PREVENTION TRIAL
~
e
LIPIeusoRATOllY DATA IOHM
-.(
(1-5)
1.
rnrnrn
Date of Visit:
Month
2.
Day
5.
(17-22)
Tr1g1ycerides
(ReooN in mgl):
[ITJ:J mgt (55-58)
Last Name:
b.
(23-34)
Initials:
3.
a.
Year
Q;J.
(35.. 36)
Standing plasma test:
3.
a.
a.
Chylomicron layer:
1 Present
2 Absent
9 Not done
I
1
•••
Tdg~yceride blank (To be done
. only iftrtiglycertide value is
greater than 300 mg%):
OJ mgt
o:r:o
(59-60)
c.
Triglyceride less blank:
Net trig Zy cerides
d.
Date of triglyceride
rnrnrn(65-70)
determination:
Month
Day
Yeu
%
IIg
(61-64)
-~
~
I
2
9
(37)
-542
Femoral antiversion (toeing in) is customarily detected and
measured by X-ray technique. It is desired to minimize the use
of X-rays because it exposes the Banada1 area to possible hazard.
Thus variation in X-ray technique to minimize X-·ray exposure is
a common object of study. The following dat.a resulted in one
such study in which various degrees of antiverslon were produced
in a model and then read on X-ray film by three technicians.
Each technician read each film twice in random order.
1) Investigate the bias in the new method.
2) Do the three technicians differ more than expected by chance'!
3) Do the technicians' biases vary with the angle?
4) Write a 1/2 page report summarizing your findings.
3
Data concerning the amount of heat evolved in calories per gram
of cement (Y) as a function of the amount of each of four ingredients (Xl, X2, X3. X4) in the mix are presented in Table l.
Be brief and to the point in your response to each part.
(a) An all possible regression functions analysis is reported
in Table 2. Interpret the results of this analysis.
(b) A stepwise regression analysis is reported in Table 3.
the results of this analysis.
Interpret
(c) How many variables would you include in a regression model?
Which are·they?
(d) What have you learned from this problem about setting the
entry and exit cr iteria when performing stepwise rt:~gres8ion on
a data set too large for an all possible regression functions
analysis?
(e) Describe what other analyses you would perform to assess
whether the assumptions underlying regression analysis are met.
for the regression model chosen in (c).
(f) How would you use the regression model select'ed in (c) to
predict the amount of heat that is likely to evolve from yet
another mix of cement?
-55-
Table 1:
Observation,
;
( CItL)
I
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
"83.8
113.3
109.4
y/
2
3
4
5
6
7
8
9
10
II
12
13
Table 2:
.
e
Data
~II
~12
~/)
X/4
7
26
29
56
31
52
55
71
31
6
15
8
8
6
9
17
22
18
4
23
9
8
60
52
I
II
II
1
11
3
I
2
21
I
II
10
54
47
40
66
68
20
47
33
22
6
44
22
26
34
12
12
Summary of all possible regressions
Numbcrof
Regressors
in Model
p
Regrnsors
in Model
NODe
c,
SS£(p)
R/I2
jfl
MSE(p)
NODe
2715.7635
0
0
226.3136 442.92
x.
x2
x)
~
2
2
2
2
1265.6861
906.3363
1939.4005
883.8669
0.53395·
0.66627
0.28581
0.67459
0.49158"
0.63593.
0.22095
0.64495
2
2
2
2
2
2
3
3
3
3
3
3
X.X2
x.x)
3
3
3
3
4
4
4
4
%.%2%)
4
5
1
I
1
X4
.X'%4
X2 X )
%2%4
%)%4
51.9045 0.97868
1221.0721 0.54811
74.7621 0.97241
415.4427 0.84703
868.8801 " 0.68006
175.1380 0.93529
/I
115.0624
82.3942
176.3092
80.3515
202.55
142.49
315.16
138.73
0.97441
5.1904
2.68
0.45180 122.1073 198.10
0.96691
1.4762
5.50
0.81644 41.5443 62.44
0.61607 86.8880 138.23
0.92235 11.5738 22.37
X1%)X4
48.1106
41.9721
50.8361
73.8145
0.91638
0.91645
0.91504
0.96316
5.3456
5.1303
5.6485
8.2017
3.04
3.02 "
. 3.50
7.34
X'%2%)%4
47.8636 0.98238 0.97356
5.9829
5.00
X.X2%4
X,%)X4
0.98228
0.98234
0.98128
0.91282
-56-
Table 3
• ~UAk& • 0.67'5_'~
e(,) .' 130.7]011;13119
ITII' 1
Dr
1laQ1\&8S1011
1
11
12
tor ",;11
TOTal.
tllTt.llc&PT
III
SWUtJI
"&::All 5'lliAb
,
11131.196,.002
c... ~.1I1I""'090
211'.7b301692
11131.1196111002
22.110
or
lIUh
"'Ob)'
0.06110
1l0.~~1'37111)
• VALliI'
STU &.Alllll
TlPll n
117.5111'131111
-0.Tjol111111
O. 1511~1I1I00
1031.11'161600Z
~S
,
""Ull)~
22.110
0.0\)011
---------------.--------------------.---------------------------------------._-~
YAllUbUI;
STU 2
~1
....TUl£D
2
10
12
.... llQIoUr.
,
....Ull>F
211" .~III177
13~0.5001l023o
1111.63
0.0001
,
""11b>F
laO blllil
O. Bllllll1l1l1
O.01lllll.1I!>5
1.11;''11151211
-o.61:S'l!>31l3 .
YAIlIAIIL&
IiEClIlt.:>s1011
11111011
MAL
•
II1TUCII:I'T
11
ALL
"hC.ll)~
211111.7'10;'111,2
117.,,7212'1110
211!>.11130TIl'l2
011'1.2113111111 111
1611."1
v.OOOl
11 :>S
,
PIlOll>F
1120.1I(l1..Ul!>]
26.111'13b2'Jt.
9.9;' l1,j"11
1511.01
5.0)
1.110
0.0001
0.0511
11.205'
1ll.u1lt.~1011
£IhOll
TtiTAL
.,
II1T..lltl::i'T
12
~j
~.,
Ot
STIl lIIIluli
YALU~
UkUU.~:>
0.111199759
(I. "5010"'1
.:a.TII;M,U
0~11:S211n9
!>.3,OJl,l,il1
T~1'1l
..._--------_..._----------------II SQIoAIlt: • U.'I1I2j15112
t( .. ) •
5.0ooo00liO
or
SUIl Ut' lI'lUAllto:l
II
lcbIl1.IIV9I1j7,'I
1I1.lIl1jo,935
2715.711307la'lli!
0
lli!
0.0001
0.0001
,
--------_...
:>bl''f
1011.22
15'1.30
tlUli ~"Alt..
11.6111130091
1.1151U7'111
0."'11101176
-0.2;'11,110412
l2
III
1I09.10IlbOIl1'1I .
11 YU •'12110311011
:KI'IAk~
$lift
3
9
12
11,~"l,
I ~lIAI•• 0.'Io~)~!>1I5
e(,) •
:s.OII11.:s3ltT
~ IIITi~D
Dr
TU'.:
.----------..-------....-----.----------
~---~--
STU 3
1 .1I7l1211~2
2715.7113010112
10~.~T3.. ,1111
11
III
Qt'
711.1~I1Z111
II "LlI~
. nullC....T
U.~72~71U5
5.~1I511500':
~u~
SUM
Dr
ISGllill:!lOll
llllC)lI
TOTal.
I 5'lliAlli •
C(I') •
II VALlI'
STO "Ill"k
102.1I0!)jC>!I:SU
1.5!ll1(1lcIl5
0.510107'0
0.101'1011110
-0.1"11011 l(1j
0.1I1'170'lco7
O.72j711000
0.15..10'10!>
0.70'1050:(10
MUll
~lIAIlt.
.'
i'IIOt)t'
111.~lI
0.0001
~.::.
~
I'''''b>t'
25.1150'11 no
2.'17<1"'11,,<1
... :S..
0.0700
0.500'1
01l1l.1I1111159j,'I
5.91121151192
Tlr.. 11
0.10~0'1vIl5
0.21111'1711"2
0.50
0.0;0
0.011
1l.0~!l'1
0.1111'"
._-----------------------------------.----------------_.. _-----------------------
-574
In a retrospective study of the possible effect of blood group un
the incidence of peptic ulcers, Woolf (1955) obtained data from
three cities. The table below gives for each city data for blood groups
o and A only. In each city, blood group is recorded for pepticulcer subjects and for a control series of individuals not having
peptic ulcer.
Bldod groups for peptic ulcer and control subjects
London
Manchester
Newcastle
Peptic ulcer
Control
Croup 0
911
361
396
Group 0
4578
4532
6598
Group A
579
246
219
Gr~
4219
3775
5261
Source: Woolf (1955). On estimating the relation between
blood group and disease. Annals of Human Genetics, 19:
251-,3.
(a) Investigate the association between blood group and peptic
ulcer status at each locality. Use a confidence interval
method and'interpret its implications to the hypothesis of no
relationship.
(b) Assess the homogeneity of any blood group by peptic ulcer
status association across localities. Use an appropriate
significance test. Also, if homogeneIty is supported, provide
a confidence interval for the blood group by peptic ulcer status
association for the combined data from all three localities.
(c) Discuss briefly the interpretation of the results from (a)
and (b)•
•
-58BASIC MASTER LEVEL WRITI'EN EXlIMINATION IN BICGTATISTICS
PARI' II
(April 15, 1984)
INSI'IUCl'IONS :
a)
'!his is an are,! lu..:>K. examination.
b)
M.P.H.
c)
Put the answers to different questions on separate
setS of papers.
d)
Put your oode letter, not your nane, on eacn page.
e)
Return the examination with a signed statanent of hooor
pledge on a page separate from your answers.
f)
'You are required to answer onLy what. is aske~ in the
questions and 1Z0t al Z you know about the top~cs.
students are to answer any two questions
during the two-hour peri<:>d (l: 30 P!' - 3: 30 pm).. M. S .
students are to answer tnree questwns of winch not
noro than 2 should be from Group A - tilire period 1: 30
prr. -' 4 : 30 pn.
Group A
A randanized clinical trial for hypertension is to be designed to
The {tlysician
in charge of the trial wants to krx::lw the oorrect saItI)le size to use. ReviEM
of past trials shOflS that the standard deviation for diastolic. blood pressure
];base V of entering patients is around 6 nm Hg. The I*tys~cian feels that
differen:::es between treatIrent groups of 3 nm Hg I'lllSt be detectable by the
1.
coopare four t.reatlTent nodalities in a Parallel group design.
e-
experinent.
Perform sane pcMer calculations using the attadled charts to help
identify appropriate sanp1e sizes for the following sit~tiors:
(I)
If any boo treatnents differ by 3 nm Hg ,. then the 4 treatnent
group oonparison with 3 d. f. for the numerator of the F-test
should be significant.
(II)
If aIr:f two t.reatrrents differ by 3 nm Hg,. then the pai:twise
difference msedon the t-test should be significant.
(III)
If any two treatrrents differ by 3 nm, then a Bonferroni adj usted
pairwise difference (based on tho fact that six pairwise
differences are being inspected) should be significant.
..
•
EDITORIAL NOTE.
The "attached charts" were those given on pages 115
and 116 in E.S. Pearson and H.O. Hartley, "Charts of
the power flUlction for analysis of variance tests, derived
from the non-central F-distribution", Biometrika 38 (1951)
112-130.
-59-
•
2. A tead1er wishes 'to determine tb! value of providing a manual arrl/or
certain rotes to his 'classes. He has 48 students, whan he distributes at
. rand:mt anong 4 different groups, placing 12 in each. '!he assigrurent of
teadring aid oombinations to groups is also Cb~ at rarxbm. After 1:h= course
is over; all students who are still enrolled take tb! sane exam, with resul ts as sham be1cw •
N
Data
~an
S.D.
9
10
60,64,67,68~8,69,71,73,75
No
.'«:>
Yes
11
No
No
68.33
52.20
54.18
57.57
4.53
12.42
9.16
8.56
•l1anual
Yes
Yes
Notes
Yes
7
MDVA:
32,41,44,47,48,54,54,64,65,73
41,44,47,49,51,54,56,59,61,61,73
44,51,55,59,59,66,69
Source
OF
SS
Between
3
1456
485.3
Within
33
2831-
85.8
4287
119.1
Total
36
M:3
('Ihese data are from W.C. Guenther, Analysis of Variance, Page 44.)
a) Assuming that differences in average grade nay be attributed solely to the
different oonbinations of teaching aids, write out a linear rrode1 for this
experinent. Estimate all its Paraneters.
b)
Consider two sirrp1ifications of the m;de1:
1)
2)
Eli.rnl.nate interaction, making it additive.
QJnbine all group:; except. Yes/yes •.
:ihOfI that the data will support the seoond. sinplification but oot the first.
c) Present an argurrent for transfonning these data. What woo1d be a good transformation to try? (OJ not actually perform art:! analysis with transfonred data.)
d) Criticize the
variance with it.
e~ri.ncntal
design, and/or the use of standard analysis of
Can you suggest inprovarents?
-60-
3.
'lour answers to QUl~stion 3 must be suhmitted on
ON ANY OTHER SHl':l':T WILL ~E CONSIDERED!
.th(~ ~ts.
NO
MATl':I~.L~l~
You may use the back of the sheet if necessary, but enough sj>ace has been
left under each part to provide the expected answelr for that part.
Parts (a)-(e) of this question are worth three points 'each. In each case.
the DATA step shown is executed reading one or both of the SAS data sets'
shown below (note that both are sorted BY 10). For each step, list the
data part of the data set created (include variable nameS as shown helow),
and answer the questions presented.
data set TWO
data set ONE
ID
X
y
10
1
1
4
10
"8
1
2
3
0
4
2
4
5
3
3
X
0
6
7
'4
Z
-3
0
9
•
-----
Output Dataset
a)
DATA A;
SET TWO;
Y-SUM(X,Z) ;
IF (XLT Z) THEN OUTPUT;
RETURN;
---------.- ---------How' many' times is the DATA step exe.cuted?
I
.
-61-
•
b)
data set ONE
data set TWO
ID
X
Y
10
X
Z
1
1
4
10
•8
1
2
•6
-3
3
0
4
3
4
2
5
7
4
__01
DATA B"
SET ~O ONE(RENANE-(Y=Z»;
IF (X GE 4);
3
0
9
•
Output Dataset
Q-X+Z;
DROP X;
OUTPUT;
RETURN;
What variables arc in thePDV?
Output" Dataset
c)
DATA D;
MERGE TWO ONE;
BY ID;
IF LAST.ID THEN OUTPUT;
RETURN;
How many observations would be in the data set if.the OUTPUT statement
was replaced by a DELETE statement (following the IF/THEN)?
-62-
_._-
data set ON 1':
ID
X
1
1
4
10
3
u
4
42
5
Y
_J~
6)
8
,---.,..-'---J)
data set
-----.
--'
TWO
__Z_.
1
2
•6
3
3
7
4
-3
0
9
..
-------Output Dataset
DATA D;
SET TWO;
DROP X Z;
1
VAR-'X' ;
VAL-X;
-----
OU1'PU'r;
VAA-' '(' ;
VAL-Yo
OUTPUT;
RETUlU~;
What is the length of VAL?
Why?
ee)
DATA
I~;
Output Dataset
SET ONE;
NUH-PUT(ID,WORDSS.);
AVG=(X+Y) /2;
MEAN'"' MKAN(X , '() ;
DROP 10 XY;
OUTPUT;
RETURN;
What is the length of NUM?
Why?
-63-
Parts (f) and (g) arc worth five points each.
In parts (f) and (g) of this problem, you are only asked to compile the
descriptor part of the o:lta set, not to execute it or cr.eate the data part.
For each of the DATA steps in parts (f) and (g) below, fill in the information
SAS would store in the descriptor part of the data sets being created. Sketch
the ~rogram Data Vector that would be created, indicating the type and length
of each variablz in it. When appropriate, indicate the existence and size of
input and/or output,buffers.
f)
DATA F;
IN~UT NAME $ FNAME $ SATK SATV;
SATTOT-SATM+SATV;
IF(SATTOT GT 1000) THEN STATUS-'AnKIT';
ELSE STATUS='REJECT';
FILE PRINT;
PUT NAME $ 1-10 STATUS $ 12-20;
OUTPUT;
RETURN;
CARDS;
PDV
Name:
Type:
Length:
WORK.D
Name:
Type:
Length:
Informat:
Format:
r
Label:
e'
-64-
~)
DATA G;
INl"Il.E IN LIU::CL= 50;
NAt-IF.
@21 8DAn:
@31 ODATE
#2 @l CAUSEl
INPUT II 1 @1
@11
CAUSE2
$20.
MMDDYY8.
~IMDOYY8.
$6.
$6.
.
FORMAT BDAT\:: DO An: DATE.;
LENGTH SURVTIME 4;
SURVTIME-DDATE-BOATE;
OUTPUT;
...
RETURN;
PDV
Name:
Type:
Length:
WORK.E
Name:
Type:
Length:
InforDlat:
Format:
Label:
.
-65Group B
"'0
4.
For the period 1 July 1974 through 30 June 1978, North carolina experienced
a sudden infant death (SID) rate of two per thousand live births .
•
...
(a) For a county having 3000 live birtllS and 12 SIOS in the sane calendar
period, calculate the P-value for the possibility that the SID rate in
this county is greater than that of North Carolina. Use the nonnal distribution to approximate a. presurred underlying Poisson IWdel.
(b) Let ). denote the SID rate for a county with 3000 live births during the
above specified calendar period and cx:nsider the null hypothesis H:). == 0.002,
Le., A is two SlDs per thousand live briths. Detennine an upperOtail
critical region w~th a level of signi ficance CL = O. 05.
(c) Following part (b), calculate the power corresponding to the alternative
hypothesis H : A =0.006.
.
a
(d) eorment briefly on the appropriateness of the nonnal approximation to the
Poisson distribution· for the above calculations. Sketch heM you would
. prc::lCeed in parts (a), (b), and (c) if the nonnal approximation were not
appropriate.