ONe BIO~~l:A:n::;TIC~
MIMEO SERIES NO. 2193T
EFFECT OF DICHOTOMIZING
'.. . ·OONTINUOUS VARIABLES IN REGRESSION
:.,.,.: ?"!-MJDELS
(.
',"" ..
.'
,~r\
\.
BY J. F. Cumsille
Name
Date
The library of the Deportment of Statistics
North CaroHna-St*UnMrsity
.
lot
EFFECT OF DICHOTOMIZING CONTINUOUS VARIABLES
IN REGRESSION MODELS
by
Jose Francisco Cumsille
Department of Biostatistics
University of North Carolina
•
Institute of Statistics
Mimeo Series No. 2193T
May 1998
EFFECT OF DICHOTOMIZING CONTINUOUS
VARIABLES IN REGRESSION MODELS
...
by
Jose Francisco Cumsille
A dissertation submitted to the faculty of The University of North Carolina at Chapel Hill
in partial fulfillment of the requirements for the degree of Doctor of Public Health in the
Department of Biostatistics.
Chapel Hill
1997
Aproved by:
Ad~ftr1~s
U?~!:/~
Reader
Reader
..
Read~f-
ii
ABSTRACT
Jose Francisco Cumsille. Effect of Dichotomizing Continuous Variables in Regression
Models (Under the Direction of Dr. Shrikant I. Bangdiwala).
The dichotomization of continuous variables is a very common practice when people analyze
data. Some of the consequences of such dichotomization are well known: loss of information,
grouping people in the same group when they are different, loss of power of the statistical
methods, underestimation of the correlation coefficient, among others. The dichotomization
is not only a methodology issue; this practice can have a strong impact in the interpretation of
the empirical results. One of the most popular tools to analyze data is the regression model. It
is very common to see the results from a particular linear regression model, where one or
more continuous variables have been dichotomized.
The objectives of this dissertation are to study the effect on the model structure of
categorization of continuous variables in multiple linear regression models, and also to study
the effect on the exposure (main effect) of categorizing a continuous confounding variable in
multiple linear regression models and also in multiple logistic regression models. These
considerations imply that the outcome can be continuous or binary and the confounding
variable must be continuous. About the main effect, we study the situation in which it is also
.
a continuous variable. In order to evaluate the consequence in terms of the model structure,
an analytic approach was used, whereas to evaluate the impact on the measure of association
between the outcome and the exposure, a simulation approach was used for both the linear
and logistic regression situations. The effect on the measure of association was evaluated by
using a variety of methodologies.-
iii
The main result in terms of the model structure is in terms of the linearity. In fact, if the
original model is linear, after dichotomization of the control variable, the resulting model is
not linear. Therefore, if someone dichotomizes a continuous variable in a linear regression
model and after that, fits a linear regression model, what essentially they are doing is to fit a
wrong model. In terms of the measures of association, we found that after dichotomization,
the amount of bias depends of the correlation structure among the variables in the model, but
it increases as the correlations increase. This was true for both linear and logistic regression
models.
"
.
IV
.
Acknowledgements
..
I would like to thank my advisor, Dr. Shrikant Bangdiwala, for his patience and
guidance during the course of this dissertation research. I also whish to express my
sincere thanks to the other members of my committee, Drs. C. E. Davis, G. G. Koch, L.
L. Kupper and D. Loomis for their suggestion and their careful examination of my work.
Many thanks to Dr. P. K. Sen for his help in part of this dissertation.
I also gratefully acknowledge the International Clinical Epidemiology Network
(INCLEN) and the Study Grant Program of the W. K. Kellogg Foundation for allowing
me to complete my doctoral studies.
v
To my wife Luz Maria, my children Paula,
..
Francisco and Claudio.
To my parents Miguel and Mery.
vi
TABLE OF CONTENTS
...
Chapter 1: Introduction, Literature Review and Objectives
1
1.1.- Introduction
1
1.2.- Motivation
3
1.3.- Literature Review
7
1.4.- Objectives
12
Chapter 2: Dichotomization under the Multiple Linear Regression
Model: An Analytic Approach
14
2.1.- The Model
14
2.2.- Evaluation of the Conditional Expected Value
16
Chapter 3: Practical Effect ofDichotomization under the Multiple
Linear Regression Model
..
23
3.1.- General Considerations
23
3.2.- The Model
24
3.3.- Dichotomizing the Control Variable
25
3.4.- Epidemiological Interpretation for X2
26
3.5.- Simulation Approach
29
3.6.- Simulation Process
34
3.7.- Measure of the Effect
34
3.8.- Results
39
Tables and Graphs for Scenarios 1 to 12
42
Tables and Graphs for Scenarios 13 to 24
47
Tables and Graphs for Scenarios 25 to 36
52
Tables and Graphs for Scenarios 37 to 48
57
vii
Tables and Graphs for Scenarios 49 to 60
61
Tables and Graphs for Scenarios 61 to 72
66
Tables and Graphs for Scenarios 73 to 84
71
Tables and Graphs for Scenarios 85 to 89
76
Tables for the Relative Difference
80
Chapter 4: Dichotomization under the Multiple Logistic Regression
Model: a Simulation Approach
82
4.1.- General Considerations
82
4.2.- The Model
83
4.3.- Simulation Approach
84
4.4.- Simulation Process
92
4.5.- Measure of the Effect
92
4.6.- Results
95
Tables and Graphics for Scenarios 1 to 23, n=50
97
Tables and Graphics for Scenarios 1 to 23, n=200
106
Tables and Graphics for Scenarios 25 to 42, n=50
115
Tables and Graphics for Scenarios 25 to 42, n=200
123
Tables for the Relative Difference
131
Chapter 5: Conclusions, Recommendations and Further Research
133
5.1.- Conclusions and Recommendations
133
5.2.- Further Research
135
Bibliography.
137
,.
CHAPTERl
.
INTRODUCTION, LITERATURE REVIEW AND OBJECTIVES
1.1.- INTRODUCTION.
In many health studies, the objective is to establish the relationship
between one particular outcome of interest (dependent variable) and a set of covariates
(independent variables). For instance, suppose that we are interested in studying the
reduction in blood pressure (BP) in people under different treatments; in that case, BP is
the outcome and treatment is the covariate. In order to better understand the relationship,
we usually wish to control for the effects of other variables; for example, age, sex, race,
and cholesterol. Multivariately, it is also possible to study the relationship between the
outcome and the set of independent variables in order to know how all the covariates
together, explain the behavior ofthe outcome.
The situation described above is very often encountered in clinical trials
and in other kinds of epidemiological or clinical studies. In the example above, the
outcome was BP, which is continuous. We can also have real situations in which the
outcome corresponds to a categorical variable, usually dichotomous; for example, good or
bad response to a treatment, presence or absence of one particular disease, and so forth.
In order to analyze the relationship, we focus the problem in two important
elements; the first one is related to the scale of measurement of the variables under study,
in particular about the outcome. Depending on the kind of outcome in the study, among
other considerations, we define the link function necessary to describe the relation. The
second important element is the choice of the most appropriate measure of association in
order to describe the strength of the association.
2
In addition, the kind of study plays an important function in the
determination of the relationship. For example, different approaches must be used when
we have simply one measure on each subject or when each subject is measured more than
one time.
As was mentioned above, the scale of measurement is one of the essential
elements considered in the decision about the most appropriate model and also about the
measurement of association. Consider again the example about BP, and suppose that we
want to estimate the association between BP and age. Given that BP and age are both
continuous variables, if we assume a bivariate normal distribution, the natural measure of
association is the Pearson correlation coefficient, which is obtained after fitling a linear
model for the conditional expected values of BP given age. In other words, given the fact
that we know the scale of measurement behind the variables under study, we make
distributional assumptions and then define the model. Finally, given a random sample,
we fit the model and get the estimates for all the parameters involved in the model.
On the other hand, if the outcome is binary, the appropriate model is the
logistic regression model, and then the corresponding measure of association is the odds
ratio.
In the cases described above, there is a perfect correspondence between the
model and the measure of association. However, sometimes one or more continuous
variables in the model (dependent or independent variables) are categorized into two or
more levels, and then they are used in a qualitative scale in the statistical analysis. One
valid question is: are the conclusions the same, independently of whether the variable is
treated as continuous or if it is used in a lesser scale of measure? In other words, what are
the consequences of changing the scale of measure of continuous variables in the
statistical analysis?
3
1.2.- MOTIVATION
In order to have a good approximation about the effect of dichotomizing
continuous variable in a statistical model, we present two real life examples.
.
Example 1.Chang et al (1990) carried out a study in which they assess the association between an
outcome (Coronary Heart Disease) and a set of covariates: Smoking, Systolic Blood
Pressure, Total Cholesterol, Relative Weight and Age. The authors wrote: "we
dichotomized the covariates by the assignation of zero to the following conditions:
- Smokes less that one pack of cigarettes per day;
- Systolic Blood Pressure less than 140 mm Hg;
- Relative Weight less than 120 per cent;
- Total Cholesterol level less than 250 mg per 100 ml;
- Age less than 55.
We assigned the alternatives in each case a value of one."
Using the five variables, it is possible to classify the patients in 32
different risk sets. The authors analyzed the data using, among others, the Cox timedependent model. Some of the 32 sets and their corresponding Relative Risks (R.R)
obtained by fitting the time-dependent model are given in Table 1.1:
•
4
Table 1.1: Some of Risk Sets and their Relative Risk.
1
2
3
4
5
6
7
8
0
1
0
0
0
0
1
1
31
32
o
0
0
1
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1.00
1.56
1.55
1.64
2.06
2.76
2.42
2.56
1
1
1
1
1
1
1
1
6.79
10.60
o.
1
Suppose that we want to classify four people with the following patterns of the five
covariates, as in Table 1.2:
Table 1.2: Patterns for 4 hypothetical subjects.
2
3
4
18
21
40
135
145
220
115
125
180
240
255
350
54
56
70
According to the classification rule and the results reported by the authors, the values
used in the statistical analysis, and the results in terms of the relative risk, for the four
subjects, are given in Table 1.3:
5
Table 1.3: New "values" for the 4 hypothetical subjects
after applying the classification rule.
o
2
3
4
I
I
o
o
o
o
I
I
I
I
I
I
1
1
1.00
10.60
10.60
The first two subjects fall in the baseline category (R.R=1.0), and the other two subjects
are located in the worst situation (R.R=1O.6).
Some concerns about the data presented above are:
- Is the risk of heart disease the same in subjects I and 2?
- Is the risk of heart disease the same in subjects 3 and 4?
- Is the R.R. of subject 3 with respect to subject 2, the same as the R.R. of subject 4
relative to subject I?
- Are the baseline continuous values from subjects I and 2 equivalent?
If we observe the original values, probably our conclusion would be that the risk of
subjects I and 2 are different, and also similarly for subjects 3 and 4.
Example 2.Hosmer and Lemeshow (1989) use many different examples through their
book. In one of the studies, the goal was to identify risk factors associated with giving
birth to an infant with low weight. The potential risk factors are: age of the mother (AGE
in years), weight at the last menstrual period (LWT in pounds), Race (RACE: White,
.
Black and Other), smoking status during pregnancy (SMOKE: yes or no), history of
premature labor (PTL: none, one, etc..), history of hypertension (HT: yes or no), presence
of uterine irritability (VI: yes or no), number of physician visits during the first trimester
(FTV: none, one, two, etc...).
6
The outcome in the study was birth weight (BWT) in grams, which is
continuous; it is dichotomized, and the variable low birth weight (LOW) is created
according to the following criteria: LOW=O if BTW
~
2,500g. and LOW=1 if
BTW<2,500g.
The results shown in the book using LOW as the outcome and AGE,
LWT, RACE (two dummy variables with White as reference category) and FTY as the
independent variables in a logistic regression model are reproduced in Table 1.4:
Table ·1.4: Results from logistic regression model.
Intercept
Age
LWT
Race (Black)
Race (Other)
FTY
1.2954
-0.0238
-0.0142
1.0039
0.4331
-0.0493
1.0714
0.0337
0.0065
0.4979
0.3622
0.1672
1.4617
0.4989
4.7430
4.0660
1.4296
0.0869
0.2267
0.4800
0.0298
0.0438
0.2318
0.7681
Using the logistic regression model, we see that the association between
LOW and Race is not so clear. In fact, the estimate for the parameter related with Race
"black" vs Race "white" is marginally significant (p=0.0438), and the corresponding
estimate for the difference between Race "other" vs Race "white" is not significant
(P=0.2318).
Now, suppose that we use BTW instead of LOW as the dependent
variable; given that BTW is a continuous variable, we use a multiple linear regression
model with AGE, LWT, RACE (two dummy variables with White as reference category)
and FTY as the independent variables, i.e., we have the same set of independent
variables. The results under the new model are given in Table 1.5:
7
Table 1.5: Results from multiple linear regression model.
.
Intercept
Age
LWT
Race (Black)
Race (Other)
FTV
2470.240
0.779
4.580
-448.634
-240.593
10.989
316.28
10.308
1.799
161.706
115.574
50.152
7.81
0.076
2.546
-2.774
-2.082
0.219
0.0001
0.9396
0.0117
0.0061
0.0388
0.8268
As we can see, if we use the variable in its original scale, the strength of
association between the response and Race is much clearer, and it is different that the one
observed if BTW is dichotomized. The logical question that arises is: are these
comparisons appropriate?
1.3.- LITERATURE REVIEW.
Through the above two examples, we have a preliminary suspicion about
the consequences of dichotomizing continuous variables. In the first case, the
consequence is related to the classification of heterogeneous subjects within the same risk
set; i.e., the subjects within a particular group "are treated as if they were identical" with
respect to the set of variables under study (Maxwell & Delaney, 1993). In the second
example, we have that the association depends on which scale of measure is used in one
particular continuous variables, and also what type of model is used as the link function
between the outcome and the independent variables.
One goal in the present dissertation, is to evaluate the consequences of
categorizing continuous variables. There are several ways to study the effect of
categorization. As we will discuss later, we focus our study in two type of models.
Why do some people like to categorize continuous variables? Different
reasons can be found in the literature for the categorization of continuous variables. One
8
concern is error in measurements. Sometimes, it is very difficult to have an exact measure
for a particular continuous variable, or it is suspected that the measure may contain error.
Given that scenario, Flegal et al. (1991) studied the effect of measurement error on the
estimates of relative risk. Reade-Christopher and Kupper (1991) propose a model to
assess the effect of exposure misclassification in epidemiological follow-up studies. For
case-control studies, Fung and Howe (1984) studied the general effect of exposure and
covariate misclassification on estimation and power.
In the area of misclassification, there are many papers considering
different aspects. However, there are not many papers studying the effect of categorizing
continuous variables assuming that no measurement error is present.
In the applied
papers which use the continuous variables in a different scale of measure, most of the
time, there are no reasons specified for categorization. The researchers simply detemiine
the cut points and then apply the statistical methods. For the Albany study, Chang et al.
(1990) used different statistical analyses in order to predict the risk of coronary heart
disease; they defined several continuous variables as risk factors, and applied logistic
regression and Cox regression models using the risk factors in the original scale as well as
after dichotomization. They reported only the results associated with the binary risk
factors, "because the results were indistinguishable, we report here only the dichotomized
analysis since it is somewhat easier to relate to clinical practice". They did not justify the
choice of the cut points, and it is not clear if the same risk ratios are obtained for other
choices of sets of cutpoints.
Another reason found in the literature for categorization is related to
statistical models applied to these cases. In general, the statistical analysis can be easier
and sometimes we do not need sophisticated statistical software. On the other hand, some
researchers prefer to dichotomize the outcome, the exposure or both, because in this way
the dichotomized variable represents in a better manner the disease or no disease status,
and the interpretation of the coefficient may be easier, or because the range for a
9
continuous variables has distinct clinical significance (Ragland, 1992). Another important
reason is related with the model assumption; as was noted by Altman et al. (1994), the
categorization of continuous variables "enables researches to avoid strong assumptions
about the relation between the marker and risk, but at the expense of throwing away
information". In logistic regression models for example, one assumes a linear relationship
between the independent variables and the log odds, whereas if the continuous variables
are categorized, that assumption is no longer required. Kahn and Sempos (1989), using
one particular example, compared the odds ratio associated with presence or absence of
cardiovascular heart disease and several continuous variables; they fitted the model using
the original continuous scale, and then using two different categorizations. For that
example, the estimated odds ratios sometimes were very close and in others very
different. Because the linearity assumption is not a requirement using categorized
variables, they prefer to use "methods requiring fewer assumptions".
Using a different strategy, the problems presented above are the kinds of
problems that we want to study in this dissertation: assuming that the continuous
variables are correctly measured, how does their categorization affect the estimation
process, and subsequently the understanding of the true relationship between the outcome
and the exposure? The first consequence of categorization is the loss of information and
power in the statistical analysis (Zhao and Kolonel, 1992). For example, and related to
inter observer agreement, Donner and Eliasziw (1994) studied the consequences of
dichotomizing continuous variables, in terms of loss of statistical power; they concluded
that "the effect of dichotomization on the efficiency of reliability studies can be severe".
On the other hand, one of the most controversial elements of categorization is related to
the choice of cutpoints. In many situations, the selection of one particular value for
dichotomizing is arbitrary.
For example, Amenta et al. (1988) compared different
statistical methods for evaluating test effectiveness and diagnostic discrimination after
dichotomizing quantitative tests; they concluded that dichotomization with arbitrary
10
cutoff values was inadequate for comparing test effectiveness. For case-control studies,
Wartenberg and Northridge (1991) presented a new approach to define the cut point,
related with the exposure; basically, the idea is to compute the association between the
outcome and the exposure for each possible dichotomization of the exposure, and then to
choose the cut point, obtained for the maximum risk. Even when that strategy seems
attractive, Altman (1994) noted that the p-value obtained by multiple test is not valid, and
the type I error associated with that strategy is very high. Ragland (1992) considered a
very simple situation where the association between age and blood pressure is the
interest; using different cutpoints for blood pressure, the magnitude of the association
(prevalence ratio and odds ratio) as well as the power changes as the cut point for blood
pressure changes.
Lagakos (1988) studied, among other situations, the case of discretization
of a continuous explanatory variable. He studied three underlying distributions (uniform,
normal and exponential) and he presented the asymptotic relative efficiency (ARE) of a
statistical test after categorization, to the corresponding one before categorization. For
this particular case, when the continuous variable is dichotomized, the ARE goes from
0.5 to 0.75 for optimal intervals (based on Connoris paper, 1972), and from 0.48 to 0.75
for equi probable intervals (median).
Maxwell
and
Delaney
(1993)
explored
the
consequence
of
dichotomization of two independent variables in a linear regression model. They used one
hypothetical set of data and they also used a particular relationship between the
continuous outcome and the two continuous independent variablesa. Then they compared
the results from a 2x2 (using the medians as the cutpoints for the independent variables)
factorial ANDVA with the results from adjusting a multiple regression model. They noted
that the dichotomization of a single continuous independent variable almost always
produces a conservative bias (underestimate the association), but the dichotomization of2
or more independent variables can have the opposite effect (overestimate the association).
11
Different models have been proposed to study the relationship between
certain nutrient intakes and a specific disease. Brown et al (1994) studied the effect of
categorizing nutrient intakes in
~erms
of the equivalencies of the models (given that the
outcome is the disease status, and the link function used is the logistic regression model).
In fact, they presented four models in which three of them are equivalent if the nutrient
intakes are used as continuous, but if those variables are categorized, then that
equivalence no longer exists. For dose-response problems, Greenland (1995, 1995a)
discusses the effect of categorizing the continuous exposure and the consequences on the
trend analysis and also on the power.
For a case-control study, Qaqish (1994) studied the effect of categorizing a
single continuous independent variable. For this kind of study, the case-control ratio in
the sample is much higher than the population prevalence, thus the cut points used for
categorizing in the sample are not representative of the corresponding ones in the
population. This situation will have influence in the estimates. For three distributions of
the exposure, three population odds ratios and three case-control ratios, the author
evaluated the amount of bias produced by categorizing the independent variable. He
concluded that the bias can be severe. Also, for case-control studies, Becher (1992) uses
the residual confounding approach to evaluate the effect of categorization of confounder
continuous variables in the logistic regression model when the main effect is binary; he
concluded that the dichotomization is not recommended, but "if a categorization is
required for some reason, then four or five levels should be used...".
Finally, we have to mention other papers related to this topic. Neuman
(1982) evaluates the correlation coefficients after dichotomization of the two variables in
a bivariate model, under different underlying distributions. Reyner et al. (1986), and
".
Reyner et al. (1992) studied the effect of categorizing continuous variables under specific
models, basically using asymptotic relative efficiency as the measurement of the effect.
12
1.4.- OBJECTIVES.
The literature points to a general disadvantage of dichotomizing variables.
However, there are different positions with respect to categorizing. Given the discussion
above, the major objectives of this dissertation are:
a) to study the effect on the model structure of categorization of continuous
variables in multiple linear regression models, and
b) to study the effect on the exposure (main effects) of categorizing continuous
confounding variables in multiple linear regression models and also in multiple logistic
regression models. That implies that the outcome can be continuous or binary and the
confounding variable must be continuous. The idea is to study the situations in which the
main effect is a continuous variable.
Given these objectives, the research question in this dissertation is how the
dichotomization of continuous variables affects the model structure and the estimates of
the parameters?
It is important to establish that using different approaches and strategies,
the second objective has been answered; in fact, Maxwell and Delanay (1993) used a
multiple linear regression and they dichotomized both independent variables. The main
difference with respect to the present paper, is that we establish clearly the difference
between the two independent variables: one of them is the main effect, and the other one
is the control variable, and only the second one is dichotomized. On the other hand, the
situation related to the logistic regression model is similar to the model studied by Becher
(1992), but he studied the effect of dichotomizing the confounder using the residual
confounding approach instead of the effect on the estimate directly. These two differences
with respect to papers already published justify the present dissertation. The regression
models are a very powerful tool to establish relationships between one outcome and the
exposure. However, the treatment of the variables in the models, does not always agree
13
with the statistical requirements. The idea is to present the results in an easy, and useful,
comprehensive way.
CHAPTER 2
DICHOTOMIZATION UNDER THE MULTIPLE LINEAR
REGRESSION MODEL: AN ANALYTIC APPROACH.
As mentioned in chapter 1, one of the objectives of this dissertation is to
study how the dichotomization of one continuous variable affects the structure of a
particular model. In this chapter, we present the result of the dichotomization of one
independent continuous variable in the multiple linear regression model.
2.1.- THE MODEL.
Suppose that we want to study the association between one continuous
outcome Y, and one continuous exposure Xl, controlling for one continuous potential
confounder variable (control variable) X2. Suppose that the random vector V=(Y, X}, X2)
=(Y X') has a trivariate normal distribution with:
-mean vector:
p-(~)=(~) and
a2
variance-covariance matrix: E= .
..
(
a;l
a y2
15
known that the conditional distribution of Y given Xl and X2 is also nonnal, with
expected value given by:
where x' =(Xl X2).
One can rewrite this in tenns of parameters /30'
/31
and /32 as
or
(2.1)
where:
a;.X2'
a~I,X2 and
a;.Xl
are the conditional variances, and PYX l.X2 and PYX 2. X \ are the partial
correlation coefficients.
Suppose that for some reason, the control variable, X2, is dichotomized
using a cut point, c, and thus the variable D is created according to the following rule:
0 if X 2
D=
{
~
c
(2.2)
1 if X2 > C.
...
Let P=P(X2
> c)=P(D=I).
16
After the dichotomization of the control variable X 2 , one can fit the same
linear regression model but with D in the model instead of X2. In other words, the model
fitted takes the form:
(2.3)
and then, the association between the outcome and the exposure variable Xl, controlling
for D, is evaluated through the estimate of aq, Le.,
al. Under the maximum likelihood or
the least-squares approaches, QI is an unbiased estimate of al. However, the "true"
relationship between Y and XI is represented by {31 and not by al.
2.2.- EVALUATION OF THE CONDITIONAL EXPECTED VALUE.
The basic idea is to evaluate E[YIXI,D], which can be divided into two
parts:
and
(2.4)
First, we will solve
(2.5)
Given that this last expression represents a conditional expectation, we need to know the
conditional distribution of Ygiven
distribution has the following form:
XI=XI,
and D=O (or
X2 ~
c). This conditional
17
!(YIX1=Xl,D=O) =
!(y,xI/D=O)
!(xI/D=O) .
On the other hand,
and
therefore,
(2.6)
Notice that
1:00!(y,Xl ,X2)dx2
=
1:
(2.7)
!(x2Iy,Xl)*!(y,Xl)dx 2
=1:!(X2Iy,Xl)*!(yIXl)!(Xl)dx 2
= !(yIXl)!(Xl)1:!(X2Iy,Xl)*dX2.
On the other hand,
1:!(Xl,X2}dx2 = 1:!(X2Ixd*!(Xl)dX2
(2.8)
c
= !(Xl)l !(x2Ixd dx 2=!(Xl)*q> {C-J.L2.1 },
~
~.l
18
where J.t2.I=E(X2IXI) and a~.I=V(X2IXI), and <I> is the cumulative standard normal
distribution function.
Using (2.7) and (2.8) in (2.6), we have that
(2.9)
Given that W=(Y XI X2) has a trivariate normal distribution, the
conditional distribution of X2 given Y and Xl is also normal, with expected value J.t2.yl,
and variance equal to aIyl' and then i~f(X2Iy,XI)dx2 is a cumulative normal distribution,
and it can be written as <I> {C-/-£2.YI}.
0"2.yl
Thus, the final result is given by :
(2.10)
In (2.10) we have that <I>{C:Y2.1}
is not a linear function in XI, and therefore [<I>{C-Y2.1
}]-l
v2.1
0"2.1
Y
is also not linear in Xl. In addition, the term <I> {C-/-£2.
I} inside the integral, depends on Xl
0"2.yl
and Y, and it is also not a linear function. Thus E[YIXI=XI,D=O] corresponds to the
quocient of two non-linear functions, and then (2.10) is not linear in Xl.
It is also important to note that if c goes to infinity (D=O with probability
1), then
and thus
19
(2.11)
which represents a simple linear regression model ofY on XI.
On the other hand, suppose that X 2 is totally independent of Y and Xl. In
then
I
E[YIXI=xI,D=O] cI> {C-J:t2.1 }
0"2.1
*1
00
C-P2.yl
y f(ylx I) cI> {(J
-00
} dy
2.yl
In those two situations, X2 does not contribute in the model and therefore the multiple
linear regression model is in fact a simple linear model involving only Xl.
We now turn our attention to the second part of the regression model after
dichotomization, which is when D= 1.
Recall that Y and XI are both continuous random variables, and D is a binary variable.
(2.13)
By using similar arguments as before, we have that:
(2.14)
Now
(2.15)
20
On the other hand,
1
1
00
00
f(Xl ,X2}dx2 =
f(X2lxl }*f(Xl)dx2
(2.16)
OO
= f(Xl)l f(X2Ixd dx2=f(Xl)*[1-cP{ C-J12.1 }],
(j2.l
C
where as before, J12.1=E(X2IXl) and (j~.1=V(X2IXl)' and cP is the cumulative standard
normal distribution function.
Using (2.15) and (2.16) in (2.14), we have that
But the integral inside the bracket corresponds to the complement of the cumulative
standard normal distribution, and thus:
1
00
I
C-J12yl
E[YIXl=Xl,D=I] [1_cP{C-f./:2.1 }]* _ yf(Ylxl) [l-cP{ (j . }] dy,
CT2.l
00
.
2.yl
-[1-_cP-{~:-~f./:2-.1 }-] *
CT2.1
[1
00
-00
y f(Ylxl)dy
-1
00
J12 Y1
y f(yIXl)cP{C- . } dY],
-00
(j2.yl
(2.18)
21
Once again, we obtain an expression that is not-linear in Xl.
In this case, and given that we are looking for the result when D=1
(X2
YI } and <.I> {C-P,2.1 }
> c), this occurs with probability 1 if c goes to -00, then both <.I> {C-P,2.
0"2.yl
0"2.1
-
go to O. Therefore, asymptotically
(2.20)
where (30= Py-{3yI Pl. This is the same result we obtained in (2.11).
Again, suppose that
0"2.1 =0"2.yl =0"2,
X2
is totally independent of Y and
Xl,
and then:
P2.l =P2.yl =P2
and
In that case, (2.18) can be written as:
but
and then
(2.21)
which is the same result given by (2.12).
The results obtained by (2.10) and (2.19) are very important. They
basically imply that when one wants to fit a multiple linear regression model, and one
continuous variable is dichotomized, the resulting model is not linear. Therefore, if
22
someone insists on fitting a multiple linear regression model with D in the model instead
ofX2 , what the people essentially are doing, is to misspecijy the model.
CHAPTER 3
PRACTICAL EFFECT OF DICHOTOMIZATION UNDER
THE MULTIPLE LINEAR REGRESSION MODEL
3.1.- GENERAL CONSIDERATIONS.
In the previous chapter, we have shown that if one independent continuous
variable is dichotomized in a multiple linear regression model, the resulting model is a
non-linear one. This finding, while very important, is accessible only to statisticians but is
likely not to be readily understood by users of statistics, because the language used is in
general complex. In fact, in order to understand the message, the reader needs some
technical background in statistics. Therefore, as an applied statistician wishing to give a
cautionary message to people who do not have training in statistics, one must use simple
and direct language. This is the reason we decided to use a simulation approach to
complement the results obtained in chapter two.
In a general regression model, most of the time one is interested in
studying either the overall influence of the covariates over the response or the association
of some covariates (exposure variables) with the outcome controlling for the other
covariates (confounder variables). In this dissertation, the second situation is our primary
focus.
We have to remember that the goal of this dissertation is to study how the
categorization of a continuous control (or confounder) variable affects the relationship
between the continuous outcome and one particular independent variable (exposure
variable), in multiple regression models. In this chapter, we will assume that the exposure
is also a continuous variable, and thus the model is a multiple linear regression model.
24
3.2.- THE MODEL.
Let Y be the outcome under study and let Xl and X2, be the independent
variables. In our case, we will use the same strategy that was described in chapter 2,
section 2.1.
If we have a random sample of size n from the target population, another
way to write the linear regression model (2.1) is the following:
(3.1)
In that case, the usual assumptions about the error term e, are:
- E(ej)=O, Vi=I,2,...,n
- E(ejej)=Cov(ej,ej)=O, i
(3.2)
=1=
j,
- V(ejFa2 •
Either in (2.1) or (3.1),
/31
is a measure of association between the outcome Y and the
exposure variable Xl, controlling for X2. For a given sample size, the two most popular
methodologies to estimate
/31
are the least squares method (LS), and the maximum
likelihood method (ML). In terms of the estimation process, both methods are equivalent,
and they arrive at the same results, but in order to make inferences about the parameters,
it is necessary to know the distribution of the estimators. In other word, we must use the
ML approach. The LS approach can also be used if we have a large sample, and then it is
possible to use the large sample theory.
If we write the model in matrix notation, we have that:
Y=X/3+ e,
and then, the ML estimate of vector /3 is given by
25
(3.3)
In particular, let ~1 be the ML estimate of 131; it is well known that ~1 is
the best linear unbiased estimate (BLUE) of 131, and therefore E(~ 1)=131'
3.3.- DICHOTOMIZING THE CONTROL VARIABLE X2.
Suppose that the continuous control variable X2 is now dichotomized
according to the following rule:
I ifX2 > cwith probability P
D=
{
oifX2 ~ c with probability Q=l-P.
(3.4)
Then, if D is used in the model instead of X2, the multiple linear regression model takes
the form:
(3.5)
In order to estimate the regression coefficient corresponding to the main
effect Xl, we again can use the same approach as before. Let al be the ML estimate of
al. In this case, E(al)=al. Even when al is BLUE to at, it is not necessarily a good
estimate of the true parameter under study, namely
131.
It is interesting to evaluate the
difference between 131 and al; in general, we have that:
(3.6)
26
Now, when the variable X2 is dichotomized and the variable D is used in the model in
place ofX2, the model being analyzed takes the fonn:
The compansons between al and
f31
will be illustrated through
simulations, which are given in section 3.8 later.
3.4.- EPIDEMIOLOGICAL INTERPRETATION FORX2.
We have been referring to Xl as the main effect or the exposure variable
of interest, while X2 is the control variable. In epidemiologic studies, the control variable
can be interpreted either as an effect modifier or as a confounder. One particular and
important situation is to analyze what happens when X2 is not a confounder. In this case,
can X2 become a confounder after being dichotomized?
Again, we have the original model:
One approach to evaluate if a control variable is a confounder is to fit the model without
the control variable, and then examine if the coefficient of the exposure variable remains
the same. In other words, we will assume that X2 is not a confounder if we have the
model:
with the same f31' but where f3'Q is a different intercept from f3o.
The next step is to dichotomize X2. Ifwe do that, we get the model (2.3):
E(ylxl=XI, D=d)=ao+alxl+a2d;
27
but, given the fact that the original control variable is not a confounder, we need to
evaluate ifD is a confounder. That means to have the following model:
E(YIX I=xI)=/'O+/'I X I,
Finally, we compare aq and /'1.
In the model including the control variable, we have that:
and in the model that does not include X2, the coefficient corresponding to XI is:
The coefficient 131 corresponds to the adjusted measure of association between Y and XI
(controlling for X2), and /'1 is the corresponding crude measure of association between Y
and XI (ignoring X2). In practice, we say that X2 is not a confounder when 131 =/'I
(Kleinbaum, Kupper and Muller, page 164 ), or
Now, we know that
We will try to write the conditional covariances
«(jy/x2 and (jXt/X2)'
in terms of the
unconditional covariance and variance. Suppose we have the random vector (Y XI X2)
28
with ajoin multivariate distribution. Let (Y XI X2)=(Z X2) with Z=(Y XI)' The variancecovariance matrix is:
L:=
So, the conditional variance-covariance matrix of vector Z given X2 is given for:
-I
L: =L:-L: L: L:
z
Then,
and
Note that
29
Using the same arguments, we have that:
Given that
(3.7)
Therefore, the necessary and sufficient condition for X2 to not be a confounder is
Note that if
PXJ, X
2=O, then the covariance between Xl and X2 is zero, and (31 in (3.6)
which is the regression coefficient for a simple linear regression model
simplifies to ~,
a
XI
involving only Xl. This agrees with the notion that ifX2 is not a confounder it should not
be included in the model of Y on Xl. This situation will be specifically addressed in our
simulation process.
3.5.- SIMULATION APPROACH.
In this section, we will study the effect of dichotomizing the control
variable on the relationship between the outcome and the exposure, using a simulation
process. The basic idea is to estimate the parameter (31 under the true model defined for
(3.1). After that, the control variable is dichotomized and the model (3.5) is fitted, and
30
again the parameters in (3.5) are estimated; finally, we compare the estimations from the
two models.
Recall the example used for illustration in chapter 1. Assume that we have
a study of cardiovascular disease with Y=Systolic Blood Pressure (SBP), XI=Cholesterol
and X2=Age. To study the relationship of cholesterol as a predictor of SBP, one can fit a
simple linear regression model. If age is considered a potential confounder, then a
multiple linear regression model can be fitted. Often age is dichotomized, at say 60 years,
and then a multiple linear regression model is fitted with cholesterol and the
dichotomized age as predictors of SBP. We will conduct a simulation in this context.
First of all, we need to create a sample from the target model (3.1). Let n
be the sample size. Let Zl, Z2 and Z3 be three independent and identically distributed
standard normal random variables. Let
!1j, Cj
0=1,2) and d be constants. We define the
random variables Xl and X2 as:
(3.8)
Using (3.8), we have that:
(3.9)
On the other hand, let e be a normal random- variable (with mean 0 and
variance (72), independent of Zl , Z2 and Z3, and then independent of Xl and X 2 as well.
Finally, we define the dependent variable Y as:
(3.10)
31
That definition implies:
V(Y)=,8iV(XI )+,8~V(X2)+,81 ,82COV(X1,X2)+0.2
COV(Y,Xt}=,81 V(Xl)+,82COV(Xt,X2)
COV(Y,X2)=,82V(X2)+,81 COV(XI ,X2)
(3.11)
The last step of the process is to define the regression coefficients, in
particular ,81 and ,82. Thus, if we define the expected values and the variances of Y, Xl
and X2, and for given correlation coefficients among the three variables involved in the
model (Py,Xb Py,x2, PXbX2)' it is possible to obtain the regression coefficient needed in
(3.10). The constant ,80 is obtained using the definition: ,80=E(Y)-(,8l/-tl+,82J.l2)' The
reason we define the moments of the variables into the model in advance, is because we
are pretending to simulate a real situation, such as the cardiovascular study described
above.
The problem that we want to study in this dissertation, i.e., the effect of
dichotomization on measures of association, can be affected by different conditions.
Thus, our objective needs to be analyzed under these conditions. The principal conditions
are:
i) The first important aspect that we are interested in controlling in advance, is
related to the correlation structure among the variables used in the model; that means, to
control Py,xb Py,x2, PXbX2' In the simulations, we will use:
PXbX2=0, 0.1, 0.4, and 0.7,
Py,X2=0, 0.1, 0.4, and 0.7, and
Py,xI =0.1,
0.4, and 0.7,
32
to have none, relatively small, moderate and high correlations among the variables.
ii) The cut point used in the dichotomization process for the control variable X2. That
variable will be dichotomized at five different cutpoints., determined by the following 5
deciles: 1,3,5,7 and 9.
iii) The sample size. Two different sample sizes will be used: n=100 (small sample), and
n= 1000 (large sample).
If we combine the 3 situations described above, we have 96 different
scenarios that we need to analyze under the five different cutpoints (4 correlation values
between Y and X2, times 4 correlation values between Xl and X2, times 3 correlation
values between Y and Xl, times 2 sample sizes).
Returning to our example, the moments used in the simulation process are:
- Y (Systolic blood pressure): E(Y)=Jly=150 mmHg, V(Y)=900 (mmHg)2,
- Xl (Cholesterol): E(XI)=JlI=200 mgldl, V(XI)=1600 (mgldl)2, and
- X2 (Age): E(X2)=Jl2=40 years, V(X2)=100 (years)2.
Given these values, and given that the correlation coefficients are also known, it is
possible to compute the variance-covariance matrix among the three variables, and also it
is possible to obtain the constants c}, C2 and d from (3.11) and use them in (3.8). In order
to obtain the regression coefficients, we use the fact that they can be expressed as
~XY~~~X2' and the constant {30 is equal to Jly-{31 JlI-{32Jl2.
The correlation coefficients, and the true regression coefficient for the
exposure variable, are presented in the following table for the 48 combinatios. These 48
combinations allow us to have 96 different scenarios. Scenarios 1 to 48 corresponds to
combinations 1 to 48 for n=100, and scenarios 49 to 96, for n=1,000.
33
Table N° 3.1: True values of f31 for specified PX\X2' PYX2 and PYXI'
1
2
3
4
5
6
7
8
9
10
11
12
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.0750
0.3000
0.5250
0.0750
0.3000
0.5250
0.0750
0.3000
0.5250
0.0750
0.3000
0.5250
13
14
15
16
17
18
19
20
21
22
23
24
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.0
0.0
0.0
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.0758
0.3030
0.5303
0.0682
0.2955
0.5227
0.0455
0.2727
0.5000
0.0227
0.2500
0.4773
0.7
0.7
0.7
0.7
0.7
0.7
0.7
0.7
0.7
0.7
0.7
0.7
0.0
0.0
0.0
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1471
0.5882
1.0294
0.0441
0.4853
0.9265
-0.2647
0.1765
0.6177
-0.5735
-0.1324
0.3088
Table 3.1. Continuation
25
26
27
28
29
30
31
32
33
34
35
36
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.0
0.0
0.0
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.0893
0.3571
0.6250
0.0536
0.3214
0.5893
-0.0536
0.2143
0.4821
-0.1607
0.1071
0.3750
37
38
39
40
41
42
43
44
45
46
47
48
34
3.6.- SIMULATION PROCESS.
Previously, in (3.1) we defined the multiple linear model under study, and
in (3.12) we described the way to create the continuous independent variables of the
model. The independent variables XI and X2 are functions of the random variables Zj
(j=1,2,3). The random variables Zj will be created using SAS® software, through the
function RANNOR. They are simulated only once, and then, given the parameter values
in the model described in Table 3.1, we will get only one matrix with the observed values
for Xj (j=1,2); the number of rows is equal to the sample size, i.e., n. Now, if N is the
number of simulations, we will simulate N times the vector of the error terms (with n
components each) under the normal distribution assumption. Using this approach, the
matrix X={X I X2} is fixed, and for every simulation we use the same matrix X but
different vectors of errors, and then different values ofY according to (3.1). For each of
the N set of values, six different models are fitted. The first will use the original values Y,
Xl and X2 (modeI3.!), and the other will use the variables Y, Xl and D according to the
five binary variables created after the dichotomization of X2 (model 3.5), in place of X2
itself.
3.7.- MEASURE OF THE EFFECT.
Once all the conditions are defined, the natural question is: how the effect
that we want to study should be analyzed. There are many ways to evaluate the effect of
dichotomizing X2, on the estimation of the measure of association (31. We present five
procedures.
3.7.1.- Relative Difference (R.D.)
In order to look at the impact of dichotomization, we evaluate the relative
difference (R.D) between the mean of estimates (over 500 replications) and the true
parameter, which is given by:
35
R.D.
= (Mean of estimates - parameter) *100 .
parameter
(3.12)
This is a descriptive method to examine the relative effect on the parameter estimate.
3.7.2.- Mean Squared Error (MSE).
We already mentioned that the estimates obtained by fitting (3.5) are
biased with respect to the parameter described by (3.1). That implies that a natural
measure of error is given by the Mean Squared Error (MSE). For a general parameter v,
the MSE is defined as:
MSE(v)={b(v)} 2 + Var(v) ,
(3.13)
where b(iI)=E(v)- v is the bias.
In our case, when we use the original X2, and given that ~1 is an unbiased estimate of f31
in (3.1), the MSE(~)=Var(~I)' In the other cases, we have that:
Through the 500 replications, we can compute the mean and the variance
of the estimates, and use them in order to estimate the MSE. If h indexes the replication,
then let
500
500.....
;;:::
f3 1=
Ef31h
h=1
.....
E alh
h=1
500 and a 1= 500 '
be the mean of the estimates, before and after the dichotomization, and let
36
500
"""(......
LJ alhh=1
:;::;:)2
a I
499
be the corresponding estimated variances.
Then, we have that the estimates for the MSE are given by:
(3.14)
In summary, we know in advance the true values for the parameters
corresponding to the main effects, (31; the control variable X2 is dichotomized and then
we fit the model (3.5) instead of the model (3.1). Using the estimates by fitting (3.5),
aI,
we compute the estimate of the Mean Squared Error using (3.14). That is the natural way
that we can use when we want to evaluate the impact on the estimates related to the main
effects by dichotomizing X2.
3.7.3.- Sensitivity (S), Specificity (Sp) and Total Misclassification Proportion (T).
We assume that to use the original variable is the correct way to estimate
(31. Then, ~I can be considered as the gold standard estimator. In order to test the null
hypothesis Ho : (31=0, we use the t statistic given by:
t=
~~
_~
tv
t (n-3)
(3.15)
se«(3I)
From (3.15), it is possible to compute the p-value (PI), and it is also possible to decide
about Ho . If PI
~
0.05, then Ho is rejected. Otherwise, it is not. On the other hand, after
37
. the dichotomization of X2, let a 1 be the estimate of al in (3.5), and let se(a 1) be the
corresponding standard error. Then, to test H o : a}=O, we use:
(3.16)
Again, we use (3.16) to get the p-value and then we decide about the null hypothesis
using a similar criterion as described above. Therefore, for every type of dichotomization,
we can define the following table:
Not Reject
Reject
Total
a
c
a+c
Because we assume that the results from the case using X2 are the gold standard, we
define the two following probabilities:
- P(reject H o with D in the model/reject H o with X2 in the model)=sensitivity (8), and
- P(not reject H o with D in the model/not reject H o with X2 in the model)=specificity
(8p).
These two quantities can be estimated by:
.... d
....
a
8= b+d and 8p= a+c'
(3.17)
On the other hand, it is also possible to estimate the total misclassification
probability, which corresponds to
rc>g. The last value tell us the percentage of wrong
decisions. Usually, we see in the applied statistics literature similar problems as presented
in this dissertation, and based on a sample, investigators (researchers) arrive to some
38
conclusion. If the conclusion comes from a model where the control variable was
dichotomized, a valid question is about the quality of that decision.
3.7.4.- Standarized Mean (S.M)
Another procedure is related to the direct comparison between the t value
from (3.15) and (3.16). In fact, if for replication h, we rename (3.15) as th, and (3.16) as
t~, we have the following differences:
(3.18)
The idea is to use the random variables Dh in order to compare the two procedures. If Ho
establishes that there is no difference between the two procedures, then we can use the
differences Dh in an non-parametric approach to test the null hypothesis. However,
significance of the 500 differences (3.18) will not be assessed since given 500
observations any small difference will be statistically significant. The absolute value of
the standardized mean is presented instead. The S.M. is the ratio between the mean and
the standard deviation of the 500 values of Dh.
3.7.5.- Confidence Interval/or (31.
We also can measure the effect by fitting the model (3.1) using the original
variables, and also fitting the model (3.5) with X2 dichotomized (5 different deciles), and
then constructing 95% confidence intervals for (31 using these six models. Finally, we
evaluate if the confidence interval includes the parameter being estimated. Even when the
last approach is not totally necessary for this particular case, we will use it in order to
have one homogeneous methodology for measuring the effect of dichotomization, given
that, as we discussed in Chapter 4, for the logistic regression model the MSE approach is
not feasible.
39
3.8.- RESULTS.
3.8.1.- Description of results.
In this section, we present the results of our simulations. They are
presented in a series of tables and corresponding graphs, for the 96 scenarios described in
tables 3.1. The scenarios are grouped into 8 subsets of 12 scenarios each, according to the
four values PXl,X2 and the two sample sizes (n=lOO and 1,000). In each set of tables, we
present the average of the estimates, the relative difference in the estimate (R.D.) (section
3.7.1), the Mean Squared Error (section 3.7.2), the sensitivity, the specificity and the
estimate of the total misclassification probability assuming that the results from the model
with the original values (without dichotomization) are the gold standard (section 3.7.3),
and the standardized mean (section 3.7.4). Second, the corresponding graphs with the
confidence intervals are presented (section 3.7.5). In each group, we present 95%
confidence intervals for
f31
for the different cutpoints based on the 500 replications, as
well as a horizontal line for the true
f31
value. Decile 0 is the situation when X 2 is not
dichotomized. For 6 out of 96 possible scenarios, it was not possible to obtain results (3
for each sample size), because the specific values for the correlation coefficients did not
agree with the other parameters used in the simulation process. This situation affected the
scenarios 39, 42 and 46 for n=lOO, and the scenarios 87, 90 and 94 for n=I,OOO.
In pages 42 to 46 we present the results for scenarios Ito 12, which
correspond to n=100, PX1> X2=0 and different combinations between Py,Xj and Py,X2.The
scenarios 1 to 3 correspond to the situation when X 2 is totally independent of Xl and Y.
In those cases the dichotimization of the control variable X 2 does not affect the
relationship between the exposure and the outcome, for any decile. A similar situation is
observed when Py,x2=0.10 (scenarios 4-6). However, when Py,X2=0.40 or Py,x2=0.70
(scenarios 7-12), the situation becomes worse. In fact, in the corresponding graphs we
observe that the confidence intervals are far away from the true parameter, and we also
observe very high Relative Differences (R.D.), in particular for decile 1.
40
Table 3.2.c contains the results for sensitivity, specificcity and the total
misc1assification probability. In many cases the sensitivity was not estimable because for
the chosen constants the t statistics was always significance under the null as well as the
alternative hipothesis.
In pages 47 to 51 the results for scenarios 13 to 24 are presented, and they
correspond to the cases when n=100 and PX1,X2=O.lO. This situation is very similar to the
situation described above in terms of when Py,X2 =0.0 or Py,X2 =0.10 there is not much
effect of dichotomization, but when the correlation coefficient between the outcome and
the control variable increases, the situation became worse, using any criteria of
evaluation.
The cases when Px1,x2 =0.40 and n=IOO are presented in pages 52 to 56.
For the different scenarios under study, the situation is in general bad, even for small
values of Py,X2' We observe very high values for R.D. as well as for Standarized Mean
(S.M). On the other hand, the graphs show important differences between the estimates
after dichotomization, and the true parameter. This situation is even worse for scenarios
37-48, when Px 1, x2=0.70; these are presented in pages 57 to 60. In terms of estimates of
the parameter, the situation is very bad because the R.D's are high. In addition, we
observe very high misclassification probabilities.
For a large sample size, n=I,OOO, the results show some differences with
respect to the small sample size (scenarios 49-60). In fact, when PX1 ,X2 =0.0 (pages 61 to
65), we observe that even though the values for R.D. are small, the values of S.M. are
very high, which is an indication of the differences between the statistics t and t* defined
by 3.15 and 3.16. These differences are also reflected in the graphs, particularly for
Py,X2=0.40 and Py,x2=0.70.
When PX1 ,X2 increases to 0.10 (pages 66 to 70), the situation is in general
similar to the one described above. However, when PX1 ,X2 =0.40 (pages 71 to 75) and
PX1 ,X2 =0.70 (pages 76 to 79) the situation is worse, using the R.D., the S.M. or the graphs.
41
3.8.2.- Summary of results.
We used different criteria to evaluate the effect of dichotomizing the
control variable. From those, we think that the more relevant is the Relative Difference,
because it measures directly the difference between the estimate and the parameter. In
tables 3.10 to 3.13 (pages 80 and 81) descriptive summary statistics of the RD. are
presented. We exclude the R.D. for the case when X2 is used in the model, because we
are not interested in evaluating these cases, due to the fact that we know in advance that
in those cases the RD must be small, and our interest is to evaluate the dichotomization
of X2 • Therefore, the tables present the result for the 450 cases (90 scenarios times 5
decile cutpoints), in terms of the observed mean RD., standar deviation, minimum and
maximum, as well as a description of RD by sample size, deciles and correlation
coefficient.
As we observe in table 3.10, the mean ofRD is 28.8%, with a very high
standard deviation. The means by sample size are similar, even though smaller for
n=I,OOO. In terms of deciles, we note that the wore situations are in the extremes,
particularly for the first decile. The best situation is when X2 is dichotomized in the
median or decile 5. In table 3.11 we present the mean of RD for different combinations
of
Px 1, x 2
and
Py, x'
2
As we observe, the discrepancies between the estimates after
dichotomization and the parameter increase as
Px }, x 2
increases, as well as
Py, X2
increases.
The results from these tables allow us to conclude that the dichotomization of the control
variable can have a strong impact on the measure of association between the outcome and
the exposure, in particular when the correlation coefficients among them are high.
Finally, in tables 3.12 and 3.13 we present the frequency distribution of the
values of RD in ten percent increments. As we observe, in a 55.6% of the cases under
study, the value of the RD was less that 10%, and in a 16.7% of the cases the RD had
42
values greater than 50%. In terms of deciles (Table 3.13) the worse situations were the
extreme deciles, for both sample sizes.
Table N° 3.2.a: True value for [31, the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 1 to 12 (PX1> X 2 = 0.00, n = 100).
2
3
4
5
6
7
8
9
10
11
12
.3000
.5250
.0750
.3000
.5250
.0750
.3000
.5250
.0750
.3000
.5250
.3050
.5240
.0745
.3007
.5253
.0734
.2998
.5243
.0814
.3016
.5248
1.654
.1955
.6652
.2475
.0653
2.177
.0750
.1344
8.05
.5280
.0317
.3056
.5237
.0782
.3052
.5246
.1078
.2960
.5212
-.023
.2663
.5606
1.855
.2359
4.243
1.744
.0842
43.78
1.345
.7169
130.1
11.23
6.775
.3053 1.765
.5237 .2444
.0769 2.571
.3039 1.293
.5265 .2909
.1113 48.45
.2710 9.664
.5388 2.627
.0777 3.64
.3197 6.561
.5462 4.035
.3054
.5239
.0763
.3023
.5322
.0847
.3214
.5157
.1026
.2619
.5382
1.805
.2037
1.728
.7626
1.361
12.96
7.135
1.772
36.77
12.69
2.521
.3051
.5239
.0761
.2997
.5350
.0779
.3154
.5491
.1148
.2436
.5339
1.708
.2051
1.430
.0864
1.895
3.826
5.133
4.597
53.07
18.81
1.701
.3053
.5239
.0727
.3032
.5307
.0891
.2908
.5156
.0188
.2714
.5122
1.779
.2105
3.022
1.049
1.091
18.73
3.056
1.801
74.89
9.540
2.441
Table N° 3.2.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM),
for different cutpoints, for scenarios 1 to 12 (P X Io X 2 = 0.00, n = 100)
1
2
3
4
5
6
7
8
9
10
11
12
.0049
.0066
.0024
.0058
.0047
.0029
.0040
.0046
.0020
.0032
.0017
.0001
.0046
.0065
.0023
.0059
.0047
.0030
.0049
.0046
.0020
.0129
.0029
.0014
.0139
.2594
.2562
1.026
.4166
1.305
3.608
1.689
3.022
5.209
4.837
10.31
.0047
.0067
.0023
.0059
.0047
.0029
.0051
.0054
.0022
.0032
.0022
.0006
.0049 .0049
.0926 .0066
.2967 .0024
.9020 .0058
.6251 .0047
.427 .0029
3.756 .0041
3.203 .0052
1.456 .0021
1.341 .0040
2.437 .0032
9.419 .0003
.0766 .0048 .0536 .0046 .0343
.0392 .0065 .3255 .0066 .1276
.2540 .0024 .1320 .0024 .0971
.7447 .0058 .5429 .0059 .2747
.4296 .0048 .6659 .0047 .3379
1.328 .0029 1.174 .0029 1.288
1.537 .0041 .1344 .0043 .7987
.7183 .0049 .0867 .0047 2.158
2.522 .0026 1.036 .0021 3.302
.6467 .0049 1.337 .0064 4.531
4.456 .0049 5.360 .0025 4.627
9.480 .0002 9.981 .0003 10.75
Table N° 3.2.c: Estimation for Sensitivity (%), Specificity (%) and the percentage of the total
misclassification probability (1), for different cutpoints, for scenarios 1 to 12 (PXIoX2 = 0.00, n = 100).
2
3
4
5
6
7
8
9
10
11
12
99.3
100.0
82.8
100.0
96.9
100.0
100.0
100.0
100.0
100.0
99.2
100.0
0.67
100.0
100.0
0.6
0.0
99.0
100.0
14.0
0.8
77.4
100.0
31.8
100.0
100.0
98.8
100.0
100.0
100.0
98.6
100.0
56.4
100.0
100.0
1.0
0.0
99.5
100.0
18.4
1.4
96.3
100.0
13.0
96.3
100.0
98.8
100.0
100.0
100.0
100.0
100.0
97.3
100.0
100.0
0.6
0.0
99.8
100.0
3.0
0.0
99.0
100.0
3.4
88.3
100.0
98.8
100.0
100.0
93.6
100.0
100.0
99.3
100.0
100.0
0.4
0.0
99.0
100.0
2.0
0.0
92.4
100.0
8.4
99.7
100.0
88.4
100.0
100.0
94.6
99.0
100.0
6.7
100.0
100.0
2.8
0.0
7.2
1.0
28.0
43
Scenario 1: PXj,X2
= 0.00,
Py,x2
= 0.00,
Py,Xj
= 0.10, n =
100.
Figure 1/1' 95% Confidence InteMlI for 8etlPO 075
.09
.07
0
Scenario 2: PXj,X2
= 0.00,
Py,x2
1
= 0.00,
:3
Py,Xj
decile
5
= DAD, n =
7
9
7
9
100.
Figure 1b' 95% Confidence Interval for 8eta=0 30
.32
.29
0
Scenario 3: PX1> X2 = 0.00,
Py,x2
1
:3
decile
5
= 0.00, Py,Xj = 0.70, n =
100.
Figure 1c: 95% Confidence Interval for Beta-0.525
.535
.515
0
1
3
decile
5
'7
9
44
Scenario 4:
PXj,X2
= 0.00,
Py,x2
= 0.10,
Py,Xj
= 0.10,
n
=
100.
Figure 1d: 95% Confidence Interval for Beta=0.075
.1
r-J-I-I-I-r
.056
decile
Scenario 5:
PXj,X2
= 0.00,
Py,x2
= 0.10,
Py,Xj
5
= 0.40, n =
7
9
7
9
100.
Figure 1 e: 95% Confidence Interval for Beta=O.30
.32
.28
Scenario 6:
PXJ, X2
0
1
Py,X2
= 0.10,
= 0.00,
3
Py,Xj
decile
5
= 0.70, n =
100.
Figure 1f: 95% Confidence Interval for Beta=0.525
.55
I
I
I
f-I-I-----=
.51
o
3
decile
5
7
9
45
Scenario 7: P Xt. X 2 = 0.00,
Py,x2
= 0.40,
= 0.10, n =
Py,xl
100.
Figure 1 : 95% Confidence Interval for Beta=0.075
.118899
I
I
...
III
C
I
f1Q)
.0
I
I
I
.08
0
Scenario 8: PX1,X2
= 0.00,
Py,x2
decile
= 0.40,
Py,xl
5
= 0.40, n =
7
9
100.
Figure 1h: 95% Confidence Interval for Beta=0.30
.35
I
...
III
C
III
I
h
1il
.0
I
I
.28
3
0
Scenario 9: P Xlt X2 = 0.00,
Py,x2
= 0.40,
Py,Xl
7
decile
= 0.70, n =
9
100.
Figure 1i: 95% Confidence Interval for Beta=0.525
.553139
I
I
...
III
c
III
1-1
1il
0
I
I
.5
0
3
decile
5
7
9
46
Scenario 10: PXj, X2
= 0.00,
Py,X2
= 0.70,
Py,Xl
= 0.10,
n
=
100.
Figure 1j: 95% Confidence Interval for Beta=0.075
.11997 -
I
I
I
...
...c
I
(II
-
(II
Q)
D
I
-
I
-.027804 0
Scenario 11: P X1,X2
= 0.00,
3
Py,X2
= 0.70,
Py,Xl
decile
= 0.40,
5
n
7
=
9
100.
Figure 1k: 95% Confidence Interval for Beta=0.30
.34
I
...
I
(II
c(II
...
Q)
D
I
I
I
I
.239854
0
Scenario 12: PXj, X2
= 0.00,
3
Py,X2
decile
7
5
= 0.70, Py,Xl = 0.70, n =
9
100.
Figure 11: 95% Confidence Interval for Beta=O.525
I
.561557
:r
.51
o
3
decile
5
7
9
47
Table N° 3.3.b: True value for 131, the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 13 to 24 (PXI,X2 = 0.10, n = 100).
13
14
15
16
17
18
19
20
21
22
23
24
.0758
.3030
.5303
.0682
.2955
.5227
.0455
.2727
.5000
.0227
.2500
.4773
.0729 3.709
.3047 .5368
.5286 .3232
.0648 4.996
.2953 .0570
.5225 .0477
.0449 1.201
.2723 .1456
.4970 .5983
.0200 12.1
.2500 .0089
.4803 .6325
.0718
.3036
.5264
.0560
.2962
.5239
.0379
.3184
.5121
.0900
.2633
.5531
5.178
.1724
.7408
17.87
.2405
.2316
16.58
16.76
2.426
296.1
5.310
15.88
.0721
.3033
.5220
.0666
.2967
.5226
.0309
.2730
.5211
.0663
.3068
.5649
4.811
.0709
1.559
2.377
.4196
.0233
32.13
.1155
4.215
191.8
22.72
18.36
.0728 3.894
.3029 .0511
.5248 1.048
.0622 8.823
.2958 .I 158
.5237 .1929
.0697 53.24
.2993 9.730
.5155 3.094
.0823 262.2
.2830 13.21
.5160 8.114
.0726 4.211
.3037 .2331
.5230 1.382
.0650 4.687
.3012 1.960
.5244 .3286
.0674 48.36
.2982 9.321
.5214 4.280
.0938 312.9
.2793 11.73
.5028 5.356
.0727
.3040
.5228
.0667
.3013
.5256
.0539
.3138
.5349
.0476
.2935
.4836
4.096
.3286·
1.4142
2.224
1.986
.5476
18.65
15.04
6.980
109.3
17.38
1.325
Table N° 3.3.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM), for different
cutpoints, for scenarios 13 to 24 (PXJ, X 2 = 0.10, n = 100)
13
14
15
16
17
18
19
20
21
22
23
24
.0059
.0030
.0034
.0060
.0055
.0026
.0052
.0045
.0024
.0039
.0021
.0009
.0058
.0030
.0035
.0062
.0057
.0026
.0053
.0064
.0026
.0084
.0023
.0066
.0118 .0058
.1223 .0030
.5876 .0034
.5687 .0061
.1040 .0055
.5868 .0027
1.286 .0055
6.581 .0047
2.234 .0028
3.314 .0058
2.963 .0053
4.483 .0085
.0002
.0916
.5820
.3964
.0437
.4408
2.293
1.438
.8131
3.038
.8611
2.799
.0059
.0030
.0034
.0060
.0067
.0025
.0057
.0050
.0026
.0074
.0032
.0024
.0442
.0066
.4011
.7894
.1433
.9040
3.061
5.245
.8078
4.883
.5256
3.721
.0058
.0030
.0034
.0061
.0054
.0025
.0056
.0049
.0028
.0089
.0029
.0015
.0588
.0121
.5207
.0296
.8038
1.382
2.870
4.512
.I 897
5.194
1.020
4.465
.0058 .0541
.0030 .3237
.0034 .5439
.0061 .1512
.0054 .8090
.0025 1.499
.0052 1.253
.0060 6.314
.0036 .1790
.0046 1.015
.0040 1.234
.0010 6.129
Table N° 3.3.c: Estimation for Sensitivity, Specificity and the percentage of the total misclassification
probability ( 1), for different cutpoints, for scenarios 13 to 24 (PXI,X2 = 0.10, n = 100).
14
15
16
17
18
19
20
21
22
23
24
98.2
90.0
100.0
7.14
93.4
100.0
69.2
99.8
100.0
64.3
100.0
100.0
87.9
100.0
100.0
5.6
0.4
100.0
80.0
3.0
2.6
99.8
100.0
7.0
96.6
100.0
96.9 0.4
100.0 0.4
100.0
57.1 3.8
99.0 1.0
100.0
87.9 4.0
100.0
100.0
100.0
80.0
94.8
35.7
89.1
100.0
100.0
90.8
98.8
100.0
95.2
100.0
100.0
87.9
100.0
100.0
1.2
1.8
99.5
70.0
5.2
1.8
94.3
35.7
11.0
86.5
100.0
100.0
96.9
100.0
100.0
95.2
100.0
100.0
87.9
100.0
100.0
0.8
0.6
98.9
70.0
5.6
1.8
99.6
7.14
13.4
100.0
100.0
100.0
93.9
100.0
100.0
95.2
100.0
100.0
66.7
100.0
100.0
1.8
0.6
0.8
2.6
2.2
48
= 0.10,
Scenario 13: P X1,X2
Py,x2
= 0.00,
Py,Xl
= 0.10,
Figure 2a' 95% Confidence Interval for Beta=O 0758
.09
-
.06
0
n = 100.
Scenario 14: P X1,X2
3
1
= 0.10,
Py,x2
= 0.00,
decile
Py,xl
5
7
= 0.40, n =
9
100.
Figure 2b: 95% Confidence Interval for 8eta=O.3030
.32
1-1-1-1 ! I
.29
o
Scenario 15: P Xt> X2 = 0.10,
3
Py,X2
= 0.00,
Py,Xl
decile
5
= 0.70, n =
7
9
100.
Figure 2c: 95% Confidence Interval for 8eta=O.5303
.54 -
-
.
.516876
o
3
decile
5
7
49
Scenario 16: PXl,X2
= 0.10,
Py,X2
= 0.10, Py,Xl = 0.10,
Figure 2d: 95% Confidence Inlerval for Bela=0.0662
.073629
.049013
-L.,---.------........------.------r------rl
n = 100.
Scenario 17:
P Xl,X2
o
3
decile
5
9
7
= 0.10, Py,X2 = 0.10, Py,Xl = DAD, n = 100.
Figure 2e: 95% Confidence Inlerval for Be18=O.2955
.32
H-I-I-I-----,I
.27
o
Scenario 18: P Xl,X2
= 0.10,
decile
Py,X2
5
7
9
= 0.10, Py,Xl = 0.70, n = 100.
Figure 2f: 95% Confidence Inlerval for 8ela=O.5227
.535
.51
0
1
:3
decile
5
7
9
51
Scenario 22:
P X1,X2
= 0.10,
= 0.70,
Py,x2
Py,Xl
= 0.10,
n
=
100.
Figure 2j: 95% Confidence Interval for 8eta=0.0227
.1
I
I
I
I
...
Rl
C
Rl
...
I
Q)
.0
I
.013
0
Scenario 23:
P X1,X2
= 0.10,
3
= 0.70,
Py,X2
Py,Xl
decile
5
= 0040,
n
7
=
100.
Figure 2k: 95% Confidence Interval for 8eta=0.25
.32
I
I
I
1il
c
I
Rl
~
.0
I
I
.23
0
Scenario 24:
P Xb X2
= 0.10,
3
Py,X2
= 0.70,
Py,Xl
decile
5
= 0.70,
n
7
=
9
100.
Figure 21: 95% Confidence Interval for 8eta=0.4773
.57-
I
I
-
...
Rl
C
~
-
I
£2
I
-
I
I
.46 - ' - y - - - r - - - - - r - - - - - - , , - - - - - - , - - - - - - , - '
o
3
5
9
decile
52
Table N° 3A.a: True value for {3l, the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 25 to 36 (PXI,X2 = 0040, n = 100).
'fI-
25
26
27
28
29
30
31
32
33
34
35
36
.0893
.3571
.6250
.0536
.3214
.5893
-.0536
.2143
.4821
-.1607
.1071
.3750
.0888
.3644
.6257
.0539
.3209
.5904
-.057
.2083
.4798
-.157
.1069
.3730
.5264
2.028
.1182
.6453
.1619
.1964
6.270
2.803
.4781
2.364
.2040
.5360
.0806
.3044
.5793
.0664
.3069
.5287
.0160
.2770
.5106
-.045
.2025
.4651
9.709
14.78
7.319
23.99
4.529
10.28
129.9
29.28
5.891
72.16
89.0
24.01
.3228
.5844
.0610
.3151
.5598
-.005
.2649
.5002
-.071
.2159
.4306
9.618
6.497
13.86
1.980
5.000
90.33
23.61
3.743
55.62
101.5
14.82
.0843 5.615
.3343 6.395
.6030 3.521
.0600 12.07
.3138 2.375
.5412 8.163
-.025 53.77
.2415 12.68
.4954 2.758
-.061 62.26
.2195 104.9
.3868 3.152
.3299 7.627
.5865 6.167
.0633 18.13
.3137 2.396
.5450 7.521
.0051 109.5
.2589 20.81
.4959 2.861
-.056 65.14
.1730 61.5
.4355 16.13
.3194
.5707
.0693
.3060
.5236
.0169
.2583
.5148
-.020
.2107
.4751
10.57
8.686
29.43
4.810
11.15
131.48
20.54
6.781
87.65
96.6
26.70
Table N° 3A.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM), for different
cutpoints, for scenarios 25 to 36 (PXI> X 2 = 0.40, n = 100)
26
27
28
29
30
31
32
33
34
35
36
.0057
.0016
.0051
.0056
.0031
.0059
.0069
.0028
.0024
.0040
.0021
.0073
.0036
.0049
.0051
.0062
.0106
.0095
.0033
.0016
.0132
.0099
.9760
2.596
.6651
.0612
.6803
3.105
2.869
3.944
6.010
4.667
6.084
.008'5
.0061
.0031
.0048
.0054
.0040
.0083
.0081
.0029
.0103
.0156
.0050
.2604
.7623
2.193
.4718
.1860
1.030
2.610
2.973
3.963
5.365
9.077
4.814
.0086
.0059
.0020
.0053
.0052
.0050
.0067
.0073
.0028
.0123
.0163
.0022
.1694
.6999
1.960
.3205
.0424
.2474
1.785
1.137
3.015
6.132
9.031
2.343
.0060
.0030
.0047
.0053
.0047
.0088
.0080
.0028
.0132
.0083
.0056
.8234
2.248
.5936
.0189
.6202
3.410
3.057
2.521
5.929
5.297
4.677
.0066
.0045
.0046
.0049
.0067
.0104
.0080
.0034
.0221
.0147
.0119
.3842
1.007
2.690
.7799
.3313
.5333
3.498
2.577
5.288
6.966
5.713
5.983
Table N° 3A.c: Estimation for Sensitivity, Specificity and the percentage of the total misclassification
probability ( 1), for different cutpoints, for scenarios 25 to 36 (PXI> X 2 = 0040, n = 100).
25
26
27
28
29
30
31
32
33
34
35
36
93.9
40.0
97.5
21.0
100.0
27.9
99.4
100.0
94.4
6.0
99.8
0.8
100.0
14.8 12.6
100.0 21.8
100.0
5.95 82.2
100.0 45.6
100.0
97.1
80.0
98.6
25.4
100.0
13.6
100.0
96.3
3.0
1.2
99.0
100.0
36.1
9.0
100.0 20.6
100.0
21.7 68.4
100.0 54.6
100.0
96.2
60.0
99.1
56.5
100.0
10.1
100.0
5.0
85.2
99.6
0.8
100.0
6.8
50.8
98.9 12.8
100.0
15.1 74.2
100.0 56.8
100.0
95.5
60.0
98.2
35.5
100.0
41.5
100.0
88.9 5.2
99.8
0.6
100.0
21.3 11.2
100.0 17.8
100.0
10.1 78.6
100.0 37.0
100.0
92.6
60.0
96.4
38.4
100.0
22.8
100.0
92.6
7.4
99.8
0.6
100.0
13.1 13.8
100.0 17.0
100.0
1.8
85.8
100.0 48.8
100.0
53
Scenario 25:
PXj,X2
= 0.40,
Py,x2
= 0.00,
Py,Xj
= 0.10, n = 100.
Figure 3a: 95% Confidence Interval for Beta=0.0893
.1
.07
0
Scenario 26: PXj, X2 = 0.40,
3
Py,X2
= 0.00,
Py,Xj
decile
5
7
9
= 0.40, n = 100.
Figure 3b: 95% Confidence Interval for Beta=0.3571
.38
I
...ro
I
.r:
ro
I
1D
c
I
I
I
.28
0
Scenario 27: PXj, X2 = 0.40,
3
Py,X2
= 0.00,
Py,Xj
decile
5
7
9
= 0.70, n = 100.
Figure 3c: 95% Confidence Interval for Beta-0.6250
.65
I
til
!
.r:
~
Q)
c
I
I
I
I
.55
0
3
decile
5
7
9
54
Scenario 28: P X1,X2
= 0.40,
Py,X2
= 0.10,
= 0.10, n =
Py,Xl
100.
Figure 3d: 95% Confidence Interval for Beta=0.0536
.08
I I
I I
.045 -L..,---.------.----"""T-----.-----...,...J
7
9
3
5
o
decile
Scenario 29: P X1,X2
= 0.40,
Py,X2
= 0.10,
Py,Xl
= 0.40, n =
100.
Figure 3e: 95% Confidence Interval for Beta-0.3214
.335 -
Scenario 30: P Xl.X2
= 0.40,
Py,X2
= 0.10,
Py,Xl
= 0.70, n =
100.
Figure 3f: 95% Confidence Interval for BetaeO.5893
.6
I
...
...
l\l
I
Co
l\l
Q)
0
I
I
I
.52
0
I
3
decile
9
55
Scenario 31: PXj,X2
= DAD,
= 0.40, Py,Xj
Py,X2
= 0.10, n
= 100.
Figure 3 : 95% Confidence Interval for Beta=-0.0536
.04
I
I
I
1il
r=
('\l
...
OJ
I
0
-.06
I
3
0
Scenario 32:
P Xlo X2
I
= DAD,
Py,x2
= DAD,
Py,Xj
decile
5
= DAD, n
7
9
I
I
7
9
= 100.
Figure 3h: 95% Confidence Interval for Beta=0.2143
.3
I
1il
r=
I
~
0
I
.2
I
3
Scenario 33:
PXj,X2
= DAD,
Py,X2
= 0.40, Py,Xj
decile
5
= 0.70, n
= 100.
Figure 3i: 95% Confidence Interval for Beta=0.4821
.53
I
I
I
I
I
5
7
I
.47
o
3
decile
9
56
Scenario 34: PXj, X2 = 0.40,
Py,X2
= 0.70,
Py,xl
= 0.10, n =
100.
Figure 3j: 95% Confidence Interval for Beta--0.1607
o
I
I
I
I
I
r
-.17
-L...,----.--'----....----...----.....,..-----.-J
o
Scenario 35: PXj, X2 = 0.40,
3
Py,x2
= 0.70,
Py,xl
decile
5
= 0.40, n =
9
100.
Figure 3k: 95% Confidence Interval for Beta-0.1 071
I
I
I
I
I
I
.09
-L..,----.-----....----...-----....----....,.J
o
5
f
decile
Scenario 36: P X1,X2
= 0.40,
Py,X2
= 0.70,
Py,xl
= 0.70, n =
100.
Figure 31: 95% Confidence Interval for Bet8=0.3750
.5
I
I
....('Il
I
I
c.
('Il
a;
.c
I
I
.35
0
decile
5
7
9
57
Table N° 3.5.a: True value for f3l' the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 37 to 48 (PX1> X2 = 0.70, n = 100).
37
38
39
40
41
42
43
44
45
46
47
48
.1471
.5882
.1502
.5865
2.149
.288
.0441
.4853
.0478
.4883
8.262
.6143
.0720
.3423
63.28
29.47
.0639
.3824
44.89
21.20
.0588
.3836
33.26
20.96
.0647
.3684
46.59
24.09
.0732
.3306
65.92
31.87
-.2647
.1765
.6177
-.271
.1710
.6245
2.327
3.118
1.108
.0140
.2620
.5460
105.3
48.47
11.60
-.068
.2456
.5635
74.45
39.18
8.766
-.057
.2404
.5791
78.54
36.25
6.236
-.014
.2378
.5698
94.86
34.75
7.743
.0541
.2574
.5508
120.4
45.86
10.81
-.1324
.3088
-.131
.3124
.7620
1.161
.2729
.4828
306.2
56.32
.1988
.4263
250.2
38.03
.1622
.4241
222.6
37.31
.1497
.4382
213.1
41.89
.2487
.4923
287.9
59.41
Table N° 3.5.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM), for different
cutpoints, for scenarios 37 to 48 (PX1> X2 = 0.70, n = 100)
37
38
39
40
41
42
43
44
45
46
47
48
.0125
.0067
.0088
.0246
.6639
1.010
.0094
.0156
.4653
1.024
.0102
.0153
.2924
.7907
.0099
.0183
.4140
.7671
.0088
.0277
.7269
.8155
.0095
.0101
.0046
.0829
. 144
.0077
4.369
2.238
1.571
.0456
.0117
.0058
3.412
2.304
1.362
.0499
.0112
.0051
3.468
2. 1
.4831
.0687
.0114
.0056
4.160
2.113
.8712
.1062
.0124
.0075
4.995
2.815
.4708
.0064
.0055
.1670
.0338
10.05
7.547
.1130
.0178
8.149
6.393
.0907
.0175
7.315
6.162
.0834
.0209
6.452
5.464
.1488
.0370
8.800
8.184
Table N° 3.5.c: Estimation for Sensitivity, Specificity and the percentage of the total misclassification
probability (1), for different cutpoint, for scenarios 37 to 48 (PX1> X2 = 0.70, n = 100).
38
39
40
41
42
43
44
45
46
47
48
89.7
81.0
99.8
11.0
0.2
93.7
83.3
100.0
7.2
94.3
61.9
100.0
8.4
94.8
66.7
100.0
7.6
90.6
83.3
100.0
10.0
89.5
17.3
0.7
99.5
100.0
82.4
49.0
98.9
28.5
13.9
100.0
100.0
70.0
42.2
97.9
32.5
14.6
99.5
100.0
69.2
40.0
94.7
40.0
4.7
100.0
100.0
78.2
35.4
58.9
12.9
2.5
100.0
100.0
86.8
51.4
0.0
0.0
99.5
100.0
59.2
1.0
0.3
0.0
64.4
100.0
73.4
1.0
8.8
0.0
31.7
100.0
81.8
1.0
18.0
0.0
15.1
100.0
83.2
1.0
0.0
0.0
93.2
100.0
61.8
1.0
58
Scenario 37: P X1,X2
= 0.70,
Py,X2
= 0.00,
Py,xl
= 0.10,
n = 100.
Figure 4e: 95% Confidence IntelVel for 8et8=0.147t
.18
I
til
r:
£!l
Q)
0
I
I
I
I
I
.06
0
Scenario 38: P X1,X2
= 0.70,
3
Py,X2
decile
5
= 0.00, Py,Xl = 0.40,
n
=
100.
Figure 4b: 95% Confidence IntelVel for 8etll=O.5882
l;r-------------------
.6
-.....,----.,...----,----.....,-----,-1
1
3
5
.3 - .....
o
Scenario 40:
P Xl,X2
= 0.70,
decile
Py,X2
= 0.10,
Py,xl
= 0.10, n = 100.
Figure 4d: 95% Confidence Interval for Betll=O.0441
.09
I
tii
!:
£!l
Q)
0
.035
1
0
I I
decile
5
I
I
9
59
Scenario 41: P X1,X2
= 0.70,
Py,X2
= 0.10, Py,xl
= 0040,
n
= 100.
Figure 4e: 95% Confidence Intervlll for Beta=0.4853
11----------------
.5
til
.c
...
l':I
Q)
I
I
.c
I
I
I
.3
0
Scenario 43:
P X1,X2
= 0.70,
3
Py,X2
= 0.40,
Py,xl
decile
7
5
= 0.10,
n
=
100.
Figure 4!l: 95% Confidence Intervlll for Beta--a.2647
.1 -
0-
10
-.3
0
Scenario 44:
P X1,X2
= 0.70,
3
Py,X2
= 0040,
Py,xl
decile
7
5
= 0.40, n =
9
100.
Figure 4h: 95% Confidence Intervlll for Beta=0.1765
.3
I
...
.c
...
l':I
I
I
I
5
7
l':I
I
Q)
.c
I
.15
0
3
decile
9
60
Scenario 45: PXj,X2
= 0.70,
Py,X2
= 0.40,
Py,Xj
= 0.70, n =
100.
Figure 4i: 95% Confidence Interval for Beta=O.6177
.65
I
_
I
I
I
I
I
.5
o
Scenario 47: PXj,X2
= 0.70,
3
Py,X2
= 0.70,
Py,Xj
decile
5
= 0.40, n =
9
100.
Figure 4k: 95% Confidence Interval for Beta-0.1324
.3-
:r
--,.-------,,.....-----r----.....,.------r-l
-.15 - ........
o
Scenario 48: PXj,X2
= 0.70,
3
Py,X2
= 0.70,
Py,Xj
decile
7
5
= 0.70, n =
9
100.
Figure 41: 95% Confidence Interval for Beta=O.30BB
.55
I
I
I
.27
o
3
decile
5
7
9
61
Table N° 3.6.a: True value for /31, the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 49 to 60 (PXJ, X2 = 0.00, n = 1,000).
49
50
51
52
53
54
55
56
57
58
59
60
.0750
.3000
.5250
.0750
.3000
.5250
.0750
.3000
.5250
.0750
.3000
.5250
.0757 .9895
.2989 .3627
.5246 .0830
.0774 3.210
.2996 .1450
.5247 .0565
.0752 .2975
.2991 .2888
.5252 .0405
.0751 .0743
.3002 .0572
.5248 .0327
.0757
.2989
.5246
.0756
.3013
.5252
.0686
.3076
.5261
.0748
.2995
.5357
.9954
.3640
.0819
.8608
.4155
.0347
8.551
2.528
.2029
.2955
.1811
2.044
.0757 .9938
.2989 .3635
.5246 .0741
.0754 .5777
.2997 .1091
.5259 .1724
.0771 2.845
.3059 1.973
.5291 .7815
.0667 11.12
.2952 1.594
.5319 1.320
.0757 .9890
.2989 .3535
.5246 .0862
.0762 1.595
.2993 .2257
.5286 .6855
.0722 3.765
.3071 2.369
.5319 1.316
.0794 5.864
.3111 3.704
.5193 1.088
.0757
.2989
.5246
.0767
.3007
.5269
.0733
.3026
.530 I
.0894
.3052
.5117
.2989 .3638
.5245 .0877
.0761 1.478
.3020 .6675
.5239 .2011
.0762 1.604
.3114 3.796
.5320 1.341
.0708 5.652
.3016 .5306
.5320 1.334
Table N° 3.6.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM), for different
cutpoints, for scenarios 49 to 60 (PXJ, X2 = 0.00, n = 1,000)
51
52
53
54
55
56
57
58
59
60
.0003
.0005
.0005
.0003
.0005
.0004
.0002
.0003
.0002
.0000
.0004 .0366
.0003 .2984
.0005 2.101
.0005 1.097
.0003 1.900
.0005 6. 14
.0004 4.192
.0002 9.496
.0003 4.282
.0002 12.09
.0001 32.75
.0004 .1218
.0003 .9599
.0005 2.428
.0005 1.524
.0003 .5546
.0005 .96661
.0004 2.927
.0002 6.902
.0003 6.433
.0002 11.14
.0001 31.10
.0003
.0005
.0005
.0003
.0005
.0004
.0002
.0003
.0003
.0000
.0004
.0003
.0005
.0005
.0003
.0005
.0004
.0002
.0005
.0002
.0002
.0263
.1371
.4350
1.089
1.233
.2221
4.203
4.514
6.776
.0470
10.02
31.15
.0006
.0004
.0003
.0005
.0005
.0003
.0005
.0004
.0002
.0003
.0002
.0001
.0239
.0787
.4356
1.489
1.915
2.944
2.182
3.232
8. 08
5.054
12.52
33.01
Table N° 3.6.c: Estimation for Sensitivity, Specificity and the percentage of the total misclassification
probability (1), for different cutpoint, for scenarios 49 to 60 (PXJ, X2 = 0.00, n = 1,000).
50
51
52
53
54
55
56
57
58
59
60
100.0
100.0
100.0
100.0
100.0
98.2 1.6
100.0
100.0
92.1 7.4
100.0
100.0
97.6 2.4
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
98.0 1.8
100.0
100.0
100.0 0.0
100.0
100.0
95.2 4.8
100.0
100.0
95.5
100.0
100.0
100.0
100.0
100.0
100.0
98.7 1.2
100.0
100.0
96.8 3.0
100.0
100.0
99.6 0.4
100.0
100.0
100.0
100.0
100.0
50.0
100.0
100.0
100.0
99.1
100.0
100.0
97.2
100.0
100.0
100.0
100.0
100.0
0.0
100.0
0.8
100.0
2.6
100.0
0.2
100.0
99.8 0.2
100.0
100.0
98.5 1.4
100.0
100.0
98.3 1.6
100.0
100.0
95.4 4.6
100.0
100.0
62
Scenario 49: PXr, X2
= 0.00,
Py,X2
= 0.00,
Py,xl
= 0.10,
n
=
1000.
Figure Sa: 95% Confidence Interval for Beta-0.075
.06
~I
i
f
f
f
5
7
I
.07
o
Scenario 50:
P X1,X2
= 0.00,
Py,X2
decile
= 0.00,
Py,xl
= 0.40, n =
9
1000.
Figure 5b: 95% Confidence Interval for Beta=0.30
.305
I r----I
I
I
I
I
7
9
.295
o
Scenario 51:
P X1,X2
= 0.00,
3
Py,X2
= 0.00,
Py,Xl
decile
= 0.70, n
= 1000.
Figure 5c: 95% Confidence Interval for Beta-O.525
.53
H I I I
r
.52
o
3
decile
5
9
63
Scenario 52:
PXj,X2
= 0.00,
Py,X2
= 0.10,
Py,Xj
= 0.10,
n
=
1000.
Figure 5d' 95% Confidence Interval for Beta=0.075
.08
-
.072
0
Scenario 53:
PXj,X2
= 0.00,
3
1
Py,X2
= 0.10,
Py,Xj
decile
~
= 0.40, n =
7
9
7
9
1000.
Figure 5e: 95% Confidence Interval for Beta-D.30
.305 -
.295
o
Scenario 54:
PXj,X2
= 0.00,
3
Py,X2
= 0.10,
Py,Xj
decile
= 0.70,
5
n
=
1000.
Figure 5f: 95% Confidence Interval for Beta=0.525
.53
.52
I I
o
3
decile
5
7
9
64
Scenario 55:
P X1,X2
= 0.00,
Py,x2
= 0.40,
Py,Xl
= 0.10,' n =
1000.
Figure 5 : 95% Confidence Interval for Bete=O.075
.08
1-------1'---1---'1'"---1
I
.06
o
Scenario 56:
P Xl.X2
= 0.00,
3
Py,X2
= 0.40,
Py,Xl
decile
= 0.40, n =
1000.
Figure 5h: 95% Confidence Interval for Bete-O.30
.32
...
I
l\l
C
l\l
m
.c
I
I
I
I
I
.29
o
Scenario 57: P Xt. X2
= 0.00,
decile
Py,X2
= 0.40,
Py,Xl
= 0.70,
5
n
9
= 1000.
Figure 5i: 95% Confidence Interval for Beta..O.525
.535
I
I
I
I
.52
o
3
decile
5
7
9
65
Scenario 58:
PXj,X2
= 0.00,
Py,X2
= 0.70,
Py,Xj
= 0.10,
n
=
1000.
Figure 5j: 95% Confidence Interval for Beta=0.075
.095
I
~
I
!II
C
£:1
OJ
c
I---1
I
I
.06
3
Scenario 59:
PX1> X2
= 0.00,
Py,x2
= 0.70,
Py,Xj
7
decile
= 0.40,
n
=
9
1000.
Figure 5k: 95% Confidence Interval for Beta=0.30
.315
I
I
~
!II
C
l\l
~
I
f----I
OJ
c
I
.29
0
Scenario 60:
PXj, X2
= 0.00,
3
Py,X2
= 0.70,
Py,Xj
decile
= 0.70,
7
5
n
=
9
1000.
Figure 51: 95% Confidence Interval for Beta=0.525
.51
-.,..--,.-----,.-----r-----r-----,...-J
o
3
5
7
9
decile
66
Table N° 3.7.a: True value for {3I, the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 61 to 72 (PX1> X 2 = 0.10, n = 1,000).
.0758
.3030
.5303
.0682
.2955
.5227
.0455
.2727
.5000
.0227
.2500
.4773
.0755 .3084 .0750
.3027 .1060 .3008
.5291 .2213 .5270
.0686 .6116 .0719
.2940 .4954 .2978
.5232 .0925 .5241
.0459 1.084 .0649
.2728 .0149 .3004
.4994 .1228 .5201
.0225 0.832 .0747
.2498 .1004 .2934
.4778 .1172 .5016
.9931
.7509
.6279
5.498
.7883
.2528
42.76
10.13
4.016
228.5
17.36
5.090
.0750 .9723
.3008 .7250
.5264 .7441
.0710 4.166
.2956 .0 73
.5242 .2734
.0676 48.66
.2969 8.846
.5220 4.397
.0514 126.1.
.2903 16.13
.4963 3.975
.0751
.3011
.5278
.0714
.2945
.5237
.0642
.2837
.5090
.0483
.2808
.4948
.9078
.6370
.4743
4.662
.3065
.1871
41.33
4.010
1.800
112.3
12.30
3.669
.0751 .9304
.3009 .7181
.5271 .6101
.0701 2.780
.2948 .2093
.5241 .2673
.0544 19.57
.2919 7.029
.5067 1.340
.0627 175.8
.2699 7.977
.4871 2.049
.3011 .6407
.5263 .7491
.0694 1.736
.2964 .3315
.5245 .3366
.0692 52.34
.2947 8.070
.5053 1.955
.0601 164.4
.2885 15.41
.5022 5.222
Table N° 3.7.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM), for different
cutpoints, for scenarios 61 to 72 (PX1> X 2 = 0.10, n = 1,000)
62
63
64
65
66
67
68
69
70
71
72
.0006
.0005
.0003
.0006
.0005
.0003
.0004
.0004
.0003
.0003
.0003
.0001
.0006 .0872
.0005 .5460
.0003 1.240
.0006 3.205
.0005 2.825
.0003 .6724
.0008 7.586
.0012 9.890
.0007 4.343
.0030 8.758
.0021 4.334
.0007 18.77
.0006
.0005
.0003
.0006
.0005
.0003
.0009
.0010
.0007
.0012
.0019
.0004
.0805
.4955
1.075
3.085
2.575
2.862
12.01
13.46
1.253
6.796
.7966
14.96
.0005
.0 03
.0006
.0005
.0 03
8
.0005
.0003
.0009
.0012
.0004
.4211
.8994
3.143
.1996
.9201
11.79
1.567
3.465
6.855
2.647
14.49
1.025
1.827
.6143
3.042
4.118
9.417
4.788
10.21
6.528
16.99
.6142
.0003 1.327
.0006 0.404
.0005 2.955
.0003 3.242
.0010 10.55
.0009 4.604
.0003 7.291
.0017 6.045
.0017 5.769
.0007 18.87
Table N° 3.7.c: Estimation for Sensitivity, Specificity and the percentage of the total misclassification
probability (1), for different cutpoints, for scenarios 61 to 72 (PX1> X 2 = 0.10, n = 1,000).
61
62
63
64
65
66
67
68
69
70
71
72
81.8
37.6
5.9
100.0
100.0
100.0 3.2
100.0
100.0
100.0 28.6
100.0
100.0
97.9 68.2
100.0
100.0
2.4
85.2
28.8
32.6 37.1
35.7
46.8 41.8
100.0
100.0
100.0 2.6
100.0
100.0
100.0 28.8
100.0
100.0
97.9 42.4
100.0
100.0
90.9
74.7
13.4
100.0
100.0
100.0 1.6 92.1
100.0
100.0
100.0 11.6 26.6
100.0
100.0
97.9 62.8 23.7
100.0
100.0
100.0
100.0
100.0 1.4
100.0
100.0
100.0 33.6
100.0
100.0
97.9 55.4
100.0
100.0
67
Scenario 61:
P X1,X2
= 0.10,
= 0.00,
Py,x2
Py,Xl
= 0.10, n =
1000.
Figure 6a' 95% Confidence Interval for Beta=0.0758
.08
-
.07
-
Scenario 62:
PXj, X2
= 0.10,
3
1
0
Py,X2
= 0.00,
Py,Xl
decile
l'
5
= DAD,
n
=
9
1000.
Figure 6b: 95% Confidence Intervsl for Beta=O.3030
.31
I I I I
.295
o
Scenario 63: PXj, X2 = 0.10,
3
Py,X2
decile
5
= 0.00, Py,Xl = 0.70, n
9
=
1000.
Figure 6c: 95% Confidence Interval for Beta=0.5303
.535
I I
I
I
I
I
.52
o
3
decile
5
9
68
Scenario 64:
PXj,X2
= 0.10,
Py,X2
= 0.10,
Py,Xj
= 0.10, n =
1000.
Figure Sd: 95% Confidence Interval for Beta=0.OS82
.075
I I I
.OS5
-L...,----,r-----.......-----.-------,r------r-'
5
7
9
3
o
decile
Scenario 65: PXj,X2
= 0.10,
Py,X2
= 0.10,
Py,Xj
= 0.40, n
=
1000.
Figure Se' 95% Confidence Interval for Beta=0.2955
.3
I
.29 -I.-r---...,----.......-----.-------,r------r-'
7
9
o
3
5
decile
Scenario 66: PXj,X2 = 0.10,
Py,X2
= 0.10, Py,Xj = 0.70, n = 1000.
Figure Sf: 95% Confidence Interval for Beta=0.5227
.53 -
I
.52
6
1
3
decile
5
7
9
69
Scenario 67:
P X1,X2
= 0.10,
= DAD,
Py,X2
Py,Xl
= 0.10,
n
=
1,000.
Figure 6 : 95% Confidence Interval for Beta=0.0455
.075
I
I
...
I
I
C\l
c
1llQ)
I
.c
I
.04
3
0
Scenario 68:
P X1,X2
= 0.10,
Py,X2
= DAD,
Py,Xl
decile
5
= DAD,
n
7
=
9
1,000.
Figure 6h: 95% Confidence Interval for Bet8=O.2727
.31
I
I
I
I
I
I------------.265
o
Scenario 69:
P Xlo X2
= 0.10,
3
Py,X2
= DAD,
Py,Xl
decile
5
= 0.70, n =
7
9
1,000.
Figure 6i: 95% Confidence Intervol for Bete=0.50
I
.52
I
...
I
C\l
c
1ll
Q)
.c
I
I
I
.49
0
3
decile
5
9
70
Scenario 70:
P X1,X2
= 0.10,
Py,X2
= 0.70,
Py,Xl
= 0.10, n =
1,000.
Figure 6j: 95% Confidence Interval for 8eta=0.0227
.08 -
I
I
I
.02-
I
I
I
I.-,--.....,.----....
----.----,...------r'
5
7
9
o
decile
Scenario 71: P Xlo X2
= 0.10,
Py,X2
= 0.70,
Py,Xl
= 0.40, n =
1,000.
Figure 6k: 95% Confidence Interval for 8eta=O.25
.3
I
I
I
I
I
.25
!f------------------o
Scenario 72: P X1,X2
= 0.10,
3
Py,X2
= 0.70,
Py,Xl
decile
5
= 0.70, n
7
9
= 1,000.
Figure 61: 95% Confidence Interval for 8eta=0.4773
I
I
I
.475
I
-...,--,-----,-----,.-----.-------r'
o
3
5
7
9
decile
71
Table N° 3.8.a: True value for 131, the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 73 to 84 (P X lo X 2 = DAD, n = 1,000) .
74
75
76
77
78
79
80
81
82
83
84
.3571
.6250
.0536
.3214
.5893
-.0536
.2143
.4821
-.1607
.1071
.3750
.0893
.3557
.6243
.0553
.3235
.5896
-.052
.2151
.4812
-.160
.1075
.3745
0.000
.4009
.1193
3.212
.6404
.0516
2.615
.3748
.1970
.4600
.3590
.1407
.0792
.3297
.5678
mOl
.3080
.5483
.0129
.2839
.5126
.0027
.2418
.4872
lUI
10.19
9.152
30.92
4.181
6.951
124.0
32.46
6.307
101.7
125.7
29.93
.0822
.3303
.5846
.0653
.3131
.5634
.0008
.2576
.5006
-.049
.2075
.4427
7.956
7.504
6.468
21.89
2.583
4.395
101.4
20.19
3.837
69.33
93.63
18.05
.0840
.3319
.5847
.0643
.3135
.5700
-.006
.2458
.4977
-.069
.1912
.4472
5.876
7.077
6.446
20.08
2.483
3.275
88.49
14.72
3.231
57.12
78.48
19.26
.0829
.3278
.5778
.0648
.3124
.5661
.0045
.2559
.5026
-.052
.1881
.4619
7.100
8.207
7.548
20.90
2.813
3.930
108.3
19.43
4.251
67.75
75.56
23.16
.0796
.3166
.5563
.0699
.3081
.5513
.0290
.2794
.5149
-.011
.2352
.5028
10.81
IUS
10.99
30.50
4.139
6.452
154.1
30.41
6.799
93.05
119.5
34.08
Table N° 3.8.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM), for different
cutpoints, for scenarios 73 to 84 (PXlo X 2 = DAD, n = 1,000)
74
75
76
77
78
79
80
81
82
83
84
.0019
.0034
.0008
.0007
.0019
.0050
.0053
.0012
.0269
.0184
.0129
2.437
8.182
2.355
.3684
3.013
9.378
12.52
12.98
22.42
27.48
34.71
.0012 2.103
.0018 6.126
.0007 1.851
.0006 .3168
.0009 2.668
.0035 10.09
.0024 8.777
.0007 8.89
.0126 17.83
.0103 25.80
.0049 17.38
.0012 1.973
.0018 5.872
.0007 1.834
.0006 .4236
.0006 2.720
.0028 9.179
.0016 5.805
.0006 8.12
.0086 16.34
.0074 23.39
.0055 35.35
.0006
.0014
.0024
.0007
.0006
.0008
.0039
.0022
.0007
.0121
.0068
.0078
.5029
2.047
6.848
1.686
.2409
2.934
10.59
8.386
11.44
18.60
19.83
40.45
.0007
.0021
.0049
.0009
.0007
.0017
.0074
.0047
.0014
.0226
.0167
.0166
.6415
2.345
8.082
2.171
.1320
3.266
12.15
11.01
14.50
20.58
27.02
47.48
Table N° 3.8.c: Estimation for Sensitivity, Specificity and the percentage of the total misclassification
probability (1), for different cutpoint, for scenarios 73 to 84 (PXloX2 = DAD, n = 1,000).
73
74
75
76
77
78
79
80
81
82
83
84
41.4
84.2
100.0
100.0
100.0 23.8
100.0
100.0
.66 66.3
100.0
100.0
1.0 99.0
100.0
100.0
54.7
92.4
100.0
100.0
99.7 18.6 56.7
100.0
100.0
2.3
62.4 95.9
100.0
100.0
76.4 23.6
100.0
100.0
100.0
100.0
100.0 17.6 59.1
100.0
100.0
6.6
58.4 89.3
100.0
100.0
98,4
1.6
100.0
100.0
100.0
100.0
99.0 16.6 43.4
100.0
100.0
1.3
64.2 46.4
100.0
100.0
82.2 17.8
100.0
100.0
100.0
100.0
99.7 23.2
100.0
100.0
0.3
81.6
100.0
100.0
2.2
97.8
100.0
100.0
72
Scenario 73:
P Xl,X2
= 0.40,
Py,X2
= 0.00,
Py,Xl
= 0.10,
n
= 1,000
Figure 761: 95% Confidence Interval for Beta-0.0893
.095
I
...
l\l
r:.
19
Q)
c
I I I I
I
.075
0
Scenario 74:
P Xb X2
= 0.40,
3
Py,X2
= 0.00,
Py,Xl
decile
7
5
= 0.40,
n
9
= 1,000.
Figure 7b: 95% Confidence Interval for Beta=0.3571
.36
I
1ii
r:.
...
l\l
Q)
I
I
c
I
I
I
.31
decile
Scenario 75:
P Xl,X2
= 0.40,
Py,X2
= 0.00,
Py,Xl
= 0.70,
5
n
7
= 1,000.
Figure 7c: 95% Confidence Interval for Beta-0.6250
.625 - :!:'J'~---------------------
.55
o
decile
5
9
73
Scenario 76: P Xl,X2
= DAD,
Py.X2
= 0.10,
Py,Xl
= 0.10,
n
=
1,000.
Figure 7d: 95% Confidence Interval for Beta=0.0536
.08
I
I
I
I
I
I~---.05
o
Scenario 77:
P Xl,X2
= 0.40,
3
Py,x2
= 0.10,
Py,Xl
decile
5
= 0.40,
n
9
=
1,000.
Figure 7e: 95% Confidence Interval for Bet8=O.3214
.33
I
...
s:
...
IV
I
I
IV
Q)
c
I
I
I
.3
3
Scenario 78: PX1> X2 = 0.40,
Py,x2
= 0.10,
Py,Xl
decile
= 0.70,
7
5
n
=
1,000.
Figure 7f: 95% Confidence Interval for Bet 8=0. 5893
.59
It:-------------------
I
I
I
.53
o
3
decile
5
7
9
74
Scenario 79: PXI, X2 = 0040,
Py,x2
= DAD,
Py,Xl
= 0.10,
n
= 1,000.
Figure 7!:j: 95% Confidence Interval for Beta=-O.0536
.045 -
:r
:r
I
-.06 -l-r-----,------r----,-------,r-------,-J
3
o
5
9
decile
Scenario 80:
P X1,X2
= DAD,
Py,x2
= DAD,
Py,xl
= DAD,
n
= 1,000.
Figure 7h: 95% Confidence Interval for Beta=O.2143
.29
I
I
I
I
I
.21
I
o
3
decile
5
7
9
Scenario 81: P Xll X2 = 0040,
Py,X2 = DAD, Py,Xl = 0.70, n = 1,000.
Figure 7i: 95% Confidence Interval for Beta=O.4821
.52
I
I
I
I
.48
I
I------------o
3
decile
5
7
75
Scenario 82:
P X1,X2
= 0040,
Py,X2
= 0.70,
Py,Xl
= 0.10,
n
= 1,000.
Figure 7j: 95% Confidence Interval for Beta=-0.1607
0-
-.16
~*=:::;===::::;===:::;===:::::;===::::;~
9
5
o
decile
Scenario 83:
PX1> X2
= 0040,
Py,X2
= 0.70,
Py,Xl
= 0040,
n
= 1,000.
Figure 7k: 95% Confidence Interval for Beta=0.1071
.1
Scenario 84:
P X1,X2
-.,..--r------,-----,-----"""T"'-----r-'
o
= 0040,
3
Py,X2
= 0.70,
Py,Xl
decile
= 0.70,
5
n
7
9
= 1,000.
Figure 71: 95% Confidence Interval for Beta=0.3750
.5
E
.37
o
3
decile
5
7
9
76
Table N° 3.9.a: True value for 131, the mean and R.D. from 500 replications,
for different cutpoints, for scenarios 85 to 96 (PXIo X 2 = 0.70, n = 1,000).
85
86
87
88
89
90
91
92
93
94
95
96
.0441
.4853
.0446
.4847
1.022
.1336
.D700
.3460
58.72
28.71
.0657
.3779
48.90
22.14
.0617
.3859
39.84
20.48
.0634
.3782
43.63
22.08
.0694
.3443
57.34
29.05
-.2647
.1765
.6177
-.264
.1767
.6187
.4300
.1130
.1706
.0186
.2712
.5513
107.0
53.67
10.75
-.051
.2483
.5682
80.85
40.70
8.002
-.076
.2373
.5714
71.19
34.48
7.482
-.052
.2447
.5691
80.29
38.66
7.862
.0088
.2718
.5495
103.3
54.03
11.03
-.1324
.3088
-.132
.3078
.1480
.3418
.2145
.4802
262.1
55.48
.1284
.4354
197.0
40.98
.0980
.4278
174.0
38.52
.1216
.4458
191.9
44.39
.2111
.4833
259.5
56.50
Table N° 3.9.b: Estimation for Mean Square Error (MSE) and for the standardized mean (SM), for different
cutpoints, for scenarios 85 to 96 (PXIo X 2 = 0.70, n = 1,000)
85
86
87
88
89
90
91
92
93
94
95
96
.0011
.0007
.0014
.0198
2.065
3.099
.0012
.0120
1.684
2.436
.0011
.0104
1.601
2.289
.0011
.0119
1.545
2.492
.0013
.0203
1.898
2.755
.0007
.0008
.0004
.0806
.0095
.0047
15.65
9.119
3.111
.0463
.0057
.0028
12.44
7.314
2.340
.0360
.0043
.0025
11.60
6.442
2.299
.0456
.0053
.0027
13.84
6.632
1.906
.0752
.0096
.0049
16.28
9.076
3.776
.0005
.0006
.1206
.0297
31.09
22.74
.0683
.0164
24.98
18.30
.0534
.0146
23.09
19.26
.0648
.0191
22.29
21.94
.1183
.0308
28.95
22.49
Table N° 3.9.c: Estimation for Sensitivity, Specificity and the percentage of the total misclassification·
probability (1), for different cutpoint, for scenarios 85 to 96 (PXIo X 2 = 0.70, n = 1,000).
}f
85
86
87
88
89
90
91
92
93
94
95
96
66.7
94.6
100.0
5.6
100.0
97.2
100.0
2.8
100.0
97.8
100.0
2.2
66.7
98.8
100.0
33.6
100.0
100.0
50.6
49.9
99.2
100.0
38.4
57.2
98.3
100.0
33.0
54.1
97.5 . 35.6
100.0
10.4
100.0
100.0
89.6
56.4
100.0
100.0
43.6
91.6
100.0
100.0
8.40
62.0
100.0
100.0
99.8
100.0
0.2
100.0
100.0
100.0
100.0
100.0
100.0
1.4
38.0
100.0
94.4
100.0
5.6
33.6
100.0
100.0
50.6
4.8
100.0
100.0
95.2
100.0
100.0
77
Scenario 85:
PXj,'V2
= 0.70, Py,'C:2 = 0.00, Py,Xj = 0.10, n = 1,000.
Figure 8a: 95% Confldence Inlerval for 8ela=0.1471
.15
I
.
...
...
III
C
III
OJ
I
I
c
I
I
I
.07
~
0
Scenario 86:
PXj,'C'2
decile
5
= 0.70, Py,:C'2 = 0.00, Py,Xj = 0.40, n = 1,000.
Figure 8b: 95% Confidence Interval for 8eta-0.5882
.59
=
---,r-------,-----r------,-----..,.....
5
9
.35 - ....
6
Scenario 88:
PXj,a'2
= 0.70,
decile
Py,X2
= 0.10,
Py,Xj
= 0.10, n = 1,000.
Figure ad: 95% Confidence Interval for Beta-D.0441
.08
I
...
III
C
I
I
....
1Il
1IJ
I
I
C
.04
I
0
3
decile
5
7
9
78
Scenario 89: P Xl,X2
= 0.70,
Py,X2
= 0.10,
Py,Xl
= 0.40,
n
=
1,000.
Figure 8e: 95% Confidence Interval for Beta=O.4853
.48
J[
-L,---.-----.-----.------.------..-J
.3
Scenario 91:
P Xl,X2
o
= 0.70,
3
Py,X2
= 0040,
Py,Xl
decile
5
9
= 0.10, n =
1,000.
Figure 8l:/: 95% Confidence Interval for Beta=-O.2647
.05
...
III
C
~
.c
-.26
~-~:::::;:::=====;=====::;:::====:::::;:====::::;J
7
3
5
9
o
decile
Scenario 92:
P Xll X2
= 0.70,
Py,X2
= 0040,
Py,xl
= 0040,
n
=
1,000.
Figure 8h: 95% Confidence Interval for Beta=O.1765
I
.16
-L,---.-----.-----.-----.-----.,..J
o
3
decile
5
7
9
79
Scenario 93:
PXj, X2
= 0.70,
Py,X2
= 0.40,
Py,Xl
= 0.70,
n
= 1,000.
Figure 8i: 95% Confidence Interval for Beta=O.S177
I~-------------------
.S2
til
...c
!Il
OJ
.c
I
I
I
I
.54
3
0
Scenario 95:
P Xl,X2
= 0.70,
Py,X2
= 0.70,
Py,Xl
decile
7
5
= 0.40,
n
= 1,000.
Figure 8k: 95% Confidence Interval for Beta=-O.1324
'*
.22-
9
...
'*
-.14 -
=*----------------------o
Scenario 96: P Xl,X2 = 0.70,
3
Py,X2
= 0.70,
Py,Xl
decile
= 0.70,
5
n
9
= 1,000.
Figure 81: 95% Confidence Interval for Beta=O.3088
.5 -
•
.3
o
3
decile
5
80
Table N° 3.10: mean ofR.D. by sample size and deciles.
General
n=100
n=l,OOO
decile 1
decile 3
decile 5
decile 7
decile 9
450
225
225
90
90
90
90
90
28.81
30.84
26.77
35.41
25.83
23.89
26.96
31.95
51.12
55.53
46.33
61.72
44.54
44.45
50.52
52.66
0.023
0.023
0.035
0.034
0.023
0.051
0.086
0.088
312.9
312.9
262.1
306.2
250.2
262.2
312.9
287.9
Table N° 3.11: mean ofR.D. by combinations of PXl,:X2 and Py,:X2'
0.00
0.10
0.40
0.70
General
1.93
1.35
8.53
30.34
8.74
1.09
2.13
10.71
37.62
10.64
6.69
16.54
45.40
47.75
29.10
14.43
72.93
62.92
141.66
66.74
6.03
23.24
31.89
62.50
28.81
•
81
Table N 3.12. Frequency of mean R.D. of 450 scenarios by RD and sample size.
,-.U.li:"
<10
10-19.99
20-29.99
30-39.99
40-49.99
50-59.99
60-69.99
70-79.99
80-89.99
90-99.99
100&+
119
30
16
13
9
6
5
4
2
3
18
52.9
13.3
7.1
5.8
4.0
2.7
2.2
1.8
0.9
1.3
8.0
_.1'1'>
52.9
66.2
73.3
79.1
83.1
85.8
88.0
89.8
90.7
92.0
100.0
131
19
17
11
10
8
2
3
3
2
19
58.2
8.4
7.6
4.9
4.4
3.6
0.9
1.3
1.3
0.9
8.4
58.2
66.6
74.2
79.1
83.5
87.1
88.0
89.3
90.6
91.5
100.0
250
49
33
24
19
14
7
7
5
5
37
55.6
10.9
7.3
5.3
4.2
3.1
1.6
1.6
1.1
1.1
8.2
55.6
66.5
73.8
79.1
83.3
86.4
88.0
89.6
90.7
91.8
100.0
None
Mild
(18.2 %)
Moderate
(9.6 %)
Severe
(16.7 %)
Table N 3.13.- Mean R.D. (%) of 450 scenarios by RD, deciles and sample size.
None
Mild
Moderate
Severe
44.4
26.7
8.9
20.0
57.8
15.6
13.3
13.3
57.8
17.8
8.9
15.6
57.8
17.8
8.9
15.6
46.7
24.4
8.9
20.0
52.9
20.4
9.8
16.9
53.3
15.6
11.1
20.0
60.0
17.8
8.9
13.3
62.2
15.6
8.9
13.3
62.2
17.8
6.7
13.3
53.3
13.3
11.1
22.2
58.2
16.0
9.3
16.4
CHAPTER 4
DICHOTOMIZATION UNDER THE MULTIPLE LOGISTIC
REGRESSION MODEL: A SIMULATION APPROACH.
4.1.- GENERAL CONSIDERATIONS.
In chapters 2 and 3, we studied the situation when the outcome is a
continuous response. In this chapter we assume that the original outcome Y is binary. In
addition, suppose we have two independent variables, Xl and X2, and as before, we
assume that Xl is a continuous main effect or exposure variable, and X2 is a continuous
control variable, a potential confounder. As in the problem involving the multiple linear
regression model, our goal is to assess the association between the outcome and the main
effect Xl, controlling for X2. This is a typical situation in epidemiology, and given the
fact that the outcome is binary, a common approach is to use a logistic regression model.
We can have different kinds of study designs which have outcomes such as Y. In the
present dissertation, we will assume that our data come from a longitudinal study; one
example will be shown later. As before, the control variable X2 will be dichotomized, and
a binary variable will be used instead. Again, our goal in this· case, is to evaluate the
consequences of such procedure, in terms of how much the estimate of the association
between the outcome Y and the exposure variable Xl is affected. Under the logistic
regression approach, we will only study the case through simulations. Basically, we
simulate a set of data where X2 is a continuous variable, and then using a regular
estimation process, we estimate 131 [as in (4.1)], which is the measure of association
between the outcome and the exposure variable. Once again, the variable X2 will be
dichotomized in a particular cut point, and then we will estimate again 131, but now using
83
the dichotomized variable instead of X2. Finally, we will compare the results from these
two processes, under various scenarios.
4.2.- THE MODELS.
Given that we assume that the design is a longitudinal study, let Y be the
dependent variable taking the values:
Yi=1 if the i-th observation has the characteristic of interest, with probability P and
Yi=O if the i-th observation does not have the characteristic of interest, with probability
Q=I-P,
where i=I,2,....,n.
The general expression for the logistic regression model, with Xl and X 2
as the independent variables, is given by:
(4.1)
In order to estimate (31 in (4.1), the maximum likelihood (ML) method will
be used. Let fJI be the ML estimator, and let se( fJI) be the corresponding standard error.
Now, suppose that the continuous control variable is dichotomized according to (3.3).
Thus, using D instead ofX2 in (4.1), the following model is fitted:
eaO+aqXl+a2d
P(YIXI =XI ,D=d) 1+eaO+alXl+a2d'
where d can take the values 0 or I. Using again the ML method, let
(4.2)
al be the estimate of
the measure of association between the outcome Y and the exposure variable Xl, and let
84
se( al) be the corresponding standard error. Under different scenarios, the idea is to
compare the two estimators. Such scenarios will be described later.
4.3.- SIMULATION APPROACH.
In this section, we present the strategy to create the variables for our
simulation process. However, as was described in the previous chapters, we want to
simulate a real situation, close to real life. As before, suppose again that we have a
problem from a cardiovascular study, where we want to assess the association between inhospital mortality for myocardial infraction (MI), and total cholesterol level.
Simplistically, that relationship needs to be controlled for age, for example. Suppose that
for a given study, all the MIs that occurred in a country are including in the study, during
a fixed period of time. In a specific calendar time, we want to assess the association
between mortality (incidence rate) and cholesterol level (at the moment of the MI). As we
can see, this is exactly the same problem described before, but now we use a binary
outcome (in- hospital fatal or non-fatal MI) instead of a continuous outcome (systolic
blood pressure). Therefore, the variables for our simulation are:
- Y: in-hospital fatal MI (l=yes, O=no),
- Xl: total cholesterol level (mg/dl),
- Xz: age (years).
In order to simulate the data, we have to define the distribution of Y as
well as the two independent variables Xl and Xz. Before defining the variables of
interest, we have to define some other variables. Let Y, Zl, Zz and Z3 be independent
random variables with the following distributions: Y has a Bernoulli distribution with
parameter P (E[Y]=P, V[Y]=PQ, with Q=l-P), and the variables Zj (i=1,2,3) each has a
standard normal distribution. Our interest is to control the correlation coefficients among
85
the variables involved in the process; we can do that in several ways. One of these is to
define Xl and X2 as follows:
and
(4.3)
where 81 , 82 ,
C1, C2
and 8 are constants that when specified, dictate a correlation structure
among the variables, and
VI
and V2 are constants such that the expected values of Xl and
X2 can be controlled.
This kind of definition for the continuous independent variables allows us
to control the variances and the covariances among the variables needed in the model; we
do that by controlling the constants 81, 82 ,
Cl
and C2 and 8.
The defmitions given for the two independent random variables in (4.3),
imply the following conditional moments:
E(XiN=Y)=Vj + 8iY
(i=1,2),
(i=1,2),
and the following unconditional moments:
(i=1,2),
(4.4)
On the other hand, the covariances among the variables are given by:
86
Cov(Y, X2)=82V(Y)=82PQ,
(4.5)
Up to this point, we have constructed the variables involved in the process,
and also we have obtained the variances and the covariances among them. These
variances and
covariances will allow us to compute the correlation coefficients needed
in our simulation study.
As before, in our simulation study, we want to control in advance different
aspects related to the correlation structures among the variables involved in the model,
the sample sizes, and the cutpoints..
i) Correlation structure.
The correlation coefficients among the variables involved in the process
can be defined using (4.5). In fact, and given that V(y)=PQ, we have that the correlation
coefficients of the independent variables with Y are given by:
Py,Xl
Py,x2
81PQ
81 yIPQ
VPQ{8rPQ + ci(l+82)}
V 8r PQ + ci(l +8 2)
82PQ
82yIPQ
VPQ{8iPQ + c~(l+82)}
V8iPQ + c~(l +82)
(4.6)
For specific values for the correlation coefficients, we must obtain the
necessary constants for (4.3). We know that the Cov(Y,Xl)=Py,XI y'V(Y)V(Xl)' and from
87
(4.6) we have that Cov(Y,Xt}=8 1PQ. Thus,
Py,xl
81
JV(X l )
JPQ
For a given incidence rate P, and for a given variance of the exposure
variable Xl, and if we define in advance the correlation coefficient between the binary
outcome Y and Xl, then using the above expression, we obtain the constant 81• Using the
same arguments and with X2 instead of Xl , we obtain the constant 82.
On the other hand, from (4.4) we know that:
(4.7)
and given that the variances are known as well as the constants 81 and 82, the constants Cl
and C2 depend only on 8. Finally, using the definition of the simple correlation coefficient
between Xl and X2, which is given by:
and replacing the constants from (4.7) in (4.8), we obtain the formula for 8. That is given
by:
88
In summary, in order to simulate a real situation, we will fix the values for
the moments of the three variables in the model, as well as fix the correlation coefficients
among them. Therefore, using these values, we obtain the necessary constants for (4.3).
However, as we will describe later, there are some restrictions for the correlation
coefficients. In fact, the definition for 8 in (4.9) implies that the expression inside the
bracket has to be greater than or equal to zero, and thus that terms cannot take just any
value. Fortunately, only for a few cases we do not get a solution. On the other hand, one
special situation that we want to evaluate, is when PX]'X2=0. In such case, due to the
definition for the simple correlation coefficient given by (4.8), 8 must be zero and 81 (or
82) must also be zero. But, if 81=0 (or 82=0) then from (4.6) Py,xl=O (or Py,xl=O). Thus, in
the cases we want to assess the effect of dichotomizing X2 on the measure of association
between the outcome Y and the exposure X}, when XI and X2 are independent, it must be
restricted to the cases when also Y and XI (or Y and X 2 ) are not correlated.
In terms of the specific values for simulations, we will use 0.1, 0.4 and 0.7
as before for Py,xl and Py,X2' For PX]'X2 we use the values 0,0.2,0.4,0.6 and 0.8. We have
to remember that in the case when PX],X2=0, then Py,Xt=O or Py,X2=0.
ii) Cutpoints..
As in the linear regression model simulations, we will use the same 5 cutpoints, defined by the deciles 1,3,5, 7 and 9.
iii) Sample size and number ofreplications.
89
As in the linear regression analyses, we will study 2 different sample sizes:
50 (small) and 200 (medium). These two values will give us a picture about the influence
of the sample size on the association between the outcome and the main effects. For all
the cases that we will study, the number of replications will be 1,000.
Returning to our example, we assume the following moments:
- P, the in-hospital mortality rate, 0.15.
- Cholesterol Level, E(Xl)=Jll=200 mg/dl, V(Xl)=1600 (mg/dl)2,
- Age, E(X2)=Jl2=40 years, V(X2)=100 (years)2.
Given these values, and given the correlation coefficients, it is possible to
compute the constants needed for the simulations, which were described in (4.3). The
total number of combinations are 42: 4 values of PXJ, X2' times 3 values of
values of
Py,X2'
plus 6 situations when
PXJ, X2=0.
Py,Xt
times 3
Since the correlation coefficients are the
same for n=50 and n=200, we present the values once. The correlation coefficients and
the constants are presented in the following table:
90
Table N° 4.1: values of constants for different combinations, n=50, 200.
1
2
3
4
5
6
0
0
0
0
0
0
0.1
0.4
0.7
0
0
0
0
0
0
0.1
0.4
0.7
11.2
44.8
78.4
0
0
0
0
0
0
2.8
11.2
19.6
7
8
9
10
11
12
13
14
15
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
11.2
11.2
11.2
44.8
44.8
44.8
78.4;
78.4
78.4
2.8 35.8
11.2 36.1
19.6 36.6
2.8 33.3
11.2 35.8
19.6
2.8 25.8
11.2
19.6
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
11.2
11.2
11.2
44.8
44.8
44.8
78.4
78.4
78.4
16
17
18
19
20
21
22
23
24
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
2.8
11.2
19.6
2.8
11.2
19.6
2.8
11.2
19.6
39.8
36.7
28.6
40.0
40.0
40
31.0
31.0
29.1
28.5
31.0
33.1
20.9
25.8
10.0
10.0
10
9.9
9.2
7.1
0.00
0.00
0.00
0.00
0.00
0.00
8.9
8.3
6.5
9.0
8.9
0.49
0.46
0.47
0.46
0.22
9.0
0.47
7.7
7.1
5.2
7.7
7.7
6.5
7.3
8.3
0.81
0.81
0.93
0.81
0.63
0.47
0.93
0.47
91
Table N° 4.1: continuation.
25
26
27
28
29
30
31
32
33
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
11.2
11.2
11.2
44.8
44.8
44.8
78.4
78.4
78.4
2.8
11.2
19.6
2.8
11.2
19.6
2.8
11.2
19.6
25.3
24.7
20.1
22.8
25.3
26.2
14.4
20.4
25.3
34
35
36
37
38
39
40
41
42
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.1
0.1
0.1
0.4
0.4
0.4
0.7
0.7
0.7
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
11.2
11.2
11.2
44.8
44.8
44.8
78.4
78.4
78.4
2.8
11.2
19.6
2.8
11.2
19.6
2.8
11.2
19.6
17.9 4.5 1.99
16.2 3.7 2.23
6.3 1.21
5.7 1.26
3.6 1.71
6.2 1.26
6.3 1.05
5.1 0.98
5.0 1.71
6.6 0.98
6.3 0.52
15.0 4.1 2.24
17.9 4.5 1.79
16.6 3.2 1.97
12.9 4.2
17.9 4.5
1.96
1.24
92
4.4.- SIMULATION PROCESS.
As in chapter 3, we must create the variables needed in (4.3). The random
variables Zl, Z2 and Z3 are three independent random variables, each with a standard
normal distribution created using the RANNOR function in SAS®. The random binary
variable Y is created in such a way that it is independent ofZi (i=1,2,3), using the SAS®
function RANBIN, with parameter P=0.15. One important difference in the present
situation as compared to the multiple linear regression model studied before, is that now
we do not include the error term when we create the variables. That issue has one
implication; in the situation involving the multiple linear regression problem, the
independent variables were created only once, and for each simulation we created the
corresponding error terms and then computed the outcome Y. This implied that for each
simulation, the independent variables had the same values, and the only components that
changed their values were the error terms and the outcomes. The present case is different;
for each simulation we have to re-generate all variables, the outcome and also the
independent variables according to the above definitions. Again, assume that N is the
number of replications. As before, for each replication, six logistic regression models are
fitted. The first one uses the outcome Y, the continuous exposure Xl, and the continuous
control variable X2, while the other five models use the dichotomized variable D at one
of the 5 deciles specified above.
Table 4.1 presents the value of constants for the 42 different scenarios
explored, based on combinations of PX1> X2'
Py,X2' Py,Xl'
For six scenarios the combinations
lead to situations were it is not possible to calculate the constants, leading to a total of 36
scenarios examined for small (n=50) and moderate (n=200) sample sizes.
4.5.- MEASURE OF THE EFFECT.
For the multiple logistic regression model, we use the same strategy to
evaluate the effect of dichotomizing X2 as used for the multiple linear regression model,
93
except the one corresponding to the Mean Squared Error approach. We cannot use it in
this case, because we do not define in advance the parameter being estimated. Instead, we
use four alternative measures of the effect.
4.5.1.- Relative Difference (RD.)
Like in section 3.7.1 in the previous chapter, we compute the relative
difference (RD.), but now it is computed with respect to the estimate from the model
using X2, Le., without dichotomization, instead of the true parameter (which is unknown
in this situation). Thus, the RD. is computed as:
R.D. = (Mean of
a1 - Me~ of ~1)*100.
(4.10)
Mean of (31
4.5.2.- Sensitivity, Specificity, and Total Misclassification Probablity.
As in section 3.7.3, we can assume that the estimate obtained by fitting
(4.1) is the gold standard estimator. Let ~I be the estimator Of{3h and let se(~I) be the
standard error. To test the null hypothesis Ho: {31=0, we use the Wald statistic given by:
~1}2
W= { --.se({3I)
rv
2
(4.11)
X(I).
Using (4.11), it is possible to compute the p-value (P2): ifp2
~
0.05, then
the null hypothesis is rejected. Otherwise, it is not. Now, model (4.2) after to
dichotomizing X2 is fitted. Let al be the estimate of the measure of association between
the outcome and the exposure, controlling for the dichotomized variable X2. Let se(al) be
the corresponding standard error. Then, in order to test the null hypothesis Ho : al=O in
94
(4.2), we again use the Wald statistic, which in this case is given by:
(4.12)
This statistic has associated with it a corresponding p-value, which gIves us the
infonnation to decide whether we reject Ho or not.
Thus, like in section 3.7.3, the 1,000 replications for every cut point can be
classified in a 2x2 table like the following:
Not Reject
Reject
Total
a
c
a+c
We use the same definition described in section 3.7.3. in order to compute the sensitivity
(S), the specificity (Sp) and the total misclassification proportion (T).
4.5.3.- Standarized Mean (S.M.)
Another way to evaluate the impact of dichotomizing the control variable
is through the comparison of the two test statistics to assess the hypothesis Ho : /31=0.
They are defined by (4.11) and (4.12). If for replication h, we rename (4.11) as Wh and
(4.12) as W~, then we consider the difference
for h=I,.....,I,OOO.
(4.13)
It is possible to evaluate directly the difference through a non-parametric test. But, as in
section 3.7.4, given that the test is based on 1,000 observations, any small difference will
be significant, and so instead of the difference, we compute the standardized mean (S.M)
95
of the difference:
1000
E dh!l, 000
8M = _h=_l_~~_
s.d.(dh)
4.5.4.- Confidence Interval.
We fit 6 different models for each of the 1,000 replications. If
~1
represents the mean of the regression coefficient of Xl over the 1,000 replications for
model (4.1), and letting sd(~ I) be the corresponding standard deviation, then using the
large sample theory about the maximum likelihood estimates, we have that the 95%
confidence interval for the parameter (31 is:
~I ± 1.96 sd(~I)/VN . Now, using model
(4.2), let a l be the average of the estimates of the measure of association between Y and
Xl after dichotomizing X2, and let sd(a I) be the standard deviation; thus, its
corresponding 95% confidence interval is given by: a l ± 1.96 sd(al)lVN. Given the
fact that the estimate of {31 from the model which used the original control variable
(model 4.1) is an unbiased estimate (asymptotically), it can be viewed as the reference.
We visually compare the resulting 95% connfidence interval for
(}:1
under different
cutpoints to the corresponding 95% confidence interval for {31. The results using the 95%
confidence interval approach are shown in graphs 7a to 7t (pages 99 to 105), graphs 8a to
8t (pages 108 to 114) , graphs 9a to 9p (pages 117 to 122) and graphs lOa to lOp (pages
125 to 130).
4.6.- RESULTS.
As in Chapter 3, the results from the simulations are presented using tables
and graphs. In the tables, we present the means of the estimate of the exposure variable
Xl over the 1,000 replications. In addition, the relative differences (R.D.) are presented in
a similar table (section 4.5.1). In a different table, the result about the standardized mean
96
is presented (section 4.5.3). Finally, in a third table we present the result with respect to
the sensitivity (8), specificity (8p) and the total percentage of misclassification (T)
(section 4.5.2). The results of section 4.5.4 are presented using graphs.
We notice that comparable to the multiple linear regression situation, the
effect of dichotomization is also dramatic. They are in fact even more dramatic in some
situations, for low levels of correlation between Xl and X2, and Y and X2, the
discrepancies diminished when the sample size increased from 50 to 200. However, for
moderate to high levels of correlation between Xl and X2, and Y and X2, the large
discrepancies remain dramatic at both sample sizes. If we use the relative difference as a
criterion of comparison, we observe values higher that 100% in many cases, and we even
observe a value over 500%.
In tables 4.7 to 4.10, by way of summarizing the results, we present the distribution of
360 values of the Relative Differences: 36 scenarios times 2 sample sizes times 5 deciles.
The general mean is 70.43%, which implies that on average, the difference between the
estimate after dichotomization and the estimate using X 2 is 70%. There is not much
difference between the two sample sizes, but among the deciles, we observe that the
worse situations are in the extremes, like for the linear regression model. In terms of the
correlation coefficients, there is not a regular pattern for
PXl ,X2'
however in terms of
we observe that as the correlation increases, the R.D also increases.
Py,x2'
97
Table N° 4.2.a: Mean for the estimates of (31, and R.D. from 1,000 replications,
for different cutpoints, for scenarios I to 23, n=50.
I
2
3
4
5
6
7
8
9
10
11
13
16
17
18
19
20
21
22
23
-.00732
-.04157
-.28299
.00011
-.00048
-.00279
-.00709
-.00069
.01395
-.04482
-.05158
-.30563
-.00571
.00959
.09443
-.04940
-.03777
-.07183
-.38395
-.29350
-.00736
-.04196
-.23749
-.00012
-.00008
-.00048
-.00831
-.00683
-.00738
-.04176
-.04147
-.24192
-.00780
-.00504
-.00532
-.04424
-.04027
-.04343
-.24458
-.23637
0.61
0.93
16.1
205.9
83.9
82.9
17.3
892.2
152.9
6.84
19.6
20.8
36.8
152.6
105.6
10.5
6.61
39.5
36.3
19.5
-.00725
-.04172
-.24985
-.00010
.00013
-.00043
-.00786
-.00568
-.00575
-.04376
-.04201
-.26804
-.00757
-.00215
-.00059
-.0478
-.03925
-.04220
-.29023
-.25488
0.93
0.35
11.7
189.1
126.7
84.6
10.9
725.1
141.2
2.34
18.5
12.3
32.6
122.5
100.6
3.25
3.92
41.2
24.4
13.2
-.00740
-.04160
-.25551
-.00026
-.00003
-.00070
-.00767
-.00491
-.00391
-.04352
-.04455
-.27298
-.00702
.00035
.00544
-.04988
-.03891
-.04178
-.30818
-.26278
1.13
0.06
9.85
332.1
94.4
74.8
8.21
613.9
128.0
2.90
13.6
10.7
22.9
96.4
94.2
0.97
3.02
41.8
19.7
10.5
-.00738
-.04252
-.25458
-.00012
-.00001
-.00038
-.00769
-.00364
-.00122
-.04358
-.04723
-.27500
-.00787
.00181
.01458
-.04668
-.03895
-.05109
-.31101
-.27318
0.80
2.3
10.0
211.6
98.8
86.5
8.54
428.9
108.7
2.77
8.42
10.0
37.9
81.1
84.6
5.52
3.11
28.9
19.0
6.93
Table 4.2.b: Estimation for the Standardized Mean,
Scenarios I to 23, n=50.
2
3
4
5
6
7
8
9
10
11
13
16
17
18
19
20
21
22
23
.02929
.36541
.08343
.04576
.20622
.16234
.23319
.37285
.31422
1.0053
.40909
.15489
.01490
.04808
.19909
1.3781
1.9596
.60954
.84034
.08609
.17241
.01309
.04517
.22225
.08463
.17116
.30672
.14123
.75749
.11617
.14296
.12121
.12212
.09857
1.1067
1.8294
.41576
.58393
.06652
.04377
.04547
.04650
.20783
.07004
.10290
.25884
.16283
.62350
.10145
.13855
.18533
.05783
.13014
.90167
1.6094
.33645
.35364
.01985
.09469
.08387
.05828
.23008
.11484
.09992
.24054
.15263
.51287
.10211
.19697
.19308
.01989
.05579
.78485
1.1382
.29761
.27225
.01553
.38823
.05789
.03546
.20116
.16741
.14952
.18364
.26424
.69807
.30489
.16489
.10627
.21204
.11673
1.0358
.94142
.55219
.46395
-.00719 1.71
-.04077 1.93
-.22909 19.0
-.00076 785.1
-.00055 15.1
-.00134 51.9
-.00787 11.0
-.00346 403.2
-.0066 147.3
-.04511 0.65
-.05077 1.56
-.25264 17.3
-.00817 43.2
-.00179 118.7
.01131
88.0
-.04700 4.87
-.04206 11.4
-.05241 27.0
-.26249 31.6
-.25247 14.0
98
Table N° 4.2.c: Estimation for Sensitivity, Specificity and the percentage of the total
misclassification probability (T), for different cutpoint, for scenarios 1 to 23, n=50.
,
I
2
3
4
5
6
7
8
9
10
11
13
16
17
18
19
20
21
22
23
98.9
87.9
80.0
99.7
98.3
97.1
96.9
94.8
92.5
80.4
51.1
77.5
96.3
98.8
93.4
71.1
40.2
26.0
62.6
69.1
85.5
96.7
96.5
76.7
65.2
0.00
88.2
76.2
0.00
97.5
97.9
2.2
5.3
10.9
1.0
2.5
3.1
3.7
6.0
7.7
6.7
25.0
97.4 12.0
75.9
4.8
44.2
4.0
2.7
10.0
94.5 11.2
99.7 38.0
100.0 73.2
95.3 23.5
99.1
17.5
99.4 83.1
91.3 96.5
87.6 94.7
99.8 80.0
97.9 56.5
97.6 0.00
97.7 82.3
96.7 61.9
93.8 50.0
86.1 97.4
58.3 96.1
83.9
96.5
99.2
95.2
79.8
53.3
35.2
75.1
78.5
91.3
85.2
48.1
10.8
94.6
98.4
100.0
92.2
97.3
2.0
4.7
8.5
0.8
3.0
2.6
3.3
4.8
6.3
5.4
22.4
99.2 87.9
91.3 97.4
86.9 92.4
99.6 83.3
98.5 69.6
97.0 0.00
98.5 89.7
97.3 78.6
96.3 0.00
84.9 98.3
68.3 95.3
12.2
4.1
3.5
7.9
9.0
30.2
64.1
17.6
13.1
87.1
98.0
96.9
96.8
81.4
65.0
52.7
80.9
83.0
93.2
90.7
40.4
21.6
94.7
97.3
100.0
91.5
93.7
1.7
4.0
10.1
0.9
2.2
3.2
2.1
3.5
3.9
5.0
17.9
9.7
2.4
6.0
6.0
8.5
23.2
46.8
14.6
12.2
98.6 85.5
89.2 96.9
86.4 94.0
99.7 80.0
98.7 78.3
98.8 0.00
99.1 88.2
98.0 76.2
98.5 50.0
87.4 97.4
78.7 93.7
87.3
94.2
99.2
97.4
81.4
75.6
86.8
80.0
85.4
93.4
32.7
48.1
21.6
94.2
95.9
81.8
90.1
91.7
2.5
4.9
9.4
0.9
1.8
1.4
1.6
2.9
1.6
5.1
13.6
9.5
9.0
3.5
5.4
8.9
17.0
13.3
15.7
1l.8
98.4 85.5
85.7 97.0
80.7 97.8
99.7 90.0
98.5 65.2
98.6 0.00
97.2 88.2
97.1 73.8
96.8 0.00
84.5 97.5
68.3 96.3
81.9
96.9
96.9
97.8
74.4
64.7
71.9
68.7
80.1
95.3
40.4
40.4
10.8
95.4
98.4
81.8
96.5
93.3
2.7
5.6
9.9
0.6
2.3
1.6
3.4
3.9
3.4
5.7
17.4
10.7
6.0
6.0
5.4
8.7
23.0
28.0
19.5
14.0
99
= 0.00,
Scenario 1: PXl,X2
Py,x2
= 0.00,
= 0.10, n = 50.
Py,xl
Figure 7a: 95% Confidence Interval for Betal, Scenario 1
-00"'
Ii
'-']1
'
r
I
1
1 1
I
-.0085
Scenario 2:
= 0.00,
PXl,X2
Py,x2
1
5
1
7
9
decil
= 0.00,
Py,xl
= DAD, n = 50.
Figure 7b: 95% Confidence Interval for Betal, Scenario 2
~
Ir
I
------,-----,J-
L.----,-------,
o
I
I
_----i-
..L
[
----1...-
-'-,
-.038 -;
I
!
1
"1
-'" I
-.044
-.046 ' - . - - - . . . . . . , - - - - - - , - - - - , - - - - - - - . - - - - - - , 1
5
decil
Scenario 3: PXl,X2
= 0.00,
Py,x2
= 0.00,
Py,xl
= 0.70, n = 50.
Figure 7c: 95% Confidence Interval for Betal, Scenario 3
- .2
I
1
u
fJu'"
..
- .25
.c
f
t
t
-,J
I
t
I
5
0
dedi
7
100
Scenario 4: PXl,X2
= 0.00,
Py,X2
= 0.10,
Py,Xl =
Scenario 5:
= 0.00,
Py,x2
= 0040,
Py,Xl
P X1,X2
n
= 50.
= 0.00, n
= 50.
0.00,
Figure 7e: 95% Confidence Interval for Betal, Scenario 5
I
.002
I I I
-.004
5
dedI
Scenario 6:
PXl,X2
= 0.00,
Py,x2
= 0.70,
Py,Xl
= 0.00, n = 50.
Figure 7f: 95% Confidence Interval for Betal, Scenario 6
.005
T
I
I
I
-.005l
. L,------T-----r
o
I
3
5
dedI
I
101
Scenario 7: P Xl,X2
= 0.20,
Py,x2
= 0.10,
~gure 7~:
-,,, II
I
-," r
i
= 0.20,
,
I
T
1
-1~~,-~--
Py,x2
-----l
T
o
Scenario 8: P Xl,X2
= 0.10, n = 50.
95% Confidence Interval for Betal, Scenario 7
j
I
-.01
Py,Xl
I
I
1
~---.-,---,.------,-15
9
decil
= 0.40,
Py,xl
= 0.10, n = 50.
Figure
7h: 95% Confidence
Interval for
Betal, scenario
8
I !
,
.-1.-.
----'-,
005
.
1
,jt
-.005
I
I
I
j
---'--,
I
-.01
5
decil
Scenario 9: PXI, X2
= 0.20,
Py,x2
= 0.70,
Py,xl
= 0.10, n = 50.
Figure 7i: 95% Confidence Interval for Betal, Scenario 9
.02
.01
T
~j
,j
·LI
-.01
o
I
,---,-------,-.
I
5
3
decil
I
-,----7
I
102
= 0.20,
Scenario 10: PXI,X2
Py,X2
= 0.10,
Py,xI
=
DAD, n
= 50.
J-T'
, -"--'----'--;,I
IT I T
I
T
I
Figure 7j: 95% Confidence Interval for Betal, Scenario 10
-.04
I
$
-""1j j
I]
t
1
t
1
!
-.05 ~'-,,-----,----_.
,---------,--
o
Scenario 11: PXI,X2 = 0.20,
Py,x2
1
5
decil
= DAD,
Py,xI
= DAD,
n
= 50.
Figure 7k: 95% Confidence Interval for Betal, Scenario 11
I
-.04
T
j
1-
t
...
~'"
'"
,Q
-.05
t
i
-.06
J
5
decil
Scenario 13: PXI,X2
= 0.20, Py,x2 = 0.10,
Py,xI
= 0.70, n = 50.
Figure 71, 95% confidence Interval for Betal, Scenario 13
-.2
I
- .25
-,jI
-.35
~
I I I
I
'--,,-----,------,-------,-------,----r'
o
3
5
decil
103
= 0.40,
Scenario 16: P X1,X2
Py,x2
= 0.10,
= 0.10, n = 50.
Py,xl
Figure 7m: 95% Confidence Interval for Beta1, Scenario 16
I
I
i
L-
i.
~
-j T
-.004
I
IIII
I
,t
.,
I
-.006ll
,I
<tl
.c:
IIIl.
.'Ql3
.a
-.008
i
T
+!
1
!
I
TI
T
~
I
II
t
1
l.
i
tr
~
I
-.01 -!' " - - r - - - i - - - - - - - ; decil
Scenario 17: P X1,X2
= 0.40,
Py,X2
= 0.40,
5
= 0.10, n = 50.
Py,Xl
Figure 7n: 95% Confidence Interval for Beta1, Scenario 17
!
I
I
I
-.01
Scenario 18:
P X1,X2
I
L,---,
o
1
= 0.40,
Py,X2
decil
= 0.70,
Py,Xl
I
I
5
= 0.10, n = 50.
95% Confidence Interval for Beta1, Scenario 18
i
.05
I
·
L.--,----~ - ,-~--:---j
-:--3
5
deci1
7
9
104
Scenario 19: P XI,X2 = 0.40,
Py,X2
-"r
=
0.10,
Py,XI
=
0.40, n = 50.
Figure 7p: 95% Confidence Interval for Betal, Scenario 19
-.., ~
~.
J
'---J
.
tTl
II 1
-", H
1
11
i
-.055 -:
I
I
j
1
1I
-,
,,
3
5
'iII
1,
~
_
7
9
dedl
Scenario 20: P XI,X2 = 0.40,
Py,X2
=
0.40,
Py,XI
=
0.40, n = 50.
Figure 7q: 95% Confidence Interval for Beta!. Scenario 20
11
-m
11
~
'"
.Q
I
j
~
-.. I
t
1
j
j
1
1
-.045
Ij
5
dedl
Scenario 21: PX)'X2 = 0.40,
Py,X2
=
0.70,
Py,XI
=
0.40, n = 50.
Figure 7r: 95% Confidence Interval for Betal, Scenario 21
I
-.04
-.05
I
I
T
i
-.06 I
-0<
-.08
ji
~1
L,--0
dedI
5
--,-7
I
105
Scenario 22:
P Xl,X2
= DAD,
= 0.10,
Py,X2
Py,Xl
= 0.70,
n
= 50.
Figure 7s: 95% Confidence Interval for Betal, Scenario
!
I
:
I
_--'-
22
--'---,
-.2l
I
J
-,,1
I
i
.L
~
I
-.3l
I
1
I
-'lI!1
-.4
1L,-
P Xl,X2
= DAD,
Py,X2
J
---,-,------r------r,
a
Scenario 23:
I
5
7
9
decil
= DAD,
Py,Xl
= 0.70, n = 50.
Figure 7t: 95% Confidence Interval for Betal, Scenario 23
-.2
I I
I I I
,I
-.25
'-'
~'-'
Jj
-,,1
,
a
decil
,
,
5
7
106
Table N° 4.3.a: Mean for the estimates of /31, and R.D. from 1,000 replications,
for different cutpoints, for scenarios 1 to 23, n=200.
I
2
3
4
5
6
7
8
9
10
II
13
16
17
18
19
20
21
22
23
-.00750
-.03468
-.10840
-.00011
.00009
-.00011
-.00615
-.00185
.00582
-.03401
.03440
-.10946
-.00503
.00572
.03673
-.03725
-.02706
-.02357
-.13445
-.10373
-.00749 0.12 -.00747
-.03466 0.07 -.03470
-.19751
82.2 -.10850
-.00009 13.4 -.00011
.00006
.00006 30.6
.00019 269.8 .00015
-.00694 12.7 -.00668
-.00646 249.2 -.00528
-.00643 210.5 -.00490
-.03424 0.69 -.03416
-.03514 2.14 -.03492
-.10729 1.98 -.10973
-.00652 29.7 -.00603
-.00591 203.4 -.00315
-.00478 113.0 -.00043
-.03528 5.29 -.03590
-.03341 23.5 -.03178
-.03345 41.9 -.03202
-.11104 17.4 -.11851
-.10392 0.18 -.10552
0,43
0.04
0.09
2.13
27.2
236.6
8.54
185.6
184.2
0.45
1.51
0.25
19.9
155.2
101.2
3.62
17.4
35.8
11.9
1.72
-.00748 0.21
-.03469 0.03
-.10799 0.37
-.00011
1.03
.00010
9.22
.00009 177.9
-.00655 6.50
-.00435 135.4
-.00322 155.3
-.03416 0.44
-.03486 1.32
-.10911 0.32
-.00586 16.6
-.00098 117.1
.00446
87.8
-.03614 2.99
-.03052 12.8
-.03039 28.9
-.12144 9.68
-.10551
1.71
-.00748
-.03466
-.10788
-.00012
.00003
-.00003
-.00665
-.00396
-.00087
-.03416
-.03486
-.10877
-.00593
.00017
.01090
-.03621
-.02998
-.02840
-.12210
-.10511
0.22
0.07
0.48
12.2
71.4
69.6
7.99
114.2
114.9
0.45
1.31
0.64
18.0
97.1
70.3
2.79
10.8
20.5
9.19
1.33
Table 4.3.b: Estimation for the Standardized Mean,
Scenarios 1 to 23, n=200.
1
2
3
4
5
6
7
8
9
10
11
13
16
17
18
19
20
21
22
23
.00235
.00846
.07708
.00979
.00593
.00547
.59255
.69251
,44964
.56457
1.9123
.20371
.64902
.11729
1.1017
.23482
3.0929
4.3114
.84370
1.5407
.03865
.07357
.06498
.02567
.01322
.00519
.48650
.54713
.17443
.47630
1.5273
.02235
.53466
.14228
1.6297
.15769
2.4855
3.9036
.68388
1.0829
.01053
.04680
.04153
.00172
.00007
.04451
.40493
.42839
.01573
,43159
1.22710
.02274
.48328
.32240
1.7935
.08247
2.0829
3.4006
.53939
.88778
.00572
.04214
.01096
.00074
.02951
.02893
.50093
.38096
.14904
.41041
1.2253
.06088
.51441
.44094
1.3942
.12516
2.0013
2.3295
,46516
.85519
.01099
.02696
.11060
.00608
.02781
.02735
.58494
.49619
.39302
.59056
1.6002
.14326
.63879
.27148
1.8498
.15822
2.3976
1.8510
.65108
1.0016
-.00749 0.13
-.03465 0.10
-.10734 0.97
-.00010 5.14
.00007
22.6
.00003 130.5
-.00687 11.6
-.00483 161.1
-.00268 146.1
-.03430 0.85
-.03513 2.11
-.10808 1.26
.00639 226.9
-.00204 135.7
.00620
83.1
-.03557 4.50
-.03132 15.7
-.03008 27.6
-.11643 13.4
-.10641
2.58
107
Table N° 4.3.c: Estimation for Sensitivity, Specificity and the percentage of the total
misc1assification probability (T), for different cutpoint, for scenarios 1 to 23, n=200.
1
2
3
4
5
6
7
8
9
10
11
13
16
17
18
19
20
21
22
23
98.8
99.7
97.3
96.4
92.1
79.1
76.9
88.1
77.8
66.5
0.0
0.0
97.5
100.0
100.0
88.9
64.4
15.9
97.1
77.1
9.3
100.0
100.0
100.0
93.9
3.0
8.9
100.0
100.0
100.0
100.0
100.0
1.6
0.0
0.0
0.9
4.2
7.1
6.9
21.0
28.2
0.0
0.0
0.0
11.2
32.1
79.4
0.0
2.5
52.1
0.0
0.0
99.3
99.5
96.9
96.9
95.58
87.1
85.3
91.7
91.7
88.7
0.0
0.0
97.5
100.0
100.0
92.6
60.0
22.7
97.1
73.8
5.3
100.0
100.0
1.3
0.0
0.0
0.9
4.7
6.3
4.1
13.7
20.7
0.0
0.0
100.0
93.9
3.8
4.0
100.0
100.0
100.0
100.0
100.0
0.0
8.0
20.0
78.9
0.0
2.5
52.1
0.0
0.0
99.4
99.3
97.6
96.7
95.7
93.3
92.0
95.1
96.8
98.0
8.0
0.6
97.8
100.0
100.0
88.9
68.9
25.0
95.7
75.4
4.0
100.0
100.0
1.1
99.1
0.0
0.0
1.3
3.7
6.5
4.3
7.8
14.6
0.0
0.0
99.6
98.3
96.9
95.1
94.7
96.3
100.0
93.1
15.0
12.1
100.0
100.0
100.0
100.0
100.0
0.0
5.2
14.1
70.5
0.0
2.3
51.8
0.0
0.0
94.4
98.3
94.1
4.0
6.9
97.1
100.0
100.0
92.6
73.3
29.5
96.6
73.8
10.7
100.0
100.0
1.5
0.0
0.0
0.8
2.8
6.1
4.6
6.6
10.1
0.0
0.0
100.0
94.6
21.8
37.1
100.0
100.0
99.8
100.0
100.0
0.0
5.6
11.9
5.13
0.0
2.4
48.6
0.0
0.0
98.5
99.5
98.3
97.5
93.7
92.1
93.8
89.0
94.9
98.5
8.00
14.4
98.7
100.0
100.0
90.7
75.6
52.3
96.6
78.7
17.3
100.0
100.0
1.4
0.0
0.0
1.0
2.7
4.5
5.7
8.7
11.7
0.0
0.0
100.0
92.3
9.0
13.9
100.0
100.0
100.0
100.0
100.0
0.0
10.6
16.5
68.9
0.0
2.3
44.6
0.0
0.0
108
Scenario 1: PXj,X2
= 0.00,
Py,x2
= 0.00,
= 0.10, n = 200.
Py,Xj
Figure 8a: 95% Confidence
Interval for Betal. Scenario I
_---1._ _ _
1
I
I
T
T
I
j
..-1--_ I
-1
- .007
I
I-
-.
0072
1
.,
11
2
-.0074
"
~
I
It
,Q
I I
-.0076
-j
!
T
I
J
iI
i
I
-00,.11
I,
I
1
I
I
1
I
I
I
I
I
II
I
1
1
I
I
t
II
I
1
I
1
5
dedi
Scenario 2: PXI,X2
= 0.00,
Py,X2
= 0.00,
= 0.40, n = 200.
Py,Xj
Figure 8b: 95% Confidence Interval for Betal. Scenario 2
-.034
T
-.0345
1
-.035
-. e355
-1
---------r-----,---------,-J
L..,~-~-_____r
o
Scenario 3: PXI,X2
= 0.00,
Py,X2
3
5
decil
= 0.00, Py,xI = 0.70, n = 200.
Figure 8c: 95% Confidence Interval for Betal, Scenario 3
-.108
-u
1
,1
I
- .112
~'-;I~_~
o
I
5
dedi
109
Scenario 4:
PXj,X2
= 0.00, Py,X2 = 0.10, Py,Xj = 0.00, n = 200.
Figure ad: 95% Confidence Interval for Betal, scenario 4
l i t !
i,
.L
Scenario 5:
PXj,X2
I
I
I
1
I
1
I
I
I
1
= 0.00, Py,x2 = 0.40, Py,Xj = 0.00, n = 200.
-.0005 L,------,---5
decil
Scenario 6:
PXj,X2
= 0.00, Py,x2 = 0.70, Py,Xj = 0.00, n = 200.
1
-.0005
1
-.001
5
decil
110
Scenario 7:
P XI,X2
= 0.20,
Py,X2
= 0.20,
Py,XI
= 0.10, n = 200.
r---"-'------'-,-
,
----L-------'--,i
Figure 8g' 95% Confidence Interval for Beta!' Scenario 7
-.0055
j
-0",/
f
I i i jrr
!I T T
-0",'1' r
-.007~
I
-.0075
r
T
I
J
~
~
1L.,.-_-,---__
,
5
dedi
Scenario 8:
PXj, X2
= 0.20,
Py,x2
= 0.40,
Py,XI
= 0.10,
n
= 200.
Figure 8h: 95% Confidence Interval for Betal, Scenario 8
~
..,'"
-O"'f
-.004
~
I
-. 006
1
I
I
I
.0
I
I
I
-.008
dedi
Scenario 9:
PXj, X2
= 0.20,
Py,x2 ;::
r
Figure
.01
.005 1
i
8~:
0.70,
Py,XI
= 0.10,
95% Confidence Interval
5
n
fo~
= 200.
Betal. Scenario 9
I
I
o~
I
-.005
I
-~
-"I
'-r-,--,------,--------,------,---------r'
o
3
5
dedi
111
Scenario 10: P Xlo X2 = 0.20,
Py,x2
= 0.10,
Py,xl
= 0.40, n = 200.
Figure 8j: 95% Confidence Interval for Betal, Scenario 10
!
-.0335
ITI
j
! i
j't
I
I
T
!
I
T
I
j
j
-"
-'-,
T
I
-.034
!I
I
I
11
L
i
1
1
I
-.0345 -;
I
1
,i
-.035
~
-r---i------r
5
decil
Scenario 11: P X1,X2
= 0.20,
Py,x2
= 0.40,
Py,xl
= 0.40, n = 200.
T
Scenario 13: P X1,X2
= 0.20,
Py,X2
= 0.10,
Py,xl
= 0.70, n = 200.
Figure 81: 95% Confidence Interval for Beta1, Scenario 13
-.105
T
t
-.11
I
-.115
L,,----.-----...,.-----.-------.-J
013
5
deci1
112
Scenario 16:
P X1,X2
= 0.40,
Py,X2
= 0.10,
Py,Xl
= 0.10, n = 200.
Figure 8m: 95% Confidence Interval for Betal, Scenario 16
I
J
T
t
I
!
T
~I
.L
f
t
5
7
1
decil
Scenario 17: P X1,X2
= 0.40,
Py,X2
= 0.40,
Py,Xl
f
l
.----------,'
9
= 0.10, n = 200.
Figure 8n: 95% Confidence Interval for Betal, Scenario 17
,.L- I
i
I
I
'I
.005
-.005
-.01
~--'----'--_'---'---_--'-I
------,---------r'
5
deci1
Scenario 18: P X1,X2
= 0.40,
Py,x2
1
= 0.70,
Py,Xl
= 0.10, n = 200.
Figure 80: 95% Confidence Interval for Beta1, Scenario 18
•
04
~
1
,I
I
-.02
5
deci1
I
113
Scenario 19:
P Xb X2
= 0.40,
= 0.10,
Py,X2
Py,xl
= 0.40, n = 200.
Figure 8p: 95% Confidence Interval for Betal, Scenario 19
r"--.035
~
~
T
I
tI
I
I
- . 036 J
I
I
J..
I
t
t
I
1.
I
i
Ql
.Q
J..
!T
lifi
-.03?
I
Tf
T
i
~
---'-
I
I
II
+1
I
J..~
I
1
r
I,
I!
:1
-.038 "L.--r-----,---.
I
.---,r
I
0
Scenario 20:
P Xl,X2
= 0.40,
Py,X2
3
1
= 0.40,
-."1i--l-1
Py,Xl
I
I'
,
9
5
decil
= 0.40, n = 200.
Figure 8q: 95% Confidence Interval for Beta1, Scenario 20
..l
-'------_ _---'---_ _----'----,
-.'" j
-",1
1
I
032
-.
-.034
~
I
I
I
I
5
deci1
Scenario 21: P Xb X2 = 0.40,
Py,X2
= 0.70,
-~.: j;'" ."
Py,Xl
= 0.40, n = 200.
95% Confidence Interval for Beta1, Scenario 21
I
-.03
~
I
-.035
.j-,------,1
.,-
-,--__
I
114
Scenario 22:
P X1,X2
= 0.40,
= 0.10,
Py,X2
Py,Xl
= 0.70, n = 200.
22
----<'----Ll-
Figure Bs: 95% Confidence Interval for Betal. Scenario
i
.---i---'--'- - - - - - ' - ,- - - _ _ L
-.11
f
!
I
I
I
I
-.12l
-. 13
I'
IIl
I
I
1
iT
II
1
- .14
Scenario 23: P X1,X2
1
~----'-------'--------T'
-
= 0.40,
o
1
Py,X2
= 0.40,
-,-
---,-'
5
decil
Py,Xl
= 0.70, n = 200.
Figure Bt: 95% Confidence Interval for Betal, Scenario 23
I !
I
,
!
L
T
T
-.11 l'---rl_ _,----
o
-,-
,,
decil
5
,7
,-'
115
Table N° 4.4.a: Mean for the estimates of (31, and R.D. from 1,000 replications,
for different cutpoints, for scenarios 25 to 42, n=50.
25
26
27
28
29
30
31
32
33
34
35
37
38
39
41
42
-.00478
.02449
.27875
-.06394
-.02992
.01308
-.55879
-.27495
-.25018
-.00423
.11143
-.14799
-.02223
.23747
-.50317
-.20609
-.00692
-.00485
-.00357
-.04426
-.03943
-.03828
-.27270
-.21982
-.22661
-.00748
-.00377
-.04551
-.04066
-.03772
-.23982
-.22211
44.8
119.8
101.3
30.8
31.8
392.2
51.2
20.1
9.42
76.9
103.4
69.3
82.9
115.9
52.3
7.78
-.00630
-.00068
.00650
-.04901
-.03792
-.03455
-.36585
-.23542
-.23547
-.00689
-.00626
-.05935
-.03882
-.03241
-.28224
-.22814
31.8
102.8
97.7
23.4
26.7
364.2
34.5
14.4
5.88
62.9
94.4
59.9
74.7
113.6
43.9
10.7
-.00612 28.1 -.00667
.00636
.00491
79.9
.04925
.01810
93.5
-.05281
17.4 -.05209
-.03644 21.8 -.03677
-.03019 330.9 -.02815
-.42369 24.2 -.41738
-.24695 10.2 -.25601
-.25149 0.52 -.26133
-.00709 67.6 -.00815
.01581
85.8 -.01904
-.07156 51.6 -.07179
-.03584 61.2 -.03552
-.02323 109.8 -.00771
-.33065 34.3 -.37626
-.22753 10.4 -.23850
39.6
74.1
82.3
18.5
22.9
315.3
25.3
6.89
4.46
92.7
82.9
51.5
59.8
103.2
25.2
15.7
Table 4.4.b: Estimation for the Standardized Mean,
Scenarios 25 to 42, n=50.
25
26
27
28
29
30
31
32
33
34
35
37
38
39
41
42
.17374
.40071
.08465
.13174
1.5454
1.7687
1.1542
.80508
1.4453
.15373
1.1295
.34211
1.6075
1.2143
.96437
1.4279
.12286
.57026
.14731
.13776
1.2635
1.4187
.82981
.48904
1.3713
.08574
1.3366
.28954
1.2951
.76680
.76329
1.4304
.07537
.67448
.45477
.14918
1.0147
1.0341
.65278
.31526
1.1923
.11595
1.3697
.2 358
.99557
.29662
.56565
1.3886
.10893
.73681
.75848
.15655
.93209
.55862
.60831
.21549
.92755
.10695
1.3507
.30758
.88698
.19018
.42035
1.1199
.15804
.65618
.18034
.15396
1.0825
.64816
.92367
.39342
.88633
.06357
1.5535
.37129
1.1592
.00806
.59229
1.0419
-.00684
.00120
.02684
-.04890
-.03683
-.03279
-.32410
-.24354
-.24648
-.00727
-.00691
-.05497
-.03735
-.01216
-.29372
-.21659
43.1
95.1
90.4
23.5
23.1
350.8
42.0
11.4
1.48
71.9
93.8
62.9
68.0
105.1
41.6
5.09
116
Table N° 4.4.c: Estimation for Sensitivity, Specificity and the percentage of the total
misc1assification probability ( 1), for different cutpoint, for scenarios 25 to 42, n=50.
25
26
27
28
29
30
31
32
33
34
35
37
38
39
41
42
95.6
95.8
95.6
75.9
35.1
29.1
46.7
70.2
34.9
93.5
86.5
41.9
32.4
29.8
53.3
33.2
57.1
6.7
0.00
88.3
99.1
33.3
97.9
98.1
100.0
39.1
2.6
82.8
96.1
32.0
97.0
100.0
6.6
18.9
7.8
14.3
50.3
70.9
48.3
14.9
62.7
8.8
65.6
22.0
61.0
70.1
32.2
64.1
96.7
98.3
94.4
74.9
50.8
46.1
63.7
75.6
38.9
94.3
94.5
47.9
49.3
52.0
59.7
35.0
62.5
12.7
12.0
89.8
97.4
0.00
91.8
95.5
97.3
36.9
8.8
86.5
93.2
10.0
91.9
100.0
5.2
15.8
7.7
13.3
38.6
54.0
33.5
13.8
58.9
8.3
58.7
18.0
46.2
50.1
29.6
62.3
97.5 66.1
99.0 17.6
88.10 28.0
78.7 90.9
64.3 95.6
69.8 0.00
74.7 87.8
83.9 94.9
50.9 94.6
94.6 52.2
96.3 17.4
50.4 88.2
66.6 91.3
80.6 0.00
69.8 88.9
41.0 92.7
4.3
14.4
13.4
12.6
28.6
30.4
2 .0
10.2
47.5
7.4
52.7
16.2
30.9
23.4
23.9
56.9
97.0 60.7 5.0
99.3 23.6 13.2
85.6 20.0 16.0
82.6 90.4 11.2
73.2 92.5 22.4
95.3 33.3 4.9
73.9 83.7 25.1
88.9 92.3 9.3
76.7 86.5 22.9
94.5 41.3 7.9
96.6 17.6 52.5
49.6 86.4 17.9
73.2 89.3 25.1
98.4 0.00 6.6
75.3 83.1 22.1
73.4 80.5 26.3
96.2
98.9
95.9
73.9
62.9
84.4
57.8
81.6
66.7
95.8
97.9
43.6
58.8
91.3
67.8
59.4
66.1
5.5
9.1
15.9
12.0
6.2
90.7 12.8
97.4 29.2
33.3 15.8
94.9 38.6
93.8 11.9
100.0 32.1
30.4
7.2
60.1
4.5
82.6 22.0
95.2 37.5
12.9
8.0
89.5 25.0
97.6 39.0
117
Scenario 25:
P Xl,X2
= 0.60,
= 0.10,
Py,X2
Py,Xl
= 0.10, n = 50.
Figure 9a: 95% Confidence Interval for Betal, Scenario 25
I
I
I
!
I
JTI
-.004
I
I
!
I!
i
iI
~1
-.006
T
I
I 1j
I
TI
I
~I
L
I
I
1-
I
J:L -r-----,----
-.008
o
Scenario 26: P Xt. X2
= 0.60,
5
decil
= 0040,
Py,x2
Py,Xl
", ."-,,-
.03 r,~~re
= 0.10, n = 50.
,"_n
'oc M,n, " - " , ,
. 02
I
JI
.01
I
_.:L-------ro
Scenario 27:
P Xt. X2
= 0.60,
Py,X2
_---,--1- , I _ - - - , - l
5
I
decil
= 0.70,
Py,Xl
= 0.10, n = 50.
Figure 9c: 95% Confidence Interval for Betal, Scenario 27
.3
.2
..,
~
..,
AI
Ql
,Q
I
~
,j
I
I
oL7
0
'"
'"
5
I
decil
118
= 0.60,
Scenario 28: P Xl,X2
Py,X2
= 0.10,
Py,Xl
= 0040, n = 50.
Figure 9d: 95% Confidence Interval for Betal, scenario 28
I
i
i
T
T
l
t
L
-.04r
T
I
~
II
!
1.
I
-.05
t
-j
I
1
I
'"''"
.c:
T
T
i
I
1
I
''""'
QI
.0
1
!T
-.06 -,
I
!I
,~
II
I'
-.07
iI
~1
~-,-
0
1
,
I
5
3
decil
Scenario 29: P XI,X2
= 0.60,
Py,X2
= 0040,
Py,XI
-,-----1
9
= 0040, n = 50.
Figure ge: 95% Confidence Interval for Betal, Scenario 29
-.
025
1'
I
I
I
IT
-" jj
'"'
~
5QI
I
I
I
-.035
.0
-.04
-.0451,-_,
0
1
I
I
I
-'-
T
t
I
5
decil
Scenario 30:
P XI,X2
= 0.60,
Py,X2
= 0.70,
Py,XI
= 0040, n = 50.
I
5
decil
I
119
Scenario 31: P XI,X2
= 0.60,
Py,X2
= 0.10,
Py,XI
= 0.70, n = 50.
Figure 9g: 95% Confidence Interval for Betal, Scenario 31
-,' i "
,
'~'I
,
T
!
-.3
'""'
'"'
.0
~
I~
I
-.4
Q)
I
~
I
I
,
r-------,--J
5
7
I
-.51
IT
It
-.6~
0
Scenario 32: P XI,X2
= 0.60,
Py,x2
dedI
= 0.40,
Py,XI
9
I
= 0.70, n = 50.
-.2 !F~gure 9~: 95% confidenfe Interval for, Betal, scenar~o~
I
I
j
-.3 ' - , - - - - - , - - - - - , - - - - - - r , - - - - - - - - , - - - - - - - - - - - r - '
5
decil
Scenario 33: P XI,X2
= 0.60,
Py,X2
= 0.70,
Py,XI
= 0.70, n = 50.
Figure 9i: 95% Confidence Interval for Betal, Scenario 33
t
I
I
I
[
t
5
decil
t
120
Scenario 34: P XI,X2
=
0.80,
= 0.10,
Py,x2
Py,XI
= 0.10, n = 50.
Figure
9j: 95% Confidence Interval for Betal, Scenario 34
!
I
_
L - -_ _- - - - ' L---
:]r
II
-'006i~
1
T
~
i
I
-.008l
I
I
1-
1
!
I
-.01
T
1,---,-------,--------,
o
I
3
5
dedI
Scenario 35: P Xlo X2
= 0.80,
Py,X2
= 0.10,
Py,xI
=
0.40, n
= 50.
Figure 9k: 95% Confidence Interval for Betal, Scenario 35
I !
I
•
I
.05
m
dedI
Scenario 37:
P Xlo X2
=
0.80,
Py,x2
= 0.40,
Py,XI
5
= 0.10, n = 50.
Figure 91: 95% Confidence Interval for Betal, Scenario 37
-.05
I
.
-.1
IT
-.15L
o
I
dedI
5
j
121
Scenario 38:
= 0.80,
PXI,X2
Py,X2
= 0.40,
-"JF'
Py,xI
= 0.40, n = 50.
95% Confidence
Interval ---L'
for Betal, Scenario
38
_-',
-..L'
-L.
I
~
I
IIII
I
I.e
~
-.03
T
-.05
I
T
!
1-
~
-.04
1
l
1
f
TI
ti
~
I
I
L-~_
------,--------,---------.----J
o
1
3
5
7
9
decil
Scenario 39: PXI,X2
= 0.80,
Py,X2
= 0.70,
Py,xI
= 0.40, n = 50.
•3
I
.2
:J
-.1
:Ill:
LI, - _ - - ,
-,
-,.
--,-
---,--'
5
decil
Scenario 41: PXI,X2
= 0.80,
Py,X2
= 0.40,
Py,xI
= 0.70, n = 50.
Figure 90: 95% Confidence Interval for Betal, Scenario 41
-.2
I
-.3
.,
11
1'wl
I
I
-.4
I
I
.Q
-.5
Jl
11
I
-.61' - , - - , - - - - - , - - - - - , , - - - - , - - - - - - , '
5
decil
7
122
Scenario 42:
PXj,X2
= 0.80,
Py,X2
= 0.70,
Py,Xj
= 0.70, n = 50.
Figure 9p:
, 95% Confidence Interval for Beta1, Scenario 42
- . 18
1
I
I
I
L
L
Ii
~~
-.2 -; I
IJ
Ii
.j.J
~
.j.J
Q)
-.22
Ji11
I
I
.0
II
-.24
-.26
ti
I
1!
I!
TI
1-
j
I
I
II
I
Q)
I
I
I
I
-L
1-
T
I
f'I~
11
-L
~,
0
ded1
.
I
T
5
I
7
I
9
t
123
Table N° 4.5.a: Mean for the estimates of {31, and R.D. from 1,000 replications,
for different cutpoints, for scenarios 25 to 42, n=200.
25
26
27
28
29
30
31
32
33
34
35
37
38
39
41
42
-.00469
.01956
.16984
-.04765
-.02286
.00504
-.32203
-.10902
-.11342
-.00398
.06328
-.09307
-.01977
.08515
-.20961
-.07000
-.00682
-.00448
-.00382
-.03641
-.03331
-.03242
-.12330
-.10348
-.10338
-.00666
-.00346
-.03951
-.03400
-.03332
-.10855
-.10319
45.4
122.9
102.3
23.6
45.7
743.6
61.7
5.08
8.85
671
105.5
57.6
71.9
139.1
48.2
47.4
-.00628
.00034
.00445
-.03944
-.03107
-.02872
-.15457
-.10564
-.10231
-.00611
.00472
-.04849
-.03210
-.02825
-.12191
-.10064
33.8 -.00601 28.1
98.3
.00439
77.6
97.4
.01443
91.5
17.2 -.04123 13.5
35.9 -.02904 27.0
670.2 -.02397 576.0
52.0 -.17715 45.0
3.10 -.10734 1.54
9.79 -.10412 8.20
53.5 -.00622 56.1
92.5
.01249
80.3
47.9 -.05532 40.6
62.4 -.02988 51.1
133.2 -.01962 123.0
41.8 -.13863 33.9
43.8 -.09637 37.7
-.00608
.00633
.02757
-.04118
-.02791
-.01707
-.17890
-.10759
-.10621
-.00606
.01539
-.05564
-.02819
-.00501
-.14904
-.08921
29.5
67.6
83.8
13.6
22.1
439.0
44.4
1.31
6.36
52.1
75.7
40.2
42.6
105.9
28.9
27.4
Table 4.5.b: Estimation for the Standardized Mean,
Scenarios 25 to 42, n=200.
25
26
27
28
29
30
31
32
33
34
35
37
38
39
41
42
.59024
.92750
1.2827
.53620
3.6008
3.4776
2.2041
1.3005
3.4274
.49475
3.3558
1.3328
3.8156
1.7074
1.7655
3.1280
.47957
1.3751
1.3563
.51576
2.9534
2.5391
1.6838
.89253
3.1830
.38074
4.6598
1.1573
3.0871
.75061
1.3056
3.1591
.39557
1.6467
.40713
.40704
2.4698
1.9157
1.3586
.67927
2.7747
.35510
4.8259
1.0210
2.5128
.29757
1.0613
3.0367
.40399
1.7171
.31446
.42490
2.2420
.96642
1.3152
.65464
1.9934
.35436
4.4124
1.0837
2.2377
1.8448
.90930
2.6529
.51991
1.5511
.66852
.51984
2.7306
.98701
1.7986
.87573
2.0117
.47278
4.6785
1.3462
2.5484
1.1756
1.1372
2.3696
-.00653
.00173
.01520
-.03838
-.02987
-.01991
-.14356
-.10691
-.10512
-.00667
.00609
-.04632
-.03019
-.01127
-.13496
-.09250
39.1
91.2
91.0
19.5
30.6
495.3
55.5
1.94
3.32
67.4
90.4
50.2
52.7
113.2
35.6
32.1
124
Table N° 4.5.c: Estimation for Sensitivity, Specificity and the percentage of the total
misc1assification probability ( T), for different cutpoint, for scenarios 25 to 42, n=200.
25
26
27
28
29
30
31
32
33
34
35
37
38
39
41
42
84.6
56.3
87.7
0.0
0.0
1.7
2.8
82.3
0.0
0.0
0.0
0.0
86.8 15.2
5.4
81.6
10.2 72.6
100.0 0.0
100.0 15.4
96.6 94.3
100.0 17.4
100.0 0.0
100.0 6.9
61.8 19.3
11.2 88.8
100.0 0.0
100.0 49.9
100.0 5.5
100.0 1.5
100.0 8.5
89.6
89.5
85.5
1.3
0.3
5.1
2.8
88.6
0.4
0.0
0.0
0.0
83.0 11.1
4.4
73.8
12.9 70.6
100.0 0.0
100.0 15.2
94.9 94.1
10.0 16.8
100.0 0.0
100.0 6.9
63.2 13.3
13.7 86.3
100.0 0.0
100.0 49.7
97.1
8.2
100.0 1.5
100.0 8.5
92.4
98.4
46.7
6.5
4.3
6.8
5.6
88.9
3.8
1.8
6.7
0.0
84.9
8.4 91.7
14.4 64.1 99.6
55.5 46.5 8.8
100.0 0.0
100.0 14.4 7.8
67.8 92.0 40.6
99.5 16.9 9.6
100.0 0.0
99.9
6.8 11.3
69.7 12.5 90.8
39.0 61.0
100.0 0.0
100.0 48.0 10.8
70.4 33.4 50.9
100.0 1.4 20.0
1.2
100.0 8.5
83.0
9.2
21.8 58.3
90.4 28.1
100.0 0.0
100.0 14.2
15.3 60.9
99.4 16.5
100.0 0.0
99.5
6.8
72.4 10.6
57.7 42.3
100.0 0.0
99.8 44.6
7.4
90.2
100.0 1.2
100.0 8.4
86.8
98.8
50.7
3.9
37.9
2.3
11.3
87.5
4.4
49.1
0.0
4.7
84.0
5.5
43.5
100.0
100.0
25.4
99.9
100.0
99.7
73.7
13.4
100.0
100.0
26.8
100.0
100.0
13.5
70.6
54.9
0.0
14.8
62.8
17.4
0.0
6.6
13.6
86.6
0.0
47.7
72.0
1.5
8.1
125
= 0.60,
Scenario 25: PXj,X2
Py,x2
= 0.10,
Py,Xj
= 0.10, n = 200.
Figure
lOa:
95% Confidence
Interval for Betal, Scenario 25
!
!
- - . l_ _- - - - - '
~I'_
-. 004
1
-.005
~1
_'__,
r
I)
I
I
I
-.006l
,
I
T
I
t
J
~
I
!
1I
l,--,--------r-----TI-
-.007 -,
o
= 0.60,
Scenario 26: PXj,X2
Py,X2
I
3
I
5
dedI
= 0.40,
Py,Xj
= 0.10, n = 200.
Figure lOb: 95% Confidence Interval for Betal, Scenario 26
!
I
.01
-.01
-L.--.-.,-----,--o
= 0.60,
Scenario 27: P Xl:X2
Py,x2
,
I
dedI
= 0.70,
Py,Xj
5
= 0.10, n = 200.
Figure 10c: 95% Confidence Interval for Betal, Scenario 27
I
I
I
.2l
I
.15
~
...'"
.1
Ql
.c
.
05
o
1
~ .
..
..
..
I
[
0
decil
5
--,-7
~
126
Scenario 28: P Xl,X2
= 0.60,
Py,X2
= 0.10,
Py,xl =
0.40, n
= 200.
_.0351F~gure 10,d: 95% Confide~ce In~ Betal. scena~iO~
I
I
i
-'
I!
~
I
-.04 -;
I
I
I
-.
I
I
I
I
1
045
i~
i
t
I
!
I
i"
'1,-,-----.-'
-.05
o
Scenario 29:
PXj, X2
= 0.60,
Py,x2
I
= DAD,
3
Py,xl
J
I
5
decil
= DAD,
n
= 200.
I
I
-.03
I
I
- .035
5
dedI
Scenario 30:
PXI, X2
= 0.60, Py,X2 = 0.70,
Py,Xl
= DAD,
n
= 200.
Figure 10f: 95% Confidence Interval for Betal, Scenario 30
i
!
,
I
I
J
I
-"'1
:m:
III
•. 04l_-,llI
o
.-
.---5
dedI
.---
-,-J
127
Scenario 31: P X1,X2
= 0.60,
Py,X2
= 0.10,
= 0.70, n = 200.
Py,Xl
Figure 109: 95% Confidence Interval for Betal, Scenario 31
c-'--_..L
I
I
I
-.11
I
I
-.21
I
~l
i
I
I
!,
-.3 ~
T
it
i
!
-.4
-L _-,-__
-,-------;,r---3
5
7
9
dedI
Scenario 32: P X1,X2
= 0.60,
Py,X2
= 0.40,
= 0.70, n = 200.
Py,Xl
Figure 10h: 95% Confidence Interval for Betal, Scenario 32
r
!
-.1
~
:J
1l
I
- .105
-H
I
I
j
f
5
decil
Scenario 33: P X1,X2
= 0.60,
Py,X2
= 0.70,
Py,Xl
= 0.70, n =
200.
Figure 10i: 95% Confidence Interval for Betal, Scenario 33
-.1
-.105
I I j
1
5
decil
I
128
Scenario 34: PX\,X2
= 0.80,
Py,X2
= 0.10,
Py,xI
= 0.10, n =
200.
Figure lOj: 95% Confidence Interval for Betal, Scenario 34
I
j
I
~
~
rI
-',
-.003 iT
I;
Ii
-.004
~+
I
'1.
I
!I
i
1
-,0;
r
I
~
-.006
j
-L_~l
-.007
o
Scenario 35:
PXJ, X2
= 0.80,
Py,X2
I
I
I
+
3
5
.: -:
~
Ttl
T
-, J
7
I
9
decil
= 0.40,
Py,xI
= 0.10, n = 200.
Figure 10k: 95% Confidence Interval for Betal. Scenario 35
,-i-'_ _Li-----l.-
I~
.06l
.04
JI
I
.02l
m:
·L~-,------,----,-'
5
dedi
Scenario 37: PXI,X2
= 0.80,
Py,X2
= 0.10,
Py,xI
= 0.40, n = 200.
Figure 101: 95% Confidence Interval for Betal. Scenario 37
I
-"I
-.06
..
I
I
I
I
m:
lI[
~
1
I
II
I
-.oal
I
II
i
-.Il,---,
0
I
;;
dedi
J
9
129
= 0.80,
Scenario 38: P X1,X2
Py,x2
= 0.40,
Py,x1
= 0.40, n = 200.
-_.l-
Figure 10m: 95% Confidence Interval for Betal, Scenario 38
"
,
~I
-.02
_L-
.--!-'
i
I
~
-.025
I
I
I
I
-.031
II
I
L,.
-.035
I
-,----,---J
o
Scenario 39:
P X1,X2
= 0.80,
Py,X2
5 7 9
decil
= 0.70,
Py,X1
= 0.40, n = 200.
Figure lOn: 95%: Confidence Interval for Beta!. Scenario 39
1
. 05 ,
~
II
1:
-.05
I
5
decil
Scenario 41:
P X1,X2
= 0.80,
Py,x2
= 0.40,
-'F
Py,x1
= 0.70, n = 200.
95% confidfce Interval
~~r
l
Scen~rio
41
J
I
-. 15
Beta!,
I
I
-.2
~
T
Ii
1
-.25
i,---,--o
-,-----,--------,-----r'
3
5
decil
130
Scenario 42: PXj,X2
= 0.80,
Py,X2
= 0.70,
Py,Xj
= 0.70, n =
200.
Figure lOp: 95% Confidence Interval for Beta1, Scenario 42
-.07
...
~...
Ql
.Q
~
i
I
I
1-
~
-.f
I
I
I
!I!
I
I
-. 09
I
T
1
1
!I
If
~I
I
I
!
I
L
T
1
i
!
I
I
-.11 _c
L,-----,----,
j-
-.11
T
1
I
I
I
0
1
3
decil
I
I
5
7
131
Table N° 4.7: mean ofR.D. by sample size and deciles.
General
n=50
n=200
decile 1
decile 3
decile 5
decile 7
decile 9
360
180
180
72
72
72
72
72
70.43
76.89
63.98
86.41
73.80
64.77
55.12
72.07
118.9
130.4
106.4
144.8
125.0
111.0
82.9
123.4
0.031
0.055
0.031
0.068
0.040
0.031
0.072
0.102
892.2
892.2
743.6
892.2
725.1
613.9
439.0
785.1
Table N° 4.8: mean ofR.D. by combinations of P X l,X2 and Py,X2'
0.00
0.20
0.40
0.60
0.80
General
5.43
5.43
175.77
6.58
24.05
33.33
59.99
40.64
57.98
198.95
48.65
43.09
63.92
76.91
126.49
148.92
63.09
189.05
70.01
123.20
62.76
94.43
43.04
88.49
64.54
70.43
132
Table N° 4.9: Distribution of 360 values for RD by sample size.
< 10
10-19.99
20-29.99
30-39.99
40-49.99
50-59.99
60-69.99
70-79.99
80-89.99
90-99.99
100&+
"'37
31
16
12
8
7
6
6
11
11
35
20.6
17.2
8.9
6.7
4.4
3.9
3.3
3.3
6.1
6.1
19.5
20.6
37.8
46.7
53.4
57.8
61.7
65.0
68.3
74.4
80.5
100
58
18
14
10
13
9
6
5
5
8
34
32.2
10.0
7.8
5.6
7.2
5.0
3.3
2.8
2.8
4.4
18.9
32.2
42.2
50.0
55.6
62.8
67.8
71.1
73.9
76.7
81.1
100
95
29
30
22
21
16
12
11
16
19
69
2604
13.6
8.3
6.1
5.8
4.4
3.3
3.1
4.4
5.3
19.2
26.4
40.0
48.3
54.4
60.2
64.6
67.9
71.0
75.4
80.8
100
None
Mild
(21.9%)
Moderate
(11.9%)
Severe
(39.7)
Table N° 4.10: Distribution (%) ofdifferent scenarios by RD, deciles and sample size.
None
Mild
Moderate
Severe
16.7
19.4
16.7
47.2
16.7 22.2
27.8 30.7
13.9 5.6
41.7 41.7
27.8
25.0
5.6
41.7
19.4
27.8
13.9
38.9
25.0
16.7
16.7
41.67
33.3
13.9
16.7
36.1
38.9
16.7
11.1
33.3
33.3
25.0
8.3
33.3
30.6
16.7
11.1
41.7
CHAPTER 5
CONCLUSIONS, RECOMMENDATIONS AND FURTHER RESEARCH
•
5.1.- CONCLUSIONS AND RECOMMENDATIONS.
As was mentioned at the beginning of this dissertation, there are many
concerns about categorization of continuous variables for statistical analysis purposes. It
is well known that this practice, among other considerations, implies loss of information
and loss of power. As we described in the examples from Chan et al. and Hosmer et aI.,
the conclusions when one or more continuous variables in the problem are dichotomized,
can be different from the analysis using the original variables. In addition, the election of
the cutpoint for dichotomization, can also lead to different conclusions.
In the medical literature, it is very frequent to see statistical models in
order to study the relationship between a particular outcome (disease) and a particular
exposure variable. Also, in order to improve that relationship, one or more control
variables are included in the model. Many times, when the variables in the model are
continuous, they are usually transformed to discrete variables and used in this way in the
analysis. Given the results described in the two examples used for motivation, one can
ask if the results and conclusions published in the literature where dichotomization was
used, are correct or not.
The objective in this dissertation was to evaluate the consequences of
dichotomizing continuous variables in regression models. We were interested in one
particular problem, the one involving one outcome, one exposure and one control
variable. In practice, these three variables can be continuous, discrete, or a mixture of
both. In whatever scenario, any of the continuous variables can be dichotomized.
However, we chose to study only the case when the control variable is dichotomized, and
134
we are interested in evaluating how this practice affects the measure of association
between the outcome and the exposure.
Basically, we studied two models; the first one is when the outcome is a
continuous variable, which implied studying a multiple linear regression model. The
second situation was when the outcome is binary, which implied studying the
dichotomization under the multiple logistic regression model. For the linear regression
model, using a tri-variate normal distribution for the variables involved in the model, two
approaches were, used. First, we studied the effect of dichotomization of the control
variable through an analytic approach, and after that, we studied the effect using a
simulation approach. In the logistic regression case, only the last approach was
conducted.
In terms of results, under the linear regression model, we showed
analytically that when the control variable is dichotomized, the resulting model is not
linear. This implies that if someone fits a linear regression model after dichotomization,
what essentially they are doing is to misspecify the model. Under the analytic approach,
we focused our attention on the model structure rather that on the parameter estimation.
In order to study the impact over the measure of association, we conducted a simulation
process. In general, under most of the scenarios considered in the simulation, we obtained
an overestimation of the magnitude of the measure of association between the outcome
and the exposure variable. One important result is when the control variable is not a
confounder before dichotomization. In that case, if we use the correlation coefficient
between the exposure and the outcome as the criterion (P X t. X 2 = 0), and depending on the
correlation structure, the dichotomized variable becomes a confounder variable. For
example, we found cases where the difference between the crude measure of association
and the adjusted one (adjusted for the dichotomized variable) was more than 50%. The
average of the R.D. was 28.8% and it depended on the correlation coefficients. The R.D.
•
135
increases as the correlation coefficient between Xl, and X2 increases, as well as when the
correlation coefficient between the outcome and the control variable increases.
Finally, we studied the effect of dichotomizing the control variable under
the logistic regression model. Due to the fact that we do not know in advance the true
•
regression coefficient, the ones obtained after dichotomization of the control variable
were compared with the coefficient from the model without dichotomization. In this case,
the results were similar to the ones obtained in the linear regression model. However,
sometimes, the differences between the regression coefficient before and after
dichotomization were more dramatic. For example, for scenario 8 with n=50, the
differences ranged between 403.2% and 892.2%. On the other hand, if we considered all
the cases, the average for the R.D. was 70.43%, higher than the one observed for the
linear regression model.
In general, given our results, either based on the analytic approach or on
the simulation approaches, we do not recommend dichotomizing the control variable. We
recognize that this is a very common practice because with the dichotomous variable, the
relationship between the disease and the exposure can be interpreted easier, and also
because sometimes the relationship between the outcome and the exposure is not
necessarily linear. However, when we dichotomize the continuous control variable, we
obtain a severe bias in the estimates, if we assume that the relationship with the
continuous variable is the correct one. If the relationship is not linear and one
dichotomizes, one must understand that the resulting model is not only non-linear, but
addresses a different relationship or research question.
5.2.- FURTHER RESEARCH.
As was mentioned in the previous section, we only focused our attention
on the dichotomization of one variable, the control one. It is also possible to study the
effect on the measure of association between the outcome and the exposure:
136
- when the exposure variable is dichotomized,
.: when both, the control and the exposure are dichotomized,
- when the outcome is dichotomized.
The model studied in this dissertation includes only one exposure variable and one
control variable; in addition, an interaction term can be included, and therefore the effect
of dichotomizing the control variable can be considered:
- under this situation of having an interaction term,
- when more than one control variable is included in the model.
Another important matter is related to the statistical distribution of the variables in the
model. For example, in the linear regression model, we assumed multivariate normality.
It is possible to study the situation in which the continuous variables are not normal.
Finally, one practical and important problem to study is when the control variable is
categorized instead of dichotomized. It would be important to have some
recommendations in terms of the number of groups for categorizing if someone insists on
this approach.
.
137
BIBLIOGRAPHY
1.- Altman, D., Lausen, B., Sauerbrei, W. and Schumacher M.: Danger of Using
"Optimal" Cutpoints in the Evaluation of Prognostic Factors. Commentary in Journal
..
'
0/
the National Cancer Institute, Vol, 86, N° 11, June 1, 1994.
2.- Altman, D.: Problems in Dichotomizing Continuous Variables. Letters to the Editor,
American Journal o/Epidemiology, Vol. 139, N° 4, 1994.
3.- Amenta, S., Brocher, S. and Serenk:o-Aber, A.: Comparing different Statistical
methods for Evaluating Diagnostic Effectiveness of Clinical Tests: Respiratory Distress
Syndrome as a Model. C/in. Chem.,·VoI34 N° 2, 1988.
4.- Becher, H.: The Concept of Residual Confounding in Regression Models and Some
Applications. Statistic in Medicine, Vol 11, 1992.
5.- Brown, C., Kipnis, V., Freedman, L., Hartman, A., Schatzkin, A. and Wacholder, S.:
Energy Adjustment Methods for Nutricional Epidemiology: The Effect of Categorization.
American Journal o/Epidemiology, Vol. 139, N°3, 1994.
6.- Chang, H., Lininger, L., Doyle, J., Maccubbin, P. and Rotherberg, R: Application of
the Cox Model as a Predictor of relative Risk of Coronary Heart Disease in the Albany
Study. Statistic in Medicine, Vol 9, 1990.
7.- Connors, R.: Grouping for Testing Trends
In
Categorical Data. Journal
0/ the
American Statistical Association, 67, 1972.
8.- Donner, A. and Eliasziw M.: Statistical Implications on the Choice Between a
Dichotomous or Continuous Trait in Studies of Inter observer Agreement. Biometrics 50,
June 1994.
•
9.- Flegal, C., Keyl, P. and Nieto, J.: Differential Misclassification arising from
Nondifferential Errors in Exposure Measurement. American Journal
Vo134 N° 10, 1991.
0/ Epidemiology,
138
10.- Fung K. Y. and Howe G. R.: Methodological issues in case-control studies. III: The
effect of join misc1assification of risk factors and confounding factors upon estimation
and power. International Journal o/Epidemiology 13, 1984.
11.- Greenland, S.: Dose-Response and Trend Analysis in Epidemiology: Alternatives to
Categorical Analysis. Epidemiology, Vol. 6, N°4, 1995.
12.- Greenland, S.: Avoiding Power Loss Associated with Categorization and Ordinal
Score in Dose-Response and trend Analysis. Epidemiology, Vol. 6, N°4, 1995a.
13.- Hosmer, D. and Lemeshow, S.: Applied Logistic Regression. John Wiley, 1989.
14.- Kahn, H. and Sempos C: Statistical Methods in Epidemiology. Oxford University
Press, 1989.
15.- Kleinbaum, D., Kupper, L. and Muller, K.: Applied Regression Analysis and Other
Multivariable Methods. PWS-KENT Publishing Company, 2nd ed. 1988.
16.- Lagakos, S. W.: Effect of Mismodeling and Mismeasuring Explanatory Variable on
Test of their Association with a Response variable. Statistics in Medicine, Vol. 7, 1988.
17.- Qaqish, B.: Categorizing Continuous Covariates in Epidemiologic Studies.
Unpublished manuscript, 1994.
18.- Maxwell, S. and Delaney, H.: Bivariate Median Splits and Spurious Statistical
Significance. Psychological Bulletin, Vol 113 N° 1, 1993.
19.- Neumann, L.: Effect of Categorization on the Correlation Coefficient. Quality and
Quantity, 16, 1982.
20.- Ragland, D.: Dichotomizing Continuous OutcomeVariables: Dependence of the
Magnitude of Association and Statistical Power on the Cutpoint, Epidemiology. Volume
3 Number 3, 1992.
"
21.- Rayner, J., Dodds, K. and Best, D.: The Effect of Categorization on a Simple
Analysis of Variance Model. Biom. J. 28,2, 1986.
22.- Rayner, J., Liddell, G. and Seyb, A.: The Effect of Categorization on a Balanced
Incomplete Blocks and Latin Squared. Biom. J. 34, 1, 1992.
.
139
23.- Reade-Cristopher, S. and Kupper, L.: On the Effects of Predictor misclassification in
Multiple Linear Regression Analysis. Biometrics, March 1992.
24.- Wartenberg, D. and Northridge, M.: Defining exposure in case-control studies: a new
approach. Am. J. Epidemiology 133, 1991.
25.- Zhao, L.P. and Kolonel, L.: Efficiency Loss from Categorizing Quantitative
Exposures into Qualitative Exposure in Case-Control Studies. American Journal of
Epidemiology, Vol. 136 N° 4, 1992.
© Copyright 2026 Paperzz