A study of determinants of plasma and beta

APPLIED STATISTICS
FALL 2009
A STUDY
TUDY OF DETERMINANTS
OF PLASMA RETINOL AND
BETA
BETA-CAROTENE
PROJECT REPORT
TUTOR: Dr. Kaibo Wang
TEAM MEMBERS:
(ACCORDING TO STUDENT ID)
WANG, JUN
2009210552
CUI, WEN
2009210554
LV, SHIKUN
2009210566
SUN, NINGNING
2009210571
Applied Statistics Report, Industrial Engineering, Tsinghua University`
TABLE OF CONTENTS
1.
INTRODUCTION ........................................................................................ 3
2.
LITERATURE REWIEW ............................................................................ 3
3.
PURPOSE OF THE STUDY ....................................................................... 5
4.
OUTLINE OF THE ANALYSIS ................................................................. 6
5.
ANALYSYS RESULTS .............................................................................. 7
5.1 VARIABLES TYPES AND LEVELS .................................................. 7
5.1.1 Quantitative variables ................................................................... 7
5.1.2 Categorical variables:.................................................................... 8
5.2 DESCRIBTIVE ANALYSIS ................................................................ 8
5.2.1 SEX ............................................................................................... 8
5.2.2 VITUS ......................................................................................... 10
5.2.3 SMOKSTAT ............................................................................... 11
5.2.4 AGE ............................................................................................ 13
5.2.5 QUETELET ................................................................................ 13
5.2.6 CALORIES ................................................................................. 14
5.2.7 FAT ............................................................................................. 15
5.2.8 FIBER ......................................................................................... 16
5.2.9 ALCOHOL ................................................................................. 16
5.2.10 CHOLESTEROL ...................................................................... 17
5.2.11 BETADIET ............................................................................... 18
5.2.12 RETDIET .................................................................................. 18
1 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
5.3 DATA ANALYSIS VIA REGRESSION AND GLM ........................ 19
5.3.1 Plasma Beta-carotene .................................................................. 19
5.3.2 Retinol ......................................................................................... 24
6.
DISCUSSION ............................................................................................ 28
7.
APPENDIX ................................................................................................ 28
6.1 Plasma Beta-carotene ............................................................................. 28
6.1 1 Regression method: ..................................................................... 28
6.1.2 GLM Method .............................................................................. 37
6.2 Retinol .................................................................................................... 42
6.2.1 Regression Method ..................................................................... 42
6.2.2 GLM Method .............................................................................. 64
8.
REFERANCER .......................................................................................... 69
2 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
1. INTRODUCTION
Observational studies have suggested that low dietary intake or low plasma
concentrations of retinol, beta-carotene, or other carotenoids might be associated with
increased risk of developing certain types of cancer. However, relatively few studies
have investigated the determinants of plasma concentrations of these micronutrients.
In this project, we find a cross-sectional study, which want to investigate the
relationship between personal characteristics and dietary factors, and plasma
concentrations of retinol, beta-carotene and other carotenoids. Study subjects (N =
315) were patients who had an elective surgical procedure during a three-year period
to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was
found to be non-cancerous.
We want to check that there is wide variability in plasma concentrations of these
micronutrients in humans, and that much of this variability is associated with dietary
habits and personal characteristics.
2. LITERATURE REWIEW
Observational studies have suggested that low dietary intake or low plasma
concentrations of retinol, beta-carotene, or other carotenoids might be associated with
increased risk of developing certain types of cancer [1].
The relationship between plasma carotenoids, plasma cholesterol, cigarette
smoking, vitamin supplement use, and intakes of alcohol, vitamin A, and carotene
were investigated in 1981 for 187 Multiple Risk Factor Intervention Trial men in
Pittsburgh in the research of Russell-Briefel R [2]. The total plasma carotenoid value
was positively correlated with the dietary carotene and vitamin A indices (estimated
by a food frequency questionnaire), vitamin A supplement usage, and plasma
cholesterol, and inversely related to cigarette smoking, alcohol intake, and serum
aspartate transaminase. The mean plasma carotenoid level was higher in nonsmokers
(x = 186 µg/dl, 95% confidence interval (Cl) 178–195) as compared with cigarette
smokers (x = 164 µg/dl, 95% Cl 151–178) and in vitamin A supplement users (x =
206 µg/dl. 95% Cl 188/224) as compared with nonusers (x = 172 µg/dl, 95% Cl
164–179). Variables associated with the total plasma carotenoids in multiple
regression analyses included dietary vitamin A and carotene, calorie intake, weekly
3 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
alcohol intake, cigarette smoking, vitamin supplement usage, and plasma cholesterol,
and accounted for 27% of the variance. The total plasma carotenoid value was also
highly correlated with plasma beta-carotene (r = 0.67) and lycopene (r = 0.68). The
mean beta-carotene (30 µg/dl), however, accounted for only 16% of the total plasma
carotenoids.
The relationship of diet and nutritional supplements, cigarette use, alcohol
consumption, and blood lipids to plasma levels of beta-carotene was studied among
330 men and women aged 18–79 years in the research of Stryker WS[3]. Dietary
carotene, preformed vitamin A, and vitamin E intake were estimated by a
self-administered semiquantitative food frequency questionnaire. The correlation of
dietary carotene with plasma beta-carotene was reduced in smokers compared with
non-smokers (r = 0.02 vs. 0.44 among men; r = 0.19 vs. 0.45 among women).
Smokers had much lower plasma levels of beta-carotene than did nonsmokers
(geometric mean 8.5 vs. 15.3 µg/dl for men; 17.3 vs. 26.3 µg/dl for women) despite
only slightly lower intakes of carotenoids. In multiple regression analyses, men who
smoked one pack per day had 72% (95% confidence interval (Cl) 58–89) of the
plasma beta-carotene levels of nonsmokers after accounting for dietary carotene and
other variables; for women, the corresponding percentage was 79% (Cl 64–99). In
similar models, men drinking 20 g of alcohol per day had 76% (Cl 65–88) of the
beta-carotene levels of nondrinkers; women had 89% (Cl 73–108) of the levels of
nondrinkers. An interaction term for carotene intake and smoking was statistically
significant in a model combining both sexes. These results suggest that plasma levels
of beta-carotene among smokers and, perhaps, heavy consumers of alcohol may be
reduced substantially below levels due to differences in diet. The correlation of
calorie-adjusted intake of vitamin E with lipid-adjusted plasma levels of vitamin
alpha-tocopherol was 0.53 for men (n = 137) and 0.51 for women (n = 193) and did
not differ by alcohol consumption and cigarette use; these correlations were largely
accounted for by use of vitamin supplements. In linear regression models, vitamin E
Intake and plasma lipids were significant predictors of plasma alpha-tocopherol
levels.
Vitamin A and its analogues have an important role in cellular processes related
to carcinogenesis. The prospect that high intake of certain vitamins may confer
protection against cancer has drawn substantial attention during past decades. As a
consequence, some of the epidemiologic literature pertaining to these vitamins have
4 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
dealt with the relation of cancer with either their intake in the diet or their levels in the
serum. Willett WC showed that, with higher levels of retinol plasma, the risks of get
cancer may be decreased. However, plasma retinol levels are under strict control and
a high intake of preformed vitamin does not seem to be relevant for cancer
prevention.
So many epidemiologic studies have been conducted primarily as dietary studies
of vitamin A and carotene, or as blood studies of serum retinol. Researchers tried to
find out the relationship between some factors in dietary or personal and the plasma
retinol. Stähelin, H. B. suggested an inverse relationship between vitamin A and
cancer risk, although some studies have found no relationship. Then people find that a
lower retinol levels is not the cause of an invasive cancer. Instead, it is the cancer that
brings about a lower retinol level in human body.
In daily life, vitamin A may be derived from diet or daily consume. Besides,
people’s life habits and living condition like social status, smoking, drinking and age,
and some other ones will also have an important effect on the plasma retinol level. As
we know that, some personal factors like smoking status, dietary, gender and height,
have a complex relationship with the plasma retinol, since these factors may influence
each other or confuse the result.
Among those personal factors, the most related one is the gender, which will
cause a higher retinol level if a person is male. Higher levels of serum vitamin A
among males than among females and higher levels of serum carotene in females than
in males have been reported in studies in the United States, England, West Germany,
Denmark, and Switzerland. Since hepatic secretion of retinol binding protein
maintains fairly constant levels of retinol over a wide range of dietary retinol intake.
So plasma retinol levels are independent of Vitamin A intake, only if people are in a
normal state without having deficient or excessive levels.
Beside the above mentioned factors, we also add age, height and weight, and
some other factors like fat, cholesterol which will affect the plasma retinol level and
may cause cancer at the same time.
3. PURPOSE OF THE STUDY
Many dietary and biochemical epidemiologic studies have shown a strong
association between beta-carotene, retinol and the risk of cancer. These findings have
5 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
increased interest in factors influencing beta-carotene levels and retinol levels in
human plasma or serum, and also inspire scientists to find out internal factors which
may have some effect or relationship with the beta-carotene and retinol in people’s
plasma.
As we mentioned above, previous studies have indicated that if a person’s
dietary history show greater consumption of green or yellow leafy vegetables, which
have more amount of carotene, the intake level of beta-carotene is high. Smokers and
alcohol takers have lower beta-carotene levels. However, the relationship between
beta-carotene levels and the person’s age, sex, quetelet, calories\ fat\ fiber\cholesterol
consumed per day is less clear because the evidence is either scanty or conflicting.
Similarly, higher levels of serum retinol may be associated with a decreased risk
of cancer, although the epidemiologic evidences less consistent than that for
beta-carotene. It appears that lower retinol levels may be a consequence rather than a
cause of invasive cancer. The relation between personal factors and plasma retinol
levels is complex, since hepatic secretion of retinol binding protein maintains fairly
constant levels of retinol over a wide range of dietary retinol intake.
In this study, we use data form an observational experiment with 315 patients. In
this experiment, there are totally 12 independent variables (age, sex, smoking status,
quetelet, vitamin use, number of calories consumed per day, grams of fat consumed
per day, grams of fiber consumed per day, number of alcoholic drinks consumed per
week, cholesterol milligram consumed per day, dietary beta-carotene microgram
consumed per day and dietary retinol microgram consumed per day) and 2 dependent
variables, which would be analyzed below to find out whether any of these personal
characters would have an effect on persons’ plasma beta-carotene and plasma retinol.
4. OUTLINE OF THE ANALYSIS
The report would give statistical analysis of the data.
First are the descriptive analysis, showing all the independent variables mean,
median, min, max and quartile with graphs to give a clear vision of the pattern of
variability in the data.
The second part is the data analysis via regression.
6 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
•
In this part, we first show scatter plots between independent variables and
dependent variables to the relationship between them. If there are some variables
highly skewed, log transformation is carried out or some data are deleted directly
to make them nearly symmetric.
•
Then best subset and stepwise regression methods are used to select model
variables.
•
After all these data treated, a residual plot is drawn. So whether the data residual
is normal distribution, ε ~N0, σ
and independently and identically distribution
(i.i.d) could be checked.
•
Based on all the information above, a final regression model is developed and
deep conclusion and explanation would be given.
5. ANALYSYS RESULTS
The statistical analysis software Minitab has been used to analyze the data
collected from the experiment.
5.1 VARIABLES TYPES AND LEVELS
5.1.1 Quantitative variables
AGE: Age (years)
QUETELET: Quetelet:
CALORIES: Number of calories consumed per day.
FAT: Grams of fat consumed per day.
FIBER: Grams of fiber consumed per day.
ALCOHOL: Number of alcoholic drinks consumed per week.
CHOLESTEROL: Cholesterol consumed (mg per day).
BETADIET: Dietary beta-carotene consumed (mcg per day).
RETDIET: Dietary retinol consumed (mcg per day)
7 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
5.1.2 Categorical variables:
SEX: Sex (1=Male, 2=Female).
SMOKSTAT: Smoking status (1=Never, 2=Former, 3=Current Smoker)
VITUSE: Vitamin Use (1=Yes, fairly often, 2=Yes, not often, 3=No)
5.2 DESCRIBTIVE ANALYSIS
In this case, there are 14 variables in total. As shown in the following box, the
first 12 variables are potential determinants of Plasma Retinol and Beta-Carotene
Levels. Some basic description statistics are calculated by Minitab and listed below.
Variable
Mean
StDev
Min
Q1
Median
Q3
Max
AGE
50.146
14.575
19.000
39.000
48.000
63.000
83.000
SEX
1.8667
0.3405
1.0000
2.0000
2.0000
2.0000
2.0000
SMOKSTAT
1.6381
0.7110
1.0000
1.0000
2.0000
2.0000
3.0000
QUETELET
26.157
6.014
16.331
21.789
24.735
28.950
50.403
VITUS
1.9651
0.8607
1.0000
1.0000
2.0000
3.0000
3.0000
CALORIES
1796.7
680.3
445.2
1333.8
1666.8
2106.4
6662.2
FAT
77.03
33.83
14.40
53.90
72.90
95.30
235.90
FIBER
12.789
5.330
3.100
9.100
12.100
15.600
36.800
ALCOHOL
3.279
12.323
0.000
0.000
0.300
3.200
203.000
CHOLESTEROL
242.46
131.99
37.70
154.90
206.30
308.90
900.70
BETADIET
2185.6
1473.9
214.0
1114.0
1802.0
2863.0
9642.0
RETDIET
832.7
589.3
30.0
479.0
707.0
1047.0
6901.0
BETAPLASMA
189.9
183.0
0.0
89.0
140.0
231.0
1415.0
RETPLASMA
602.8
208.9
179.0
466.0
566.0
719.0
1727.0
5.2.1 SEX
The variable of sex is divided into two levels. 1 is represented as male, and 2 is
for female. So in order to see whether and how Plasma Retinol is affected by sex level,
8 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
we can draw two box plots for comparison. One is for male, and the other is for
female.
We can get the max, min, median, Q , Q of Plasma Retinol in male and
female respectively. It is obviously that all the data of male are higher than female.
Thus Sex is an important variable which will determine the plasma retinol level in
human body, since generally, this level in male is usually higher than female.
The box plot of Plasma Beta-Carotene with male and female are as below, in
which, 1 and 2 are also represented as male and female respectively.
9 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Since from above figure, we use Minitab to see the specific number of min, max,
median and Q , Q . There is a little higher level in female than in male. And also,
there are many outliners in female which are especially higher than others.
5.2.2 VITUS
We use the same method to find the relationships between Plasma Retinol,
Plasma Beta-Carotene and Vitamin use level and the with
For vitamin use level, a 3-point scale is used. ( 1=Yes, fairly often, 2=Yes,
not often, 3=No). According to three groups, we draw box plot of Plasma Retinol
with VITUS.
We see that the statistics data we get from three statuses, in which respective
data like min, max and median are almost have a same level with each other.
There does not obviously difference among these three levels. We may say that
people using vitamin or not has not significant influence on Plasma Retinol level.
10 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
From the above figure, we see that people use vitamin very often has
relatively higher level of Plasma Beta-Carotene than others. And people who use
vitamin but not often have higher level than non users. Considering outliers, it is
the same fact that, using vitamin often bears a higher level of plasma
Beta-Carotene.
5.2.3 SMOKSTAT
As to relationships between Plasma Retinol and Smoking Status level, we use
3-point scale to describe smoking status. (1=Never, 2=Former, 3=Current Smoker)
Then 3 groups are used to see relationships between Plasma Retinol, Plasma
Beta-Carotene and different status of smoking.
11 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
It is interesting to see that the min, max, median, Q and Q of type 2 are
all higher than the others respectively, which means the former smokers usually
have a higher level of plasma retinol than others. But the statistics data of
non-smokers are between the former smoker and current smokers.
From above figure, we can see a different result with plasma retinol. People
never smoking have a high plasma Beta-Carotene level than others. For current
smokers, they may bear a relative lower level than nonsmokers and former
smokers.
12 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
5.2.4 AGE
Mean
50.146
Age
StDev
14.575
Min
19.000
Q1
39.000
Median
48.000
Q3
63.000
Max
83.000
As it is shown in the describe statistic table and the figure above, the mean of
participant’s age is around 50. And the most of the data distribute in the area between
32 and 77, which indicates that most participants are basically middle-age or elderly
people.
5.2.5 QUETELET
When analyzing the height and weight of the participant, we use the formula
(weight/ (height^2) to represent these two physical measurements, naming Quetelet.
The Queteletindex, is a statistical measurement which compares a person's weight and
height. Though it does not actually measure the percentage of body fat, it is used to
estimate a healthy body weight based on how tall a person is. The lower the value, the
thinness the person is. Contrarily, if the value is very high, the person’s weight is
some kind of over compared to his height. The table below is a category according to
WHO.
Category
BMI range – kg/m2
Severely underweight
less than 16.5
Underweight
from 16.5 to 18.4
13 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Normal
from 18.5 to 24.9
Overweight
from 25 to 30
Obese Class I
from 30.1 to 34.9
Obese Class II
from 35 to 40
Obese Class III
over 40
This is the data of all the participants.
QUETELET
Mean
StDev
Min
Q1
Median
Q3
Max
26.157
6.014
16.331
21.789
24.735
28.950
50.403
As it is shown in the describe statistic table and the figure above, the mean of
participant’s Quetelet is around 26. And the most of the data distribute in the area
between 18.5 and 30, which indicates that some participants are normal and some are
a little overweight.
5.2.6 CALORIES
CALORIES
Mean
StDev
Min
Q1
Median Q3
Max
1796.7
680.3
445.2
1333.8
1666.8
6662.2
2106.4
14 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
As it is shown in the describe statistic table and the figure above, the mean of
participant’s calories consumed is around 1800 ca. And the most of the data distribute
in the area between 1000 and 2200.
5.2.7 FAT
FAT
Mean
StDev
Min
Q1
Median Q3
Max
77.03
33.83
14.40
53.90
72.90
235.90
95.30
As is shown in the descriptive statistic table and the histogram above, the mean
of participant’s fat consumed per day is around 77g. And the most of the data
distribute in the area between 45 and 135.
15 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
5.2.8 FIBER
FIBER
Mean
StDev
Min
Q1
Median Q3
Max
12.789
5.330
3.100
9.100
12.100
36.800
15.600
As is shown in the descriptive statistic table and the histogram above, the mean of
participant’s fiber consumed per day is around 12.789g. And the most of the data
distribute in the area between 6 and 18.
5.2.9 ALCOHOL
ALCOHOL
Mean
StDev
Min
Q1
Median Q3
Max
3.279
12.323
0
0.000
0.300
203.000
3.200
16 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
In the histogram above, we can see that in this sample, there is an outlier, whose
number of alcoholic drinks consumed per week is 203. Therefore, we delete this point
and draw the histogram again as below.
As it is shown in the describe statistic table and the figure above, the mean of
participant’s number of alcoholic drinks consumed per week is around 3.3. And the
most of the data distribute in the area between 0 and 6.
5.2.10 CHOLESTEROL
CHOLESTEROL
Mean
StDev
Min
Q1
Median Q3
Max
242.46
131.99
37.7
154.9
206.3
900.7
308.9
Histogram of CHOLESTEROL
70
66
60
57
Frequency
50
47
41
40
29
30
20
20
17
14
10
9
6
3
0
150
300
1
450
600
CHOLESTEROL
0
2
1
750
1
0
1
900
17 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
We find out the mean value is about 242, and most data distribute in the interval
between 50 and 250, which indicates the common level of people cholesterol
consumption.
5.2.11 BETADIET
BETADIET
Mean
StDev
Min
Q1
Median Q3
Max
2185.6
1473.9
214
114
1802
9642
2863
Histogram of BETADIET
70
65
60
60
Frequency
50
40
40
37
29
30
24
21
20
10
9
7
6
3
2
0
0
1500
3000
4500
BETADIET
4
6000
2
3
1
7500
1
0
0
1
9000
The mean value is about 2187, and most data distribute in the interval between
1000 and 2500, which indicates that most people consumes about 2000 mcg of
beta-carotene per day.
5.2.12 RETDIET
RETDIET
Mean
StDev
Min
Q1
Median Q3
Max
832.7
589.3
30
479
707
6901
1047
18 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Histogram of RETDIET
160
152
140
Frequency
120
100
95
80
60
36
40
20
18
9
2
0
0
1000
2000
1
0
1
3000
4000
RETDIET
0
0
5000
0
0
0
6000
1
7000
The mean value is about 832, and most data distribute in the interval between
500 and 1500, which indicates that most people consumes about 1000 mcg of retinol
per day.
5.3 DATA ANALYSIS VIA REGRESSION AND GLM
5.3.1 Plasma Beta-carotene
5.3.1.1 Regression method:
1. Check with Scatter Plots
Firstly the scatter plot of plasma beta-carotene against all the predictors
which are continues data is drawn to check whether there are any outliers or
data aggregations. See the scatter plots as they are shown in the appendix.
After delete all the outliers of each pair of plasma beta-carotene and variables,
another three scatter plots is drawn, whose distribution are more wider, as it
shown in the appendix.
2. Best subset and Stepwise regression to select model variables
During the analysis progress, we use dummy variables to take place of
discreet variables: SEX, SMOKSTAT and VITUSE, and name them as sex_1;
19 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
smoking-status_1, smoking-status_2 and smoking_status_3; vitamin-status_1,
vitamin-status_2 and vitamin_status_3. Then we tried all the stepwise
regression, forward selection and backward elimination, finally select
QUETLET, BETADIET, vitamin_status_3, FAT as the model variables.
Variables
T-Value
P-value
QUETLET
-4.11
0.000
BETADIET
3.57
0.000
Vitamin_status_3
-3.17
0.002
Smoking_status_3
-2.04
0.042
FAT
-1.88
0.061
R-sq.=15.09; R-Sq.(adj)=13.67
3. Regression Analysis
(1) Residual check
From the residual plot(Figure 7), (see the appendix), we could see that in the
normal probability plot, it doesn’t obeys normal distribution at all. In the
figure Versus Fits, the red dots are highly skewed. So we use log function to
do a translation, and use log (plasma beta-carotene) to replace plasma
beta-carotene. Then after another round of scatter plots check and data
deleted, we draw a new residual plot (Figure 8), again, from the residual plot,
we could see that in the normal probability plot, it almost obeys normal
distribution. In the figure Versus Fits, range of the red dots of each group
are almost the same, which indicates that the assumption ε ~N0, σ
is
accepted. And also, the entire red dotes in the Versus Order disorder, so
another assumption that all of the data are independently and identically
distribution (i.i.d) is also accepted.
So best subset and stepwise regression to select model variables progress is
done again to select the variables which would be used in the model. This
time
QUETLET,
BETADIET,
vitamin_status_3,
smoking_status_3,
FAT.AGE, Sex_2 and FIBER are selected as the model variables.
20 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Variables
T-Value
P-value
QUETLET
-5.05
0.000
BETADIET
1.93
0.054
Vitamin_status_3
-3.48
0.001
Smoking_status_3
-2.29
0.023
FAT
-1.98
0.048
AGE
2.00
0.046
Sex_2
1.73
0.085
FIBER
1.51
0.132
R-sq.= 21.44; R-Sq.(adj)=19.31
(2) Regression Model
From Analysis of variance, P-value=0.000, which indicates that at least one
coefficient is not 0, i.e. the model is serviceable.
From the P-value of predictor, we could see that the P-values of QUETLET,
vitamin_status_3, smoking_status_3, FAT and AGE are smaller than 0.05,
which indicates that these predictors should be kept in the final model.
However, in order to make the model more accurate, we still keep the
variable in the final model whose P-value is larger than 0.05. Furthermore,
R-Sq = 21.4%and R-Sq(adj)=19.3%, which indicates that the model is
persuasive and could forecasting accurately.
4. The final model
Log
(plasma
beta-carotene)
=
2.32
-
0.0140QUETLET
-0.124vitamin_status_3- 0.116 smoking_status_3 + 0.000025 BETADIET 0.00113 FAT+ 0.00248 AGE+ 0.0934 sex_2 + 0.00632 FIBER
The coefficient of QUETLET, vitamin_status_3, smoking_status_3 and FAT
are negative, which indicates that with the increase of these variables, there
21 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
would be a decrease of plasma beta-carotene, which namely if a person is
overweight and a current smoker, and also never take vitamin, his plasma
beta-carotene would be low.
The coefficient of BETADIET, AGE, Sex_2 and FIBER are positive, which
indicates that with the increase of average number of these variables, there
would also be an increase of plasma beta-carotene, which namely if the
person is older, a female, his dietary beta-carotene and fiber consumed per
day is high, then her plasma beta-carotene would be higher.
5.3.1.2 General Linear Model
1. Check with Scatter Plots
The same to regression progress, after using log (plasma beta-carotene) to
replace plasma beta-carotene, we draw scatter plots to see the relationship
between variables and dependent variable. And then delete the outliers.
2. Residual check
From the residual plot (Figure 12), we could see that in the normal
probability plot, it almost obeys normal distribution. In the figure Versus
Fits, range of the red dots of each group are almost the same, which
indicates that the assumption ε ~N0, σ
is accepted. And also, the entire
red dotes in the Versus Order disorder, so another assumption that all of the
data are independently and identically distribution (i.i.d) is also accepted.
3. General Linear Model
From the table below, we could see all the variables used in the final model
and their p-value, if there are any variables whose p-value is lower than 0.05,
which indicates that these variables’ coefficient is significant not equal to 0.
And
because
p-value
of
CALORIES,
FAT,
ALCOHOL
and
CHOLESTEROL are much larger than 0.05, so we delete them in the final
model.
22 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Variables
T-Value
P-value
AGE
1.70
0.090
QUETLET
-4.92
0.000
CALORIES
-0.87
0.385
FAT
0.31
0.758
FIBER
1.49
0.139
ALCOHOL
0.32
0.750
CHOLESTEROL
-0.52
0.603
BETADIET
1.60
0.111
RETDIET
0.96
0.337
Vitamin_1
2.22
0.027
Vitamin_2
0.18
0.858
BETADIET*Vitamin
R-sq.= 23.38%; R-Sq.(adj)=19.11%
4. The final model
Log
(plasma
beta-carotene)
=2.3061+0.002224
AGE-0.014010
QUETLET+0.00818FIBER+0.000021BETADIET+0.000034BETADIET*Vi
tamin_1
The coefficient of QUETLET is negative, which indicates that with the
increase of QUETLET, there would be a decrease of plasma beta-carotene,
which namely if a person is overweight, then his plasma beta-carotene would
be low.
The coefficients of AGE, FIBER, BETADIET and BETADIET*Vitamin_1
are positive, which indicates that with the increase of average number of
these variables, there would also be an increase of plasma beta-carotene,
which namely if the person is older, his dietary beta-carotene and fiber
23 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
consumed per day is high, then her plasma beta-carotene would be higher.
And also, there is an interaction of beta-carotene consumed per day and
vitamin taken.
5.3.2 Retinol
5.3.2.1 Regression method
1.
Check with Scatter Plots
First we check the matrix plot of retinol plasma with all the continuous
predictors to see whether there are outliers or data aggregations. The each
scatter plot of retinol plasma with a predictor is analyzed to further observe
the outliers and related distribution. Outliers are deleted in order to make the
scatters symmetrically distributed. The corresponding plots are all appended
to the end of this report.
2.
Best subset and Stepwise regression to select model variables
Among all the variables, SEX, SMOKSTAT and VITUSE are discrete
variables which cannot be directly used in the regression analysis. So we use
dummy variables to replace them. Denote SEX_F which takes the values 1
when SEX is female, otherwise SEX_F is 0. Similarly SMOK_1 and
SMOK_2 are compared to describe the three levels of SMOKSTAT, while
VITUSE_1 and VITUSE_2 represent three levels of VITUSE. There
relations are show in the following two tables.
SMOKSTAT
SMOK_1
SMOK_2
1
0
0
2
0
1
3
1
0
24 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
VITUSE
VITUSE _1
VITUSE _2
1
0
0
2
0
1
3
1
0
Then we tried all the stepwise regression, forward selection and backward
elimination, finally select AGE, QUETELET, ALCOHOL, BETADIET,
SEX_F, SMOK_2, and VITUSE_1 as the model variables.
Variables
T-Value
P-value
AGE
3.32
0.001
QUETELET
1.72
0.087
ALCOHOL
3.24
0.001
BETADIET
-2.04
0.043
SEX_F
-1.97
0.049
SMOK_2
1.70
0.089
VITUSE_1
-2.95
0.052
R-sq.= 13.55; R-Sq.(adj)=11.42
3.
Regression Analysis
(1) Residual check
(2) The residual plot appended to the end of the paper show that there are some
outliers at the right side which affect the conformity to the normal
distribution. In the figure of Versus Fits, the red dots are skewed. So through
observation, we delete the outliers. After that we do regression again and
draw residual plots in the appendix. This time we could see that in the
normal probability plot, it almost obeys normal distribution. In the figure
Versus Fits, range of the red dots of each group are almost the same, which
25 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
indicates that the assumption ε ~N0, σ
is accepted. Besides, the entire red
dotes in the Versus Order are distributed randomly, so the assumption that
all of the data are independently and identically distribution (i.i.d) is
accepted.
(3) Regression Model
From the P-value of predictor, we could see that the P-values of AGE,
BETADIET, SEX_F, SMOK_2, and VITUSE_1 are smaller than 0.05, so
that these predictors should be kept in the final model.
4.
The final model
RETPLASMA = 517 + 2.09 AGE - 0.0149 BETADIET- 71.7 SEX_F + 41.9
SMOK_2 - 43.4 VITUSE_1
The coefficient of BETADIET, SEX_F, and VITUSE_1 are negative, which
indicates that with the increase of these variables, there would be a decrease
of retinol plasma. The coefficient of AGE and SMOK_2 are positive, which
indicates that with the increase of average number of these variables, there
would also be an increase of plasma retinol, which namely if the person is
older or a former smoker then the plasma retinol would be higher.
5.3.2.2 General Linear Model
1. Check with Scatter Plots
Because we’ve already smooth away the outliers in the regression model,
it’s not necessary to check the scatter plot here. We just use the data used in
the regression part.
2. Residual check
At the end of the GLM regression, we got the residual plot, we could see
that in the normal probability plot, it almost obeys normal distribution,
which indicates that the assumption ε ~N0, σ
is accepted.
3. General Linear Model
26 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Through the GLM function of MINITAB, we get the table below, we’ve got
all the variables used in the final model and their p-value. Then we check
out if there are any variables whose p-value is lower than 0.15, which
indicates that these variables’ coefficient is significant not equal to 0. So we
choose AGE, ALCOHOL, BETADIET, into the final model.
Variables
T-Value
P-value
Constant
6.57
0.000
AGE
2.50
0.013
QUETLET
0.90
0.370
CALORIES
-0.70
0.486
FAT
-1.43
0.153
FIBER
-0.79
0.428
ALCOHOL
1.77
0.079
CHOLESTEROL
0.92
0.360
BETADIET
-1.66
0.097
RETDIET
0.40
0.688
R-sq.= 14.75%; R-Sq.(adj)= 9.52%
4. The final model
RETPLASMA=510.86+1.8777AGE+5.002ALCOHOL-0.013507BETADIET
The coefficient of AGE and ALCOHOL are positive, which means if these
two variables become larger, the plasma density of retinal will rise.
The coefficient of BETADIET is negative, which indicates that with the
decrease of these two variables. Plasma retinal would be higher.
27 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
6. DISCUSSION
We conclude that there is wide variability in plasma concentrations of these
micronutrients in humans, and that much of this variability is associated with dietary
habits and personal characteristics.
A better understanding of the physiological
relationship between some personal characteristics and plasma concentrations of these
micronutrients will require further study.
7. APPENDIX
6.1 Plasma Beta-carotene
6.1 1 Regression method:
6.1.1.1 Scatter plots
1. Scatter plots before deleting the outliners:
Figure 1 Scatter plots of Plasma beta-carotene and AGE, QUTELET, CALORIES
28 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Figure 2 Scatter plots of Plasma beta-carotene and FAT, FIBER, ALCOHOL
Figure 3 Scatter plots of Plasma beta-carotene and CHOLESTEROL, BETADIET, REDIET
2. Scatter plots after delete the outliers:
29 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Figure 4 Scatter plots of Plasma beta-carotene
beta
and AGE, QUTELET, CALORIES after data delete
P l a s m a b e t a - c a r o t e n e , F A T , F I B E R , A L C O H O L 的矩阵图
0
100
200
0
10
20
1600
800
Plasma beta-carotene
0
200
100
FAT
0
30
20
FIBER
10
20
10
ALCOHOL
0
0
800
16600
10
20
30
Figure 5 Scatter plots of Plasma beta-carotene
beta carotene and FAT, FIBER, ALCOHOL after data delete
30 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
P l a s m a b e t a - c a r o t e n e , C H O L E S T E R O L , B E T A D I E T , R E T D I E T 的矩阵图
0
400
800
0
1000
2000
1600
800
Plasma beta-carotene
0
800
400
CHOLESTEROL
0
10000
5000
BETADIET
0
2000
RETDIET
1000
0
0
800
1600
0
5000
10000
Figure 6 Scatter plots of Plasma beta-carotene and CHOLESTEROL, BETADIET, REDIET after data delete
6.1.1.2 Residual check
31 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Figure 7 Residual plots for plasma beta-carotene
Figure 8 Residual plots for log( plasma beta-carotene)
6.1.1.3 Regression
32 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
1. Best subset and Stepwise regression to select model variables
Stepwise Regression: Plasma beta-carotene(ng/ml) versus Age, sex_1, ...
Alpha-to-Enter: 0.15
Alpha-to-Remove: 0.15
Response is Plasma beta-carotene(ng/ml) on 17 predictors, with N = 304
Step
Constant
weight/(height^2)
1
353.5
-6.4
2
3
300.9
-6.4
321.6
-6.2
4
341.0
375.5
-6.5
T-Value
-3.93
-4.03
-3.92
P-Value
0.000
0.000
0.000
5
-6.4
-4.15
-4.11
0.000
0.000
BETADIET
0.0246
0.0228 0.0211
0.0232
T-Value
3.77
3.55
3.28
3.57
P-Value
0.000
0.000
0.001
0.000
vitamin_status_3
-69
-64
-63
T-Value
-3.47
-3.23
-3.17
P-Value
0.001
0.001
0.002
-62
-57
T-Value
-2.22
-2.04
P-Value
0.027
0.042
smoking_status_3
FAT
-0.56
T-Value
-1.88
P-Value
0.061
S
R-Sq
R-Sq(adj)
171
4.86
4.55
168
9.16
8.55
165
12.67
11.80
164
14.08
12.93
163
15.09
13.67
33 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Stepwise Regression: log beta versus Age, sex_1, ...
Alpha-to-Enter: 0.15
Alpha-to-Remove: 0.15
Response is log beta on 17 predictors, with N = 304
Step
Constant
weight/(height^2)
1
2.526
-0.0144
2
3
4
2.563
2.606
2.533
-0.0138
-0.0148
-0.0147
T-Value
-4.91
-4.82
-5.21
P-Value
0.000
0.000
0.000
vitamin_status_3
-0.153
-0.139
-5.26
0.000
-0.132
-3.88
-3.73
P-Value
0.000
0.000
0.000
-0.168
-0.152
-5.24
-3.67
0.000
-0.140
-3.04
-2.82
P-Value
0.001
0.003
0.005
0.00004
0.000
-0.135
-3.36
0.00003
-5.21
0.000
T-Value
BETADIET
2.512
-0.0144
-0.129
-4.23
6
2.615
-0.0146
T-Value
smoking_status_3
5
-3.82
0.000
-0.129
-2.56
0.011
0.00003
T-Value
2.70
3.11
2.99
P-Value
0.007
0.002
0.003
-0.00132
-0.00116
-2.51
-2.16
FAT
T-Value
P-Value
Age
0.032
0.290
0.289
0.0018
T-Value
1.55
P-Value
0.123
S
0.013
0.309
0.300
0.295
0.292
34 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
R-Sq
R-Sq(adj)
7.40
7.10
Step
Constant
weight/(height^2)
12.61
12.03
7
2.382
-0.0144
18.12
20.12
18.50
-0.0140
-5.05
P-Value
0.000
0.000
-0.127
-0.124
T-Value
-3.56
-3.48
P-Value
0.000
0.001
-0.126
-0.116
T-Value
-2.51
-2.29
P-Value
0.013
0.023
BETADIET
0.00003
0.00002
T-Value
2.92
1.93
P-Value
0.004
0.054
FAT
16.68
19.47
2.321
-5.23
smoking_status_3
14.93
17.78
8
T-Value
vitamin_status_3
15.77
-0.00091
-0.00113
T-Value
-1.65
-1.98
P-Value
0.100
0.048
Age
0.0025
0.0025
T-Value
2.00
2.00
P-Value
0.047
0.046
sex_2
0.089
0.093
35 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
T-Value
1.64
1.73
P-Value
0.102
0.085
FIBER
0.0063
T-Value
1.51
P-Value
0.132
S
0.288
0.288
R-Sq
20.84
21.44
R-Sq(adj)
18.96
19.31
Regression Analysis: log beta versus weight/(heig, vitamin_stat, ...
The regression equation is
log beta = 2.32 - 0.0140 weight/(height^2) - 0.124 vitamin_status_3
- 0.116 smoking_status_3 + 0.000025 BETADIET - 0.00113 FAT
+ 0.00248 Age + 0.0934 sex_2 + 0.00632 FIBER
Predictor
Coef
SE Coef
T
Constant
2.3210
0.1402
16.56
weight/(height^2)
vitamin_status_3
smoking_status_3
BETADIET
P
0.000
-0.014003
0.002772
-5.05
0.000
-0.12355
0.03554
-3.48
0.001
-0.11559
0.05048
-2.29
0.00002494 0.00001291
0.023
1.93
0.054
FAT
-0.0011338
0.0005720
-1.98
0.048
Age
0.002476
0.001238
2.00
0.046
sex_2
0.09340
0.05406
1.73
0.085
36 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
FIBER
S = 0.287677
0.006317
R-Sq = 21.4%
0.004186
1.51
0.132
R-Sq(adj) = 19.3%
Analysis of Variance
Source
DF
Regression
8
Residual Error
6.66382
295
Total
SS
24.41362
303
F
P
0.83298 10.07 0.000
0.08276
31.07744
Source
DF
Seq SS
weight/(height^2)
1
vitamin_status_3
1 1.61678
smoking_status_3
MS
2.30126
1 0.98308
BETADIET
1
0.62327
FAT
1 0.52777
Age
1 0.19982
sex_2
FIBER
1 0.22335
1 0.18849
6.1.2 GLM Method
6.1.2.1 Scatter plots
37 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
l o g b e t a , A g e , Q u e t e l e t , C A L O R I E S 的矩阵图
30
60
90
0
2000
4000
3
2
log beta
1
90
60
Age
30
50
35
Quetelet
20
4000
2000
CALORIES
0
1
2
3
20
35
50
Figure 9 Scatter plots of log (Plasma beta-carotene) and AGE, QUTELET, CALORIES
l o g b e t a , F A T , F I B E R , A L C O H O L 的矩阵图
0
100
200
0
10
20
3
2
log beta
1
200
100
FAT
0
30
20
FIBER
10
20
10
ALCOHOL
0
1
2
3
10
20
30
Figure 10 Scatter plots of log (Plasma beta-carotene) andFAT, FIBER, ALCOHOL
38 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
l o g b e t a , C H O L E S T E R O L , B E T A D I E T , R E T D I E T 的矩阵图
0
400
800
0
1000
2000
3
2
log beta
1
800
400
CHOLESTEROL
0
10000
5000
BETADIET
0
2000
RETDIET
1000
0
1
2
3
0
5000
10000
Figure 11 Scatter plots of log (Plasma beta-carotene) and CHOLESTEROL, BETADIET, REDIET
6.1.2.2 Residual check
39 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Figure 12
Residual plots for plasma beta-carotene
6.1.2.3 Model
General Linear Model: log beta versus Sex, Smoking status, Vitamin
Factor
Sex
Type
Levels Values
fixed
2
Smoking status fixed
Vitamin
3
fixed
1, 2
1, 2, 3
3
1, 2, 3
Analysis of Variance for log beta, using Adjusted SS for Tests
Source
DF
Seq SS
Age
1
0.40150
0.23897 0.23897
2.90 0.090
Sex
1
0.79890
0.17571 0.17571
2.13
Smoking status
weight/(height^2)
Vitamin
2
0.66772
1
2.63996
2
Adj SS
0.48616
1.99036
1.01660
CALORIES
1
Adj MS
0.24308
1.99036
2.95
24.18
0.06828 0.03414
0.00009
F
0.145
0.054
0.000
0.41
0.06235 0.06235
0.661
0.76 0.385
FAT
1
0.15930
0.00784 0.00784
0.10 0.758
FIBER
1
0.63890
0.18163 0.18163
2.21 0.139
ALCOHOL
1
CHOLESTEROL
1
0.01623
0.00013
0.00839 0.00839
0.02228 0.02228
P
0.10 0.750
0.27 0.603
BETADIET
1
0.26242
0.21037 0.21037
2.56 0.111
RETDIET
1
0.06539
0.07630 0.07630
0.93 0.337
Vitamin*BETADIET
2
0.54436
0.54436 0.27218
3.31 0.038
40 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Error
287 23.62834 23.62834
Total
303 30.83984
S = 0.286930
R-Sq = 23.38%
Term
R-Sq(adj) = 19.11%
Coef
Constant
2.3061
Age
-0.014010
SE Coef
T
P
0.1302 17.71 0.000
0.002224
weight/(heig
0.08233
0.001306
0.002849
1.70 0.090
-4.92
0.000094
0.000
CALORIES
-0.000082
-0.87
0.385
FAT
0.000447
0.001450
0.31 0.758
FIBER
0.008180
0.005508
1.49 0.139
ALCOHOL
0.001381 0.004325
0.32
0.750
CHOLESTEROL
-0.000109 0.000210
-0.52
0.603
BETADIET
0.000021
0.000013
1.60 0.111
RETDIET
0.000033
0.000034
0.96 0.337
BETADIET*Vitamin
1
0.000034
0.000015
2.22 0.027
2
0.000003
0.000017
0.18 0.858
41 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
6.2 Retinol
6.2.1 Regression Method
6.2.1.1 Scatter plots
Fig 13 Matrix plot of all variables
42 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Scatterplot of RETPLASMA vs AGE
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
20
30
40
50
60
70
80
90
AGE
Fig 14 Scatter plot of retinol plasma with age
Because the plot is symmetric distributed, so we don’t need to transform or
delete the points.
Scatterplot of RETPLASMA vs QUETELET
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
15
20
25
30
35
QUETELET
40
45
50
Fig 15 Scatter plot of retinol plasma with QUETELET
43 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Because the plot is skewed, so we delete the 5 outliers on the right side.
Scatterplot of RETPLASMA vs CALORIES
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
0
1000
2000
3000
4000
CALORIES
5000
6000
7000
Fig 16 Scatter plot of retinol plasma with CALORIES
Two points on the right side are deleted.
44 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Scatterplot of RETPLASMA vs FAT
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
0
50
100
FAT
150
200
Fig 17 Scatter plot of retinol plasma with FAT
Two points on the right side are deleted.
Scatterplot of RETPLASMA vs FIBER
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
0
10
20
FIBER
30
40
Fig 18 Scatter plot of retinol plasma with FIBER
45 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Five points on the right side are deleted.
Scatterplot of RETPLASMA vs ALCOHOL
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
0
10
20
ALCOHOL
30
40
Fig 19 Scatter plot of retinol plasma with ALCOHOL
Two points on the right side are deleted.
46 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Scatterplot of RETPLASMA vs CHOLESTEROL
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
0
100
200
300
400
500
600
CHOLESTEROL
700
800
900
Fig 20 Scatter plot of retinol plasma with CHOLESTEROL
Four points on the right side are deleted.
Scatterplot of RETPLASMA vs BETADIET
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
0
2000
4000
6000
BETADIET
8000
10000
Fig 21 Scatter plot of retinol plasma with BETADIET
47 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Two points on the right side are deleted.
Scatterplot of RETPLASMA vs RETDIET
1800
1600
1400
RETPLASMA
1200
1000
800
600
400
200
0
0
1000
2000
3000
4000
RETDIET
5000
6000
7000
Fig 22 Scatter plot of retinol plasma with RETDIET
6.2.2.2 Regression
1. Best subset and Stepwise regression to select model variables
Stepwise Regression: RETPLASMA versus AGE, QUETELET, ...
Forward selection.
Alpha-to-Enter: 0.25
Response is RETPLASMA on 14 predictors, with N = 292
Step
Constant
AGE
T-Value
1
2
3
4
5
6
450.1 428.1 517.7 607.2 597.4 615.2
3.08
3.77
3.03
3.79
2.58
3.10
2.15
2.49
2.11
2.45
2.16
2.52
48 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
P-Value
0.000 0.000
ALCOHOL
0.002 0.013 0.015 0.012
9.9
T-Value
3.56
P-Value
0.000
SEX_F
8.5
2.95
8.6
3.01
7.8
2.68
8.3
2.85
0.003 0.003 0.008 0.005
-73
-92
-2.33
-97
T-Value
-1.97
P-Value
0.050 0.018 0.020 0.012
FAT
-2.39
-89
-0.70
-2.52
-0.75
-0.76
T-Value
-1.72
-1.85
P-Value
0.086
0.066 0.062
SMOK_2
-1.88
41
42
T-Value
1.69
1.71
P-Value
0.092
0.088
VITUSE_1
-40
T-Value
-1.63
P-Value
0.103
S
R-Sq
205
4.67
201
8.67
200
199
198
198
9.88 10.80 11.68 12.50
49 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
R-Sq(adj)
Mallows Cp
Step
Constant
AGE
4.34
8.03
22.3
8.94
11.3
9.4
7
2.12
2.69
P-Value
0.014
0.008
9.1
9.6
T-Value
3.10
3.25
P-Value
0.002
0.001
-95
-90
T-Value
-2.48
-2.35
P-Value
0.014
0.020
-0.79
-0.64
T-Value
-1.96
-1.56
P-Value
0.051
0.119
SMOK_2
6.8
2.31
2.48
FAT
7.5
523.8
T-Value
SEX_F
8.4
8
519.4
ALCOHOL
9.56 10.14 10.66
41
44
50 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
T-Value
1.67
1.79
P-Value
0.095
0.074
VITUSE_1
-44
-48
T-Value
-1.79
-1.95
P-Value
0.075
0.052
QUETELET
3.8
3.8
T-Value
1.76
1.78
P-Value
0.079
0.076
BETADIET
-0.0145
T-Value
-1.66
P-Value
0.097
S
R-Sq
R-Sq(adj)
Mallows Cp
197
13.45
11.32
5.7
196
14.29
11.87
5.0
8 predictors are chosen as below: AGE, ALCOHOL, SEX_F, FAT, SMOK_2,
VITUSE_1, QUETELET, and BETADIET.
Stepwise Regression: RETPLASMA versus AGE, QUETELET, ...
Backward elimination.
Alpha-to-Remove: 0.1
Response is RETPLASMA on 14 predictors, with N = 292
51 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Step
Constant
AGE
1
537.0
2
537.1
2.34
3
532.6
2.34
4
528.9
2.37
5
546.0
2.41
6
533.6
2.22
2.22
T-Value
2.54
2.54
2.60
2.68
2.57
2.57
P-Value
0.012
0.012
0.010
0.008
0.011
0.011
QUETELET
3.8
3.8
3.8
3.9
4.0
4.0
T-Value
1.71
1.73
1.76
1.77
1.82
1.88
P-Value
0.089
0.085
0.080
0.077
0.069
0.062
-0.58
-0.62
CALORIES
0.043
0.044
0.044
0.052
T-Value
0.62
0.63
0.64
0.79
P-Value
0.536
0.530
0.525
0.430
-1.30
-1.31
-1.35
FAT
-1.30
T-Value
-1.18
-1.21
-1.23
P-Value
0.237
0.228
0.221
0.205
0.173
-3.3
-3.2
-3.3
-1.4
FIBER
-3.3
T-Value
-0.83
-0.85
-0.84
P-Value
0.406
0.396
0.403
-1.27
-0.86
0.393
-1.37
-1.51
0.131
-0.46
0.646
52 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
ALCOHOL
9.3
9.3
9.3
9.1
9.7
9.8
T-Value
2.95
2.97
2.97
2.96
3.29
3.32
P-Value
0.003
0.003
0.003
0.003
0.001
0.001
CHOLESTEROL
0.00
T-Value
0.01
P-Value
0.990
BETADIET
-0.0128
-0.0127
-0.0127
-0.0128
-0.0134
-0.0155
T-Value
-1.26
-1.27
-1.27
-1.29
-1.35
-1.77
P-Value
0.209
0.204
0.204
RETDIET
0.012
0.012
0.35
0.36
0.36
P-Value
0.723
0.721
0.721
-90
-90
-90
T-Value
-2.27
-2.32
-2.32
P-Value
0.024
0.021
0.021
SMOK_1
-7
0.178
0.078
0.012
T-Value
SEX_F
0.198
-89
-2.30
0.022
-89
-2.32
0.021
-88
-2.30
0.022
-7
T-Value
-0.19
-0.19
P-Value
0.851
0.851
53 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
SMOK_2
43
43
45
45
45
44
T-Value
1.68
1.68
1.84
1.84
1.83
1.82
P-Value
0.095
0.094
0.067
0.067
0.068
0.070
VITUSE_1
-58
-58
-59
T-Value
-2.04
-2.04
-2.08
P-Value
0.043
0.042
0.038
VITUSE_2
-29
-29
-2.08
-0.97
-0.97
-0.99
P-Value
0.333
0.331
0.325
R-Sq
R-Sq(adj)
Mallows Cp
Step
Constant
AGE
198
14.91
10.61
197
14.91
10.93
15.0
7
523.8
2.31
11.24
13.0
-0.98
11.0
-2.20
0.028
-30
-1.02
0.329
11.52
9.2
-30
-1.02
0.310
197
14.86
-61
0.028
-29
197
14.90
-62
-2.22
0.039
-30
T-Value
S
-58
0.306
197
14.67
11.63
7.8
196
14.61
11.88
6.0
8
450.6
2.73
T-Value
2.69
3.32
P-Value
0.008
0.001
54 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
QUETELET
3.8
3.7
T-Value
1.78
1.72
P-Value
0.076
0.087
CALORIES
T-Value
P-Value
FAT
-0.64
T-Value
-1.56
P-Value
0.119
FIBER
T-Value
P-Value
ALCOHOL
9.6
9.6
T-Value
3.25
3.24
P-Value
0.001
0.001
CHOLESTEROL
T-Value
P-Value
55 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
BETADIET
-0.0145
-0.0174
T-Value
-1.66
-2.04
P-Value
0.097
0.043
RETDIET
T-Value
P-Value
SEX_F
-90
-73
T-Value
-2.35
-1.97
P-Value
0.020
0.049
SMOK_1
T-Value
P-Value
SMOK_2
44
42
T-Value
1.79
1.70
P-Value
0.074
0.089
VITUSE_1
-48
-49
T-Value
-1.95
-1.95
P-Value
0.052
0.052
56 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
VITUSE_2
T-Value
P-Value
S
R-Sq
R-Sq(adj)
Mallows Cp
196
197
14.29
11.87
13.55
11.42
5.0
5.4
7 predictors are chosen as below: AGE, QUETELET, ALCOHOL, BETADIET,
SEX_F, SMOK_2, VITUSE_1.
Best Subsets Regression: RETPLASMA versus AGE, QUETELET, ...
Response is RETPLASMA
C
H
O
QC
LB
UA
AEER
EL
LSTE
VV
II
SSTT
TO
FCTATSMMUU
ER
IOEDDEOOSS
ALIFBHRIIXKKEE
57 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Mallows
Vars
R-Sq
R-Sq(adj)
1
4.7
4.3
22.3 204.63 X
1
4.5
4.1
23.0 204.86
2
8.7
8.0
11.3 200.64 X
2
7.2
6.5
16.2 202.29 X
3
9.9
8.9
9.4 199.65 X
XX
3
9.6
8.6
10.3 199.97 X
XX
4
10.8
9.6
8.4 198.97 X
4
10.8
9.5
8.5 199.03 X
XXX
5
11.7
10.2
7.4 198.29 X
XXXX
5
11.7
10.1
7.5 198.33 X
6
12.7
10.8
Cp
GEEAEOOEE_____
6.3
S ETSTRLLTTF1212
X
X
X
XXX
XXXX
197.57 X XXXXX
6
12.6
10.8
6.4 197.59 X
XXXXX
7
13.5
11.4
5.4 196.92 X XXXXXX
7
13.5
11.3
5.7 197.03 X XXXXXX
8
14.3
11.9
5.0 196.42 X XXXXXXX
8
14.0
11.6
5.9 196.74 X XXXXXXX
9
14.6
11.9
6.0 196.40 X XXXXXXXX
9
14.4
11.6
6.8 196.69 X XXXXXXXX
10
14.7
11.6
7.8 196.67 X XXXXXXXXX
10
14.7
11.6
7.8 196.68 X XXXXXXXXX
11
14.9
11.5
9.2 196.81 X XXXXXXXXXX
11
14.8
11.4
9.4 196.90 X XXXXXXXXXX
58 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
12
14.9
11.2
11.0 197.11 X XXXXXXXXXXX
12
14.9
11.2
11.1 197.15 X XXXXXXXXXXX
13
14.9
10.9
13.0 197.46 X XXXXXXXXXXXX
13
14.9
10.9
13.0 197.47 X XXXXXXXXXXXX
14
14.9
10.6
15.0 197.81 X XXXXXXXXXXXXX
6 variables are selected as below: AGE, QUETELET, ALCOHOL, BETADIET,
SEX_F, VITUSE_1.
3. Model
Regression Analysis: RETPLASMA versus AGE, QUETELET, ...
The regression equation is
RETPLASMA = 451 + 2.73 AGE + 3.69 QUETELET + 9.56 ALCOHOL - 0.0174
BETADIET - 72.8 SEX_F + 41.6 SMOK_2 - 48.6 VITUSE_1
Predictor
Coef
SE Coef
Constant
450.56
85.37
AGE
2.7254
T
P
5.28 0.000
0.8220
3.32 0.001
QUETELET
3.691
2.149
1.72
0.087
ALCOHOL
9.556
2.947
3.24
0.001
BETADIET
SEX_F
SMOK_2
VITUSE_1
-0.017437
-72.77
41.60
-48.64
0.008559 -2.04
36.85 -1.97
24.40
0.043
0.049
1.70 0.089
24.89 -1.95
0.052
59 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
S = 196.917
R-Sq = 13.5%
R-Sq(adj) = 11.4%
Analysis of Variance
Source
DF
Regression
7
Residual Error
284
Total
1725637
11012454
291
Source
SS
MS
F
P
246520 6.36 0.000
38776
12738091
DF Seq SS
AGE
1 594668
QUETELET
1
56352
ALCOHOL
1
563668
BETADIET
SEX_F
1
119721
1 134397
SMOK_2
VITUSE_1
1 108714
1 148116
Unusual Observations
Obs
AGE RETPLASMA
Fit
SE Fit
Residual St Resid
15
64.0
1249.0 857.3
56.7
391.7
2.08RX
17
75.0
1262.0 810.6
40.3
451.4
2.34R
60 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
19
57.0
1727.0 722.1
36.6
1004.9
5.19R
45
69.0
321.0 757.7
41.4
-436.7
-2.27R
69
33.0
227.0 693.3
46.6
-466.3
-2.44R
74
50.0
1031.0 857.0
63.1
174.0
0.93 X
127
70.0
1139.0 650.2
26.2
488.8
2.50R
131
46.0
476.0 790.0
60.2
-314.0
-1.67 X
187
49.0
1443.0 708.3
32.7
734.7
3.78R
256
33.0
194.0 604.2
45.2
-410.2
269
36.0
1102.0 507.9
26.2
594.1
3.04R
272
54.0
1517.0 602.3
20.8
914.7
4.67R
275
41.0
1193.0 794.4
64.8
398.6
2.14RX
286
66.0
986.0 593.8
35.9
392.2
2.03R
-2.14R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
After we delete the outliers, we do regression again and get the following result.
Regression Analysis: RETPLASMA versus AGE, QUETELET, ...
The regression equation is
RETPLASMA = 517 + 2.09 AGE + 1.85 QUETELET + 5.23 ALCOHOL - 0.0149
BETADIET - 71.7 SEX_F + 41.9 SMOK_2 - 43.4 VITUSE_1
Predictor
Coef
SE Coef
Constant
516.78
69.91
T
P
7.39 0.000
61 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
AGE
2.0923
0.6679
3.13 0.002
QUETELET
1.847
1.758
1.05
0.295
ALCOHOL
5.228
2.688
1.94
0.053
BETADIET
-0.014933
SEX_F
0.006885 -2.17
-71.71
SMOK_2
32.27 -2.22
41.85
VITUSE_1
-43.42
S = 155.146
19.75
0.027
2.12 0.035
20.30 -2.14
R-Sq = 11.9%
0.031
0.033
R-Sq(adj) = 9.6%
Analysis of Variance
Source
DF
Regression
7
Residual Error
270
Total
Source
SS
875521
6498945
277
125074 5.20 0.000
24070
1 345437
1
6545
ALCOHOL
1
115308
BETADIET
VITUSE_1
P
7374466
QUETELET
SMOK_2
F
DF Seq SS
AGE
SEX_F
MS
1
85087
1 102453
1 110527
1 110163
62 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
4. Residual plots
Fig 22 Residual plots of retinol plasma
Fig 23 Residual plots of retinol plasma after deleting the outliers
63 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
6.2.2 GLM Method
General Linear Model: RETPLASMA versus SEX, SMOKSTAT, VITUSE
Factor
Type
SEX
Levels Values
fixed
2
SMOKSTAT fixed
VITUSE
1, 2
3
fixed
3
1, 2, 3
1, 2, 3
Analysis of Variance for RETPLASMA, using Adjusted SS for Tests
Source
DF
Seq SS
Adj SS Adj MS
F
P
AGE
1
345437
137560 137560 5.71
0.018
SEX
1
132454
139672 139672 5.79
0.017
SMOKSTAT
2
169037
157668
78834
QUETELET
1
4344
14227
14227 0.59
VITUSE
2
CALORIES
61941
1
103465
48196
3.27 0.040
51733 2.15
9072
9072 0.38
0.443
0.119
0.540
FIBER
1
44141
17672
17672 0.73
0.393
FAT
1
84885
50081
50081 2.08
0.151
ALCOHOL
1
CHOLESTEROL
1
60373
14008
81431
20773
81431
20773
3.38 0.067
0.86 0.354
BETADIET
1
61604
59894
59894 2.48
0.116
RETDIET
1
7689
7689
7689 0.32
0.573
Error
263 6340355
6340355
24108
64 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Total
277 7374466
S = 155.267
R-Sq = 14.02%
Term
Coef
Constant
SE Coef
527.61
AGE
R-Sq(adj) = 9.45%
T
76.50
1.7782
P
6.90 0.000
0.7444
2.39
0.018
QUETELET
1.400
1.823
0.77
0.443
CALORIES
0.03414
0.05565
0.61
0.540
FIBER
-2.679
3.129
-0.86
0.393
-1.2532
0.8695
-1.44
0.151
FAT
ALCOHOL
5.193
2.826
CHOLESTEROL
0.1183
0.1275
BETADIET
RETDIET
1.84
0.067
0.93 0.354
-0.012755
0.008092
-1.58
0.116
0.01527
0.02704
0.56
0.573
Unusual Observations for RETPLASMA
Obs
RETPLASMA
Fit
SE Fit
Residual
St Resid
10
935.00 629.25
31.57
305.75
2.01 R
29
1002.00 686.15
50.25
315.85
2.15 R
53
247.00 553.66
32.91
-306.66
-2.02 R
69
187.00 524.09
34.32
-337.09
-2.23 R
107
927.00 516.64
48.70
410.36
2.78 R
65 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
114
917.00 603.00
31.11
314.00
2.06 R
203
879.00 538.19
34.61
340.81
2.25 R
254
792.00 484.88
31.34
307.12
2.02 R
256
995.00 638.58
26.90
356.42
2.33 R
261
649.00 591.68
64.80
57.32
0.41 X
274
216.00 554.43
27.29
-338.43
-2.21 R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
So:
RETPLASMA=527.61+1.7782AGE+5.193ALCOHOL-0.012755BETADIET
To find the interaction between any of the predictors, we did lots of trials, and
finally find an evidence of interaction.
Another solution:
General Linear Model: RETPLASMA versus SEX, SMOKSTAT, VITUSE
Factor
SEX
Type
Levels Values
fixed
2
SMOKSTAT fixed
VITUSE
fixed
1, 2
3
3
1, 2, 3
1, 2, 3
Analysis of Variance for RETPLASMA, using Adjusted SS for Tests
66 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
Source
DF
Seq SS
Adj SS Adj MS
F
P
AGE
1
345437
150885
150885
6.26 0.013
SEX
1
132454
117093
117093
4.86 0.028
SMOKSTAT
2
169037
149864
74932
3.11
QUETELET
1
4344
19396
19396
0.81 0.370
VITUSE
2
CALORIES
61941
1
62854
48196
31427
11715
0.046
1.30 0.273
11715
0.49 0.486
FIBER
1
44141
15166
15166
0.63 0.428
FAT
1
84885
49577
49577
2.06 0.153
ALCOHOL
1
CHOLESTEROL
60373
1
75105
14008
75105
20296
3.12
20296
0.84
0.079
0.360
BETADIET
1
61604
66667
66667
2.77 0.097
RETDIET
1
7689
3889
3889
0.16 0.688
53358
53358
26679
VITUSE*RETDIET
2
Error
261
6286997
Total
277
7374466
S = 155.203
Term
Constant
AGE
QUETELET
R-Sq = 14.75%
Coef
510.86
1.8777
1.646
6286997
1.11
0.332
24088
R-Sq(adj) = 9.52%
SE Coef
77.70
T
6.57
0.7502
1.834
P
0.000
2.50 0.013
0.90 0.370
67 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
CALORIES
0.03893
0.05582
FIBER
-2.490
3.138
-0.79
0.428
-1.2604
0.8786
-1.43
0.153
FAT
0.70 0.486
ALCOHOL
5.002
2.833
CHOLESTEROL
0.1174
0.1279
BETADIET
-0.013507
RETDIET
1.77 0.079
0.92 0.360
0.008119 -1.66
0.01103
0.02745
0.097
0.40 0.688
RETDIET*VITUSE
1
0.02579
0.03142
2
-0.04709
0.03182
0.82 0.413
-1.48
0.140
Unusual Observations for RETPLASMA
Obs
RETPLASMA
Fit
SE Fit
Residual
St Resid
7
825.00 692.87
70.09
132.13
0.95 X
10
935.00 625.32
32.15
309.68
2.04 R
29
1002.00 696.28
51.39
305.72
2.09 R
53
247.00 570.71
37.81
-323.71
-2.15 R
69
187.00 528.78
34.98
-341.78
-2.26 R
107
927.00 504.35
50.90
422.65
2.88 R
114
917.00 596.38
32.25
320.62
2.11 R
203
879.00 544.27
34.98
334.73
2.21 R
256
995.00 627.73
30.58
367.27
2.41 R
68 / 71
Applied Statistics Report, Industrial Engineering, Tsinghua University`
261
649.00 551.43
70.25
97.57
0.70 X
265
946.00 635.52
25.89
310.48
2.03 R
274
216.00 566.66
32.85
-350.66
-2.31 R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Residual Plots for RETPLASMA
Normal Probability Plot
Versus Fits
99
400
90
200
Residual
Percent
99.9
50
10
0
-200
1
-400
400
0.1
-500
-250
0
Residual
250
500
600
Fitted Value
700
Versus Order
40
400
30
200
Residual
Frequency
Histogram
500
20
10
0
-200
-400
0
-300 -200 -100
0
100
Residual
200
300
400
1
20 40 60 80 100 120 140 160 180 200 220 240 260
Observation Order
So there are no interactions.
RETPLASMA=510.86+1.8777AGE+5.002ALCOHOL-0.013507BETADIETRE
FERANCER
8. REFERANCER
[1] Peto R, Doll R, Buckley JD, et al. Can dietary beta-carotene materially reduce
human cancer rates? Nature 1981;290:201-8.
69 / 71
800
Applied Statistics Report, Industrial Engineering, Tsinghua University`
[2] Russell-Briefel R, Bates MW, Kuller LH. The relationship of plasma carotenoids
to health andbiochemical factors in middle-aged men. Am J Epidemiol
1986;122:741-9.
[3] Stryker WS, Kaplan LA, Stein EA, et al. The relation of diet, cigarette smoking,
and alcohol consumption to plasma beta-carotene and alphatocopherol levels. Am
J Epidemiol 1988;127:283- 96.
[4] Adams-Campbell, L. L., M. U. Nwankwo, et al. (1992). Serum retinol,
carotenoids, vitamin E, and cholesterol in Nigerian women. Nutritional
Biochemistry 3(2): 58-61.
[5] Comstock, G. W., M. S. Menkes, et al. (1988). Serum levels of retinol,
beta-carotene, and alpha-tocopherol in older adults.American Journal of
Epidemiology 127(1): 114-123.
[6] Russellbriefel, R., M. W. Bates, et al. (1985). The relationship of plasma
carotenoids to health and biohchemical factors in middle-aged men. American
Journal of Epidemiology 122(5): 741-749.
[7] Stähelin, H. B., E. Buess, et al. (1982). vitamin A, cardiovascular risk factors, and
mortality. The Lancet 319(8268): 394-395.
[8] Van Poppel, G. and H. van den Berg (1997). Vitamins and cancer. Cancer Letters
114(1-2): 195-202.
70 / 71