APPLIED STATISTICS FALL 2009 A STUDY TUDY OF DETERMINANTS OF PLASMA RETINOL AND BETA BETA-CAROTENE PROJECT REPORT TUTOR: Dr. Kaibo Wang TEAM MEMBERS: (ACCORDING TO STUDENT ID) WANG, JUN 2009210552 CUI, WEN 2009210554 LV, SHIKUN 2009210566 SUN, NINGNING 2009210571 Applied Statistics Report, Industrial Engineering, Tsinghua University` TABLE OF CONTENTS 1. INTRODUCTION ........................................................................................ 3 2. LITERATURE REWIEW ............................................................................ 3 3. PURPOSE OF THE STUDY ....................................................................... 5 4. OUTLINE OF THE ANALYSIS ................................................................. 6 5. ANALYSYS RESULTS .............................................................................. 7 5.1 VARIABLES TYPES AND LEVELS .................................................. 7 5.1.1 Quantitative variables ................................................................... 7 5.1.2 Categorical variables:.................................................................... 8 5.2 DESCRIBTIVE ANALYSIS ................................................................ 8 5.2.1 SEX ............................................................................................... 8 5.2.2 VITUS ......................................................................................... 10 5.2.3 SMOKSTAT ............................................................................... 11 5.2.4 AGE ............................................................................................ 13 5.2.5 QUETELET ................................................................................ 13 5.2.6 CALORIES ................................................................................. 14 5.2.7 FAT ............................................................................................. 15 5.2.8 FIBER ......................................................................................... 16 5.2.9 ALCOHOL ................................................................................. 16 5.2.10 CHOLESTEROL ...................................................................... 17 5.2.11 BETADIET ............................................................................... 18 5.2.12 RETDIET .................................................................................. 18 1 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 5.3 DATA ANALYSIS VIA REGRESSION AND GLM ........................ 19 5.3.1 Plasma Beta-carotene .................................................................. 19 5.3.2 Retinol ......................................................................................... 24 6. DISCUSSION ............................................................................................ 28 7. APPENDIX ................................................................................................ 28 6.1 Plasma Beta-carotene ............................................................................. 28 6.1 1 Regression method: ..................................................................... 28 6.1.2 GLM Method .............................................................................. 37 6.2 Retinol .................................................................................................... 42 6.2.1 Regression Method ..................................................................... 42 6.2.2 GLM Method .............................................................................. 64 8. REFERANCER .......................................................................................... 69 2 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 1. INTRODUCTION Observational studies have suggested that low dietary intake or low plasma concentrations of retinol, beta-carotene, or other carotenoids might be associated with increased risk of developing certain types of cancer. However, relatively few studies have investigated the determinants of plasma concentrations of these micronutrients. In this project, we find a cross-sectional study, which want to investigate the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids. Study subjects (N = 315) were patients who had an elective surgical procedure during a three-year period to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was found to be non-cancerous. We want to check that there is wide variability in plasma concentrations of these micronutrients in humans, and that much of this variability is associated with dietary habits and personal characteristics. 2. LITERATURE REWIEW Observational studies have suggested that low dietary intake or low plasma concentrations of retinol, beta-carotene, or other carotenoids might be associated with increased risk of developing certain types of cancer [1]. The relationship between plasma carotenoids, plasma cholesterol, cigarette smoking, vitamin supplement use, and intakes of alcohol, vitamin A, and carotene were investigated in 1981 for 187 Multiple Risk Factor Intervention Trial men in Pittsburgh in the research of Russell-Briefel R [2]. The total plasma carotenoid value was positively correlated with the dietary carotene and vitamin A indices (estimated by a food frequency questionnaire), vitamin A supplement usage, and plasma cholesterol, and inversely related to cigarette smoking, alcohol intake, and serum aspartate transaminase. The mean plasma carotenoid level was higher in nonsmokers (x = 186 µg/dl, 95% confidence interval (Cl) 178–195) as compared with cigarette smokers (x = 164 µg/dl, 95% Cl 151–178) and in vitamin A supplement users (x = 206 µg/dl. 95% Cl 188/224) as compared with nonusers (x = 172 µg/dl, 95% Cl 164–179). Variables associated with the total plasma carotenoids in multiple regression analyses included dietary vitamin A and carotene, calorie intake, weekly 3 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` alcohol intake, cigarette smoking, vitamin supplement usage, and plasma cholesterol, and accounted for 27% of the variance. The total plasma carotenoid value was also highly correlated with plasma beta-carotene (r = 0.67) and lycopene (r = 0.68). The mean beta-carotene (30 µg/dl), however, accounted for only 16% of the total plasma carotenoids. The relationship of diet and nutritional supplements, cigarette use, alcohol consumption, and blood lipids to plasma levels of beta-carotene was studied among 330 men and women aged 18–79 years in the research of Stryker WS[3]. Dietary carotene, preformed vitamin A, and vitamin E intake were estimated by a self-administered semiquantitative food frequency questionnaire. The correlation of dietary carotene with plasma beta-carotene was reduced in smokers compared with non-smokers (r = 0.02 vs. 0.44 among men; r = 0.19 vs. 0.45 among women). Smokers had much lower plasma levels of beta-carotene than did nonsmokers (geometric mean 8.5 vs. 15.3 µg/dl for men; 17.3 vs. 26.3 µg/dl for women) despite only slightly lower intakes of carotenoids. In multiple regression analyses, men who smoked one pack per day had 72% (95% confidence interval (Cl) 58–89) of the plasma beta-carotene levels of nonsmokers after accounting for dietary carotene and other variables; for women, the corresponding percentage was 79% (Cl 64–99). In similar models, men drinking 20 g of alcohol per day had 76% (Cl 65–88) of the beta-carotene levels of nondrinkers; women had 89% (Cl 73–108) of the levels of nondrinkers. An interaction term for carotene intake and smoking was statistically significant in a model combining both sexes. These results suggest that plasma levels of beta-carotene among smokers and, perhaps, heavy consumers of alcohol may be reduced substantially below levels due to differences in diet. The correlation of calorie-adjusted intake of vitamin E with lipid-adjusted plasma levels of vitamin alpha-tocopherol was 0.53 for men (n = 137) and 0.51 for women (n = 193) and did not differ by alcohol consumption and cigarette use; these correlations were largely accounted for by use of vitamin supplements. In linear regression models, vitamin E Intake and plasma lipids were significant predictors of plasma alpha-tocopherol levels. Vitamin A and its analogues have an important role in cellular processes related to carcinogenesis. The prospect that high intake of certain vitamins may confer protection against cancer has drawn substantial attention during past decades. As a consequence, some of the epidemiologic literature pertaining to these vitamins have 4 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` dealt with the relation of cancer with either their intake in the diet or their levels in the serum. Willett WC showed that, with higher levels of retinol plasma, the risks of get cancer may be decreased. However, plasma retinol levels are under strict control and a high intake of preformed vitamin does not seem to be relevant for cancer prevention. So many epidemiologic studies have been conducted primarily as dietary studies of vitamin A and carotene, or as blood studies of serum retinol. Researchers tried to find out the relationship between some factors in dietary or personal and the plasma retinol. Stähelin, H. B. suggested an inverse relationship between vitamin A and cancer risk, although some studies have found no relationship. Then people find that a lower retinol levels is not the cause of an invasive cancer. Instead, it is the cancer that brings about a lower retinol level in human body. In daily life, vitamin A may be derived from diet or daily consume. Besides, people’s life habits and living condition like social status, smoking, drinking and age, and some other ones will also have an important effect on the plasma retinol level. As we know that, some personal factors like smoking status, dietary, gender and height, have a complex relationship with the plasma retinol, since these factors may influence each other or confuse the result. Among those personal factors, the most related one is the gender, which will cause a higher retinol level if a person is male. Higher levels of serum vitamin A among males than among females and higher levels of serum carotene in females than in males have been reported in studies in the United States, England, West Germany, Denmark, and Switzerland. Since hepatic secretion of retinol binding protein maintains fairly constant levels of retinol over a wide range of dietary retinol intake. So plasma retinol levels are independent of Vitamin A intake, only if people are in a normal state without having deficient or excessive levels. Beside the above mentioned factors, we also add age, height and weight, and some other factors like fat, cholesterol which will affect the plasma retinol level and may cause cancer at the same time. 3. PURPOSE OF THE STUDY Many dietary and biochemical epidemiologic studies have shown a strong association between beta-carotene, retinol and the risk of cancer. These findings have 5 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` increased interest in factors influencing beta-carotene levels and retinol levels in human plasma or serum, and also inspire scientists to find out internal factors which may have some effect or relationship with the beta-carotene and retinol in people’s plasma. As we mentioned above, previous studies have indicated that if a person’s dietary history show greater consumption of green or yellow leafy vegetables, which have more amount of carotene, the intake level of beta-carotene is high. Smokers and alcohol takers have lower beta-carotene levels. However, the relationship between beta-carotene levels and the person’s age, sex, quetelet, calories\ fat\ fiber\cholesterol consumed per day is less clear because the evidence is either scanty or conflicting. Similarly, higher levels of serum retinol may be associated with a decreased risk of cancer, although the epidemiologic evidences less consistent than that for beta-carotene. It appears that lower retinol levels may be a consequence rather than a cause of invasive cancer. The relation between personal factors and plasma retinol levels is complex, since hepatic secretion of retinol binding protein maintains fairly constant levels of retinol over a wide range of dietary retinol intake. In this study, we use data form an observational experiment with 315 patients. In this experiment, there are totally 12 independent variables (age, sex, smoking status, quetelet, vitamin use, number of calories consumed per day, grams of fat consumed per day, grams of fiber consumed per day, number of alcoholic drinks consumed per week, cholesterol milligram consumed per day, dietary beta-carotene microgram consumed per day and dietary retinol microgram consumed per day) and 2 dependent variables, which would be analyzed below to find out whether any of these personal characters would have an effect on persons’ plasma beta-carotene and plasma retinol. 4. OUTLINE OF THE ANALYSIS The report would give statistical analysis of the data. First are the descriptive analysis, showing all the independent variables mean, median, min, max and quartile with graphs to give a clear vision of the pattern of variability in the data. The second part is the data analysis via regression. 6 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` • In this part, we first show scatter plots between independent variables and dependent variables to the relationship between them. If there are some variables highly skewed, log transformation is carried out or some data are deleted directly to make them nearly symmetric. • Then best subset and stepwise regression methods are used to select model variables. • After all these data treated, a residual plot is drawn. So whether the data residual is normal distribution, ε ~N0, σ and independently and identically distribution (i.i.d) could be checked. • Based on all the information above, a final regression model is developed and deep conclusion and explanation would be given. 5. ANALYSYS RESULTS The statistical analysis software Minitab has been used to analyze the data collected from the experiment. 5.1 VARIABLES TYPES AND LEVELS 5.1.1 Quantitative variables AGE: Age (years) QUETELET: Quetelet: CALORIES: Number of calories consumed per day. FAT: Grams of fat consumed per day. FIBER: Grams of fiber consumed per day. ALCOHOL: Number of alcoholic drinks consumed per week. CHOLESTEROL: Cholesterol consumed (mg per day). BETADIET: Dietary beta-carotene consumed (mcg per day). RETDIET: Dietary retinol consumed (mcg per day) 7 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 5.1.2 Categorical variables: SEX: Sex (1=Male, 2=Female). SMOKSTAT: Smoking status (1=Never, 2=Former, 3=Current Smoker) VITUSE: Vitamin Use (1=Yes, fairly often, 2=Yes, not often, 3=No) 5.2 DESCRIBTIVE ANALYSIS In this case, there are 14 variables in total. As shown in the following box, the first 12 variables are potential determinants of Plasma Retinol and Beta-Carotene Levels. Some basic description statistics are calculated by Minitab and listed below. Variable Mean StDev Min Q1 Median Q3 Max AGE 50.146 14.575 19.000 39.000 48.000 63.000 83.000 SEX 1.8667 0.3405 1.0000 2.0000 2.0000 2.0000 2.0000 SMOKSTAT 1.6381 0.7110 1.0000 1.0000 2.0000 2.0000 3.0000 QUETELET 26.157 6.014 16.331 21.789 24.735 28.950 50.403 VITUS 1.9651 0.8607 1.0000 1.0000 2.0000 3.0000 3.0000 CALORIES 1796.7 680.3 445.2 1333.8 1666.8 2106.4 6662.2 FAT 77.03 33.83 14.40 53.90 72.90 95.30 235.90 FIBER 12.789 5.330 3.100 9.100 12.100 15.600 36.800 ALCOHOL 3.279 12.323 0.000 0.000 0.300 3.200 203.000 CHOLESTEROL 242.46 131.99 37.70 154.90 206.30 308.90 900.70 BETADIET 2185.6 1473.9 214.0 1114.0 1802.0 2863.0 9642.0 RETDIET 832.7 589.3 30.0 479.0 707.0 1047.0 6901.0 BETAPLASMA 189.9 183.0 0.0 89.0 140.0 231.0 1415.0 RETPLASMA 602.8 208.9 179.0 466.0 566.0 719.0 1727.0 5.2.1 SEX The variable of sex is divided into two levels. 1 is represented as male, and 2 is for female. So in order to see whether and how Plasma Retinol is affected by sex level, 8 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` we can draw two box plots for comparison. One is for male, and the other is for female. We can get the max, min, median, Q , Q of Plasma Retinol in male and female respectively. It is obviously that all the data of male are higher than female. Thus Sex is an important variable which will determine the plasma retinol level in human body, since generally, this level in male is usually higher than female. The box plot of Plasma Beta-Carotene with male and female are as below, in which, 1 and 2 are also represented as male and female respectively. 9 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Since from above figure, we use Minitab to see the specific number of min, max, median and Q , Q . There is a little higher level in female than in male. And also, there are many outliners in female which are especially higher than others. 5.2.2 VITUS We use the same method to find the relationships between Plasma Retinol, Plasma Beta-Carotene and Vitamin use level and the with For vitamin use level, a 3-point scale is used. ( 1=Yes, fairly often, 2=Yes, not often, 3=No). According to three groups, we draw box plot of Plasma Retinol with VITUS. We see that the statistics data we get from three statuses, in which respective data like min, max and median are almost have a same level with each other. There does not obviously difference among these three levels. We may say that people using vitamin or not has not significant influence on Plasma Retinol level. 10 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` From the above figure, we see that people use vitamin very often has relatively higher level of Plasma Beta-Carotene than others. And people who use vitamin but not often have higher level than non users. Considering outliers, it is the same fact that, using vitamin often bears a higher level of plasma Beta-Carotene. 5.2.3 SMOKSTAT As to relationships between Plasma Retinol and Smoking Status level, we use 3-point scale to describe smoking status. (1=Never, 2=Former, 3=Current Smoker) Then 3 groups are used to see relationships between Plasma Retinol, Plasma Beta-Carotene and different status of smoking. 11 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` It is interesting to see that the min, max, median, Q and Q of type 2 are all higher than the others respectively, which means the former smokers usually have a higher level of plasma retinol than others. But the statistics data of non-smokers are between the former smoker and current smokers. From above figure, we can see a different result with plasma retinol. People never smoking have a high plasma Beta-Carotene level than others. For current smokers, they may bear a relative lower level than nonsmokers and former smokers. 12 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 5.2.4 AGE Mean 50.146 Age StDev 14.575 Min 19.000 Q1 39.000 Median 48.000 Q3 63.000 Max 83.000 As it is shown in the describe statistic table and the figure above, the mean of participant’s age is around 50. And the most of the data distribute in the area between 32 and 77, which indicates that most participants are basically middle-age or elderly people. 5.2.5 QUETELET When analyzing the height and weight of the participant, we use the formula (weight/ (height^2) to represent these two physical measurements, naming Quetelet. The Queteletindex, is a statistical measurement which compares a person's weight and height. Though it does not actually measure the percentage of body fat, it is used to estimate a healthy body weight based on how tall a person is. The lower the value, the thinness the person is. Contrarily, if the value is very high, the person’s weight is some kind of over compared to his height. The table below is a category according to WHO. Category BMI range – kg/m2 Severely underweight less than 16.5 Underweight from 16.5 to 18.4 13 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Normal from 18.5 to 24.9 Overweight from 25 to 30 Obese Class I from 30.1 to 34.9 Obese Class II from 35 to 40 Obese Class III over 40 This is the data of all the participants. QUETELET Mean StDev Min Q1 Median Q3 Max 26.157 6.014 16.331 21.789 24.735 28.950 50.403 As it is shown in the describe statistic table and the figure above, the mean of participant’s Quetelet is around 26. And the most of the data distribute in the area between 18.5 and 30, which indicates that some participants are normal and some are a little overweight. 5.2.6 CALORIES CALORIES Mean StDev Min Q1 Median Q3 Max 1796.7 680.3 445.2 1333.8 1666.8 6662.2 2106.4 14 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` As it is shown in the describe statistic table and the figure above, the mean of participant’s calories consumed is around 1800 ca. And the most of the data distribute in the area between 1000 and 2200. 5.2.7 FAT FAT Mean StDev Min Q1 Median Q3 Max 77.03 33.83 14.40 53.90 72.90 235.90 95.30 As is shown in the descriptive statistic table and the histogram above, the mean of participant’s fat consumed per day is around 77g. And the most of the data distribute in the area between 45 and 135. 15 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 5.2.8 FIBER FIBER Mean StDev Min Q1 Median Q3 Max 12.789 5.330 3.100 9.100 12.100 36.800 15.600 As is shown in the descriptive statistic table and the histogram above, the mean of participant’s fiber consumed per day is around 12.789g. And the most of the data distribute in the area between 6 and 18. 5.2.9 ALCOHOL ALCOHOL Mean StDev Min Q1 Median Q3 Max 3.279 12.323 0 0.000 0.300 203.000 3.200 16 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` In the histogram above, we can see that in this sample, there is an outlier, whose number of alcoholic drinks consumed per week is 203. Therefore, we delete this point and draw the histogram again as below. As it is shown in the describe statistic table and the figure above, the mean of participant’s number of alcoholic drinks consumed per week is around 3.3. And the most of the data distribute in the area between 0 and 6. 5.2.10 CHOLESTEROL CHOLESTEROL Mean StDev Min Q1 Median Q3 Max 242.46 131.99 37.7 154.9 206.3 900.7 308.9 Histogram of CHOLESTEROL 70 66 60 57 Frequency 50 47 41 40 29 30 20 20 17 14 10 9 6 3 0 150 300 1 450 600 CHOLESTEROL 0 2 1 750 1 0 1 900 17 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` We find out the mean value is about 242, and most data distribute in the interval between 50 and 250, which indicates the common level of people cholesterol consumption. 5.2.11 BETADIET BETADIET Mean StDev Min Q1 Median Q3 Max 2185.6 1473.9 214 114 1802 9642 2863 Histogram of BETADIET 70 65 60 60 Frequency 50 40 40 37 29 30 24 21 20 10 9 7 6 3 2 0 0 1500 3000 4500 BETADIET 4 6000 2 3 1 7500 1 0 0 1 9000 The mean value is about 2187, and most data distribute in the interval between 1000 and 2500, which indicates that most people consumes about 2000 mcg of beta-carotene per day. 5.2.12 RETDIET RETDIET Mean StDev Min Q1 Median Q3 Max 832.7 589.3 30 479 707 6901 1047 18 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Histogram of RETDIET 160 152 140 Frequency 120 100 95 80 60 36 40 20 18 9 2 0 0 1000 2000 1 0 1 3000 4000 RETDIET 0 0 5000 0 0 0 6000 1 7000 The mean value is about 832, and most data distribute in the interval between 500 and 1500, which indicates that most people consumes about 1000 mcg of retinol per day. 5.3 DATA ANALYSIS VIA REGRESSION AND GLM 5.3.1 Plasma Beta-carotene 5.3.1.1 Regression method: 1. Check with Scatter Plots Firstly the scatter plot of plasma beta-carotene against all the predictors which are continues data is drawn to check whether there are any outliers or data aggregations. See the scatter plots as they are shown in the appendix. After delete all the outliers of each pair of plasma beta-carotene and variables, another three scatter plots is drawn, whose distribution are more wider, as it shown in the appendix. 2. Best subset and Stepwise regression to select model variables During the analysis progress, we use dummy variables to take place of discreet variables: SEX, SMOKSTAT and VITUSE, and name them as sex_1; 19 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` smoking-status_1, smoking-status_2 and smoking_status_3; vitamin-status_1, vitamin-status_2 and vitamin_status_3. Then we tried all the stepwise regression, forward selection and backward elimination, finally select QUETLET, BETADIET, vitamin_status_3, FAT as the model variables. Variables T-Value P-value QUETLET -4.11 0.000 BETADIET 3.57 0.000 Vitamin_status_3 -3.17 0.002 Smoking_status_3 -2.04 0.042 FAT -1.88 0.061 R-sq.=15.09; R-Sq.(adj)=13.67 3. Regression Analysis (1) Residual check From the residual plot(Figure 7), (see the appendix), we could see that in the normal probability plot, it doesn’t obeys normal distribution at all. In the figure Versus Fits, the red dots are highly skewed. So we use log function to do a translation, and use log (plasma beta-carotene) to replace plasma beta-carotene. Then after another round of scatter plots check and data deleted, we draw a new residual plot (Figure 8), again, from the residual plot, we could see that in the normal probability plot, it almost obeys normal distribution. In the figure Versus Fits, range of the red dots of each group are almost the same, which indicates that the assumption ε ~N0, σ is accepted. And also, the entire red dotes in the Versus Order disorder, so another assumption that all of the data are independently and identically distribution (i.i.d) is also accepted. So best subset and stepwise regression to select model variables progress is done again to select the variables which would be used in the model. This time QUETLET, BETADIET, vitamin_status_3, smoking_status_3, FAT.AGE, Sex_2 and FIBER are selected as the model variables. 20 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Variables T-Value P-value QUETLET -5.05 0.000 BETADIET 1.93 0.054 Vitamin_status_3 -3.48 0.001 Smoking_status_3 -2.29 0.023 FAT -1.98 0.048 AGE 2.00 0.046 Sex_2 1.73 0.085 FIBER 1.51 0.132 R-sq.= 21.44; R-Sq.(adj)=19.31 (2) Regression Model From Analysis of variance, P-value=0.000, which indicates that at least one coefficient is not 0, i.e. the model is serviceable. From the P-value of predictor, we could see that the P-values of QUETLET, vitamin_status_3, smoking_status_3, FAT and AGE are smaller than 0.05, which indicates that these predictors should be kept in the final model. However, in order to make the model more accurate, we still keep the variable in the final model whose P-value is larger than 0.05. Furthermore, R-Sq = 21.4%and R-Sq(adj)=19.3%, which indicates that the model is persuasive and could forecasting accurately. 4. The final model Log (plasma beta-carotene) = 2.32 - 0.0140QUETLET -0.124vitamin_status_3- 0.116 smoking_status_3 + 0.000025 BETADIET 0.00113 FAT+ 0.00248 AGE+ 0.0934 sex_2 + 0.00632 FIBER The coefficient of QUETLET, vitamin_status_3, smoking_status_3 and FAT are negative, which indicates that with the increase of these variables, there 21 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` would be a decrease of plasma beta-carotene, which namely if a person is overweight and a current smoker, and also never take vitamin, his plasma beta-carotene would be low. The coefficient of BETADIET, AGE, Sex_2 and FIBER are positive, which indicates that with the increase of average number of these variables, there would also be an increase of plasma beta-carotene, which namely if the person is older, a female, his dietary beta-carotene and fiber consumed per day is high, then her plasma beta-carotene would be higher. 5.3.1.2 General Linear Model 1. Check with Scatter Plots The same to regression progress, after using log (plasma beta-carotene) to replace plasma beta-carotene, we draw scatter plots to see the relationship between variables and dependent variable. And then delete the outliers. 2. Residual check From the residual plot (Figure 12), we could see that in the normal probability plot, it almost obeys normal distribution. In the figure Versus Fits, range of the red dots of each group are almost the same, which indicates that the assumption ε ~N0, σ is accepted. And also, the entire red dotes in the Versus Order disorder, so another assumption that all of the data are independently and identically distribution (i.i.d) is also accepted. 3. General Linear Model From the table below, we could see all the variables used in the final model and their p-value, if there are any variables whose p-value is lower than 0.05, which indicates that these variables’ coefficient is significant not equal to 0. And because p-value of CALORIES, FAT, ALCOHOL and CHOLESTEROL are much larger than 0.05, so we delete them in the final model. 22 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Variables T-Value P-value AGE 1.70 0.090 QUETLET -4.92 0.000 CALORIES -0.87 0.385 FAT 0.31 0.758 FIBER 1.49 0.139 ALCOHOL 0.32 0.750 CHOLESTEROL -0.52 0.603 BETADIET 1.60 0.111 RETDIET 0.96 0.337 Vitamin_1 2.22 0.027 Vitamin_2 0.18 0.858 BETADIET*Vitamin R-sq.= 23.38%; R-Sq.(adj)=19.11% 4. The final model Log (plasma beta-carotene) =2.3061+0.002224 AGE-0.014010 QUETLET+0.00818FIBER+0.000021BETADIET+0.000034BETADIET*Vi tamin_1 The coefficient of QUETLET is negative, which indicates that with the increase of QUETLET, there would be a decrease of plasma beta-carotene, which namely if a person is overweight, then his plasma beta-carotene would be low. The coefficients of AGE, FIBER, BETADIET and BETADIET*Vitamin_1 are positive, which indicates that with the increase of average number of these variables, there would also be an increase of plasma beta-carotene, which namely if the person is older, his dietary beta-carotene and fiber 23 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` consumed per day is high, then her plasma beta-carotene would be higher. And also, there is an interaction of beta-carotene consumed per day and vitamin taken. 5.3.2 Retinol 5.3.2.1 Regression method 1. Check with Scatter Plots First we check the matrix plot of retinol plasma with all the continuous predictors to see whether there are outliers or data aggregations. The each scatter plot of retinol plasma with a predictor is analyzed to further observe the outliers and related distribution. Outliers are deleted in order to make the scatters symmetrically distributed. The corresponding plots are all appended to the end of this report. 2. Best subset and Stepwise regression to select model variables Among all the variables, SEX, SMOKSTAT and VITUSE are discrete variables which cannot be directly used in the regression analysis. So we use dummy variables to replace them. Denote SEX_F which takes the values 1 when SEX is female, otherwise SEX_F is 0. Similarly SMOK_1 and SMOK_2 are compared to describe the three levels of SMOKSTAT, while VITUSE_1 and VITUSE_2 represent three levels of VITUSE. There relations are show in the following two tables. SMOKSTAT SMOK_1 SMOK_2 1 0 0 2 0 1 3 1 0 24 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` VITUSE VITUSE _1 VITUSE _2 1 0 0 2 0 1 3 1 0 Then we tried all the stepwise regression, forward selection and backward elimination, finally select AGE, QUETELET, ALCOHOL, BETADIET, SEX_F, SMOK_2, and VITUSE_1 as the model variables. Variables T-Value P-value AGE 3.32 0.001 QUETELET 1.72 0.087 ALCOHOL 3.24 0.001 BETADIET -2.04 0.043 SEX_F -1.97 0.049 SMOK_2 1.70 0.089 VITUSE_1 -2.95 0.052 R-sq.= 13.55; R-Sq.(adj)=11.42 3. Regression Analysis (1) Residual check (2) The residual plot appended to the end of the paper show that there are some outliers at the right side which affect the conformity to the normal distribution. In the figure of Versus Fits, the red dots are skewed. So through observation, we delete the outliers. After that we do regression again and draw residual plots in the appendix. This time we could see that in the normal probability plot, it almost obeys normal distribution. In the figure Versus Fits, range of the red dots of each group are almost the same, which 25 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` indicates that the assumption ε ~N0, σ is accepted. Besides, the entire red dotes in the Versus Order are distributed randomly, so the assumption that all of the data are independently and identically distribution (i.i.d) is accepted. (3) Regression Model From the P-value of predictor, we could see that the P-values of AGE, BETADIET, SEX_F, SMOK_2, and VITUSE_1 are smaller than 0.05, so that these predictors should be kept in the final model. 4. The final model RETPLASMA = 517 + 2.09 AGE - 0.0149 BETADIET- 71.7 SEX_F + 41.9 SMOK_2 - 43.4 VITUSE_1 The coefficient of BETADIET, SEX_F, and VITUSE_1 are negative, which indicates that with the increase of these variables, there would be a decrease of retinol plasma. The coefficient of AGE and SMOK_2 are positive, which indicates that with the increase of average number of these variables, there would also be an increase of plasma retinol, which namely if the person is older or a former smoker then the plasma retinol would be higher. 5.3.2.2 General Linear Model 1. Check with Scatter Plots Because we’ve already smooth away the outliers in the regression model, it’s not necessary to check the scatter plot here. We just use the data used in the regression part. 2. Residual check At the end of the GLM regression, we got the residual plot, we could see that in the normal probability plot, it almost obeys normal distribution, which indicates that the assumption ε ~N0, σ is accepted. 3. General Linear Model 26 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Through the GLM function of MINITAB, we get the table below, we’ve got all the variables used in the final model and their p-value. Then we check out if there are any variables whose p-value is lower than 0.15, which indicates that these variables’ coefficient is significant not equal to 0. So we choose AGE, ALCOHOL, BETADIET, into the final model. Variables T-Value P-value Constant 6.57 0.000 AGE 2.50 0.013 QUETLET 0.90 0.370 CALORIES -0.70 0.486 FAT -1.43 0.153 FIBER -0.79 0.428 ALCOHOL 1.77 0.079 CHOLESTEROL 0.92 0.360 BETADIET -1.66 0.097 RETDIET 0.40 0.688 R-sq.= 14.75%; R-Sq.(adj)= 9.52% 4. The final model RETPLASMA=510.86+1.8777AGE+5.002ALCOHOL-0.013507BETADIET The coefficient of AGE and ALCOHOL are positive, which means if these two variables become larger, the plasma density of retinal will rise. The coefficient of BETADIET is negative, which indicates that with the decrease of these two variables. Plasma retinal would be higher. 27 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 6. DISCUSSION We conclude that there is wide variability in plasma concentrations of these micronutrients in humans, and that much of this variability is associated with dietary habits and personal characteristics. A better understanding of the physiological relationship between some personal characteristics and plasma concentrations of these micronutrients will require further study. 7. APPENDIX 6.1 Plasma Beta-carotene 6.1 1 Regression method: 6.1.1.1 Scatter plots 1. Scatter plots before deleting the outliners: Figure 1 Scatter plots of Plasma beta-carotene and AGE, QUTELET, CALORIES 28 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Figure 2 Scatter plots of Plasma beta-carotene and FAT, FIBER, ALCOHOL Figure 3 Scatter plots of Plasma beta-carotene and CHOLESTEROL, BETADIET, REDIET 2. Scatter plots after delete the outliers: 29 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Figure 4 Scatter plots of Plasma beta-carotene beta and AGE, QUTELET, CALORIES after data delete P l a s m a b e t a - c a r o t e n e , F A T , F I B E R , A L C O H O L 的矩阵图 0 100 200 0 10 20 1600 800 Plasma beta-carotene 0 200 100 FAT 0 30 20 FIBER 10 20 10 ALCOHOL 0 0 800 16600 10 20 30 Figure 5 Scatter plots of Plasma beta-carotene beta carotene and FAT, FIBER, ALCOHOL after data delete 30 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` P l a s m a b e t a - c a r o t e n e , C H O L E S T E R O L , B E T A D I E T , R E T D I E T 的矩阵图 0 400 800 0 1000 2000 1600 800 Plasma beta-carotene 0 800 400 CHOLESTEROL 0 10000 5000 BETADIET 0 2000 RETDIET 1000 0 0 800 1600 0 5000 10000 Figure 6 Scatter plots of Plasma beta-carotene and CHOLESTEROL, BETADIET, REDIET after data delete 6.1.1.2 Residual check 31 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Figure 7 Residual plots for plasma beta-carotene Figure 8 Residual plots for log( plasma beta-carotene) 6.1.1.3 Regression 32 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 1. Best subset and Stepwise regression to select model variables Stepwise Regression: Plasma beta-carotene(ng/ml) versus Age, sex_1, ... Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is Plasma beta-carotene(ng/ml) on 17 predictors, with N = 304 Step Constant weight/(height^2) 1 353.5 -6.4 2 3 300.9 -6.4 321.6 -6.2 4 341.0 375.5 -6.5 T-Value -3.93 -4.03 -3.92 P-Value 0.000 0.000 0.000 5 -6.4 -4.15 -4.11 0.000 0.000 BETADIET 0.0246 0.0228 0.0211 0.0232 T-Value 3.77 3.55 3.28 3.57 P-Value 0.000 0.000 0.001 0.000 vitamin_status_3 -69 -64 -63 T-Value -3.47 -3.23 -3.17 P-Value 0.001 0.001 0.002 -62 -57 T-Value -2.22 -2.04 P-Value 0.027 0.042 smoking_status_3 FAT -0.56 T-Value -1.88 P-Value 0.061 S R-Sq R-Sq(adj) 171 4.86 4.55 168 9.16 8.55 165 12.67 11.80 164 14.08 12.93 163 15.09 13.67 33 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Stepwise Regression: log beta versus Age, sex_1, ... Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is log beta on 17 predictors, with N = 304 Step Constant weight/(height^2) 1 2.526 -0.0144 2 3 4 2.563 2.606 2.533 -0.0138 -0.0148 -0.0147 T-Value -4.91 -4.82 -5.21 P-Value 0.000 0.000 0.000 vitamin_status_3 -0.153 -0.139 -5.26 0.000 -0.132 -3.88 -3.73 P-Value 0.000 0.000 0.000 -0.168 -0.152 -5.24 -3.67 0.000 -0.140 -3.04 -2.82 P-Value 0.001 0.003 0.005 0.00004 0.000 -0.135 -3.36 0.00003 -5.21 0.000 T-Value BETADIET 2.512 -0.0144 -0.129 -4.23 6 2.615 -0.0146 T-Value smoking_status_3 5 -3.82 0.000 -0.129 -2.56 0.011 0.00003 T-Value 2.70 3.11 2.99 P-Value 0.007 0.002 0.003 -0.00132 -0.00116 -2.51 -2.16 FAT T-Value P-Value Age 0.032 0.290 0.289 0.0018 T-Value 1.55 P-Value 0.123 S 0.013 0.309 0.300 0.295 0.292 34 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` R-Sq R-Sq(adj) 7.40 7.10 Step Constant weight/(height^2) 12.61 12.03 7 2.382 -0.0144 18.12 20.12 18.50 -0.0140 -5.05 P-Value 0.000 0.000 -0.127 -0.124 T-Value -3.56 -3.48 P-Value 0.000 0.001 -0.126 -0.116 T-Value -2.51 -2.29 P-Value 0.013 0.023 BETADIET 0.00003 0.00002 T-Value 2.92 1.93 P-Value 0.004 0.054 FAT 16.68 19.47 2.321 -5.23 smoking_status_3 14.93 17.78 8 T-Value vitamin_status_3 15.77 -0.00091 -0.00113 T-Value -1.65 -1.98 P-Value 0.100 0.048 Age 0.0025 0.0025 T-Value 2.00 2.00 P-Value 0.047 0.046 sex_2 0.089 0.093 35 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` T-Value 1.64 1.73 P-Value 0.102 0.085 FIBER 0.0063 T-Value 1.51 P-Value 0.132 S 0.288 0.288 R-Sq 20.84 21.44 R-Sq(adj) 18.96 19.31 Regression Analysis: log beta versus weight/(heig, vitamin_stat, ... The regression equation is log beta = 2.32 - 0.0140 weight/(height^2) - 0.124 vitamin_status_3 - 0.116 smoking_status_3 + 0.000025 BETADIET - 0.00113 FAT + 0.00248 Age + 0.0934 sex_2 + 0.00632 FIBER Predictor Coef SE Coef T Constant 2.3210 0.1402 16.56 weight/(height^2) vitamin_status_3 smoking_status_3 BETADIET P 0.000 -0.014003 0.002772 -5.05 0.000 -0.12355 0.03554 -3.48 0.001 -0.11559 0.05048 -2.29 0.00002494 0.00001291 0.023 1.93 0.054 FAT -0.0011338 0.0005720 -1.98 0.048 Age 0.002476 0.001238 2.00 0.046 sex_2 0.09340 0.05406 1.73 0.085 36 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` FIBER S = 0.287677 0.006317 R-Sq = 21.4% 0.004186 1.51 0.132 R-Sq(adj) = 19.3% Analysis of Variance Source DF Regression 8 Residual Error 6.66382 295 Total SS 24.41362 303 F P 0.83298 10.07 0.000 0.08276 31.07744 Source DF Seq SS weight/(height^2) 1 vitamin_status_3 1 1.61678 smoking_status_3 MS 2.30126 1 0.98308 BETADIET 1 0.62327 FAT 1 0.52777 Age 1 0.19982 sex_2 FIBER 1 0.22335 1 0.18849 6.1.2 GLM Method 6.1.2.1 Scatter plots 37 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` l o g b e t a , A g e , Q u e t e l e t , C A L O R I E S 的矩阵图 30 60 90 0 2000 4000 3 2 log beta 1 90 60 Age 30 50 35 Quetelet 20 4000 2000 CALORIES 0 1 2 3 20 35 50 Figure 9 Scatter plots of log (Plasma beta-carotene) and AGE, QUTELET, CALORIES l o g b e t a , F A T , F I B E R , A L C O H O L 的矩阵图 0 100 200 0 10 20 3 2 log beta 1 200 100 FAT 0 30 20 FIBER 10 20 10 ALCOHOL 0 1 2 3 10 20 30 Figure 10 Scatter plots of log (Plasma beta-carotene) andFAT, FIBER, ALCOHOL 38 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` l o g b e t a , C H O L E S T E R O L , B E T A D I E T , R E T D I E T 的矩阵图 0 400 800 0 1000 2000 3 2 log beta 1 800 400 CHOLESTEROL 0 10000 5000 BETADIET 0 2000 RETDIET 1000 0 1 2 3 0 5000 10000 Figure 11 Scatter plots of log (Plasma beta-carotene) and CHOLESTEROL, BETADIET, REDIET 6.1.2.2 Residual check 39 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Figure 12 Residual plots for plasma beta-carotene 6.1.2.3 Model General Linear Model: log beta versus Sex, Smoking status, Vitamin Factor Sex Type Levels Values fixed 2 Smoking status fixed Vitamin 3 fixed 1, 2 1, 2, 3 3 1, 2, 3 Analysis of Variance for log beta, using Adjusted SS for Tests Source DF Seq SS Age 1 0.40150 0.23897 0.23897 2.90 0.090 Sex 1 0.79890 0.17571 0.17571 2.13 Smoking status weight/(height^2) Vitamin 2 0.66772 1 2.63996 2 Adj SS 0.48616 1.99036 1.01660 CALORIES 1 Adj MS 0.24308 1.99036 2.95 24.18 0.06828 0.03414 0.00009 F 0.145 0.054 0.000 0.41 0.06235 0.06235 0.661 0.76 0.385 FAT 1 0.15930 0.00784 0.00784 0.10 0.758 FIBER 1 0.63890 0.18163 0.18163 2.21 0.139 ALCOHOL 1 CHOLESTEROL 1 0.01623 0.00013 0.00839 0.00839 0.02228 0.02228 P 0.10 0.750 0.27 0.603 BETADIET 1 0.26242 0.21037 0.21037 2.56 0.111 RETDIET 1 0.06539 0.07630 0.07630 0.93 0.337 Vitamin*BETADIET 2 0.54436 0.54436 0.27218 3.31 0.038 40 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Error 287 23.62834 23.62834 Total 303 30.83984 S = 0.286930 R-Sq = 23.38% Term R-Sq(adj) = 19.11% Coef Constant 2.3061 Age -0.014010 SE Coef T P 0.1302 17.71 0.000 0.002224 weight/(heig 0.08233 0.001306 0.002849 1.70 0.090 -4.92 0.000094 0.000 CALORIES -0.000082 -0.87 0.385 FAT 0.000447 0.001450 0.31 0.758 FIBER 0.008180 0.005508 1.49 0.139 ALCOHOL 0.001381 0.004325 0.32 0.750 CHOLESTEROL -0.000109 0.000210 -0.52 0.603 BETADIET 0.000021 0.000013 1.60 0.111 RETDIET 0.000033 0.000034 0.96 0.337 BETADIET*Vitamin 1 0.000034 0.000015 2.22 0.027 2 0.000003 0.000017 0.18 0.858 41 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 6.2 Retinol 6.2.1 Regression Method 6.2.1.1 Scatter plots Fig 13 Matrix plot of all variables 42 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Scatterplot of RETPLASMA vs AGE 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 20 30 40 50 60 70 80 90 AGE Fig 14 Scatter plot of retinol plasma with age Because the plot is symmetric distributed, so we don’t need to transform or delete the points. Scatterplot of RETPLASMA vs QUETELET 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 15 20 25 30 35 QUETELET 40 45 50 Fig 15 Scatter plot of retinol plasma with QUETELET 43 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Because the plot is skewed, so we delete the 5 outliers on the right side. Scatterplot of RETPLASMA vs CALORIES 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 0 1000 2000 3000 4000 CALORIES 5000 6000 7000 Fig 16 Scatter plot of retinol plasma with CALORIES Two points on the right side are deleted. 44 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Scatterplot of RETPLASMA vs FAT 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 0 50 100 FAT 150 200 Fig 17 Scatter plot of retinol plasma with FAT Two points on the right side are deleted. Scatterplot of RETPLASMA vs FIBER 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 0 10 20 FIBER 30 40 Fig 18 Scatter plot of retinol plasma with FIBER 45 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Five points on the right side are deleted. Scatterplot of RETPLASMA vs ALCOHOL 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 0 10 20 ALCOHOL 30 40 Fig 19 Scatter plot of retinol plasma with ALCOHOL Two points on the right side are deleted. 46 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Scatterplot of RETPLASMA vs CHOLESTEROL 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 0 100 200 300 400 500 600 CHOLESTEROL 700 800 900 Fig 20 Scatter plot of retinol plasma with CHOLESTEROL Four points on the right side are deleted. Scatterplot of RETPLASMA vs BETADIET 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 0 2000 4000 6000 BETADIET 8000 10000 Fig 21 Scatter plot of retinol plasma with BETADIET 47 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Two points on the right side are deleted. Scatterplot of RETPLASMA vs RETDIET 1800 1600 1400 RETPLASMA 1200 1000 800 600 400 200 0 0 1000 2000 3000 4000 RETDIET 5000 6000 7000 Fig 22 Scatter plot of retinol plasma with RETDIET 6.2.2.2 Regression 1. Best subset and Stepwise regression to select model variables Stepwise Regression: RETPLASMA versus AGE, QUETELET, ... Forward selection. Alpha-to-Enter: 0.25 Response is RETPLASMA on 14 predictors, with N = 292 Step Constant AGE T-Value 1 2 3 4 5 6 450.1 428.1 517.7 607.2 597.4 615.2 3.08 3.77 3.03 3.79 2.58 3.10 2.15 2.49 2.11 2.45 2.16 2.52 48 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` P-Value 0.000 0.000 ALCOHOL 0.002 0.013 0.015 0.012 9.9 T-Value 3.56 P-Value 0.000 SEX_F 8.5 2.95 8.6 3.01 7.8 2.68 8.3 2.85 0.003 0.003 0.008 0.005 -73 -92 -2.33 -97 T-Value -1.97 P-Value 0.050 0.018 0.020 0.012 FAT -2.39 -89 -0.70 -2.52 -0.75 -0.76 T-Value -1.72 -1.85 P-Value 0.086 0.066 0.062 SMOK_2 -1.88 41 42 T-Value 1.69 1.71 P-Value 0.092 0.088 VITUSE_1 -40 T-Value -1.63 P-Value 0.103 S R-Sq 205 4.67 201 8.67 200 199 198 198 9.88 10.80 11.68 12.50 49 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` R-Sq(adj) Mallows Cp Step Constant AGE 4.34 8.03 22.3 8.94 11.3 9.4 7 2.12 2.69 P-Value 0.014 0.008 9.1 9.6 T-Value 3.10 3.25 P-Value 0.002 0.001 -95 -90 T-Value -2.48 -2.35 P-Value 0.014 0.020 -0.79 -0.64 T-Value -1.96 -1.56 P-Value 0.051 0.119 SMOK_2 6.8 2.31 2.48 FAT 7.5 523.8 T-Value SEX_F 8.4 8 519.4 ALCOHOL 9.56 10.14 10.66 41 44 50 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` T-Value 1.67 1.79 P-Value 0.095 0.074 VITUSE_1 -44 -48 T-Value -1.79 -1.95 P-Value 0.075 0.052 QUETELET 3.8 3.8 T-Value 1.76 1.78 P-Value 0.079 0.076 BETADIET -0.0145 T-Value -1.66 P-Value 0.097 S R-Sq R-Sq(adj) Mallows Cp 197 13.45 11.32 5.7 196 14.29 11.87 5.0 8 predictors are chosen as below: AGE, ALCOHOL, SEX_F, FAT, SMOK_2, VITUSE_1, QUETELET, and BETADIET. Stepwise Regression: RETPLASMA versus AGE, QUETELET, ... Backward elimination. Alpha-to-Remove: 0.1 Response is RETPLASMA on 14 predictors, with N = 292 51 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Step Constant AGE 1 537.0 2 537.1 2.34 3 532.6 2.34 4 528.9 2.37 5 546.0 2.41 6 533.6 2.22 2.22 T-Value 2.54 2.54 2.60 2.68 2.57 2.57 P-Value 0.012 0.012 0.010 0.008 0.011 0.011 QUETELET 3.8 3.8 3.8 3.9 4.0 4.0 T-Value 1.71 1.73 1.76 1.77 1.82 1.88 P-Value 0.089 0.085 0.080 0.077 0.069 0.062 -0.58 -0.62 CALORIES 0.043 0.044 0.044 0.052 T-Value 0.62 0.63 0.64 0.79 P-Value 0.536 0.530 0.525 0.430 -1.30 -1.31 -1.35 FAT -1.30 T-Value -1.18 -1.21 -1.23 P-Value 0.237 0.228 0.221 0.205 0.173 -3.3 -3.2 -3.3 -1.4 FIBER -3.3 T-Value -0.83 -0.85 -0.84 P-Value 0.406 0.396 0.403 -1.27 -0.86 0.393 -1.37 -1.51 0.131 -0.46 0.646 52 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` ALCOHOL 9.3 9.3 9.3 9.1 9.7 9.8 T-Value 2.95 2.97 2.97 2.96 3.29 3.32 P-Value 0.003 0.003 0.003 0.003 0.001 0.001 CHOLESTEROL 0.00 T-Value 0.01 P-Value 0.990 BETADIET -0.0128 -0.0127 -0.0127 -0.0128 -0.0134 -0.0155 T-Value -1.26 -1.27 -1.27 -1.29 -1.35 -1.77 P-Value 0.209 0.204 0.204 RETDIET 0.012 0.012 0.35 0.36 0.36 P-Value 0.723 0.721 0.721 -90 -90 -90 T-Value -2.27 -2.32 -2.32 P-Value 0.024 0.021 0.021 SMOK_1 -7 0.178 0.078 0.012 T-Value SEX_F 0.198 -89 -2.30 0.022 -89 -2.32 0.021 -88 -2.30 0.022 -7 T-Value -0.19 -0.19 P-Value 0.851 0.851 53 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` SMOK_2 43 43 45 45 45 44 T-Value 1.68 1.68 1.84 1.84 1.83 1.82 P-Value 0.095 0.094 0.067 0.067 0.068 0.070 VITUSE_1 -58 -58 -59 T-Value -2.04 -2.04 -2.08 P-Value 0.043 0.042 0.038 VITUSE_2 -29 -29 -2.08 -0.97 -0.97 -0.99 P-Value 0.333 0.331 0.325 R-Sq R-Sq(adj) Mallows Cp Step Constant AGE 198 14.91 10.61 197 14.91 10.93 15.0 7 523.8 2.31 11.24 13.0 -0.98 11.0 -2.20 0.028 -30 -1.02 0.329 11.52 9.2 -30 -1.02 0.310 197 14.86 -61 0.028 -29 197 14.90 -62 -2.22 0.039 -30 T-Value S -58 0.306 197 14.67 11.63 7.8 196 14.61 11.88 6.0 8 450.6 2.73 T-Value 2.69 3.32 P-Value 0.008 0.001 54 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` QUETELET 3.8 3.7 T-Value 1.78 1.72 P-Value 0.076 0.087 CALORIES T-Value P-Value FAT -0.64 T-Value -1.56 P-Value 0.119 FIBER T-Value P-Value ALCOHOL 9.6 9.6 T-Value 3.25 3.24 P-Value 0.001 0.001 CHOLESTEROL T-Value P-Value 55 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` BETADIET -0.0145 -0.0174 T-Value -1.66 -2.04 P-Value 0.097 0.043 RETDIET T-Value P-Value SEX_F -90 -73 T-Value -2.35 -1.97 P-Value 0.020 0.049 SMOK_1 T-Value P-Value SMOK_2 44 42 T-Value 1.79 1.70 P-Value 0.074 0.089 VITUSE_1 -48 -49 T-Value -1.95 -1.95 P-Value 0.052 0.052 56 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` VITUSE_2 T-Value P-Value S R-Sq R-Sq(adj) Mallows Cp 196 197 14.29 11.87 13.55 11.42 5.0 5.4 7 predictors are chosen as below: AGE, QUETELET, ALCOHOL, BETADIET, SEX_F, SMOK_2, VITUSE_1. Best Subsets Regression: RETPLASMA versus AGE, QUETELET, ... Response is RETPLASMA C H O QC LB UA AEER EL LSTE VV II SSTT TO FCTATSMMUU ER IOEDDEOOSS ALIFBHRIIXKKEE 57 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Mallows Vars R-Sq R-Sq(adj) 1 4.7 4.3 22.3 204.63 X 1 4.5 4.1 23.0 204.86 2 8.7 8.0 11.3 200.64 X 2 7.2 6.5 16.2 202.29 X 3 9.9 8.9 9.4 199.65 X XX 3 9.6 8.6 10.3 199.97 X XX 4 10.8 9.6 8.4 198.97 X 4 10.8 9.5 8.5 199.03 X XXX 5 11.7 10.2 7.4 198.29 X XXXX 5 11.7 10.1 7.5 198.33 X 6 12.7 10.8 Cp GEEAEOOEE_____ 6.3 S ETSTRLLTTF1212 X X X XXX XXXX 197.57 X XXXXX 6 12.6 10.8 6.4 197.59 X XXXXX 7 13.5 11.4 5.4 196.92 X XXXXXX 7 13.5 11.3 5.7 197.03 X XXXXXX 8 14.3 11.9 5.0 196.42 X XXXXXXX 8 14.0 11.6 5.9 196.74 X XXXXXXX 9 14.6 11.9 6.0 196.40 X XXXXXXXX 9 14.4 11.6 6.8 196.69 X XXXXXXXX 10 14.7 11.6 7.8 196.67 X XXXXXXXXX 10 14.7 11.6 7.8 196.68 X XXXXXXXXX 11 14.9 11.5 9.2 196.81 X XXXXXXXXXX 11 14.8 11.4 9.4 196.90 X XXXXXXXXXX 58 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 12 14.9 11.2 11.0 197.11 X XXXXXXXXXXX 12 14.9 11.2 11.1 197.15 X XXXXXXXXXXX 13 14.9 10.9 13.0 197.46 X XXXXXXXXXXXX 13 14.9 10.9 13.0 197.47 X XXXXXXXXXXXX 14 14.9 10.6 15.0 197.81 X XXXXXXXXXXXXX 6 variables are selected as below: AGE, QUETELET, ALCOHOL, BETADIET, SEX_F, VITUSE_1. 3. Model Regression Analysis: RETPLASMA versus AGE, QUETELET, ... The regression equation is RETPLASMA = 451 + 2.73 AGE + 3.69 QUETELET + 9.56 ALCOHOL - 0.0174 BETADIET - 72.8 SEX_F + 41.6 SMOK_2 - 48.6 VITUSE_1 Predictor Coef SE Coef Constant 450.56 85.37 AGE 2.7254 T P 5.28 0.000 0.8220 3.32 0.001 QUETELET 3.691 2.149 1.72 0.087 ALCOHOL 9.556 2.947 3.24 0.001 BETADIET SEX_F SMOK_2 VITUSE_1 -0.017437 -72.77 41.60 -48.64 0.008559 -2.04 36.85 -1.97 24.40 0.043 0.049 1.70 0.089 24.89 -1.95 0.052 59 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` S = 196.917 R-Sq = 13.5% R-Sq(adj) = 11.4% Analysis of Variance Source DF Regression 7 Residual Error 284 Total 1725637 11012454 291 Source SS MS F P 246520 6.36 0.000 38776 12738091 DF Seq SS AGE 1 594668 QUETELET 1 56352 ALCOHOL 1 563668 BETADIET SEX_F 1 119721 1 134397 SMOK_2 VITUSE_1 1 108714 1 148116 Unusual Observations Obs AGE RETPLASMA Fit SE Fit Residual St Resid 15 64.0 1249.0 857.3 56.7 391.7 2.08RX 17 75.0 1262.0 810.6 40.3 451.4 2.34R 60 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 19 57.0 1727.0 722.1 36.6 1004.9 5.19R 45 69.0 321.0 757.7 41.4 -436.7 -2.27R 69 33.0 227.0 693.3 46.6 -466.3 -2.44R 74 50.0 1031.0 857.0 63.1 174.0 0.93 X 127 70.0 1139.0 650.2 26.2 488.8 2.50R 131 46.0 476.0 790.0 60.2 -314.0 -1.67 X 187 49.0 1443.0 708.3 32.7 734.7 3.78R 256 33.0 194.0 604.2 45.2 -410.2 269 36.0 1102.0 507.9 26.2 594.1 3.04R 272 54.0 1517.0 602.3 20.8 914.7 4.67R 275 41.0 1193.0 794.4 64.8 398.6 2.14RX 286 66.0 986.0 593.8 35.9 392.2 2.03R -2.14R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. After we delete the outliers, we do regression again and get the following result. Regression Analysis: RETPLASMA versus AGE, QUETELET, ... The regression equation is RETPLASMA = 517 + 2.09 AGE + 1.85 QUETELET + 5.23 ALCOHOL - 0.0149 BETADIET - 71.7 SEX_F + 41.9 SMOK_2 - 43.4 VITUSE_1 Predictor Coef SE Coef Constant 516.78 69.91 T P 7.39 0.000 61 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` AGE 2.0923 0.6679 3.13 0.002 QUETELET 1.847 1.758 1.05 0.295 ALCOHOL 5.228 2.688 1.94 0.053 BETADIET -0.014933 SEX_F 0.006885 -2.17 -71.71 SMOK_2 32.27 -2.22 41.85 VITUSE_1 -43.42 S = 155.146 19.75 0.027 2.12 0.035 20.30 -2.14 R-Sq = 11.9% 0.031 0.033 R-Sq(adj) = 9.6% Analysis of Variance Source DF Regression 7 Residual Error 270 Total Source SS 875521 6498945 277 125074 5.20 0.000 24070 1 345437 1 6545 ALCOHOL 1 115308 BETADIET VITUSE_1 P 7374466 QUETELET SMOK_2 F DF Seq SS AGE SEX_F MS 1 85087 1 102453 1 110527 1 110163 62 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 4. Residual plots Fig 22 Residual plots of retinol plasma Fig 23 Residual plots of retinol plasma after deleting the outliers 63 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 6.2.2 GLM Method General Linear Model: RETPLASMA versus SEX, SMOKSTAT, VITUSE Factor Type SEX Levels Values fixed 2 SMOKSTAT fixed VITUSE 1, 2 3 fixed 3 1, 2, 3 1, 2, 3 Analysis of Variance for RETPLASMA, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P AGE 1 345437 137560 137560 5.71 0.018 SEX 1 132454 139672 139672 5.79 0.017 SMOKSTAT 2 169037 157668 78834 QUETELET 1 4344 14227 14227 0.59 VITUSE 2 CALORIES 61941 1 103465 48196 3.27 0.040 51733 2.15 9072 9072 0.38 0.443 0.119 0.540 FIBER 1 44141 17672 17672 0.73 0.393 FAT 1 84885 50081 50081 2.08 0.151 ALCOHOL 1 CHOLESTEROL 1 60373 14008 81431 20773 81431 20773 3.38 0.067 0.86 0.354 BETADIET 1 61604 59894 59894 2.48 0.116 RETDIET 1 7689 7689 7689 0.32 0.573 Error 263 6340355 6340355 24108 64 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Total 277 7374466 S = 155.267 R-Sq = 14.02% Term Coef Constant SE Coef 527.61 AGE R-Sq(adj) = 9.45% T 76.50 1.7782 P 6.90 0.000 0.7444 2.39 0.018 QUETELET 1.400 1.823 0.77 0.443 CALORIES 0.03414 0.05565 0.61 0.540 FIBER -2.679 3.129 -0.86 0.393 -1.2532 0.8695 -1.44 0.151 FAT ALCOHOL 5.193 2.826 CHOLESTEROL 0.1183 0.1275 BETADIET RETDIET 1.84 0.067 0.93 0.354 -0.012755 0.008092 -1.58 0.116 0.01527 0.02704 0.56 0.573 Unusual Observations for RETPLASMA Obs RETPLASMA Fit SE Fit Residual St Resid 10 935.00 629.25 31.57 305.75 2.01 R 29 1002.00 686.15 50.25 315.85 2.15 R 53 247.00 553.66 32.91 -306.66 -2.02 R 69 187.00 524.09 34.32 -337.09 -2.23 R 107 927.00 516.64 48.70 410.36 2.78 R 65 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 114 917.00 603.00 31.11 314.00 2.06 R 203 879.00 538.19 34.61 340.81 2.25 R 254 792.00 484.88 31.34 307.12 2.02 R 256 995.00 638.58 26.90 356.42 2.33 R 261 649.00 591.68 64.80 57.32 0.41 X 274 216.00 554.43 27.29 -338.43 -2.21 R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. So: RETPLASMA=527.61+1.7782AGE+5.193ALCOHOL-0.012755BETADIET To find the interaction between any of the predictors, we did lots of trials, and finally find an evidence of interaction. Another solution: General Linear Model: RETPLASMA versus SEX, SMOKSTAT, VITUSE Factor SEX Type Levels Values fixed 2 SMOKSTAT fixed VITUSE fixed 1, 2 3 3 1, 2, 3 1, 2, 3 Analysis of Variance for RETPLASMA, using Adjusted SS for Tests 66 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` Source DF Seq SS Adj SS Adj MS F P AGE 1 345437 150885 150885 6.26 0.013 SEX 1 132454 117093 117093 4.86 0.028 SMOKSTAT 2 169037 149864 74932 3.11 QUETELET 1 4344 19396 19396 0.81 0.370 VITUSE 2 CALORIES 61941 1 62854 48196 31427 11715 0.046 1.30 0.273 11715 0.49 0.486 FIBER 1 44141 15166 15166 0.63 0.428 FAT 1 84885 49577 49577 2.06 0.153 ALCOHOL 1 CHOLESTEROL 60373 1 75105 14008 75105 20296 3.12 20296 0.84 0.079 0.360 BETADIET 1 61604 66667 66667 2.77 0.097 RETDIET 1 7689 3889 3889 0.16 0.688 53358 53358 26679 VITUSE*RETDIET 2 Error 261 6286997 Total 277 7374466 S = 155.203 Term Constant AGE QUETELET R-Sq = 14.75% Coef 510.86 1.8777 1.646 6286997 1.11 0.332 24088 R-Sq(adj) = 9.52% SE Coef 77.70 T 6.57 0.7502 1.834 P 0.000 2.50 0.013 0.90 0.370 67 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` CALORIES 0.03893 0.05582 FIBER -2.490 3.138 -0.79 0.428 -1.2604 0.8786 -1.43 0.153 FAT 0.70 0.486 ALCOHOL 5.002 2.833 CHOLESTEROL 0.1174 0.1279 BETADIET -0.013507 RETDIET 1.77 0.079 0.92 0.360 0.008119 -1.66 0.01103 0.02745 0.097 0.40 0.688 RETDIET*VITUSE 1 0.02579 0.03142 2 -0.04709 0.03182 0.82 0.413 -1.48 0.140 Unusual Observations for RETPLASMA Obs RETPLASMA Fit SE Fit Residual St Resid 7 825.00 692.87 70.09 132.13 0.95 X 10 935.00 625.32 32.15 309.68 2.04 R 29 1002.00 696.28 51.39 305.72 2.09 R 53 247.00 570.71 37.81 -323.71 -2.15 R 69 187.00 528.78 34.98 -341.78 -2.26 R 107 927.00 504.35 50.90 422.65 2.88 R 114 917.00 596.38 32.25 320.62 2.11 R 203 879.00 544.27 34.98 334.73 2.21 R 256 995.00 627.73 30.58 367.27 2.41 R 68 / 71 Applied Statistics Report, Industrial Engineering, Tsinghua University` 261 649.00 551.43 70.25 97.57 0.70 X 265 946.00 635.52 25.89 310.48 2.03 R 274 216.00 566.66 32.85 -350.66 -2.31 R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Residual Plots for RETPLASMA Normal Probability Plot Versus Fits 99 400 90 200 Residual Percent 99.9 50 10 0 -200 1 -400 400 0.1 -500 -250 0 Residual 250 500 600 Fitted Value 700 Versus Order 40 400 30 200 Residual Frequency Histogram 500 20 10 0 -200 -400 0 -300 -200 -100 0 100 Residual 200 300 400 1 20 40 60 80 100 120 140 160 180 200 220 240 260 Observation Order So there are no interactions. RETPLASMA=510.86+1.8777AGE+5.002ALCOHOL-0.013507BETADIETRE FERANCER 8. REFERANCER [1] Peto R, Doll R, Buckley JD, et al. Can dietary beta-carotene materially reduce human cancer rates? Nature 1981;290:201-8. 69 / 71 800 Applied Statistics Report, Industrial Engineering, Tsinghua University` [2] Russell-Briefel R, Bates MW, Kuller LH. The relationship of plasma carotenoids to health andbiochemical factors in middle-aged men. Am J Epidemiol 1986;122:741-9. [3] Stryker WS, Kaplan LA, Stein EA, et al. The relation of diet, cigarette smoking, and alcohol consumption to plasma beta-carotene and alphatocopherol levels. Am J Epidemiol 1988;127:283- 96. [4] Adams-Campbell, L. L., M. U. Nwankwo, et al. (1992). Serum retinol, carotenoids, vitamin E, and cholesterol in Nigerian women. Nutritional Biochemistry 3(2): 58-61. [5] Comstock, G. W., M. S. Menkes, et al. (1988). Serum levels of retinol, beta-carotene, and alpha-tocopherol in older adults.American Journal of Epidemiology 127(1): 114-123. [6] Russellbriefel, R., M. W. Bates, et al. (1985). The relationship of plasma carotenoids to health and biohchemical factors in middle-aged men. American Journal of Epidemiology 122(5): 741-749. [7] Stähelin, H. B., E. Buess, et al. (1982). vitamin A, cardiovascular risk factors, and mortality. The Lancet 319(8268): 394-395. [8] Van Poppel, G. and H. van den Berg (1997). Vitamins and cancer. Cancer Letters 114(1-2): 195-202. 70 / 71
© Copyright 2026 Paperzz