MATH/STAT 352 CORRELATION AND REGRESSION CORRELATION BIVARIATE DATA: CORRELATION AND REGRESSION Two variables of interest: X, Y. GOAL: Quantify association between X and Y: correlation. Predict value of Y from the value of X: regression. EXAMPLES: (height, weight), (yrs. of education, salary), (hrs. of studying, exam score), (SAT score, GPA), (chemical reaction time, temperature), (rainfall, runoff volume), (demand, price), etc. BIVARIATE DATA: PAIRS (X, Y): (x1, y1), (x2, y2), …, (xn, yn). (xi, yi) – ith observation, values of X and Y on the ith “subject”. Correlation studies: study type and amount of association between X and Y. Regression studies: aim to predict Y from X by constructing a simple equation relating Y to X. CORRELATION GRAPHICAL REPRESENTATION OF BIVARIATE DATA – SCATTER PLOT: plot observations (x1, y1), (x2, y2), …, (xn, yn) as points on the plane. Types of association/relationship between vars: positive, negative, none. Positive association: Two variables are positively associated if large values of one tend to be associated (occur) with large values of the other variable and small values of one tend to be associated with small values of the other variable. Positive association Example: Height and weight are usually positively associated CORRELATION, contd. Negative association: Two variables are negatively associated if large values of one tend to be associated (occur) with small values of the other. The variables tend to “move in opposite directions”. No association: If there is no association, the points in the scatter plot show no pattern. Negative association E.g. High demand often occurs with low price. No association PEARSON CORRELATION COEFFICIENT Measure of strength of association is the correlation coefficient: Population: ρ (rho); Sample: rxy. Data: (x1, y1), (x2, y2), …, (xn, yn). Sample statistics: sample means: Sample standard deviations: sx and sy. Sample correlation coefficient: Correlation coefficient measures strength of LINEAR association. PROPERTIES OF THE SAMPLE CORRELATION COEFFICIENT rXY. rxy> 0 indicates positive association (ρxy> 0) between X and Y. rxy < 0 indicates negative association (ρxy< 0) btween X and Y. rxy≈ 0 indicates no association (ρxy≈ 0) between X and Y. -1 ≤ rxy (or ρxy)≤ 1, the closer | rxy | (or | ρxy |) to 1, the stronger the relationship between X and Y. Computational formula for r: given sample stats we can compute r as CORRELATION COEFFICIENT AND ASSOCIATION Strong association r = 0.9 Moderate association r = 0.75 r = 0.5 Weak association No association Almost perfect association, but not linear, r small. r = 0.28 r = - 0.3 CORRELATION, CONTD. Correlation does not imply CAUSATION! Watch out for hidden (lurking) variables. Example. Study of fires. X=amount of damage, Y = # of firefighters. rXY≈0.85. The more firefighters, the more damage? Hidden variable: Size of the fire. EXAMPLE In a study of income an savings, data was collected from 10 households. Both savings and income are reported in thousands of $ in the following table. Find the correlation coefficient between income and income savings savings. 25 28 35 39 44 48 52 65 55 72 0.5 0.0 0.8 1.6 1.8 3.1 4.3 4.6 3.5 7.2 Solution: Summary statistics: X=income, Y=savings Σxi = 463 , Σx2i = 23533, Σyi = 27.4 , Σy2i = 120.04 , Σxi yi =1564.4. There is a strong positive association between family income and savings. Testing hypotheses about significant correlation Test on significance level α. STEP1. Ho: ρ=0 vs Ha: ρ≠ 0 or Ha: ρ > 0 or Ha: ρ < 0 STEP 2. Test statistic: Under the Ho, the test statistic has t distr. with n-2 df. STEP 3. Critical value? One-sided test tα, two-sided tα/2 . STEP 4. DECISION-critical/rejection region(s) depends on Ha. Ha: ρ≠ 0 Reject Ho if |t|> tα/2; Ha: ρ > 0 Reject Ho if t > tα; Ha: ρ < 0 Reject Ho if t < - tα. Example Income and savings. Test if there is a significant positive linear relationship between income and savings. Use significance level of 5% STEP1. Ho: ρ=0 vs STEP 2. Test statistic: Ha: ρ > 0 (positive relationship) Under the Ho, the test statistic has t distr. with 8 df. STEP 3. Critical value? One-sided test t0.05= 1.86. STEP 4. DECISION-critical/rejection region(s) depends on Ha. Ha: ρ > 0 Reject Ho if t > 1.86; t=10.1 > 1.86, so we reject Ho. Answer: There is a significant positive correlation between income and savings. Example: Eruptions of the Old Faithful Geyser When Old Faithful geyser erupts in the Yellowstone NP, the length of the eruption time and the time interval until the next eruption are recorded. The data for 8 eruptions produced the sample correlation coefficient r = 0.926. Test if this relationship is significantly different from zero. Use significance level of 0.01. STEP1. Ho: ρ=0 vs Ha: ρ≠ 0 STEP 2: Test statistic: STEP 3. Critical value? Two-sided test, df=6, t0.005= 3.707. STEP 4. |t|=6.008 >3.707, so reject Ho. Thus, there is a significant linear relationship between the length of the eruption and the time after the eruption.
© Copyright 2026 Paperzz