SPS 580 Lecture 2 I. Bivariate Chi Square BIVARIATE ANALYSIS: CROSSTABULATION ASSIGNMENT #1 ASKED YOU: For each dependent variable . . . Comment on the findings from the univariate percentage table and/or column chart and implications they might have for the test of your theory . . . Here’s what I was thinking . . . IDEA: People from Chicago are more likely to be in favor of a tax on Lakefront beach use for non-City residents THEORY: Place of Residence Opinion on Beach tax Opinion on Beach Tax 37% 21% 21% 20% FREQUENCIES VARIABLES=bchtax Agree Strongly Agree Somewhat Disagree Strongly The majority oppose a beach tax for non-City residents On average 41% of the general population favor a beach tax IF THE THEORY IS RIGHT WE EXPECT SOMETHING LIKE THIS . . . (ABSTRACTION ALERT) Place of residence Chicago Suburban Cook Co Collar Counties Total II. Disagree Somewhat "Favor" Beach Use Tax 50% 40% 30% 41% A BIVARIATE PERCENTAGE TABLE “MARGINAL” TOTAL Frequencies = marginal distribution BIVARIATE CROSSTABULATION: REAL DATA A. RECODE X and Y variables so categories = the desired end product RECODE bchtax (1 thru 2=1) (3 thru 4=2) (ELSE=9) INTO bchtax2. VARIABLE LABELS bchtax2 'dichotomy'. value labels bchtax2 1 'Favor tax' 2 'Oppose tax' 9 'other'. missing values bchtax2 (9) . RECODE region (1=1) (2=2) (3 thru 7=3) (ELSE=9) INTO region3. VARIABLE LABELS region3 'region recoded'. value labels region3 1 'Chicago' 2 'Suburban Cook Co' 3 'Collar Counties'. missing values region3 (9). 1 SPS 580 Lecture 2 Bivariate Chi Square B. PERFORM A CROSSTABULATION OF X and Y VARIABLES ANALYZE / DESCRIPTIVE STATISTICS / CROSSTABS / ROWS region3 / COLUMNS bchtx2 / CELLS row percentage CONTINUE / OK region3 region recoded * bchtax2 dichotomy Crosstabulation % within region3 region recoded bchtax2 dichotomy 1.00 Favor tax region3 region recoded 2.00 Oppose tax Total 1.00 Chicago 54.0% 46.0% 100.0% 2.00 Suburban Cook Co 33.1% 66.9% 100.0% 3.00 Collar Counties 37.0% 63.0% 100.0% 41.8% 58.2% 100.0% Total Rows = categories of place of residence (X variable) Cells = %s add to 100% across each row Columns = categories of opinion (Y variable) SPSS HINT: When you COMPUTE new variables they appear at the end of the list of SPSS variables, not in alphabetical order Categories of Y Categories of X Place of residence Chicago Suburban Cook Co Collar Counties Total "Favor" Beach Use Tax 54% 33% 37% 42% A BIVARIATE PERCENTAGE TABLE Conditional percents %Y|X Marginal total percent %Y Conditional percents %Y|X Categories of Y PQ Bivariate column chart "Favor" Beach Use Tax 60% 54% 33% 40% Y = %Y given the value of X, Conditional distribution Y = %Y|X 37% 20% X = independent variable, place of residence 0% Chicago Suburban Cook Collar Counties Co Categories of X 2 SPS 580 Lecture 2 Bivariate Chi Square MAKING A PRESENTATION QUALITY BIVARIATE GRAPH . . . Highlight the Bivariate percentage table INSERT COLUMN 2D-COLUMN Delete gridlines Delete legend Add data labels Re-set y-axis metric Delete x-axis tick marks Resize fonts The theory is right . . . people from Chicago are more likely to support the beach tax (54%) than people from the suburbs (33% and 37%). Is it also right to say that people from the Collar Counties are more likely (37%) to support the beach tax than people who live in Cook County (33%) ??? Well . . . It doesn’t make much sense, and it is probably not a statistically significant difference. III. Statistical Significance I A. When we talk about statistical significance, we are talking about significance of differences B. When you say a difference is not statistically significant, that means you think . . . a. It’s small It’s too small b. Smaller than what? C. The pattern in the data could have arisen by chance, so don’t get all worked up about it D. OPERATIONAL DEFINITION . . . “Chance” means the PERCENT in each place (each category of X) is the same as the TOTAL PERCENT plus or minus some sampling error Place of residence Chicago Suburban Cook Co Collar Counties Total "Favor" Beach Use Tax 54% 33% 37% 42% Place of residence # of people in the survey Chicago 1,147 Suburban Cook Co 1,070 Collar Counties 954 Total 3,171 Observed Data "Favor" Beach Use Tax 42% +/- error 42% +/- error 42% +/- error 42% Sampling Error Model for testing stx significance Y% does not depend on the value of X aka . . . Null Hypothesis . . . “no” causal relationship 3 SPS 580 Lecture 2 Bivariate Chi Square E. Test of Statistical Significance Means . . . Compare the Observed Data to the SEM (null hypothesis) to see if the data are “statistically significant” EXPECTED IF NULL IS RIGHT Place of "Favor" Beach "Favor" Beach residence Use Tax Use Tax Chicago 54% 42% Suburban Cook Co 33% 42% Collar Counties 37% 42% Total 42% 42% OBSERVED Null Hypothesis => conditional %s = marginal % F. What you get from a significance test is the likelihood the null is true. General rule for policy research: if the likelihood of something being true is 5% or less, then IT IS NOT TRUE G. So, what’s the likelihood that the % in favor everywhere is 42% and the differences are just sampling variation? = Likelihood the apparent differences could have arisen from sampling variation 4 SPS 580 Lecture 2 IV. Bivariate Chi Square The Chi Square Test OBSERVED COUNT XTAB observed data Favor tax Oppose Total Chicago 619 528 1147 Suburban Cook Co 354 716 1070 Collar Counties 353 601 954 Total 1326 1845 3171 Expected Count if NULL is right 619 / 1147 = 54% 354 / 1070 = 33% 353 / 954 = 37% 1326 / 3171 = 42% Favor tax Oppose Total Chicago 480 667 1147 Suburban Cook Co 447 623 1070 Collar Counties 399 555 954 Total 1326 1845 3171 Calculate expected data if null is right 42% * 1147 = 480 42% * 1070 = 447 42% * 954 = 399 42% * 3171 = 1326 Total Calculate Observed minus Expected O-E Favor tax Oppose Chicago 139 -139 Suburban Cook Co -93 93 Collar Counties -46 46 The positive differences show where there are “too many” cases if null is right Total (O-E)^2 / E Favor tax Oppose Chicago 40 29 Suburban Cook Co 20 14 Collar Counties 5 4 Total Calculate [ (O-E)^2 / E ] 139 ^ 2 = 19,222.65 19,222.65 / 480 = 40 Total Add it all up SUM [ (O-E)^2 / E ] = 112 Determine degrees of freedom (df) = (# rows – 1) * (# columns – 1) = 2*1 = 2 Look up the “critical value” of chi square, given the df. . . d.f. p <.05 if chi sq > 1 3.841 2 5.991 3 7.815 4 9.488 5 11.07 6 12.592 7 14.067 8 15.507 9 16.919 10 18.307 . . . if chi square is greater than the critical value then the likelihood of the null hypothesis is less than 5%, which means IT IS NOT TRUE So in this case, chi square = 112 df = 2 critical value = 5.991 Null is NOT TRUE The data COULD NOT have arisen because of sampling variation. THE DIFFERENCES BETWEEN LOCATION AND ATTITUDE TOWARD THE BEACH TAX ARE STATISTICALLY SIGNIFICANT 5 SPS 580 Lecture 2 Bivariate Chi Square What about the difference between Suburban Cook and the Collar counties? OBSERVED COUNT Favor tax Oppose Total Suburban Cook Co 354 716 1070 Collar Counties 353 601 954 Total 707 1317 2024 35% Expected Count if NULL is right Favor tax Oppose Total Suburban Cook Co 374 696 1070 Collar Counties 333 621 954 Total 707 1317 3171 Total O-E Favor tax Oppose Suburban Cook Co -20 20 Collar Counties 20 -20 Total (O-E)^2 / E Favor tax Oppose Suburban Cook Co 1.04 0.56 Collar Counties 1.17 0.63 Total Total SUM [ (O - E ) ^2 / E ] 3.41 So in this case, chi square = 3.41 df = 1 critical value = 3.841 Null is TRUE The data could have arisen because of sampling variation. THE DIFFERENCE BETWEEN SUBURBAN COOK AND THE COLLAR COUNTIES ON ATTITUDE TOWARD THE BEACH TAX IS NOT STATISTICALLY SIGNIFICANT 6 SPS 580 Lecture 2 Bivariate Chi Square CHI SQUARE – A BLANKET TEST Idea: Poor people are more likely to use public transit THEORY: Income causes mode of transportation V. tranmt Resp's Means Most Used To Get To Work 1 Car/truck 76.9% 2 Van 1.1% 3 Bus 6.7% 4 Subway Or Elevated 4.1% 5 Railroad 4.1% 6 Taxi 0.2% 7 Motorcycle 0.2% 8 Bicycle 0.4% 9 Walked 3.2% 10 Worked At Home 1.8% 11 Airplane 0.4% 12 Other 1.0% 100% this is what’s in the data set (NOT PRESENTATION QUALITY) Let’s say you coded it into these four groups . . . Mode 78.8% 14.9% 3.6% 2.7% 100.0% Private Public Biked/Walked Other RECODE tranmt (1=1) (2=1) (6=1) (7=1) (11=1) (10=4) (12=4) (3 thru 5=2) (8 thru 9=3) (ELSE=SYSMIS). value labels tranmt 1 'private' 2 'public' 3 'bike or walked' 4 'other'. There’s a nice income variable already there, just needs DK assigned to missing INCOME . . . missing values inc4gp (7,8,9). Mode of Transportation to Work <$20k Private 60% Public 28% Bike / walk 9% Other 4% $20k - $40k 77% 16% 4% 2% $40k -$70k 85% 11% 2% 2% $70k+ 80% 14% 2% 3% TOTAL 79% 15% 4% 3% BIVARIATE PERCENTAGE TABLE PQ BIVARIATE CHARTS (Ugly, but PQ) … Mode of Transportation to Work Mode of Transportation to Work 90% 90% 80% 80% Private 70% 70% <$20k 60% $20k - $40k 50% 50% 40% 40% $40k -$70k $70k+ Public 60% 30% 30% 20% 20% 10% 10% Bike / walk Other 0% 0% Private Public Bike / walk <$20k Other 7 $20k - $40k $40k -$70k $70k+ SPS 580 Lecture 2 Bivariate Chi Square Chi Square = 188 df = 9 But this chi square tests the null hypothesis that ALL OF THE INCOME GROUPS MAKE ALL THE SAME TRANSPORTATION CHOICES i.e., . . . EXPECTED UNDER NULL HYPOTHESIS, 9 DF Mode of Transportation to Work VI. Private Public Bike / walk Other <$20k 79% 15% 4% 3% $20k - $40k 79% 15% 4% 3% $40k -$70k 79% 15% 4% 3% $70k+ 79% 15% 4% 3% TOTAL 79% 15% 4% 3% Chi square is a BLANKET TEST of all possible differences Likelihood < 5 % Null is FALSE All income groups did not make the same transportation choices FOCUSING CHI SQUARE TESTS ON SPECIFIC HYPOTHESES It might be that the differences in PUBLIC transportation use are NOT SIGNIFICANT, even though the overall pattern of differences is significant. So you need to construct a chi square test that makes use of all the data available, but focuses on the set of differences you are most interested in . . . Mode of Transportation to Work <$20k Public 28% Non-Public 72% $20k - $40k 16% 84% $40k -$70k 11% 89% $70k+ 14% 86% TOTAL 15% 85% 28% Public Transpoprtation Use 16% <$20k 11% $20k - $40k $40k -$70k 14% $70k+ EXPECTED UNDER NULL <$20k Mode of Transportation NonPublic Public 15% 85% $20k - $40k 15% 85% $40k -$70k 15% 85% $70k+ 15% 85% TOTAL 15% 85% Chi square = 99.6 df = 3 Likelihood < 5 % Null is FALSE Income groups not equally likely to use public transportation 8 SPS 580 Lecture 2 VII. Bivariate Chi Square PHI, SENSITIVITY OF CHI SQUARE TEST TO N THEORY: Smoking is correlated with health status COLLAPSE RARE CATEGORIES TO AVOID SMALL CELL COUNTS IN TABLES . . . helthr Respondent's Health 1 Excellent 40% 2 Good 43% 3 Fair 14% 4 Poor 3% 100% Respondent's Health Excellent 40% Good 43% Fair + Poor 17% 100% RECODE helthr (1=1) (2=2) (3=3) (4=3) (7 thru 9=9) INTO helthr3. VARIABLE LABELS helthr3 '3 categories'. value labels helthr3 1 'Excellent' 2 'Good' 3 'Fair or Poor'. missing values helthr3 (7,8,9). CHI SQUARE rule: can’t have cell counts with Expected value < 5.0 Rule of Thumb: recode so there is always 10% + per final category Rule of thumb: rare categories OK for DEPENDENT VARIABLES only Health Status 20% 48% Chi square = 152 df = 2 16% Null is not true > 5.99 41% Smoking is correlated with health status Fair/Poor Good Excellent 32% Smokers 42% Non-Smokers But the difference is not very big (42% - 32% ) = 10% Chi square is large because N is large (N = 18,381) If there were 500 cases with all other distributions the same chi sq = 4.13 < 5.99 null is true CHI SQUARE IS SENSITIVE TO SAMPLE SIZE values of phi weak .00 - .10 moderate .10 - .30 strong .30 + PHI = SQUARE ROOT [ CHI SQ / N ] Case 1 . . . n = 18,381 . . . phi = .09 (weak) Case 2 . . . n = 500 . . . phi = .09 (weak) 9 SPS 580 Lecture 2 Bivariate Chi Square VIII. PUTTING IT ALL TOGETHER THEORY: Age causes likelihood of unemployment missing values umemp2 (7,8,9). NOTE variable name is UMEMP2 typo in data entry ???? Likelihood of Unemployment in the Next 5 Years Very likely Fairly likely Not too likely Not at all likely Under 30 7% 8% 32% 53% 100% 31-45 7% 8% 29% 56% 100% 46-64 11% 9% 24% 56% 100% 65+ 33% 16% 14% 37% 100% Total 9% 9% 28% 55% 100% big table somewhat large N = 2,032 2 expected values = 6.0 Chi sq = 74 df = 9 Phi = .191 Null is not true the likelihood of being unemployed is not the same in all age groups Chart/graph . . . have to choose what to focus on because the whole table is unwieldy "Very Likely" to be Unemployed in the Next 5 Years "Very Likely + Fairly Likely" to be Unemployed in the Next 5 Years 33% 49% 50% 30% 40% 20% 30% 11% 10% 7% 20% 7% 20% 15% 15% Under 30 31-45 10% 0% 0% Under 30 31-45 46-64 65+ 10 46-64 65+
© Copyright 2025 Paperzz