How Not to Lie with Statistics: Avoiding Common Mistakes in Quantitative Political Science Author(s): Gary King Reviewed work(s): Source: American Journal of Political Science, Vol. 30, No. 3 (Aug., 1986), pp. 666-687 Published by: Midwest Political Science Association Stable URL: http://www.jstor.org/stable/2111095 . Accessed: 31/01/2013 04:25 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . Midwest Political Science Association is collaborating with JSTOR to digitize, preserve and extend access to American Journal of Political Science. http://www.jstor.org This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions WORKSHOP How NottoLie withStatistics: Avoiding CommonMistakesin Quantitative PoliticalScience* GaryKing,NewYorkUniversity This articleidentifies a set of serioustheoretical mistakesappearingwithtroublingly highfrequency throughout thequantitative politicalscienceliterature. areall Thesemistakes basedon faulty statistical oron erroneous statistical theory analysis.Through algebraicand interpretive proofs,someof themostcommonly made mistakesare explicatedand illustrated.The theoretical problemunderlying each is highlighted, and suggested solutions are providedthroughout. It is arguedthatcloserattention to theseproblems will and solutions resultin morereliablequantitative analysesandmoreusefultheoretical contributions. One ofthemostglaring withmuchquantitative problems politicalscienceis itsunevensophistication and quality.Mistakesare oftenmadebut rarelynoticed.In journalsubmissions, conference presentations, and studentpapers,problems occurwithevenmorefrequency. Havingobserved thissituationfora fewyears,I noticedseveralpatterns. First,thesame mistakesare beingmadeor "invented" overand over.Second,to refera orientated toan articleinEconometrica, The substantively politicalscientist JournaloftheAmerican Statistical orevenPoliticalMethodolAssociation, ogyis to giveadvicethateitheris not helpfulor is notfollowed.These problems aremorethantechnical flaws;theyoftenrepresent theoimportant in mostcases,there reticaland conceptualmisunderstandings.' However, are relatively thatcan reduceor eliminate bias and other simplesolutions statistical make the problems, improve conceptualization, analysiseasierto interpret, andmaketheresults moregeneral. In orderto addresstheseconcerns,thispaper presentsproofsand illustrations ofsomeofthemostcommonstatistical inthepolitimistakes cal science literature, and suggested along withtheoreticalarguments An earlierversionofthisarticlefirst appearedat theannualPoliticalScienceMethodologySocietyconference, from Berkeley, California, July,1985.I appreciate thecomments theparticipants at thatmeeting, particularly thoseof Christopher Achenand Nathaniel Beck. Thanksalso to mycolleaguesat New YorkUniversity, particularly LarryMead, BertellOllman,and Paul Zarowin.ArthurGoldberger, HerbertM. Kritzer,AnnR. Mcreviewers were Cann,CharlesM. Pearson,LynRagsdale,theeditors,and theanonymous also veryhelpful. 'An exampleofa minortechnicalmistakeis usingordinallevelindependent variables withstatistics thatassumeinterval leveldata,I referto thisas "minor"becauseit usually (although notalways)haslittlesubstantive consequence andbecauseitdoesnotrepresent a conceptualmisunderstanding. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 667 corrections. It specifically omitsproblemswiththenewestand fanciest statistical techniquesfortworeasons.First,theproblemsconsideredbelow formthetheoretical and statisticalfoundation to themoresophisticatedmethodologies; and fillingcracksin thefoundation finding should logicallyand chronologically precedethepaintingof shinglesand shutters.Second,the greatvarietyof newertechniquesare beingused by relatively fewpoliticalscientists; thus,anycriticism ofthenewtechniques willapplyonlyto a smallaudience.Althoughimportant, I willleavethe newertechniquesfora future paper. Foreachquantitative problem, I describe(1) themistake, (2) theproof, and(3) theinterpretation. Theproofs, orappendices appearingin footnotes whenexcessively technical,are formalversionsof,as wellas algebraicor numerical evidencefor,theassertions made in mydiscussionofthemistake.Emphasishereis on theintuitive, so generality is oftensacrificedin orderto improveconceptualunderstanding. The finalsectionincludesa inthecontext briefsummary andgivesimplications ofmistakes ofproposed solutions. Somesectionsaretoo briefto be dividedintothistriadand are therefore combined.This sortof methodological has been retrospective donein otherdisciplines, butalthoughwe can learnfromsomeofthese, mostdo notaddressproblems specificenoughto politicalscienceresearch. (See, forexample,Leamer,1983a; Smith,1983; Friedmanand Phillips, 1981;andHendry,1980;Gurel,1968).2 Overthreedecadesago,DarrellHuff(1954) explained,in a book by the same name,How to Lie WithStatistics.Because of the systematic precisionrequired, we shouldrealizebynowthatitis a lotharder(knowinglyornot)to lie (andgetawaywithit)withstatistics thanwithout them. Regressionon Residuals TheMistake. Supposethaty wereregressed on twosetsofindependent variables to be estimated areintheparameter XI andX2.3Thecoefficients vectorsPIand 32in model1: E(y IX1,X2) =XI1f3 +X2X32 (1) The standardand appropriate wayto estimateP1and P2 in model 1 is by a multipleregression ofy on XI andX2.The result running y =X1bI +X2b2+ e (2) 21 do notciteevery methodologically flawedpoliticalscienceworkinthispaperbecause thepurposehereistoimprove future research andtofacilitate criticalreadingofall research. Thereis littlegainedbyberating thoseon whoseresearch wearetrying to build. 3 Theword "regressed" is sometimes misused.Readinga regression equationfromleftto right, wesay,"thedependent variableis regressed ontheindependent In thetext,y variables." is thedependent variable; a setofseveralindependent variables. XI andX2eachrepresent This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 668 GaryKing is theleastsquares(LS) estimator. The sampleestimates in equation2 are usedto inferto thepopulationparameters in equation1. Nowconsideran (incorrect) alternative procedure, calledheretheregression onresiduals (ROR)estimator. Thisis a methodofestimating X11and inthisequation: byfirst regressing y onXI, resulting P2often"invented" y=Xlb*l+el whereb*is thefirstROR estimator. (3) Wethenregress on thesecondsetofexplanatory variables, eI, theresiduals, X2,yielding el =X2b>e2 (4) whereb*is thesecondROR estimator. The mistakenbeliefis thatb2fromthesecondregression in equation 4 is equal to b2fromequation2; thatis,sincewe have"controlled" forXI inequation3,theresultis thesameas ifwehad originally computed2. As is demonstrated in theproofappearingin appendixA, thisis nottrue. The ROR estimator b in equation3 is a biasedestimateofPI, sincethe does equation notcontrolforX2. This is thewell-known omittedvariablesbias.4 Sinceel-the residualsfromequation3 and thedependent variablein equation4-is calculatedfromthebiasedROR estimator b*, ittoo is biased.Thus,itfollowsthatb*is also biased,sinceitis calculated fromthe regressionof the biased el on the second set of explanatory variables X2.5 TheInterpretation. Exceptfortwoveryspecialcases,theROR estimatoris notthesameas theordinary andbyitselfhas leastsquaresestimator no usefulinterpretation. b* is also a biasedestimateof a in model1. In orderto estimate1Pand P2 correctlyin model 1, both sets of variablesXI and X2 shouldbe putin theregression Thisgivesan estisimultaneously. mate(bI) oftheinfluence ofXI on y (controlling forX2),and an estimate forXI). (b2)ofX2ony(controlling An implication ofthisresultis thatone shouldnotmaketoo muchof oftheresidualsfroma regression anyinterpretation analysis.Ifitappears froman analysisoftheresidualsthatsomevariableX3 is missing, thenX3 maybe missing, butit is notpossibleto drawfairconclusionsaboutthe orboth. 4The biasdoesnotoccurwheneitherP2 = 0 orXI andX2 areuncorrelated on 'Sometimesthisprocessis continued:The secondset of residualse2 is regressed andanother setof another setofexplanatory variables, X3, producing another RORestimator Thisprocesshasbeenextended thefirst twoin residuals. to manystages,butI onlyconsider ROR estimator, thebiasis confounded evenfurther. thetext.In themulti-stage This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 669 influenceofX3 on y unlessX3 were actually measured and the full equationwere estimated. An exampleis Achen's (1979) resultthat"Normal Vote" calculations are inconsistent:the Normal Vote was determinedby a two-stepprocess, roughlyanalogous to usingthe ROR estimator. In the statisticalliterature,the ROR estimationprocedure is called "stepwiseleast squares." However,"stepwiseregression"is verydifferent fromthisprocedure-although it is no less problematic.6 The Race of the Variables In this section,the use of standardizedcoefficients("beta weights"), correlationcoefficients (Pearson's correlation),and R2 ("the coefficientof determination")are challenged. In most practical political science situations,it makeslittlesense to use thesestatistics.Theydo not measurewhat theyappear to; theysubstitutestatisticaljargon forpolitical meaning;they can be highlymisleading;and in nearlyall situations,thereare betterways to proceed. TheRace (1): Standardized Fruit TheMistake.Apples,Oranges,and Perceptions. Imaginea situation wherea researcherwantedto explainy, the numberof visitsto the doctor per year.The explanatoryvariableswereX1,the numberofapples eatenper week,and X2, the numberof orangeseaten per week. The multipleregression equationwas thenestimatedto be: y= 10 - 1.5X1- 0.25X2. (5) 6Stepwiseregression (whichhas been called "unwiseregression" [Leamer,1985] or mightbe called a "MinimumLogic Estimator"), allowscomputeralgorithms to replace logicaldecisionprocessesin selecting variablesfora regression analysis.Thereis nothing wrongwithfitting manyversions ofthesamemodelto analyzeforsensitivity. Afterall,the goaloflearning fromdatais as nobleas thegoalofusingdata to confirm a priorihypotheses. However,some a prioriknowledge, or at least some logic,alwaysexiststo make selections betterthanan atheoretical computer algorithm. EdwardLeamer(1983b,p. 320) hasnoted,"Economists haveavoidedstepwisemethodsbecausetheydo notthinknatureis pleasantenoughto guarantee orthogonal explanatory variables, and theyrealizethat,ifthe truemodeldoes nothavesucha favorable design,thenomitting correlated variablescan havean obviousanddisastrous effect on theestimates oftheparameters." Attheveryleast, stepwise evenifoccasionallyusefulforspecialpurposes,neednotbe presented regression, in publishedwork(see Lewis-Beck,1978).The use of stepwiseregression has caused an additional curiousmistake.Itis oftensaidthattheorderinwhichvariablesareenteredinto a regression equationinfluences thevaluesofthecoefficients. A cursorylookat theequationsused in theestimation (or at a samplecomputerrun)will showthatthisis wrong. Whatdoes changeis dependentupon theordervariablesare enteredis themarginalincreasein theR2 statistic. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 670 Gary King Foreveryadditionalappleone eatsperweek,theaveragenumberofvisits to thedoctorper yeardecreasesby one and a half.For each additional orangeone eats,theydecreasebyone quarterofa visit. This hypothetical researcher now would like to make a statement aboutthe comparative worthof applesand orangesin reducingdoctor visits.He thenasks theresident politicalsciencemethodologist whether shecan helphimcompareapplesandoranges.Themethodologist saysthat the answerdependsupon the researcher statinghis questionmoreprecisely.Iftheresearcher means:"I haveonlyenoughmoneyforoneappleor one orange,and I wantto knowwhichwillmakeme healthier," thenthe answeris probably theapple.Butsupposean applecosts50 cents,whilean orangecostsonlyfivecents.In thiscase,theresearcher might ask,"Whatis thebestuseofmylastdollar?"Herethedecisionwouldhaveto be in favor of the orange:For one dollarspenton two apples,doctorvisitswould decreaseby about three,whereasthe same dollarspenton 20 oranges woulddecreasedoctorvisitsbyfiveon average. Assuming thequestionis statedprecisely enough,thesecomparisons makesomesense.Buttheymakesenseonlybecausethereis a commonunit ofmeasurement-apieceoffruit oran amountofmoney. Supposethenthat theresearcher toldthe methodologist thathe had tornoffthecomputer printout just priorto thelast coefficient estimate.The real equation,he explained, includesX3,therespondent's ofdoctorsas beneflcial, perception measuredon a scaleranging from1 (notbeneficial) to 10(verybeneficial). Theestimated equationshouldhaveappearedas this: y= 10- 1.5XI- 0.25X2+ 2X3 (6) The researcher nowasks whether thismeansthatperceptions are "more important" than apples. Afterall, he says,2 is greaterthan 1.5. Any methodologist worthher8087 chipwouldobjectto this,she asserts.In fact,wereone to takethiscomparisonto itslogicalextreme, one would concludethatperceivingdoctorsas moredetrimental is morehealthproducing thaneatingan apple.Although bothregressors seekto explain thesamedependentvariable,theyare neithermeasuredon, norcan they be converted to,meaningfully commonunitsofmeasurement. This is preciselythepoint:Onlywhenexplanatory variablesare on commonunitsofmeasurement is therea chanceofcomparimeaningfully son. If thereis no commonunitof measurement, thereis no chanceof meaningful comparison. However, thereis anothersensein whicheven"common-unit" comparisonsareunfair. Theapplecoefficient, forexample, theeffect represents ofapples(holdingconstant theinfluence oforangesand perceptions). The estimatedcoefficient fororangeshas a different set of controlvariables This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 67I (sinceit includesapplesand notoranges).This maymakea comparison ifnotlogicallyimpossible. betweenapplesandorangesmoredifficult, The MistakeContinued.StandardizedFruit. Convincedabout not our hypothetical unstandardized researcher comparing coefficients, procoefficients on his computerprintout. The poses usingthestandardized methodologist retorts that,ifitis oflittleusetocompareapplesandperceplessusetocomparestandardized tions,thenitis ofconsiderably applesand If standardized perceptions. Standardization does not add information. therewerenobasisforcomparison thenthereis no priortostandardization, afterstandardization. basisforcomparison A relatively commonrebuttalis thatforexplanatory variableswith unclearordifficult-to-understand counitsofmeasurement, standardized efficients shouldincreaseinterpretability. The problemis thatiftheoriginal data weremeaningless, thenthestandardized regression coefficients ifstandardized areprecisely as meaningless; coefficients do notadd information,theycertainly do notadd meaning."To replacetheunmeasurable is notprogress" bytheunmeaningful (Achen,1977,p. 806). I present "s" todenotestandardized the Usinga superscript variables, resultsforourhypothetical case:7 (7) s = _O.9Xi- 0.2X2+ 0.5X3 Wenowmustinterpret equation7 to mean,forexample,thatas weeatone additionalstandarddeviationofapples,thenumberofvisitsto thedoctor decreasesby ninetenthsof a standarddeviation-nota veryappealing conceptualization. Threeobservations: First,standardizing makesthecoefficients submore difficult to standardization stantially interpret. Second, stilldoesnot us to this first effect to the one-half standard deviation enable compare increasein doctorvisitsresulting froma one standarddeviationincrease in perceptions ofdoctors. areestimates of Third,andmostserious,whiletheoriginalcoefficients therelationships betweentherespective variablesand thedeexplanatory fortheotherexplanatory thestanpendentvariable(controlling variables), dardizedvariables aremeasures ofthisrelationship as wellas ofthevariance in oftheindependent variable.Since researchers are typicallyinterested in thetwoseparately, measuring onlytherelationship, orat leastinterested 'Therearetwomethodsthatcan producethesamestandardized coefficients: (1) standardizeeach oftheoriginalvariables(subtract thesamplemeanand divideby thesample standard on thesestandardized deviation) andruna regression variables; or(2) runa regressionand multiply each unstandardized coefficient bytheratioofthestandarddeviationof therespective variabletothestandard ofthedependent independent deviation variable. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 672 GaryKing itmakeslittlesensetousestandardized variables. A simplenumerical proof willdemonstrate thispoint.8 The Proof Imaginea simpleexperiment whereonlythreeobservationson one dependentand one independent variablearetaken.The observations arey' = (5, 5, 6) andX' = {2, 4, 4). Calculatedfromthesethree observations witha constanttermincluded,theregression is: y= 4.5 + 0.25X (8) andthestandardized coefficients are: yS= 0.50XS (9) thatanother Supposefurther datapointwas yearwentby,andanother collectedon y { 9.5) and onX {20). Becausethisrandomdrawworkedout well,theunstandardized coefficients inequation8 do notchangeatall with theintroduction ofthisadditionalobservation. thenewobservaHowever, tionincreases thesamplestandard deviation ofX from1.16to8.39(whichis whatone wouldgenerally expectas n increases).Althoughthisdid not in equation8, thestandardized coefficient changetheoriginalcoefficients nearlydoublesinthefourobservation case (compareequations9 and 10): Vs= 0.97Xs (10) Undersituations withdifferent variancesoftheindependent variables butidenticalrelationships, thestandardized coefficient is constrained only to have the same signas the unstandardized coefficient. Standardized coefficients This intuitive maybe eitherunder-or over-estimates. proof extendsdirectly to situations withmultipleindependent variables. TheInterpretation. In summary, standardized coefficients areingeneral(1) moredifficult tointerpret, thatmay (2) do notadd anyinformation helpto compareeffects fromdifferent explanatory variables, and (3) may add seriously information. Theoriginal, unstandardized coeffimisleading cientsaremeaningful andarenotsubjecttotheseproblems, although they cannotbe comparedforimportance. generally Thereare twoimportant to thesepoints.First,ifone qualifications mustincludea variablethatis difficult to interpret as a control,then perhapsstandardizing just thisvariablewouldcapitalizeon thestandardizedcoefficient's simpler descriptive properties (Blalock,1967a).Thisparall tial standardization procedureis certainlybetterthanstandardizing 8Kim and Mueller(1976) also showthatchangesin thecovariancesof theincluded variables andofthevariances oftheincludedandexcludedvariables ina system ofequations also affect thestandardized (butnottheunstandardized) coefficients. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 673 thevariables.9 Second,somearguethatstandardized measuresseemto be themorenaturalscale forvariablesliketestscores.'0For example,Harcan sometimes gens(1976) arguesthatstandardized coefficients be structuralparameters. AlthoughKim and Ferree(1981) successfully refute mostofthisargument on theoretical grounds, thereis onesenseinwhichit maybe correctforsomestudies.To makethispoint,itis usefulto consider a verydifferent typeof standardization commonlyused and generally acceptedin economicstudiesoftimeseriesdata. Therawconsumer priceindex(CPIt)is notusuallyincludedin regrestheseriesis nonstationary andmay,theresionmodelsfortworeasons.First, leadtospuriousfindings. an increaseintheprice fore, Second,forexample, ofa typicalmarketbasketoffoodfrom$10.00to $11.00is likelyto have moreofan influence on anydependent variablethaniftheincreasewere from$100.00to$101.00.Forbothofthesereasons,theproportional change in CPI, is used;this"standardized" measureis commonly calledtheinflationrate.l Inthiscase,thestandardized variableis usuallyconsidered more thanthe"unstandardized" naturalandsubstantively meaningful CPI,. In a similar manner, subtracting thesamplemeanfroma variableunder itbythesamplestandarddeviationmaybe themore analysisand dividing naturalmeasureforsome concepts,particularly forsome psychological In part,itmayevenbe a matter ofpersonal measures. scalesandattitudinal each tasteandcustom(Blalock,1967b).However, decisionsaboutwhether variableis tobe standardized shouldbe madeandjustified on an individual coefficients" basisrather than"a habitualrelianceonthestandardized (Kim calculateproporandFerree,1981,p. 207).Justas we shouldnotroutinely variablesin crosstionalchangesforeveryvariablein a timeseriesanalysis, sectional standardized. analysesshouldnotbe automatically A moreimportant and finalpointis thatmosttimesscholarsare not in finding interested outwhichvariablewillwintherace.Mostoftenit is fora setof theoretically "goodenough"to saythatevenaftercontrolling thenI wouldgiveup on the variableis too difficult to understand, 9Ifthedependent interpretation. data,ortryto figure outa moremeaningful regression, collectbetter causes,consider theEducationalTesting thissometimes "0Asan exampleoftheproblem admission (GRE). University GraduateRecordExamination Service's(ETS's) standardized in makeimportant decisionsbased in parton smalldifferences offices acrossthecountry distinguish scoreson thisexam,whereasETS reportsthatthe GRE can onlycorrectly whoaremorethanonehundred pointsapart(on a scalefrom200 to 800) twooutof students orif interval!). Perhapsifthisscorewerenotstandardized threetimes(i.e.,a 66%confidence to use prepared wewouldbe better substantive interpretation, therewerea moremeaningful GREs foradmission decisions. " Themostintuitive rateis as (CPI, - CPI,- )/CPI,1, buta waytocalculatetheinflation whichfortechnicalreasonsis actuallybetterandis usedmosteverynearlyexactmeasure, where, is log(CPI,)- log(CPI, 1).See KingandBenjamin(1985)fora politicalapplication. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions GaryKing 674 influences), possibleconfounding variables(i.e.,plausiblerivalhypotheses, stillseemsto havean important thevariablein whichwe are interested theempiricaleviinfluenceon thedependentvariable.This is precisely expectaourtheoretical orrefute denceforwhichwe searchto substantiate a is gainedby hypothesizing tions.Usually,littlepoliticalunderstanding winnerina raceofthevariables. Problem TheRace (2): TheCorrelation to thesimplecorrelaTheMistake.Manygreatthingsare attributed whileregression Itpurportedly needstoassumecovariation, tioncoefficient. arethought to assumptions mustassumecausation.The specificstatistical It is said to be a betterguidewhenone's be lessseverethanforregression. ratherthan go together" theoryarguesonlythat"thevariablesgenerally also It supposedly causeandeffect relationship." therebeinga "one-to-one, easiertointerpret. makesresults is false.Thereare severalapproachesto deEach ofthesestatements areinvalid(Tufte,1974).Two are scribing whythesecommonarguments mostusefulforpresent purposes. onone coefficient first, thecase ofa standardized TheProof Consider, variable.Throughsomesimplealgebraic and one dependent independent is equalto coefficient itcan be shownthatthisstandardized manipulation, thatappliestothestancoefficient.'2 Thus,everyargument thecorrelation coefficient. appliesalso tothecorrelation dardizedregression coefficient, towhichthesamplecorrelaNext,considerthepopulationparameters is distribution to infer.The mostlikelyrelevant probability tionattempts themarginalmeanand thebivariatenormal,whichhas fiveparameters: coefficient. varianceforeach variableand p, thepopulationcorrelation ofp wewouldneedto an estimator Theproblemis thatifrwereconsidered assumethatx and y weredrawnfroma bivariatenormaldistribution. are of a bivariatenormaldistribution Since the marginaldistributions ofrewe wouldneed to makeall theassumptions distributed, normally 12 With assumeeachvariablehasa meanofzero.Thisproofdemonno lossofgenerality, is variable andonedependent thatthestandardized coefficient (bS) foroneindependent strates coefficient (r): equaltothecorrelation bs =b s- = (x'x)- 'x'_y - 2 = * VY21 j T2=rYY lXj yi arenotthesameas standardized coefficients independent variables, Forthecase ofmultiple heretherefore Theresults orpartialcorrelation coefficients. coefficients presented correlation stillapply. conclusions andrecommendations butthesubstantive arenotcompletely general, This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 675 gressionand,in addition,theassumptions thatX is normallydistributed andthatx andyarejointlynormally In manypoliticalscience distributed. examples,thisis unreasonable. For example,any use of a dichotomous independent variable(male/female, agree/disagree, etc.) violatesthe asonecan useregression, sumption. Moreover, makefewerassumptions, and getmorereasonableandinterpretable results. The Interpretation. All of the problemsattributed to standardized coefficients applyto correlation coefficients. Furthermore, thereis nothingin statisticaltheorythat attributes causalassumptions toregression is simplya sample coefficients; regression estimateof a (population)conditionalexpectedvalue. The assumptions areabouttheconditional notabouttheinfluence probability distribution, ofx on y.Nothingcan or shouldstopan appliedresearcher fromstating thatx causesy,butitis crucialto understand thatstatistical analysisdoes not usuallyprovideevidencewithwhichto evaluatethisassertion(see Granger (1969) and Sims(1980) formoredirectattempts). Thereis also nothing thatattributes causalassumptions tothecorrelationcoefficient. Correlations are sampleestimatesof thepopulationparameter p fromthebivariatenormaldistribution. Thus,arguments about andcorrelation arenotrequiredforeitherregression causality, association, orcorrelation anddo notforma basisforchoosingbetweenthetwo. Furthermore, as a resultof thedistributional requirements, the asfor correlation coefficients far sumptions are moredemandingthanfor regression analysis.Unstandardized regression coefficients are almostalwaysthebestoption. TheRace (3): Coefficient ofDetermination? R2 is oftencalled the"coefficient ofdetermination." The result(or cause)ofthisunfortunate is thattheR 2 statistic terminology is sometimes interpreted as a measureoftheinfluence ofX on y.Othersconsiderittobe a measureofthefitbetweenthestatistical modeland thetruemodel.A highR 2 is considered tobe proofthatthecorrectmodelhasbeenspecified or thatthetheorybeingtestedis correct.A higherR2 in one modelis takento meanthatthatmodelis better. All theseinterpretations arewrong.R2 is a measureofthespreadof pointsarounda regression line,and it is a poor measureof even that fromtheirmeans,R 2 can (Achen,1982).Takingall variablesas deviations be definedas thesumofall y2 (thesumofsquaresdue to theregression) dividedbythesumofall y2 (thesumofsquarestotal): yyb'X'Xb Rs" __ Xy X2 's vy, This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 676 GaryKing wherethelastequationmovesfromgeneralnotationto thatforone independentvariable. coeffiNote,however, thatthisis preciselythesquareofthecorrelation coefficient cient(or the squareof the standardizedregression givenin all ofthecriticisms andstandardfootnote 12).Therefore ofthecorrelation izedregression coefficients applyequallytotheR 2 statistic. Worse,however, is thatthereis no statistical theorybehindtheR2 statistic.Thus,R2 is not an estimator because thereexistsno relevant All calculatedvaluesofR 2 refer populationparameter. onlyto theparticular samplefromwhichtheycome. This is clearfromthestandardized coefficient examplein precedingparagraphs, but it is moregraphically in two(x,y) plotsby Achen(1977, 808). In the firstplot demonstrated R 2 = 0.2. In thesecondplot,thefitaroundtheregression lineis thesame, butthevarianceofX is larger;hereR 2 = 0.5. Ad hoc arguments forR2 are oftenmade in the formof the researcher's questionsand themethodologist's answers: Q: How can I tellhowstrongly myindependent variablesinfluence R2? variablewithout mydependent A: Interpret yourunstandardized regression coefficients. Q: Buthowcan I tellhowgoodthesecoefficients are? A: The standarderrorsare estimatesof thevarianceof yourestimatesacrosssamples.If theyare small relativeto yourcoefficients,thenyou shouldbe moreconfident thatsimilarresults wouldhaveemergedevenifa sampleof 1500 different people wereinterviewed. Q: Buthowcan I tellhowgoodtheregression is as a whole? A: If youwantto testthehypothesis all that yourcoefficients are zero,use the F-test.More complexhypotheses aboutdifferent relevantlinearcombinations of coefficients theoretically (e.g., thatthefirstthreecoefficients arejointlyzero,or thatthenext twoadd to 1.0)can also be tested.R 2 is associatedwith,butis a poorsubstitute for,teststatistics. Q: O.K. I guessI reallymeanto ask:How can I assessthespreadof thepointsaroundmyregression line? A: There is nothingintrinsically or politicallyinteresting in the in spreadofpointsarounda regression line.Ifyouareinterested theprecisionwithwhichyoucan confidently makeinferences, thenlook at yourstandarderrors.Alternatively, you mightbe in theprecisionof within-sample interested and out-of-sample forecasts. Forecastscorrespond to theregression line(or to the line forout-of-sample extrapolated forecasts), givenspecified This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS Q: A: Q: A: 677 valuesofyourexplanatory variables.It is perfectly reasonableto estimateand thenmakeprobabilistic statements abouttheforecastsor evento calculateforecast confidence intervals. Surelyif theobservedpointspreadis large,theconfidenceintervalwill also be large.However,R2 is also a poor substitute forgoing to confidence directly intervals. Butdo youreallywantme to stopusingR 2? Afterall, myR 2 is higher thanthatofall myfriends and higher thanthosein all the articlesin thelastissueoftheAPSR! Ifyourgoal is to geta bigR 2, thenyourgoal is notthesameas thatforwhichregression analysiswas designed.The purposeof regression analysisand all ofparametric statistical analysesis to estimateinteresting populationparameters(regressioncoefficientsin thiscase). Thebestregression modelusuallyhas an R2 thatis lowerthancouldbe obtainedotherwise. If the goal is just to geta big R2, theneven thoughthatis to be relevant unlikely to anypoliticalscienceresearchquestion, hereis some "advice": Includeindependent variablesthatare verysimilarto thedependentvariable.The "best"choiceis the dependent variable;yourR2 willbe 1.0.Laggedvaluesofy usually do quite well. In fact,the moreright-hand-side variables includedthebiggeryourR2 willget.13Anotherchoiceis to add variablesor selectively add or deleteobservations in orderto increasethevarianceoftheindependent variables. Thesestrategies willincreaseyourR 2, buttheywilladd nothingto youranalysis,nothingto yourunderstanding ofpolitical andnothing phenomena, usefulin explaining yourresultsto others.Thegeneralstrategy ofanalysiswilllikelydestroy mostofthe desirableproperties ofregression analysis. Is thereanything usefulaboutR2? Yes.Thereis atleastonedirectuseandseveralindirect usesofR2. Youcandirectly applyandevaluateR 2 whencomparing twoequationswithdifferent explanatory variablesandidenticaldependent variables. Themeasureis,inthiscase,a convenient goodness-of-fit a roughwayto assessmodelspecification statistic, providing and Foranyoneequation, sensitivity. R 2 can be considered a measure oftheproportional inerrorfrom reduction thenullmodel(withno tocurrent explanatory variables) model.As such,itis a measureof 1Itis possible, butunlikely, fortheR2 tostaythesame;inanycase,itwillneverdecrease as morevariables areadded.Moregenerally, as thenumber ofvariables approaches thenumber ofobservations, R 2 approaches 1.0. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 678 GaryKing thisinterthe"proportion ofvarianceexplained,"and,although pretation is commonly used,itis notclearhowthisinterpretation addsmeaning topoliticalanalyses. Therearealso a variety ofindirect "uses"forR 2. Itis oftentruethata and highR 2 is accompaniedby smallstandarderrors,largecoefficients, good news; narrowconfidenceintervals. Thus,a higherR2 is generally thisis thereasonwhy,ceterisparibus,R 2 does notalwaysmislead.Howin R2 is alreadyavailablein other ever,mostof theusefulinformation aremore commonly reported statistics. Furthermore, theseotherstatistics accuratemeasures:They can directlyanswertheoretically interesting R 2 cannot.Ofcourse,whenone readssomeoneelse'swork,R 2 questions. ifsomeofthemoreaccuratemeamaybe a usefulinterpretive substitute theoddsofbeingmisled sureswerenotcalculated.Consequently, although withR 2 thanwiththeseotherstatistics, itisjustas aresubstantially higher of information that wellthatR2 is routinely It is the use this reported. shouldbe changed. withDichotomous Variables Confusion In thissection,I discusscommonmisusesofdichotomous variables. First,I consider therelationship betweenanalysisofvarianceandregression in handlingdichotomous independent variables.Then,I presentcommon to mistakesin usingdichotomous dependentvariables.Finally,I attempt dealleviateconfusion aboutusingdichotomous variablesand mistaking analysis. pendent variablesforindependent variablesinfactor (1) Dichotomous Independent Variables with TheMistake. Considera case wherethereare twopopulations meansg1 and 92, fromwhichrandomsamplessizes nI and n2are taken. The populationscouldbe maleand female,agreeand disagree,Republican and Democratoranything thatcouldbe represented bya meaningful dichotomous variable.A commonproblemis to testthehyexplanatory pothesisthatthemeansareequal (9 1 - 92 = 0). In thiscase,thefirstthing we do is calculatethemeans,Yi andY2, ofthetwosamples. in means Therearethreeapproaches to thisproblem: (1) a difference model. test,(2) an analysisofvariance(ANOVA)model,and(3) a regression Justifications forchoosingone of thesemodelsovertheothersare often in meanstestis sometimes given.Thedifference seenas a quickwaytogeta in meanshavebeencredited feelforthedata.ANOVAand thedifference withrequiring less restrictive assumptions about the data. Some think others variables; say ANOVAcanbe safely usedwithdichotomous dependent variables. thatANOVAandnotregression allowsdichotomous independent This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 679 Theseassertions are false.In fact,thethreetechniquesare intimately related-conceptually, statistically, and evenalgebraically. The simplest butleastgeneralofthethreeis thedifference in meanstest.Let y be a vectorofobservations frombothpopulationsandX be an indicatorvariable. Let thevalue forthefirstpopulationbe - 1 and thevalue forthe secondbe 1. (Thesevaluesarearbitrary choicesthatmakelatercomputationeasier.)Thenthemodelis E(yIX= -l)= E(ylX= l)=2 (11) The obvioussamplestatisticis thedifference in thesamplemeans, which,afterdividingby the standarderrorof thisdifference, followsa 14 t-distribution. Analysis ofvariance(ANOVA)is a somewhat moregeneralwayto deal withthisproblem.The theoretical modelis E(y) = i + 8i, where i is the grandmean of both populations,i = 1, . . . , G, whereG is the numberof and 6i is thedeviationfromthegrandmeanforpopulationi. populations, G We imposethe restriction that l: 8i = 0. In the special case of G = 2, 61 = - 62. The model can be restatedforeach populationas E(yIX= - 1)=g+81 E(ylX= 1)=9+82=9-61 (12) The sampleestimateof i is y and of6i is di. Bydefinition, y+di=y, andy+d2=y2. Thesemeansare,ofcourse,identicalto thosethatestimatemodel11,but deviations fromthesample"grand"mean.The repred, and d2 represent sentation is slightly buttheinterpretation different, shouldbe exactlythe same.The teststatisticforthehypothesis that61 = 62 = 0 followstheFwhichis a trivialgeneralization distribution, ofthet-distribution usedfor thedifference in meanstest.'5 The finaland mostgeneralapproachto thisproblemis withregression analysis(the generallinearmodel).The model of interesthereis E(y IX) = Po+ p IX,withX takingon thevalue - 1 forthefirst population and 1 forthesecond.The modelis defined,foreachpopulation, as: '4The choiceforan estimateof the standarderrordependsupon whether the two samplesareindependent. I considerhereonlythecase ofindependence, Although thereare straight-forward generalizations to thecase of"nonspherical" disturbances. " a variable witha t-distribution (df= k) yieldsa variable withan F-distribution Squaring (df,=k,df2= 1). This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions GaryKing 68o E(YIX= - 1)=Po-pi E(YIX = 1) = 3o+ pi (13) AppendixB ofPoand P1 areboand b1,respectively. The sampleestimates ofthegrandmean(y) and provesthatboandbI areclosealgebraicrelatives fromthatmean(di) fromtheANOVAmodel.Two points thedeviations in AppendixB. First,bo is shownnotto be thegrand are demonstrated meanexceptwhennI = n2.Second,bI is provennotto be thedeviation fromthegrandmean(di), exceptwhennI = n2. onlydenotesdifferbetweenANOVAand regression Thisinequality Thereare no relationships. thesameunderlying entwaysofrepresenting Notethatin the in assumptions or empiricalinterpretation. differences in model,deviationsfromthegrandmeancan be represented regression estimates: termsoftheparameter y-yi =bo+bI[(n2-ni)/n]-(bo+bi) =[n2- n nl]b y-y2=bo+bI[(n2-ni)/n]-(bo-bi) [n2- n+ b Thus,forthespecialcase ofn1 = n2,y -Yl = -bI andy -Y2=bI justas in be2b1to be estimateddifference ANOVA.Whenn1 I# n2,we interpret tweenthe two populationmeans.In fact,2bI is exactlythe parameter in meanstest. estimateforthedifference independent Notethatin noneofthesemodelsshoulddichotomous The consequenceof such a calculationis to variablesbe standardized. dependentnotonlyuponthevariance coefficient makethestandardized variable(as is alwaysthecase) butalso uponitsmean, oftheindependent ofitsmean. is a function sincethevarianceofdichotomy ofmeans andthedifference ANOVA,regression, TheInterpretation. testare all special cases of the generallinearmodel.The assumptions requiredofone arerequiredoftheothersas well.Iftherearedichotomous Ifthereis a dependentvariables,noneofthetechniquesare appropriate. variable,anyone ofthethreewill do. If,as is dichotomous independent usuallythecase withpoliticaldata,therearebothdiscreteandcontinuous theresearch willaccommodate thenonlyregression variables, explanatory problem. mixedmodels ofANOVAthataccommodate Therearegeneralizations with"analysisof (whichis notto be confused like"analysisofcovariance," This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 68I covariance structures"). Since,forexperimental researchers, ANOVAoften seemsa moreconceptually appropriate model,and sincethe same data requirements and resulting information is essentially equivalent to regressionanalysis, thechoicebetween thetwoismostly a matter ofpersonaltaste. Myviewis thatformostpoliticalscienceresearch, regression is a substantially moregeneralmodel:It incorporates manytypesofANOVAin onestatistical model(andalgebraicformula). issues Although specification in a regression applyto all threemethods, theyareusuallyonlyconsidered In addition,regression context. is also substantially easierto generalizein disturbances and othercommonproborderto correctfornonspherical lems.By comparison, moregeneralANOVA modelscan getquitemessy whentheyexist.Forthisreason,manyANOVAcomputer programs actutheresultsintotheANOVA allydo regression analysesandthentransform forpresentation. parameters Thepointisthatforthestandardanalysis, all threemodelscomefrom thesamegeneralform.Each modelprovidesa different of representation and correctspecification is requiredofall exactlythesameinformation, theregression modelmay three.Whentheanalysisis morecomplicated, provemoretractable. (2) Dichotomous DependentVariables TheMistake.The mistakehereis usingdichotomous dependent variablesinregression, ANOVA,oranyotherlinearmodel.Doingthiscanyield thanoneorlessthanzero,heteroskedasticity, predicted probabilities greater inefficient and uselessteststatistics. Of estimates, biasedstandarderrors, moreimportance is thata linearmodelappliedtothesedatais ofthewrong functional incorrect. form;in otherwords,itis conceptually forexample, theinfluence incomeon theprobabiloffamily Consider, ityof a childattending college(measuredas a dichotomous, college/no in thissituation a linearrelationship collegerealization).Hypothesizing incomewillincrease impliesthatan additionalthousanddollarsoffamily ofthe theprobability ofgoingto collegeby thesame amountregardless levelofincome.Surelythisis notplausible.Imaginehowlittledifference an additional$1000 wouldmakefora family with$1,000,000,or forone withonly$500, in annualfamilyincome.However,fora familyat the threshold ofhavingenoughmoneyto senda childto college,an additional ofcollegeattendance thousanddollarswouldincreasetheprobability bya thisimpliesis a steepregression line substantial amount.The relationship a strong atthemiddlerangeofincomeanda relatively (representing effect) at theextremes. thisforall line(a weakerrelationship) flatter Extending valuesof incomeproducesthe familiarlogitor probitS-curve(foran see King,in press,1986). application, This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 682 GaryKing The solutionis to modelthisrelationship witha logitor probit(or someotherappropriate non-linear) model.Scholarly footnotes to thecontrary, it is notpossibleto do logitand regression analysesand havethem "comeoutthesame."Whatexactlyis meantby"comeoutthesame"?It wouldbe meaningless to comparelogitand LS coefficients, standarderrors,or teststatistics. Thereis no suchthingas R2 in logitanalysis,and makelittlesense;in althoughthereare analogousstatistics, comparisons anycase,logitanalysiswillalwayshavea fittothedataas goodas orbetter thanthatofLS estimation. cannotbe thesame,since The interpretation theunderlying modelsareverydifferent. theoretical Thereis,however, one propercomparison betweenLS and logitestimation-betweenthefitted valuesofthetwomodelsexpressed as proportions.6 A short-hand wayto accomplishthisforthe logitmodelis by observing thefirstderivatives ofthelogitfunction, bp(1- p), whereb is thelogitcoefficient and p is theinitialprobability. The problemis that unlikeLS, the effecton y foran additionalunitincreasein X is not as constant overtherangeofX values.This"variableeffect" is represented a nonlinear logitfunction."7 withDichotomous (3) Confusing Dichotomous Independent DependentVariables In thefactoranalysismodel,thereare manyobservedvariablesfrom whichthegoal is to deriveunderlying (unobserved)factors.A common mistakeis to viewthe observedvariablesas causingthe factor.This is variablesas funcincorrect. The correctmodelhas observabledependent tionsoftheunderlying factors. Forexample,ifa setof and unobservable opinionquestionsaskedofthepoliticaleliteis factor analyzed,underlying ideologicaldimensions arelikelyto result.It is thefundamental ideologies thatcausetheobservedopinions,and itis precisely becausetheseideologies are unobservable thatwe measureonlythe consequencesof these ideologies. Thishastwopracticalconsequences fortheresearcher. First,variables 16 Whentheunderlying foreachobservation remainswithinthe0.25 to0.75 probability probability interval, thelogitandLS modelsproduceverysimilar predicted values.However, more standarderrorsand teststatistics havelittlemeaning;althoughtheyhavesomewhat meaning whenprobabilities arewithinthe0.25 to0.75 interval. Ofcourse,projections ofthe withLS. underlying theoretical modelarealwaysimplausible variables, Kritzer(1978a;see '7For thespecialcase ofonlynominallevelindependent becomesa veryintuitive also 1978b)showsthata minimum estimation chi-square procedure weighted leastsquareson tabulardata.Thisarticleis also a goodexampleofthepointmade in theprevioussection-thatmanydifferent statistical modelscan be organizedunderthe regression framework. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 683 likerace,gender, and age shouldneverbe observedvariablesin a political willeverfindthat scientist's factor thata researcher analysis.It is doubtful or ideologyinfluences a person'sgenderor race.Secpartyidentification ond,sincemostfactoranalysismodelsarelinear,theycan no morehandle dichotomous dependent(observed)variablesthancan regression analysis models.However,thereare nonlinearfactoranalysismodels,whichare ofthebinarylogitmodel,thatmaybe appropriate in this generalizations situation (Christoffersson, 1975). Reporting ReplicableResults I focusin thissectionon reporting resultsofstatistical analyses.An erroneousreporting method,ifnotthemostgrievousoffense, is certainly themostfrustrating. Afterall,ifa mistakeis madeand reported, thenitis sometimes possibleto assessthedamage.Ifminimum standards reporting arenotfollowed, thentheonlyconclusions thatcan be drawnarebasedon blindfaithin or rejectionof the author'sinterpretative conclusionsand methodological skills.Tabularinformation conveysinformation thatusuinthetext.Ifthetablesare allyis not(andusuallyshouldnotbe) presented notcomplete, thenthereportmaybe rendered useless. I haveconcentrated in thispaperprimarily on regression analysis,the mostfrequently usedexplicitstatistical modelin politicalscienceresearch andthemostfrequently abused.As an example,therefore, considerreportingtheresultsof a LS analysis.The requiredresultsshouldbe (1) data theunitofmeasurement foreachvariable,theunit descriptions (including ofanalysis,and thenumberofobservations and variables),(2) parameter estimates coefficients andtheestimated varianceofthedistur(regression bances),and (3) the standarderrors(measuresof the precisionof the coefficient Fortime-series estimates). analysesand certaintypesofcrosssectionalanalyses,testsofor searchesfornonspherical disturbances (e.g., autocorrelation and heteroskedasticity) shouldalso appear.18 Ifjointhybutnotexecuted, relevant pothesistestsarerelevant, partsofthevariancecovariancematrixoftheregression coefficients (on thediagonalofwhich arethesquaresofthestandarderrors)shouldbe included.Sincetheycan be derivedfromtheinformation presented, t-tests, F-tests,goodness-of-fit and marginal levelsareoptional. statistics, probability One relatively commonviolationofthesereporting rulesis to replace notmeeting somesignificance levelwith"N.S."(Sometimes anycoefficient intime-series 18 Automatic useoftheDurbin-Watson statistic datais better thannothing, butitis farfromthebestapproach.A betterprocedure is to analyzetheautocorrelation and functions oftheresiduals.Although fullreporting ofthesewouldbe partialautocorrelation a sentence ortwosummarizing excessive, anyodd resultswouldbe veryhelpful. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 684 GaryKing is notreported!) arenotsignificant eventhelevelatwhichthesecoefficients In fact,I knowofno politicalscican be verymisleading. Thisprocedure ence researchin whichit makessenseto use a precisecriticalvalue.Any as atthe0.05levelis as usefulinthisdiscipline thatis significant coefficient whichis a coefficient tointerpret ifitwere0.06or0.04.To deleteandrefuse levelmakeslittlesense.Eveniftheauthor 0.01or0.001abovea significance to cometotheirown hasa reasonforit,at leastreaderscouldbe permitted is to presentthe marginalprobability conclusions.My recommendation of regardless foreach coefficient, level(theexact"levelof significance") would readers heorshewantsand whatitis;theauthorcan arguewhatever andsubsignificance Statistical stillbe ableto drawtheirownconclusions. relationship. haveno necessary importance stantive foottables,misleading Therearemanyotherexamplesofincomplete adequacy the to judge way notesand uselessappendices.The bestgeneral It,ofcourse, iftheanalysiscan be replicated. is to determine ofreporting and methodological contribute need not be replicated,but in orderto informust enough a report to itsreaders, paper information theoretical ifsomeoneactually mationso thattheresultsitgivescouldbe replicated tried. Remarks This paperreviewssomeofthemorecommonconceptualstatistical politicalscienceresearch.Althoughmanymismistakesin quantitative colleagues,manymoreslipby.Thosepretakesarecaughtbyperceptive Too often,we problematic. systematically most here are among the sented each others'misthan learning from rather mistakes learneach others' for theinitial are reasons there in each plausible case, takes.Fortunately, painlesssoluorconceptualproblemanda relatively "invention" mistaken tionto theproblem. givenin thepaper,thereare twomore In additionto thearguments generalrulesthatshouldbe appliedto all politicalsciencedataanalyses. Ifthestatistics statistics. on interpretable First,weshouldconcentrate intoinformathatis fine,as longas theycan be translated arecomplicated, by,nonstatisticians. to,and interpretable tionthatis meaningful biasedor a feelfordata" is laudable,butpresenting Second,"getting modresultsis not.Thus,we shouldtryto use formalstatistical incorrect els,aboutwhichmuchmoreis known.The problemwithad hocsolutions is thatthe same mistakescan occurin theseas withformalstatistics; we are muchlesslikelyto discoverthem.Forexample,political however, thepoint and arguing areproneto doinga fewcross-tabulations scientists variableswith dependent fromthere.Omittedvariablebias,dichotomous issuesare manytimesmissedwith linearmodels,and otherspecification methods this"method."Whatis oftennotrealizedis thattheseinformal This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 685 HOW NOT TO LIE WITH STATISTICS can usuallybe expressedin verysimpleformalstatistical models.Their weaknesses thenbecomeimmediately apparent. If thesetwo ruleswerefollowed-and adequateinformation were providedwithwhichto assess the quantitative analyses- manyfuture mistakes couldbe avoided. submitted 16 September 1985 Manuscript 18 November Final manuscript received 1985 APPENDIX A A PROOF THAT REGRESSION ON RESIDUAL (ROR) ESTIMATORS ARE BIASED thecoefficient First,partition vectoras b= [b,b2]and thevectorofindependent variablesas X = [X, X2 ]. Also let Q = X'X, A = Q X', and e = My be thevectorof residuals (whereM = I - XQX'). Thenb in thefullregression, equation2, is theleastsquares(LS) whereb = Ay. estimator, Nowconsidertheregression on residual(ROR) estimator. First,letQj = XYXjfori = 1, 2,j = 1, 2, Ai = QYi-1X' fori = 1,2 and M, = I - XI Qii' X,. Then,calculatethecoefficients withb, = Alyande, = Mlyfromequation3. b, is thefirst Then RORestimator. andresiduals variablesX2 and getb2= A2e, fromequation4, regress e, on thesecondsetofexplanatory where, b2 is thesecondROR estimator. Nowletb ' = [b, b2 ]. I willfirst provethatb #b. I b b b= L XX X (I 2] -I [XI'] [Q2XIMX21-Q2,' -(2X2 2) X2Ill X l l(2Ml2 X ] X ,X)l [b l + Q,, Q,2(X2QM,X2)Q ' X2 M Qy2 J (x2M,X2)-'XQ2M,y [bl + Q'Q,2 L (x2M,X2) 'XM,(X,b,+X2b2+e) (x2M,X2) 'X2M,(X2b2 +e2) J fromequations2 and4: substituting = [ 1 [b,1 bi +A,X2b2 * I (X2M,X2)- (X2X2)b12 b*2- termsandtakingexpectedvalues,wehave: Then,rearranging [(X2X2)- I (X2MI X2)P 2] omitted standard variablebias.'9 Thus,bothb, andb2arebiased.Theformer represents Therearealsotwospecialcases.IfX, andX2 areorthogonal (i.e.,X{'X2= 0),thenb = b.Also, thatan omittedvariablebias existsonlywhen '9Itis easyto see fromthisformulation oftheomittedvariable fromtheregression both(1) thesamplecoefficients (AI X2) resulting on theomittedvariable(P2) is on theincludedvariablesare nonzeroand (2) theparameter is zero. ony). Thereis nobiasifeitherone,orthierproduct, nonzero(i.e.,hassomeinfluence This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions 686 GaryKing whenb2= 0, thenbI = bl. Finally,whenP2= 0, E(b) =!fi. A similarproofcan be foundin andJochems Goldberger (1961).Furthermore, Goldberger (1961)haveshownforthebivariate ofb2. case,andAchen(1978)forthemultivariate case,thatb2is an underestimate APPENDIX B THE RELATIONSHIP BETWEEN REGRESSION AND ANALYSIS OF VARIANCE hold: Notefirst thatfortheestimates ofthemodelinequation13,thefollowing equalities n =n2-n, 7.x n X2 =fn2+ni n Yay _ _ =n2Y2+nfyi n a xy = n2Y2 - nlyl Now,expressing boandb1in termsofy1andy2: n n n n xy- axay i=1 i=1 bl= i=l n x2 (n2 + - X) ni) (n2V2 - nly) - (n2 - 1l)(n2y2 + nlyj) (n2+nl )2 - (n2- n1)2 Y2 -Yi 2 Also, bo=y b-bx n2Y2 + nlyl (y2 (n2+fl) -yi) (n2- ni) 2(n2+n1) Y2 +Y1 2 REFERENCES coefficient. Perilsofthecorrelation representation: H. 1977.Measuring Achen,Christopher AmericanJournalofPoliticalScience, 21 (November):805-15. manuscript. leastsquares.Unpublished 1978.On thebiasin stepwise 6(3): 343-56. PoliticalMethodology, 1979.Thebiasin normalvoteestimates. and usingregression.BeverlyHills: Sage. 1982. Interpreting ofassociation. closedpopulations, andmeasures Blalock,HubertM. 1967a.Causalinferences, AmericanPoliticalScience Review,61 (March): 130-36. AmericanJournalofSociology, . 1967b. Path coefficientsversusregressioncoefficients. 72 (May):675-76. This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions HOW NOT TO LIE WITH STATISTICS 687 variables.Psychometrika, Anders.1975.Factoranalysisof dichotomized Christoffersson, 40(1):5-32. Pediatric residents Phillips.1981.What'sthedifference? B.,andSheridan Stanford Friedman, 68(5):644-46. statistics. Pediatrics, concepts regarding andtheirinaccurate error. leastsquares:Residualanalysisandspecification Arthur S. 1961.Stepwise Goldberger, 1000. 56 (December):998Statistical Association, oftheAmerican Journal Arthur S., and D. B. Jochems.1961.Noteon stepwiseleastsquares.Journalof Goldberger, 56 (March):105-10. Statistical Association, theAmerican modelsandcross-spectral byeconometric causalrelations C. W.J.1969.Investigating Granger, 37 (July):424-38. Econometrica, methods. Journalof Psychiatry, Gurel,Lee. 1968. Statisticalsense and nonsense.International 6(2):127-31. SociologicalMethodsand coefficients. Hargens,LowellL. 1976.A noteon standardized 5 (November):247-56. Research, orscience?Economica,47:387-406. David. 1980.Econometrics-alchemy Hendry, NewYork:Norton. Darrell.1954.Howtolie withstatistics. Huff, in causalanalysis.SociologiandG. DonaldFerree,Jr.1981.Standardization Kim,Jae-On, 187-210. cal MethodsandResearch,10(November): coefficients and unstandardized Kim,Jae-On,and CharlesW. Mueller.1976.Standardized 4 (May):423-38. MethodsandResearch, in causalanalysis.Sociological approach. policy:A structuralist 1986.Politicalpartiesandforeign King,Gary.Forthcoming. PoliticalPsychology. amongU.S. of partyidentification King,Gary,and GeraldBenjamin.1985.The stability oftheAmerican Politattheannualmeeting Paperpresented senators andrepresentatives. NewOrleans. icalScienceAssociation, tablesby weightedleast squares:An Kritzer,HerbertM. 1978a. Analyzingcontingency 5(4):277-326. alternative totheGoodmanapproach.PoliticalMethodology, tableanalysis. contingency to multivariate An introduction . 1978b.The workshop: 187-226. ofPoliticalScience,22 (February): American Journal AmericanEconomic Leamer,EdwardE. 1983a. Let's takethe con out of econometrics. Review,73 (March):31-44. InZ. Griliches andM. D. Intriligator, analysis. . 1983b.Modelchoiceandspecification Vol.I, NewYork:North-Holland. eds.,Handbookofeconometrics. 3. 75 (June):308-1 Review, Economic analyseswouldhelp.American . 1985.Sensitivity 5(2): A caution.PoliticalMethodology, MichaelS. 1978.Stepwiseregression: Lewis-Beck, 213-40. 1-48. 48 (January): andreality. Econometrica, A. 1980.Macroeconomics Sims,Christopher American Journal Somefrequent misunderstandings. Kim. 1983.Testsofsignificance: Smith, 53(2):315-21. ofOrthopsychiatry, Cliffs: Prentice-Hall. forpoliticsandpolicy.Englewood Tufte,EdwardR. 1974.Data analysis This content downloaded on Thu, 31 Jan 2013 04:25:06 AM All use subject to JSTOR Terms and Conditions
© Copyright 2026 Paperzz