BusinessstatisticswithExcelandTableau Ahands-onguidewithscreencastsanddata StephenPeplow ©2014-2015StephenPeplow WiththankstothestudentswhohavehelpedmeatImperialCollegeLondon,the UniversityofBritishColumbiaandKwantlenPolytechnicUniversity. Contents 1.Howtousethisbook1 1.1Thisbookisalittledifferent 2 1.2Chapterdescriptions 2 2.VisualizationandTableau:telling(true)storieswith data9 3.Writingupyourfindings15 3.1Planofattack.Followthesesteps. 17 3.2Presentingyourwork 17 4.Dataandhowtogetit18 4.1Bigdata 21 4.2Someusefulsites 21 5.Testingwhetherquantitiesarethesame23 5.1ANOVASingleFactor 23 5.2ANOVA:withmorethanonefactor 29 6.Regression:Predictingwithcontinuousvariables... 31 6.1Layoutofthechapter 33 6.2Introducingregression 33 6.3Truckingexample 36 6.4Howgoodisthemodel?—r-squared 39 6.5Predictingwiththemodel 40 6.6Howitworks:Least-squaresmethod 41 6.7Addinganothervariable 42 6.8Dummyvariables 43 6.9Severaldummyvariables 49 6.10Curvilinearity 50 6.11Interactions 52 6.12Themulticollinearityproblem 58 6.13Howtopickthebestmodel 58 6.14Thekeypoints 59 6.15Workedexamples 60 7.Checkingyourregressionmodel61 7.1Statisticalsignificance 61 7.2Thestandarderrorofthemodel 64 7.3Testingtheleastsquaresassumptions 66 7.4Checkingtheresiduals 67 7.5Constructingastandardizedresidualsplot 68 7.6Correctingwhenanassumptionisviolated 71 7.7Lackoflinearity 71 7.8Whatelsecouldpossiblygowrong? 73 7.9Linearitycondition 73 7.10Correlationandcausation 74 7.11Omittedvariablebias 74 7.12Multicollinearity 75 7.13Don’textrapolate 75 8.TimeSeriesIntroductionandSmoothingMethods.. 76 8.1Layoutofthechapter 76 8.2TimeSeriesComponents 77 8.3Whichmethodtouse? 78 8.4Naiveforecastingandmeasuringerror 79 8.5Movingaverages 81 8.6Exponentiallyweightedmovingaverages 83 9.TimeSeriesRegressionMethods87 9.1Quantifyingalineartrendinatimeseries usingregression 87 9.2Measuringseasonality 89 9.3Curvilineardata 91 10. Optimization95 10.1Howlinearprogrammingworks 95 10.2Settingupanoptimizationproblem 96 10.3Exampleofmodeldevelopment. 97 10.4Writingtheconstraintequations 97 10.5Writingtheobjectivefunction 98 10.6OptimizationinExcel(withtheSolveradd-in)...98 10.7Sensitivityanalysis 100 10.8InfeasibilityandUnboundedness 102 10.9Workedexamples 102 11.More complexoptimization107 11.1Proportionality 107 11.2Supplychainproblems 109 11.3Blendingproblems 110 12. Predictingitemsyoucancountonebyone114 12.1Predictingwiththebinomialdistribution 115 12.2PredictingwiththePoissondistribution 118 13.Choice underuncertainty119 13.1Influencediagrams 119 13.2Expectedmonetaryvalue 121 13.3Valueofperfectinformation 123 13.4Risk-returnRatio 124 13.5Minimaxandmaximin 124 13.6Workedexamples 125 14. Accountingforrisk-preferences127 14.1Outlineofthechapter 128 14.2Wheredotheutilitynumberscomefrom? 130 14.3Convertinganexpectedutilitynumberintoacertaintyequivalent. 136 15. Glossary137 1. Howtousethisbook I began business life as an entrepreneur in business in Hong Kong. I ran trade exhibitions,importedcoffeefromKenya,and startedandoperatedtworestaurants, oneofwhich(unusuallyforHongKong)servedvegetarianfood.Thesebusinesses wereprofitable,butIwouldhavesavedmyselfagreatdealofstress,anddonebetter if I had used some business intelligence informed by statistics to improve my decision-making.Lookingback,IwishIhadbeenabletothinkthroughandwrite analysesontopicssuchastheseamongmanyothers: a.calculationofoptimalrestaurantstaffinglevels b.analysisofsalesovertimeandseasonalityintrends c.receiptspercustomerbyrestauranttypeandanalyzedstrengthandtypeof anydifferences d.predictedsalesdataforpotentialexhibitorswithvisualiza-tionsof various‘whatif’scenarios e.gaineddeeperinsightsfromvisualizingmydata f.wonovermorepartnersandinvestorswithbettervisualandwritten presentations I’msurethattherearemanymorepeoplelikeme,awarethattheyoughttobedoing morewiththedatatheycollectaspartofnormalbusinessoperations,butuncertain ofhowtogoaboutit. There is no shortage of textbooks and manuals, but these don’tseem to get to the hands-on applications quickly enough. This book is for peoplesuchasme,intwocomponents:theanalysis,findingouttheunderlyingstory fromthedata,andthenthepresentationofthestory. 1 1.1Thisbookisalittledifferent Mostbooksonstatsintroducestatisticaltechniquesonachapterbychapterbasis. Instead,I’vewrittenthisbooksothatitreflects the way many people learn. The chaptersarestructuredbythetypeofquestionyoumightwanttoask.Examplesof the typeofquestionsaresummarizedbelow,andatthebeginningof eachchapter. Thechaptersthemselvesdoesn’tincludemuchmathandtechnicaldetails.Instead theseareplacedtowardstheendofthebookinaglossary. ManyoftheExcelproceduresarelinkedtoscreencastspreparedbymetoillustrate that particular procedure. The data-sets used in the book are available from my Dropboxpublicfolder:justclickonthehyperlinkthatappearsnexttoeachworked example.Withthedata-setsopeninExcel,youcanfollowalongatyourownpace. (And then do the same with your own data). I have used Tableau for some visualizations,andyouwillfindlinkstomy workbooksandscreencasts. 1.2Chapterdescriptions Ifyou’rereadingthisbook,thenyouareverylikelyalreadyengagedinbusinessand wouldliketoknowhowtotakethatbusinesstonewheights.Takealookthroughthe chapterdescriptionsthat followandthengostraighttothatchapter.Ifyouthinkyou mightneedabitofastatisticsrefresher,lookthroughtheGlossaryinChapter16 first.AverygoodfreestatisticsOpenIntrotextbookisavailablehere¹ Chapter2introducesTableauPublic²whichisafreedatavisu-alizationtool.While Exceldoeshavegraphingtoolswhichareeasytouse,theresultscanlookalittle clunky.Tableauhelps ¹https://www.openintro.org/stat/²http://www.tableausoftware.com/public/ ustomergedatafromdifferentsourcesandcreateremarkablevisualizationswhich canthenbeeasilyshared.I’llbe illustratingresultsthroughoutthisbookwitheither Tableau,Excelorboth.One caution:ifyoupublishyourresultstothe web using TableauPublic,yourdataisalsopublished.Ifthisisaproblem,thereisanoptionto pay for an enhanced version of Tableau. The screenshotbelowshowsworkdone changesonlanduseintheDeltaregionofBritishColumbia.Byclickingonthelink, youcanopentheworkbookandalterthesettings.Youcanfiltertheyear(seetop right)andalsolandusetype.ThankstoMalcolmLittleforhisworkonthisproject. TableauworkbookforDeltaLandCoverChanges³ Tableaushowingchangesinlandcover Tableauhastraininganddemonstrationvideosavailableonits website, and there are plenty of examples out there. The screen casts which Tableau provides (available at their homepage) are probably enough. Where I have found some technique(suchasboxplots)particularlytrickyIhavecreatedscreencastsforthis book.Theimagebelowisdatawewilluseintheregressionchapter.Youruna ³https://public.tableau.com/views/DeltaCropCategories1996-2011/Sheet1?:embed=y&: showTabs=y&:display_count=yes truckingcompanyandfromyourlogbooksextractthedistanceanddurationand numberofdeliveriesforsomedeliveryjobs.Theimageshowsatrendline,with distanceonthehorizontalaxisandtimetakenontheverticalaxis.Thepointson thescatterplotare sizedbynumberofdeliveries.Youcanseethatmoredeliveries increasesthetime,asonewouldexpect.Usingregression,Chapter6,wewillwork outamodelwhichcanshowhowmuchextrayoushouldchargeforeachdelivery. HereisaYouTubefortruckingscatterplot⁴ Tableaushowingdistances,times,anddeliveries Chapter3.Writingupyourfindings.Inmostcases statistical analysis is done in ordertohelpadecision-makerdecidewhattodo.Iamassumingthatitisyourjobto thinkthroughtheproblemandassistthedecision-makerbyassemblingandanalyzing thedataonhis/herbehalf.Thischaptersuggestswaysinwhichyoumightwriteup yourreport,accompaniedbyvisualizationstogetacrossyourmessage.Thereare alsolinkstosomehelpfulwebsiteswhich ⁴https://youtu.be/oFACLZLJWZI discussthepreparationofslidesandhowtomakepresentations. Chapter4:Dataandhowtogetit.Ifyouareconsideringcollectingdatayourself, throughasurveyforexample,thenyou’llfindthischapteruseful.So-calledbigdata isahottopic,andsoIincludesomediscussion.Thechapteralsoincludeslinksto publiclyavail-abledatasetswhichmightbehelpful. Chapter 5 deals with tests for whether two or more quantities are the same or different.Forexample,youareafranchiseewiththree coffee-shops. Youwant to knowwhetherdailysalesarethesameordifferent,andperhapswhatfactorscause any difference that you find. Youwant to know whether there is a statistically significantcommonfactorornot,toeliminatethepossibilitythatthedifferenceyou seeoccurredpurelybychance.Perhapstheaverageageofthecustomersmakesa differenceinthesales?ANOVAusesverysimilartheorytoregression,thesubjectof thenextchapterandperhapsthemostimportantinthebook. Chapter 6: Regression is almost certainly the most important statis- tical tool coveredinthisbook.Inoneformoranother,regressionisbehindagreatdealof applied statistics. The example presented in this book imagines you running a trucking business and wanting to be able to provide quotations for jobs more quickly. It turns out thatifyouhavesomepastlog-bookdatatouse,such as the distance of various trips, the time that they took and how many deliveries were involved,anaccuratemodelcanbemade.Regressioncanfindtheaveragetimetaken foreveryextraunitofdistance,aswellas other variables such as the number of stops. This chapter shows how to create such a model and how to use it for prediction.Withsuchamodel,youcanmakequotationsreallyquickly.Thischapter alsocoversmorecomplexregressionandhowtogoaboutmodel-building. Otheruses:youwanttocalculatethebetaofastock,comparingthereturnsofone particularstockwiththeS&P500. Chapter7isabouttestingwhethertheregressionmodelswebuilt in Chapters 6 and 7 are in fact any good. Regression appears so easy to do that everybody does it, without checking the validity of the answers. Following the procedures in this chapter will help to ensure that your work is taken seriously. Thischapterismoreofaguidetothinkingcriticallyabouttheresultsother people might have. Is there missing variable bias? For example, you notice a strong correlationbetweensalesofice-creamsanddeathsbydrowning.Canyoutherefore saythatreducingice-creamsaleswillmakethewatersafer?Err–no.Themissingor lurking variable istemperature.Both drowning deaths and ice cream sales are a functionoftemperature,notofeachother. Chapter8isabouttime-seriesandshowshowwecandetecttrendsindataovertime, and make predictions. The smoothing methods, such as moving averages and exponentiallyweightedmovingaver-agescoveredinthischapterarefinefordata whichlacksseasonality andwhenonlyarelativelyshort-termforecast is needed. Whenyouhavelonger-rundataandformakingpredictions,thefollowing chapter hasthetechniquesyouneed. Chapter 9 describes the regression-based approach to time series analysis and forecasting.Thisapproachispowerfulwhenthereissomeseasonalitytoyourdata, forexamplesalesofTVsetsshowadistinctquarterlypattern(asdo umbrellas!). SalesofTVsetsshowingaquarterlyseasonality Usingregression we can detect peaks and troughs, connect them to seasons and calculatetheirstrength.Theresultisamodelwhichwecanusetopredictsalesinto thefuture. Chapter 10 is about optimization or making the best use of a limitednumberof resources.Thisishighlyusefulinmanybusinesssituations.Forexample,youneed to staff a factory with workers of different skills and pay-levels. There is a minimum amount of skillsrequiredforeachclassofworker,and perhaps also a minimumnumberofhoursforeachworker.Usingoptimization,youcancalculate thenumberofhourstoallocatetoeachworker. Chapter11Optimizationcanalsobeusedformorecomplex‘blend-ing’problems. Example:Yourunapaperrecyclingbusiness. Youtakeinpapersandotherfibers suchascardboardboxes.Howcanyoumixtogetherthevariousinputssothatyour outputmeetsminimumqualityrequirements,minimizeswastage,andgeneratesthe mostprofit? Another example: what mix or blend of investments would best suit your requirements? Chapter 12 concerns calculating the probability of items you can count one by one.Forexample,whatistheprobabilitythatmore thanfivepeoplewillcometotheservicedeskinthenexthalf-hour? What is the probabilitythatallofthenextthreecustomerswillmakeapurchase?Wecansolve theseproblemsusingthebinomialandthePoisson distributions. Chapter13concernschoiceunder uncertainty: if you have a choice of different actions,eachofwhichhasanuncertainoutcome,whichactionshouldyouchooseto maximize the expected monetary value?Afarmerknowsthepayoffshe/she will make from different crops provided the weather is in one state or another (sunny/wet) butatthetimewhenthecropdecisionhastobemade,heobviously doesn’t know what the actual state will be. Which crop should he plant? A manufacturerneedstodecidewhethertoinvestinconstructinganewfactoryata timeofeconomicuncertainty:whatshouldhedo? Chapter 14 adds the decision-maker’s risk profile to his orher decisionprocess. Mostpeopleareriskaverse,andarewillingtotradeoffsomeriskinexchangefor certainty. This chapter shows how to construct a utility curve which maps risk attitudes,andthenprioritizethedecisionsintermsofmaximumexpectedutility. Chapter15isaGlossaryandcontainssomebasicstatisticsinforma-tion,primarily definitions of terms that the book uses frequently and which have a particular meaninginstatistics(forexample‘population’).TheGlossaryalsodiscusseswhy theinferentialtech-niquesusedinstatisticsaresopowerful,allowingusto make inferencesaboutapopulationbasedonwhatappearstobeaverysmallsample. UnderEforExcelintheGlossary,you’llfindsomelinkstoscreencastsonsubjects notdirectlycoveredinthisbook,butwhichyoumightfindhelpful. 2. VisualizationandTableau: telling(true)storieswithdata In this book, we’ll work on gaining insights from data by visual- ization and quantitativeanalysis,andthenpresentingtheresultstoothers.Recentlyapowerful newtoolcalledTableauhasbecomeavailable.TableauPublic¹isafreeversion.But becarefulwithyourdatabecausewhenyoupublishtotheweb,asyoumustdowith TableauPublic,thenyourdataalsobecomesavailabletoanybody.Ifyourequirethat yourdatabeprotected,youcanpayfortheprivateversion.Tableau9hasjustbeen madeavailable. Many of the worked examples in this book are accompanied by a link to a completed Tableau workbook, showing how you could have used Tableau to present your findings. Tableau cannot perform easily the more technical hypothesis-testingaspectsof statisticssuchasregression,butitcanhelpyoutoget yourpointacrossclearly.TableauhasalsoprovidedausefulWhitePaper²onvisual bestpracticeswhichiswellworthreading In business intelligence we are generally interesting in detecting patterns and relationships.Thismightseemobvious,becauseashumanswearealwayslooking forthesephenomena.I’djustliketoaddthattheabsenceofapatternorrelationship mightbejustasinformativeasfindingone.Anexcellentfirststepistotakealookat yourdatawithanexpositorygraphbeforemovingontomoreformaldataanalysis. Belowareexamplesofthetypesofchartsorgraphsmostcommonlyusedineither examiningdatafirstoff, ¹http://www.tableausoftware.com/public/ ²http://www.tableausoftware.com/whitepapers/visual-analysis-everyone 9 justto‘seehowitlooks’.Checkingan expository graph shows up problems that mightthrowyou off later: missing data, outliers or, more excitingly, unexpected andintriguingrelationships. Histograms Histogramsbreakthedataintoclassesandshowthedistributionofthedata.Which classes(sometimesknownasbins)aremostcommon.Thehistogramshowsusthe ‘shape’ofthedata.Aretheremanysmallmeasurements,ordoesthedatalookas though it is normally distributed and consequently mound-shaped. See DistributionintheGlossaryfor more on this. The plot below is of deaths in car accidentsonaweeklybasisinthe United States over the period 1973-1978. The histogramshowsonlythedistributionandnotthesequence. From the histogram only, wecould saythatthemost common number of deaths inaweekis between8000 and 9000, be- cause the bars of the histogram are highest there. They are high- est because they count thenumberof times in the time-period that there have been (for example 8000 deaths occurredin15weeks). What the histogram doesn’tshowis that deathshaveactuallybeendecreasingsince Histogramofcardeaths,reportedweekly about1976.Thereisprobablyasimpleexplanationforthis:compul-soryseatbelts perhaps?Thetake-homefromthisisthatexpositorygraphsarereallyeasytodo, and that you should try different methods with the same data to tease out the insights. Cardeathsovertime Scatterplots Themostcommonandmostusefulgraphisprobably the scatter- plot. It is used whenwehavetwocontinuousvariables,suchastwoquantities(weightofcarand mileage)andwewanttoseetherelationshipbetweenthem.Thescatterplotisalso usefulforplottingtimeseries,wheretimeisoneofthevariables. YoucanuseExcelforthis,andalsoTableau.InExcel,arrangethecolumnsofyour datasothatthevariablewhichyouwantonthehorizontalxaxisistheoneonthe left.Thisistheindependentvariable.Puttheothervariable,thedependentvariable, ontheverticalyaxis.Itisusualpracticetoputthevariablewhichwethinkisdoing theexplainingonthexaxisandtheresponsevariableon theyaxis.HereisascatterplotofEuropeanUnionExpenditureper student as a percentageofGDP: EU%ofGDPspentonPrimaryEducation There is a gradual increase over the years 1996 to 2011 as one would expect; educational expenditures tend to remain a relatively fixed share of the budget. However, there was a blip in 2006 which is interesting. Because the data is percentageof GDP, the drop might have been caused by an increase in GDP or alternativelybyanactualeducationalspendingdecrease.We’dneedtolookatGDP figuresfortheperiod.IgotEuropeanGDPfigurespercapitaandusingTableauput themonthesameplot,withdifferentaxesofcourse.Theresultisbelow EuropeanGDPandPrimary Expenditures GDPdroppedaround2008/2009becauseoftheworld financialcrisis.Theblipin primaryexpendituresisstillunexplained. Maps Tableaumakesthedisplayofspatialdatarelativelyeasy,providedyoutellTableau whichofyourdimensionscontainthegeographicalvariable.Theplotbelowshows changesingreenhousegas emis-sionsfromagricultureintheEuropeanUnion.I linkedworldbankdatabycountry.Theworkbookishere³. ³https://public.tableausoftware.com/views/EuropeanGHGfromAgriculture20102011/Sheet1?:embed=y&:display_count=no Unfortunately, not all geographic entities are available for linkage in Tableau. States and counties in the United States are certainly available and provinces in Canada,aswellascountriesinEurope.TherearewaystoimportArcGISshapefiles intoTableau.Ashape-fileisalistofverticeswhichdelimitthepolygonsthatdefine thegeographicalentity. GHGemissionsfromagricultureintheEuropeanUnion 3. Writingupyourfindings Inchapter1,ImentionedsomeofthequestionsIwishedIhadbeenabletoanswer whenIwasoperatingabusiness.NowIwanttosuggestwaysinwhichyoumight structureyourresponseto suchquestions.Theactualanswerscomewhenyou’ve donethestatisticalwork(carriedoutinthefollowingchapters),butitmighthelpto haveatleastastructureonwhichyoucanbeginbuildingyourwork. Beforeyoudoanyworkontheproblem,getstraightinyourmindexactlywhatyou arebeingaskedtodo.Youneedtobeclearandsodoesthedecision-makertowhom youaregoingtosubmityourfindings.Hereisonewaytodothis: Copydownthekeyquestionthatyouarebeingaskedinexactlythewaythatithas beengiventoyou.Forexample,letussaythatyouhavebeenaskedtoanswerthis admittedly tough question: ‘What factorsareimportantfor the success of a new supermarket?’ Nowkeeprewritingitinyourownwordssothatitbecomes as clearaspossible. Makesurethateachandeverywordisclear.Whatdoes‘success’actuallymean?— inthebusinesscontextprobably‘mostprofitable’.Whatdoes‘new’mean?Isthisa completelynewsupermarketoranewoutletofanexistingbrand?Youmightend upwith:‘whenweareplanningalocationforanewoutletforourbrand,whatfactors contributemosttoprofitability?’Doingallthishelpsyoutoidentifythestatistical methodmostsuitableforthetask.Youalsocanincludeherethehypotheses(ifany) thatyouwillbetesting. Hereisasuggestedsectionorderforyourreport.However,thisprobablywon’tbe theorderofthework.TheorderinwhichyoudothestatisticalworkisinSection3, yourplanofattack.Sotheorder 15 ofdoingtheactualworkisthatyoucarryouttheplanofattack,andthenwriteup yourresultsfollowingthesectionbysectionsequenceIamgivingyounow. Section1.Statethequestiontobeansweredclearlyandsuccinctly,asresultofthe rewriting you did above. Doing this makes sure that you clearly understand the problemand what is being asked of you. Writing out the question and stating it clearlyrightatthebeginning of your report also ensures that the decision-maker (thepersonwhoposedthequestioninthefirstplace)andyouunderstandeachother. Youcanalsowriteinthehypothesiswhichyouaretesting. Section2.Provideanexecutivesummaryoftheresults.Thisshouldbeperhapssix lineslongandcontainthekeyfindingsfromyourwork.Don’tputinanytechnical words,jargonorcomplicatedresults.Justenoughsothatabusypersoncanskim throughitandgetthegeneralideaofyourfindingsbeforegoingmoredeeplyinto yourhardwork. Section3.Outlineyourplanofattack ,describinghowyou have approachedthe problem.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting. Atypicalplanofattackfollowslateroninthischapter. Section4.Data.Provideabriefdescriptionofthedatathatyouhaveused.Thesource ofthedata, how you found it, whether you think it is reliable and whether it is sufficientforthetask.Itisagoodideatoincludesummarystatisticsatthispoint. This includes information such as the number of observations, and what the variablesare. Section5.Statisticalmethodology.Describethestatisticalmethod-ologywhichyou are going to use, for example ANOVAto test whetherornotthemeansare the same. Section6.Carryoutthetests,makingsuretoincludeadescriptionofwhetherornot thehypothesescanberejected.Thisis thecentralpartofthereport.Justmakesure tofocusonansweringthequestion. Section7.Writeaconclusionwhichdescribestheresults of the test and how the results answer the question that was asked. You can also include here any shortcomingsinthedataorinyourmethodologywhichmightaffectthevalidityof the results. The conclusion is very important because it ties together all the previoussections.HereyoucouldincludealinktoaTableau‘story’asanadditional oralternativewayofpresentingresults. Section8.References/end notes or further information, especially on sources of data, should end the document. If you want to make your document look really good, and you have lots of references, consider using the Zotero plug-in for Firefox.Itisanexcellentfreewaytomanageyourbibliography. 3.1Planofattack.Followthesesteps. a.Identificationofthestatisticalmethodthatyouwilluseandwhyyou chosethosemethods b.Whatdataisrequired,andwherecanitbefound? c.Conductthestatisticaltests d.Discusswhethertheresultsarereliable/answerthequestion.(Ifnot,start again!) 3.2Presentingyourwork While it is probably best to concentrate on written work because the decisionmakermightwanttoreadyourworkindetail, and discuss it with colleagues, a Tableau presentation is an excellent way of making your point over again. See Chapter2formoreonvisualizationandTableau. Here is a link to some excellent notes by Professor Andrew Gelman on giving researchpresentations¹ ¹http://andrewgelman.com/2014/12/01/quick-tips-giving-research-presentations/ 4. Dataandhowtogetit Inthischapter,we’lllookatthedatacollectionprocess,theprob-lemsthatmight accompanypoorlycollecteddata,discussso-called bigdataand provide links to someusefulpublicsourcesofdata. As you might expect, this book works through the analysis of numbers — quantitativedata—butitisimportanttonotethattheanalysisofqualitativedataisa rapidlygrowingarea.Unfortunately theanalysisofqualitativedataisbeyondthis bookandthesoftwareavailabletous. Quantitativedatacomesfromtwomainsources. Primarydataiscollectedbyyou orthecompanyyouareworkingfor.Forexample,themarketresearchyoudotofind outthepossiblemarketshareforyourproductprovidesprimarydata.Primarydata includes bothinternalcompanydataanddatafromautomatedequipmentsuchas website hits. Collecting data is expensive and highly proprietary. It is therefore unlikelytobepublishedandavailableoutsidetheenterprise. By contrast, Secondary data is plentiful and mostly free. It is collected by governmentsofallsizes,andalsobymany non-governmentorganizationsaswell. There is an increasing trend towards the liberation of data under various open governmentinitiatives.Governmentdataisusuallyreliable,butbesuretocheckthe accompanying notes which warn of any problems, suchaslimitedsamplesizeor changeinclassificationorcollectionmethods overtime.Thereare some links to secondarydataattheendofthischapter. 18 Experimentaldesign Experiments don’t necessarily have to be conducted in laboratories by people in whitecoats:asurveyinshoppingmallisalsoaformofexperiment,asisanalyzing the results of your favorite cricket team. There are two different design types: experimental and observational . The k e ydifference is the amount of randomizationthatisbuiltintotheexperiment.Ingeneral,morerandomizationis good,becauseithelpstoremovetheinfluenceoffixedeffects,suchasthequalityof thesoilinaparticularlocation. Here’sanexampletomakeclearthedifference.Imaginea re-searcherwantingto test the effect of a new drug on mice. In an experimental design, the mice are examined before the drug is administered, and then the drug is applied to a treatmentgroupofmiceandacontrolgroup.Thecontrolgroupreceivesnodrugs. Itsjobistoactasareferencegroup.Allthemicearekeptinidenticalconditionsapart fromtheapplicationofthedrug.Theallocationofmicetothetreatmentgroupand tothecontrolgroupisentirelyrandom. While we can (and do) carry out drug experiments on mice, itwould clearly be verywrongtotrytodothesamewithhumansassubjects.Insteadwecollectdataor observationsandthenanalyzethoseobservationslookingford i ff e r e n c e s . Tocompensateforthelackofrandomization,wecontrolforob-serveddifferences byincludingasmanyrelevantvariablesaspossi-ble.IfweknewthatpersonXhad receivedaparticulardrugandhaddevelopedaparticularcondition,wewouldwant tocomparepersonXwithsomebodyelsewhohadnotdevelopedthat condition. Relevantvariablesthatwewouldwanttoknowmightbe age,genderandpossibly pre-existinghealthconditions.Includingthese variables reduces fixed effects and allowsustoconcentrateontheeffectofthedrug. Problemswithdata It is obvious that to be credible, your analyses must be based on reliable data. Problemswithdataareusuallyconnectedtopoorsamplingandexperimentaldesign techniques,especially: Samplesizetoosmall .Therelativesizeofthesampleto thepopulationusually doesnotmatter.Itistheabsolutesizeofthesamplethatcounts.Youusuallywantat leastseveralhundredobservations. Population of interest not clearly defined. It is clear that we need to take a samplefromapopulation,butwhatexactlyisthepopulation?Here’sanexample. Youwanttosurveyshoppersinashoppingmallregardingyournewproduct.Butof thepeopleinsidethemall,whoexactlyareyourpopulation?Peoplejustentering, peoplejustleaving,peoplehavingtheirlunchinthefoodcourt? Singles, couples, elderlypeopleortheteenagershanging aroundoutsidethedoor?Youcanseethat pickinganyoneofthesegroupsonitsownwillleadtoabiasedsample. Non-response bias . Many people don’t answer those irritating telephone calls which come in the evening because they’re busy with dinner. As a result, only answers from those who do choose to answer the survey are counted. Those respondentsmostlikelyaren’trepresentativeof the population. Perhaps they live aloneordonothavetoomuchtodo.I’mnotsayingthattheyshouldnotbe in the sample,justthatincludingonlythosewhodorespondmaybiasyoursample. Voluntaryresponsebias .Ifyoufeelstronglyaboutanissue,then you are more likelytorespondthanifyouareindifferent.That’ssimplyhumannature.Asaresult, thesurveyresultswillbeskewedbytheviewsofthosewhofeelmostpassionately. Thisishardlyarepresentativesamplebecausethestrongly-heldviewsdrownoutthe moremoderate voices. 4.1Bigdata Primarydatafrequentlycomesfromautomatedcollectiondevices,suchasscanners, websites,socialmedia,andthelike.Thevolumeofsuchdataisenormous,andis aptlycalledbigdata.Bigdataisthetermusedtodescribelargedatasetsgeneratedby traditionalbusi-nessactivitiesandfromnewsourcessuchassocialmedia.Typical big data includes information from store point-of-sale terminals, bank ATMs, FacebookpostsandYouTubevideos. One of the apparently attractive features of big data is simply its size, which supposedlyenablesdeeperinsightsandrevealsconnectionswhichwouldnotappear insmallersamples.Thisargu-mentneglectsthepowerofstatistics,andinparticular inferential statistics. A small sample, properly collected, can yield superior insightstoaverylargepoorlycollectedsample.Thinkofitthisway:whichisbetter: averylargesampleinwhichalltherespondentsareinthesameage-groupandofthe samegender;orasmalleronewhichmoreaccuratelyreflectsthepopulation? 4.2Someusefulsites YoucanofcourseeasilyjustGooglefordata,orlookatthesemorefocusedsites: Gapminderdata:freetousebutbesuretoattribute¹Worker employmentandcompensation² Interestingandwide-ranginghistorical data³Google PublicData⁴ ¹http://www.gapminder.org/data/²http://www.bls.gov/fls/country/canada.htm ³http://www.historicalstatistics.org/ ⁴http://www.google.com/publicdata/directory InternationalMonetaryFund⁵ FoodandAgricultureOrganisation⁶UnitedNations Data⁷ TheWorldBank⁸ ⁵http://www.imf.org/external/data.htm ⁶http://www.fao.org/statistics/en/ ⁷http://data.un.org/ ⁸http://databank.worldbank.org/data/home.aspx 5. Testingwhetherquantitiesarethe same This chapter concerns testing whether the population means of two or more quantitiesarethesameornot:andiftheyareinfactdifferent,whetheranyvariable canbeidentifiedasbeingassociated with the difference. The test we will use is ANOVA,(AnalysisofVariance).ThetestwasdevelopedbytheBritishstatistician SirRonaldFisher,andtheF-testwhichANOVAusesisnamedinhis honor.Fisher alsodevelopedmuchofthetheoreticalworkbehindexperimentdesignduringhis timeattheRothamstedResearchStationinEngland. 5.1ANOVASingleFactor ThemoststraightforwardapplicationofANOVAiswhenwesimply want to test whetherornottwoormoremeansarethesame.Inthisworkedexample,wehave threedifferenttypesofwheatfertilizer(Wolfe,WhiteandKorosa)andwewouldlike toknowwhethertheirapplicationproducesequalordifferentyields. AsthesectiononexperimentaldesigninChapter3emphasized,theexperimentmust be designed to isolate the effect of the fertilizer, and this is achieved by randomization. To control for differences in site-specific growing qualities, we selectplotsoflandwhichareassimilaraspossible:exposuretosunlight,drainage, slopeandotherrelevantqualities.Werandomlyassignfertilizerstoplots,andthe yieldsaremeasured.Thefirstfewlinesofatypicaldatasetappearbelow. 23 Thefertilizerdataset There are two columns in the dataset: the factor , which is the name of the fertilizer,andtheyieldattributedtoeachplot.Inthiscase,eachfertilizerwastested onfourplots,providing3x4=12observations.Later,whenusingExceltorunthe ANOVA test, it will be necessary to change the format so that the yields are groupedundereachfactor. Agoodfirststepistovisualizethedata.Hereisaboxplotdrawnwith Tableau. Fertilizer boxplot Thedatasetishere:fertilizerdataset¹ andhereisaYoutubeofcreatingaboxplotinTableau². ¹https://dl.dropboxusercontent.com/u/23281950/fertilizer.xlsx ²http://youtu.be/3QohthWXp1M I’llmakeafrankadmissionrightnow:ittookmethebestpartofamorningtogetthe techniqueofdrawingaboxplotinTableaudown, so hopefully this YouTube will helpyoutoavoidspendingsomuchtimeonthis. The boxplot reveals these useful measurements: the smallest and largest observation,ortherange.Themedian,whichisthehori-zontallinewithinthebox; andthe25%quartile(upperlineofthebox;and75%quartile,lowerlineofthebox. Therefore50%(75-25)of thedataarecontained within the box. The ‘whiskers’ markoffdata which are outliers, meaning that any datapoint which is outside a whiskerisanoutlier.Thisdoesn’tmeantosaythatitissomehowwrong:perhaps someofthemostinterestingdiscoveriescome fromlookingatoutliers.However, an outlier might also be the result of carelessdataentryandshouldtherefore be checked.Itisclearfromtheplotsthattheyieldsarebynomeansthesame. Inthisexample,wearemeasuringonlyonefactor :theeffectofthefertilizeronthe meanyieldofeachplot,andsothetestwewanttoconductissinglefactorANOVA with a completely randomized design. It is completely randomized because the allocationoffertil-izertoplotwasrandom.Wewantanydifferencestobeduetothe fertilizerandthefertilizeralone. Becauseourtestiswhetherthemeansarethesameordifferent,thehypothesisis: Ho : µ 1= µ 2= µ 3= .. =µ n withthealternativehypothesisthatnotallthemeansarethesame. That isn’t the sameassayingthattheyarealldifferent;justthatatleastoneisdifferentfromthe others. Therejectionrulestatesthatifthepvaluewhichcomesoutofthetestissmaller than0.05,thenwerejectthenullhypothesis.If we reject the null, then we must acceptthealternativehypothesis. WetestthehypothesisusingtheANOVASingleFactortoolwithinDataAnalysis. First,thetwo-columnstructureofthedata hastobetransformedintocolumnsfor eachofthethreefertilizers.ThatiseasilyaccomplishedusingthePIVOTTABLE function.Thisyoutube³takesyouthroughtheprocessofchangingthestructureof thedataandrunningtheANOVAtest.Theresultarebelow. ANOVAoutputforfertilizertest Thekeystatisticisthep-value.Itismuchsmallerthan0.05andsowerejectthenull hypothesis.Themeansarenotallthesame.Theyaredifferent.Thesummaryoutput tellsusthatWhitehasthelargestmeanyieldandthisagreeswiththeboxplot.Inthis case,separationofresultsintoarankingofyieldisrelativelyeasybecausetheyare sodistinct.UnfortunatelyExcellacksawayofeasilytestingwhetheranyotherpair arethesameordifferent.Allwecansayforsureisthattheyarenotallthesame. Here is a slightly more complicated and realistic example.You are designing an advertisingcampaignandyouhavemodelswithdifferenteyecolors:blue,brown, andgreen,andalsooneshotin ³http://youtu.be/wevrSWYBl8U whichthemodelislookingdown.Youmeasuretheresponsetoeach arrangement. Doestheeyecoloraffecttheresponse?Thedatasetiscalledadcolor⁴.Thefirstfew linesarehere: Thefirstfewlinesoftheadcolordataset Thecoloristhefactor,andwe’llneedtousePIVOTTABLEtotabulatethedataso thattheeyecolorbecomesthecolumns.I’vedonethathere…. Theresultis Theeyecolorresults ⁴https://dl.dropboxusercontent.com/u/23281950/adcolour.xls Thepvalueis0.036184,smallerthan0.05.Asaresultwe canstatethatdifferent colors are associated with different responses. Looking at the summary output, greenhasthehighestaverageat3.85974,sowecouldprobablyselectgreen. 5.2ANOVA:withmorethanonefactor TheFertilizertestaboveshowedhowtotesttheeffectofasinglefactor.TheANOVA resultsshowedthatthemeanswere not the same. By inspecting the boxplot and alsotheExceloutput,itisclearthatthevarietyWHITEhasahigheryield.What might be interesting is finding whether another factor also has a statistically significanteffectonyields,andwhetherthetwofactorsinteractedtogether. ThecaseinquestioninvolvespreparationfortheGMAT,anexamrequiredbysome graduate schools. The GMAT is a test of logical thinking and is therefore not dependentonspecificpriorlearning.Weknowthetestscoresofsomeapplicants, andwhetherthosestudentscamefromBusiness,EngineeringorArtsand Sciences faculties.Thestudentshadalsotakenpreparationcourses,ranging from a 3-hour review to a 10-week course. The question is: did taking the preparation course matter;anddidfacultymatter?Here wehavetwofactors:facultyandpreparation course.Thedataisalreadyarrangedincolumnsandsowecangostraightinwitha two-waywithreplication.Lookatthedata⁵.Itisreplicatedbecause there are two setsofobservationsforeachpreparationtype.Theoutputishere: ⁵https://dl.dropboxusercontent.com/u/23281950/testscores.xlsx TestscoresANOVAoutput Wehavethepreparationtypeintherows,andtherelevantpvaluethereis0.358,so wefailtorejectthenull.Wecannotsaythatthereisanydifferenceinscoresasa result of preparation course type; in other words, there is no effect on scores resultingfrompreparationcoursetype.However,forfaculty(incolumns) thereis distinctdifference,withapvalueof0.004.Fromthesummary output, it looks as thoseintheEngineeringfacultyhadthehighestscore. 6. Regression:Predictingwith continuousvariables This chapter is about discovering the relationships that might or (equally well) mightnotexistbetweenthevariables in the dataset ofinterest.Here’satypicalif rathersimplisticexampleofregression in action: the operator of some delivery truckswantstopredictthe average time taken by a delivery truck to complete a givenroute.Theoperatorneedsthisinformationbecausehechargesbythehourand needstobeabletoprovidequotationsrapidly.Thecustomerprovidesthedistance tobedriven,andthenumberofstopsenroute.Thetaskistodevelopamodelsothat the truck operator can predict the time taken given that information. Regression provides a mathematical ‘model’ or equation from which we can very quickly predictjourneytimesgivenrelevantinformationsuchasthedistance,andnumber ofstops.Thisisextremelyusefulwhenquotingforjobsorforauditpurposes. Weknow the size of the input variables—the distance andthedeliveries, but we don’tknowtherateofchangebetweenthemandthedependentorresponsevariable: whatistheeffectontimeofincreasingdistancebyacertainamount?Oraddingone morestop?Usingthetechniqueof regression ,wemakeuseofa setofhistorical records, perhaps the truck’s log-book, to calculate the average time taken by the trucktocoveranygivendistance.Aswithanyattempttopredictthefuturebasedon thepast,thepredictionsfromregressiondependonunchangedconditions:nonew roadworks(whichmightspeedupordelaythejourney),nochangeintheskillsof thedriver. 31 In the language of regression, the time taken in the truck example is the response or dependent variable, and the distance to be driven is the explanatory or independent variable. While we will only ever have just one responsevariable,thenumberof explanatoryvariablesisunlimited.Thenumber ofstopsthetruckhastomakewillimpactjourneytime,aswillvariablessuchasthe ageofthetruck,weatherconditions,andwhetherthejourneyisurbanorrural.Ifwe know these variables, we also can include them in the model and gain deeper insightsintowhatdoes(andequallyimportantdoesnot)affectjourneytime. Regressionmodelsareusedprimarilyfortwotasks:toexplaineventsinthepast; and to make predictions. Regression providescoefficientswhichprovidethesize anddirectionofthechangeintheresponsevariableforaoneunitchangeinoneor moreoftheexplanatory variables. Regression makes a prediction for how long a truck will take to make a certain numberof deliveries and also states how accurate the prediction will be. With a regression model in hand, we can make quotations accurately and quickly. In addition,wecandetect anomaliesinrecordsandreportsbecausewecancalculate howlongajourneyshouldhavetakenundervariouswhat-ifconditions. Here I have used a straightforward example of calculating a timeproblemfora trucking firm, but regression is much more powerful than that. Some form of regressionunderliesagreatdealofappliedresearch.Whenyoureadaboutincreased riskduetosmoking(orwhateverelseisthelatestdanger)thatriskwascalculated with regression. In this chapter, we’ll calculate the differencebetween usedand newmarioKartsalesonebay,estimatestrokeriskfactors,andgenderinequalitiesin pay,allwithregression. Regressionisnotverydifficulttodo,buttheproblemis thateverybodydoesit. Ordinary Least Squares (OLS), linear regression’s proper name, rests on some assumptionswhichshouldbecheckedforvalidity,butoften aren’t. We cover the assumptions,andwhat todoifthey’renotmet,inthefollowingchapter.Thischapterisaboutthehands-on applicationsofregression. 6.1Layoutofthechapter The chapter begins with some background on how regression works. We’ll illustratethetheorywiththetruckingexamplemen-tionedabove,beforegoingon toaddingmoreexplanatoryvariablestoimprovetheaccuracyoftheprediction. Particularly useful explanatory variables are dummy variables, which take on a categoricalvalues,typicallyzeroor1.Forexample, wecouldcodeemployeesas maleandfemale,anddiscoverfromthiswhetherthereisagenderdifferenceinpay, andcalculatetheeffect.Inaworkedexample in the text, we analyze the sales of marioKartonebay,andusedummyvariablestofindtheaveragedifferenceincost betweenanewandausedversion. Sometimestwovariablesinteract:higherorlowerlevelsinonevariablechangethe effectofanothervariable.Inthetext,weshowthattheeffectofadvertisingchanges asthelistpriceoftheitemincreases. Finally,we’lldiscusscurvilinearity .Therelationshipbetweensalaryandexperienceis non-linear.Aspeoplegetolderandmoreexpe-rienced,theirsalariesfirstmove quicklyupwards,andthenflattenoutorplateau.Wecancapturethatnon-linearfunction tomakepredictionmore accurate. 6.2Introducingregression Atschoolyouprobablylearnedhowto calculate the slope of a line as‘riseover run’.Let’ssayyouwanttogotoParisforavacation.Youhaveuptoaweek.There aretwomainexpenses,theairfare and the cost of accommodation per night. The airfareisfixedand staysthesamenomatterwhetheryougoforonenightorseven.Thehotel(plusyour mealchargesetc)is$200pernight.Youcouldwriteupasmalldatasetandgraphit likethis: TotalParistripexpenses Theequationforis:Totalexpenses=1000+200*Nights Basically, you have just written your first regression model. Themodelcontains twocoefficientsofinterest: • theinterceptwhichisthevalueoftheresponsevariable(cost)whenthe explanatoryvariableiszero.Heretheinterceptis $1000.Thatistheexpensewithzeronights.Itisthepointontheverticaly axiswherethetrendlinecutsthroughit,wherex=0.Youstillhavetopaythe airfareregardlessofwhetheryoustayzeronightsormore. • theslopeof the line which is $200. For every one unit increase in the explanatoryvariable(nights)thedependent variable (total cost) increases bythisamount. In regression, we usually know the dependent variable and the independent variable, but we don’t know the coefficients. Weuse regressionto extract those coefficientsfromthedatasothatwecanmakepredictions. Howdoweknowwhethertheslopeoftheregressionlinereflectsthedata?Trythis thoughtexperiment:ifyouplottedthedataand thetrendlinewasflat,whatwouldthatmean?Norelationship.You couldstayin Parisforever,free!Aflatlinehaszeroslope,andsoanextranightwouldnotcost anymore. More formally, the hypothesis testing procedure tests thenullhypothesisthat the slopeiszero.Ifwecanrejectthenull,thenwearerequiredtoacceptthealternative hypothesis,whichisthattheslopeisnotzero.Ifitisn’tzero(thetrendlinecould slopeupordown)thenwe have something to work with. We test the hypothesis usingthedatafoundfromthesample,andExcelgivesusapvalue.Ifthepvalueis smallerthan0.05,werejectthenullhypothesis.Ifwerejectthenullwecansaythat thereisaslopeandtheanalysisisworthwhile. The Paris example had only one independent variable (number of nights) but regression can include many more, as the trucking examplebelowwillshow.If thereismorethanonevariable,wecannotshowtherelationshipwithonetrendline intwodimen-sions.Instead,therelationshipisahyperplaneorsurface.Theplot below shows the effect on taste ratings (the dependent variable) of increasing amountsoflacticacidandH2Sincheesesamples. 3Dhyperplaneforresponsestocheese Hereisanotherexample.Wehavethedataonthesizeof thepopulationofsome towns and also pizza sales. How can we predict sales given population. Pizza YouTubehere¹ 6.3Truckingexample The records of some journeys undertaken are available in an Excel file named ‘trucking’². Takealookatthedata.AsImentionedatthebeginningofthebook,itisalwaysa goodideatobeginwithatleastasimpleexploratoryplotofthevariablesofinterest inyourdata.WithExcelwecanquicklydrawaroughplotsothatwecanseewhatis goingon.Whatwewanttodoispredicttimewhenweknowthedistance.The ¹https://www.youtube.com/watch?v=ib1BfdVxcaQ ²https://dl.dropboxusercontent.com/u/23281950/trucking.xlsx scatter plot below shows time as a function of distance. YouTube: scatterplot of trucking data³ Trucking scatterplot Time is the dependent variable and distance is the independent variable. The independentvariablegoesonthehorizontalx axis,thedependentvariableonthe verticalyaxis. Fromtheplotitisclearthatthereisapositiverelationshipbetweenthetwovariables. Asthedistanceincreases,thensodoesthetime.Thisishardlyasurprise.Butwe wantmorethanthis:wewanttoquantifytherelationshipbetweenthetimetakenand the distance traveled.Ifwecanmodel this relationship that will be useful when customersaskfor quotations. Thedatasetcontainsonefurthervariable,whichisthenumberofdeliveriesonthe route.We’llusethatlateron.Fornowwe’llusejustthetimeandthedistance. Tobuildourmodel,wewanttobuildanequationwhichlookslikethisinsymbolic form ³https://www.youtube.com/watch?v=HSpY12mLXoU y ˆ= b o + b 1 x 1+ …b n x n +e yhat(spoken‘yhat’) is the dependent variable. It is called ‘hat’ because it is an estimate. b0 is the intercept, or the point where the trend line (also called the regression line) passes through the vertical y axis. This is the point where the independentvariableiszero.Youperhapsrememberasimilarequationfromschool: y=mx +b. Inthetruckingexample,theintercept(b0)mightbethe timetakenwarmingupthe truckandcheckingdocumentation. Timeis running, but the truck is not moving. b1 is the ‘coefficient’fortheindependentvariablebecauseitprovidesthechange inthedependentvariableforaone-unitchangeinthe independentvariable.x1is theindependentvariable,inthiscasedistance.Wewanttobeabletopluginsome valueoftheindependentvariableandgetbackapredictedtimeforthatdistance. Thebandthexbothhaveasubscriptof1becausetheyarethefirst(andsofaronly independentvariable).Ihaveputinmorevariables just to indicate that we could havemany. Thecoefficient,b1,providestherateofchangeofthedependentvariableforaone unitchangeintheindependentvariable.Youcanthinkofthisastheslopeoftheline: asteeperslopemeansagreaterincreaseinyforaone-unitincreaseinx. Wecannowfindtheestimatedregressionequationusingtheregressionapplication inExcel.First,weusetheregressfunctioninExcel’sdataanalysistooltoregress distanceontime.YouTube⁴ Theregressionoutputisasbelow: ⁴https://www.youtube.com/watch?v=xKsYfa7YGgE Theregressionoutput Ihavemarkedsomeofthekeyresultsinred,andIexplainthembelow. 6.4Howgoodisthemodel?—r-squared Thereisaredcirclearoundtheadjustedr-squaredvalue,here0.622.r-squaredisa measureofhowgoodthemodelis.r-squaredgoesfrom0to1,witha 1 meaning a perfect model, which is extremelyrareinappliedworksuchasthis.Zeromeans thereisnorelationshipatall.Theresulthereof0.62meansthat62%ofthevariance in the dependent variable is explained by themodel. This isn’t bad, but it’s not great either. Below,we’ll add more independent variables and show how the accuracy of the model improves. The adjusted r-squared takes into account the numberofvariablesintheregressionequationandalsothesamplesize,whichiswhy it is slightly smaller than the r-squared value. It doesn’t have quite the same interpretationasr-squared,butitisveryuseful when comparing models. Like rsquared,wewantashighavalueaspossible:higherisbetterbecauseitmeansthat themodelisdoingamoreaccuratejob. Alsocircledarethewordsinterceptanddistance.Thesegiveusthe coefficientsweneedtowritethemodel.ExtractingthecoefficientsfromtheExcel output,wecanwritetheestimatedregressionmodel. y ˆ=1 . 274+0 . 068∗Dist whereyhatisthe estimated or predicted response time. If you took a very large numberofjourneysofthesamedistance,thiswouldbetheaverageofthetimetaken. Theinterceptisaconstant,itdoesn’tchange.Itistheamountoftimetakenbeforea singlekilometerhasbeendriven.Thedistancecoefficientof0.068isthepieceof informationthatwe reallywant.Thisistherateofchangeoftimeforaoneunit change in the dependent variable, distance. An increase of one kilometer in distanceincreasesthepredictedtimeby0.0678hours,and vice-versaofcourse. Dothemathandyou’llseethattheaveragespeedis14.7mph. 6.5Predictingwiththemodel A model such as this makes estimating and quoting for jobs much easier. If a managerwantedtoknowhowlongitwouldtakeforatrucktomakeajourneyof2.5 kmsforexample,allheorshewouldneedtodoistoplug2.5intodistanceandget: predictedtime=1.27391+0.06783*2.5=1.4435hours. multiplythisresultbythecostperhourofthedriver,addthemiscellaneouscharges andyou’redone. Itisunwisetheextrapolate.Onlymakepredictionswithintherangeofthedatawith whichyoucalculatedyourmodel(seealsoChapter4onthis. 6.6Howitworks:Least-squaresmethod The method we just used to find the coefficients is called the methodof least squares .Itworks like this: the software tries to find a straight line through the pointsthatminimizestheverticaldistancebetweenthepredictedyvalueforeachx (thevalueprovidedbythetrendline)andtheactualorobservedvalueofeachyfor thatx.Ittriestogoasclose as it can to all of the points. The verticaldistanceis squaredsothattheamountsarealwayspositive. Theplotbelowisthesameastheoneabove,exceptthatIhaveaddedthetrendline. Theblacklineillustratestheerror Theslopeofthetrendlineisthecoefficient0.0678.Thatistherateofchangeofthe dependentvariableforaone-unitchangeinthe independent variable. Think “rise overrun”.Iextendedthetrendlinebackwardstoillustratethemeaningofintercept. Thevalueoftheinterceptis1.27391,whichisthevalueofywhenxiszero.Thisis actually a form of extrapolation, and because we have no observations of zero distance,thisisanunreliableestimate. Nowlookatthepointwherex=100.Thereareseveralyvaluesrepresentingtime takenforthisdistance.Ihavedrawnablackline betweenoneparticularyvalueandthetrendline.Thisverticaldistancerepresents anerrorinprediction:ifthemodelwasperfect, all the observed points would lie alongthetrendline.Themethodofleastsquaresworksbyminimizingthevertical distance. Itispossibletodothecalculationsbyhand,buttheyaretediousandmost peopleusesoftwareforpracticaluse.Thegapbetweenpredictedandobservedis knownasaresidualanditisthe‘e’terminthegeneralformequationabove. Theerrordiscussedjustaboveistheverticaldistancebetweenthepredictedandthe observed values for every x value. The amount of error is indicated by the rsquaredvalue .r-squaredrunsfrom0(acompletelyuselessmodel)to1(perfect fit). Intheglossary,underRegression,Ihavewrittenupthemaththatunderpinsthese results. 6.7Addinganothervariable Ther-squaredof0.66wefoundwithoneindependentvariableisreasonableinsuch circumstances, indicating that our model explains 66% of the variability in the responsevariable.Butwe might do better by adding another variable to explain moreofthevariability. Thetruckingdatasetalsoprovidesthenumberof deliveries that the driver has to make.Clearly,thesewillhaveaneffectonthetimetaken.Let’sadddeliveriestothe regressionmodel. Notethatyou’llneedtocutandpastesothattheexplanatoryvariablesareadjacent toeachother.Itdoesn’tmatterwheretheresponsevariableis,buttheexplanatory variablesmustbe adjacentinoneblock.Thenewresultisbelow: Nowwith deliveries Notethatther-squaredhasincreasedto0.903, so the new model explains about 25%moreofthevariationintime.Theimprovedmodelis: y ˆ=− 0 . 869+0 . 06∗Dist+0 . 92∗Del A few things to note here: the values of the coefficients have changed. This is becausetheinterpretationofthecoefficientsinamultiplemodellikethisisbased ononlyonevariablechanging .Forexample,thecoefficientofdistanceis0.92, almostonehourforeachdeliveryassumingthatthedistancedoesn’talsochange. Weshouldinterpretthecoefficientsundertheassumptionthatalltheothervariables are‘heldsteady’,apartfromthecoefficientofinterest. 6.8Dummyvariables Above, we saw how adding a further variable has dramatically improve the accuracyofamodel.Adummyvariableisanadditional variable but one that we construct ourselves as a result of dividing data into two classes, for example by gender.Dummyvariablesare powerful because they allow us to measure the effect of a binary variable, known as a dummy or sometimes indicator variable. A dummyvariabletakesonavalueofzeroorone,andthuspartitionsthe dataintotwoclasses.Oneclassiscodedwithazero,andiscalledthe reference group . The other classes are coded with a one and successive numbers. There may be more than two groups, but there will always be one reference group. W eare generating an extra variable, so the regressionequationlookslike this: y ˆ=b 0+b 1 x 1+b 2 x 2 In the case of observations which have been coded x1=0 (the reference group), then the b1 term will disappear because it is multipliedbyzero.Theb2termremains.Forthereferencegroup,the estimatedregressionequationthensimplifiesto y ˆ reference=b 0+b 2 x 2 whileforthenon-referencegroup,itis y ˆ nonreference=b 0+b 1 x 1+b 2 x 2 Thesizeofb1x1representsthedifferencebetweenthe averagesize ofthereferencelevelandwhatevergroupisthenon-referencegroup. Let’sworkthroughanexample.Creatingadummyvariable⁵ Thedataset[‘gender’](https://dl.dropboxusercontent.com/u/23281950/gender.xlsx) containsrecordsofsalariespaid,yearsofexperienceandgender. Wemightwantto know whether men and women receive the same salarygiventhesameyearsofexperience.Loadthedata,then run a linear regression of Salary on Years of Experience, using years of experienceasthesoleexplanatoryvariable.Theresultisbelow: ⁵https://www.youtube.com/watch?v=TBJsEb2UCPs Thegenderregression The interpretation is that the intercept of $30949 is the aver- age starting salary, withyearsofexperiencezero.Thecoefficient $4076meansthateveryadditionalyearofexperienceincreasestheworker’ssalary bythisamount. Ther-squaredis0.83,sothesimpleone-variablemodel explainsabout83%ofthe variation in the dependent variable, salary. We have one more variable in the dataset,gender.Addthisasadummyvariableandruntheregressionagain.Notethat youwillhaveto cut and paste the variables so that the explanatory variables are adjacent. Withthegenderdummyadded Ther-squaredhasincreasedtonearly1,soournewmodelwiththeinclusion ofgenderisveryaccurate.Theimportantnewvariableis gender,codedas female=0andmale=1.Nothingsexistaboutthis,wecouldequallywell havereversedthecoding.Ifyouarefemale,thenthemodelforyoursalary is:23607.5+4076*Years ifyouaremale,thenthemodelforyoursalaryis23607.5+ 114683.67+ 4076Years Eachextrayearofexperienceprovidesthesamesalaryincrease,butonaverage malesreceive$14683.67moreearnings. Hereisanotherexample,usingthemaintenancedataset.Dummyvariable⁶ Anotherdummyvariableexampleandacautionary tale Themariokartdatasetcamefrom[OpenIntroStatistics](www.openintro.org)a wonderfulfreetextbookforentrylevelstudents.Thedataset ⁶https://www.youtube.com/watch?v=Yv681upodDI contains information on the price of Mario Kart in ebay auctions.First let’s test whethercondition‘new’or‘used’makesdifference.Constructanewcolumncalled CONDUMMY,codednew=1andused=0.Nowrunaregressionwithyournew dummyvariableagainsttotalprice.Theoutputisbelow Theconditiondummyresults Thisresultistremendouslybad.Thepvalueis0.129,meaningthatconditionisnota statistically significant predictor of price. Surely the condition must have some significant effect? Wait—we forgot to do some visualization. To the right is a histogramofthetotalpricevariable. Lookslikewemighthaveaproblemwithoutliers…someof the observations are muchlargerthantheothers.Takeanotherlookwithabarchartatthehigherprices. marioKarttotal prices Wecanidentifytheseoutliersusingzscores(seetheGlossary).Orwecouldjustsort thembysizeandthenmake avaluejudgementbasedonthedescription.That’swhatIhavedoneinthisYouTube. Outlierremovaland regres-sion⁷ Thetotalpricesasabarchart It seems that two of the items listed were for grouped items which were quite differentfromtheothers.Thereisthereforealegitimatereasonforexcludingthem. Belowarethenewresults: ⁷https://www.youtube.com/watch?v=ivLGteHuu3Q CorrectedmarioKart output Theestimatedregressionequation is y ˆ=42 . 87+10 . 89∗CONDUMMY Theconditiondummywascodedasnew=1,used=0.Ifamariokartisused,thenits averagepriceis42.87,ifitnewthentheaveragepriceis42.87+10.89.Theaverage differenceinpricebetweenoldandnewisnearly$11.Makessense. Take-home:checkyourdatabeforedoingtheregression. 6.9Severaldummyvariables ThegenderandthemarioKartexamplesabovecontainedjustonedummyvariable. But it is possible to have more. For example, your sales territory contains four distinctregions.Ifyoumakeoneofthefourthereferencelevel,andthendivideup thedatawithdummies for the remaining three regions, you can compare performance in each of the regionsbothtoeachotherandtothereferencelevel. Thedatasetmaintenancecontainstwovariableswhichyoucanconverttodummies, followingthisYouTube. Maintenance Regression⁸ 6.10Curvilinearity Sofar,wehaveassumedthattherelationshipbetweenthedepen-dentvariableand theindependentvariablewaslinear. However,inmanysituationsthisassumption does not hold. The plot below showsethanolproductioninNorth America over time. NorthAmericanethanolproduction.Source:BP ThesourceofthedataisBP.Itisclearthatproductionofethanolisincreasingyearly butinanon-linearfashion.Justdrawingastraightlinethroughthedatawillmissthe increasingrateof production.Wecancapturetheincreasingratewithaquadratic term,which is simply the time element squared. I have created a new variable whichindexestheyears,whichIhavecalledt,andafurthervariable ⁸https://www.youtube.com/watch?v=xWdcT7u9YFE&feature=em-upload_owner whichistsquared. urvilinearregression YouTube⁹ Thedatasetwithanindexfortime.Source:BP Aregressionoftagainstoutputhasanr-squaredvalueof0.797.Inclusionofthet squaredtermincreasesther-squaredmarkedlyto 0.98.Theregressionoutputisbelow. Regressionoutputwiththequadraticterm Wewouldwritetheestimatedregressionequationas y ˆ=4504−1301 t+194 . 9 t 2 Noticethatforearlysmallervaluesoft,theeffectofthequadratic ⁹https://www.youtube.com/watch?v=jgjWSpyPBqg termisnegligible.Astgetslargerthenthequadraticswampsthelineartterm. Makingaprediction .Let’spredictethanolproductionafter5years.Then y ˆ=4505−1301∗5+194 . 9∗25=2872 . 5 Theplotbelowshowspredicted against observed values. While the fit is clearly imperfect,itiscertainlybetterthanastraightline. Predictedagainstobservedethanolproduction. 6.11Interactions Increasingthepriceofagoodusuallyreducessalesvolume(al- though of course profitmightnotchangeifthepriceincreasessufficientlytooffsetthelossofsales. Advertisingalsousuallyincreasessales,otherwisewhywouldwedoit? What about the joint effect of the two variables? How aboutreducingtheprice andincreasing the advertising? The joint effect is called an interaction and can easilybeincludedintheexplanatoryvariablesasan extra term. The output below istheresultof regressingsalesonPriceandAdsforaluxurytoiletriescompany.Theestimated regressionmodel is y ˆ=864−281 P+4 . 48 Ads Regressionoutputforjustpriceandads Sofarsogood.Thesignsofthecoefficientsareaswewouldexpectfromeconomic theory.Salesgodownaspricerises(negativesignonthecoefficient).Salesgoup withadvertising(positivesign onthecoefficient).Caution:donotpaytoomuch attention to the absolute size of the coefficients. The fact that price has a much largercoefficientthanadvertisingisirrelevant.Thecoefficientisalsorelatedtothe choiceofunitsused. Nowcreateanothertermwhichispricemultipliedbyads.CallthistermPAD.The firstfewlinesofthedatasetarebelow. Firstfewlineswiththenewinteractionvariable Nowdotheregressionagain,includingthenewterm. Inclusionoftheinteractionterm The new term is statistically significant and the r-squared has increased to 0.978109.Thenewmodeldoesabetterjobofexplainingthevariability.Howcome priceisnowpositiveandtheinteractionterm is negative? We have to look at the results as a wholerememberingthatthesignsworkwhenalltheothertermsare ‘heldconstant’.Theexplanation:asthepriceincreases,theeffectofadvertisingon salesisLESS.Youmightwanttolowertheprice andseeiftheincreasedvolume compensates. Anotherinteractionexample Here’sanotherexample,thisonerelatingtogenderandpay. Thedatasetiscalled ‘paygender’ and contains information on the gen- der of the employee, his/her review score (performance), years of experience and pay increase. We want to know: •Isthereagenderbiasinawardingsalaryincreases? •Inthereagenderbiasinawardingsalaryincreasesbasedontheinteraction betweengenderandreviewscore? First,let’sregresssalaryincreasesonthedummyvariableofgenderandalsoReview. Myresultsarebelow(IhavecreatedanewvariablewhichIhavecalledG.Itisjust thegendervariablecodedwithmale =0andfemale=1. GdummyandReview Nothingverysurprisinghere:womengetpaidonaverage233.286 lessthanmen. And—holdinggendersteady–eachpoint increase in Review gives an increase in salaryof2.24689. How do I know that women are paid less than men? The estimated regression equationis: y ˆ=204 . 5061 − 233 . 286 ∗ ( x=1 ifF )−233 . 286∗( x=0ifM )+2 . 24689∗Review Rememberhowwecodedmenandwomen?Thex=0ifMtermdisappears,sowe areleftwiththisequationformen: y ˆ=204 . 5061+2 . 24689∗Review andthisoneforwomen(I’vedonethesubtraction)togive y ˆ=− 28 . 779+2 . 24689∗Review Conclusion:thereisagenderbiasagainstwomen.ForthesameReview standard,onaveragewomenarepaid233.286lessthanmen. How about the second question — possibility that the gender bias increases with Review level? W ecan test this with an interaction variable.MultiplytogetheryourgenderdummyandtheReviewscore tocreateanewvariablecalledInteract.Thendothe regression again with salary regressed against G, Review and Interact. My output is below. Interactionoutput Formen,theequationisSalary=59.94472+4.848995(becausetheGandInteract termsgotozerobecausewecodedMale=0.SoforeveryextraincreaseinReview,a man’ssalaryincreasesby4.848995. Forwomen,theequationisSalary=59.94472-29.714-1*(4.848995-4.05468)so foreveryoneincreaseinReview,awomen’ssalary increases by only 4.8489954.05468=0.79. Notice that the adjusted r- squared for this model is higher (it is 0.927339comparedto0.810309) thanthemodelwithjustG and Review. So the interactionisbothstatisticallysignificant,pvalue0.000316)andhasthemeaning thatforwomen,improvingtheReviewscoredoesn’tincreasethesalaryasmuchas formen. ConclusionTheanalysisshowedthatwomenarebeingtreatedunfairly.Thereisa genderbiasagainsttheminaveragesalariesforthesameperformancereview;and each increase in review points earnthemconsiderablylessinsalarythanforthe equivalentmale.Caution: this conclusion is based solely on the limited evidence providedbythissmalldataset. 6.12Themulticollinearityproblem Multicollinearityreferstowayinwhichtwoormorevariables‘explain’thesame aspect of the dependent variable. Forexample,let’ssaythatwehad a regression modelwhichwasexplainingsomeone’ssalary.Employeestendtogetpaidmoreas theygetolderandalsoastheiryearsofexperienceincreases.Soifwehadbothage and years of experience in the list of independent variables we would almost certainlysufferfrommulticollinearity. Multicollinearitycanleadtosomefrustratingandperplexing re-sults.Youruna regression.Oneofthevariableshasapvaluelargerthan0.05,soyoudecidetotake itout.Youruntheregressionagain and—-the sign and/or significance of another variables changes. This happens because the two variables were explaining the sameaspectofthedependentvariablejointly. Howtosolvethisproblem:checkthecorrelationoftheindependentvariablesfirst, beforeputtingthemintoyourmodel.Chapter14hasasectiononcorrelation.Ifyou findthatthecorrelationofanytwovariablesishigherthan0.7,besuspicious.These twomaybringyoursomegrief! 6.13Howtopickthebestmodel Youwillbetryingoutdifferentformulations,addingandremovingvariablestotry tocaptureasmuchexplanatorypowerasyoucan.Howyoudodecideifonemodel isbetterthananother?Therearetwoapproaches: • Lookonlyattheadjustedr-squaredvalue,evenif your model contains variables with a p value larger than 0.05. If the adjustedr-squaredvalue goesup,leavesuchvariablesin. •Pruneyourmodelsothatitcontainsonlyvariableswithapvalue<=0.05. Youstillwantashighanadjustedr-squaredas youcanget,butyoualsowantallyourexplanatoryvariablestobe statisticallysignificant. Itakethelatterapproach.Youusuallyhavefewervariablesbutallof themhavea reason(statisticalsignificance)forbeinginyourmodelandyoucaninterprettheir meaningintelligently.Aparsimoniousmodelisbetter than a complex model that fitsyourdataverywell.Simpleandrobustisgood.Thedatasetthatyouusedtofit yourmodelisonlyasamplefromapopulation.Anoverlycomplexmodelmaynot workwellwhenpresentedwithadifferentsamplefromthepopulation. 6.14Thekeypoints • Thinkthroughyourmodelbeforeyoustartincluding vari-ables.What variablesdoyouthinkwillhaveaneffectonthedependentvariableandin whichdirection(plusorminus).Itistemptingtojustputineverythingand hope for the best but this rarely works. Some software is able to do stepwiseregression,pullingoutinsignificantvariablesforyou, butExcel isnotoneofthem. • Keepyourmodelassimpleaspossible.Complicatedmodelsrarelywork well. • Visualiseyourdatafirst,evenwithasimplescatterplotaswehavedone throughoutthischapter. •Checkwhetheryouhaveallthevariablesthatyoumightneed.Ifyouwere tryingtopredictwhetherashopselling expensive jewellery would make sufficientsales,youmightwanttoknowtheaverageincomeofresidents.If youdon’thaveit—getit.Thereisahugeamountofdatalyingaboutwhich youcanobtain either free or quite cheaply. I’veincludedaverybrieflist ofURLsinChapter12. 6.15Workedexamples 1.Theestimatedregressionequationforamodelwithtwoindependent variablesand10observationsisasfollows: y ˆ=29+0 . 59 x 1+4 . 9 x 2 Whataretheinterpretationsofb1andb2inthisestimatedregres-sion equation? Answer:thedependentvariablechangesbyonaverage0.59whenx1changebyone unit,holdingx2constant.Similarlyforb2. Predictthevalueoftheindependentvariablewhenx1is175andx2is290: y ˆ=29+0 . 59(175)+4 . 9(290)=1553 . 25 7. Checkingyour regressionmodel Itisn’tdifficulttobuildaregressionmodelasthepreviouschapterhasshown.But thatispartoftheproblem:everybodydoesit.Butnoteverybodytakesthetrouble tochecktheresults.Checkingtheresultsisanimportantstepfortworeasons:first,to makesureyourcalculationsandpredictionsarecorrect;andsecondlytoshow third parties that your work is solid. In this chapter we’ll work on doing just that. To decidewhetherthemodelswehavebeenworkingonareanygoodweneedtolook attwoareas: •Isyourmodelstatisticallysignificant? •Aretheassumptionsbehindtheleastsquaresmethodmet?Inparticular, linearityandresidualdistribution. Thefirstareaiseasiertoworkthroughthanthesecond,whichisprobablywhyone doesn’talwaysseeresidualanalysisdiscussedwhenresultsarepresented.Thisisa shamebecausethereisagreatdealtobelearnedfrompickingthroughresiduals.Your analysiswillbegreatlyimprovedbyacloseattentiontothisarea. Belowwe’llworkthroughstatisticalsignificanceandthentesttheassumptions. 7.1Statisticalsignificance Theregressiontrendlinedisplaysarateofchangebetweentwovariables.Theline slopesupwardswhenthedependent variableincreasesforaone-unitincreaseinthe independentvariable( 61 positive relationship) and vice-versa (negative relationship). What we want to know is this: did that slope upwards or downwards occur by chance; if we took anothersamplefromthepopulationwouldweachieveasimilarresult?Keypoint: weare usuallyworkingwithasampledrawnfromapopulation.Thesampleinthe truckingexamplewasthefirm’slogbook. Thenullhypothesisisthatthereisnoslope;withthealternativethatthereisinfacta slope.Thehypotheses(seetheGlossaryforhelponhypotheses)totestwhetherthere isastatisticallysignificantslopeare: Thenull: H o : β =0 andthe alternative: H a : β̸ = 0 where β isthesymbolforthecoefficientsofthepopulation parameter of the independent variableofinterest.Inwords,thehypothesisistestingwhetherbetaiszeroornot.If it is zero, then we are ‘flatlining’ and there is no point in continuing with the analysis.Ifhoweverwecanrejectthenull,thenweacceptthealternativewhichis thatbetaisnotzero.Itmightbenegativeorpositive,thatdoesn’tmatter:whatmatters is whether it is zero or not. Note that we use a Greek letter when discussing a populationparameterwhichwillprobablyneverbe accurately known. We use the Latin letter ‘b’ when we have estimated it using a sample drawn from the population. Exceldoesallthetestingofstatisticalsignificanceforusintwoways. First,itteststheindividualvariablesandprovidesapvalue.Belowistheresultfrom thefirsttruckingregressionwedid.Notethatthepvaluefordistanceis0.004.Itis customarytouseacut-offof0.05.Themeaningofthisresultisthat,followingthe rejectionrule ,werejectthenullifthepvalueissmallerthan0.05.Because0.004is smallerthan0.05,werejectthenullwe therefore conclude that there is in fact a statisticallysignificantslope.Insummary,wetestedthe hypothesis that the slope waszero,andrejectedit.Thereforeweacceptthealternativewhichisthatthereisa slopeofsomesort. Thetruckingregressionoutput Thistesttellsusthatthereisastatisticallysignificantslope.Itdoesn’ttellusthe signoftheslope(upordown)oritsmagnitude.Butthefactthatthereisasignificant slopeisimportantinformation.Itisworthcarryingonwiththeanalysisbecausethe variablesactuallymeansomething.Bytheway,ignorethepvaluefortheintercept. Itusuallyhasnosubstantivemeaning. ThesecondwaythatExcelteststhevalidityofthemodelistotestwhetherthemodel asawholeissignificant.ThetestiscalledtheFtestafterthestatisticianRAFisher. Forthetruckingexample,lookundersignificanceF.Theresultisapvalue,inthis casethesame pvalueforthevariableDISTANCE.Wehaveonlyoneexplanatoryvariableandso thepvaluewillbethesame.Wehopethatthepvalueissmallerthan0.05,which enablesustoconcludethatatleastoneoftheslopesisnon-zero. The section above examined the first of the tests, which was for the statistical significanceofthemodel.Nowwemoveontothesecondarea. 7.2Thestandarderrorofthemodel ThestandarderrorinExcelisfoundjustundertheadjustedr-squaredoutput.Itis theestimatedstandarddeviationoftheamountofthedependentvariablewhichisnot explainablebythemodel.Inotherwords,itisthestandarddeviationoftheresiduals. Undertheassumptionthatthemodeliscorrect,itisthelowerboundonthestandard deviationofanyofthemodel’sforecasterrors.Thefigurebelowshowstheoutputof regressionestimatingriskofastroke(multipliedby100)againstbloodpressureand age,withadummyvariableforsmokingornot. Exceloutputforstrokelikelihood Thestandarderroris6(roundedup).Foranormally-distributedvariable,95%of the observations will be within two standard deviations of the mean. For our purposesthatmeansthat2 x6=12shouldbeaddedorsubtractedtothepredicted valuetofindaconfidenceintervalforpredictions.Theestimatedregressionequation fromthestrokemodelaboveis: y ˆ=− 91+1 . 08 Age+0 . 25Pressure+8 . 74 Smokedummy Animaginarypersonwhosmokes,isaged68andhasa bloodpressureof175will haveariskof34(allfiguresrounded).Wecanbe95%confidentthatthisestimate willfallsomewherebetween34-12and34+12or22to46.Theseareratherwide confidenceintervalsandsowemightwanttoworkatimprovingthemodelbyfor exampleincreasingthesamplesizeoraddingmoreexplanatoryvariables. 7.3Testingtheleastsquares assumptions Therearetwokeyassumptionsthattheleastsquaresmethodreliesonwhichweneed tocheck.Aftercheckingwe’llworkthroughwaysofretrievingthesituationshould theassumptionsbeviolated.Theassumptionsare: • Therelationshipbetweenthedependentandtheindependentvariablesis linear.Fortunatelythisiseasytocheckandalsotofixifitisn’tsatisfied. • The residuals have a non-constant variance. (You’ll some-timesseein otherstatisticstextbooksotherstricterrequire-ments,suchastheresiduals beingnormallydistributedandwithameanofzero.Formostpurposesjust checking for non-constant variance is enough). What does non-constant variance mean? We’ll work through this by defining a resid- ual and identifyingwhetherornottheassumptionhasbeenmet.Andfinallywhatto doaboutit.Butfirst,checkingforlinearity. Checkingforlinearity Therelationshipbetweenthedependentvariableandtheindepen- dentvariableis assumedtobelinear.Thisisimportantbecausethemodelgivesusarateofchange: thecoefficientshowsthechangeinthedependentvariableforaone-unitchangein theindependentvariable.Iftherelationshipisnon-linear,thatcoefficientwill not bevalidinsomeportionsoftherangeofthevariables. Wewanttoseeastraightline(eitherupordown)onascatterplot.Putthedependent variable on the vertical y axis, and one of the independent variables on the horizontal x axis. If you include thetrendline,youcanobservehowcloselythe observedvaluesmatchthepredicted. Wecanalsocheckforlinearitybyplottingthepredictedvaluesand theobserved values.Theplotbelowdoesthisforriskofstroke: Predictedagainstobservedrisk Thisisareasonableresult,indicatingthatthelinearityconditionismet.Mostofthe pointsareinastraightline,althoughIwould be concerned about the lower risk levels,especiallyaroundrisk=20.Thereappearstobeconsiderablevarianceatthis point. 7.4Checkingtheresiduals Aresidualisthedifferencebetweentheobservedvalueofyforanygivenxvalue, andthepredictedvalueofyforthatsamexvalue.Itisthereforeapredictionerror, andisgiventhenotationefortheGreekletter‘epsilon’.Itistheverticaldistance betweentheactualandpredictedvaluesofthedependentvariableforthesamevalue oftheindependentvariable.Wediscussedthisinthepreviouschapterinconnection withtheleastsquaresmethod,wherewewrotetheestimationequation: y ˆ= b 0+b 1 x Thehatony,thedependentvariable,indicatesthatitisanestimateofthedependent variable.Weknowthattheestimatecannotbecorrectunlessallthepredictedvalues andtheobservedvaluesmatchupexactly.Iftheydon’t,thenthereareresiduals.Ifwe calltheerrorsε(epsilon)whichistheGreeklettermatchingourlettere,thenwecan rewritetheregressionequationas: y=β o+β 1 x+ε Theerrorshavenowbeenabsorbedintoepsilonandsowecanremovethehatfrom y.Itisthebehaviorofepsilonthatisofinterest, becausetheleast-squaresmethod restsontheassumptionthatthe errors absorbed into epsilon have a non-constant variance. Thismeansthattherenorelationshipbetweenthesizeoftheerrortermand thesizeoftheindependentvariable.Therefore,ifweplottedtheerrorsagainstthe independentvariables,weshouldseenoclearpattern.Howtodothisisdescribed below. 7.5Constructingastandardized residualsplot Excelprovideswhatiscallsstandardresidualsaspartofitsregres-sionoutput,and wewillusethese.Notehoweverthatthesearenot‘true’standardizedresiduals,but theyareprobablycloseenough.MakesureyouchecktheStandardizedResiduals boxwhensettingupyourregression. Checkthestandardizedresidualsbox Wewanttoplotthestandardresiduals against the predicted value yhat. We will createanewcolumn to the left of the column Standard Residuals, and copy the column of predicted times into that new column. Then create a scatter plot of predictedtimeandstandard residuals. Youshould end up with the image below. Standardizedresidualplot¹ ¹https://www.youtube.com/watch?v=1wuyZfB39p4 Standard residuals TheYaxisindicatesthenumberofstandarddeviationsthat eachresidualisaway fromthemean,plottedagainstthepredictedvaluesonthehorizontalaxis.Noneof theresidualsaremorethat2standarddeviationsawayfromthemeanofzero,sothe resultsaregenerallysatisfactory,althoughthereisoneobservationinexcessof1.5. We still have a worrying fan shape which would seem to indicate non-constant variance:wecanobserveapattern. Let’sruntheregressionagain,butnowincluding the second vari- able,whichis deliveries.Theresidualsplotisobtainedinthesameway,andhereistheresult. Standardizedresidualswithtwovariables Thefanproblemseemstohaveimprovedalittle.Ther-squaredofthemodelwith twovariableswashigher,meaningthattheresidualsweresmaller(becausethereis lessunexplained error). 7.6Correctingwhenanassumptionisviolated Above,weexaminedpossibleviolationsoftwooftheassumptionsunderlyinglinear regression:linearityandnon-constantvariance.Nowlet’slookatwhattodowhen these assumptions are found to havebeenviolated.Firstsomegood news: linear regressionisquiterobusttosuchviolations,andevensotheyarequiteeasytocorrect for. We’ll deal with the problems in the same order: linearity and non-constant variance. 7.7Lackoflinearity The first check is to look at a scatter plot of the dependent variable against the independentvariable.Theplotbelowshows volume of sales and length of time employed,togetherwithatrendline. Acurvilinearrelationship Thereareseveraldifferentwaysinwhichalinearrelationshipcanbeachievedso that we can use the least squares method. A common method is to include a quadraticterm.Thismeansadding thesquaredvalueoftheindependentvariableto thelistofindependent variables.Thisiseasilydonebycreatinganothercolumn, consistingofsquaredvaluesofthefirstvariable. Anewcolumnoftheindependentvariablesquared Todothis,createa new column and then label it. Click on the first value in the variablewhichyouwanttosquare,andaddˆ2.Enter. Then drag downwards. The plotbelowshowsthepredicted andobservedsales.Themodelisclearlysuperior. Predictedandobservedafterinclusionofquadraticterm Ifthedependentvariablehasalargenumberoflowvalues,andisheavilyskewedto theright,thenagoodsolutionistotransformthedependentvariableintoitsnatural logarithm. 7.8Whatelsecouldpossiblygowrong? Regressionisaverycommonlyusedanalyticaltoolandyouaremostlikelyeither goingtouseityourselforexaminetheworkofotherswhohaveusedregression. Below is a list of common mistakes to watch for. And if I have made them somewhereinthisbook,I’msureyou’llbethefirsttoletmeknow! 7.9Linearitycondition Regressionassumesthattherelationshipbetweenthevariablesislinear: the trend linethatthesoftwaretriestofindtominimizethesquareddifferencebetweenthe observedandpredictedvaluesisstraight.Soifyoutrytorunaregressiononnonlineardata,you’llgetaresultbutitwillbemeaningless. Action :alwaysvisualizethedatabeforedoinganyanalysis.Ifyouseethatthedata isnon-linear,youmaybeabletotransformitusingthetechniquesdescribedinthe previouschapter.Asanexample:thecurvilinearexamplewhichwastransformed byincludingaquadratictermasanexplanatoryvariable. 7.10Correlationandcausation Regressionisaspecialcaseofcorrelation,andasweall knowcorrelationdoesn’t meancausation(seetheGlossary).Inregression,nomatterhowgoodthemodel,all that we have been able to show is that a change in an explanatory variable is associatedbyachangeinthedependentvariable.Forexample,yourecordthehours youputintostudyingandyourgrades.Surprise!Morestudying=bettergrades?Or perhaps not….could have been a better instructor.Wecannotsaythatonecaused theother.Sowhenwritingupresultsorinterpretingthoseofothers,beverycareful nottoclaimmorethanyouareableto. There is a related problem which is ‘reverse causation’. You study more, your gradesgoup.Temptingtothinkthatonepossiblycausedtheother.Butperhapsitis theotherwayround?Yourgradeswerepoor,youstudiedharder?Oryouhadgood gradesandthenledyouintotakingthecourseseriously. Action : try not to include explanatory variables which are affected by the dependentvariable.Youcouldalsotryto‘lag’onevariable.Cutandpastethehours ofstudyingvariablesothatitisonetime-periodbehindthegradesandthenrunthe regressionagain. 7.11Omittedvariablebias Under Correlation in the Glossary I give some examples of lurking variables. Regression, like correlation, is susceptible to the same problem. Example: you noticethatinhotterweathertherearemore deathsbydrowning.Didthehotterweathercausethedrownings?Wellno,theextra swimmingcausedbytheheatpresentedmoreriskscenarios.Thelurkingvariableis hoursspentswimmingratherthantemperature.Trytogettotherealvariableifyou can. 7.12Multicollinearity When predicting the price of a house, square footage and number of rooms are likelytobehighlycorrelatedbecausetheyarebothex-plainingthesamething.This problemresultsinunusualbehaviorintheregressionmodel. Action : correlate your explanatory variables BEFORE doing any regression. Watchoutifyouhaveapairwhichhasanrvalueof 0.7or higher.This doesn’t mean that you shouldn’t use them, butit’s ared flag. Thisiswhatmighthappen.Youtakeoutononethevariablesina regression, the onethat’sleftreversesitssignorsuddenlybecomesstatisticallyinsignificant. Ifyoudoruntheregressionandyougetanunlikelyresult,choosethevariablewith thehighesttvalue(orsmallestpvalue)andditchtheother. 7.13Don’textrapolate Thecoefficientsfromtheregressionarecalculatedbasedonthedatayouprovided. Ifyoutrytopredictforavaluebeyondtherangeof that data, the results will be unreliableifnottotallywrong. In the years of experience and salary example in Chapter 5, the co- efficient for yearsofexperienceprovidedthechangeinsalarythatanextrayearofexperience wouldgive.Wouldyoufeelcomfortablepredictingthesalaryofsomeonewith95 yearsofexperience? Action :beforerunningthenumbers,checkthattheinputsarewithintherangefor whichyoucalculatedthemodel. 8. TimeSeries Introductionand SmoothingMethods Atimeseriesisasetofobservationsonthesamevariablemeasured atconsistent intervalsoftime.Thevariableofinterestmightbemonthlysalesvolume,website hitsoranyotherdataofrelevancetothebusiness.Usingtimeseriesanalysiswe can detect patterns and trends and–just possibly–make forecasts. Forecasting is tricky because (of course!) we only have historical data to goonandthereisno assurance that the same pattern will repeat itself. Despite all this, time series analysishasdevelopedintoahugetopic,fortunatelywithalargenumberoffreely availabledata-setstouse.Linkstosomeofthesedata-setsareprovidedattheendof thedatachapter(Chapter3). There are two basic approaches: the smoothing approach and the regression method .Bothmethodsattempt to eliminate the background noise. This chapter coverssmoothingmethods, thefollowingchaptertheregressionapproach. 8.1Layoutofthechapter Herewewill 1.identifythefourcomponentsthatmakeupatimeseries 2.introducethenaive,movingaveragemethodsandexponen-tially weightedsmoothingmethodstomakeaforecast 3.discusswaysinwhichtheaccuracyoftheforecastscanbecalculated 76 8.2TimeSeriesComponents There are four potential components of a time series. One or more of the components may co-exist. The components are: trend compo- nent; seasonal component;cyclicalcomponent;randomcomponent. •trendcomponent :theoverallpatternapparentinatimeseries.Theplot below shows French military expenditure as a percentage of GDP. The downwardtrendisapparent. FrenchmilitaryexpenditureasapercentageofGDP • seasonalcomponent :salesofskisorChristmasdecorationsareclearly higherinthewintermonths,whileicecreamandsuntancreammovemore quicklyinthesummer.Ifweknow the seasonal component, then we can use this information for prediction. The time between the peaks of a seasonalcomponentisknownastheperiod .Noteherethatseasondoesn’t necessarilymeantheseasonoftheyear. TheplotbelowshowssalesofTVsetsbyquarter.Herethereisaconstantupward trendbecausemorepeoplearebuyingTV sets,butthereisalsoaseasonaleffect. TVSalesbyquarter •cyclicalcomponent :sometimeserieshaveperiodslast-inglongerthan one year.Business cycles such as the50- year Kondratieff Wave are an example.Thesearealmostimpossibletomodel,but that doesn’t stop the hopefulfromtrying.KondratieffhimselfendedupavictimofJosephStalin becausehissuggestionthatcapitalisteconomieswentinwavesundermined Stalin’sviewthatcapitalismwasdoomed,andhewasexecutedin1938. •randomcomponent :therandomcomponentisjustthat:thenoiseorbits leftoverafterwehaveaccountedforeverythingelse.Thetell-talesignsofa random component are: a con- stant mean; no systematic pattern of observations;constantlevelofvariation. 8.3Whichmethodtouse? In the next two chapters, we’ll work through the two basic ap- proaches: the smoothingapproachandtheregression approach.Howtodecidewhichtouse? Firststepasusualistomakeasimplescatterplotofyourdata,with time on the horizontalxaxisandthevariableofinterestontheverticalyaxis.Ifthe result is somethingresemblingastraightline,forexampletheFrenchmilitaryexpenditure, thenuseasmoothingapproach.Ifthereisevidenceofsomeseasonality,suchaswith theTVsalesdata,thenregressionisthewaytogo.Thisisespeciallythecaseifyou wanttocalculatethesizeoftheseasonaleffect. 8.4Naiveforecastingandmeasuringerror Forecastingisjustthat:aneducatedguessaboutwhatmighthappenatsomepointin thefuture.Forecastsarebasedonhistoricalevents, and we are hoping that some patternofbehaviorwillrepeatitselfwhichwillmaketheforecast‘true’.Thebest thatwecandoistotryoutdifferentforecastingmethodsand work out how well theypredicteddatawhichwealreadyknewabout.Thenwetakealeapintothedark andhopethatthebestofthosemethodswilldoagoodjobondatathatwedon’t knowabout.Thissectionconcernsthemeasurementofforecastinga c c u r a c y . We’regoingtoconstructanaiveforecastandusethatforecastto demonstratetwo commonmethodsofaccuracychecking:theMean SquaredError(MSE)and the MeanAbsoluteDeviation(MAD)approaches. Anaiveforecastassumesthatthevalueofthevariableatt+1willbethesameasatt. Thevaluesarejustcarriedforwardbyonetimeperiod.Hereisanexample: ImageGaspriceswithnaiveforecast The naive approach is sometimes surprisingly effective, perhaps because of its simplicity.Ingeneralsimplemodelsperformwellperhapsbecauseoftheirlackof assumptions about the future. Now we will measure the accuracy of the Naive ForecastwithMADandMSE.Inbothmethodstheerroriscalculatedbysubtracting theforecastorpredictedvaluefromtheobservedvalue.Themethodsdifferinwhat isdonewiththoseerrors. MADmeasurestheabsolutesizeoftheerrors,sumsthemandthendividesbythe numberofforecasts.Theabsolutevalueoftheerrorisjustitssize,withoutthesign. In Excel, you can find an absolute value with =ABS(F1 - F3) where FI is the observedvalueandFEthepredictedvalue.Usethelittlecorneroftheformulaboxto drag it down. Then find the sum and divide by the n, which is the number of observations.Inmaththisis MAD= ∑ ( abserror ) n InMSE,theerrorissquaredbeforebeingsummedanddivided. Squaring the error removes the problem of the negative numbers, but creates anotherone:largeerrorsaregivenmoreweightingbecauseofthesquaring.Thiscan distorttheaccuracyoftheresults.ThemathsfortheMSEisbelow MSE= ∑ ( error )2 n Theplotbelowshowstheerrorsassociatedwiththenaiveforecast-ing,theabsolute valuesoftheerrorsandtheMAD.InExcel,use =abs()toconverttoanabsolutevalue. ImageErrorsandMeanAbsoluteDeviation 8.5Movingaverages Slightlymorecomplexthanthenaiveapproachisthemovingaverageapproach, whichreliesonthefactthatthemeanhasless variance than individual observations. By focusing on the mean, we can see the generaltrend,lessdistractedbynoise. Theanalystdecideshowfarbackintimetheaveragewillgo.Thelongerbackin time,thenthemorevaluesoftheobservation are averaged, resulting in a flatter more smoothed result. This is good for observing long-term trends, but less satisfactoryifyouareinterestedinmorerecent history. The number of observations which are to be averaged is known as k, and this number is chosen by the analyst based on experience. In Excel’s Data Analysis ToolpakthereisaMovingAveragetoolwhichcandoallthework.Positioningthe outputisslightlytricky.Therearektimeperiodsbeingusedforthecalculationof the first forecast;sothefirstforecastshouldbeatk+1becausektimeperiods are beingusedtofindthefirstvalue.Whereexactlytoputthefirstcelloftheoutputisa bitfiddly.Iftherearektimeperiodsbeingused, placethefirstcell at k-1. Excel providesachartoutput,examplebelow.ThereisaYouTubehere¹ ImageMovingaveragewithk=3forgasprices Themovingaveragetechniqueisusedbyinvestorstodetectthe‘goldencross’and the‘deathcross’.Theyexamine50dayand200daymovingaverages.Whenthe50 daymovingaverageis above ¹http://youtu.be/zrioQOWfxjY the200daymovingaveragethispointistheso-calledgoldencrossandasignalto buy.Bycontrast,whentheoppositehappensthisisthe‘deathcross’…timetoget out! 8.6Exponentiallyweightedmoving averages The exponential smoothing method (EWMA) is a further step forward from the naiveandthemovingaverageapproaches.Themovingaverageforgetsdataolder thanthe ktimeperiodsspecified, while the EWMA incorporates both the most recentandmorehistoricalobservationstoconstructaforecast.IntheGlossaryyou’ll findmoredetailedmathshowinghowtheEWMAbringsforwardallpasthistory. Ingeneralthefurtherbackintimeyouget,thelessinfluencetheobservationshave. UsingEWMAwechooseasmoothingconstant,alpha,whichsetstheweightgiven to the most recent observation againstprevious forecasts. If we thought that the forecast(t+1)wouldbesimilartothecurrentobservation(t=0)andwethoughtthat olderforecasts wereoflittlevalue,thenwewouldchooseahighvalueofalpha. Alpharunsfrom0to1.TheEWMAformulais F t +1=αY t+(1−α ) F t where F is the forecast, and Y is the actual value of the variable. alpha is the smoothingconstant,ranginginvaluebetween0and1,andchosenbytheanalyst. In words, this equation means that the forecast value (at t+1) is the smoothing constantalphamultiplyingtheactualvalueat t=0plus(1-alpha)timestheforecast att=0.Soifthepreviousforecastwastotallycorrect,theerrorwouldbezeroandthe forecastwouldbeexactlywhatwehavetoday.Itturnsoutthattheexponential smoothingforecastforanyperiodisconstructedfromaweightedaverageofallthe previousactualvaluesofthetimeseries. Theequationaboveshowsthatthesizeofalpha,thesmoothing constant,controls thebalancebetweenweightinggiventothemostpreviousobservationandprevious observations.Ifthealpha issmall,thentheamountofweightgiventoYatt=0is smallandtheweightgiventopreviousobservationsislargeandvice-versa.Ifalpha =1,thenpreviousobservationsaregivennoweightatall,and weassumethatthe futureisthesameasthepast.Thisachievesthe same result as naive forecasting discussedabove.Typically quitesmallvalueofalphaareused,suchas0.1. ##ApplicationofExcel’sexponentialsmoothingtool An exponential smoothing tool is available in the Analysis ToolPak. We’lluse Canadian Gross Domestic Product per capita as an example. Here is the data plottedonitsown.Youtube² CanadianGDPPerCapita Thereisasteadyupwardstrend,butwithadipin2009duetotheworldfinancial crisis.Excelasksforadampingfactor.Thisis1-alpha.Ihavedonetheforecasts twice,oncewithanalphaof0.1 ²http://youtu.be/w-GmxjrX1qg andagainwithanalphaof0.9.Thereforethedampingfactorswere 0.9and0.1.Theresultforalpha=0.9isshownbelow. Forecastwithalpha=0.9 This is pretty good forecast, with the forecast tracking the actual observations closely. TheplotbelowshowssheepnumbersascountsbyheadinCanada from1961to 2006.First,let’splotthisusing0.3asalpha(arbitrarilychosen).Theplotisbelow, withadampingfactorof1-0.3=0.7. Sheepnumberswithalpha=0.3 9. TimeSeriesRegressionMethods Thesmoothingmethodsweworkedinthepreviouschapterarefinewhenthereis noevidenceofseasonality,forexamplewiththe sheepdata.However,smoothing has only limited use for longer term prediction and analysis. Data that is more interestingforusas analysts may be non-linear in trend, and also have seasonal peaks and troughs. Using regression, we can measure the quantitative effect of seasonalityandtrend,eithertogetherorseparately.Theresultisamodelwhichcan beusedforprediction. Belowwewill: 1.useregressiontoquantifyatrendinatimeseries 2.introduceaquadratictermtoaccountfornon-linearityinthetimeseries 3.usedummyvariablestomeasureseasonality 9.1Quantifyingalineartrendinatimeseries usingregression TheplotbelowshowslifeexpectancyatbirthforCanadians. 87 Canadianlifeexpectancy This is pretty much a straight line as one might expect in adevelopedcountry withalargepercapitaexpenditureon healthcare.Notethatthisisforbothsexes: forwomenonlywemightexpecttoseeevenbetterfigures.Irantheregressionwith Yearastheindependentvariable,explaininglifeexpectancy.Theresultisbelow. Lifeexpectancyregressionresults NoticethattheadjustedR-squaredvalueisclosetounity,reflecting thestraightlinethatweseeonthegraph.Thecoefficients fortheinterceptandthe independentvariableprovidethisestimatedregressionequation ^y t=58 . 47+0 . 000565∗Year The meaning is that for every year after 1960 that a child was born, his/her life expectancyincreasedby0.000565years,orabout5hours. 9.2Measuringseasonality ThedatasetforTVsales(fromModernBusinessStatistics byAndersonSweeney andWilliams)containssalesbyquarterforfour years. There are therefore sixteen observations. IcreatedacolumnwhichIcalledindexsothatwecanseesalesina consecutive fashion.ThefirstquarterisSpringandthelastquarteriswinter.Itisapparentthat quarterswhicharedivisiblebyfourarehigherthanothers,andsoforth.Perhapssales arehigherinwinter? WecanusethedummyvariablemethodofChapter5todeterminewhetherthisis trueandalsotheextentofthedifference.WewillcreatedummiesforSummer,Fall andWinter,leavingSpringasthereferencelevel.Recallthatwehavefourpossible statesoftheseasonvariable,andsowewillneedk-1,or4-1=3dummies.Itusually doesn’tmatter which state you choose as your reference level: just don’t forget which one you picked. The dataset is below, with a column called Index, which we’llusetomeasurethetimetrend. TVSaleswiththequarterlydummies Iwanttofindouttwothings: whethersalesareincreasingovertime.Icandothisbyincludingtheindexasan explanatoryvariable.Becausetheregressiontoolrequiresthatalltheindependent variablebeinoneblock,IhavecopiedandpastedtheIndexcolumntotheright. Theregressionoutputis TVSalesregressionout Let’swalkthrough this output line by line. First notice that the p values for all independentvariablesaresmallerthan0.05.Thereforeallaresignificant. ThereferencelevelisSpring,andthereforethereisnocoefficientfor this quarter. Summer has a negative sign in front of its coefficient, meaning that sales for Summer are smaller than those forSpring.Fallispositive,somoreTV sales are sold in the Fall than in the Spring. Not unexpectedly, Winter has the largest coefficientofall, reflectingtheplot.Theindexhasapositivesign,meaningthataverageTVsalesare increasingwithtime. Therearethereforetwocomponentsinthisseries:atrendand aseasonal component.Youtube¹ **Makingpredictions**fromtheseresults.Let’spredictthesalesforFallinseven quarterstime.Thisis y ˆ=4 . 7+1 . 05875+7∗0 . 147=6 . 78 Theobservedvaluewas6.8,sothepredictionwasreasonableifslightly pessimistic. 9.3Curvilineardata Thedataabovefollowedalineartrend,whichisperfectforordinary least squares, which assumes that the relationship between the dependent variable and the independent variable is linear. But some interesting data does not follow a neat lineartrend.Theplotbelowshowscrudeoilandgaspricesovertheperiod1949to 2003. ¹http://youtu.be/wyKIHInMbY8 Crudeandgaspricesshowingcrudepricecurvilinearity However,ifweshowthetwoseriesonthesamegraph,asontheplotIdidinTableau, withseparateyaxesforeachtypeoffuel,wecanseethattheshapesareveryclose. Crudeandgasondifferentaxes Thecurvilinearityofcrudepricesisclear.Thepricescametoapeak inthe1970sasaresultofOPEC’sdecisiontorestrictsupply.Inany event,crude pricescannot be described as linear. The solution is to add an extra term to the regressionof price on time, and that is time squared. The first few lines of the datasetarebelow.Ihaveaddedacolumncalledtorepresenttheyear,andaddedtsq whichisjusttsquared. Datawithtimesquared Nowregresscrudeagainsttime and time squared, and the pleasing result below appears Thecurvilinearregressionforcrude The adjusted r-squared is high at 0.88 and the two time predictors are highly significantstatistically.Noticethatthetwopredictorsarequitedifferentinsizeand alsohaveoppositesigns.Whentissmall,thenthevariabletdominatesandthetrend is upward. However, as t gets larger, then t-squared gets even larger still. The negativesignont-squaredpullstheregressionlinedown. 10. Optimization Linear Programming (LP) is the tool we use to optimize a particular objective function .Forexample,amanufacturerofcarpetswantstogetthemostprofitout of his raw materials and labor force. The objective function is the relationship between the inputs and his costs and thus his profits. Optimization helps the manufacturerfindthemostefficientallocationofresources.Theobjectivefunction doesn’tnecessarilyhavetobeintermsofmoney;itcouldverywellbetoallocate workinghoursefficientlysothateveryonehasmoretimeoff.Frequentlywewant tomaximizetheobjective function(perhapsmakemoreprofit)butwemightalso wanttominimizeanobjectivefunction,suchascosts. Optimizationisnotparticularlydifficult,andwewillusetheSolveradd-intoExcel to do the mathematically tricky parts. What is important is the writing up of a correctmodelinthefirstplace. Layoutofthechapter Linear programming is not especially difficult, but the work has to be done in a logicalandorderlyfashion.Inthischapter,wewill *spendsometimeworkingthroughthebasicsteps in setting up an optimization problem * work through the meaning of the Solver output in terms of what the coefficientvaluesactuallymean 10.1Howlinearprogrammingworks Linearprogrammingworksbysolvingasetofsimultaneousequa-tions.The problemtobesolved—tomaximizeastockreturnfor 95 example—iswrittenasasetofsimultaneousequations.Theequa- tions may be quitesimple,buttheremaybemanyofthem.LinearProgrammingfindsthebest solutiontotheequations.Thejobistofindthecoefficientsforthevariablesinthe equationwhichmaximizeorminimizetheoutcome. We’llworkthroughanexamplefirstonpaper,andthenputitintoExcel’sSolver add-intogetthesolution.Thethreeimportantstepsare: •Writethesimultaneousequations,basicallyputtingintomaththewording oftheproblem.Thisisprobablythetoughestpart. •RuntheOptimizationinSolver,tofindtheoptimalsolution. • Sensitivity analysis to find out how much effect a unit change in a constraintwouldhave. 10.2Settingupanoptimizationproblem Optimizationproblemsareusuallypresentedaswrittenquestions.Thebestapproach istodeterminewhicharethe •decisionvariablesthosevariableswhichyoucancontrol.Forexample,in thecarpetfactoryexample,thenumberofhoursgiventoeachworkeris a decisionvariable. Itisusuallythesizeofthedecisionvariablethatwearetryingtooptimize.Forthe workers,wewouldwantthemtohaveexactlythenumberofhourswhichproduces the most profitable output. Too few hours and the factory is working below capacity. On the other hand,toomanyandmoneyisbeingwasted.The decision variables gointowhatSolvercallsthechangingcells.Theconventionisto colorcodethesecellsinExcelinredcolor. • constraints are constants or givens which we cannot change. The jurisdictionwherethefactoryislocatedmighthave legislation restricting the maximum number of hours that an employee can work per day.We wouldneedtoadd aconstrainingequationtotakeaccountofthisfact,even ifthesolutionwasfinanciallysub-optimal. 10.3Exampleofmodeldevelopment. Afarmerhas50acresoflandathis/herdisposal.Healsohas up to 150 hours of labor and up to 200 tonnes of fertilizer. He can plant either cotton or corn or a mixture.Cottonproducesaprofitof$400peracre,corn$200peracre.Eachacreof cottonrequires5laborhoursand6tonnesoffertilizer.Forcorntheequivalentis3 and2.Howmuchlandshouldthefarmerallocatetoeachcrop? Let’scallacresofcottonxandacresofcorny.FforfertilizerandLforlabor.These fourvariablesarehisdecisionvariablesbecausehecanallocatecropstolandandhe canalsodecidehowmuchlabor and fertilizer to deploy. He is searching for the combinationofthefourdecisionvariableswhichmaximizeshisprofit. Weknowhehas50acres,sothetotalacreagecannotexceedthisamount.Thisisa constraint,whichcanbewrittenasanequation,asshownbelow.Otherconstraints arethemaximumlaborandfertilizer.Obviouslyhecannotusemorethanhehas. 10.4Writingtheconstraintequations Thetotalacreagedevotedtoeachcropcannotexceed50: x+y≤50 Thelaborandfertilizerarealsoconstraints.Referbackto thequestiontogetthe amountsoflaborandfertilizerperacrepercrop. Forlabor,theconstraintequationis: 5 x+3 y≤150 This equation comes about because for each acre of cotton, the farmer needs 5 hours of labor,while for corn it is 3. Wedonotknowthevaluesofxandy,but whatevertheyare,weknowthat 5xplus3ymust be less than or equal to 150, becauseweonlyhave150hoursavailable. Forfertilizer,theconstraintequationis: 6 x+2 y≤200 10.5Writingtheobjectivefunction Whatdowewanttogetoutofthis:whatistheobjective? Clearly it is the most efficientmixtureofinputs,subjecttoconstraints,producingthelargestprofit.The objectivefunctionis: MaxProfit=400x+200y Inwords,findthecombinationofxandythatmaximizestheprofit,subjecttothe constraintswehavewritten.ThenexttaskistoputallthisintoSolver. 10.6OptimizationinExcel(withtheSolver add-in) WewilluseanExcelspreadsheetandSolvertoachieveallthreesteps. OptimizationYouTubeforthefarmerproblem¹ ¹https://www.youtube.com/watch?v=WeTgK6wmvSY OpenanExcelspreadsheetsothatyoucanfollowalong. Cellcoloringconventions Isuggestyoufollowtheseconventionsforcolor-codingthecells: •Inputcells.Thesecontainallthenumericdatagiveninthestatementofthe problem.ColorinputcellsinBLUE. •Changingcells.Thevaluesinthesecellschangetooptimizethe objective.CodechangingcellsinRED. •Objectivecell.Onecellcontainsthevalueoftheobjective.Colorthe objectivecellinGREY. NowI’llworkthroughthefarmingproblemabovestepbystep. MakesureyouhavetheSolveradd-inloaded.Tocheck,openExcel,thenclickon theDataTab.IfSolverisloaded,youwillseetheSolvernametotheright. TheSolvertab BelowistheExcelspreadsheetwiththeinformationthatweknowalreadytypedin. Ihaveputrandomvaluesinthe red-coloredchangingcells,justasplaceholders. Thesenumberswill changewhenExcelsolvesforthemostprofitableallocation. •TypetheaddressoftheobjectiveintotheSolverdialoguebox,andmake suretheMaxradiobuttonisselected. •Typeintherangeofthechangingcells. •Workthroughtheconstraints.Therearethree:labor;fertil-izer;andland. •SelectSimplexLPastheSolvingmethod. •Presssolve. Thefarmingsolution Solverwillchangethevaluesinthechangingcellstomaximizetheobjectivecell.I foundthat30acresofcottonandnoneofcornprovidedaprofitof$12000. 10.7Sensitivityanalysis Constraintscanbeeitherbindingorslack .Ifaconstraintisbinding,thatmeansthat allofthatparticularresourceisbeingused,andmorecouldbeemployedifitshould becomeavailable.Wecanfindoutwhetherconstraintsarebindingandalsotheir shadow pricebypressingSensitivityAnalysisafterrunningSolver again.TheSensitivity analysis will appear as a tab at the bottom of your worksheet. For the farming problem,itlookslikethis: Farmingsensitivityreport Let’sletattheConstraintssection.Laboruses150hoursandhasashadowpriceof $80.Theshadowpricesindicatestheper-unitvalueoftheconstrainedcommodityif theconstraintwasincreasedbyoneunit.Ifwecouldhaveonemorehouroflabor, thentheprofitwouldincreaseby$80.Landandfertilizerhaveashadow pricesof zerobecausetheconstraintisslackornon-binding.We are not completely using theseresourcesandsowedonotneedanyextrainputs.Therearesacksoffertilizer lyingaroundunused.Noneedtobuymore. The columns AllowableIncrease and Allowable Decrease are rele- vantbecause theytellushowmuchmoreofabindingconstraintwecouldusebeforerunningthe modelagain.Forlabor,theamountis16.67(rounded).Ifweeasedtheconstraintby morethanthisamountthenwewouldneedtorunthenewmodelinSolveragain. Thesamelogicfordecrease. 10.8InfeasibilityandUnboundedness Solverisquiterobust,buttwoproblemsmayoccur.Asolutionto anoptimization problemisfeasibleifitsatisfiesalltheconstraints.Butitispossiblefornofeasible solutionto exist. This occurs if you makeamistakeinwritingthemodel,orthe modelistootightlyconstrained. Unboundedness Unboundedness occurs when you have missed out a constraint. There is no maximum(orminimum).Trychangingallconstraintsto>=insteadof=<. 10.9Workedexamples Thedemandingmother Mostmothersarekeentokeepcontactwiththeirchildren.Oneparticularmotheris ratherdemanding.Sherequiresatleast 500minutesperweekofcontactwithyou. Thiscanbethroughtele-phone,visitingorletter-writing(yes!Somepeoplestilldo that!).Herweeklyminimumforphoneis200minutes,visiting40minutesandletters 200minutes.Youassignacosttotheseactivities.Forphone:$5perminute;visiting $10perminute;letter-writing$20perminute.Howdoyouallocate your time so thatyourcostistheminimum? Writetheequationsfirst.Whatistheobjective? Letxstandforminutesofphone;yminutesofvisiting;and zminutesofletters. Youwanttominimizethecostoftheseactivities, so the objective is to minimise 200x+40y+200z.Theconstraintsare:x+y+zmustbemorethanorequalto500. Spreadsheetfordemandingmother Noticethatwewanttominimizethetime,sochangetheradiobuttontominrather thanmax.Andwhenyou are typing in the constraints,checkthatthesignofthe inequalityistherightwayround. Thetheatermanager Youarethemanagerofatheaterwhichisinfinancialtrouble.Youhavetooptimize thecombinationofplaysthatyouwillputontomakethemostprofit.Youhavefive playsinyourrepertoire,A,B,C,D,E.Theyhavedifferentdrawpoints(appealtothe public)andthereforeticketpricestomatch.Adrawpointishowattractivetheplay istothegeneralpublic.Thedatalookslikethis: Play Draw Ticket A 2 20 B 3 20 C 1 20 D 5 35 Play Draw Ticket E 7 40 Youhavetheseconstraints:youhaveonly40possibleslots.Andnoplaycanbeput onlessthantwiceormorethantentimes(givestheactorsareasonableturn).Each performance costs $10 per ticket sold, regardless of the ticket price (covers the electricityetc).Thetotaldrawpointshastoexceed160tokeepthecriticshappy. Howdoyoudistributeyourperformances?Whichcombinationofplaysproduces thehighestprofitandsatisfiestheconstraints? Wearetryingtomaximizeprofit,solookforanobjective function thatdoesjust that:20A+20B+20C+35D+40E.Butwait—-wehavethe$10 cost. So better takethatoutfirst,leaving10A+10B+10C+25D+30E. Withthese constraints: Thedrawpoints(theattractivenessoftheplaytocritics):2A+3B+1C+5D+7E >=160 andnoplaycanbeputonmorethantwiceormorethantentimes.A<=2andA<= 10B<=2andB<=10C<=2andC<=10D<=2 andD<=10 andwehaveonly40slots,soA+B+C+D+E<=40 Spreadsheetfortheatreproblem Moreworked examples 19thcenturyfarmer Iama19thcenturyEnglishfarmer.I can grow wheat or barley. Wheatyields10 bushelsanacre,barley8bushelsanacre.Thepriceofwheatistenshillingsabushel, barley5shillingsabushel.Thelabourcostsforwheat are 3 shillings an acre, for barley2shillingsanacre.Thetransportationcosttomarketforwheatis1shillinga bushel,forbarley½shillingabushel.Ihave100acres.(abushelisameasureof volume;anacreisameasureofarea;ashillingisacurrency unit). Questions:HowdoIsplitupmyland? IfIcouldgetonemoreacreofland,howmuchwouldthatbeworthtome? Yetanotherfarmingquestion(Iusethesebecausemostpeoplecanimaginefieldsof land,cropsgrowingandthelike.Butofcoursethesametechniquesareapplicable tootherbusinesssituations). Afarmerhas100acresofland.Hecanplantcropsorraisesheep. Each hectare of cropsprovides$100butrequireslaborof$50peracre.Hecanspendamaximumof $500oncroplabor.Eachacreofsheepprovides$40but needs only $10 in labor charges.Thelaborbudgetisunlimited. Worked solution CallareaincropsXandin sheep Y. Then X + Y =< 100 and 50X <= 500 The objectiveis100X-50X+40Y-10Ysimplifiesto50X+30Y.MyExcelspreadsheet isbelow. Spreadsheetfortheaboveproblem 11. Morecomplex optimization Optimization using Solver is a powerful method of solving common business resource-allocationproblems.Inthepreviouschapter theproblemswererelatively simple concerning the allocation of land ortime.Butlinearprogrammingcan be usedtosolvemorecomplexandworthwhileproblemsaswe’llseebelow.Thekey requirementisthattheanalystisabletodefinetheproblemasasetofequations. Thereisnoonemethodexceptforthinkingcarefullyand writingouttheproblem asasetofequations. Todemonstrate,we’llworkthroughthreedifferenttypesofprob-lem: *problems concerning proportionality: you need to allocate money to different investments while minimizing risk and keeping returns above a certain amount. Whatproportiondoyouputineachinvestment? *supplychain problems *blendingproblemswhereyouneedtomixtogetherinputsfromdifferentsources 11.1Proportionality Investmentdecisions example Thisexampleshowshowwecan‘weight’theinputsaccordingtosomecriteria,in thiscasetheirrisk.Wewanttominimizetheriskbutensurethatthereturnisabove someminimumlevel.Whatisthemixtureorblendofinvestmentsthatcandothat? 107 The problem: you have an inheritance of $300,000 from an uncle but there are somerestrictions:youmustinvestallthemoneyinfourfunds;yourannualreturn hastobeatleast5%.Andyoumustminimizeyourrisk.Thefourfundsyouruncle hasspecifiedare: Fund Risk Return X1 10.7 4.2 X2 5 4 X3 6 5.6 X4 6.2 4 Let’sdealwiththe objective function first. That is to minimize the risk. We can weightthesize of the investment in each fund with its risk index. So, weighting eachinvestmentandthendividingbythetotalinvestmentgivestheamountofthe risk:whichisexactlywhatwewanttominimize.SowewanttominimizeR(for Risk)likethis: R=10 . 7 X 1+5 X 2+6 X 3+6 . 2 X 4 300 ,000 Ifyoudon’tgetthis,thinkaboutwhatwouldhappenifallthefundswereasrisky asX1:thetotalriskwouldincrease.Anotherwaytothinkofthis:ifwedecreasethe numberofsharesinX1andinsteadincreasethenumberofX4,whatwillhappen? Theriskwilldecrease. Theconstraints:thetotalinvestmenthastoaddupto$300,000,sothisconstraintis X 1+X 2+X 3+X 4 =300 ,000 Wealsohavetoachieveareturnofatleast5%(noeasymatter these days). This constraintis 4 . 2 X 1+4 X 2+5 . 6 X 3+4 X 4 ≥5 Belowismyspreadsheetfromthisproblem: Investmentblending 11.2Supplychainproblems Workingouthowmuchtoshipfromproductioncenterstodemand locations is a commonprobleminsupplychainoptimization.AsIhavebeenstressing,thekeyto solvingproblemsofthistypeistowriteouttheequationswhichdefinethemodel. Hereisanexample: You run a company which has bakeries in location A and B. The bakeries ship cartons of bread to your retail stores at locations X, Y, Z. The bakeries have different capacities and each retail store has different demands. The costs of deliverypercartonfromeachbakerytoeachstoreisbelow Bakery X Y A 12 13 11 150 17 17 200 100 90 B Demand 9 50 Z Capacity Notation:adeliveryfrombakeryAtolocationXisAxandsoforth. Theobjectivefunctionistominimizethedeliverycosts.Sowewant tominimize: 12Ax+13Ay+11Az+9Bx+17By+17Bz Eachbakeryhasafixedcapacity,soAx+Ay+Az<=150andBx+By+Bz<=200 Supplyhastoexactlymatchdemand Ax+Bx=50Ay+By=100Az+Bz=90 Noticethestrictequalitysign.Myresultsarebelow: Distributionproblem 11.3Blendingproblems Abovewediscussedlinearprogrammingmodelswhichwere simplebuteffective. There are other types of Optimization model which are helpful, especially BlendingModels,whichcanalsobesolvedbylinearprogramming. Blendingmodelsareusedinsituationswherewehavetwoormoreinputswhich havetobe mixed to some formula. Through Optimizationwe canfind the most profitablemixture.Wine,metals,oil,sausages,recycledpaper—-thisisapowerful technique. Youcouldprobablyuseitformarketingcampaigns.We’llworkthough anexamplewhichisforoil. Theoilblendingproblem Theproblem:anoilcompanyhas15000barrelsofCrudeoil1and20000barrelsof Crudeoil2onhand.Thecompanysellsgasolineandheatingoil.Theseproducts aremadebyblendingtogetherCrudeoil1andCrudeoil2.EachbarrelofCrudeoil 1hasaqualitylevelof10,andeachbarrelofCrudeoil2hasaqualitylevelof5.The gasolinethatweproducemusthaveaqualitylevelofatleast8.Theheatingoilmust haveaqualitylevelofatleast6.Gasolinesellsfor$75abarrel,heatingoilfor$60. Howcanweblendtheoils together in such a way that meets minimum quality requirementsandmaximizesprofit? Oilblendingsolution First,let’sthinkthroughwhatthedecisionvariables(whatgoesintothechanging cells) might be. You might very well think (as I did first off) that the decision variableswouldbetheamountsofthetwooilsusedandtheamountsproduced.But thisisn’tenough:wehavetoblendtogetherthetwotypesofoil.Theyhavetobe mixedbeforetheycanbesold,andthemixturehastoreachsomeminimumquality standard.Thecompanyneedsablendingplan. Theinputs : selling prices (here gasoline = $75, heating oil = $60) availability of oil from suppliersqualitylevelofcrudeoils:Crude1is10andCrude2is5 Theconstraints Gasolinequality>=8Heatingoilquality>=6QuantityofCrude1 =15000QuantityofCrude2=20000 Theblendingplan : Gasolinehastohaveaminimumqualityofatleast8,andheating oilmusthavea minimumqualityofatleast6.TheCrudeoil1we haveonhandhasaqualitylevelof10whileCrudeoil2hasaqualitylevelof5.We wanttoblendthesetwocrudeoilstobothachievethe minimumqualitystandards andmakethegreatestprofit. Let’sattacktheproblembycreatingtotal‘qualitypoints’(QPs)whichrepresentthe qualityofoilinabarrelmultipliedbythenumberofbarrelsofthatoil. Writeequationstocalculatethequalitypoints: TotalQPsinthegasoline=10*amountofOil1+5*amountofOil2 Ifforexample,wemixedtogether50barrelsofCrudeoil1and40barrelsofCrude oil2,thetotalQPswouldbe50x10+40x5=700.Theaverageperbarrelwouldbe 700/90=7.78.Thisistoolowforgasoline(needs8)butacceptableforheatingoil. Twopoints: we could sell the oil as heating oil, but it is exceeding the minimum quality requirement.Wecouldmakemoreprofitbyreducing thequalitytotheminimum, orchargeapremium.Butthisisprescrip-tivework,andwehavetoworkwithinthe inputsgiventous). TheonlywaytogettheoiluptogasolinestandardsistoincreasetheamountofCrude oil1inthemixture. Ifyoudon’tgetthis,tryathoughtexperiment:forgasoline,iftherewasnooilatall fromOil2,whatwouldbetheQP?Itwouldbe10*thequantityofoilfromOil1. Again,howaboutifweblendedtogether1000barrelsfromeachtype of oil: how manyQPswouldbeproduced?Itwouldbe10*1000+5*1000=15000.Read thisthroughagain…itisimportant. The blending plan provides us with the constraints we need to ensure that the minimumqualitylevelsareachieved.Intheexample justabove,weblended2000 barrelstoprovideaQPof15000.Thisisanaverageof7.5.Goodenoughforheating oil,notgoodenoughfor gasoline. Oilproblemsolution Discussionofthesolution:notethatgasolinesellsformoremoneythanheatingoil, buttheoptimalsolutionsuggeststhatweshouldsellmoreheatingoilthangasoline. Thisisbecauseoftheconstraintsonquality. 12. Predictingitemsyoucancountone byone Why you need to know this . Many business decisions involve counts: either binary(yes/no)orwithintimeorspace.Itwouldbegoodtoknowtheprobabilityofa certainnumberofcustomerscom-inguptoaservicedeskinacertainlengthoftime; ortheprobabilityofacertainnumberofcaraccidentsatagivenintersection.Weare looking for a discrete probability distribution; discrete because the number of occurrencesisaninteger. Chapter4onregressionshowedhowtopredictthesizeof anoutcomewhichwas continuous(moneyortimeperhaps).Thedependentvariable—whatweweretrying topredict—couldtakeonalmostanyvalue.Nowwewanttopredictprobabilities for the occurrence of an independent variable which is an integer. Below we’ll work through some hands-on applications, with the theory available in the Glossary. We’llbreakthisintotwoparts: Theprobabilityofa binaryoutcome .Theprobabilitiesoftwofaultyitemsoutof the next twenty on the production line. Or, the probability of at least five customersoutofthenext fiftyactuallybuyingsomething.Thisis estimated with thebinomialdistribution . The probability of a particular count of occurrences over time or area. The probabilityofthreeorfewerpeoplearrivingatyourCustomerServiceDeskwithin thenexthalfhour.Ormorethantwomistakesinthenexttenlinesofcode.Thisis estimatedwiththePoissondistribution . 114 12.1Predictingwiththebinomial distribution Afterthenormaldistribution(seetheGlossaryforadefinition),thebinomialisthe mostimportantinstatistics.Themathforthebinomialdistributionisalsodefinedin theGlossary. The binomial distribution provides the probability of a ‘success’ in a certain number of ‘trials’. For example, you can calculate theprobabilityof more than sevenoutofthenexttwentypeople throughthedoor actually buying something. Herea‘success’issomebodybuyingsomething;whilethenumberof‘trials’isthe numberofpeoplecomingthroughthedoor(hereitistwenty). Some definitions: Whatwearelookingforistheprobabilityofapre-definednumberof‘successes’. Notethatthedefinitionofsuccessisuptotheanalyst.Itcould be ‘spending more than$20’orwearingahat. Intheexamplewe’llworkthroughbelowyourunastore.Youknowthatthelong-run probability of somebody buying something is 0.6. You obtained this number by countingtotalnumbersofcustomersand,outofthose, the numbers who actually boughtsomething(successes). Workedexample:Youwanttoknowtheprobabilityofexactlythreepeoplethrough thedooroutofthenextfivebuyingsomething.Notethat‘success’justmeansthat thedefinedeventhappens.Whetherornotitisa‘goodthing’isn’trelevant. ForExcel,theargumentsrequiredare:therandomvariablewhose probability we want to predict; number of trials, the long-run probability, and a true/false statement.Intheexample, the number of trials is five. The random variable (X) whoseprobabilitywewanttopredictis3.Wealsoneedthelong-runprobabilityof asuccess.Intheexample,p=0.6.Wealsowanttheprobabilityofexactly3,souse thefalsestatement.I’llgointothisinsomedetailshortly. Open an Excel spreadsheet, and type in =BINOM.DIST(3,5,0.6,false) and you should get this: 0.3456. This is the probability of exactlythreepeopleoutofthe nextfivebuyingsomething(numberofsuc-cessesoutofthenextfivetrials).Notice theorderoftheargumentsintheExcelfunction.Andespeciallythelastone,which intheexampleaboveisfalse.(Thealternativeistrue).Thedifferenceisimportant becausetheresultsarequitedifferent. Trueandfalseargument .Definingtheargumentasfalseprovidestheprobability of exactly the random variable. Defining the argu- ment as true provides a cumulativeprobability. Excel adds up probabilities from the left, so changing the argu- ments to =BINOM.DIST(3,5,0.6,TRUE)=0.66304isthesumoftheprobabilities of X=0 + X=1andsoonuptoanincludingX=3.Theprobabilityof0.66304isthereforethe probabilityofthreeorfewercustomersbuyingsomething.Thetablebelowshows theprobabilitiesofvariousvaluesofX,both‘false’and‘cumulative’.As you can see, the cumulative is just the continued addition of each successive probability. Thereisafurtherexamplehere¹ Probabilitiescalculatedbothfalseandtrue Howaboutmorethanthreecustomersbuyingsomething?Inmath notation, we’re lookingforP(X>=3).Weknowthat probabilitiesmustsumupto1.Weknowthat threeorfeweris0.66304.Somorethanfivehastobe:1-0.66304=0.33696. ¹https://www.youtube.com/watch?v=oZ1DmNQ8wW4 Aslightlyharderexample:theprobabilitythatatleastfourcus-tomersoutofthe nexttenwillbuysomething?We’relookingfor P(X >=4). Look carefully at the notation.X>=4impliesthatthedistributionofthetencustomersissplitintotwo halves:lessthanfourandfourormore.Itisthelatterhalfwhoseprobabilitywe’re lookingfor. Ifwecanfindtheprobabilityof0+1+2+3,thatistheprobabilityoflessthanfour (weonlywant integers here). So find that probability and then subtract from 1, makinguseofthefactthatprobabilitiesmustsumupto1. Let’sdothisstepbystep. First,findP(X=<3),thatistheprobabilitythatXisthreeorless: =BINOM.DIST (3,10,0.6,true) = 0.054762. Notice the ‘true’ which givesusthe cumulativeprobability. Wewantfourormore,sosubtractfrom1likethis:1-0.054762=0.945238. Anotherexample.It’swinterandyouneedtoweara sweatereveryday.Youhave twoblueandthreeredsweaters.Calculatetheprobabilitythatduringtheweekyou willweararedsweater: Exactly twiceintheweek Answer:thelong-runprobabilityofpickingaredsweateris3/5 =0.6becauseyouhavefivesweaters,andthreeofthemarered.Thenumberoftrials is7becausethereare7daysinaweek.The wording of the question contains the word ‘exactly’ which means that we don’t want a cumulative answer, so we’ll include the FALSE argument. Therefore the answer is =binom.dist(2, 7, 0.6, FALSE)=0.077414 Morethanthreetimes.Whenyouseewordssuchas‘morethan’that’sacluethat you’relookingforacumulative probability.In math notation, we’re looking for P(X>3). So if we find the cumulative probability up to and including two, and thensub-tractfromone,we’redone.Thecumulativeprobabilityoftwo or fewer is P(X=0) + P(X=1) + P(X=2). So the answer is = 1 binom.dist(2,7,0.6,true)= 0.903744 Thekeys: Writeoutwhatyouaretryingtopredictinmathnotation.Thisforcesyoutobeclear. Drawalittlesketch(hopefullybetterthanmine!)ifyougetconfused. Ifthewordingofyourproblemcontains‘exactly’orrequirestheprobabilityofjust oneparticularoutcome(egP(x=3))thenyouwanttousefalse. 12.2PredictingwiththePoisson distribution Thebinomialdistributiongaveustheprobabilityofabinaryoutcome(yes/no)out of a certain number of trials. The Poisson gives us the probability of a certain numberwithinaspecifiedtime-frameorarea. Here’stheexample. You run the Customer Service Desk. You want to know the probabilityoffiveorfewercustomersarrivinginthenexthalfhour.Thatwouldbe usefulforstaffing,wouldn’tit?Youknowthattheonaverage,20customersarrive everyhalfhour.Youknowthatbecausehavecountedthem. Wewant:P(X>=5).Likethebinomialabove,thePoissonhasatrue/falseargument forwhetherwewantcumulativeprobabilitiesor‘exact’. The notation include a > sign, which implies cumulative, so we use TRUE. In Excel: =POISSON.DIST(5,20,true).Theansweris7.19088E-05. This looks a bit weird, butitisjustmathnotation.TheE-05meansthatyoushouldmovethedecimalpoint 5placestotheleft.Sotherearefourzeroesinfrontoftheleading7.Inotherwords, anextremelysmallprobability.Perhapssendsome staff out for lunch? It is very smallbecause5isalongwayfromyourlongrunaverageof20. 13. Choice under uncertainty Virtually all business decisions involve at least some degree of uncertainty. For example,youdon’tknow(andpresumably canneverknow)somefuture‘stateof nature’whichmightbeanexchangerateorbusinessrental.Afarmerdoesnotknow theprice of wheat at harvest time, but has to decide how much to plant many months ahead. His decision is relatively simple compared to a decision-maker confrontedwithanopponentwhowill reacttotakeadvantageofwhateverdecision youdomake.Thefirstsituationiscalled‘non-strategic’andiseasilyhandledby theexpectedmonetaryvaluemethodwhichwe’llworkthroughinthechapter.The second‘strategic’situationismorecomplicatedbutcanbesolvedbyanapplication ofgametheory.We’llcovertheEMVmethodinsomedetailbelow.Gametheoryis anenormoussubjectandwedon’tcoverithere,butI’llgiveanexampleattheendof thechapteralongwithsomerecommendedreading. 13.1Influencediagrams Aninfluencediagramisasketchoftheinfluencesandoutcomesofadecision.The influencediagramallowsustothinkthrough and depict the forces that influence ourchoiceswithouthavingtoassignprobabilities.Itiscustomarytouseovalshapes forvariableswhichareuncertain,aboxforadecisionnode;andlozengesforthe outcome. 119 VacationInfluencediagram Youareanoutdoors-typeandsothechoiceofactivityisaffectedbytheweather.The weatherforecastisaffectedbytheweathercondition (one hopes!). But you don’t knowtheactualweathercondition,justtheforecast(whichiswhyitisaforecast). Youchooseyouractivity(box)basedontheforecast.Theamountofsatisfactionyou get (lozenge) depends on the activity you chose AND the actual weather condition:observethetwolines leadingtothelozenge.Thereisastandardformat: 1.DecisionNodesarepresentedinrectangles.Intheexampleontheslides, thedecisioniswhatactivitytopursuewhenonvacation(climbamountain? readabook?).Whichchoicewemakemaytosomeextentbedeterminedby theweatherforecast. 2. Uncertainty Nodes are presented in ovals. The weather fore- cast is uncertainandsoistheactualweatheritself. 3.OutcomeNodesarepresentedinlozenges.Theoutcomeinthevacation exampleishowmuchsatisfactionyoutookfrom thechoiceyoumadefor vacationactivity.Thisisautility. 4. Arcs display the flow of the influence. The actual weather condition affectstheweatherforecast,whichaffects yourplanning.Noticethatyou are taking a decision on activity in advance of knowing what the actual weatherwillbe.Thatiswhythereisnoarcbetweenweatherconditionand vacationactivity. YoucandrawinfluencediagramsquiteeasilyinExcelusingtheshapesdrop-down tool. Therearedifferentoutcomes here, which we can be represented as payoffs. The highestpayoffcomesfrommatchingyourchoiceofactivitytotheweatherforecast andthenfindingthattheforecastmatched reality. 13.2Expectedmonetaryvalue theexpectedmonetaryvalue,orEMV,isjustthevalueofsomeout-comegivena particularstateofnature,multipliedbytheprobabilityofthatstateofnature.Here,I am using the customary expression ‘state of nature’ not necessarily to refer to anythinginthenaturalworld,buttoaparticularsetofconditions. Thestateofnaturehastobemutuallyexclusivefortheprobabilities towork.For example, if we have separate probabilities for rain and sunshine, then cannot calculateforrainANDsunshine.Thepayoffswhichresultfromeachstateofnature aredifferent,dependingonthestateofnature. Let’smotivatethiswithanexampleusingafriendlyfarmerasthedecision-maker. Hehasthechoiceofplantingwheat,raisingcattle,orsomemixtureofboth.Those threepotentialdecisionsrepresenthisrangeofchoices.Fromexperience,heknows howmuchhecanexpecttoreceivedependingontheweatheratthetimeofharvest. Thatamountiscalledthepayoff . Hefacesthreestatesofnature,representingdifferentweatherconditionsatharvest. These are rain,clear, andsunny. For each choice and for each state of nature he knowsthepayoffswhicharerepresentedinthecellsbelow. Thepayofftableforthefarmer Therearethreedecisionsandthreeoutcomes,makingatotalof9possiblepayoffs. Payoffscanbenegative,andarenotalwaysintermsofmoney.Theycouldbetime, oranyother appropriate metric. In the rows we put the choices available to the decision- maker.Inthecolumnsweputthe‘statesofnature’whicharethe future eventsnotunderthecontrolofthedecision-maker. The farmer calculates the probability of each of the three outcomes by working througholdweatherrecords.Hefinds: probabilityofrainingis0.4probabilityofclearis0.3probabilityofsunnyis0.3(of coursetheseprobabilitiesmustalladdupto1.Therecouldbemanymore‘statesof nature’butIhavekeptthemtothreeforclarity. TheExpectedValueisthesumofeachprobabilitymultipliedby the outcome. In mathnotationthisis: ∑ EMV = p x∗x which in words is: the expected monetary value is the sum of all the outcomes multipliedbytheirprobability.InExcel,usetheformula =sumproduct(array1, array2). This is a mean, or average, with each outcome weightedbyitsprobability.Indecisionsinvolvingmoney, weusuallycalltheExpectedValuetheExpectedMonetary Value(EMV).Forthe farmer,theEMVsareasbelow: TheEMVsforthefarmer Imultiplied(horizontallybydecision)thepayoffbytheprobability ofthatpayoff andthensummed.Usuallywechoosethe decision with the highest EMV, so the farmershouldchoosewheat.ExpectedvalueYoutube¹ Note that a ‘good’ decision is making the best decision at the time with the informationtohandatthattime.Theremaybeunluckyconsequencesbutprovided theanalysthasdoneathoroughjobin selecting the optimal outcome at the time, thensheshouldnotbeblamed! Itwillbehighlyunusualforanoutcomeof30toactuallyoccur.TheEMVisjusta weightedaverageandnotaprediction. 13.3Valueofperfectinformation The farmer might be tempted to ask for advice from a consultant. What is the maximumheorsheshouldpayfortheadvice?Itisthedifferencebetweenthebest case that the farmer calculates and how much he would receive with perfect information.WecalculatethevalueofperfectinformationbycomparingtheEMV ofthebestchoiceineachstateofnaturewiththeEMVwithoutinformation.Ifyou knewthattheweatherwouldbepoor,youwouldselectCattle; clearwheat;sunny wheatagain.ThenfindtheEMVwithperfectinformationbymultiplyingthesebest optionsbytheir probabilities. ¹https://www.youtube.com/watch?v=fR7cBts1C1w Theworkingis:100.4+300.3+70*0.3=4+9+21=34.Thebestwecouldcomeup withoutperfectinformationwas30,sothevalueofperfectinformationis34-30= 4.Soifsomeoneofferedyouperfectinformationforaprice,thehighestthatyou wouldpayfortheperfectinformationwouldbenotmorethan4. 13.4Risk-returnRatio There are other ways of choosing the optimal outcome other than the EMV, althoughtheEMVremainsthemostcommonlyused.Onecommonmeasureisthe ReturntoRiskRatio ,whichprovidesthedollarsreturnedperdollarputatrisk. Foreachdecision,wedivide theExpectedValueofthedecisionbytheStandard Deviationoftheoutcomeforthatdecision.Weusuallychoosethedecisionwiththe highestRRR,becausethenthedollarreturnforeachdollarputatriskishighest. Thefarmer’sworksheetisbelow. Mixedhasazerobecausethepayoffsareallthesame.Thereisnorisk.Cattlehasthe highestat 0.715. This is higher than the RRR for wheat although wheat has the highestEMV.Thefarmermightwanttothinkmoredeeplyabouttheprobabilityof sunnyweatheratharvesttime,becausethismakesahugedifference. 13.5Minimaxandmaximin A non-probabilistic approach to making choices is the use of maximin and maximax.You can see that your choice depends on whether you are pessimistic (maximin)oroptimistic(maximax).Ifwecansomehowobtainprobabilities,thena probabilistic approachworksbest. 13.6Workedexamples Yo uare planning to market a new coffee-flavored drink. You have a choice betweenpackingthedrinkinareturnableornon-returnablepackaging.Yourlocal governmentisdebatingwhether non-returnable bottles should be prohibited. The tablebelowshowsyourprofits.Ifthenon-returnablelawispassed,youwillstillget somesalesfrom exports. Decision Lawpassed Lawnotpassed Returnable 80 40 Nonreturnable 25 60 Example1 1. What would be the best decision based on maximin and maximax criteria? 2. Alobbyisttellsyouthattheprobabilityofthegovernmentbanningnonreturnablesis0.7.Assumingthelobbyistisright,whatisthebestdecision basedontheEMV? 3.Atwhatlevelofprobabilitywouldyourdecisionchange? 4.Howmuchwouldyoupayforperfectinformation? Answers: 1. Maximin:theworstare25and40.Thebestis40, meaningpackagewith returnables.Maximax:thebestare80and60.Againreturnables. 1. TheEMVforreturnableis800.7+400.3=68.Fornonreturn-ableitis 250.7+600.3=35.5.UnderEMV,youshouldusereturnablebottles. 2. Setthisupasequation,withptheunknownwhichhastobalanceboth sides80p+40(1-p)=25p+60(1-p).Solvefor pandget0.267.Soifyouhearrumoursthatsuggesttheprobabilityis comingdown,thinkagain? 3. Ifwehadperfectinformation,theEVPIis800.7+600.3=74.Soyou shouldnotpaymorethan74-68=6 Example2 Yourunabank.Acustomerwantstoborrow$15,000for1 year at 10% interest. Youbelievethatthereisa5%chancethatthecustomerwilldefaultontheloan,in whichcaseyouwillloseallthemoney.Ifyoudon’tlendthemoney,youwillinstead placethe$15,000inbondswhichreturn6%butarerisk-free. 1.WhataretheEMVsofloanandnotloan? 2. Youhave a credit investigation department which can help you with moreaccuracyintheprobabilityofdefault.Whatisthemostyoushouldbe willingtopayfortheiradvice? 3. Calculatethelevelofprobabilityofdefaultatwhichlendingthemoney andinvestinginbondshaveequalEMV. 14. Accountingforriskpreferences What this chapter is about: the previous chapter showed howwe could pick the mostattractivechoicegivenpayoffsand probabil-ities.Butitmighthavestruck youthattherewaslittlediscussionoftherisksinvolvedandhowdifferentpeople relate to risk. In this chapter we are going to work through picking optimal decisions,takingintoaccounttheriskpreferencesofthedecisionmakers. Hereisaquestionforyou:youaregivenalotteryticketwhichhasa0.5probability ofwinning$10,000anda0.5probabilityofzero.TheEMVistherefore10,0000.5 + 00.5 = $5000. Someone comes along and offers you $3000 for the ticket guaranteed. Would take the sure thing? Or would you hope that you win the $10,000?Most peopleare‘riskaverse’andwouldprefertogiveupsomeof the EMVinexchangeforcertainty.The$3000(orhowevermuchitis)is calledyour Certainty Equivalent (CE) in this particular gamble. The difference between the EMVandtheCEiscalledtheRiskPremium.So RP=EMV-CE Youcanthinkofthecertaintyequivalentasasellingprice.Itisusuallysmallwhen thesizeofthegambleissmall,butincreasesasthegamblegetslarger.HereIam using the term ‘gamble’ because there are probabilities involved. The certainty equivalentisausefulconceptwhich,amongstotherqualities,allowsustocategorize approachestorisk. •Risk-averse .IfyourCEislessthantheEMVofagamble,thenyouare risk-averse (and most people are, so nothing to worry about. You’re ‘normal’) 127 •Risk-neutral .YourCEmatchestheEMV.Youplaytheaver-ages. • Risk-seeking .YouenjoythegambleinherentintheEMVand need to haveaveryhighCEbeforeyou’llgiveitup.Mostpeopleinbusinesswho are risk-seeking usually don’t last verylong.Althoughwesometimessee risk-seekingdecisionswhenyourbusinessisintroubleandahugegambleis theonlypossiblesolution Businessdecisionsareallaboutrisk,andtheprobabilities them-selvesareoften unreliableandhardtofind.Thefarmerwhosecrop- ping decision we studied in Chapter2doesn’tknow tomorrow’sweather,letalonetheweatherinsixmonths. WegenerallyprefertogiveupsomeoftheEMVinreturnformorecertaintybecause weareriskaverse.That’swhywebuyinsurance.IknowthatIwillbecoveredifI smashmycar,andIcanevenputupwiththeknowledgethatpeopleinZurichare gettingricherbecauseofmyaversiontorisk. 14.1Outlineofthechapter Here’swhatwe’regoingtodo: •Describeutilityandcalculateutilities •Showhowtousetheexponentialutilityfunctiontocalculateutilitiesbased onrisktolerance • Convertthoseutilities to certainty equivalents which we can rank and comparewiththeEMVsofthesamedecision Torank choices in terms of their certainty equivalent ratherthan their EMV,we needtheconceptofutility,whichisa numerical measure of how well a good or service satisfies a need or want. Do you prefer to be reading this book, outside eatinganice-cream,or talkingwithafriend?Youcan rank these choices quickly in your head.Youcan subconsciouslyallocateutilitynumberstothechoicesandthenpickwhicheverhas thehighestutility. Utility numbers help us to rank our preferences, but there are no units and the rankingisindividual-specific.Inotherwords,givenasetofalternatives,different people may well have differentrankingsfor them. For the moment, let’s assume thatitisjustyouwhoseutilitiesweareinterestedin.Ifwecansomehowfigureout your utilitiesforthedifferentoutcomesofabusinessdecision,wecanuse those utilitiestohelpyoumakeamorepsychologicallysatisfyingchoice. Utility Theplotbelowshowsahypotheticalutilitycurve.Theutilityvalueisonthevertical yaxisandtheoutcomeofagambleison thehorizontalxaxis.Noticetwothings: Atypicalutilitycurve *Theslopeofthelineisupwards.Thisreflectsthefactthateveryone prefersmore moneytoless • The rate of increase of the line is decreasing or slowing down. It was steepearlyon,buttowardstheenditisalmostaplateau.Thisisbecausethe valueofeachextra dollarisslightlylessthanthatof the previous dollar. Thisisthemarginalutilityofmoney. 14.2Wheredotheutilitynumberscomefrom? Therearetwo approaches. Thefirstinvolvesaskingthedecision-makerhis/herchoicebetweentwo gambles. The second uses a utility function, a mathematical function which converts two inputvariables(risktoleranceandthesizeoftheoutcome)intooneutilitynumber. Withautilitynumberinhandwecanbegintheprocessofrankingthedecisionsin termsofutilityratherthanEMV. Theintuitivemodel We’llworkthroughtheintuitivemodelfirstandthenapplythesamethinkingtothe equation model. Once you get the hang of it,theequation model is quicker and probably more accurate. We’ll use the payoff table from the farmer example, repeatedbelow. Thepayofftableforthefarmer We’llassignautilitynumberof1tothehighestpayoffandthenzerotothelowest. Thebeginningsoftheutilitytablelookslikethis: PayoffUtility 701 40 30 20 10 -10 -300 Nowweneedtofillinthegaps.Whatwe’lldoistouseeachpayoff asacertainty equivalent,andaskthisquestion: Whatvalueofpwouldmakeyouindifferentbetweeneitherreceiv-ingaguaranteed payoffofthecertaintyequivalent(inthefirstcase 40)oracceptingagambleofreceivingthehighestpayoff(70)withprobabilitypor losing30(thelowestpayoff)withprobability1-p.Let’ssaythatyouanswerp=0.9. Theexpectedvalueofthegambleis70*0.9+0.1(-30)=60. This is higher than your certainty equivalent of 40, indicating that you are risk averse.Thedifferencebetweentheexpectedvalueandthecertaintyequivalentisthe riskpremium,andistheamounttheindividualiswillingtogiveuptoforgorisk. Continuedownwards,certaintyequivalentbycertaintyequivalentandcompletethe table.I’vemadeupthenumbersjustforillustra-tion.Aplotoftheutilityvaluesis alsoshown. Payoff Utility 70 1 40 0.9 30 0.8 20 0.65 10 0.4 -10 0.35 -30 0 Farmer’sutilitycurve Ifyouwererisk-neutral,youwouldprefertotakethegambleandso wouldbean EMVmaximizer.Thechoiceoftherisk-neutralpersonisindicatedbythestraight line. Now,let’sreplacethepayoffvaluesinthefarmertablewithutilities.Wecancalculate theexpectedutilitiesinthesamewayaswefoundtheEMVs.Thetableshowsboth forcomparison.Wemultiplytheutilityofeachoutcomewithitsprobability. Expectedutilities The famer’s EMV decision would have been wheat (EMV=14)but taking into account his risk preferences, mixed gives the highest expected utility. It makes sensestospreadtheriskbetween twocompletelydifferent crops. ##Theexponentialutilitycurvemethod It is rather tedious asking so many questions, plus people get tired andconfused ratherquickly.Anattractivealternativeistousea mathematicalfunction,theexponentialutilitycurve.Thisrequires onlyoneinput variable,R,thetoleranceforrisk.Theequationisbelow: U x=1−e− x / R Here x is the payoff; Ux is the utility of the payoff x; and R istheindividuals risktolerance.Apersonwitha large value of R ismorelikelytotakerisksthan someonewithasmallerRvalue.AstheRvalueincreases,thebehaviorapproaches thatoftheEMVmaximizer.TheplotforsomeonewitharisktoleranceofR=1000is below. Actuallyperforming thecalculationstofindutilityateachlevelofpayoffiseasyusingExcel,aswe’llsee below.ButfirstwehavetodetermineR,theindividual’stoleranceforrisk. Findingrisktolerance Therearetwoapproaches Byaskingthisquestion:consideragamblewhichhadequalchances of making a profitofXoralossofX/2.Whatisthe value of x for which you wouldn’t care whetheryouhadthegamble.Inother words,whatisthevalueofxforwhichthe certaintyequivalentis zero?Theexpectedvalueofthealternativeis0.5xX-0.5*(X/2)=0.25X.Aslong as X >0 then the decision-maker is displaying risk- averse behavior, because his/hercertaintyequivalentislessthantheexpectedvalueofthegamble.Thegreater theR,themoretolerantofrisk.Inhisbook,ThinkingFast,ThinkingSlow,Daniel Kahneman notes that the ‘2’ in the denominator can vary a tad. It isn’t aprecise formula,butapparentlyitcomesclose:Indiagramform: Anal-ternativeapproach,moreusedin businessandfinance,istoemploy guidelinenumberscalculatedbyProfessorRon Howard¹,apioneerofdecision-analysis.Basedonhisyearsofexperience,Howard suggeststhattheRvalueforacompanyis: •6.4%ofnewsales •124%ofnetincome •15.7%ofequity ##Exampledecisionusingutilitymaximisation What we’ll do here is to examine a decision from both the EMV and Expected Utility angles, using the exponential utility function. Thecalculationofexpected valuesisthesame:thedifferenceisthatinsteadofmultiplyingeachmonetarypayoff byitsprobabilityandthenaddingup,wemultiplytheutility. ¹https://profiles.stanford.edu/ronald-howard The decision set-up. All money figures in millions. I run a coffee importing business.Ineedaspecialimportlicensetobringinaparticulartypeofcoffee.The equityofmycompanyis12.74.So,myRvalueis15.74%of12.74=2. Ihaveachoicebetween‘rush’and‘wait’forthepermit. IfIrush,Ihavetopayaspecialfeeof5,andthereisa50/50chanceofthepermit beinggranted.Ifitisgranted,Iwillhavesalesof8,givingmeanetprofitof8-5=3. Ifitisn’tgranted,Istillhavetopaythe5,butwillhavesalesofonly6,givingmeanet profitof2.First,findtheEMVandEUofrushing.TheEMVis0.5x3+0.5(2)=2.5. Usingtheexponentialcurveformula,theutilityof3is=1-exp(- 3/2)=0.77687. Andtheutilityof2is=1-exp(-2/2)=0.632121.TheEUforrushingis0.5(0.77687) +0.5(0.632121)=0.704496.Allfiguresareinmillions. Nowforthealternative,whichistowait.Inthisscenario theprobabilitiesarethe sameat50/50,butthecostisonly3ifitisissued,nothingifitisn’t.Thesalesarethe sameat8and4.Thenetprofitsare8-3=5and4.TheEMVis0.55+0.54=4.5.The utilitiesareU(5)=1-exp(-5/2)=0.918andU(4)=1-exp(-4/2)=0.865.TheEUis 0.50.918+0.50.865=0.892. ThetablebelowshowstheEMVsandEUs. Spreadsheetfortherushdecision Butwecangofurtherusingcertaintyequivalents,subjectofthenext section. 14.3Convertinganexpectedutilitynumberintoa certaintyequivalent. Fortunately there is an easy way to convert expected utilities back to certainty equivalents. It is then straightforward to rank the decisions by certainty equivalents. Remember the concept of the certainty equivalent? The amount of moneywhichitwould takefor you to sell your gamble? It turns out that we can convertanexpectedutilitynumberintoacertaintyequivalentusinganequation: CE=− R∗ln (1−EU) wherelnisthenaturallogarithm.Weusethisequationwherewearetalkingprofits andwantmore.Inthecaseofcosts,takeoffthenegativesigninfrontofR. AbovewefoundEU=0.704496fortherushdecisionbranch.ThatisaCEof =-2*LN(1-0.704496)= 2.43814 using Excel’s ln function. For the wait decision, the EU was 0.892, giving a certaintyequivalentof=-2*LN(1-0.892)= 4.451248 LargerCEsarepreferredwhentheoutcomesareprofits,andsmallerCEswhenthe outcomesarecosts.SoI’dwait!ThisconfirmsthedecisionmadewiththeExpected Utilities. 15. Glossary Thisisahands-onguidetodoingsomebasic analytic work with data. However, statisticshasitsownsetofvocabularyandtech-niques.Whenyouare presenting your results to other people, you may wish to follow established techniques and terminology. Also, here you’ll find more of the underlying theory if you are interestedinhowExcelgetstheresultsitdoes. Binomialdistribution The binomial distribution is used to find the probability of the occurrence of a certainnumberofevents(calledsuccesses,althoughtheycouldbeanythingclearly defined)withinanumberoftrials.Thebinomialdistributionwillgive,forexample, theprobabilityofatleasttenpartsbeingdefectiveoutofthenexthundred,provided thelong-termdefectiverateisknown.Thebinomial requires some conditions to workproperly.These are: asequenceofnidenticaltrialstherearetwooutcomesforeachtrial,denotedsuccess andfailuretheprobabilityofsuccessdoesn’tchangethetrialsareindependent. ItisquiteeasytocalculatetheprobabilitieswithExcelbutforillustrationhereis theequationwhichExceluses: f( x )= ( ) n x p x (1 −p ) (n−x) wherexisthenumberofsuccesses;ptheprobabilityofsuccessononetrial;nthe numberoftrials,andf(x) the probability of x successesinntrials.Thisnotation pointsupthefactthattheresultisafunctionofx. 137 Centrallimittheorem(CLT) Thecentrallimittheoremisfundamentalinstatistics.Whatthe theorem states is this:let’ssayyouhavealargepopulationand you keep taking samples from the populationandmeasuring themeanofeachsample.Thenyoudrawahistogramof themeanssothateachmeanisavariableinthedistribution.Inotherwords,instead of plotting each observation independently, we plot them by the means of each sample.Thehistogramwhichappearswillbeapproximatelynormallydistributed,or intheshapeofthewell- known bell curve. This is wonderful news because the normaldistributioniswellbehaved,andsowecancalculatetheareaunderanypart ofit. Correlation Youhaveapairofvariables:forexample:websitehitsand actualsales.Youwant to know whether there is a relationship betweenthevariables.Andwhetheritis ‘statisticallysignificant’,orperhapswhatyouthoughtwasarelationshipoccurred justbychance. Correlation is the study of the linear relationship between two or more random variables.Correlationanalysisprovidesboththe direction and the strength of the relationshipbetweenthevariables.Forexample,thereisarelationshipbetweenthe heightandweightofindividuals. There is apparently also a relationship between consumptionofsugarandobesityratesatthenationallevel.Cor-relationanalysis cantestwhethertherelationshipsarespuriousorreal. It’strueandworthrepeating: correlationdoesnotimply cau-sation .Itdoesnot necessarily follow that one variable causes the other even though they might be highly correlated. There are many examples of such spurious correlations, for examplenational consumption of chocolate and number of Nobel prizes won. If you do find a correlation, first think about whether you have a lurking variable discussed below. The two examples given above for height/weight and sugar consumptionmightverywellinclude causaleffects,butthisisnotsomethingthatcorrelationanalysiscanestablishonits own.Indeed,somephilosophers(such as David Hume) claim that causation can neverbedecisivelyestablished. Nowwewill: •Visualizeacorrelationwithascatterplot •explainwhythecoefficientofcorrelationworks •interpretthevalueofthecoefficientofcorrelation •testforthestatisticalsignificanceofthecoefficientofcorre-lation Scatterplotsandcorrelation Acanningfactoryhasdataonnumberofworkersemployedandoutputofcans.The datasetis‘canning’.Thefactoryisinterestedintherelationshipbetweenthenumber ofworkersandoutput.LoadthedataintoExcel. Firstfewlineofthecanningdata The columns of data represents numbers of workers andcans manufactured. For thenotationwe’llcalltheseXandY.WhenImeanthenameofthevariable,I’lluse theuppercase.Forindividualobservations,suchasx=23,I’lluselowercase.We wanttoseetherelationshipbetweenXandY.Todothisweneedtohavethedatalaid outincolumnssothateachobservationofXandYareonthesameline.It’salwaysa goodthingtovisualisetherelationshipwith a scatter plot and the scatter plot is below. Thecansscatterplot I’vedrawnaverticalandhorizontallinewhichgoesthroughthemeansofthetwo variables.Wecandividetheobservationsinto quadrants,withquadrantonebeing toprightandsoforth.Iftherelationshipispositivethenwewouldexpecttoseemost ofthe observations in quadrant one and quadrant three. If the relationship was negativethenmostoftheobservationswouldbeinquadrantstwoandfour. TherelationshipbetweenXandYisnotperfectlylinear.Ifitwas all the pairs of points would line up along a straight line. Weneedsomewayofmeasuringhow far off being in a straight line these observations are. There are two measures, covarianceandcoefficientofcorrelation.Thereissomemathsinwhatfollowsbut we’llwalkthroughitslowly. Covariance:Trythisthoughtexperiment:ifalltheYpointswereonaverticalline, thenthesubtractionofthemeanfromeachYvaluewouldgiveuszero: ( y i− y ¯)=0 I’veputinalittleifortheyvaluestoshowthatitcouldbeany ofthevaluesintheYvariable.SimilarlyfortheXvalues,ifallthepointswereona verticallinethen ( x i−x ¯)=0 Wecanseethatwedon’thaveverticallinesandthereissomedif-ferencebetween eachobservedvalueandthemean.Ifwemultiplytogetherthedifferencesandthen findtheaveragebydividing byn-1wecanfindthecovariance.Wedividebyn-1 becauseweareinmostcasesdealingwithasample.Subtracting1adjustsforthis fact.ThesamplecovarianceofXandYistherefore: S XY = ( x−x ¯)( y−y ¯) n−1 Thebigproblemwithcovarianceisthattheresultisverydependentontheunits.We can get round this problem by standardizing the differences between each observation and its mean. We do that by dividing the difference by the standard deviation of the variable. You can think of this technique as providing the differenceinunitsofstandarddeviations,so ( x−x ¯ sx andthesamefortheYvariable.Nowwenolongerneedtoworryabouttheunits.The coefficient of correlation is the covariance now between the standardised differences, not the differences them- selves. The equation for the coefficient of correlationofasampleis ∑ r XY= ( x−x ¯)( y−y ¯)( n−1) S X S Y Performingthiscalculationisextremelytediousandweusuallyletthecomputerdo thework.Typein=correl(array1,array2)andhit enter.Forthecansdatatheresultis0.991083.ThisisthePearsonProductMoment or‘r’value. CorrelationYoutube¹ Valuesofr:valuesofrarerestrictedtotherange-1to+1.Anegativevaluemeans that the slope is negative: as one variable increases thentheotherdecreases.A positivevaluemeansthatbothvariables increasetogether.Thelargertheabsolute valueofr,thenthecloserthecorrelation.Acorrelationofzeromeansthat all the pointsarescatteredaboutwithnopatternwhatsoever.Thervalueof0.991083for cansmeansthatthepointsarelinedupalongastraightline,whichiswhatwewould expectfromlookingatthescatterplot. Testingr:theobservationsthatwehaveusedtocalculaterconsistofasamplefrom apopulation.Becauseitisonlyasample,wedon’tknowwhetherthecorrelation wouldholdupifweweresomehowabletogainaccesstothepopulation.risthe notationforthecoefficientofcorrelationforasample.Forapopulationweusethe Greekletterρ“rho”.Generally,Greeklettersareusedfor population parameters. ThisdistinguishesthemfromtheRomanlettersusedforestimates. Thehypothesiswe’retestingis: againstthenull H 0:ρ ≤0 H a :ρ̸ =0 Wefindtheteststatisticusingthisequation r √ n −2 t =√ −r 2 ¹https://www.youtube.com/watch?v=QDj8aYfvccQ We’llfirstworkoutthevalueoftheteststatisticusingthervaluewefoundabove. Thenwe’llfindthepvaluefromthis. Thervaluefoundfromthecorrelationwas0.99andtherewere 24 observations (n=24).Plugginginthenumbers,wefindthatt=32.86.Thisisaonetailedtest,souse =t.dist(yourvaluefort,n-1,andfalse)tofindapvalueof2.67343E-21. The -21 meansmovethedecimalplace21placestotheleft,sothisnumberis effectively zero.Thervaluewefoundisstatisticallysignificant.Itismuchsmaller than0.05, sowecanrejectthenullhypothesis.Thereisacorrelation. Descriptivestatistics usesgraphsandtablestopresent whatweknowaboutthe data. This is a powerful method of gaining insights into the data, and also for makingpresentations.However,descriptivestatisticslimitsitselftothedataavailable doesnotmakeanyinferencesbeyondwhatwehavefromthedatatohand. Distribution of the observations within a dataset describes the density of the observations:aretheyspreadaboutevenly,orpeakedinthemiddleandthenspread outoneithersideoftheirmean? One of the most common and useful distributions is the normal distribution, sometimesknownasthebellcurve.Manythings inthenaturalworldfollowthis distribution.Forexample,ifyouwereabletomeasuretheheightsofalargenumber ofmenorwomen,you’dfindthattheyfollowedanormaldistribution. Theplottotherightshowsthedistributionofrecordsofheightsofparentscollected byFrancisGaltoninthe19thcentury.Galtonwastryingtofindouttherelationship betweentheheightofparentsandtheirchildren.Alongtheway,hediscoveredthe phenomenonknownas‘regressiontothemean’. Ihaveplottedtheheightsasahistogramandaddedanormalcurve withthesame meanandstandarddeviationastheoriginaldata.Thisclearlyfollowsthebellcurve. Thenormaldistributionisveryimportantinstatisticsforseveralreasons -thenormaldistribu-tionis‘well-behaved’inthesensethatital-waysfollowsthesame mathematicalfunction.Theequationisalit-tlecomplex,butwecandrawanynormal distributionprovidedweknowitsmeanandvariance.Themeanistheaverageofthe ob Galton’sheightobservations servations,andinanormaldistributionoccursrightinthemiddle.Thevarianceisa measureoftheaveragedistanceofeachobserva-tionfromthemean.Thelargerthe variance,themore‘spreadout’ordispersedthedata. Thegraphshereshowtwodatasets,withidenticalmeans butdifferentvariances. In most of this book we won’t use the term variance, but instead standard deviation .Thestandarddeviationisjustthesquarerootofthevariance.Weusethe standarddeviationprimarilysothatwecanusethesameunitsasthemean.Thefirst plothasastandarddeviationof5,andthesecondoneastandarddeviationof2.The plotswerecreatedbygeneratingarandomsetof1000numbers,withameanoften andthestandarddeviationsjustdescribed. -the second reason is connected with the most important result in statistics, the CentralLimitTheorem ,orCLT.This statesthatifyoukeepdrawingsamples withreplacementfromalargepopulation, the means of the samples will follow a normaldistribution .Theimportantthingisthatthedistributionoftheoriginal population doesn’t matter as long as we draw at least 30samples.Infact,itgets betterthanthat:iftheoriginalpopulationisnotnormallydistributed,thenwedon’t needasmanyas30inoursample.Asfewas10willdothetrick. Standarddeviation=5 Becausethenormaldistributionfollowsanequationsowell,wecan find the area underanypartofitveryeasily. Standarddeviation=2 The mean splits the area under the curve into halves, meaning that 50 % of the observationsaresmallerthanthemean,and50%larger.Acommonpracticeisto markoffthehorizontalaxisinunits of standard deviation. The further from the mean,thegreaterthestandarddeviation.Thishelpsbothinvisualizingtheshapeof thedata,butalsoinusingtheempiricalrule . Empirical rule : a ‘rule of thumb’ which states that when data are normally distributed, -64%ofthedataarewithinonestandarddeviationofthemean-95%ofthedataare withintwo standard deviations of the mean -99.7% of the data are within three standarddeviationsofthemean Thismakesintuitivesense:only100-99.7=0.3%ofthedataaremorethanthree standard deviations from the mean. They would therefore be in either of the extremeendsortailsofthedistribution.Theprobabilityoftheiroccurrencewould beverysmall.Bycontrast,observationsclosertothemean, those located only a smallnumberofstandarddeviationsfromthemean,aremuchmorelikelytooccur. Thatiswhythehistogramishighestonatthemean.Thoughtexperiment:veryshort orverytallpeopleareunusual(whichiswhyyoustareatthem).Bycontrast,people aroundthemeanheightarenotatallunusualandyoujustpassthemby. Estimation : The statistical process is circular: we start off by choosing the characteristic of the population that interests us. That characteristic is called a parameter.Wedrawarandomsamplefromthatpopulation. Weuse thestatistic to infer information about the same quantifiable feature in the population. The statistic(for ex-ampleanaverageoraproportion)isanestimateofthepopulation parameter. The process of finding how close the estimate to the parameter, and buildingaconfidence interval around the estimate, is inference . We test claims abouttheparameterusinghypothesis-testing,whichiscloselyrelatedtoinference. Hypothesis-testingisthefinalpartofthischapter. Excel –morescreencasts Insertingthedataanalysistoolpakadd-in²Thenormal distribution³ ²https://www.youtube.com/watch?v=Ods1Z5BLD9g&list=UUVZFhqXDgJgIRq-ljotClgQ ³https://i1.ytimg.com/vi/1qn0EIm5-PQ/mqdefault.jpg ExponentialSmoothing Hypotheses and p values A hypothesis is a claim someone makes about a population. For example, acoffee importer (me!)is told by the supplier that the meanweightofthesacksofcoffeebeansis40kg.Thatistheclaim.Iwanttotest that claim because I am paying for the coffee. All the sacks of coffee in the shipmentarethe‘population’.Itakearandomsampleofsackstocheck.Ifindthat themeanweightofthesampleis38kgs.DoIrejectthewholeshipment,orsayI must have got a low sample, let’s take them? The hypotheses here are: the null hypothesis:thepopulationmeanweightis40kg.Meanwhilethealternativeothe populationmeanweightisnot40kgs.Hypothesistestingprovidesananswertothe testintheformofapvalue,definedbelow. ApvalueistheprobabilityoffindingasamplewiththemeasuredcharacteristicsIF the null hypothesis was true. In the case of the coffee above, let’s say that we calculatedthepvalue(SeeChapter 9)anditwas0.1.Thatmeansthatthereisa10%chanceoffindingasamplewitha meanweightof38kgsiftherealtruemeanweightoftheshipmentwas40kgs.The mostfrequentlyusedcut-offpointis0.05or5%.Ifthepvaluewassmallerthan0.05 thenwewouldsaythatthereisalessthan5%chancethatthissamplecamefroma populationwiththeclaimedweight. Therefore the shipment isn’t of the claimed weightandshouldberejected. Hypothesiswritingandtesting Ahypothesisisastatementorclaim.Forexample,‘themeanweight of the coffee packsis3kg’isahypothesisorclaim.Thehypothesisneedstobetestedsothatwe knowwhethertheclaimstandsorisshowntobefalse. Hypothesesarewrittensothattheclaim,andthecounter-claim,aredistinct.There aretwohypotheses,onefortheclaimandthe other for the counterclaim. Ho, the ‘nullhypothesis’,containstheclaim.Ha,the‘alternativehypothesis’,containsthe counterclaim.Wewritehypothesesbyassumingthatthenullhypothesisistrue. Thenullhypothesesforthecoffeeexampleabovewouldbe Ho :µ =3 kg Thisisreadas:Thenullhypothesisisthatthepopulationmeanweight(mu)ofthe coffeeisequalto3kg. The alternative hypothesis (Ha) is that the population mean weight is not 3 kg, written: Ha :µ̸ =3 kg Wetestthehypothesesusinginformation(statistics)fromthesam- ple (the mean weight,thesamplesize,andthestandarddeviation).Notethatthecounterclaimsays nothingaboutthemeanweightinthepopulationbeingmorethanorlessthan3kg, itsimplysaysthatitnotequalto3kg.Also,important,theequalitysignappearsin thenull.Thisisalwaysthecase,whethertheinequalityis‘strict’or‘weak’. Wecanalsotestwhetherthepopulationmeanissmallerthanor largerthansome constant.Theshipperclaimsthattheweightis‘atleast’3kg.Wewritethisas: H 0:µ ≥3 kg Notethattheequalityremainswiththenull.Thealternativeisthenthenegationof thenull,andthismustbethattheweightis‘lessthan3kg’. H a :µ<3 kg ConvinceyourselfthattheonlypossiblewaytonegateHoiswiththealternative hypothesis.Iftheclaimwasthatthattheaverageweightislessthanorequalto3kg, wesimplyswitchoverthe inequalitysigns.Thereareotherformsofhypothesiswritingwhich we will work throughinthisbook.However,theyallsharetherulethattheequalitysigngoesin thenull;andthealternativenegatesthenull. Inference isthekeystatistical process. It is the way in which information about populationscanbegleanedfromsurprisinglysmallsamples.Anexampleispolling beforeanelection.Providedthesampleisdrawnrandomlyandisrepresentativeof thepop-ulation,asampleofperhapsonly1,000peoplecanprovidequiteaccurate estimates of the proportion of the population predicted to vote for a particular politicalparty.Theestimatesareusedtomakeinferencesaboutthepopulation.We won’tknowuntilelectiondaywhetherornottheinferenceswerecorrectornot. Inferentialstatisticsisaverypowerfultechniquewhichallowsustomakethejump fromasampletoapopulation.Wecanuseasmallsampletomakeinferencesabout thecharacteristicsofapopulationaboutwhichweperhapsknowverylittle.Thekey istoobtainasamplewhichrepresentsthepopulationasaccuratelyaspossible.For this, both randomization and good survey design are essential. W ewill discuss thesetwoimportantissueslaterinthechapter.Recently,athirdclassofstatistical methodologieshasarisen.Thisismachinelearning ,whichtakesadvantageofnew analytictechniquesandgreatercomputingpower.Wewon’tbeabletogointothese techniquesinthisbook,unfortunately. Lurkingvariables : It happens that there is close monthly cor- relation between theconsumptionoficecreamanddeathsbydrowning.Isitthatpeopleeatanice cream,goswimmingandthendrown?Thecorrelationmissesthelurkingvariable whichistemperature.Inhotweather,peoplebothgoswimmingmoreoftenandbuy ice creams. The ice cream and the drowning variables are both correlated with temperatureandsoarecorrelatedwitheach other.Anotherexample:itseemsthat thereisacorrelationbetween kids needing reading glasses andsleepingwith the lighton.Soyou should switch the light off? Not so fast. The lurking variable is the shortsightednessoftheparents,notthekids.Theparentsleavethelightonsotheycansee thekid.Thekidsinheritbadeyesightfromtheirparents. Parameter : a quantifiable characteristic of interest in the popula- tion. The average weight of all of the fish in the Fraser Riveris a parameter. The same characteristicinthesampleisknownasastatistic.Wecaneasilyfindthisstatistic becausethesampleisonlyasubset,andthereforesmaller,thanthepopulation.We usuallyneverknowtheactualvalueofaparameter,butthatdoesn’tmatterbecause wecanestimatefromthesample. Pointestimate :astatistic,suchasthemeanoraverage.Itiscalledapointbecauseit isanexactnumbersuchas4.21.Itisnotarange,itisapoint.Itiscalledanestimate becauseitisanestimateofthepopulationparameter.Therefore4.21isanestimate ofwhatthesamequantifiablefeaturewouldbeinthepopulation.Youcanthinkofa pointestimateasbeingthe‘bestguess’forthepopulationparameter. If we selected a different sample from the population, then almost certainly the pointestimatewouldbedifferent,givinganewbestguess.Ifwecontinuetotake samples,thenwearegoingtogetas manybestguessesaswehavesamples.The rangeofbestguessesisthesamplingvariation . Poissondistribution ThePoissonprobabilitydistributionis,aswiththeBinomial,adis-creteprobability tool.Wecanuseitforcalculatingtheprobabilityofeventswhichoccurovertimeor space.Forexample,numberofmistakesinagivennumberof lines of code. The formulais: f( x )= µ x e− xx ! wheref(x)istheprobabilityofxoccurrencesinagiveninterval.mu istheexpectedvalueormeannumberofoccurrencesinthegiveninterval(orarea). eisaconstant,value2.71828. Population : the group of individuals about whom we want some information. Each individual in the population is called an element or sometimes a case . Examples of populations are all the fishinthe FraserRiver,Toyotacarsmadein 2012,orindeedthepopulationofacountry.Itisusuallynotpossibletodealwithdata atthepopulationlevelbecauseitiseitherexpensiveorimpossibletoobtain.Asa resultwedrawasample,orsubset,fromthepopulation. Randomisationisthetechniqueofselectingelementsfromthepopulationtobuild asamplesothateachelementhasthesame known probability of being selected. This technique greatly reduces bias, described in more detail in the following chapter. Regression Samplingframe :alistoftheitemstobesampled.Imaginethatyouwishtosurvey 100studentsfromaUniversity.TheUniversityprovidesyouwithalistofstudents andtheirIDnumbers.Thestudentsformthepopulation,andthelististhesampling frame.Youwouldusearandomsamplingmethod,coveredbelow,torandomlyselect 100studentsfromtheUniversity.The100studentsformthesample. Standarderror : Trythisthoughtexperiment.Let’ssaythat thereare1000jelly babiesinabowlandforsomereasonyou’dliketoknownthemeanweightofajelly babywithouthavingtoweigh all ofthem.Youhaveachoiceoftakingarandom sampleofeithern = 100 or n = 500. Which sample would produce the most accurate estimate: obviouslythesamplesizeof500becauseitis‘closer’tothepopulation size. Yetitstillwon’tbecompletelyaccuratebecauseitcouldhappenthatthesampleyou select is biased in some way: perhaps you selected all the heavy ones? We can measuretheamountoferrorwiththestandarderror,whichhelpsustodeterminehow ‘close’wearetothe unknownparameter.BelowIshowhowtocalculatethestandarderror(SE).Hereis thestandarderrorforastatisticsuchasamean: σSE = √n In words, the standard error is the population standard deviation divided by the squarerootofthesamplesize,denotedbyn.Notethatasthesamplesizeincreases, thenthestandarderrordecreases. Thisis in line with your earlier intuition that the larger the sample size then the moreaccuratetheestimate.Usuallywedon’t know‘sigma’,σ ,sowereplaceitwith ‘s’whichisthestandarddeviationofthesample. Transformations :thisisatechniquesometimesusedinregression. Theobjectis usually to transform the distribution of the dependent variables so that it more closely resembles a normal distribution. We do this because the theoretical underpinning of linear regression is basedonanassumption of normality in the dependentvariable.The dependent variable is typically transformed by taking its naturallogarithmandthenusingthetransformedvariableintheregression.Thisis often used when the dependent variable has only positive values and is highly skewed to the right. Independent variables can also be transformed such as by addingasquaredversionof the variable in the regression. Independent variables canalso betransformedby z scores : A z score is a measure of how far an observation contained within a variable is from the mean of that variable, in termsof standard deviations. It is quiteeasytocalculatewithinExcel,andthisYoutubeshowsyouhow zscoresYoutube⁴. Whydothis?Bydividingthedifferencebetweenthevalueofthe ⁴https://www.youtube.com/watch?v=FSZpynSBev8 observationandthemeanofthevariablebythestandarddeviationofthevariable, likethis: z i= x i−x ¯ s thedifferencesareinunitsofstandarddeviations.Wecanusethisfor: •Comparingthedistributionsoftwocompletelydifferentvariables • Detectingoutliers.Anoutlierisanobservationwhichisveryfarfromthe restofthedistribution.Usually,99.7%ofthe observations will be within threestandarddeviationsofthemean.Soanobservationwhichismorethan threestandarddeviationsfromthemeanislikelytobequestionable. This canbegoodorbad: •Badbecauseperhapssomeonemadeadataentryerrorwhichwecancatch. • Good because perhaps there is something anomalous and interesting aboutthat‘weird’observation.
© Copyright 2026 Paperzz