Straightforward statistics using Excel and Tableau

BusinessstatisticswithExcelandTableau
Ahands-onguidewithscreencastsanddata
StephenPeplow
©2014-2015StephenPeplow
WiththankstothestudentswhohavehelpedmeatImperialCollegeLondon,the
UniversityofBritishColumbiaandKwantlenPolytechnicUniversity.
Contents
1.Howtousethisbook1
1.1Thisbookisalittledifferent 2
1.2Chapterdescriptions 2
2.VisualizationandTableau:telling(true)storieswith
data9
3.Writingupyourfindings15
3.1Planofattack.Followthesesteps. 17
3.2Presentingyourwork 17
4.Dataandhowtogetit18
4.1Bigdata 21
4.2Someusefulsites 21
5.Testingwhetherquantitiesarethesame23
5.1ANOVASingleFactor 23
5.2ANOVA:withmorethanonefactor 29
6.Regression:Predictingwithcontinuousvariables... 31
6.1Layoutofthechapter 33
6.2Introducingregression 33
6.3Truckingexample 36
6.4Howgoodisthemodel?—r-squared 39
6.5Predictingwiththemodel 40
6.6Howitworks:Least-squaresmethod 41
6.7Addinganothervariable 42
6.8Dummyvariables 43
6.9Severaldummyvariables 49
6.10Curvilinearity 50
6.11Interactions 52
6.12Themulticollinearityproblem 58
6.13Howtopickthebestmodel 58
6.14Thekeypoints 59
6.15Workedexamples 60
7.Checkingyourregressionmodel61
7.1Statisticalsignificance 61
7.2Thestandarderrorofthemodel 64
7.3Testingtheleastsquaresassumptions 66
7.4Checkingtheresiduals 67
7.5Constructingastandardizedresidualsplot 68
7.6Correctingwhenanassumptionisviolated 71
7.7Lackoflinearity 71
7.8Whatelsecouldpossiblygowrong? 73
7.9Linearitycondition 73
7.10Correlationandcausation 74
7.11Omittedvariablebias 74
7.12Multicollinearity 75
7.13Don’textrapolate 75
8.TimeSeriesIntroductionandSmoothingMethods.. 76
8.1Layoutofthechapter 76
8.2TimeSeriesComponents 77
8.3Whichmethodtouse? 78
8.4Naiveforecastingandmeasuringerror 79
8.5Movingaverages 81
8.6Exponentiallyweightedmovingaverages 83
9.TimeSeriesRegressionMethods87
9.1Quantifyingalineartrendinatimeseries usingregression
87
9.2Measuringseasonality 89
9.3Curvilineardata 91
10.
Optimization95
10.1Howlinearprogrammingworks 95
10.2Settingupanoptimizationproblem 96
10.3Exampleofmodeldevelopment. 97
10.4Writingtheconstraintequations 97
10.5Writingtheobjectivefunction 98
10.6OptimizationinExcel(withtheSolveradd-in)...98
10.7Sensitivityanalysis 100
10.8InfeasibilityandUnboundedness 102
10.9Workedexamples 102
11.More
complexoptimization107
11.1Proportionality 107
11.2Supplychainproblems 109
11.3Blendingproblems 110
12.
Predictingitemsyoucancountonebyone114
12.1Predictingwiththebinomialdistribution 115
12.2PredictingwiththePoissondistribution 118
13.Choice
underuncertainty119
13.1Influencediagrams 119
13.2Expectedmonetaryvalue 121
13.3Valueofperfectinformation 123
13.4Risk-returnRatio 124
13.5Minimaxandmaximin 124
13.6Workedexamples 125
14.
Accountingforrisk-preferences127
14.1Outlineofthechapter 128
14.2Wheredotheutilitynumberscomefrom? 130
14.3Convertinganexpectedutilitynumberintoacertaintyequivalent. 136
15.
Glossary137
1. Howtousethisbook
I began business life as an entrepreneur in business in Hong Kong. I ran trade
exhibitions,importedcoffeefromKenya,and startedandoperatedtworestaurants,
oneofwhich(unusuallyforHongKong)servedvegetarianfood.Thesebusinesses
wereprofitable,butIwouldhavesavedmyselfagreatdealofstress,anddonebetter
if I had used some business intelligence informed by statistics to improve my
decision-making.Lookingback,IwishIhadbeenabletothinkthroughandwrite
analysesontopicssuchastheseamongmanyothers:
a.calculationofoptimalrestaurantstaffinglevels
b.analysisofsalesovertimeandseasonalityintrends
c.receiptspercustomerbyrestauranttypeandanalyzedstrengthandtypeof
anydifferences
d.predictedsalesdataforpotentialexhibitorswithvisualiza-tionsof
various‘whatif’scenarios
e.gaineddeeperinsightsfromvisualizingmydata
f.wonovermorepartnersandinvestorswithbettervisualandwritten
presentations
I’msurethattherearemanymorepeoplelikeme,awarethattheyoughttobedoing
morewiththedatatheycollectaspartofnormalbusinessoperations,butuncertain
ofhowtogoaboutit. There is no shortage of textbooks and manuals, but these
don’tseem to get to the hands-on applications quickly enough. This book is for
peoplesuchasme,intwocomponents:theanalysis,findingouttheunderlyingstory
fromthedata,andthenthepresentationofthestory.
1
1.1Thisbookisalittledifferent
Mostbooksonstatsintroducestatisticaltechniquesonachapterbychapterbasis.
Instead,I’vewrittenthisbooksothatitreflects the way many people learn. The
chaptersarestructuredbythetypeofquestionyoumightwanttoask.Examplesof
the typeofquestionsaresummarizedbelow,andatthebeginningof eachchapter.
Thechaptersthemselvesdoesn’tincludemuchmathandtechnicaldetails.Instead
theseareplacedtowardstheendofthebookinaglossary.
ManyoftheExcelproceduresarelinkedtoscreencastspreparedbymetoillustrate
that particular procedure. The data-sets used in the book are available from my
Dropboxpublicfolder:justclickonthehyperlinkthatappearsnexttoeachworked
example.Withthedata-setsopeninExcel,youcanfollowalongatyourownpace.
(And then do the same with your own data). I have used Tableau for some
visualizations,andyouwillfindlinkstomy workbooksandscreencasts.
1.2Chapterdescriptions
Ifyou’rereadingthisbook,thenyouareverylikelyalreadyengagedinbusinessand
wouldliketoknowhowtotakethatbusinesstonewheights.Takealookthroughthe
chapterdescriptionsthat followandthengostraighttothatchapter.Ifyouthinkyou
mightneedabitofastatisticsrefresher,lookthroughtheGlossaryinChapter16
first.AverygoodfreestatisticsOpenIntrotextbookisavailablehere¹
Chapter2introducesTableauPublic²whichisafreedatavisu-alizationtool.While
Exceldoeshavegraphingtoolswhichareeasytouse,theresultscanlookalittle
clunky.Tableauhelps
¹https://www.openintro.org/stat/²http://www.tableausoftware.com/public/
ustomergedatafromdifferentsourcesandcreateremarkablevisualizationswhich
canthenbeeasilyshared.I’llbe illustratingresultsthroughoutthisbookwitheither
Tableau,Excelorboth.One caution:ifyoupublishyourresultstothe web using
TableauPublic,yourdataisalsopublished.Ifthisisaproblem,thereisanoptionto
pay for an enhanced version of Tableau. The screenshotbelowshowsworkdone
changesonlanduseintheDeltaregionofBritishColumbia.Byclickingonthelink,
youcanopentheworkbookandalterthesettings.Youcanfiltertheyear(seetop
right)andalsolandusetype.ThankstoMalcolmLittleforhisworkonthisproject.
TableauworkbookforDeltaLandCoverChanges³
Tableaushowingchangesinlandcover
Tableauhastraininganddemonstrationvideosavailableonits website, and there
are plenty of examples out there. The screen casts which Tableau provides
(available at their homepage) are probably enough. Where I have found some
technique(suchasboxplots)particularlytrickyIhavecreatedscreencastsforthis
book.Theimagebelowisdatawewilluseintheregressionchapter.Youruna
³https://public.tableau.com/views/DeltaCropCategories1996-2011/Sheet1?:embed=y&:
showTabs=y&:display_count=yes
truckingcompanyandfromyourlogbooksextractthedistanceanddurationand
numberofdeliveriesforsomedeliveryjobs.Theimageshowsatrendline,with
distanceonthehorizontalaxisandtimetakenontheverticalaxis.Thepointson
thescatterplotare sizedbynumberofdeliveries.Youcanseethatmoredeliveries
increasesthetime,asonewouldexpect.Usingregression,Chapter6,wewillwork
outamodelwhichcanshowhowmuchextrayoushouldchargeforeachdelivery.
HereisaYouTubefortruckingscatterplot⁴
Tableaushowingdistances,times,anddeliveries
Chapter3.Writingupyourfindings.Inmostcases statistical analysis is done in
ordertohelpadecision-makerdecidewhattodo.Iamassumingthatitisyourjobto
thinkthroughtheproblemandassistthedecision-makerbyassemblingandanalyzing
thedataonhis/herbehalf.Thischaptersuggestswaysinwhichyoumightwriteup
yourreport,accompaniedbyvisualizationstogetacrossyourmessage.Thereare
alsolinkstosomehelpfulwebsiteswhich
⁴https://youtu.be/oFACLZLJWZI
discussthepreparationofslidesandhowtomakepresentations.
Chapter4:Dataandhowtogetit.Ifyouareconsideringcollectingdatayourself,
throughasurveyforexample,thenyou’llfindthischapteruseful.So-calledbigdata
isahottopic,andsoIincludesomediscussion.Thechapteralsoincludeslinksto
publiclyavail-abledatasetswhichmightbehelpful.
Chapter 5 deals with tests for whether two or more quantities are the same or
different.Forexample,youareafranchiseewiththree coffee-shops. Youwant to
knowwhetherdailysalesarethesameordifferent,andperhapswhatfactorscause
any difference that you find. Youwant to know whether there is a statistically
significantcommonfactorornot,toeliminatethepossibilitythatthedifferenceyou
seeoccurredpurelybychance.Perhapstheaverageageofthecustomersmakesa
differenceinthesales?ANOVAusesverysimilartheorytoregression,thesubjectof
thenextchapterandperhapsthemostimportantinthebook.
Chapter 6: Regression is almost certainly the most important statis- tical tool
coveredinthisbook.Inoneformoranother,regressionisbehindagreatdealof
applied statistics. The example presented in this book imagines you running a
trucking business and wanting to be able to provide quotations for jobs more
quickly. It turns out thatifyouhavesomepastlog-bookdatatouse,such as the
distance of various trips, the time that they took and how many deliveries were
involved,anaccuratemodelcanbemade.Regressioncanfindtheaveragetimetaken
foreveryextraunitofdistance,aswellas other variables such as the number of
stops. This chapter shows how to create such a model and how to use it for
prediction.Withsuchamodel,youcanmakequotationsreallyquickly.Thischapter
alsocoversmorecomplexregressionandhowtogoaboutmodel-building.
Otheruses:youwanttocalculatethebetaofastock,comparingthereturnsofone
particularstockwiththeS&P500.
Chapter7isabouttestingwhethertheregressionmodelswebuilt
in Chapters 6 and 7 are in fact any good. Regression appears so easy to do that
everybody does it, without checking the validity of the answers. Following the
procedures in this chapter will help to ensure that your work is taken seriously.
Thischapterismoreofaguidetothinkingcriticallyabouttheresultsother people
might have. Is there missing variable bias? For example, you notice a strong
correlationbetweensalesofice-creamsanddeathsbydrowning.Canyoutherefore
saythatreducingice-creamsaleswillmakethewatersafer?Err–no.Themissingor
lurking variable istemperature.Both drowning deaths and ice cream sales are a
functionoftemperature,notofeachother.
Chapter8isabouttime-seriesandshowshowwecandetecttrendsindataovertime,
and make predictions. The smoothing methods, such as moving averages and
exponentiallyweightedmovingaver-agescoveredinthischapterarefinefordata
whichlacksseasonality andwhenonlyarelativelyshort-termforecast is needed.
Whenyouhavelonger-rundataandformakingpredictions,thefollowing chapter
hasthetechniquesyouneed.
Chapter 9 describes the regression-based approach to time series analysis and
forecasting.Thisapproachispowerfulwhenthereissomeseasonalitytoyourdata,
forexamplesalesofTVsetsshowadistinctquarterlypattern(asdo umbrellas!).
SalesofTVsetsshowingaquarterlyseasonality
Usingregression we can detect peaks and troughs, connect them to seasons and
calculatetheirstrength.Theresultisamodelwhichwecanusetopredictsalesinto
thefuture.
Chapter 10 is about optimization or making the best use of a limitednumberof
resources.Thisishighlyusefulinmanybusinesssituations.Forexample,youneed
to staff a factory with workers of different skills and pay-levels. There is a
minimum amount of skillsrequiredforeachclassofworker,and perhaps also a
minimumnumberofhoursforeachworker.Usingoptimization,youcancalculate
thenumberofhourstoallocatetoeachworker.
Chapter11Optimizationcanalsobeusedformorecomplex‘blend-ing’problems.
Example:Yourunapaperrecyclingbusiness. Youtakeinpapersandotherfibers
suchascardboardboxes.Howcanyoumixtogetherthevariousinputssothatyour
outputmeetsminimumqualityrequirements,minimizeswastage,andgeneratesthe
mostprofit?
Another example: what mix or blend of investments would best suit your
requirements?
Chapter 12 concerns calculating the probability of items you can count one by
one.Forexample,whatistheprobabilitythatmore
thanfivepeoplewillcometotheservicedeskinthenexthalf-hour? What is the
probabilitythatallofthenextthreecustomerswillmakeapurchase?Wecansolve
theseproblemsusingthebinomialandthePoisson distributions.
Chapter13concernschoiceunder uncertainty: if you have a choice of different
actions,eachofwhichhasanuncertainoutcome,whichactionshouldyouchooseto
maximize the expected monetary value?Afarmerknowsthepayoffshe/she will
make from different crops provided the weather is in one state or another
(sunny/wet) butatthetimewhenthecropdecisionhastobemade,heobviously
doesn’t know what the actual state will be. Which crop should he plant? A
manufacturerneedstodecidewhethertoinvestinconstructinganewfactoryata
timeofeconomicuncertainty:whatshouldhedo?
Chapter 14 adds the decision-maker’s risk profile to his orher decisionprocess.
Mostpeopleareriskaverse,andarewillingtotradeoffsomeriskinexchangefor
certainty. This chapter shows how to construct a utility curve which maps risk
attitudes,andthenprioritizethedecisionsintermsofmaximumexpectedutility.
Chapter15isaGlossaryandcontainssomebasicstatisticsinforma-tion,primarily
definitions of terms that the book uses frequently and which have a particular
meaninginstatistics(forexample‘population’).TheGlossaryalsodiscusseswhy
theinferentialtech-niquesusedinstatisticsaresopowerful,allowingusto make
inferencesaboutapopulationbasedonwhatappearstobeaverysmallsample.
UnderEforExcelintheGlossary,you’llfindsomelinkstoscreencastsonsubjects
notdirectlycoveredinthisbook,butwhichyoumightfindhelpful.
2. VisualizationandTableau:
telling(true)storieswithdata
In this book, we’ll work on gaining insights from data by visual- ization and
quantitativeanalysis,andthenpresentingtheresultstoothers.Recentlyapowerful
newtoolcalledTableauhasbecomeavailable.TableauPublic¹isafreeversion.But
becarefulwithyourdatabecausewhenyoupublishtotheweb,asyoumustdowith
TableauPublic,thenyourdataalsobecomesavailabletoanybody.Ifyourequirethat
yourdatabeprotected,youcanpayfortheprivateversion.Tableau9hasjustbeen
madeavailable.
Many of the worked examples in this book are accompanied by a link to a
completed Tableau workbook, showing how you could have used Tableau to
present your findings. Tableau cannot perform easily the more technical
hypothesis-testingaspectsof statisticssuchasregression,butitcanhelpyoutoget
yourpointacrossclearly.TableauhasalsoprovidedausefulWhitePaper²onvisual
bestpracticeswhichiswellworthreading
In business intelligence we are generally interesting in detecting patterns and
relationships.Thismightseemobvious,becauseashumanswearealwayslooking
forthesephenomena.I’djustliketoaddthattheabsenceofapatternorrelationship
mightbejustasinformativeasfindingone.Anexcellentfirststepistotakealookat
yourdatawithanexpositorygraphbeforemovingontomoreformaldataanalysis.
Belowareexamplesofthetypesofchartsorgraphsmostcommonlyusedineither
examiningdatafirstoff,
¹http://www.tableausoftware.com/public/
²http://www.tableausoftware.com/whitepapers/visual-analysis-everyone
9
justto‘seehowitlooks’.Checkingan expository graph shows up problems that
mightthrowyou off later: missing data, outliers or, more excitingly, unexpected
andintriguingrelationships.
Histograms
Histogramsbreakthedataintoclassesandshowthedistributionofthedata.Which
classes(sometimesknownasbins)aremostcommon.Thehistogramshowsusthe
‘shape’ofthedata.Aretheremanysmallmeasurements,ordoesthedatalookas
though it is normally distributed and consequently mound-shaped. See
DistributionintheGlossaryfor more on this. The plot below is of deaths in car
accidentsonaweeklybasisinthe United States over the period 1973-1978. The
histogramshowsonlythedistributionandnotthesequence.
From the histogram only,
wecould saythatthemost
common number of deaths
inaweekis between8000
and 9000, be- cause the
bars of the histogram are
highest there. They are
high- est because they
count thenumberof times
in the time-period that
there have been (for
example 8000 deaths
occurredin15weeks).
What the
histogram
doesn’tshowis that
deathshaveactuallybeendecreasingsince
Histogramofcardeaths,reportedweekly
about1976.Thereisprobablyasimpleexplanationforthis:compul-soryseatbelts
perhaps?Thetake-homefromthisisthatexpositorygraphsarereallyeasytodo,
and that you should try different methods with the same data to tease out the
insights.
Cardeathsovertime
Scatterplots
Themostcommonandmostusefulgraphisprobably the scatter- plot. It is used
whenwehavetwocontinuousvariables,suchastwoquantities(weightofcarand
mileage)andwewanttoseetherelationshipbetweenthem.Thescatterplotisalso
usefulforplottingtimeseries,wheretimeisoneofthevariables.
YoucanuseExcelforthis,andalsoTableau.InExcel,arrangethecolumnsofyour
datasothatthevariablewhichyouwantonthehorizontalxaxisistheoneonthe
left.Thisistheindependentvariable.Puttheothervariable,thedependentvariable,
ontheverticalyaxis.Itisusualpracticetoputthevariablewhichwethinkisdoing
theexplainingonthexaxisandtheresponsevariableon
theyaxis.HereisascatterplotofEuropeanUnionExpenditureper student as a
percentageofGDP:
EU%ofGDPspentonPrimaryEducation
There is a gradual increase over the years 1996 to 2011 as one would expect;
educational expenditures tend to remain a relatively fixed share of the budget.
However, there was a blip in 2006 which is interesting. Because the data is
percentageof GDP, the drop might have been caused by an increase in GDP or
alternativelybyanactualeducationalspendingdecrease.We’dneedtolookatGDP
figuresfortheperiod.IgotEuropeanGDPfigurespercapitaandusingTableauput
themonthesameplot,withdifferentaxesofcourse.Theresultisbelow
EuropeanGDPandPrimary Expenditures
GDPdroppedaround2008/2009becauseoftheworld financialcrisis.Theblipin
primaryexpendituresisstillunexplained.
Maps
Tableaumakesthedisplayofspatialdatarelativelyeasy,providedyoutellTableau
whichofyourdimensionscontainthegeographicalvariable.Theplotbelowshows
changesingreenhousegas emis-sionsfromagricultureintheEuropeanUnion.I
linkedworldbankdatabycountry.Theworkbookishere³.
³https://public.tableausoftware.com/views/EuropeanGHGfromAgriculture20102011/Sheet1?:embed=y&:display_count=no
Unfortunately, not all geographic entities are available for linkage in Tableau.
States and counties in the United States are certainly available and provinces in
Canada,aswellascountriesinEurope.TherearewaystoimportArcGISshapefiles
intoTableau.Ashape-fileisalistofverticeswhichdelimitthepolygonsthatdefine
thegeographicalentity.
GHGemissionsfromagricultureintheEuropeanUnion
3. Writingupyourfindings
Inchapter1,ImentionedsomeofthequestionsIwishedIhadbeenabletoanswer
whenIwasoperatingabusiness.NowIwanttosuggestwaysinwhichyoumight
structureyourresponseto suchquestions.Theactualanswerscomewhenyou’ve
donethestatisticalwork(carriedoutinthefollowingchapters),butitmighthelpto
haveatleastastructureonwhichyoucanbeginbuildingyourwork.
Beforeyoudoanyworkontheproblem,getstraightinyourmindexactlywhatyou
arebeingaskedtodo.Youneedtobeclearandsodoesthedecision-makertowhom
youaregoingtosubmityourfindings.Hereisonewaytodothis:
Copydownthekeyquestionthatyouarebeingaskedinexactlythewaythatithas
beengiventoyou.Forexample,letussaythatyouhavebeenaskedtoanswerthis
admittedly tough question: ‘What factorsareimportantfor the success of a new
supermarket?’
Nowkeeprewritingitinyourownwordssothatitbecomes as clearaspossible.
Makesurethateachandeverywordisclear.Whatdoes‘success’actuallymean?—
inthebusinesscontextprobably‘mostprofitable’.Whatdoes‘new’mean?Isthisa
completelynewsupermarketoranewoutletofanexistingbrand?Youmightend
upwith:‘whenweareplanningalocationforanewoutletforourbrand,whatfactors
contributemosttoprofitability?’Doingallthishelpsyoutoidentifythestatistical
methodmostsuitableforthetask.Youalsocanincludeherethehypotheses(ifany)
thatyouwillbetesting.
Hereisasuggestedsectionorderforyourreport.However,thisprobablywon’tbe
theorderofthework.TheorderinwhichyoudothestatisticalworkisinSection3,
yourplanofattack.Sotheorder
15
ofdoingtheactualworkisthatyoucarryouttheplanofattack,andthenwriteup
yourresultsfollowingthesectionbysectionsequenceIamgivingyounow.
Section1.Statethequestiontobeansweredclearlyandsuccinctly,asresultofthe
rewriting you did above. Doing this makes sure that you clearly understand the
problemand what is being asked of you. Writing out the question and stating it
clearlyrightatthebeginning of your report also ensures that the decision-maker
(thepersonwhoposedthequestioninthefirstplace)andyouunderstandeachother.
Youcanalsowriteinthehypothesiswhichyouaretesting.
Section2.Provideanexecutivesummaryoftheresults.Thisshouldbeperhapssix
lineslongandcontainthekeyfindingsfromyourwork.Don’tputinanytechnical
words,jargonorcomplicatedresults.Justenoughsothatabusypersoncanskim
throughitandgetthegeneralideaofyourfindingsbeforegoingmoredeeplyinto
yourhardwork.
Section3.Outlineyourplanofattack ,describinghowyou have approachedthe
problem.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting.
Atypicalplanofattackfollowslateroninthischapter.
Section4.Data.Provideabriefdescriptionofthedatathatyouhaveused.Thesource
ofthedata, how you found it, whether you think it is reliable and whether it is
sufficientforthetask.Itisagoodideatoincludesummarystatisticsatthispoint.
This includes information such as the number of observations, and what the
variablesare.
Section5.Statisticalmethodology.Describethestatisticalmethod-ologywhichyou
are going to use, for example ANOVAto test whetherornotthemeansare the
same.
Section6.Carryoutthetests,makingsuretoincludeadescriptionofwhetherornot
thehypothesescanberejected.Thisis thecentralpartofthereport.Justmakesure
tofocusonansweringthequestion.
Section7.Writeaconclusionwhichdescribestheresults of the test and how the
results answer the question that was asked. You can also include here any
shortcomingsinthedataorinyourmethodologywhichmightaffectthevalidityof
the results. The conclusion is very important because it ties together all the
previoussections.HereyoucouldincludealinktoaTableau‘story’asanadditional
oralternativewayofpresentingresults.
Section8.References/end notes or further information, especially on sources of
data, should end the document. If you want to make your document look really
good, and you have lots of references, consider using the Zotero plug-in for
Firefox.Itisanexcellentfreewaytomanageyourbibliography.
3.1Planofattack.Followthesesteps.
a.Identificationofthestatisticalmethodthatyouwilluseandwhyyou
chosethosemethods
b.Whatdataisrequired,andwherecanitbefound?
c.Conductthestatisticaltests
d.Discusswhethertheresultsarereliable/answerthequestion.(Ifnot,start
again!)
3.2Presentingyourwork
While it is probably best to concentrate on written work because the decisionmakermightwanttoreadyourworkindetail, and discuss it with colleagues, a
Tableau presentation is an excellent way of making your point over again. See
Chapter2formoreonvisualizationandTableau.
Here is a link to some excellent notes by Professor Andrew Gelman on giving
researchpresentations¹
¹http://andrewgelman.com/2014/12/01/quick-tips-giving-research-presentations/
4. Dataandhowtogetit
Inthischapter,we’lllookatthedatacollectionprocess,theprob-lemsthatmight
accompanypoorlycollecteddata,discussso-called bigdataand provide links to
someusefulpublicsourcesofdata.
As you might expect, this book works through the analysis of numbers —
quantitativedata—butitisimportanttonotethattheanalysisofqualitativedataisa
rapidlygrowingarea.Unfortunately theanalysisofqualitativedataisbeyondthis
bookandthesoftwareavailabletous.
Quantitativedatacomesfromtwomainsources. Primarydataiscollectedbyyou
orthecompanyyouareworkingfor.Forexample,themarketresearchyoudotofind
outthepossiblemarketshareforyourproductprovidesprimarydata.Primarydata
includes bothinternalcompanydataanddatafromautomatedequipmentsuchas
website hits. Collecting data is expensive and highly proprietary. It is therefore
unlikelytobepublishedandavailableoutsidetheenterprise.
By contrast, Secondary data is plentiful and mostly free. It is collected by
governmentsofallsizes,andalsobymany non-governmentorganizationsaswell.
There is an increasing trend towards the liberation of data under various open
governmentinitiatives.Governmentdataisusuallyreliable,butbesuretocheckthe
accompanying notes which warn of any problems, suchaslimitedsamplesizeor
changeinclassificationorcollectionmethods overtime.Thereare some links to
secondarydataattheendofthischapter.
18
Experimentaldesign
Experiments don’t necessarily have to be conducted in laboratories by people in
whitecoats:asurveyinshoppingmallisalsoaformofexperiment,asisanalyzing
the results of your favorite cricket team. There are two different design
types: experimental and observational . The k e ydifference is the amount of
randomizationthatisbuiltintotheexperiment.Ingeneral,morerandomizationis
good,becauseithelpstoremovetheinfluenceoffixedeffects,suchasthequalityof
thesoilinaparticularlocation.
Here’sanexampletomakeclearthedifference.Imaginea re-searcherwantingto
test the effect of a new drug on mice. In an experimental design, the mice are
examined before the drug is administered, and then the drug is applied to a
treatmentgroupofmiceandacontrolgroup.Thecontrolgroupreceivesnodrugs.
Itsjobistoactasareferencegroup.Allthemicearekeptinidenticalconditionsapart
fromtheapplicationofthedrug.Theallocationofmicetothetreatmentgroupand
tothecontrolgroupisentirelyrandom.
While we can (and do) carry out drug experiments on mice, itwould clearly be
verywrongtotrytodothesamewithhumansassubjects.Insteadwecollectdataor
observationsandthenanalyzethoseobservationslookingford i ff e r e n c e s .
Tocompensateforthelackofrandomization,wecontrolforob-serveddifferences
byincludingasmanyrelevantvariablesaspossi-ble.IfweknewthatpersonXhad
receivedaparticulardrugandhaddevelopedaparticularcondition,wewouldwant
tocomparepersonXwithsomebodyelsewhohadnotdevelopedthat condition.
Relevantvariablesthatwewouldwanttoknowmightbe age,genderandpossibly
pre-existinghealthconditions.Includingthese variables reduces fixed effects and
allowsustoconcentrateontheeffectofthedrug.
Problemswithdata
It is obvious that to be credible, your analyses must be based on reliable data.
Problemswithdataareusuallyconnectedtopoorsamplingandexperimentaldesign
techniques,especially:
Samplesizetoosmall .Therelativesizeofthesampleto thepopulationusually
doesnotmatter.Itistheabsolutesizeofthesamplethatcounts.Youusuallywantat
leastseveralhundredobservations.
Population of interest not clearly defined. It is clear that we need to take a
samplefromapopulation,butwhatexactlyisthepopulation?Here’sanexample.
Youwanttosurveyshoppersinashoppingmallregardingyournewproduct.Butof
thepeopleinsidethemall,whoexactlyareyourpopulation?Peoplejustentering,
peoplejustleaving,peoplehavingtheirlunchinthefoodcourt? Singles, couples,
elderlypeopleortheteenagershanging aroundoutsidethedoor?Youcanseethat
pickinganyoneofthesegroupsonitsownwillleadtoabiasedsample.
Non-response bias . Many people don’t answer those irritating telephone calls
which come in the evening because they’re busy with dinner. As a result, only
answers from those who do choose to answer the survey are counted. Those
respondentsmostlikelyaren’trepresentativeof the population. Perhaps they live
aloneordonothavetoomuchtodo.I’mnotsayingthattheyshouldnotbe in the
sample,justthatincludingonlythosewhodorespondmaybiasyoursample.
Voluntaryresponsebias .Ifyoufeelstronglyaboutanissue,then you are more
likelytorespondthanifyouareindifferent.That’ssimplyhumannature.Asaresult,
thesurveyresultswillbeskewedbytheviewsofthosewhofeelmostpassionately.
Thisishardlyarepresentativesamplebecausethestrongly-heldviewsdrownoutthe
moremoderate voices.
4.1Bigdata
Primarydatafrequentlycomesfromautomatedcollectiondevices,suchasscanners,
websites,socialmedia,andthelike.Thevolumeofsuchdataisenormous,andis
aptlycalledbigdata.Bigdataisthetermusedtodescribelargedatasetsgeneratedby
traditionalbusi-nessactivitiesandfromnewsourcessuchassocialmedia.Typical
big data includes information from store point-of-sale terminals, bank ATMs,
FacebookpostsandYouTubevideos.
One of the apparently attractive features of big data is simply its size, which
supposedlyenablesdeeperinsightsandrevealsconnectionswhichwouldnotappear
insmallersamples.Thisargu-mentneglectsthepowerofstatistics,andinparticular
inferential statistics. A small sample, properly collected, can yield superior
insightstoaverylargepoorlycollectedsample.Thinkofitthisway:whichisbetter:
averylargesampleinwhichalltherespondentsareinthesameage-groupandofthe
samegender;orasmalleronewhichmoreaccuratelyreflectsthepopulation?
4.2Someusefulsites
YoucanofcourseeasilyjustGooglefordata,orlookatthesemorefocusedsites:
Gapminderdata:freetousebutbesuretoattribute¹Worker
employmentandcompensation²
Interestingandwide-ranginghistorical data³Google
PublicData⁴
¹http://www.gapminder.org/data/²http://www.bls.gov/fls/country/canada.htm
³http://www.historicalstatistics.org/
⁴http://www.google.com/publicdata/directory
InternationalMonetaryFund⁵
FoodandAgricultureOrganisation⁶UnitedNations
Data⁷
TheWorldBank⁸
⁵http://www.imf.org/external/data.htm
⁶http://www.fao.org/statistics/en/
⁷http://data.un.org/
⁸http://databank.worldbank.org/data/home.aspx
5. Testingwhetherquantitiesarethe
same
This chapter concerns testing whether the population means of two or more
quantitiesarethesameornot:andiftheyareinfactdifferent,whetheranyvariable
canbeidentifiedasbeingassociated with the difference. The test we will use is
ANOVA,(AnalysisofVariance).ThetestwasdevelopedbytheBritishstatistician
SirRonaldFisher,andtheF-testwhichANOVAusesisnamedinhis honor.Fisher
alsodevelopedmuchofthetheoreticalworkbehindexperimentdesignduringhis
timeattheRothamstedResearchStationinEngland.
5.1ANOVASingleFactor
ThemoststraightforwardapplicationofANOVAiswhenwesimply want to test
whetherornottwoormoremeansarethesame.Inthisworkedexample,wehave
threedifferenttypesofwheatfertilizer(Wolfe,WhiteandKorosa)andwewouldlike
toknowwhethertheirapplicationproducesequalordifferentyields.
AsthesectiononexperimentaldesigninChapter3emphasized,theexperimentmust
be designed to isolate the effect of the fertilizer, and this is achieved by
randomization. To control for differences in site-specific growing qualities, we
selectplotsoflandwhichareassimilaraspossible:exposuretosunlight,drainage,
slopeandotherrelevantqualities.Werandomlyassignfertilizerstoplots,andthe
yieldsaremeasured.Thefirstfewlinesofatypicaldatasetappearbelow.
23
Thefertilizerdataset
There are two columns in the dataset: the factor , which is the name of the
fertilizer,andtheyieldattributedtoeachplot.Inthiscase,eachfertilizerwastested
onfourplots,providing3x4=12observations.Later,whenusingExceltorunthe
ANOVA test, it will be necessary to change the format so that the yields are
groupedundereachfactor.
Agoodfirststepistovisualizethedata.Hereisaboxplotdrawnwith Tableau.
Fertilizer boxplot
Thedatasetishere:fertilizerdataset¹
andhereisaYoutubeofcreatingaboxplotinTableau².
¹https://dl.dropboxusercontent.com/u/23281950/fertilizer.xlsx ²http://youtu.be/3QohthWXp1M
I’llmakeafrankadmissionrightnow:ittookmethebestpartofamorningtogetthe
techniqueofdrawingaboxplotinTableaudown, so hopefully this YouTube will
helpyoutoavoidspendingsomuchtimeonthis.
The boxplot reveals these useful measurements: the smallest and largest
observation,ortherange.Themedian,whichisthehori-zontallinewithinthebox;
andthe25%quartile(upperlineofthebox;and75%quartile,lowerlineofthebox.
Therefore50%(75-25)of thedataarecontained within the box. The ‘whiskers’
markoffdata which are outliers, meaning that any datapoint which is outside a
whiskerisanoutlier.Thisdoesn’tmeantosaythatitissomehowwrong:perhaps
someofthemostinterestingdiscoveriescome fromlookingatoutliers.However,
an outlier might also be the result of carelessdataentryandshouldtherefore be
checked.Itisclearfromtheplotsthattheyieldsarebynomeansthesame.
Inthisexample,wearemeasuringonlyonefactor :theeffectofthefertilizeronthe
meanyieldofeachplot,andsothetestwewanttoconductissinglefactorANOVA
with a completely randomized design. It is completely randomized because the
allocationoffertil-izertoplotwasrandom.Wewantanydifferencestobeduetothe
fertilizerandthefertilizeralone.
Becauseourtestiswhetherthemeansarethesameordifferent,thehypothesisis:
Ho : µ 1= µ 2= µ 3= .. =µ n
withthealternativehypothesisthatnotallthemeansarethesame. That isn’t the
sameassayingthattheyarealldifferent;justthatatleastoneisdifferentfromthe
others.
Therejectionrulestatesthatifthepvaluewhichcomesoutofthetestissmaller
than0.05,thenwerejectthenullhypothesis.If we reject the null, then we must
acceptthealternativehypothesis.
WetestthehypothesisusingtheANOVASingleFactortoolwithinDataAnalysis.
First,thetwo-columnstructureofthedata hastobetransformedintocolumnsfor
eachofthethreefertilizers.ThatiseasilyaccomplishedusingthePIVOTTABLE
function.Thisyoutube³takesyouthroughtheprocessofchangingthestructureof
thedataandrunningtheANOVAtest.Theresultarebelow.
ANOVAoutputforfertilizertest
Thekeystatisticisthep-value.Itismuchsmallerthan0.05andsowerejectthenull
hypothesis.Themeansarenotallthesame.Theyaredifferent.Thesummaryoutput
tellsusthatWhitehasthelargestmeanyieldandthisagreeswiththeboxplot.Inthis
case,separationofresultsintoarankingofyieldisrelativelyeasybecausetheyare
sodistinct.UnfortunatelyExcellacksawayofeasilytestingwhetheranyotherpair
arethesameordifferent.Allwecansayforsureisthattheyarenotallthesame.
Here is a slightly more complicated and realistic example.You are designing an
advertisingcampaignandyouhavemodelswithdifferenteyecolors:blue,brown,
andgreen,andalsooneshotin
³http://youtu.be/wevrSWYBl8U
whichthemodelislookingdown.Youmeasuretheresponsetoeach arrangement.
Doestheeyecoloraffecttheresponse?Thedatasetiscalledadcolor⁴.Thefirstfew
linesarehere:
Thefirstfewlinesoftheadcolordataset
Thecoloristhefactor,andwe’llneedtousePIVOTTABLEtotabulatethedataso
thattheeyecolorbecomesthecolumns.I’vedonethathere….
Theresultis
Theeyecolorresults
⁴https://dl.dropboxusercontent.com/u/23281950/adcolour.xls
Thepvalueis0.036184,smallerthan0.05.Asaresultwe canstatethatdifferent
colors are associated with different responses. Looking at the summary output,
greenhasthehighestaverageat3.85974,sowecouldprobablyselectgreen.
5.2ANOVA:withmorethanonefactor
TheFertilizertestaboveshowedhowtotesttheeffectofasinglefactor.TheANOVA
resultsshowedthatthemeanswere not the same. By inspecting the boxplot and
alsotheExceloutput,itisclearthatthevarietyWHITEhasahigheryield.What
might be interesting is finding whether another factor also has a statistically
significanteffectonyields,andwhetherthetwofactorsinteractedtogether.
ThecaseinquestioninvolvespreparationfortheGMAT,anexamrequiredbysome
graduate schools. The GMAT is a test of logical thinking and is therefore not
dependentonspecificpriorlearning.Weknowthetestscoresofsomeapplicants,
andwhetherthosestudentscamefromBusiness,EngineeringorArtsand Sciences
faculties.Thestudentshadalsotakenpreparationcourses,ranging from a 3-hour
review to a 10-week course. The question is: did taking the preparation course
matter;anddidfacultymatter?Here wehavetwofactors:facultyandpreparation
course.Thedataisalreadyarrangedincolumnsandsowecangostraightinwitha
two-waywithreplication.Lookatthedata⁵.Itisreplicatedbecause there are two
setsofobservationsforeachpreparationtype.Theoutputishere:
⁵https://dl.dropboxusercontent.com/u/23281950/testscores.xlsx
TestscoresANOVAoutput
Wehavethepreparationtypeintherows,andtherelevantpvaluethereis0.358,so
wefailtorejectthenull.Wecannotsaythatthereisanydifferenceinscoresasa
result of preparation course type; in other words, there is no effect on scores
resultingfrompreparationcoursetype.However,forfaculty(incolumns) thereis
distinctdifference,withapvalueof0.004.Fromthesummary output, it looks as
thoseintheEngineeringfacultyhadthehighestscore.
6. Regression:Predictingwith
continuousvariables
This chapter is about discovering the relationships that might or (equally well)
mightnotexistbetweenthevariables in the dataset ofinterest.Here’satypicalif
rathersimplisticexampleofregression in action: the operator of some delivery
truckswantstopredictthe average time taken by a delivery truck to complete a
givenroute.Theoperatorneedsthisinformationbecausehechargesbythehourand
needstobeabletoprovidequotationsrapidly.Thecustomerprovidesthedistance
tobedriven,andthenumberofstopsenroute.Thetaskistodevelopamodelsothat
the truck operator can predict the time taken given that information. Regression
provides a mathematical ‘model’ or equation from which we can very quickly
predictjourneytimesgivenrelevantinformationsuchasthedistance,andnumber
ofstops.Thisisextremelyusefulwhenquotingforjobsorforauditpurposes.
Weknow the size of the input variables—the distance andthedeliveries, but we
don’tknowtherateofchangebetweenthemandthedependentorresponsevariable:
whatistheeffectontimeofincreasingdistancebyacertainamount?Oraddingone
morestop?Usingthetechniqueof regression ,wemakeuseofa setofhistorical
records, perhaps the truck’s log-book, to calculate the average time taken by the
trucktocoveranygivendistance.Aswithanyattempttopredictthefuturebasedon
thepast,thepredictionsfromregressiondependonunchangedconditions:nonew
roadworks(whichmightspeedupordelaythejourney),nochangeintheskillsof
thedriver.
31
In the language of regression, the time taken in the truck example is
the response or dependent variable, and the distance to be driven is the
explanatory or independent variable. While we will only ever have just one
responsevariable,thenumberof explanatoryvariablesisunlimited.Thenumber
ofstopsthetruckhastomakewillimpactjourneytime,aswillvariablessuchasthe
ageofthetruck,weatherconditions,andwhetherthejourneyisurbanorrural.Ifwe
know these variables, we also can include them in the model and gain deeper
insightsintowhatdoes(andequallyimportantdoesnot)affectjourneytime.
Regressionmodelsareusedprimarilyfortwotasks:toexplaineventsinthepast;
and to make predictions. Regression providescoefficientswhichprovidethesize
anddirectionofthechangeintheresponsevariableforaoneunitchangeinoneor
moreoftheexplanatory variables.
Regression makes a prediction for how long a truck will take to make a certain
numberof deliveries and also states how accurate the prediction will be. With a
regression model in hand, we can make quotations accurately and quickly. In
addition,wecandetect anomaliesinrecordsandreportsbecausewecancalculate
howlongajourneyshouldhavetakenundervariouswhat-ifconditions.
Here I have used a straightforward example of calculating a timeproblemfora
trucking firm, but regression is much more powerful than that. Some form of
regressionunderliesagreatdealofappliedresearch.Whenyoureadaboutincreased
riskduetosmoking(orwhateverelseisthelatestdanger)thatriskwascalculated
with regression. In this chapter, we’ll calculate the differencebetween usedand
newmarioKartsalesonebay,estimatestrokeriskfactors,andgenderinequalitiesin
pay,allwithregression.
Regressionisnotverydifficulttodo,buttheproblemis thateverybodydoesit.
Ordinary Least Squares (OLS), linear regression’s proper name, rests on some
assumptionswhichshouldbecheckedforvalidity,butoften aren’t. We cover the
assumptions,andwhat
todoifthey’renotmet,inthefollowingchapter.Thischapterisaboutthehands-on
applicationsofregression.
6.1Layoutofthechapter
The chapter begins with some background on how regression works. We’ll
illustratethetheorywiththetruckingexamplemen-tionedabove,beforegoingon
toaddingmoreexplanatoryvariablestoimprovetheaccuracyoftheprediction.
Particularly useful explanatory variables are dummy variables, which take on a
categoricalvalues,typicallyzeroor1.Forexample, wecouldcodeemployeesas
maleandfemale,anddiscoverfromthiswhetherthereisagenderdifferenceinpay,
andcalculatetheeffect.Inaworkedexample in the text, we analyze the sales of
marioKartonebay,andusedummyvariablestofindtheaveragedifferenceincost
betweenanewandausedversion.
Sometimestwovariablesinteract:higherorlowerlevelsinonevariablechangethe
effectofanothervariable.Inthetext,weshowthattheeffectofadvertisingchanges
asthelistpriceoftheitemincreases.
Finally,we’lldiscusscurvilinearity .Therelationshipbetweensalaryandexperienceis
non-linear.Aspeoplegetolderandmoreexpe-rienced,theirsalariesfirstmove
quicklyupwards,andthenflattenoutorplateau.Wecancapturethatnon-linearfunction
tomakepredictionmore accurate.
6.2Introducingregression
Atschoolyouprobablylearnedhowto calculate the slope of a line as‘riseover
run’.Let’ssayyouwanttogotoParisforavacation.Youhaveuptoaweek.There
aretwomainexpenses,theairfare and the cost of accommodation per night. The
airfareisfixedand
staysthesamenomatterwhetheryougoforonenightorseven.Thehotel(plusyour
mealchargesetc)is$200pernight.Youcouldwriteupasmalldatasetandgraphit
likethis:
TotalParistripexpenses
Theequationforis:Totalexpenses=1000+200*Nights
Basically, you have just written your first regression model. Themodelcontains
twocoefficientsofinterest:
• theinterceptwhichisthevalueoftheresponsevariable(cost)whenthe
explanatoryvariableiszero.Heretheinterceptis
$1000.Thatistheexpensewithzeronights.Itisthepointontheverticaly
axiswherethetrendlinecutsthroughit,wherex=0.Youstillhavetopaythe
airfareregardlessofwhetheryoustayzeronightsormore.
• theslopeof the line which is $200. For every one unit increase in the
explanatoryvariable(nights)thedependent variable (total cost) increases
bythisamount.
In regression, we usually know the dependent variable and the independent
variable, but we don’t know the coefficients. Weuse regressionto extract those
coefficientsfromthedatasothatwecanmakepredictions.
Howdoweknowwhethertheslopeoftheregressionlinereflectsthedata?Trythis
thoughtexperiment:ifyouplottedthedataand
thetrendlinewasflat,whatwouldthatmean?Norelationship.You couldstayin
Parisforever,free!Aflatlinehaszeroslope,andsoanextranightwouldnotcost
anymore.
More formally, the hypothesis testing procedure tests thenullhypothesisthat the
slopeiszero.Ifwecanrejectthenull,thenwearerequiredtoacceptthealternative
hypothesis,whichisthattheslopeisnotzero.Ifitisn’tzero(thetrendlinecould
slopeupordown)thenwe have something to work with. We test the hypothesis
usingthedatafoundfromthesample,andExcelgivesusapvalue.Ifthepvalueis
smallerthan0.05,werejectthenullhypothesis.Ifwerejectthenullwecansaythat
thereisaslopeandtheanalysisisworthwhile.
The Paris example had only one independent variable (number of nights) but
regression can include many more, as the trucking examplebelowwillshow.If
thereismorethanonevariable,wecannotshowtherelationshipwithonetrendline
intwodimen-sions.Instead,therelationshipisahyperplaneorsurface.Theplot
below shows the effect on taste ratings (the dependent variable) of increasing
amountsoflacticacidandH2Sincheesesamples.
3Dhyperplaneforresponsestocheese
Hereisanotherexample.Wehavethedataonthesizeof thepopulationofsome
towns and also pizza sales. How can we predict sales given population. Pizza
YouTubehere¹
6.3Truckingexample
The records of some journeys undertaken are available in an Excel file named
‘trucking’².
Takealookatthedata.AsImentionedatthebeginningofthebook,itisalwaysa
goodideatobeginwithatleastasimpleexploratoryplotofthevariablesofinterest
inyourdata.WithExcelwecanquicklydrawaroughplotsothatwecanseewhatis
goingon.Whatwewanttodoispredicttimewhenweknowthedistance.The
¹https://www.youtube.com/watch?v=ib1BfdVxcaQ
²https://dl.dropboxusercontent.com/u/23281950/trucking.xlsx
scatter plot below shows time as a function of distance. YouTube: scatterplot of
trucking data³
Trucking scatterplot
Time is the dependent variable and distance is the independent variable. The
independentvariablegoesonthehorizontalx axis,thedependentvariableonthe
verticalyaxis.
Fromtheplotitisclearthatthereisapositiverelationshipbetweenthetwovariables.
Asthedistanceincreases,thensodoesthetime.Thisishardlyasurprise.Butwe
wantmorethanthis:wewanttoquantifytherelationshipbetweenthetimetakenand
the distance traveled.Ifwecanmodel this relationship that will be useful when
customersaskfor quotations.
Thedatasetcontainsonefurthervariable,whichisthenumberofdeliveriesonthe
route.We’llusethatlateron.Fornowwe’llusejustthetimeandthedistance.
Tobuildourmodel,wewanttobuildanequationwhichlookslikethisinsymbolic
form
³https://www.youtube.com/watch?v=HSpY12mLXoU
y ˆ= b o + b 1 x 1+ …b n x n +e
yhat(spoken‘yhat’) is the dependent variable. It is called ‘hat’ because it is an
estimate. b0 is the intercept, or the point where the trend line (also called the
regression line) passes through the vertical y axis. This is the point where the
independentvariableiszero.Youperhapsrememberasimilarequationfromschool:
y=mx
+b.
Inthetruckingexample,theintercept(b0)mightbethe timetakenwarmingupthe
truckandcheckingdocumentation. Timeis running, but the truck is not moving.
b1 is the ‘coefficient’fortheindependentvariablebecauseitprovidesthechange
inthedependentvariableforaone-unitchangeinthe independentvariable.x1is
theindependentvariable,inthiscasedistance.Wewanttobeabletopluginsome
valueoftheindependentvariableandgetbackapredictedtimeforthatdistance.
Thebandthexbothhaveasubscriptof1becausetheyarethefirst(andsofaronly
independentvariable).Ihaveputinmorevariables just to indicate that we could
havemany.
Thecoefficient,b1,providestherateofchangeofthedependentvariableforaone
unitchangeintheindependentvariable.Youcanthinkofthisastheslopeoftheline:
asteeperslopemeansagreaterincreaseinyforaone-unitincreaseinx.
Wecannowfindtheestimatedregressionequationusingtheregressionapplication
inExcel.First,weusetheregressfunctioninExcel’sdataanalysistooltoregress
distanceontime.YouTube⁴
Theregressionoutputisasbelow:
⁴https://www.youtube.com/watch?v=xKsYfa7YGgE
Theregressionoutput
Ihavemarkedsomeofthekeyresultsinred,andIexplainthembelow.
6.4Howgoodisthemodel?—r-squared
Thereisaredcirclearoundtheadjustedr-squaredvalue,here0.622.r-squaredisa
measureofhowgoodthemodelis.r-squaredgoesfrom0to1,witha 1 meaning a
perfect model, which is extremelyrareinappliedworksuchasthis.Zeromeans
thereisnorelationshipatall.Theresulthereof0.62meansthat62%ofthevariance
in the dependent variable is explained by themodel. This isn’t bad, but it’s not
great either. Below,we’ll add more independent variables and show how the
accuracy of the model improves. The adjusted r-squared takes into account the
numberofvariablesintheregressionequationandalsothesamplesize,whichiswhy
it is slightly smaller than the r-squared value. It doesn’t have quite the same
interpretationasr-squared,butitisveryuseful when comparing models. Like rsquared,wewantashighavalueaspossible:higherisbetterbecauseitmeansthat
themodelisdoingamoreaccuratejob.
Alsocircledarethewordsinterceptanddistance.Thesegiveusthe
coefficientsweneedtowritethemodel.ExtractingthecoefficientsfromtheExcel
output,wecanwritetheestimatedregressionmodel.
y ˆ=1 . 274+0 . 068∗Dist
whereyhatisthe estimated or predicted response time. If you took a very large
numberofjourneysofthesamedistance,thiswouldbetheaverageofthetimetaken.
Theinterceptisaconstant,itdoesn’tchange.Itistheamountoftimetakenbeforea
singlekilometerhasbeendriven.Thedistancecoefficientof0.068isthepieceof
informationthatwe reallywant.Thisistherateofchangeoftimeforaoneunit
change in the dependent variable, distance. An increase of one kilometer in
distanceincreasesthepredictedtimeby0.0678hours,and vice-versaofcourse.
Dothemathandyou’llseethattheaveragespeedis14.7mph.
6.5Predictingwiththemodel
A model such as this makes estimating and quoting for jobs much easier. If a
managerwantedtoknowhowlongitwouldtakeforatrucktomakeajourneyof2.5
kmsforexample,allheorshewouldneedtodoistoplug2.5intodistanceandget:
predictedtime=1.27391+0.06783*2.5=1.4435hours.
multiplythisresultbythecostperhourofthedriver,addthemiscellaneouscharges
andyou’redone.
Itisunwisetheextrapolate.Onlymakepredictionswithintherangeofthedatawith
whichyoucalculatedyourmodel(seealsoChapter4onthis.
6.6Howitworks:Least-squaresmethod
The method we just used to find the coefficients is called the methodof least
squares .Itworks like this: the software tries to find a straight line through the
pointsthatminimizestheverticaldistancebetweenthepredictedyvalueforeachx
(thevalueprovidedbythetrendline)andtheactualorobservedvalueofeachyfor
thatx.Ittriestogoasclose as it can to all of the points. The verticaldistanceis
squaredsothattheamountsarealwayspositive.
Theplotbelowisthesameastheoneabove,exceptthatIhaveaddedthetrendline.
Theblacklineillustratestheerror
Theslopeofthetrendlineisthecoefficient0.0678.Thatistherateofchangeofthe
dependentvariableforaone-unitchangeinthe independent variable. Think “rise
overrun”.Iextendedthetrendlinebackwardstoillustratethemeaningofintercept.
Thevalueoftheinterceptis1.27391,whichisthevalueofywhenxiszero.Thisis
actually a form of extrapolation, and because we have no observations of zero
distance,thisisanunreliableestimate.
Nowlookatthepointwherex=100.Thereareseveralyvaluesrepresentingtime
takenforthisdistance.Ihavedrawnablackline
betweenoneparticularyvalueandthetrendline.Thisverticaldistancerepresents
anerrorinprediction:ifthemodelwasperfect, all the observed points would lie
alongthetrendline.Themethodofleastsquaresworksbyminimizingthevertical
distance. Itispossibletodothecalculationsbyhand,buttheyaretediousandmost
peopleusesoftwareforpracticaluse.Thegapbetweenpredictedandobservedis
knownasaresidualanditisthe‘e’terminthegeneralformequationabove.
Theerrordiscussedjustaboveistheverticaldistancebetweenthepredictedandthe
observed values for every x value. The amount of error is indicated by the rsquaredvalue .r-squaredrunsfrom0(acompletelyuselessmodel)to1(perfect
fit).
Intheglossary,underRegression,Ihavewrittenupthemaththatunderpinsthese
results.
6.7Addinganothervariable
Ther-squaredof0.66wefoundwithoneindependentvariableisreasonableinsuch
circumstances, indicating that our model explains 66% of the variability in the
responsevariable.Butwe might do better by adding another variable to explain
moreofthevariability.
Thetruckingdatasetalsoprovidesthenumberof deliveries that the driver has to
make.Clearly,thesewillhaveaneffectonthetimetaken.Let’sadddeliveriestothe
regressionmodel.
Notethatyou’llneedtocutandpastesothattheexplanatoryvariablesareadjacent
toeachother.Itdoesn’tmatterwheretheresponsevariableis,buttheexplanatory
variablesmustbe adjacentinoneblock.Thenewresultisbelow:
Nowwith deliveries
Notethatther-squaredhasincreasedto0.903, so the new model explains about
25%moreofthevariationintime.Theimprovedmodelis:
y ˆ=− 0 . 869+0 . 06∗Dist+0 . 92∗Del
A few things to note here: the values of the coefficients have changed. This is
becausetheinterpretationofthecoefficientsinamultiplemodellikethisisbased
ononlyonevariablechanging .Forexample,thecoefficientofdistanceis0.92,
almostonehourforeachdeliveryassumingthatthedistancedoesn’talsochange.
Weshouldinterpretthecoefficientsundertheassumptionthatalltheothervariables
are‘heldsteady’,apartfromthecoefficientofinterest.
6.8Dummyvariables
Above, we saw how adding a further variable has dramatically improve the
accuracyofamodel.Adummyvariableisanadditional variable but one that we
construct ourselves as a result of dividing data into two classes, for example by
gender.Dummyvariablesare
powerful because they allow us to measure the effect of a binary
variable, known as a dummy or sometimes indicator variable. A
dummyvariabletakesonavalueofzeroorone,andthuspartitionsthe
dataintotwoclasses.Oneclassiscodedwithazero,andiscalledthe
reference group . The other classes are coded with a one and
successive numbers.
There may be more than two groups, but there will always be one
reference group. W eare generating an extra variable, so the
regressionequationlookslike this:
y ˆ=b 0+b 1 x 1+b 2 x 2
In the case of observations which have been coded x1=0 (the
reference group), then the b1 term will disappear because it is
multipliedbyzero.Theb2termremains.Forthereferencegroup,the
estimatedregressionequationthensimplifiesto
y ˆ reference=b 0+b 2 x 2
whileforthenon-referencegroup,itis
y ˆ nonreference=b 0+b 1 x 1+b 2 x 2
Thesizeofb1x1representsthedifferencebetweenthe averagesize
ofthereferencelevelandwhatevergroupisthenon-referencegroup.
Let’sworkthroughanexample.Creatingadummyvariable⁵
Thedataset[‘gender’](https://dl.dropboxusercontent.com/u/23281950/gender.xlsx)
containsrecordsofsalariespaid,yearsofexperienceandgender.
Wemightwantto know whether men and women receive the same
salarygiventhesameyearsofexperience.Loadthedata,then run a
linear regression of Salary on Years of Experience, using years of
experienceasthesoleexplanatoryvariable.Theresultisbelow:
⁵https://www.youtube.com/watch?v=TBJsEb2UCPs
Thegenderregression
The interpretation is that the intercept of $30949 is the aver- age starting salary,
withyearsofexperiencezero.Thecoefficient
$4076meansthateveryadditionalyearofexperienceincreasestheworker’ssalary
bythisamount.
Ther-squaredis0.83,sothesimpleone-variablemodel explainsabout83%ofthe
variation in the dependent variable, salary. We have one more variable in the
dataset,gender.Addthisasadummyvariableandruntheregressionagain.Notethat
youwillhaveto cut and paste the variables so that the explanatory variables are
adjacent.
Withthegenderdummyadded
Ther-squaredhasincreasedtonearly1,soournewmodelwiththeinclusion
ofgenderisveryaccurate.Theimportantnewvariableis gender,codedas
female=0andmale=1.Nothingsexistaboutthis,wecouldequallywell
havereversedthecoding.Ifyouarefemale,thenthemodelforyoursalary
is:23607.5+4076*Years
ifyouaremale,thenthemodelforyoursalaryis23607.5+
114683.67+ 4076Years
Eachextrayearofexperienceprovidesthesamesalaryincrease,butonaverage
malesreceive$14683.67moreearnings.
Hereisanotherexample,usingthemaintenancedataset.Dummyvariable⁶
Anotherdummyvariableexampleandacautionary
tale
Themariokartdatasetcamefrom[OpenIntroStatistics](www.openintro.org)a
wonderfulfreetextbookforentrylevelstudents.Thedataset
⁶https://www.youtube.com/watch?v=Yv681upodDI
contains information on the price of Mario Kart in ebay auctions.First let’s test
whethercondition‘new’or‘used’makesdifference.Constructanewcolumncalled
CONDUMMY,codednew=1andused=0.Nowrunaregressionwithyournew
dummyvariableagainsttotalprice.Theoutputisbelow
Theconditiondummyresults
Thisresultistremendouslybad.Thepvalueis0.129,meaningthatconditionisnota
statistically significant predictor of price. Surely the condition must have some
significant effect? Wait—we forgot to do some visualization. To the right is a
histogramofthetotalpricevariable.
Lookslikewemighthaveaproblemwithoutliers…someof the observations are
muchlargerthantheothers.Takeanotherlookwithabarchartatthehigherprices.
marioKarttotal prices
Wecanidentifytheseoutliersusingzscores(seetheGlossary).Orwecouldjustsort
thembysizeandthenmake
avaluejudgementbasedonthedescription.That’swhatIhavedoneinthisYouTube.
Outlierremovaland regres-sion⁷
Thetotalpricesasabarchart
It seems that two of the items listed were for grouped items which were quite
differentfromtheothers.Thereisthereforealegitimatereasonforexcludingthem.
Belowarethenewresults:
⁷https://www.youtube.com/watch?v=ivLGteHuu3Q
CorrectedmarioKart output
Theestimatedregressionequation is
y ˆ=42 . 87+10 . 89∗CONDUMMY
Theconditiondummywascodedasnew=1,used=0.Ifamariokartisused,thenits
averagepriceis42.87,ifitnewthentheaveragepriceis42.87+10.89.Theaverage
differenceinpricebetweenoldandnewisnearly$11.Makessense.
Take-home:checkyourdatabeforedoingtheregression.
6.9Severaldummyvariables
ThegenderandthemarioKartexamplesabovecontainedjustonedummyvariable.
But it is possible to have more. For example, your sales territory contains four
distinctregions.Ifyoumakeoneofthefourthereferencelevel,andthendivideup
thedatawithdummies
for the remaining three regions, you can compare performance in each of the
regionsbothtoeachotherandtothereferencelevel.
Thedatasetmaintenancecontainstwovariableswhichyoucanconverttodummies,
followingthisYouTube.
Maintenance Regression⁸
6.10Curvilinearity
Sofar,wehaveassumedthattherelationshipbetweenthedepen-dentvariableand
theindependentvariablewaslinear. However,inmanysituationsthisassumption
does not hold. The plot below showsethanolproductioninNorth America over
time.
NorthAmericanethanolproduction.Source:BP
ThesourceofthedataisBP.Itisclearthatproductionofethanolisincreasingyearly
butinanon-linearfashion.Justdrawingastraightlinethroughthedatawillmissthe
increasingrateof production.Wecancapturetheincreasingratewithaquadratic
term,which is simply the time element squared. I have created a new variable
whichindexestheyears,whichIhavecalledt,andafurthervariable
⁸https://www.youtube.com/watch?v=xWdcT7u9YFE&feature=em-upload_owner
whichistsquared.
urvilinearregression YouTube⁹
Thedatasetwithanindexfortime.Source:BP
Aregressionoftagainstoutputhasanr-squaredvalueof0.797.Inclusionofthet
squaredtermincreasesther-squaredmarkedlyto
0.98.Theregressionoutputisbelow.
Regressionoutputwiththequadraticterm
Wewouldwritetheestimatedregressionequationas
y ˆ=4504−1301 t+194 . 9 t 2
Noticethatforearlysmallervaluesoft,theeffectofthequadratic
⁹https://www.youtube.com/watch?v=jgjWSpyPBqg
termisnegligible.Astgetslargerthenthequadraticswampsthelineartterm.
Makingaprediction .Let’spredictethanolproductionafter5years.Then
y ˆ=4505−1301∗5+194 . 9∗25=2872 . 5
Theplotbelowshowspredicted against observed values. While the fit is clearly
imperfect,itiscertainlybetterthanastraightline.
Predictedagainstobservedethanolproduction.
6.11Interactions
Increasingthepriceofagoodusuallyreducessalesvolume(al- though of course
profitmightnotchangeifthepriceincreasessufficientlytooffsetthelossofsales.
Advertisingalsousuallyincreasessales,otherwisewhywouldwedoit?
What about the joint effect of the two variables? How aboutreducingtheprice
andincreasing the advertising? The joint effect is called an interaction and can
easilybeincludedintheexplanatoryvariablesasan extra term. The output below
istheresultof
regressingsalesonPriceandAdsforaluxurytoiletriescompany.Theestimated
regressionmodel is
y ˆ=864−281 P+4 . 48 Ads
Regressionoutputforjustpriceandads
Sofarsogood.Thesignsofthecoefficientsareaswewouldexpectfromeconomic
theory.Salesgodownaspricerises(negativesignonthecoefficient).Salesgoup
withadvertising(positivesign onthecoefficient).Caution:donotpaytoomuch
attention to the absolute size of the coefficients. The fact that price has a much
largercoefficientthanadvertisingisirrelevant.Thecoefficientisalsorelatedtothe
choiceofunitsused.
Nowcreateanothertermwhichispricemultipliedbyads.CallthistermPAD.The
firstfewlinesofthedatasetarebelow.
Firstfewlineswiththenewinteractionvariable
Nowdotheregressionagain,includingthenewterm.
Inclusionoftheinteractionterm
The new term is statistically significant and the r-squared has increased to
0.978109.Thenewmodeldoesabetterjobofexplainingthevariability.Howcome
priceisnowpositiveandtheinteractionterm is negative? We have to look at the
results as a wholerememberingthatthesignsworkwhenalltheothertermsare
‘heldconstant’.Theexplanation:asthepriceincreases,theeffectofadvertisingon
salesisLESS.Youmightwanttolowertheprice andseeiftheincreasedvolume
compensates.
Anotherinteractionexample
Here’sanotherexample,thisonerelatingtogenderandpay. Thedatasetiscalled
‘paygender’ and contains information on the gen- der of the employee, his/her
review score (performance), years of experience and pay increase. We want to
know:
•Isthereagenderbiasinawardingsalaryincreases?
•Inthereagenderbiasinawardingsalaryincreasesbasedontheinteraction
betweengenderandreviewscore?
First,let’sregresssalaryincreasesonthedummyvariableofgenderandalsoReview.
Myresultsarebelow(IhavecreatedanewvariablewhichIhavecalledG.Itisjust
thegendervariablecodedwithmale
=0andfemale=1.
GdummyandReview
Nothingverysurprisinghere:womengetpaidonaverage233.286 lessthanmen.
And—holdinggendersteady–eachpoint increase in Review gives an increase in
salaryof2.24689.
How do I know that women are paid less than men? The estimated regression
equationis:
y ˆ=204 . 5061 − 233 . 286 ∗ ( x=1 ifF )−233 . 286∗( x=0ifM )+2 . 24689∗Review
Rememberhowwecodedmenandwomen?Thex=0ifMtermdisappears,sowe
areleftwiththisequationformen:
y ˆ=204 . 5061+2 . 24689∗Review
andthisoneforwomen(I’vedonethesubtraction)togive
y ˆ=− 28 . 779+2 . 24689∗Review
Conclusion:thereisagenderbiasagainstwomen.ForthesameReview
standard,onaveragewomenarepaid233.286lessthanmen.
How about the second question — possibility that the gender bias
increases with Review level? W ecan test this with an interaction
variable.MultiplytogetheryourgenderdummyandtheReviewscore
tocreateanewvariablecalledInteract.Thendothe regression again
with salary regressed against G, Review and Interact. My output is
below.
Interactionoutput
Formen,theequationisSalary=59.94472+4.848995(becausetheGandInteract
termsgotozerobecausewecodedMale=0.SoforeveryextraincreaseinReview,a
man’ssalaryincreasesby4.848995.
Forwomen,theequationisSalary=59.94472-29.714-1*(4.848995-4.05468)so
foreveryoneincreaseinReview,awomen’ssalary increases by only 4.8489954.05468=0.79. Notice that the adjusted r- squared for this model is higher (it is
0.927339comparedto0.810309) thanthemodelwithjustG and Review. So the
interactionisbothstatisticallysignificant,pvalue0.000316)andhasthemeaning
thatforwomen,improvingtheReviewscoredoesn’tincreasethesalaryasmuchas
formen.
ConclusionTheanalysisshowedthatwomenarebeingtreatedunfairly.Thereisa
genderbiasagainsttheminaveragesalariesforthesameperformancereview;and
each increase in review points earnthemconsiderablylessinsalarythanforthe
equivalentmale.Caution: this conclusion is based solely on the limited evidence
providedbythissmalldataset.
6.12Themulticollinearityproblem
Multicollinearityreferstowayinwhichtwoormorevariables‘explain’thesame
aspect of the dependent variable. Forexample,let’ssaythatwehad a regression
modelwhichwasexplainingsomeone’ssalary.Employeestendtogetpaidmoreas
theygetolderandalsoastheiryearsofexperienceincreases.Soifwehadbothage
and years of experience in the list of independent variables we would almost
certainlysufferfrommulticollinearity.
Multicollinearitycanleadtosomefrustratingandperplexing re-sults.Youruna
regression.Oneofthevariableshasapvaluelargerthan0.05,soyoudecidetotake
itout.Youruntheregressionagain and—-the sign and/or significance of another
variables changes. This happens because the two variables were explaining the
sameaspectofthedependentvariablejointly.
Howtosolvethisproblem:checkthecorrelationoftheindependentvariablesfirst,
beforeputtingthemintoyourmodel.Chapter14hasasectiononcorrelation.Ifyou
findthatthecorrelationofanytwovariablesishigherthan0.7,besuspicious.These
twomaybringyoursomegrief!
6.13Howtopickthebestmodel
Youwillbetryingoutdifferentformulations,addingandremovingvariablestotry
tocaptureasmuchexplanatorypowerasyoucan.Howyoudodecideifonemodel
isbetterthananother?Therearetwoapproaches:
• Lookonlyattheadjustedr-squaredvalue,evenif your model contains
variables with a p value larger than 0.05. If the adjustedr-squaredvalue
goesup,leavesuchvariablesin.
•Pruneyourmodelsothatitcontainsonlyvariableswithapvalue<=0.05.
Youstillwantashighanadjustedr-squaredas
youcanget,butyoualsowantallyourexplanatoryvariablestobe
statisticallysignificant.
Itakethelatterapproach.Youusuallyhavefewervariablesbutallof themhavea
reason(statisticalsignificance)forbeinginyourmodelandyoucaninterprettheir
meaningintelligently.Aparsimoniousmodelisbetter than a complex model that
fitsyourdataverywell.Simpleandrobustisgood.Thedatasetthatyouusedtofit
yourmodelisonlyasamplefromapopulation.Anoverlycomplexmodelmaynot
workwellwhenpresentedwithadifferentsamplefromthepopulation.
6.14Thekeypoints
• Thinkthroughyourmodelbeforeyoustartincluding vari-ables.What
variablesdoyouthinkwillhaveaneffectonthedependentvariableandin
whichdirection(plusorminus).Itistemptingtojustputineverythingand
hope for the best but this rarely works. Some software is able to do
stepwiseregression,pullingoutinsignificantvariablesforyou, butExcel
isnotoneofthem.
• Keepyourmodelassimpleaspossible.Complicatedmodelsrarelywork
well.
• Visualiseyourdatafirst,evenwithasimplescatterplotaswehavedone
throughoutthischapter.
•Checkwhetheryouhaveallthevariablesthatyoumightneed.Ifyouwere
tryingtopredictwhetherashopselling expensive jewellery would make
sufficientsales,youmightwanttoknowtheaverageincomeofresidents.If
youdon’thaveit—getit.Thereisahugeamountofdatalyingaboutwhich
youcanobtain either free or quite cheaply. I’veincludedaverybrieflist
ofURLsinChapter12.
6.15Workedexamples
1.Theestimatedregressionequationforamodelwithtwoindependent
variablesand10observationsisasfollows:
y ˆ=29+0 . 59 x 1+4 . 9 x 2
Whataretheinterpretationsofb1andb2inthisestimatedregres-sion equation?
Answer:thedependentvariablechangesbyonaverage0.59whenx1changebyone
unit,holdingx2constant.Similarlyforb2.
Predictthevalueoftheindependentvariablewhenx1is175andx2is290:
y ˆ=29+0 . 59(175)+4 . 9(290)=1553 . 25
7. Checkingyour regressionmodel
Itisn’tdifficulttobuildaregressionmodelasthepreviouschapterhasshown.But
thatispartoftheproblem:everybodydoesit.Butnoteverybodytakesthetrouble
tochecktheresults.Checkingtheresultsisanimportantstepfortworeasons:first,to
makesureyourcalculationsandpredictionsarecorrect;andsecondlytoshow third
parties that your work is solid. In this chapter we’ll work on doing just that. To
decidewhetherthemodelswehavebeenworkingonareanygoodweneedtolook
attwoareas:
•Isyourmodelstatisticallysignificant?
•Aretheassumptionsbehindtheleastsquaresmethodmet?Inparticular,
linearityandresidualdistribution.
Thefirstareaiseasiertoworkthroughthanthesecond,whichisprobablywhyone
doesn’talwaysseeresidualanalysisdiscussedwhenresultsarepresented.Thisisa
shamebecausethereisagreatdealtobelearnedfrompickingthroughresiduals.Your
analysiswillbegreatlyimprovedbyacloseattentiontothisarea.
Belowwe’llworkthroughstatisticalsignificanceandthentesttheassumptions.
7.1Statisticalsignificance
Theregressiontrendlinedisplaysarateofchangebetweentwovariables.Theline
slopesupwardswhenthedependent variableincreasesforaone-unitincreaseinthe
independentvariable(
61
positive relationship) and vice-versa (negative relationship). What we want to
know is this: did that slope upwards or downwards occur by chance; if we took
anothersamplefromthepopulationwouldweachieveasimilarresult?Keypoint:
weare usuallyworkingwithasampledrawnfromapopulation.Thesampleinthe
truckingexamplewasthefirm’slogbook.
Thenullhypothesisisthatthereisnoslope;withthealternativethatthereisinfacta
slope.Thehypotheses(seetheGlossaryforhelponhypotheses)totestwhetherthere
isastatisticallysignificantslopeare:
Thenull:
H o : β =0
andthe alternative:
H a : β̸ = 0
where
β
isthesymbolforthecoefficientsofthepopulation parameter of the independent
variableofinterest.Inwords,thehypothesisistestingwhetherbetaiszeroornot.If
it is zero, then we are ‘flatlining’ and there is no point in continuing with the
analysis.Ifhoweverwecanrejectthenull,thenweacceptthealternativewhichis
thatbetaisnotzero.Itmightbenegativeorpositive,thatdoesn’tmatter:whatmatters
is whether it is zero or not. Note that we use a Greek letter when discussing a
populationparameterwhichwillprobablyneverbe accurately known. We use the
Latin letter ‘b’ when we have estimated it using a sample drawn from the
population.
Exceldoesallthetestingofstatisticalsignificanceforusintwoways.
First,itteststheindividualvariablesandprovidesapvalue.Belowistheresultfrom
thefirsttruckingregressionwedid.Notethatthepvaluefordistanceis0.004.Itis
customarytouseacut-offof0.05.Themeaningofthisresultisthat,followingthe
rejectionrule ,werejectthenullifthepvalueissmallerthan0.05.Because0.004is
smallerthan0.05,werejectthenullwe therefore conclude that there is in fact a
statisticallysignificantslope.Insummary,wetestedthe hypothesis that the slope
waszero,andrejectedit.Thereforeweacceptthealternativewhichisthatthereisa
slopeofsomesort.
Thetruckingregressionoutput
Thistesttellsusthatthereisastatisticallysignificantslope.Itdoesn’ttellusthe
signoftheslope(upordown)oritsmagnitude.Butthefactthatthereisasignificant
slopeisimportantinformation.Itisworthcarryingonwiththeanalysisbecausethe
variablesactuallymeansomething.Bytheway,ignorethepvaluefortheintercept.
Itusuallyhasnosubstantivemeaning.
ThesecondwaythatExcelteststhevalidityofthemodelistotestwhetherthemodel
asawholeissignificant.ThetestiscalledtheFtestafterthestatisticianRAFisher.
Forthetruckingexample,lookundersignificanceF.Theresultisapvalue,inthis
casethesame
pvalueforthevariableDISTANCE.Wehaveonlyoneexplanatoryvariableandso
thepvaluewillbethesame.Wehopethatthepvalueissmallerthan0.05,which
enablesustoconcludethatatleastoneoftheslopesisnon-zero.
The section above examined the first of the tests, which was for the statistical
significanceofthemodel.Nowwemoveontothesecondarea.
7.2Thestandarderrorofthemodel
ThestandarderrorinExcelisfoundjustundertheadjustedr-squaredoutput.Itis
theestimatedstandarddeviationoftheamountofthedependentvariablewhichisnot
explainablebythemodel.Inotherwords,itisthestandarddeviationoftheresiduals.
Undertheassumptionthatthemodeliscorrect,itisthelowerboundonthestandard
deviationofanyofthemodel’sforecasterrors.Thefigurebelowshowstheoutputof
regressionestimatingriskofastroke(multipliedby100)againstbloodpressureand
age,withadummyvariableforsmokingornot.
Exceloutputforstrokelikelihood
Thestandarderroris6(roundedup).Foranormally-distributedvariable,95%of
the observations will be within two standard deviations of the mean. For our
purposesthatmeansthat2 x6=12shouldbeaddedorsubtractedtothepredicted
valuetofindaconfidenceintervalforpredictions.Theestimatedregressionequation
fromthestrokemodelaboveis:
y ˆ=− 91+1 . 08 Age+0 . 25Pressure+8 . 74 Smokedummy
Animaginarypersonwhosmokes,isaged68andhasa bloodpressureof175will
haveariskof34(allfiguresrounded).Wecanbe95%confidentthatthisestimate
willfallsomewherebetween34-12and34+12or22to46.Theseareratherwide
confidenceintervalsandsowemightwanttoworkatimprovingthemodelbyfor
exampleincreasingthesamplesizeoraddingmoreexplanatoryvariables.
7.3Testingtheleastsquares
assumptions
Therearetwokeyassumptionsthattheleastsquaresmethodreliesonwhichweneed
tocheck.Aftercheckingwe’llworkthroughwaysofretrievingthesituationshould
theassumptionsbeviolated.Theassumptionsare:
• Therelationshipbetweenthedependentandtheindependentvariablesis
linear.Fortunatelythisiseasytocheckandalsotofixifitisn’tsatisfied.
• The residuals have a non-constant variance. (You’ll some-timesseein
otherstatisticstextbooksotherstricterrequire-ments,suchastheresiduals
beingnormallydistributedandwithameanofzero.Formostpurposesjust
checking for non-constant variance is enough). What does non-constant
variance mean? We’ll work through this by defining a resid- ual and
identifyingwhetherornottheassumptionhasbeenmet.Andfinallywhatto
doaboutit.Butfirst,checkingforlinearity.
Checkingforlinearity
Therelationshipbetweenthedependentvariableandtheindepen- dentvariableis
assumedtobelinear.Thisisimportantbecausethemodelgivesusarateofchange:
thecoefficientshowsthechangeinthedependentvariableforaone-unitchangein
theindependentvariable.Iftherelationshipisnon-linear,thatcoefficientwill not
bevalidinsomeportionsoftherangeofthevariables.
Wewanttoseeastraightline(eitherupordown)onascatterplot.Putthedependent
variable on the vertical y axis, and one of the independent variables on the
horizontal x axis. If you include thetrendline,youcanobservehowcloselythe
observedvaluesmatchthepredicted.
Wecanalsocheckforlinearitybyplottingthepredictedvaluesand theobserved
values.Theplotbelowdoesthisforriskofstroke:
Predictedagainstobservedrisk
Thisisareasonableresult,indicatingthatthelinearityconditionismet.Mostofthe
pointsareinastraightline,althoughIwould be concerned about the lower risk
levels,especiallyaroundrisk=20.Thereappearstobeconsiderablevarianceatthis
point.
7.4Checkingtheresiduals
Aresidualisthedifferencebetweentheobservedvalueofyforanygivenxvalue,
andthepredictedvalueofyforthatsamexvalue.Itisthereforeapredictionerror,
andisgiventhenotationefortheGreekletter‘epsilon’.Itistheverticaldistance
betweentheactualandpredictedvaluesofthedependentvariableforthesamevalue
oftheindependentvariable.Wediscussedthisinthepreviouschapterinconnection
withtheleastsquaresmethod,wherewewrotetheestimationequation:
y ˆ= b 0+b 1 x
Thehatony,thedependentvariable,indicatesthatitisanestimateofthedependent
variable.Weknowthattheestimatecannotbecorrectunlessallthepredictedvalues
andtheobservedvaluesmatchupexactly.Iftheydon’t,thenthereareresiduals.Ifwe
calltheerrorsε(epsilon)whichistheGreeklettermatchingourlettere,thenwecan
rewritetheregressionequationas:
y=β o+β 1 x+ε
Theerrorshavenowbeenabsorbedintoepsilonandsowecanremovethehatfrom
y.Itisthebehaviorofepsilonthatisofinterest, becausetheleast-squaresmethod
restsontheassumptionthatthe errors absorbed into epsilon have a non-constant
variance. Thismeansthattherenorelationshipbetweenthesizeoftheerrortermand
thesizeoftheindependentvariable.Therefore,ifweplottedtheerrorsagainstthe
independentvariables,weshouldseenoclearpattern.Howtodothisisdescribed
below.
7.5Constructingastandardized
residualsplot
Excelprovideswhatiscallsstandardresidualsaspartofitsregres-sionoutput,and
wewillusethese.Notehoweverthatthesearenot‘true’standardizedresiduals,but
theyareprobablycloseenough.MakesureyouchecktheStandardizedResiduals
boxwhensettingupyourregression.
Checkthestandardizedresidualsbox
Wewanttoplotthestandardresiduals against the predicted value yhat. We will
createanewcolumn to the left of the column Standard Residuals, and copy the
column of predicted times into that new column. Then create a scatter plot of
predictedtimeandstandard residuals. Youshould end up with the image below.
Standardizedresidualplot¹
¹https://www.youtube.com/watch?v=1wuyZfB39p4
Standard residuals
TheYaxisindicatesthenumberofstandarddeviationsthat eachresidualisaway
fromthemean,plottedagainstthepredictedvaluesonthehorizontalaxis.Noneof
theresidualsaremorethat2standarddeviationsawayfromthemeanofzero,sothe
resultsaregenerallysatisfactory,althoughthereisoneobservationinexcessof1.5.
We still have a worrying fan shape which would seem to indicate non-constant
variance:wecanobserveapattern.
Let’sruntheregressionagain,butnowincluding the second vari- able,whichis
deliveries.Theresidualsplotisobtainedinthesameway,andhereistheresult.
Standardizedresidualswithtwovariables
Thefanproblemseemstohaveimprovedalittle.Ther-squaredofthemodelwith
twovariableswashigher,meaningthattheresidualsweresmaller(becausethereis
lessunexplained error).
7.6Correctingwhenanassumptionisviolated
Above,weexaminedpossibleviolationsoftwooftheassumptionsunderlyinglinear
regression:linearityandnon-constantvariance.Nowlet’slookatwhattodowhen
these assumptions are found to havebeenviolated.Firstsomegood news: linear
regressionisquiterobusttosuchviolations,andevensotheyarequiteeasytocorrect
for. We’ll deal with the problems in the same order: linearity and non-constant
variance.
7.7Lackoflinearity
The first check is to look at a scatter plot of the dependent variable against the
independentvariable.Theplotbelowshows volume of sales and length of time
employed,togetherwithatrendline.
Acurvilinearrelationship
Thereareseveraldifferentwaysinwhichalinearrelationshipcanbeachievedso
that we can use the least squares method. A common method is to include a
quadraticterm.Thismeansadding thesquaredvalueoftheindependentvariableto
thelistofindependent variables.Thisiseasilydonebycreatinganothercolumn,
consistingofsquaredvaluesofthefirstvariable.
Anewcolumnoftheindependentvariablesquared
Todothis,createa new column and then label it. Click on the first value in the
variablewhichyouwanttosquare,andaddˆ2.Enter. Then drag downwards. The
plotbelowshowsthepredicted andobservedsales.Themodelisclearlysuperior.
Predictedandobservedafterinclusionofquadraticterm
Ifthedependentvariablehasalargenumberoflowvalues,andisheavilyskewedto
theright,thenagoodsolutionistotransformthedependentvariableintoitsnatural
logarithm.
7.8Whatelsecouldpossiblygowrong?
Regressionisaverycommonlyusedanalyticaltoolandyouaremostlikelyeither
goingtouseityourselforexaminetheworkofotherswhohaveusedregression.
Below is a list of common mistakes to watch for. And if I have made them
somewhereinthisbook,I’msureyou’llbethefirsttoletmeknow!
7.9Linearitycondition
Regressionassumesthattherelationshipbetweenthevariablesislinear: the trend
linethatthesoftwaretriestofindtominimizethesquareddifferencebetweenthe
observedandpredictedvaluesisstraight.Soifyoutrytorunaregressiononnonlineardata,you’llgetaresultbutitwillbemeaningless.
Action :alwaysvisualizethedatabeforedoinganyanalysis.Ifyouseethatthedata
isnon-linear,youmaybeabletotransformitusingthetechniquesdescribedinthe
previouschapter.Asanexample:thecurvilinearexamplewhichwastransformed
byincludingaquadratictermasanexplanatoryvariable.
7.10Correlationandcausation
Regressionisaspecialcaseofcorrelation,andasweall knowcorrelationdoesn’t
meancausation(seetheGlossary).Inregression,nomatterhowgoodthemodel,all
that we have been able to show is that a change in an explanatory variable is
associatedbyachangeinthedependentvariable.Forexample,yourecordthehours
youputintostudyingandyourgrades.Surprise!Morestudying=bettergrades?Or
perhaps not….could have been a better instructor.Wecannotsaythatonecaused
theother.Sowhenwritingupresultsorinterpretingthoseofothers,beverycareful
nottoclaimmorethanyouareableto.
There is a related problem which is ‘reverse causation’. You study more, your
gradesgoup.Temptingtothinkthatonepossiblycausedtheother.Butperhapsitis
theotherwayround?Yourgradeswerepoor,youstudiedharder?Oryouhadgood
gradesandthenledyouintotakingthecourseseriously.
Action : try not to include explanatory variables which are affected by the
dependentvariable.Youcouldalsotryto‘lag’onevariable.Cutandpastethehours
ofstudyingvariablesothatitisonetime-periodbehindthegradesandthenrunthe
regressionagain.
7.11Omittedvariablebias
Under Correlation in the Glossary I give some examples of lurking variables.
Regression, like correlation, is susceptible to the same problem. Example: you
noticethatinhotterweathertherearemore
deathsbydrowning.Didthehotterweathercausethedrownings?Wellno,theextra
swimmingcausedbytheheatpresentedmoreriskscenarios.Thelurkingvariableis
hoursspentswimmingratherthantemperature.Trytogettotherealvariableifyou
can.
7.12Multicollinearity
When predicting the price of a house, square footage and number of rooms are
likelytobehighlycorrelatedbecausetheyarebothex-plainingthesamething.This
problemresultsinunusualbehaviorintheregressionmodel.
Action : correlate your explanatory variables BEFORE doing any regression.
Watchoutifyouhaveapairwhichhasanrvalueof
0.7or higher.This doesn’t mean that you shouldn’t use them, butit’s ared flag.
Thisiswhatmighthappen.Youtakeoutononethevariablesina regression, the
onethat’sleftreversesitssignorsuddenlybecomesstatisticallyinsignificant.
Ifyoudoruntheregressionandyougetanunlikelyresult,choosethevariablewith
thehighesttvalue(orsmallestpvalue)andditchtheother.
7.13Don’textrapolate
Thecoefficientsfromtheregressionarecalculatedbasedonthedatayouprovided.
Ifyoutrytopredictforavaluebeyondtherangeof that data, the results will be
unreliableifnottotallywrong.
In the years of experience and salary example in Chapter 5, the co- efficient for
yearsofexperienceprovidedthechangeinsalarythatanextrayearofexperience
wouldgive.Wouldyoufeelcomfortablepredictingthesalaryofsomeonewith95
yearsofexperience?
Action :beforerunningthenumbers,checkthattheinputsarewithintherangefor
whichyoucalculatedthemodel.
8. TimeSeries Introductionand
SmoothingMethods
Atimeseriesisasetofobservationsonthesamevariablemeasured atconsistent
intervalsoftime.Thevariableofinterestmightbemonthlysalesvolume,website
hitsoranyotherdataofrelevancetothebusiness.Usingtimeseriesanalysiswe
can detect patterns and trends and–just possibly–make forecasts. Forecasting is
tricky because (of course!) we only have historical data to goonandthereisno
assurance that the same pattern will repeat itself. Despite all this, time series
analysishasdevelopedintoahugetopic,fortunatelywithalargenumberoffreely
availabledata-setstouse.Linkstosomeofthesedata-setsareprovidedattheendof
thedatachapter(Chapter3).
There are two basic approaches: the smoothing approach and the regression
method .Bothmethodsattempt to eliminate the background noise. This chapter
coverssmoothingmethods, thefollowingchaptertheregressionapproach.
8.1Layoutofthechapter
Herewewill
1.identifythefourcomponentsthatmakeupatimeseries
2.introducethenaive,movingaveragemethodsandexponen-tially
weightedsmoothingmethodstomakeaforecast
3.discusswaysinwhichtheaccuracyoftheforecastscanbecalculated
76
8.2TimeSeriesComponents
There are four potential components of a time series. One or more of the
components may co-exist. The components are: trend compo- nent; seasonal
component;cyclicalcomponent;randomcomponent.
•trendcomponent :theoverallpatternapparentinatimeseries.Theplot
below shows French military expenditure as a percentage of GDP. The
downwardtrendisapparent.
FrenchmilitaryexpenditureasapercentageofGDP
• seasonalcomponent :salesofskisorChristmasdecorationsareclearly
higherinthewintermonths,whileicecreamandsuntancreammovemore
quicklyinthesummer.Ifweknow the seasonal component, then we can
use this information for prediction. The time between the peaks of a
seasonalcomponentisknownastheperiod .Noteherethatseasondoesn’t
necessarilymeantheseasonoftheyear.
TheplotbelowshowssalesofTVsetsbyquarter.Herethereisaconstantupward
trendbecausemorepeoplearebuyingTV sets,butthereisalsoaseasonaleffect.
TVSalesbyquarter
•cyclicalcomponent :sometimeserieshaveperiodslast-inglongerthan
one year.Business cycles such as the50- year Kondratieff Wave are an
example.Thesearealmostimpossibletomodel,but that doesn’t stop the
hopefulfromtrying.KondratieffhimselfendedupavictimofJosephStalin
becausehissuggestionthatcapitalisteconomieswentinwavesundermined
Stalin’sviewthatcapitalismwasdoomed,andhewasexecutedin1938.
•randomcomponent :therandomcomponentisjustthat:thenoiseorbits
leftoverafterwehaveaccountedforeverythingelse.Thetell-talesignsofa
random component are: a con- stant mean; no systematic pattern of
observations;constantlevelofvariation.
8.3Whichmethodtouse?
In the next two chapters, we’ll work through the two basic ap- proaches: the
smoothingapproachandtheregression approach.Howtodecidewhichtouse?
Firststepasusualistomakeasimplescatterplotofyourdata,with time on the
horizontalxaxisandthevariableofinterestontheverticalyaxis.Ifthe result is
somethingresemblingastraightline,forexampletheFrenchmilitaryexpenditure,
thenuseasmoothingapproach.Ifthereisevidenceofsomeseasonality,suchaswith
theTVsalesdata,thenregressionisthewaytogo.Thisisespeciallythecaseifyou
wanttocalculatethesizeoftheseasonaleffect.
8.4Naiveforecastingandmeasuringerror
Forecastingisjustthat:aneducatedguessaboutwhatmighthappenatsomepointin
thefuture.Forecastsarebasedonhistoricalevents, and we are hoping that some
patternofbehaviorwillrepeatitselfwhichwillmaketheforecast‘true’.Thebest
thatwecandoistotryoutdifferentforecastingmethodsand work out how well
theypredicteddatawhichwealreadyknewabout.Thenwetakealeapintothedark
andhopethatthebestofthosemethodswilldoagoodjobondatathatwedon’t
knowabout.Thissectionconcernsthemeasurementofforecastinga c c u r a c y .
We’regoingtoconstructanaiveforecastandusethatforecastto demonstratetwo
commonmethodsofaccuracychecking:theMean SquaredError(MSE)and the
MeanAbsoluteDeviation(MAD)approaches.
Anaiveforecastassumesthatthevalueofthevariableatt+1willbethesameasatt.
Thevaluesarejustcarriedforwardbyonetimeperiod.Hereisanexample:
ImageGaspriceswithnaiveforecast
The naive approach is sometimes surprisingly effective, perhaps because of its
simplicity.Ingeneralsimplemodelsperformwellperhapsbecauseoftheirlackof
assumptions about the future. Now we will measure the accuracy of the Naive
ForecastwithMADandMSE.Inbothmethodstheerroriscalculatedbysubtracting
theforecastorpredictedvaluefromtheobservedvalue.Themethodsdifferinwhat
isdonewiththoseerrors.
MADmeasurestheabsolutesizeoftheerrors,sumsthemandthendividesbythe
numberofforecasts.Theabsolutevalueoftheerrorisjustitssize,withoutthesign.
In Excel, you can find an absolute value with =ABS(F1 - F3) where FI is the
observedvalueandFEthepredictedvalue.Usethelittlecorneroftheformulaboxto
drag it down. Then find the sum and divide by the n, which is the number of
observations.Inmaththisis
MAD=
∑
( abserror )
n
InMSE,theerrorissquaredbeforebeingsummedanddivided.
Squaring the error removes the problem of the negative numbers, but creates
anotherone:largeerrorsaregivenmoreweightingbecauseofthesquaring.Thiscan
distorttheaccuracyoftheresults.ThemathsfortheMSEisbelow
MSE=
∑
( error )2
n
Theplotbelowshowstheerrorsassociatedwiththenaiveforecast-ing,theabsolute
valuesoftheerrorsandtheMAD.InExcel,use
=abs()toconverttoanabsolutevalue.
ImageErrorsandMeanAbsoluteDeviation
8.5Movingaverages
Slightlymorecomplexthanthenaiveapproachisthemovingaverageapproach,
whichreliesonthefactthatthemeanhasless
variance than individual observations. By focusing on the mean, we can see the
generaltrend,lessdistractedbynoise.
Theanalystdecideshowfarbackintimetheaveragewillgo.Thelongerbackin
time,thenthemorevaluesoftheobservation are averaged, resulting in a flatter
more smoothed result. This is good for observing long-term trends, but less
satisfactoryifyouareinterestedinmorerecent history.
The number of observations which are to be averaged is known as k, and this
number is chosen by the analyst based on experience. In Excel’s Data Analysis
ToolpakthereisaMovingAveragetoolwhichcandoallthework.Positioningthe
outputisslightlytricky.Therearektimeperiodsbeingusedforthecalculationof
the first forecast;sothefirstforecastshouldbeatk+1becausektimeperiods are
beingusedtofindthefirstvalue.Whereexactlytoputthefirstcelloftheoutputisa
bitfiddly.Iftherearektimeperiodsbeingused, placethefirstcell at k-1. Excel
providesachartoutput,examplebelow.ThereisaYouTubehere¹
ImageMovingaveragewithk=3forgasprices
Themovingaveragetechniqueisusedbyinvestorstodetectthe‘goldencross’and
the‘deathcross’.Theyexamine50dayand200daymovingaverages.Whenthe50
daymovingaverageis above
¹http://youtu.be/zrioQOWfxjY
the200daymovingaveragethispointistheso-calledgoldencrossandasignalto
buy.Bycontrast,whentheoppositehappensthisisthe‘deathcross’…timetoget
out!
8.6Exponentiallyweightedmoving
averages
The exponential smoothing method (EWMA) is a further step forward from the
naiveandthemovingaverageapproaches.Themovingaverageforgetsdataolder
thanthe ktimeperiodsspecified, while the EWMA incorporates both the most
recentandmorehistoricalobservationstoconstructaforecast.IntheGlossaryyou’ll
findmoredetailedmathshowinghowtheEWMAbringsforwardallpasthistory.
Ingeneralthefurtherbackintimeyouget,thelessinfluencetheobservationshave.
UsingEWMAwechooseasmoothingconstant,alpha,whichsetstheweightgiven
to the most recent observation againstprevious forecasts. If we thought that the
forecast(t+1)wouldbesimilartothecurrentobservation(t=0)andwethoughtthat
olderforecasts wereoflittlevalue,thenwewouldchooseahighvalueofalpha.
Alpharunsfrom0to1.TheEWMAformulais
F t +1=αY t+(1−α ) F t
where F is the forecast, and Y is the actual value of the variable. alpha is the
smoothingconstant,ranginginvaluebetween0and1,andchosenbytheanalyst.
In words, this equation means that the forecast value (at t+1) is the smoothing
constantalphamultiplyingtheactualvalueat t=0plus(1-alpha)timestheforecast
att=0.Soifthepreviousforecastwastotallycorrect,theerrorwouldbezeroandthe
forecastwouldbeexactlywhatwehavetoday.Itturnsoutthattheexponential
smoothingforecastforanyperiodisconstructedfromaweightedaverageofallthe
previousactualvaluesofthetimeseries.
Theequationaboveshowsthatthesizeofalpha,thesmoothing constant,controls
thebalancebetweenweightinggiventothemostpreviousobservationandprevious
observations.Ifthealpha issmall,thentheamountofweightgiventoYatt=0is
smallandtheweightgiventopreviousobservationsislargeandvice-versa.Ifalpha
=1,thenpreviousobservationsaregivennoweightatall,and weassumethatthe
futureisthesameasthepast.Thisachievesthe same result as naive forecasting
discussedabove.Typically quitesmallvalueofalphaareused,suchas0.1.
##ApplicationofExcel’sexponentialsmoothingtool
An exponential smoothing tool is available in the Analysis ToolPak. We’lluse
Canadian Gross Domestic Product per capita as an example. Here is the data
plottedonitsown.Youtube²
CanadianGDPPerCapita
Thereisasteadyupwardstrend,butwithadipin2009duetotheworldfinancial
crisis.Excelasksforadampingfactor.Thisis1-alpha.Ihavedonetheforecasts
twice,oncewithanalphaof0.1
²http://youtu.be/w-GmxjrX1qg
andagainwithanalphaof0.9.Thereforethedampingfactorswere
0.9and0.1.Theresultforalpha=0.9isshownbelow.
Forecastwithalpha=0.9
This is pretty good forecast, with the forecast tracking the actual observations
closely.
TheplotbelowshowssheepnumbersascountsbyheadinCanada from1961to
2006.First,let’splotthisusing0.3asalpha(arbitrarilychosen).Theplotisbelow,
withadampingfactorof1-0.3=0.7.
Sheepnumberswithalpha=0.3
9. TimeSeriesRegressionMethods
Thesmoothingmethodsweworkedinthepreviouschapterarefinewhenthereis
noevidenceofseasonality,forexamplewiththe sheepdata.However,smoothing
has only limited use for longer term prediction and analysis. Data that is more
interestingforusas analysts may be non-linear in trend, and also have seasonal
peaks and troughs. Using regression, we can measure the quantitative effect of
seasonalityandtrend,eithertogetherorseparately.Theresultisamodelwhichcan
beusedforprediction.
Belowwewill:
1.useregressiontoquantifyatrendinatimeseries
2.introduceaquadratictermtoaccountfornon-linearityinthetimeseries
3.usedummyvariablestomeasureseasonality
9.1Quantifyingalineartrendinatimeseries
usingregression
TheplotbelowshowslifeexpectancyatbirthforCanadians.
87
Canadianlifeexpectancy
This is pretty much a straight line as one might expect in adevelopedcountry
withalargepercapitaexpenditureon healthcare.Notethatthisisforbothsexes:
forwomenonlywemightexpecttoseeevenbetterfigures.Irantheregressionwith
Yearastheindependentvariable,explaininglifeexpectancy.Theresultisbelow.
Lifeexpectancyregressionresults
NoticethattheadjustedR-squaredvalueisclosetounity,reflecting
thestraightlinethatweseeonthegraph.Thecoefficients fortheinterceptandthe
independentvariableprovidethisestimatedregressionequation
^y t=58 . 47+0 . 000565∗Year
The meaning is that for every year after 1960 that a child was born, his/her life
expectancyincreasedby0.000565years,orabout5hours.
9.2Measuringseasonality
ThedatasetforTVsales(fromModernBusinessStatistics byAndersonSweeney
andWilliams)containssalesbyquarterforfour years. There are therefore sixteen
observations.
IcreatedacolumnwhichIcalledindexsothatwecanseesalesina consecutive
fashion.ThefirstquarterisSpringandthelastquarteriswinter.Itisapparentthat
quarterswhicharedivisiblebyfourarehigherthanothers,andsoforth.Perhapssales
arehigherinwinter?
WecanusethedummyvariablemethodofChapter5todeterminewhetherthisis
trueandalsotheextentofthedifference.WewillcreatedummiesforSummer,Fall
andWinter,leavingSpringasthereferencelevel.Recallthatwehavefourpossible
statesoftheseasonvariable,andsowewillneedk-1,or4-1=3dummies.Itusually
doesn’tmatter which state you choose as your reference level: just don’t forget
which one you picked. The dataset is below, with a column called Index, which
we’llusetomeasurethetimetrend.
TVSaleswiththequarterlydummies
Iwanttofindouttwothings:
whethersalesareincreasingovertime.Icandothisbyincludingtheindexasan
explanatoryvariable.Becausetheregressiontoolrequiresthatalltheindependent
variablebeinoneblock,IhavecopiedandpastedtheIndexcolumntotheright.
Theregressionoutputis
TVSalesregressionout
Let’swalkthrough this output line by line. First notice that the p values for all
independentvariablesaresmallerthan0.05.Thereforeallaresignificant.
ThereferencelevelisSpring,andthereforethereisnocoefficientfor this quarter.
Summer has a negative sign in front of its coefficient, meaning that sales for
Summer are smaller than those forSpring.Fallispositive,somoreTV sales are
sold in the Fall than in the Spring. Not unexpectedly, Winter has the largest
coefficientofall,
reflectingtheplot.Theindexhasapositivesign,meaningthataverageTVsalesare
increasingwithtime.
Therearethereforetwocomponentsinthisseries:atrendand aseasonal
component.Youtube¹
**Makingpredictions**fromtheseresults.Let’spredictthesalesforFallinseven
quarterstime.Thisis
y ˆ=4 . 7+1 . 05875+7∗0 . 147=6 . 78
Theobservedvaluewas6.8,sothepredictionwasreasonableifslightly pessimistic.
9.3Curvilineardata
Thedataabovefollowedalineartrend,whichisperfectforordinary least squares,
which assumes that the relationship between the dependent variable and the
independent variable is linear. But some interesting data does not follow a neat
lineartrend.Theplotbelowshowscrudeoilandgaspricesovertheperiod1949to
2003.
¹http://youtu.be/wyKIHInMbY8
Crudeandgaspricesshowingcrudepricecurvilinearity
However,ifweshowthetwoseriesonthesamegraph,asontheplotIdidinTableau,
withseparateyaxesforeachtypeoffuel,wecanseethattheshapesareveryclose.
Crudeandgasondifferentaxes
Thecurvilinearityofcrudepricesisclear.Thepricescametoapeak
inthe1970sasaresultofOPEC’sdecisiontorestrictsupply.Inany event,crude
pricescannot be described as linear. The solution is to add an extra term to the
regressionof price on time, and that is time squared. The first few lines of the
datasetarebelow.Ihaveaddedacolumncalledtorepresenttheyear,andaddedtsq
whichisjusttsquared.
Datawithtimesquared
Nowregresscrudeagainsttime and time squared, and the pleasing result below
appears
Thecurvilinearregressionforcrude
The adjusted r-squared is high at 0.88 and the two time predictors are highly
significantstatistically.Noticethatthetwopredictorsarequitedifferentinsizeand
alsohaveoppositesigns.Whentissmall,thenthevariabletdominatesandthetrend
is upward. However, as t gets larger, then t-squared gets even larger still. The
negativesignont-squaredpullstheregressionlinedown.
10. Optimization
Linear Programming (LP) is the tool we use to optimize a particular objective
function .Forexample,amanufacturerofcarpetswantstogetthemostprofitout
of his raw materials and labor force. The objective function is the relationship
between the inputs and his costs and thus his profits. Optimization helps the
manufacturerfindthemostefficientallocationofresources.Theobjectivefunction
doesn’tnecessarilyhavetobeintermsofmoney;itcouldverywellbetoallocate
workinghoursefficientlysothateveryonehasmoretimeoff.Frequentlywewant
tomaximizetheobjective function(perhapsmakemoreprofit)butwemightalso
wanttominimizeanobjectivefunction,suchascosts.
Optimizationisnotparticularlydifficult,andwewillusetheSolveradd-intoExcel
to do the mathematically tricky parts. What is important is the writing up of a
correctmodelinthefirstplace.
Layoutofthechapter
Linear programming is not especially difficult, but the work has to be done in a
logicalandorderlyfashion.Inthischapter,wewill
*spendsometimeworkingthroughthebasicsteps in setting up an optimization
problem * work through the meaning of the Solver output in terms of what the
coefficientvaluesactuallymean
10.1Howlinearprogrammingworks
Linearprogrammingworksbysolvingasetofsimultaneousequa-tions.The
problemtobesolved—tomaximizeastockreturnfor
95
example—iswrittenasasetofsimultaneousequations.Theequa- tions may be
quitesimple,buttheremaybemanyofthem.LinearProgrammingfindsthebest
solutiontotheequations.Thejobistofindthecoefficientsforthevariablesinthe
equationwhichmaximizeorminimizetheoutcome.
We’llworkthroughanexamplefirstonpaper,andthenputitintoExcel’sSolver
add-intogetthesolution.Thethreeimportantstepsare:
•Writethesimultaneousequations,basicallyputtingintomaththewording
oftheproblem.Thisisprobablythetoughestpart.
•RuntheOptimizationinSolver,tofindtheoptimalsolution.
• Sensitivity analysis to find out how much effect a unit change in a
constraintwouldhave.
10.2Settingupanoptimizationproblem
Optimizationproblemsareusuallypresentedaswrittenquestions.Thebestapproach
istodeterminewhicharethe
•decisionvariablesthosevariableswhichyoucancontrol.Forexample,in
thecarpetfactoryexample,thenumberofhoursgiventoeachworkeris a
decisionvariable.
Itisusuallythesizeofthedecisionvariablethatwearetryingtooptimize.Forthe
workers,wewouldwantthemtohaveexactlythenumberofhourswhichproduces
the most profitable output. Too few hours and the factory is working below
capacity. On the other hand,toomanyandmoneyisbeingwasted.The decision
variables gointowhatSolvercallsthechangingcells.Theconventionisto colorcodethesecellsinExcelinredcolor.
• constraints are constants or givens which we cannot change. The
jurisdictionwherethefactoryislocatedmighthave legislation restricting
the maximum number of hours that an employee can work per day.We
wouldneedtoadd aconstrainingequationtotakeaccountofthisfact,even
ifthesolutionwasfinanciallysub-optimal.
10.3Exampleofmodeldevelopment.
Afarmerhas50acresoflandathis/herdisposal.Healsohas up to 150 hours of
labor and up to 200 tonnes of fertilizer. He can plant either cotton or corn or a
mixture.Cottonproducesaprofitof$400peracre,corn$200peracre.Eachacreof
cottonrequires5laborhoursand6tonnesoffertilizer.Forcorntheequivalentis3
and2.Howmuchlandshouldthefarmerallocatetoeachcrop?
Let’scallacresofcottonxandacresofcorny.FforfertilizerandLforlabor.These
fourvariablesarehisdecisionvariablesbecausehecanallocatecropstolandandhe
canalsodecidehowmuchlabor and fertilizer to deploy. He is searching for the
combinationofthefourdecisionvariableswhichmaximizeshisprofit.
Weknowhehas50acres,sothetotalacreagecannotexceedthisamount.Thisisa
constraint,whichcanbewrittenasanequation,asshownbelow.Otherconstraints
arethemaximumlaborandfertilizer.Obviouslyhecannotusemorethanhehas.
10.4Writingtheconstraintequations
Thetotalacreagedevotedtoeachcropcannotexceed50:
x+y≤50
Thelaborandfertilizerarealsoconstraints.Referbackto thequestiontogetthe
amountsoflaborandfertilizerperacrepercrop.
Forlabor,theconstraintequationis:
5 x+3 y≤150
This equation comes about because for each acre of cotton, the farmer needs 5
hours of labor,while for corn it is 3. Wedonotknowthevaluesofxandy,but
whatevertheyare,weknowthat 5xplus3ymust be less than or equal to 150,
becauseweonlyhave150hoursavailable.
Forfertilizer,theconstraintequationis:
6 x+2 y≤200
10.5Writingtheobjectivefunction
Whatdowewanttogetoutofthis:whatistheobjective? Clearly it is the most
efficientmixtureofinputs,subjecttoconstraints,producingthelargestprofit.The
objectivefunctionis:
MaxProfit=400x+200y
Inwords,findthecombinationofxandythatmaximizestheprofit,subjecttothe
constraintswehavewritten.ThenexttaskistoputallthisintoSolver.
10.6OptimizationinExcel(withtheSolver
add-in)
WewilluseanExcelspreadsheetandSolvertoachieveallthreesteps.
OptimizationYouTubeforthefarmerproblem¹
¹https://www.youtube.com/watch?v=WeTgK6wmvSY
OpenanExcelspreadsheetsothatyoucanfollowalong.
Cellcoloringconventions
Isuggestyoufollowtheseconventionsforcolor-codingthecells:
•Inputcells.Thesecontainallthenumericdatagiveninthestatementofthe
problem.ColorinputcellsinBLUE.
•Changingcells.Thevaluesinthesecellschangetooptimizethe
objective.CodechangingcellsinRED.
•Objectivecell.Onecellcontainsthevalueoftheobjective.Colorthe
objectivecellinGREY.
NowI’llworkthroughthefarmingproblemabovestepbystep.
MakesureyouhavetheSolveradd-inloaded.Tocheck,openExcel,thenclickon
theDataTab.IfSolverisloaded,youwillseetheSolvernametotheright.
TheSolvertab
BelowistheExcelspreadsheetwiththeinformationthatweknowalreadytypedin.
Ihaveputrandomvaluesinthe red-coloredchangingcells,justasplaceholders.
Thesenumberswill changewhenExcelsolvesforthemostprofitableallocation.
•TypetheaddressoftheobjectiveintotheSolverdialoguebox,andmake
suretheMaxradiobuttonisselected.
•Typeintherangeofthechangingcells.
•Workthroughtheconstraints.Therearethree:labor;fertil-izer;andland.
•SelectSimplexLPastheSolvingmethod.
•Presssolve.
Thefarmingsolution
Solverwillchangethevaluesinthechangingcellstomaximizetheobjectivecell.I
foundthat30acresofcottonandnoneofcornprovidedaprofitof$12000.
10.7Sensitivityanalysis
Constraintscanbeeitherbindingorslack .Ifaconstraintisbinding,thatmeansthat
allofthatparticularresourceisbeingused,andmorecouldbeemployedifitshould
becomeavailable.Wecanfindoutwhetherconstraintsarebindingandalsotheir
shadow
pricebypressingSensitivityAnalysisafterrunningSolver again.TheSensitivity
analysis will appear as a tab at the bottom of your worksheet. For the farming
problem,itlookslikethis:
Farmingsensitivityreport
Let’sletattheConstraintssection.Laboruses150hoursandhasashadowpriceof
$80.Theshadowpricesindicatestheper-unitvalueoftheconstrainedcommodityif
theconstraintwasincreasedbyoneunit.Ifwecouldhaveonemorehouroflabor,
thentheprofitwouldincreaseby$80.Landandfertilizerhaveashadow pricesof
zerobecausetheconstraintisslackornon-binding.We are not completely using
theseresourcesandsowedonotneedanyextrainputs.Therearesacksoffertilizer
lyingaroundunused.Noneedtobuymore.
The columns AllowableIncrease and Allowable Decrease are rele- vantbecause
theytellushowmuchmoreofabindingconstraintwecouldusebeforerunningthe
modelagain.Forlabor,theamountis16.67(rounded).Ifweeasedtheconstraintby
morethanthisamountthenwewouldneedtorunthenewmodelinSolveragain.
Thesamelogicfordecrease.
10.8InfeasibilityandUnboundedness
Solverisquiterobust,buttwoproblemsmayoccur.Asolutionto anoptimization
problemisfeasibleifitsatisfiesalltheconstraints.Butitispossiblefornofeasible
solutionto exist. This occurs if you makeamistakeinwritingthemodel,orthe
modelistootightlyconstrained.
Unboundedness
Unboundedness occurs when you have missed out a constraint. There is no
maximum(orminimum).Trychangingallconstraintsto>=insteadof=<.
10.9Workedexamples
Thedemandingmother
Mostmothersarekeentokeepcontactwiththeirchildren.Oneparticularmotheris
ratherdemanding.Sherequiresatleast 500minutesperweekofcontactwithyou.
Thiscanbethroughtele-phone,visitingorletter-writing(yes!Somepeoplestilldo
that!).Herweeklyminimumforphoneis200minutes,visiting40minutesandletters
200minutes.Youassignacosttotheseactivities.Forphone:$5perminute;visiting
$10perminute;letter-writing$20perminute.Howdoyouallocate your time so
thatyourcostistheminimum?
Writetheequationsfirst.Whatistheobjective?
Letxstandforminutesofphone;yminutesofvisiting;and zminutesofletters.
Youwanttominimizethecostoftheseactivities, so the objective is to minimise
200x+40y+200z.Theconstraintsare:x+y+zmustbemorethanorequalto500.
Spreadsheetfordemandingmother
Noticethatwewanttominimizethetime,sochangetheradiobuttontominrather
thanmax.Andwhenyou are typing in the constraints,checkthatthesignofthe
inequalityistherightwayround.
Thetheatermanager
Youarethemanagerofatheaterwhichisinfinancialtrouble.Youhavetooptimize
thecombinationofplaysthatyouwillputontomakethemostprofit.Youhavefive
playsinyourrepertoire,A,B,C,D,E.Theyhavedifferentdrawpoints(appealtothe
public)andthereforeticketpricestomatch.Adrawpointishowattractivetheplay
istothegeneralpublic.Thedatalookslikethis:
Play
Draw
Ticket
A
2
20
B
3
20
C
1
20
D
5
35
Play
Draw
Ticket
E
7
40
Youhavetheseconstraints:youhaveonly40possibleslots.Andnoplaycanbeput
onlessthantwiceormorethantentimes(givestheactorsareasonableturn).Each
performance costs $10 per ticket sold, regardless of the ticket price (covers the
electricityetc).Thetotaldrawpointshastoexceed160tokeepthecriticshappy.
Howdoyoudistributeyourperformances?Whichcombinationofplaysproduces
thehighestprofitandsatisfiestheconstraints?
Wearetryingtomaximizeprofit,solookforanobjective function thatdoesjust
that:20A+20B+20C+35D+40E.Butwait—-wehavethe$10 cost. So better
takethatoutfirst,leaving10A+10B+10C+25D+30E.
Withthese constraints:
Thedrawpoints(theattractivenessoftheplaytocritics):2A+3B+1C+5D+7E
>=160
andnoplaycanbeputonmorethantwiceormorethantentimes.A<=2andA<=
10B<=2andB<=10C<=2andC<=10D<=2
andD<=10
andwehaveonly40slots,soA+B+C+D+E<=40
Spreadsheetfortheatreproblem
Moreworked examples
19thcenturyfarmer
Iama19thcenturyEnglishfarmer.I can grow wheat or barley. Wheatyields10
bushelsanacre,barley8bushelsanacre.Thepriceofwheatistenshillingsabushel,
barley5shillingsabushel.Thelabourcostsforwheat are 3 shillings an acre, for
barley2shillingsanacre.Thetransportationcosttomarketforwheatis1shillinga
bushel,forbarley½shillingabushel.Ihave100acres.(abushelisameasureof
volume;anacreisameasureofarea;ashillingisacurrency unit).
Questions:HowdoIsplitupmyland?
IfIcouldgetonemoreacreofland,howmuchwouldthatbeworthtome?
Yetanotherfarmingquestion(Iusethesebecausemostpeoplecanimaginefieldsof
land,cropsgrowingandthelike.Butofcoursethesametechniquesareapplicable
tootherbusinesssituations).
Afarmerhas100acresofland.Hecanplantcropsorraisesheep. Each hectare of
cropsprovides$100butrequireslaborof$50peracre.Hecanspendamaximumof
$500oncroplabor.Eachacreofsheepprovides$40but needs only $10 in labor
charges.Thelaborbudgetisunlimited.
Worked solution
CallareaincropsXandin sheep Y. Then X + Y =< 100 and 50X <= 500 The
objectiveis100X-50X+40Y-10Ysimplifiesto50X+30Y.MyExcelspreadsheet
isbelow.
Spreadsheetfortheaboveproblem
11. Morecomplex
optimization
Optimization using Solver is a powerful method of solving common business
resource-allocationproblems.Inthepreviouschapter theproblemswererelatively
simple concerning the allocation of land ortime.Butlinearprogrammingcan be
usedtosolvemorecomplexandworthwhileproblemsaswe’llseebelow.Thekey
requirementisthattheanalystisabletodefinetheproblemasasetofequations.
Thereisnoonemethodexceptforthinkingcarefullyand writingouttheproblem
asasetofequations.
Todemonstrate,we’llworkthroughthreedifferenttypesofprob-lem:
*problems concerning proportionality: you need to allocate money to different
investments while minimizing risk and keeping returns above a certain amount.
Whatproportiondoyouputineachinvestment?
*supplychain problems
*blendingproblemswhereyouneedtomixtogetherinputsfromdifferentsources
11.1Proportionality
Investmentdecisions example
Thisexampleshowshowwecan‘weight’theinputsaccordingtosomecriteria,in
thiscasetheirrisk.Wewanttominimizetheriskbutensurethatthereturnisabove
someminimumlevel.Whatisthemixtureorblendofinvestmentsthatcandothat?
107
The problem: you have an inheritance of $300,000 from an uncle but there are
somerestrictions:youmustinvestallthemoneyinfourfunds;yourannualreturn
hastobeatleast5%.Andyoumustminimizeyourrisk.Thefourfundsyouruncle
hasspecifiedare:
Fund
Risk
Return
X1
10.7
4.2
X2
5
4
X3
6
5.6
X4
6.2
4
Let’sdealwiththe objective function first. That is to minimize the risk. We can
weightthesize of the investment in each fund with its risk index. So, weighting
eachinvestmentandthendividingbythetotalinvestmentgivestheamountofthe
risk:whichisexactlywhatwewanttominimize.SowewanttominimizeR(for
Risk)likethis:
R=10 . 7 X 1+5 X 2+6 X 3+6 . 2 X 4
300 ,000
Ifyoudon’tgetthis,thinkaboutwhatwouldhappenifallthefundswereasrisky
asX1:thetotalriskwouldincrease.Anotherwaytothinkofthis:ifwedecreasethe
numberofsharesinX1andinsteadincreasethenumberofX4,whatwillhappen?
Theriskwilldecrease.
Theconstraints:thetotalinvestmenthastoaddupto$300,000,sothisconstraintis
X 1+X 2+X 3+X 4 =300 ,000
Wealsohavetoachieveareturnofatleast5%(noeasymatter these days). This
constraintis
4 . 2 X 1+4 X 2+5 . 6 X 3+4 X 4 ≥5
Belowismyspreadsheetfromthisproblem:
Investmentblending
11.2Supplychainproblems
Workingouthowmuchtoshipfromproductioncenterstodemand locations is a
commonprobleminsupplychainoptimization.AsIhavebeenstressing,thekeyto
solvingproblemsofthistypeistowriteouttheequationswhichdefinethemodel.
Hereisanexample:
You run a company which has bakeries in location A and B. The bakeries ship
cartons of bread to your retail stores at locations X, Y, Z. The bakeries have
different capacities and each retail store has different demands. The costs of
deliverypercartonfromeachbakerytoeachstoreisbelow
Bakery
X
Y
A
12
13
11
150
17
17
200
100
90
B
Demand
9
50
Z
Capacity
Notation:adeliveryfrombakeryAtolocationXisAxandsoforth.
Theobjectivefunctionistominimizethedeliverycosts.Sowewant tominimize:
12Ax+13Ay+11Az+9Bx+17By+17Bz
Eachbakeryhasafixedcapacity,soAx+Ay+Az<=150andBx+By+Bz<=200
Supplyhastoexactlymatchdemand
Ax+Bx=50Ay+By=100Az+Bz=90
Noticethestrictequalitysign.Myresultsarebelow:
Distributionproblem
11.3Blendingproblems
Abovewediscussedlinearprogrammingmodelswhichwere simplebuteffective.
There are other types of Optimization model which are helpful, especially
BlendingModels,whichcanalsobesolvedbylinearprogramming.
Blendingmodelsareusedinsituationswherewehavetwoormoreinputswhich
havetobe mixed to some formula. Through Optimizationwe canfind the most
profitablemixture.Wine,metals,oil,sausages,recycledpaper—-thisisapowerful
technique. Youcouldprobablyuseitformarketingcampaigns.We’llworkthough
anexamplewhichisforoil.
Theoilblendingproblem
Theproblem:anoilcompanyhas15000barrelsofCrudeoil1and20000barrelsof
Crudeoil2onhand.Thecompanysellsgasolineandheatingoil.Theseproducts
aremadebyblendingtogetherCrudeoil1andCrudeoil2.EachbarrelofCrudeoil
1hasaqualitylevelof10,andeachbarrelofCrudeoil2hasaqualitylevelof5.The
gasolinethatweproducemusthaveaqualitylevelofatleast8.Theheatingoilmust
haveaqualitylevelofatleast6.Gasolinesellsfor$75abarrel,heatingoilfor$60.
Howcanweblendtheoils together in such a way that meets minimum quality
requirementsandmaximizesprofit?
Oilblendingsolution
First,let’sthinkthroughwhatthedecisionvariables(whatgoesintothechanging
cells) might be. You might very well think (as I did first off) that the decision
variableswouldbetheamountsofthetwooilsusedandtheamountsproduced.But
thisisn’tenough:wehavetoblendtogetherthetwotypesofoil.Theyhavetobe
mixedbeforetheycanbesold,andthemixturehastoreachsomeminimumquality
standard.Thecompanyneedsablendingplan.
Theinputs :
selling prices (here gasoline = $75, heating oil = $60) availability of oil from
suppliersqualitylevelofcrudeoils:Crude1is10andCrude2is5
Theconstraints
Gasolinequality>=8Heatingoilquality>=6QuantityofCrude1
=15000QuantityofCrude2=20000
Theblendingplan :
Gasolinehastohaveaminimumqualityofatleast8,andheating oilmusthavea
minimumqualityofatleast6.TheCrudeoil1we
haveonhandhasaqualitylevelof10whileCrudeoil2hasaqualitylevelof5.We
wanttoblendthesetwocrudeoilstobothachievethe minimumqualitystandards
andmakethegreatestprofit.
Let’sattacktheproblembycreatingtotal‘qualitypoints’(QPs)whichrepresentthe
qualityofoilinabarrelmultipliedbythenumberofbarrelsofthatoil.
Writeequationstocalculatethequalitypoints:
TotalQPsinthegasoline=10*amountofOil1+5*amountofOil2
Ifforexample,wemixedtogether50barrelsofCrudeoil1and40barrelsofCrude
oil2,thetotalQPswouldbe50x10+40x5=700.Theaverageperbarrelwouldbe
700/90=7.78.Thisistoolowforgasoline(needs8)butacceptableforheatingoil.
Twopoints:
we could sell the oil as heating oil, but it is exceeding the minimum quality
requirement.Wecouldmakemoreprofitbyreducing thequalitytotheminimum,
orchargeapremium.Butthisisprescrip-tivework,andwehavetoworkwithinthe
inputsgiventous).
TheonlywaytogettheoiluptogasolinestandardsistoincreasetheamountofCrude
oil1inthemixture.
Ifyoudon’tgetthis,tryathoughtexperiment:forgasoline,iftherewasnooilatall
fromOil2,whatwouldbetheQP?Itwouldbe10*thequantityofoilfromOil1.
Again,howaboutifweblendedtogether1000barrelsfromeachtype of oil: how
manyQPswouldbeproduced?Itwouldbe10*1000+5*1000=15000.Read
thisthroughagain…itisimportant.
The blending plan provides us with the constraints we need to ensure that the
minimumqualitylevelsareachieved.Intheexample justabove,weblended2000
barrelstoprovideaQPof15000.Thisisanaverageof7.5.Goodenoughforheating
oil,notgoodenoughfor gasoline.
Oilproblemsolution
Discussionofthesolution:notethatgasolinesellsformoremoneythanheatingoil,
buttheoptimalsolutionsuggeststhatweshouldsellmoreheatingoilthangasoline.
Thisisbecauseoftheconstraintsonquality.
12. Predictingitemsyoucancountone
byone
Why you need to know this . Many business decisions involve counts: either
binary(yes/no)orwithintimeorspace.Itwouldbegoodtoknowtheprobabilityofa
certainnumberofcustomerscom-inguptoaservicedeskinacertainlengthoftime;
ortheprobabilityofacertainnumberofcaraccidentsatagivenintersection.Weare
looking for a discrete probability distribution; discrete because the number of
occurrencesisaninteger.
Chapter4onregressionshowedhowtopredictthesizeof anoutcomewhichwas
continuous(moneyortimeperhaps).Thedependentvariable—whatweweretrying
topredict—couldtakeonalmostanyvalue.Nowwewanttopredictprobabilities
for the occurrence of an independent variable which is an integer. Below we’ll
work through some hands-on applications, with the theory available in the
Glossary.
We’llbreakthisintotwoparts:
Theprobabilityofa binaryoutcome .Theprobabilitiesoftwofaultyitemsoutof
the next twenty on the production line. Or, the probability of at least five
customersoutofthenext fiftyactuallybuyingsomething.Thisis estimated with
thebinomialdistribution .
The probability of a particular count of occurrences over time or area. The
probabilityofthreeorfewerpeoplearrivingatyourCustomerServiceDeskwithin
thenexthalfhour.Ormorethantwomistakesinthenexttenlinesofcode.Thisis
estimatedwiththePoissondistribution .
114
12.1Predictingwiththebinomial
distribution
Afterthenormaldistribution(seetheGlossaryforadefinition),thebinomialisthe
mostimportantinstatistics.Themathforthebinomialdistributionisalsodefinedin
theGlossary.
The binomial distribution provides the probability of a ‘success’ in a certain
number of ‘trials’. For example, you can calculate theprobabilityof more than
sevenoutofthenexttwentypeople throughthedoor actually buying something.
Herea‘success’issomebodybuyingsomething;whilethenumberof‘trials’isthe
numberofpeoplecomingthroughthedoor(hereitistwenty).
Some definitions:
Whatwearelookingforistheprobabilityofapre-definednumberof‘successes’.
Notethatthedefinitionofsuccessisuptotheanalyst.Itcould be ‘spending more
than$20’orwearingahat.
Intheexamplewe’llworkthroughbelowyourunastore.Youknowthatthelong-run
probability of somebody buying something is 0.6. You obtained this number by
countingtotalnumbersofcustomersand,outofthose, the numbers who actually
boughtsomething(successes).
Workedexample:Youwanttoknowtheprobabilityofexactlythreepeoplethrough
thedooroutofthenextfivebuyingsomething.Notethat‘success’justmeansthat
thedefinedeventhappens.Whetherornotitisa‘goodthing’isn’trelevant.
ForExcel,theargumentsrequiredare:therandomvariablewhose probability we
want to predict; number of trials, the long-run probability, and a true/false
statement.Intheexample, the number of trials is five. The random variable (X)
whoseprobabilitywewanttopredictis3.Wealsoneedthelong-runprobabilityof
asuccess.Intheexample,p=0.6.Wealsowanttheprobabilityofexactly3,souse
thefalsestatement.I’llgointothisinsomedetailshortly.
Open an Excel spreadsheet, and type in =BINOM.DIST(3,5,0.6,false) and you
should get this: 0.3456. This is the probability of exactlythreepeopleoutofthe
nextfivebuyingsomething(numberofsuc-cessesoutofthenextfivetrials).Notice
theorderoftheargumentsintheExcelfunction.Andespeciallythelastone,which
intheexampleaboveisfalse.(Thealternativeistrue).Thedifferenceisimportant
becausetheresultsarequitedifferent.
Trueandfalseargument .Definingtheargumentasfalseprovidestheprobability
of exactly the random variable. Defining the argu- ment as true provides a
cumulativeprobability.
Excel adds up probabilities from the left, so changing the argu- ments to
=BINOM.DIST(3,5,0.6,TRUE)=0.66304isthesumoftheprobabilities of X=0 +
X=1andsoonuptoanincludingX=3.Theprobabilityof0.66304isthereforethe
probabilityofthreeorfewercustomersbuyingsomething.Thetablebelowshows
theprobabilitiesofvariousvaluesofX,both‘false’and‘cumulative’.As you can
see, the cumulative is just the continued addition of each successive probability.
Thereisafurtherexamplehere¹
Probabilitiescalculatedbothfalseandtrue
Howaboutmorethanthreecustomersbuyingsomething?Inmath notation, we’re
lookingforP(X>=3).Weknowthat probabilitiesmustsumupto1.Weknowthat
threeorfeweris0.66304.Somorethanfivehastobe:1-0.66304=0.33696.
¹https://www.youtube.com/watch?v=oZ1DmNQ8wW4
Aslightlyharderexample:theprobabilitythatatleastfourcus-tomersoutofthe
nexttenwillbuysomething?We’relookingfor P(X >=4). Look carefully at the
notation.X>=4impliesthatthedistributionofthetencustomersissplitintotwo
halves:lessthanfourandfourormore.Itisthelatterhalfwhoseprobabilitywe’re
lookingfor.
Ifwecanfindtheprobabilityof0+1+2+3,thatistheprobabilityoflessthanfour
(weonlywant integers here). So find that probability and then subtract from 1,
makinguseofthefactthatprobabilitiesmustsumupto1.
Let’sdothisstepbystep.
First,findP(X=<3),thatistheprobabilitythatXisthreeorless:
=BINOM.DIST (3,10,0.6,true) = 0.054762. Notice the ‘true’ which givesusthe
cumulativeprobability.
Wewantfourormore,sosubtractfrom1likethis:1-0.054762=0.945238.
Anotherexample.It’swinterandyouneedtoweara sweatereveryday.Youhave
twoblueandthreeredsweaters.Calculatetheprobabilitythatduringtheweekyou
willweararedsweater:
Exactly twiceintheweek
Answer:thelong-runprobabilityofpickingaredsweateris3/5
=0.6becauseyouhavefivesweaters,andthreeofthemarered.Thenumberoftrials
is7becausethereare7daysinaweek.The wording of the question contains the
word ‘exactly’ which means that we don’t want a cumulative answer, so we’ll
include the FALSE argument. Therefore the answer is =binom.dist(2, 7, 0.6,
FALSE)=0.077414
Morethanthreetimes.Whenyouseewordssuchas‘morethan’that’sacluethat
you’relookingforacumulative probability.In math notation, we’re looking for
P(X>3). So if we find the cumulative probability up to and including two, and
thensub-tractfromone,we’redone.Thecumulativeprobabilityoftwo
or fewer is P(X=0) + P(X=1) + P(X=2). So the answer is = 1 binom.dist(2,7,0.6,true)= 0.903744
Thekeys:
Writeoutwhatyouaretryingtopredictinmathnotation.Thisforcesyoutobeclear.
Drawalittlesketch(hopefullybetterthanmine!)ifyougetconfused.
Ifthewordingofyourproblemcontains‘exactly’orrequirestheprobabilityofjust
oneparticularoutcome(egP(x=3))thenyouwanttousefalse.
12.2PredictingwiththePoisson
distribution
Thebinomialdistributiongaveustheprobabilityofabinaryoutcome(yes/no)out
of a certain number of trials. The Poisson gives us the probability of a certain
numberwithinaspecifiedtime-frameorarea.
Here’stheexample. You run the Customer Service Desk. You want to know the
probabilityoffiveorfewercustomersarrivinginthenexthalfhour.Thatwouldbe
usefulforstaffing,wouldn’tit?Youknowthattheonaverage,20customersarrive
everyhalfhour.Youknowthatbecausehavecountedthem.
Wewant:P(X>=5).Likethebinomialabove,thePoissonhasatrue/falseargument
forwhetherwewantcumulativeprobabilitiesor‘exact’. The notation include a >
sign, which implies cumulative, so we use TRUE. In Excel:
=POISSON.DIST(5,20,true).Theansweris7.19088E-05. This looks a bit weird,
butitisjustmathnotation.TheE-05meansthatyoushouldmovethedecimalpoint
5placestotheleft.Sotherearefourzeroesinfrontoftheleading7.Inotherwords,
anextremelysmallprobability.Perhapssendsome staff out for lunch? It is very
smallbecause5isalongwayfromyourlongrunaverageof20.
13. Choice under
uncertainty
Virtually all business decisions involve at least some degree of uncertainty. For
example,youdon’tknow(andpresumably canneverknow)somefuture‘stateof
nature’whichmightbeanexchangerateorbusinessrental.Afarmerdoesnotknow
theprice of wheat at harvest time, but has to decide how much to plant many
months ahead. His decision is relatively simple compared to a decision-maker
confrontedwithanopponentwhowill reacttotakeadvantageofwhateverdecision
youdomake.Thefirstsituationiscalled‘non-strategic’andiseasilyhandledby
theexpectedmonetaryvaluemethodwhichwe’llworkthroughinthechapter.The
second‘strategic’situationismorecomplicatedbutcanbesolvedbyanapplication
ofgametheory.We’llcovertheEMVmethodinsomedetailbelow.Gametheoryis
anenormoussubjectandwedon’tcoverithere,butI’llgiveanexampleattheendof
thechapteralongwithsomerecommendedreading.
13.1Influencediagrams
Aninfluencediagramisasketchoftheinfluencesandoutcomesofadecision.The
influencediagramallowsustothinkthrough and depict the forces that influence
ourchoiceswithouthavingtoassignprobabilities.Itiscustomarytouseovalshapes
forvariableswhichareuncertain,aboxforadecisionnode;andlozengesforthe
outcome.
119
VacationInfluencediagram
Youareanoutdoors-typeandsothechoiceofactivityisaffectedbytheweather.The
weatherforecastisaffectedbytheweathercondition (one hopes!). But you don’t
knowtheactualweathercondition,justtheforecast(whichiswhyitisaforecast).
Youchooseyouractivity(box)basedontheforecast.Theamountofsatisfactionyou
get (lozenge) depends on the activity you chose AND the actual weather
condition:observethetwolines leadingtothelozenge.Thereisastandardformat:
1.DecisionNodesarepresentedinrectangles.Intheexampleontheslides,
thedecisioniswhatactivitytopursuewhenonvacation(climbamountain?
readabook?).Whichchoicewemakemaytosomeextentbedeterminedby
theweatherforecast.
2. Uncertainty Nodes are presented in ovals. The weather fore- cast is
uncertainandsoistheactualweatheritself.
3.OutcomeNodesarepresentedinlozenges.Theoutcomeinthevacation
exampleishowmuchsatisfactionyoutookfrom thechoiceyoumadefor
vacationactivity.Thisisautility.
4. Arcs display the flow of the influence. The actual weather condition
affectstheweatherforecast,whichaffects yourplanning.Noticethatyou
are taking a decision on activity in advance of knowing what the actual
weatherwillbe.Thatiswhythereisnoarcbetweenweatherconditionand
vacationactivity.
YoucandrawinfluencediagramsquiteeasilyinExcelusingtheshapesdrop-down
tool.
Therearedifferentoutcomes here, which we can be represented as payoffs. The
highestpayoffcomesfrommatchingyourchoiceofactivitytotheweatherforecast
andthenfindingthattheforecastmatched reality.
13.2Expectedmonetaryvalue
theexpectedmonetaryvalue,orEMV,isjustthevalueofsomeout-comegivena
particularstateofnature,multipliedbytheprobabilityofthatstateofnature.Here,I
am using the customary expression ‘state of nature’ not necessarily to refer to
anythinginthenaturalworld,buttoaparticularsetofconditions.
Thestateofnaturehastobemutuallyexclusivefortheprobabilities towork.For
example, if we have separate probabilities for rain and sunshine, then cannot
calculateforrainANDsunshine.Thepayoffswhichresultfromeachstateofnature
aredifferent,dependingonthestateofnature.
Let’smotivatethiswithanexampleusingafriendlyfarmerasthedecision-maker.
Hehasthechoiceofplantingwheat,raisingcattle,orsomemixtureofboth.Those
threepotentialdecisionsrepresenthisrangeofchoices.Fromexperience,heknows
howmuchhecanexpecttoreceivedependingontheweatheratthetimeofharvest.
Thatamountiscalledthepayoff .
Hefacesthreestatesofnature,representingdifferentweatherconditionsatharvest.
These are rain,clear, andsunny. For each choice and for each state of nature he
knowsthepayoffswhicharerepresentedinthecellsbelow.
Thepayofftableforthefarmer
Therearethreedecisionsandthreeoutcomes,makingatotalof9possiblepayoffs.
Payoffscanbenegative,andarenotalwaysintermsofmoney.Theycouldbetime,
oranyother appropriate metric. In the rows we put the choices available to the
decision- maker.Inthecolumnsweputthe‘statesofnature’whicharethe future
eventsnotunderthecontrolofthedecision-maker.
The farmer calculates the probability of each of the three outcomes by working
througholdweatherrecords.Hefinds:
probabilityofrainingis0.4probabilityofclearis0.3probabilityofsunnyis0.3(of
coursetheseprobabilitiesmustalladdupto1.Therecouldbemanymore‘statesof
nature’butIhavekeptthemtothreeforclarity.
TheExpectedValueisthesumofeachprobabilitymultipliedby the outcome. In
mathnotationthisis:
∑
EMV = p x∗x
which in words is: the expected monetary value is the sum of all the outcomes
multipliedbytheirprobability.InExcel,usetheformula
=sumproduct(array1, array2). This is a mean, or average, with each outcome
weightedbyitsprobability.Indecisionsinvolvingmoney,
weusuallycalltheExpectedValuetheExpectedMonetary Value(EMV).Forthe
farmer,theEMVsareasbelow:
TheEMVsforthefarmer
Imultiplied(horizontallybydecision)thepayoffbytheprobability ofthatpayoff
andthensummed.Usuallywechoosethe decision with the highest EMV, so the
farmershouldchoosewheat.ExpectedvalueYoutube¹
Note that a ‘good’ decision is making the best decision at the time with the
informationtohandatthattime.Theremaybeunluckyconsequencesbutprovided
theanalysthasdoneathoroughjobin selecting the optimal outcome at the time,
thensheshouldnotbeblamed!
Itwillbehighlyunusualforanoutcomeof30toactuallyoccur.TheEMVisjusta
weightedaverageandnotaprediction.
13.3Valueofperfectinformation
The farmer might be tempted to ask for advice from a consultant. What is the
maximumheorsheshouldpayfortheadvice?Itisthedifferencebetweenthebest
case that the farmer calculates and how much he would receive with perfect
information.WecalculatethevalueofperfectinformationbycomparingtheEMV
ofthebestchoiceineachstateofnaturewiththeEMVwithoutinformation.Ifyou
knewthattheweatherwouldbepoor,youwouldselectCattle; clearwheat;sunny
wheatagain.ThenfindtheEMVwithperfectinformationbymultiplyingthesebest
optionsbytheir probabilities.
¹https://www.youtube.com/watch?v=fR7cBts1C1w
Theworkingis:100.4+300.3+70*0.3=4+9+21=34.Thebestwecouldcomeup
withoutperfectinformationwas30,sothevalueofperfectinformationis34-30=
4.Soifsomeoneofferedyouperfectinformationforaprice,thehighestthatyou
wouldpayfortheperfectinformationwouldbenotmorethan4.
13.4Risk-returnRatio
There are other ways of choosing the optimal outcome other than the EMV,
althoughtheEMVremainsthemostcommonlyused.Onecommonmeasureisthe
ReturntoRiskRatio ,whichprovidesthedollarsreturnedperdollarputatrisk.
Foreachdecision,wedivide theExpectedValueofthedecisionbytheStandard
Deviationoftheoutcomeforthatdecision.Weusuallychoosethedecisionwiththe
highestRRR,becausethenthedollarreturnforeachdollarputatriskishighest.
Thefarmer’sworksheetisbelow.
Mixedhasazerobecausethepayoffsareallthesame.Thereisnorisk.Cattlehasthe
highestat 0.715. This is higher than the RRR for wheat although wheat has the
highestEMV.Thefarmermightwanttothinkmoredeeplyabouttheprobabilityof
sunnyweatheratharvesttime,becausethismakesahugedifference.
13.5Minimaxandmaximin
A non-probabilistic approach to making choices is the use of maximin and
maximax.You can see that your choice depends on whether you are pessimistic
(maximin)oroptimistic(maximax).Ifwecansomehowobtainprobabilities,thena
probabilistic approachworksbest.
13.6Workedexamples
Yo uare planning to market a new coffee-flavored drink. You have a choice
betweenpackingthedrinkinareturnableornon-returnablepackaging.Yourlocal
governmentisdebatingwhether non-returnable bottles should be prohibited. The
tablebelowshowsyourprofits.Ifthenon-returnablelawispassed,youwillstillget
somesalesfrom exports.
Decision
Lawpassed
Lawnotpassed
Returnable
80
40
Nonreturnable
25
60
Example1
1. What would be the best decision based on maximin and maximax
criteria?
2. Alobbyisttellsyouthattheprobabilityofthegovernmentbanningnonreturnablesis0.7.Assumingthelobbyistisright,whatisthebestdecision
basedontheEMV?
3.Atwhatlevelofprobabilitywouldyourdecisionchange?
4.Howmuchwouldyoupayforperfectinformation?
Answers:
1. Maximin:theworstare25and40.Thebestis40, meaningpackagewith
returnables.Maximax:thebestare80and60.Againreturnables.
1. TheEMVforreturnableis800.7+400.3=68.Fornonreturn-ableitis
250.7+600.3=35.5.UnderEMV,youshouldusereturnablebottles.
2. Setthisupasequation,withptheunknownwhichhastobalanceboth
sides80p+40(1-p)=25p+60(1-p).Solvefor
pandget0.267.Soifyouhearrumoursthatsuggesttheprobabilityis
comingdown,thinkagain?
3. Ifwehadperfectinformation,theEVPIis800.7+600.3=74.Soyou
shouldnotpaymorethan74-68=6
Example2
Yourunabank.Acustomerwantstoborrow$15,000for1 year at 10% interest.
Youbelievethatthereisa5%chancethatthecustomerwilldefaultontheloan,in
whichcaseyouwillloseallthemoney.Ifyoudon’tlendthemoney,youwillinstead
placethe$15,000inbondswhichreturn6%butarerisk-free.
1.WhataretheEMVsofloanandnotloan?
2. Youhave a credit investigation department which can help you with
moreaccuracyintheprobabilityofdefault.Whatisthemostyoushouldbe
willingtopayfortheiradvice?
3. Calculatethelevelofprobabilityofdefaultatwhichlendingthemoney
andinvestinginbondshaveequalEMV.
14. Accountingforriskpreferences
What this chapter is about: the previous chapter showed howwe could pick the
mostattractivechoicegivenpayoffsand probabil-ities.Butitmighthavestruck
youthattherewaslittlediscussionoftherisksinvolvedandhowdifferentpeople
relate to risk. In this chapter we are going to work through picking optimal
decisions,takingintoaccounttheriskpreferencesofthedecisionmakers.
Hereisaquestionforyou:youaregivenalotteryticketwhichhasa0.5probability
ofwinning$10,000anda0.5probabilityofzero.TheEMVistherefore10,0000.5
+ 00.5 = $5000. Someone comes along and offers you $3000 for the ticket
guaranteed. Would take the sure thing? Or would you hope that you win the
$10,000?Most peopleare‘riskaverse’andwouldprefertogiveupsomeof the
EMVinexchangeforcertainty.The$3000(orhowevermuchitis)is calledyour
Certainty Equivalent (CE) in this particular gamble. The difference between the
EMVandtheCEiscalledtheRiskPremium.So
RP=EMV-CE
Youcanthinkofthecertaintyequivalentasasellingprice.Itisusuallysmallwhen
thesizeofthegambleissmall,butincreasesasthegamblegetslarger.HereIam
using the term ‘gamble’ because there are probabilities involved. The certainty
equivalentisausefulconceptwhich,amongstotherqualities,allowsustocategorize
approachestorisk.
•Risk-averse .IfyourCEislessthantheEMVofagamble,thenyouare
risk-averse (and most people are, so nothing to worry about. You’re
‘normal’)
127
•Risk-neutral .YourCEmatchestheEMV.Youplaytheaver-ages.
• Risk-seeking .YouenjoythegambleinherentintheEMVand need to
haveaveryhighCEbeforeyou’llgiveitup.Mostpeopleinbusinesswho
are risk-seeking usually don’t last verylong.Althoughwesometimessee
risk-seekingdecisionswhenyourbusinessisintroubleandahugegambleis
theonlypossiblesolution
Businessdecisionsareallaboutrisk,andtheprobabilities them-selvesareoften
unreliableandhardtofind.Thefarmerwhosecrop- ping decision we studied in
Chapter2doesn’tknow tomorrow’sweather,letalonetheweatherinsixmonths.
WegenerallyprefertogiveupsomeoftheEMVinreturnformorecertaintybecause
weareriskaverse.That’swhywebuyinsurance.IknowthatIwillbecoveredifI
smashmycar,andIcanevenputupwiththeknowledgethatpeopleinZurichare
gettingricherbecauseofmyaversiontorisk.
14.1Outlineofthechapter
Here’swhatwe’regoingtodo:
•Describeutilityandcalculateutilities
•Showhowtousetheexponentialutilityfunctiontocalculateutilitiesbased
onrisktolerance
• Convertthoseutilities to certainty equivalents which we can rank and
comparewiththeEMVsofthesamedecision
Torank choices in terms of their certainty equivalent ratherthan their EMV,we
needtheconceptofutility,whichisa numerical measure of how well a good or
service satisfies a need or want. Do you prefer to be reading this book, outside
eatinganice-cream,or
talkingwithafriend?Youcan rank these choices quickly in your head.Youcan
subconsciouslyallocateutilitynumberstothechoicesandthenpickwhicheverhas
thehighestutility.
Utility numbers help us to rank our preferences, but there are no units and the
rankingisindividual-specific.Inotherwords,givenasetofalternatives,different
people may well have differentrankingsfor them. For the moment, let’s assume
thatitisjustyouwhoseutilitiesweareinterestedin.Ifwecansomehowfigureout
your utilitiesforthedifferentoutcomesofabusinessdecision,wecanuse those
utilitiestohelpyoumakeamorepsychologicallysatisfyingchoice.
Utility
Theplotbelowshowsahypotheticalutilitycurve.Theutilityvalueisonthevertical
yaxisandtheoutcomeofagambleison thehorizontalxaxis.Noticetwothings:
Atypicalutilitycurve
*Theslopeofthelineisupwards.Thisreflectsthefactthateveryone prefersmore
moneytoless
• The rate of increase of the line is decreasing or slowing down. It was
steepearlyon,buttowardstheenditisalmostaplateau.Thisisbecausethe
valueofeachextra dollarisslightlylessthanthatof the previous dollar.
Thisisthemarginalutilityofmoney.
14.2Wheredotheutilitynumberscomefrom?
Therearetwo approaches.
Thefirstinvolvesaskingthedecision-makerhis/herchoicebetweentwo gambles.
The second uses a utility function, a mathematical function which converts two
inputvariables(risktoleranceandthesizeoftheoutcome)intooneutilitynumber.
Withautilitynumberinhandwecanbegintheprocessofrankingthedecisionsin
termsofutilityratherthanEMV.
Theintuitivemodel
We’llworkthroughtheintuitivemodelfirstandthenapplythesamethinkingtothe
equation model. Once you get the hang of it,theequation model is quicker and
probably more accurate. We’ll use the payoff table from the farmer example,
repeatedbelow.
Thepayofftableforthefarmer
We’llassignautilitynumberof1tothehighestpayoffandthenzerotothelowest.
Thebeginningsoftheutilitytablelookslikethis:
PayoffUtility
701
40
30
20
10
-10
-300
Nowweneedtofillinthegaps.Whatwe’lldoistouseeachpayoff asacertainty
equivalent,andaskthisquestion:
Whatvalueofpwouldmakeyouindifferentbetweeneitherreceiv-ingaguaranteed
payoffofthecertaintyequivalent(inthefirstcase
40)oracceptingagambleofreceivingthehighestpayoff(70)withprobabilitypor
losing30(thelowestpayoff)withprobability1-p.Let’ssaythatyouanswerp=0.9.
Theexpectedvalueofthegambleis70*0.9+0.1(-30)=60.
This is higher than your certainty equivalent of 40, indicating that you are risk
averse.Thedifferencebetweentheexpectedvalueandthecertaintyequivalentisthe
riskpremium,andistheamounttheindividualiswillingtogiveuptoforgorisk.
Continuedownwards,certaintyequivalentbycertaintyequivalentandcompletethe
table.I’vemadeupthenumbersjustforillustra-tion.Aplotoftheutilityvaluesis
alsoshown.
Payoff
Utility
70
1
40
0.9
30
0.8
20
0.65
10
0.4
-10
0.35
-30
0
Farmer’sutilitycurve
Ifyouwererisk-neutral,youwouldprefertotakethegambleandso wouldbean
EMVmaximizer.Thechoiceoftherisk-neutralpersonisindicatedbythestraight
line.
Now,let’sreplacethepayoffvaluesinthefarmertablewithutilities.Wecancalculate
theexpectedutilitiesinthesamewayaswefoundtheEMVs.Thetableshowsboth
forcomparison.Wemultiplytheutilityofeachoutcomewithitsprobability.
Expectedutilities
The famer’s EMV decision would have been wheat (EMV=14)but taking into
account his risk preferences, mixed gives the highest expected utility. It makes
sensestospreadtheriskbetween twocompletelydifferent crops.
##Theexponentialutilitycurvemethod
It is rather tedious asking so many questions, plus people get tired andconfused
ratherquickly.Anattractivealternativeistousea
mathematicalfunction,theexponentialutilitycurve.Thisrequires onlyoneinput
variable,R,thetoleranceforrisk.Theequationisbelow:
U x=1−e− x / R
Here x is the payoff; Ux is the utility of the payoff x; and R istheindividuals
risktolerance.Apersonwitha large value of R ismorelikelytotakerisksthan
someonewithasmallerRvalue.AstheRvalueincreases,thebehaviorapproaches
thatoftheEMVmaximizer.TheplotforsomeonewitharisktoleranceofR=1000is
below.
Actuallyperforming
thecalculationstofindutilityateachlevelofpayoffiseasyusingExcel,aswe’llsee
below.ButfirstwehavetodetermineR,theindividual’stoleranceforrisk.
Findingrisktolerance
Therearetwoapproaches
Byaskingthisquestion:consideragamblewhichhadequalchances of making a
profitofXoralossofX/2.Whatisthe value of x for which you wouldn’t care
whetheryouhadthegamble.Inother words,whatisthevalueofxforwhichthe
certaintyequivalentis
zero?Theexpectedvalueofthealternativeis0.5xX-0.5*(X/2)=0.25X.Aslong
as X >0 then the decision-maker is displaying risk- averse behavior, because
his/hercertaintyequivalentislessthantheexpectedvalueofthegamble.Thegreater
theR,themoretolerantofrisk.Inhisbook,ThinkingFast,ThinkingSlow,Daniel
Kahneman notes that the ‘2’ in the denominator can vary a tad. It isn’t aprecise
formula,butapparentlyitcomesclose:Indiagramform:
Anal-ternativeapproach,moreusedin
businessandfinance,istoemploy guidelinenumberscalculatedbyProfessorRon
Howard¹,apioneerofdecision-analysis.Basedonhisyearsofexperience,Howard
suggeststhattheRvalueforacompanyis:
•6.4%ofnewsales
•124%ofnetincome
•15.7%ofequity
##Exampledecisionusingutilitymaximisation
What we’ll do here is to examine a decision from both the EMV and Expected
Utility angles, using the exponential utility function. Thecalculationofexpected
valuesisthesame:thedifferenceisthatinsteadofmultiplyingeachmonetarypayoff
byitsprobabilityandthenaddingup,wemultiplytheutility.
¹https://profiles.stanford.edu/ronald-howard
The decision set-up. All money figures in millions. I run a coffee importing
business.Ineedaspecialimportlicensetobringinaparticulartypeofcoffee.The
equityofmycompanyis12.74.So,myRvalueis15.74%of12.74=2.
Ihaveachoicebetween‘rush’and‘wait’forthepermit.
IfIrush,Ihavetopayaspecialfeeof5,andthereisa50/50chanceofthepermit
beinggranted.Ifitisgranted,Iwillhavesalesof8,givingmeanetprofitof8-5=3.
Ifitisn’tgranted,Istillhavetopaythe5,butwillhavesalesofonly6,givingmeanet
profitof2.First,findtheEMVandEUofrushing.TheEMVis0.5x3+0.5(2)=2.5.
Usingtheexponentialcurveformula,theutilityof3is=1-exp(- 3/2)=0.77687.
Andtheutilityof2is=1-exp(-2/2)=0.632121.TheEUforrushingis0.5(0.77687)
+0.5(0.632121)=0.704496.Allfiguresareinmillions.
Nowforthealternative,whichistowait.Inthisscenario theprobabilitiesarethe
sameat50/50,butthecostisonly3ifitisissued,nothingifitisn’t.Thesalesarethe
sameat8and4.Thenetprofitsare8-3=5and4.TheEMVis0.55+0.54=4.5.The
utilitiesareU(5)=1-exp(-5/2)=0.918andU(4)=1-exp(-4/2)=0.865.TheEUis
0.50.918+0.50.865=0.892.
ThetablebelowshowstheEMVsandEUs.
Spreadsheetfortherushdecision
Butwecangofurtherusingcertaintyequivalents,subjectofthenext section.
14.3Convertinganexpectedutilitynumberintoa
certaintyequivalent.
Fortunately there is an easy way to convert expected utilities back to certainty
equivalents. It is then straightforward to rank the decisions by certainty
equivalents. Remember the concept of the certainty equivalent? The amount of
moneywhichitwould takefor you to sell your gamble? It turns out that we can
convertanexpectedutilitynumberintoacertaintyequivalentusinganequation:
CE=− R∗ln (1−EU)
wherelnisthenaturallogarithm.Weusethisequationwherewearetalkingprofits
andwantmore.Inthecaseofcosts,takeoffthenegativesigninfrontofR.
AbovewefoundEU=0.704496fortherushdecisionbranch.ThatisaCEof
=-2*LN(1-0.704496)= 2.43814
using Excel’s ln function. For the wait decision, the EU was 0.892, giving a
certaintyequivalentof=-2*LN(1-0.892)= 4.451248
LargerCEsarepreferredwhentheoutcomesareprofits,andsmallerCEswhenthe
outcomesarecosts.SoI’dwait!ThisconfirmsthedecisionmadewiththeExpected
Utilities.
15. Glossary
Thisisahands-onguidetodoingsomebasic analytic work with data. However,
statisticshasitsownsetofvocabularyandtech-niques.Whenyouare presenting
your results to other people, you may wish to follow established techniques and
terminology. Also, here you’ll find more of the underlying theory if you are
interestedinhowExcelgetstheresultsitdoes.
Binomialdistribution
The binomial distribution is used to find the probability of the occurrence of a
certainnumberofevents(calledsuccesses,althoughtheycouldbeanythingclearly
defined)withinanumberoftrials.Thebinomialdistributionwillgive,forexample,
theprobabilityofatleasttenpartsbeingdefectiveoutofthenexthundred,provided
thelong-termdefectiverateisknown.Thebinomial requires some conditions to
workproperly.These are:
asequenceofnidenticaltrialstherearetwooutcomesforeachtrial,denotedsuccess
andfailuretheprobabilityofsuccessdoesn’tchangethetrialsareindependent.
ItisquiteeasytocalculatetheprobabilitieswithExcelbutforillustrationhereis
theequationwhichExceluses:
f( x )=
( )
n
x
p x (1 −p )
(n−x)
wherexisthenumberofsuccesses;ptheprobabilityofsuccessononetrial;nthe
numberoftrials,andf(x) the probability of x successesinntrials.Thisnotation
pointsupthefactthattheresultisafunctionofx.
137
Centrallimittheorem(CLT)
Thecentrallimittheoremisfundamentalinstatistics.Whatthe theorem states is
this:let’ssayyouhavealargepopulationand you keep taking samples from the
populationandmeasuring themeanofeachsample.Thenyoudrawahistogramof
themeanssothateachmeanisavariableinthedistribution.Inotherwords,instead
of plotting each observation independently, we plot them by the means of each
sample.Thehistogramwhichappearswillbeapproximatelynormallydistributed,or
intheshapeofthewell- known bell curve. This is wonderful news because the
normaldistributioniswellbehaved,andsowecancalculatetheareaunderanypart
ofit.
Correlation
Youhaveapairofvariables:forexample:websitehitsand actualsales.Youwant
to know whether there is a relationship betweenthevariables.Andwhetheritis
‘statisticallysignificant’,orperhapswhatyouthoughtwasarelationshipoccurred
justbychance.
Correlation is the study of the linear relationship between two or more random
variables.Correlationanalysisprovidesboththe direction and the strength of the
relationshipbetweenthevariables.Forexample,thereisarelationshipbetweenthe
heightandweightofindividuals. There is apparently also a relationship between
consumptionofsugarandobesityratesatthenationallevel.Cor-relationanalysis
cantestwhethertherelationshipsarespuriousorreal.
It’strueandworthrepeating: correlationdoesnotimply cau-sation .Itdoesnot
necessarily follow that one variable causes the other even though they might be
highly correlated. There are many examples of such spurious correlations, for
examplenational consumption of chocolate and number of Nobel prizes won. If
you do find a correlation, first think about whether you have a lurking variable
discussed below. The two examples given above for height/weight and sugar
consumptionmightverywellinclude
causaleffects,butthisisnotsomethingthatcorrelationanalysiscanestablishonits
own.Indeed,somephilosophers(such as David Hume) claim that causation can
neverbedecisivelyestablished.
Nowwewill:
•Visualizeacorrelationwithascatterplot
•explainwhythecoefficientofcorrelationworks
•interpretthevalueofthecoefficientofcorrelation
•testforthestatisticalsignificanceofthecoefficientofcorre-lation
Scatterplotsandcorrelation
Acanningfactoryhasdataonnumberofworkersemployedandoutputofcans.The
datasetis‘canning’.Thefactoryisinterestedintherelationshipbetweenthenumber
ofworkersandoutput.LoadthedataintoExcel.
Firstfewlineofthecanningdata
The columns of data represents numbers of workers andcans manufactured. For
thenotationwe’llcalltheseXandY.WhenImeanthenameofthevariable,I’lluse
theuppercase.Forindividualobservations,suchasx=23,I’lluselowercase.We
wanttoseetherelationshipbetweenXandY.Todothisweneedtohavethedatalaid
outincolumnssothateachobservationofXandYareonthesameline.It’salwaysa
goodthingtovisualisetherelationshipwith a scatter plot and the scatter plot is
below.
Thecansscatterplot
I’vedrawnaverticalandhorizontallinewhichgoesthroughthemeansofthetwo
variables.Wecandividetheobservationsinto quadrants,withquadrantonebeing
toprightandsoforth.Iftherelationshipispositivethenwewouldexpecttoseemost
ofthe observations in quadrant one and quadrant three. If the relationship was
negativethenmostoftheobservationswouldbeinquadrantstwoandfour.
TherelationshipbetweenXandYisnotperfectlylinear.Ifitwas all the pairs of
points would line up along a straight line. Weneedsomewayofmeasuringhow
far off being in a straight line these observations are. There are two measures,
covarianceandcoefficientofcorrelation.Thereissomemathsinwhatfollowsbut
we’llwalkthroughitslowly.
Covariance:Trythisthoughtexperiment:ifalltheYpointswereonaverticalline,
thenthesubtractionofthemeanfromeachYvaluewouldgiveuszero:
( y i− y ¯)=0
I’veputinalittleifortheyvaluestoshowthatitcouldbeany
ofthevaluesintheYvariable.SimilarlyfortheXvalues,ifallthepointswereona
verticallinethen
( x i−x ¯)=0
Wecanseethatwedon’thaveverticallinesandthereissomedif-ferencebetween
eachobservedvalueandthemean.Ifwemultiplytogetherthedifferencesandthen
findtheaveragebydividing byn-1wecanfindthecovariance.Wedividebyn-1
becauseweareinmostcasesdealingwithasample.Subtracting1adjustsforthis
fact.ThesamplecovarianceofXandYistherefore:
S XY =
( x−x ¯)( y−y ¯)
n−1
Thebigproblemwithcovarianceisthattheresultisverydependentontheunits.We
can get round this problem by standardizing the differences between each
observation and its mean. We do that by dividing the difference by the standard
deviation of the variable. You can think of this technique as providing the
differenceinunitsofstandarddeviations,so
( x−x ¯
sx
andthesamefortheYvariable.Nowwenolongerneedtoworryabouttheunits.The
coefficient of correlation is the covariance now between the standardised
differences, not the differences them- selves. The equation for the coefficient of
correlationofasampleis
∑
r XY=
( x−x ¯)( y−y ¯)( n−1) S X S Y
Performingthiscalculationisextremelytediousandweusuallyletthecomputerdo
thework.Typein=correl(array1,array2)andhit
enter.Forthecansdatatheresultis0.991083.ThisisthePearsonProductMoment
or‘r’value.
CorrelationYoutube¹
Valuesofr:valuesofrarerestrictedtotherange-1to+1.Anegativevaluemeans
that the slope is negative: as one variable increases thentheotherdecreases.A
positivevaluemeansthatbothvariables increasetogether.Thelargertheabsolute
valueofr,thenthecloserthecorrelation.Acorrelationofzeromeansthat all the
pointsarescatteredaboutwithnopatternwhatsoever.Thervalueof0.991083for
cansmeansthatthepointsarelinedupalongastraightline,whichiswhatwewould
expectfromlookingatthescatterplot.
Testingr:theobservationsthatwehaveusedtocalculaterconsistofasamplefrom
apopulation.Becauseitisonlyasample,wedon’tknowwhetherthecorrelation
wouldholdupifweweresomehowabletogainaccesstothepopulation.risthe
notationforthecoefficientofcorrelationforasample.Forapopulationweusethe
Greekletterρ“rho”.Generally,Greeklettersareusedfor population parameters.
ThisdistinguishesthemfromtheRomanlettersusedforestimates.
Thehypothesiswe’retestingis:
againstthenull
H 0:ρ ≤0
H a :ρ̸ =0
Wefindtheteststatisticusingthisequation
r
√
n −2
t =√
−r 2
¹https://www.youtube.com/watch?v=QDj8aYfvccQ
We’llfirstworkoutthevalueoftheteststatisticusingthervaluewefoundabove.
Thenwe’llfindthepvaluefromthis.
Thervaluefoundfromthecorrelationwas0.99andtherewere 24 observations
(n=24).Plugginginthenumbers,wefindthatt=32.86.Thisisaonetailedtest,souse
=t.dist(yourvaluefort,n-1,andfalse)tofindapvalueof2.67343E-21. The -21
meansmovethedecimalplace21placestotheleft,sothisnumberis effectively
zero.Thervaluewefoundisstatisticallysignificant.Itismuchsmaller than0.05,
sowecanrejectthenullhypothesis.Thereisacorrelation.
Descriptivestatistics usesgraphsandtablestopresent whatweknowaboutthe
data. This is a powerful method of gaining insights into the data, and also for
makingpresentations.However,descriptivestatisticslimitsitselftothedataavailable
doesnotmakeanyinferencesbeyondwhatwehavefromthedatatohand.
Distribution of the observations within a dataset describes the density of the
observations:aretheyspreadaboutevenly,orpeakedinthemiddleandthenspread
outoneithersideoftheirmean?
One of the most common and useful distributions is the normal distribution,
sometimesknownasthebellcurve.Manythings inthenaturalworldfollowthis
distribution.Forexample,ifyouwereabletomeasuretheheightsofalargenumber
ofmenorwomen,you’dfindthattheyfollowedanormaldistribution.
Theplottotherightshowsthedistributionofrecordsofheightsofparentscollected
byFrancisGaltoninthe19thcentury.Galtonwastryingtofindouttherelationship
betweentheheightofparentsandtheirchildren.Alongtheway,hediscoveredthe
phenomenonknownas‘regressiontothemean’.
Ihaveplottedtheheightsasahistogramandaddedanormalcurve withthesame
meanandstandarddeviationastheoriginaldata.Thisclearlyfollowsthebellcurve.
Thenormaldistributionisveryimportantinstatisticsforseveralreasons
-thenormaldistribu-tionis‘well-behaved’inthesensethatital-waysfollowsthesame
mathematicalfunction.Theequationisalit-tlecomplex,butwecandrawanynormal
distributionprovidedweknowitsmeanandvariance.Themeanistheaverageofthe
ob
Galton’sheightobservations
servations,andinanormaldistributionoccursrightinthemiddle.Thevarianceisa
measureoftheaveragedistanceofeachobserva-tionfromthemean.Thelargerthe
variance,themore‘spreadout’ordispersedthedata.
Thegraphshereshowtwodatasets,withidenticalmeans butdifferentvariances.
In most of this book we won’t use the term variance, but instead standard
deviation .Thestandarddeviationisjustthesquarerootofthevariance.Weusethe
standarddeviationprimarilysothatwecanusethesameunitsasthemean.Thefirst
plothasastandarddeviationof5,andthesecondoneastandarddeviationof2.The
plotswerecreatedbygeneratingarandomsetof1000numbers,withameanoften
andthestandarddeviationsjustdescribed.
-the second reason is connected with the most important result in statistics,
the CentralLimitTheorem ,orCLT.This statesthatifyoukeepdrawingsamples
withreplacementfromalargepopulation, the means of the samples will follow
a normaldistribution .Theimportantthingisthatthedistributionoftheoriginal
population doesn’t matter as long as we draw at least 30samples.Infact,itgets
betterthanthat:iftheoriginalpopulationisnotnormallydistributed,thenwedon’t
needasmanyas30inoursample.Asfewas10willdothetrick.
Standarddeviation=5
Becausethenormaldistributionfollowsanequationsowell,wecan find the area
underanypartofitveryeasily.
Standarddeviation=2
The mean splits the area under the curve into halves, meaning that 50 % of the
observationsaresmallerthanthemean,and50%larger.Acommonpracticeisto
markoffthehorizontalaxisinunits of standard deviation. The further from the
mean,thegreaterthestandarddeviation.Thishelpsbothinvisualizingtheshapeof
thedata,butalsoinusingtheempiricalrule .
Empirical rule : a ‘rule of thumb’ which states that when data are normally
distributed,
-64%ofthedataarewithinonestandarddeviationofthemean-95%ofthedataare
withintwo standard deviations of the mean -99.7% of the data are within three
standarddeviationsofthemean
Thismakesintuitivesense:only100-99.7=0.3%ofthedataaremorethanthree
standard deviations from the mean. They would therefore be in either of the
extremeendsortailsofthedistribution.Theprobabilityoftheiroccurrencewould
beverysmall.Bycontrast,observationsclosertothemean, those located only a
smallnumberofstandarddeviationsfromthemean,aremuchmorelikelytooccur.
Thatiswhythehistogramishighestonatthemean.Thoughtexperiment:veryshort
orverytallpeopleareunusual(whichiswhyyoustareatthem).Bycontrast,people
aroundthemeanheightarenotatallunusualandyoujustpassthemby.
Estimation : The statistical process is circular: we start off by choosing the
characteristic of the population that interests us. That characteristic is called a
parameter.Wedrawarandomsamplefromthatpopulation. Weuse thestatistic to
infer information about the same quantifiable feature in the population. The
statistic(for ex-ampleanaverageoraproportion)isanestimateofthepopulation
parameter. The process of finding how close the estimate to the parameter, and
buildingaconfidence interval around the estimate, is inference . We test claims
abouttheparameterusinghypothesis-testing,whichiscloselyrelatedtoinference.
Hypothesis-testingisthefinalpartofthischapter.
Excel –morescreencasts
Insertingthedataanalysistoolpakadd-in²Thenormal
distribution³
²https://www.youtube.com/watch?v=Ods1Z5BLD9g&list=UUVZFhqXDgJgIRq-ljotClgQ
³https://i1.ytimg.com/vi/1qn0EIm5-PQ/mqdefault.jpg
ExponentialSmoothing
Hypotheses and p values A hypothesis is a claim someone makes about a
population. For example, acoffee importer (me!)is told by the supplier that the
meanweightofthesacksofcoffeebeansis40kg.Thatistheclaim.Iwanttotest
that claim because I am paying for the coffee. All the sacks of coffee in the
shipmentarethe‘population’.Itakearandomsampleofsackstocheck.Ifindthat
themeanweightofthesampleis38kgs.DoIrejectthewholeshipment,orsayI
must have got a low sample, let’s take them? The hypotheses here are: the null
hypothesis:thepopulationmeanweightis40kg.Meanwhilethealternativeothe
populationmeanweightisnot40kgs.Hypothesistestingprovidesananswertothe
testintheformofapvalue,definedbelow.
ApvalueistheprobabilityoffindingasamplewiththemeasuredcharacteristicsIF
the null hypothesis was true. In the case of the coffee above, let’s say that we
calculatedthepvalue(SeeChapter
9)anditwas0.1.Thatmeansthatthereisa10%chanceoffindingasamplewitha
meanweightof38kgsiftherealtruemeanweightoftheshipmentwas40kgs.The
mostfrequentlyusedcut-offpointis0.05or5%.Ifthepvaluewassmallerthan0.05
thenwewouldsaythatthereisalessthan5%chancethatthissamplecamefroma
populationwiththeclaimedweight. Therefore the shipment isn’t of the claimed
weightandshouldberejected.
Hypothesiswritingandtesting
Ahypothesisisastatementorclaim.Forexample,‘themeanweight of the coffee
packsis3kg’isahypothesisorclaim.Thehypothesisneedstobetestedsothatwe
knowwhethertheclaimstandsorisshowntobefalse.
Hypothesesarewrittensothattheclaim,andthecounter-claim,aredistinct.There
aretwohypotheses,onefortheclaimandthe other for the counterclaim. Ho, the
‘nullhypothesis’,containstheclaim.Ha,the‘alternativehypothesis’,containsthe
counterclaim.Wewritehypothesesbyassumingthatthenullhypothesisistrue.
Thenullhypothesesforthecoffeeexampleabovewouldbe
Ho :µ =3 kg
Thisisreadas:Thenullhypothesisisthatthepopulationmeanweight(mu)ofthe
coffeeisequalto3kg.
The alternative hypothesis (Ha) is that the population mean weight is not 3 kg,
written:
Ha :µ̸ =3 kg
Wetestthehypothesesusinginformation(statistics)fromthesam- ple (the mean
weight,thesamplesize,andthestandarddeviation).Notethatthecounterclaimsays
nothingaboutthemeanweightinthepopulationbeingmorethanorlessthan3kg,
itsimplysaysthatitnotequalto3kg.Also,important,theequalitysignappearsin
thenull.Thisisalwaysthecase,whethertheinequalityis‘strict’or‘weak’.
Wecanalsotestwhetherthepopulationmeanissmallerthanor largerthansome
constant.Theshipperclaimsthattheweightis‘atleast’3kg.Wewritethisas:
H 0:µ ≥3 kg
Notethattheequalityremainswiththenull.Thealternativeisthenthenegationof
thenull,andthismustbethattheweightis‘lessthan3kg’.
H a :µ<3 kg
ConvinceyourselfthattheonlypossiblewaytonegateHoiswiththealternative
hypothesis.Iftheclaimwasthatthattheaverageweightislessthanorequalto3kg,
wesimplyswitchoverthe
inequalitysigns.Thereareotherformsofhypothesiswritingwhich we will work
throughinthisbook.However,theyallsharetherulethattheequalitysigngoesin
thenull;andthealternativenegatesthenull.
Inference isthekeystatistical process. It is the way in which information about
populationscanbegleanedfromsurprisinglysmallsamples.Anexampleispolling
beforeanelection.Providedthesampleisdrawnrandomlyandisrepresentativeof
thepop-ulation,asampleofperhapsonly1,000peoplecanprovidequiteaccurate
estimates of the proportion of the population predicted to vote for a particular
politicalparty.Theestimatesareusedtomakeinferencesaboutthepopulation.We
won’tknowuntilelectiondaywhetherornottheinferenceswerecorrectornot.
Inferentialstatisticsisaverypowerfultechniquewhichallowsustomakethejump
fromasampletoapopulation.Wecanuseasmallsampletomakeinferencesabout
thecharacteristicsofapopulationaboutwhichweperhapsknowverylittle.Thekey
istoobtainasamplewhichrepresentsthepopulationasaccuratelyaspossible.For
this, both randomization and good survey design are essential. W ewill discuss
thesetwoimportantissueslaterinthechapter.Recently,athirdclassofstatistical
methodologieshasarisen.Thisismachinelearning ,whichtakesadvantageofnew
analytictechniquesandgreatercomputingpower.Wewon’tbeabletogointothese
techniquesinthisbook,unfortunately.
Lurkingvariables : It happens that there is close monthly cor- relation between
theconsumptionoficecreamanddeathsbydrowning.Isitthatpeopleeatanice
cream,goswimmingandthendrown?Thecorrelationmissesthelurkingvariable
whichistemperature.Inhotweather,peoplebothgoswimmingmoreoftenandbuy
ice creams. The ice cream and the drowning variables are both correlated with
temperatureandsoarecorrelatedwitheach other.Anotherexample:itseemsthat
thereisacorrelationbetween kids needing reading glasses andsleepingwith the
lighton.Soyou
should switch the light off? Not so fast. The lurking variable is the shortsightednessoftheparents,notthekids.Theparentsleavethelightonsotheycansee
thekid.Thekidsinheritbadeyesightfromtheirparents.
Parameter : a quantifiable characteristic of interest in the popula- tion. The
average weight of all of the fish in the Fraser Riveris a parameter. The same
characteristicinthesampleisknownasastatistic.Wecaneasilyfindthisstatistic
becausethesampleisonlyasubset,andthereforesmaller,thanthepopulation.We
usuallyneverknowtheactualvalueofaparameter,butthatdoesn’tmatterbecause
wecanestimatefromthesample.
Pointestimate :astatistic,suchasthemeanoraverage.Itiscalledapointbecauseit
isanexactnumbersuchas4.21.Itisnotarange,itisapoint.Itiscalledanestimate
becauseitisanestimateofthepopulationparameter.Therefore4.21isanestimate
ofwhatthesamequantifiablefeaturewouldbeinthepopulation.Youcanthinkofa
pointestimateasbeingthe‘bestguess’forthepopulationparameter.
If we selected a different sample from the population, then almost certainly the
pointestimatewouldbedifferent,givinganewbestguess.Ifwecontinuetotake
samples,thenwearegoingtogetas manybestguessesaswehavesamples.The
rangeofbestguessesisthesamplingvariation .
Poissondistribution
ThePoissonprobabilitydistributionis,aswiththeBinomial,adis-creteprobability
tool.Wecanuseitforcalculatingtheprobabilityofeventswhichoccurovertimeor
space.Forexample,numberofmistakesinagivennumberof lines of code. The
formulais:
f( x )=
µ x e− xx !
wheref(x)istheprobabilityofxoccurrencesinagiveninterval.mu
istheexpectedvalueormeannumberofoccurrencesinthegiveninterval(orarea).
eisaconstant,value2.71828.
Population : the group of individuals about whom we want some information.
Each individual in the population is called an element or sometimes a case .
Examples of populations are all the fishinthe FraserRiver,Toyotacarsmadein
2012,orindeedthepopulationofacountry.Itisusuallynotpossibletodealwithdata
atthepopulationlevelbecauseitiseitherexpensiveorimpossibletoobtain.Asa
resultwedrawasample,orsubset,fromthepopulation.
Randomisationisthetechniqueofselectingelementsfromthepopulationtobuild
asamplesothateachelementhasthesame known probability of being selected.
This technique greatly reduces bias, described in more detail in the following
chapter.
Regression
Samplingframe :alistoftheitemstobesampled.Imaginethatyouwishtosurvey
100studentsfromaUniversity.TheUniversityprovidesyouwithalistofstudents
andtheirIDnumbers.Thestudentsformthepopulation,andthelististhesampling
frame.Youwouldusearandomsamplingmethod,coveredbelow,torandomlyselect
100studentsfromtheUniversity.The100studentsformthesample.
Standarderror : Trythisthoughtexperiment.Let’ssaythat thereare1000jelly
babiesinabowlandforsomereasonyou’dliketoknownthemeanweightofajelly
babywithouthavingtoweigh all ofthem.Youhaveachoiceoftakingarandom
sampleofeithern
= 100 or n = 500. Which sample would produce the most accurate estimate:
obviouslythesamplesizeof500becauseitis‘closer’tothepopulation size.
Yetitstillwon’tbecompletelyaccuratebecauseitcouldhappenthatthesampleyou
select is biased in some way: perhaps you selected all the heavy ones? We can
measuretheamountoferrorwiththestandarderror,whichhelpsustodeterminehow
‘close’wearetothe
unknownparameter.BelowIshowhowtocalculatethestandarderror(SE).Hereis
thestandarderrorforastatisticsuchasamean:
σSE =
√n
In words, the standard error is the population standard deviation divided by the
squarerootofthesamplesize,denotedbyn.Notethatasthesamplesizeincreases,
thenthestandarderrordecreases.
Thisis in line with your earlier intuition that the larger the sample size then the
moreaccuratetheestimate.Usuallywedon’t know‘sigma’,σ ,sowereplaceitwith
‘s’whichisthestandarddeviationofthesample.
Transformations :thisisatechniquesometimesusedinregression. Theobjectis
usually to transform the distribution of the dependent variables so that it more
closely resembles a normal distribution. We do this because the theoretical
underpinning of linear regression is basedonanassumption of normality in the
dependentvariable.The dependent variable is typically transformed by taking its
naturallogarithmandthenusingthetransformedvariableintheregression.Thisis
often used when the dependent variable has only positive values and is highly
skewed to the right. Independent variables can also be transformed such as by
addingasquaredversionof the variable in the regression. Independent variables
canalso betransformedby
z scores : A z score is a measure of how far an observation contained within a
variable is from the mean of that variable, in termsof standard deviations. It is
quiteeasytocalculatewithinExcel,andthisYoutubeshowsyouhow
zscoresYoutube⁴.
Whydothis?Bydividingthedifferencebetweenthevalueofthe
⁴https://www.youtube.com/watch?v=FSZpynSBev8
observationandthemeanofthevariablebythestandarddeviationofthevariable,
likethis:
z i=
x i−x ¯
s
thedifferencesareinunitsofstandarddeviations.Wecanusethisfor:
•Comparingthedistributionsoftwocompletelydifferentvariables
• Detectingoutliers.Anoutlierisanobservationwhichisveryfarfromthe
restofthedistribution.Usually,99.7%ofthe observations will be within
threestandarddeviationsofthemean.Soanobservationwhichismorethan
threestandarddeviationsfromthemeanislikelytobequestionable. This
canbegoodorbad:
•Badbecauseperhapssomeonemadeadataentryerrorwhichwecancatch.
• Good because perhaps there is something anomalous and interesting
aboutthat‘weird’observation.