Facts are stubborn, but statistics are more pliable.

DataAcquisition
and
StatisticalAnalysis
Factsarestubborn,butstatisticsaremorepliable.
- MarkTwain
LauraSaba,PhD
AssistantProfessor
DepartmentofPharmaceuticalSciences
SkaggsSchoolofPharmacyandPharmaceuticalSciences
UniversityofColoradoDenver
[email protected]
ReproducibleResearch
“theideathattheultimateproductofresearch
isthepaperalongwiththefullcomputational
environmentusedtoproducetheresultsin
thepapersuchasthecode,data,etc.
necessaryforreproductionoftheresultsand
buildingupontheresearch”
- Wikipedia
NSFDefinitionof
Reproducibility/Replicability
• “reproducibility referstotheabilityofa
researchertoduplicatetheresultsofaprior
studyusingthesamematerialsaswereused
bytheoriginalinvestigator”
• “replicability referstotheabilityofa
researchertoduplicatetheresultsofaprior
studyifthesameproceduresarefollowedbut
newdataiscollected"
ReproducibleResearch
Fig1.Studiesreportingtheprevalenceofirreproducibility.
FreedmanLP,CockburnIM,SimcoeTS(2015)TheEconomicsofReproducibilityin
PreclinicalResearch.PLoS Biol 13(6):e1002165.
CostofIrreproducibleResearch
Fig2.EstimatedUSpreclinicalresearchspendandcategoriesoferrorsthatcontributeto
irreproducibility.
FreedmanLP,CockburnIM,SimcoeTS(2015)TheEconomicsofReproducibilityin
PreclinicalResearch.PLoS Biol 13(6):e1002165.
Objectives
• DataAcquisition
– Learngeneralguidelinesforunbiaseddatacollection
– Knowthe3topthingstoconsiderwhenstoring data
– Explainwhoowns researchdataandwithwhoandhowit
shouldbeshared
• StatisticalAnalysis
– Learnhowtoidentifyoutliers andwhattodowiththem
– Recognizethetradeoffsbetweensuitable,better,andbest
methods forstatisticalanalyses
– Learntoidentifycommonmistakesanddeceptionswhen
display andinterpretation ofresults
DataAcquisition
DataCollection
DataCollection
1. AppropriateMethods
–
Garbagein,garbageout(biaseddatacollection,e.g.,sampleselection,
biasedresults)
2. Attentiontodetail
–
Accuracyinrecording,interpretation,publications
3. Authorized
–
HIPAA,hazardousmaterials,copyrights,etc.
4. Recording
–
–
–
Hardcopyevidenceshouldbeenteredintoanumbered,boundnotebook
Electronicevidenceshouldbevalidatedinsomewaytoassurethatitwas
actuallyrecordedonaparticulardateandnotchangedatsomelaterdata
Notonlyshoulddataderivedfromtheresearchbeaccuratelyrecorded,but
alsodetailedinformationonproceduresincludingmaterialsused,e.g.,
chemicalagents.
TakenfromORIIntroductiontoRCR (http://ori.hhs.gov/education/products/RCRintro/)
Pannucci CJ,WilkinsEG.Identifyingandavoidingbiasinresearch.Plast
Reconstr Surg.2010Aug;126(2):619-25.
Reasonstokeepaccuraterecords
• Reproducibility
• Futureanalyses
• Investigationsof
misconduct
• Provingownershipof
intellectualproperties
• Others?
CaseStudy
fromResponsibleConductofResearch
• Dr.Zismentoringa“promising”medicalstudentoverthe
summerinhisresearchlab
• Student’sproject:
– cancercelllinethatrequires3weekstogrowinordertotestfor
aspecificantibody
– thestudenthasalreadywrittenashortpaperonhiswork
• Dr.Z’sdilemma:
– aftergoingovertherawdata,somedatawereonpiecesof
yellowpadswithoutclearidentificationfromwhichexperiment
thedatacame
– someoftheexperimentswererepeatedseveraltimeswithout
explanationastoway
– Dr.Zisnothappyaboutthedata,butdoesn’twantto
discouragethestudentfrompursuingacareerinresearch
CaseStudy
fromResponsibleConductofResearch
•
•
Dr.Zismentoringa“promising”
medicalstudentoverthesummerin
hisresearchlab
Student’sproject
• Whatistheprimary
responsibilityofthementor?
Dr.Z’sdilemma:
• Shouldthestudentwritea
shortpaperandsenditfor
publication?
– cancercelllinethatrequires3weeksto
growinordertotestforaspecific
antibody
– thestudenthasalreadywrittenashort
paperonhiswork
•
– aftergoingovertherawdata,some
datawereonpiecesofyellowpads
withoutclearidentificationfromwhich
experimentthedatacame
– someoftheexperimentswererepeated
severaltimeswithoutexplanationasto
way
– sheisnothappyaboutthedata,but
doesn’twanttodiscouragehimto
pursueacareerinresearch
• Shouldthementorwritea
shortpaperandsenditfor
publication?
• Ifyouwerethementor,what
wouldyoudo?
DataAcquisition
DataStorage
DataStorage
“Overtime,data,asthecurrencyof
research,becomeaninvestmentin
research.Ifthedataarenot
properlyprotected,theinvestment,
whetherpublicorprivate,could
becomeworthless”
– ORIIntroductiontoRCR
ConsiderationsWhenStoring
Data/Research
•
Catastrophe
– Labnotebooksareina“safe”place
– Electronicdataarebackedupandstored
inaseparatelocation
– Samplesarestoredproperlytoavoid
contamination
•
Confidentiality
– Informationonhumansubject– see
HIPAAguidelines
– Informationonintellectualproperty
•
Periodofretention
– NIHgenerallyrequires3yearsafter
projectend
– Otheragenciesmayrequireupto7years
afterprojectend
– Otherunforeseenuses…
TakenfromORIIntroductiontoRCR (http://ori.hhs.gov/education/products/RCRintro/)
DataAcquisition
DataOwnership/Sharing
Ownership/DataSharing
Whoownsthedata?
• Researchers
• Funders
– Grantsvs.Contracts
• ResearchInstitutions
•
e.g.,“forthemostpart,NIH
makesawardstoinstitutions
andnotindividuals”– NIHData
SharingPolicyand
ImplementationGuidance
• DataSources
– Subjects
– Countries
IllustrationbyDavidZinn
TakenfromORIIntroductiontoRCR (http://ori.hhs.gov/education/products/RCRintro/)
Intheheadlines…
http://www.sandiegouniontribune.com/news/2015/jul/02/ucsd-sues-usc-aisen/
AfewinterestingquotesfromtheNIHData
SharingPolicyandImplementationGuidanceon
DataSharing
“Finalresearchdataarerecorded
factualmaterialcommonlyaccepted
inthescientificcommunityas
necessarytodocument,support,and
validateresearchfindings.”
AfewinterestingquotesfromtheNIHData
SharingPolicyandImplementationGuidance
“NIHexpectstimelyreleaseand
sharingofdatatobenolaterthan
theacceptanceforpublicationofthe
mainfindingsfromthefinaldataset.”
AfewinterestingquotesfromtheNIHData
SharingPolicyandImplementationGuidance
“Forthemostpart,itisnot
appropriatefortheinitial
investigatortoplacelimitsonthe
researchquestionsandmethods
otherinvestigatorsmightpursue
withthedata.”
CaseStudy
fromResponsibleConductofResearch
Drs.KandWareconductingaNIH-fundedlong-term(25years),observational
studyofthehealthofpesticideapplicators.
– Initialhealthassessment(healthhistory,physicalexam,bloodandurinetests,
DNAsample,anddustsamples)
– Yearlyhealthsurveysandfullhealthassessmentevery4years
Afterthefirst15years:
• Publishedmorethanadozenpaperfromthedatabase
• Requireaelaboratedata-sharingagreementbeforereleasingthedata
DrsKandW’sdilemmaisthattheyrecentlyreceivedrequestsforaccessto
thedatabasefrom:
• Apesticidecompany
• Acompetingresearchteam
• Aradicalenvironmentgroupwithananti-pesticideagenda
CaseStudy
fromResponsibleConductofResearch
Drs.KandWareconductingaNIH-fundedlongterm(25years),observationalstudyofthe
healthofpesticideapplicators.
–
–
Initialhealthassessment(healthhistory,
physicalexam,bloodandurinetests,DNA
sample,anddustsamples)
Yearlyhealthsurveysandfullhealth
assessmentevery4years
Afterthefirst15years:
• Publishedmorethanadozenpaperfromthe
database
• Requireaelaboratedata-sharingagreement
beforereleasingthedata
DrsKessenbaum andWilcox’sdilemmaisthat
theyrecentlyreceivedrequestsforaccessto
thedatabasefrom:
• Apesticidecompany
• Acompetingresearchteam
• Aradicalenvironmentgroupwithanantipesticideagenda
QUESTIONS
• HowshouldDrs.KandW
handletheserequeststo
accesstheirdatabase?
• Isitethicaltorequire
peoplewhorequestdata
tosignelaboratedata
sharingagreements
StatisticalAnalysis
TipsforReproducibleStatistical
Analyses
1.
ALWAYSkeepaversionofthe“mostraw”data
–
2.
Useascriptinglanguage
–
–
–
3.
ProgramslikeRandSASallowyoutofollowyourstepsexactly ifyou
(orsomeoneelse)hadtoredoyouranalysis
EasilyexecuteanddocumentQCsteps
Avoidcopy/pasteerrors
Addcomments/notesdirectlytoprogram
–
–
4.
Recordwhenandwhereitwascreated,soyoucaneasilytellifithas
beenchangedsincecreation
Whyareyoudoingthisstep?
Whatisthegoalofthisstep?
Exportprecisetables/figuresfromprogram
–
–
Avoidtranspositionerrors
Savetime/energywherechangesarerequestedininitialsteps
StatisticalAnalysis
Outliers
Outliers
Without Outlier
220
1000
With Outlier
●
correlation coefficient = 0.7
p−value = <0.0001
correlation coefficient = 0.06
p−value = 0.5636
●
160
●
●
●
●
●
●
●
60
●
●
●
65
Height (inches)
70
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
60
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●●
●
●
●●
●
●
● ●
●
●
●
●
●
● ●
●●●
●
●
●
●
●
●
● ●
●
120
●
●●
●
●
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
140
400
600
Weight (lbs)
180
●
●
●
● ●
●
●●●●
●● ●
●
●● ● ●
●
●
●
●
●
●
●
●● ●
●
●
●
● ● ●●● ● ●
● ● ●
●●
● ●●
●
●
●
●
●
●
●
●●
● ● ●●
● ●
●
●●
●●●
● ●
●
● ● ●
●● ●
●
●● ●●
●
●
●
●
●
●
200
Weight (lbs)
800
200
●
65
Height (inches)
70
OutlierMitigation
1. Identify
–
–
–
2or3standarddeviations
Unrealisticvalues
Inconsistent
2. Investigate
–
–
Wasthereatechnicalissue?typo?etc?
Isitevenapossibletruevalue?
3. RemediatewithDOCUMENTATION
–
Makearuleandwriteitdown
4. Sensitivityanalysis
–
Whatwouldhavehappenedifyouhadn’teliminated
values?Isyouresultrobust?
CaseStudy
fromResponsibleConductofResearch
Anonymoussurveyofcollegestudentsonopinionaboutacademic
integrity
• 20questions(Likert scale)
• 10open-endedquestions
• 480surveysadministered(320responses)
Issues:
1. 8surveysappearaspracticaljokes(obscenities,additional
numbersaddedtoscale,etc.)
–
2.
35respondentsappeartobeconfusedaboutscale
–
3.
Somequestionsappearusablebutsomearenot
Theyanswer“5”when“1”ismorelogicalgiventheirotheranswers
29surveyshavenamesonthemwhenrespondentswere
instructednottodoso
CaseStudy
fromResponsibleConductofResearch
Anonymoussurveyofcollegestudents
onopinionaboutacademicintegrity
• 20questions(Likert scale)
• 10open-endedquestions
• 480surveysadministered(320
responses)
QUESTIONS:
1.
Howshouldtheresearchers
dealwiththesesissueswith
theirdata?
Issues:
1.
8surveysappearaspracticaljokes
2.
Shouldtheytrytoedit/fix
surveysthathaveproblems?
3.
Shouldtheythrowawayany
surveys?Whichones?
4.
Howmighttheirdecisions
concerningthedispositionof
thesesurveysaffecttheir
overallresults?
–
2.
35respondentsappeartobe
confusedaboutscale
–
3.
Somequestionsappearusablebut
somedonot
Theyanswer“5”when“1”ismore
logicalgiventheirotheranswers
29surveyshavenamesonthem
whenrespondentswereinstructed
nottodoso
StatisticalAnalysis
Suitable,better,andbestmethodsfor
analysis
MethodsforStatisticalAnalysis
• Whatisthenorminthefield?
• Aspectrumofalternativestatisticalmethods
Bias,
Inappropriate
method
Generalmethod
withstated
assumptions
Moststatistically
rigorousmethod
thatevaluates
most/all
assumptions
Increasingscope
Increasingmonetaryandtimecosts
Increasingprecision
Knowtheassumptionsofanytest
• Isthesamesubject/samplemeasuredmore
thanonce?
• Arethedatanormallydistributed?
• Isthereequalvarianceineachgroup?
• Aresubjectsrandomlyassignedtoa
treatment?Aretheymatched?
• Istemporalorderassumed?
StatisticalAnalysis
Displayandinterpretationofresults
DisplayingResults
DisplayingResults
ConfidentinobtainingfirstNIHR01Grant?
InterpretingResults
• Associationvs.Causation
– Causationcanonlybeproven
inacarefullydesignedand
carefullycontrolledprospective
study
• Eatingmorechocolatewillnot
causeyoutobecomeaNobel
Laureate
• PotentialConfoundingIssues
– Confoundingvariable–
“extraneousvariableina
statisticalmodelthatcorrelates
withboththedependent
variableandtheindependent
variable”– Wikipedia
– e.g.,Coffeedrinkersaremore
likelytogetlungcancer
• Smokersaremorelikelytobe
coffeedrinkersandsmokersare
morelikelytogetcancer
HighlightsofEthicalGuidelinesforReportingStatistical
Analysis/ResultsinPublications
FromAmericanStatisticalAssociation’sEthicalGuidelinesforStatisticalPractice
1. Reportstatisticalandsubstantiveassumptionmadein
thestudy.
2. Accountforalldataconsideredinastudyandexplain
thesample(s)actuallyused
3. Reportthesourcesandassessedadequacyofthedata
4. Reportthedatacleaningandscreeningprocedures
used
5. Clearlyandfullyreportthestepstakentoguard
validity.Addressthesuitabilityoftheanalytic
methodsandtheirinherentassumptionsrelativeto
thecircumstancesofthespecificstudy
Acknowledgements/References
•
•
•
Dr.PaulaHoffman
Dr.BrandieWagner
ORCandCCTSI
References
ResponsibleConductofResearchbyAdil EShamoo andDavidB.Resnick.SecondEd.OxfordUniversity
Press,2009.
NIHDataSharingPolicyandImplementationGuidance
(https://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm),March5,2003.
EthicalGuidelinesforStatisticalPractice,AmericanStatisticalAssociation
(http://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx),April
2016.
IntroductiontoRCR– 6.DataManagementPractices,OfficeofResearchIntegrity,USDepartmentof
HealthandHumanServices(http://ori.hhs.gov/education/products/RCRintro/c06/0c6.html),
RevisedEditionAugust2007