Citing data

INFO 7470/ECON 7400/ILRLE 7400
Citing Literature, Citing Data
John M. Abowd and Lars Vilhuber
March 11, 2013
CITING LITERATURE
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
2
Citing Literature
• Why? To prevent plagiarism, to establish
provenance of ideas
• How? Why do we cite as we do – publishing
cycles, uniqueness of sources
• Plagiarism: appropriating other people’s ideas
• Examples (Bruno Frey)
• Citing literature today: does it still work?
– Issues of versioning of articles
– Revisions/retractions/corrections
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
3
Why Do We Cite Literature?
• To give credit to the original authors of ideas
– To not give credit is plagiarism
• To allow readers to find the information cited
– Trace the evolution of ideas
– Document cited results
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
4
Plagiarism
Source: http://www.elsevier.com/authors/author-rights-andresponsibilities#responsibilities via RePEc
• More easily detected nowadays
– http://plagiarism.repec.org/offenders.html
– http://ideas.repec.org/a/che/chepap/v20y2008i1p20-25.html
• Software
– http://plagiarism.bloomfieldmedia.com/zwordpress/software/wcopyfind/
– Turnitin
– AEA uses http://www.aeaweb.org/crosscheck.php
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
5
Prominent Recent Examples of
Plagiarism
• Bruno Frey
– AEA PP, others (see FreyPlag_Wiki but also
responses by Frey)
• German ministers
– Defense
Karl-Theodor
zu Guttenberg [German source]
– Education… Annette Schavan [German source]
Maria Nikolaus Johann Jacob Philipp Franz Joseph Sylvester Freiherr von und
• Russian presidents? [2006]
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
6
How Do We Cite?
• Multiple typographical standards
• Generally enough unique keys to correctly
identify the source
• Current conventions driven to a large extent
by the publishing model in effect through the
end of the 20th century (see also Margo
Anderson’s Session 1 on data publishing)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
7
Examples
Based on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
8
Examples
Based on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08)
Declining uniqueness:
Online documents:
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
9
Permanent Links
• The URL (Uniform Resource Locator or Web address) may
be temporary, may not function in the near or far future
• Links designated as “permanent”, “persistent” or “stable”
are designed specifically to remain active and useable over
time.
• Permanent links
– Digital Object Identifier (DOIs) (more formally: Handle System)
• actionable, interoperable, persistent link
– Other Types of Permanent Links
• JSTOR (old)
• EBSCO
Adapted from http://library.concordia.ca/services/users/faculty/permanentlinks.php
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
10
DOI
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
11
DOI
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
12
DOI in References
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
13
Up to Here …
• … nothing new, or mostly
• Starting in 5th grade, we’ve been thoroughly
trained in citing our “sources”
• Or have we?
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
14
CITING DATA
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
15
Neal (1999)
• http://www.jstor.org/stable/10.1086/209919
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
16
References
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
17
References
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
18
No Data …
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
19
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
20
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
21
The Problem
• I want to replicate Neal’s analysis
• Process:
– Download NLSY data (latest!)
– Read article, replicate his described analysis in
software of my choice
– Get results, compare
• What happens if the results are not the same
– Qualitatively
– Quantitatively
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
22
Attempts to Falsify
“5. Every genuine test of a theory is an attempt to
falsify it, or to refute it [...]
6. Confirming evidence should not count except
when it is the result of a genuine test of the theory;
and this means that it can be presented as a serious
but unsuccessful attempt to falsify the theory. (I
now speak in such cases of ‘corroborating
evidence.’)”
Karl Popper, Science : Conjectures and Refutations, pg. 47
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
23
Replication Study
• Different result driven by
–
–
–
–
Differences in data
Differences in software
Differences in implementation
Errors by the original author…
• Start by keeping as much as possible the same
setup
– Same data
– Same software
– Same implementation (programs)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
24
Data for Replication
• What does “same data” imply?
– Ability to find the data
– Assurance that the data are, in fact, the same
• Data curation and citation are critical to the
replication exercise
• Increasing impetus by funding agencies
– NSF
– NIH
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
25
Not Futile
• Neal JOLE article is much cited (60 citations on
RePEc, undercount)
• Only instance of a substantive correction of a
JOLE article (as of 2013-03-08, search term:
erratum)
• Notable because the author publishing the
erratum was
– Referring to a (seminal) article from 5 years earlier
– Was the editor-in-chief at the time
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
26
Example: JOLE
• “In the April 1999 issue of this Journal, I published an
article entitled “The Complexity of Job Mobility among
Young Men” (Journal of Labor Economics 17, no. 2
[1999]: 237–61). Recently, I began a dialogue with
another researcher who was attempting to replicate
the empirical results in that article. Through this
dialogue, I learned that, for some workers, I erred in
constructing my original counts of the number of
employer changes within specific careers.1 I have
corrected this error and have found that, given correct
variable constructions, several empirical results differ
quantitatively, although not qualitatively, from the
results reported in the original article.”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
27
Other Items to Note
• Not available if not a subscriber …
• The original author’s publication count increased by 1
• The discrepancy’s reporter (Ronni Pavan) was not an author
on the erratum (Pavan did publish in the same journal in
2011)
• Neither the original data nor the corrected data (and the
associated programs) are available from the journal (they
are probably available from the author).
– The original data are public-use NLSY data, referenced as “197992”
• The online version of the original article contains a link to
the erratum (and is found when searching for “Erratum”)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
28
Why Do We Cite Data This Way?
• Used to be sufficient
– Data were the same as a book (see Margo’s Session 1)
– If not, then they were rarely modified (punch cards,
tapes)
– Example “NLSY 1979-1992” was a well-defined
CDROM
• No longer sufficient
– Where is the NLSY CDROM?
– Which version does your library have?
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
29
Publications by the Census Bureau
• Decennial Census: SF1, SF2, SF3 … once every
ten years
• Economic Census: Limited number of tables
every 5 years
• LEHD: 4860 tables every three months
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
30
Improvements
• https://usa.ipums.org/usa/cite.shtml
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
31
Improvements
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
32
These Are the Easy Cases
• NLSY, IPUMS-USA, ICPSR data
– Public-use datasets or
– Data distributor is also data custodian –
guarantees availability of the data
• Many other public-use datasets
– QCEW – no (can be defined by latest date on file,
but not officially defined)
– QWI – version.txt, but hidden
– BDS – “yearly” releases (two listed, in fact three)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
33
Data Availability Not a New Issue
• “In its first issue, the editor of Econometrica
(1933), Ragnar Frisch, noted the importance
of publishing data such that readers could
fully explore empirical results. Publication of
data, however, was discontinued early in the
journal’s history. [...] The journal arrived fullcircle in late 2004 when Econometrica
adopted one of the more stringent policies on
availability of data and programs.”
http://www.econometricsociety.org/submissions.asp#4 as cited in Anderson et al (2005)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
34
Citing Restricted-use Data
• Abowd, Kramarz, Margolis (1999): “The data
used in this paper are confidential but the
authors’ access is not exclusive.”
• But
– No current statistical agency has in place a way to
uniquely cite data
– Black box of restricted-access data enclaves
– Worries about “leakage” of confidential
information
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
35
Declining Role of Public-use Data
(Chetty, 2012)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
36
Increasing Use of Administrative Data
(Chetty, 2012)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
37
Not Just in Social Sciences
• Nature, 2012 “Many of the emerging ‘big
data’ applications come from private sources
that are inaccessible to other researchers. The
data source may be hidden, compounding
problems of verification, as well as concerns
about the generality of the results.”
Huberman, Nature 482, 308 (16 February 2012), doi:10.1038/482308d
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
38
Verification Is Important
• Falsifying data
– Andrew Wakefield (autism and vaccines)
– Yoshitaka Fujii (fabricated data in 172 out of 249
papers)
• “Believe it or not: how much can we rely on
published data on potential drug targets?”
doi:10.1038/nrd3439-c1 – Drug maker cannot
replicate more than 20-25% of findings
• “Why Most Published Research Findings Are
False” Ioannidis JPA (2005)
doi:10.1371/journal.pmed.0020124
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
39
But …
• Even studies that worry about replication… do
not provide their own data in a replicable way
“The questionnaire can be obtained from the
authors.” (doi:10.1038/nrd3439-c1)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
40
Other Approaches:
Replication for a Fee
• “The Reproducibility Initiative takes advantage of
Science Exchange’s existing network of more than
1,000 core facilities and commercial research
organizations. Researchers submit their studies (…)
[which] will attempt to replicate the studies for a fee.
• Submitting researchers will have to pay for the
replication studies (…) one-tenth that of the original
study (…) 5 percent transaction fee to Science
Exchange.
• Participants will remain anonymous unless they choose
to publish the replication results in a PLoS ONE Special
Collection (source)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
41
CORE ISSUES
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
42
Core Issues
a. Insufficient curation (starting with archiving)
b. No consistent way to learn about the data
(metadata)
c. No way to reference data (unique identifiers)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
43
Core Requirements for Data Access
• Royal Society (2012)
– Accessible (a researcher can easily find it);
– Intelligible (to various audiences);
– Assessable (are researchers able make judgments
about or assess the quality of the data);
– Usable (at minimum, by other scientists).
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
44
Identifying Data
“DOI names are assigned to any entity for use
on digital networks. They are used to provide
current information, including where they (or
information about them) can be found on the
Internet. Information about a digital object may
change over time, including where to find it, but
its DOI name will not change.”
http://datacite.org/whatisdoi, accessed on Sept 26, 2012.
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
45
Data Curation
• First step: make (some of) the data accessible
• Repositories/data custodians can address the
issue for some types of data
• Generally provide a way to identify data (DOI)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
46
Repositories
• DataOne (bio sciences)
• Dryad (ecological data)
• DataVerse (data extracts and programs
accompanying papers)
• University Libraries (Dspace)
• UK Data Archive
• ICPSR (researcher-initiated surveys)
• FRED (St. Louis Fed, time-series)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
47
Journals and Data Curation
• PLOS ONE
– Policy
– Limitations: data limited to 10MB…
• AEA
– Policy
– Example
• Econometrica
– Policy
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
48
PLoS ONE
• http://www.plosone.org/static/policies#sharing
• “PLOS is committed to ensuring the availability of
data and materials that underpin any articles
published in PLOS journals.”
• “PLOS reserves the right to post corrections on
articles, to contact authors' institutions and
funders, and in extreme cases to withdraw
publication, if restrictions on access to data or
materials come to light after publication of a
PLOS journal article.”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
49
PLoS ONE (cont.)
• “(…)appropriate accession numbers or digital
object identifiers (DOIs) published with the
paper”
• Also guidelines for software (in particular
when it is critical to the paper)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
50
AEA Policy
• http://www.aeaweb.org/aer/data.php
• “Authors of accepted papers that contain
empirical work, simulations, or experimental
work must provide to the Review, prior to
publication, the data, programs, and other
details of the computations sufficient to
permit replication.”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
51
AEA Policy (cont.)
• http://www.aeaweb.org/aer/data.php
• For econometric and simulation papers, the
minimum requirement should
– include the data set(s) and programs used to run the
final models,
– plus a description of how previous intermediate data
sets and programs were employed to create the final
data set(s).
– Authors are invited to submit these intermediate data
files and programs as an option
– […] as well as instructing a user on how replication can
be conducted.
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
52
AEA Example: Abowd and Vilhuber
(2012)
• Article:
http://www.aeaweb.org/articles.php?doi=10.125
7/aer.102.3.589
• Appendix
– Description at
http://www.aeaweb.org/aer/data/may2012/2012_27
90_app.pdf (note: no DOI!)
– Tried to be careful about referencing data, but no DOIs
available on any of the data
• Even our own data (National QWI, 38MB compressed)
– Only generic programs
– Final dataset was too large – not accepted.
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
53
Econometrica Policy
• http://www.econometricsociety.org/submissionprocedures
.asp#replication
• “Econometrica has the policy that all empirical,
experimental and simulation results must be replicable.
• Therefore, authors of accepted papers must submit data
sets, programs, and information on empirical analysis,
experiments and simulations that are needed for
replication and some limited sensitivity analysis”
• Limited-access/proprietary datasets: “detailed data
description and the programs used to generate the
estimation data sets must be provided, as well as
information of the source of the data so that researchers
who do obtain access may be able to replicate the results”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
54
Limitation of Current Repositories
• Do not (yet) provide full provenance
– For lack of citation tools
– For lack of guidance
• Limitations when using “big data”
– Repository not the solution (suggested size: <10MB,
although Econometrica has some in the 400MB range)
– Unique references to data publication, onus on
publisher?
• Do not work (well) for restricted-access data
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
55
Metadata Access
• Information about the data
• Can be
– Variable names
– Formats
– Values
– Distribution of values
– Description
– Provenance
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
56
Metadata on Public-use Data
• IPUMS: Structured/browsable metadata
• Most other sites:
– PDF or ASCII files
– Generally not linked to actual data
• Restricted-access data in Census RDC
– Generic information outside
– PDF once access granted
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
57
IPUMS Metadata
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
58
IPUMS Metadata (Details)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
59
ICPSR Metadata on ATUS
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
60
BLS Metadata on ATUS
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
61
Current Metadata on Confidential Data
• Mostly by inference
• Census Bureau (CES):
– links to public-use tabulations, documents (some by
yours truly), codebooks (Snapshot S2004)
– PDFs of detailed data in RDC
– Codebooks for a few data sets at ICPSR
• 1960 (ICPSR 21980); 1970 (21981); 1980 (21982); 1990
(21983); 2000 (21820)
• NCHS:
– what is in questionnaire (PDF) but not in public-use
codebook (PDF) might be accessible
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
62
Approaches and Solutions
• NCRN-Cornell node: Comprehensive
Extensible Data Documentation and Access
Repository (CED²AR)
– Based on existing metadata standards (DDI) with
possible extensions
– Provide structured mechanism to synchronize
confidential and public-use metadata
– Assign DOI where needed
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
63
NCRN-Cornell
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
64
Pruning Confidential Metadata
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
65
End Result (mid-2013)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
66
End Result (mid-2013)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
67
EASE OF ACCESS/REPLICATION
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
68
FRED: Federal Reserve Economic Data
• http://research.stlouisfed.org/fred2/
– Excellent job in providing easy access to a large
number of data series
– Also provide archival versions (data series ‘as-of’)
– Online graphs
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
69
Issues with FRED
• No link back to original data provider’s unique
ID (in large part because there is nothing to
link back to)
• Archival versions identified by “publication”
date (may be imprecise at times)
• Incomplete …
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
70
Accessing FRED
• Demo using Stata’s “freduse”
• Program used in this demo:
– stata-recession-fred.do
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
71
Stata
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
72
Stata Results
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
73
Accessing FRED
• Demo using R’s “quantmod”
• Program used in this demo:
– r-recession-fred.R
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
74
Using quantmod
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
75
Results with R
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
76
FRED Issues
• Positives: it’s available!
• Trains people to use keys to look up online
references
• Issues:
– Not able to link to archival versions (always the
latest version),
– But does store local copies (-> repository, onus
back on ad hoc data archiving)
– How to cite the data?
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
77
Tools and Replicability
• Tools help to do replicability analysis
– Ability to reference URL of data (handle, DOI, etc.)
– Ability to access data through URL
• Even if/when run in restricted-access environments
3/11/2013
© John M. Abowd and Lars Vilhuber 2013,
all rights reserved
78