INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013 CITING LITERATURE 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 2 Citing Literature • Why? To prevent plagiarism, to establish provenance of ideas • How? Why do we cite as we do – publishing cycles, uniqueness of sources • Plagiarism: appropriating other people’s ideas • Examples (Bruno Frey) • Citing literature today: does it still work? – Issues of versioning of articles – Revisions/retractions/corrections 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 3 Why Do We Cite Literature? • To give credit to the original authors of ideas – To not give credit is plagiarism • To allow readers to find the information cited – Trace the evolution of ideas – Document cited results 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 4 Plagiarism Source: http://www.elsevier.com/authors/author-rights-andresponsibilities#responsibilities via RePEc • More easily detected nowadays – http://plagiarism.repec.org/offenders.html – http://ideas.repec.org/a/che/chepap/v20y2008i1p20-25.html • Software – http://plagiarism.bloomfieldmedia.com/zwordpress/software/wcopyfind/ – Turnitin – AEA uses http://www.aeaweb.org/crosscheck.php 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 5 Prominent Recent Examples of Plagiarism • Bruno Frey – AEA PP, others (see FreyPlag_Wiki but also responses by Frey) • German ministers – Defense Karl-Theodor zu Guttenberg [German source] – Education… Annette Schavan [German source] Maria Nikolaus Johann Jacob Philipp Franz Joseph Sylvester Freiherr von und • Russian presidents? [2006] 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 6 How Do We Cite? • Multiple typographical standards • Generally enough unique keys to correctly identify the source • Current conventions driven to a large extent by the publishing model in effect through the end of the 20th century (see also Margo Anderson’s Session 1 on data publishing) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 7 Examples Based on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 8 Examples Based on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08) Declining uniqueness: Online documents: 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 9 Permanent Links • The URL (Uniform Resource Locator or Web address) may be temporary, may not function in the near or far future • Links designated as “permanent”, “persistent” or “stable” are designed specifically to remain active and useable over time. • Permanent links – Digital Object Identifier (DOIs) (more formally: Handle System) • actionable, interoperable, persistent link – Other Types of Permanent Links • JSTOR (old) • EBSCO Adapted from http://library.concordia.ca/services/users/faculty/permanentlinks.php 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 10 DOI 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 11 DOI 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 12 DOI in References 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 13 Up to Here … • … nothing new, or mostly • Starting in 5th grade, we’ve been thoroughly trained in citing our “sources” • Or have we? 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 14 CITING DATA 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 15 Neal (1999) • http://www.jstor.org/stable/10.1086/209919 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 16 References 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 17 References 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 18 No Data … 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 19 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 20 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 21 The Problem • I want to replicate Neal’s analysis • Process: – Download NLSY data (latest!) – Read article, replicate his described analysis in software of my choice – Get results, compare • What happens if the results are not the same – Qualitatively – Quantitatively 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 22 Attempts to Falsify “5. Every genuine test of a theory is an attempt to falsify it, or to refute it [...] 6. Confirming evidence should not count except when it is the result of a genuine test of the theory; and this means that it can be presented as a serious but unsuccessful attempt to falsify the theory. (I now speak in such cases of ‘corroborating evidence.’)” Karl Popper, Science : Conjectures and Refutations, pg. 47 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 23 Replication Study • Different result driven by – – – – Differences in data Differences in software Differences in implementation Errors by the original author… • Start by keeping as much as possible the same setup – Same data – Same software – Same implementation (programs) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 24 Data for Replication • What does “same data” imply? – Ability to find the data – Assurance that the data are, in fact, the same • Data curation and citation are critical to the replication exercise • Increasing impetus by funding agencies – NSF – NIH 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 25 Not Futile • Neal JOLE article is much cited (60 citations on RePEc, undercount) • Only instance of a substantive correction of a JOLE article (as of 2013-03-08, search term: erratum) • Notable because the author publishing the erratum was – Referring to a (seminal) article from 5 years earlier – Was the editor-in-chief at the time 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 26 Example: JOLE • “In the April 1999 issue of this Journal, I published an article entitled “The Complexity of Job Mobility among Young Men” (Journal of Labor Economics 17, no. 2 [1999]: 237–61). Recently, I began a dialogue with another researcher who was attempting to replicate the empirical results in that article. Through this dialogue, I learned that, for some workers, I erred in constructing my original counts of the number of employer changes within specific careers.1 I have corrected this error and have found that, given correct variable constructions, several empirical results differ quantitatively, although not qualitatively, from the results reported in the original article.” 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 27 Other Items to Note • Not available if not a subscriber … • The original author’s publication count increased by 1 • The discrepancy’s reporter (Ronni Pavan) was not an author on the erratum (Pavan did publish in the same journal in 2011) • Neither the original data nor the corrected data (and the associated programs) are available from the journal (they are probably available from the author). – The original data are public-use NLSY data, referenced as “197992” • The online version of the original article contains a link to the erratum (and is found when searching for “Erratum”) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 28 Why Do We Cite Data This Way? • Used to be sufficient – Data were the same as a book (see Margo’s Session 1) – If not, then they were rarely modified (punch cards, tapes) – Example “NLSY 1979-1992” was a well-defined CDROM • No longer sufficient – Where is the NLSY CDROM? – Which version does your library have? 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 29 Publications by the Census Bureau • Decennial Census: SF1, SF2, SF3 … once every ten years • Economic Census: Limited number of tables every 5 years • LEHD: 4860 tables every three months 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 30 Improvements • https://usa.ipums.org/usa/cite.shtml 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 31 Improvements 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 32 These Are the Easy Cases • NLSY, IPUMS-USA, ICPSR data – Public-use datasets or – Data distributor is also data custodian – guarantees availability of the data • Many other public-use datasets – QCEW – no (can be defined by latest date on file, but not officially defined) – QWI – version.txt, but hidden – BDS – “yearly” releases (two listed, in fact three) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 33 Data Availability Not a New Issue • “In its first issue, the editor of Econometrica (1933), Ragnar Frisch, noted the importance of publishing data such that readers could fully explore empirical results. Publication of data, however, was discontinued early in the journal’s history. [...] The journal arrived fullcircle in late 2004 when Econometrica adopted one of the more stringent policies on availability of data and programs.” http://www.econometricsociety.org/submissions.asp#4 as cited in Anderson et al (2005) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 34 Citing Restricted-use Data • Abowd, Kramarz, Margolis (1999): “The data used in this paper are confidential but the authors’ access is not exclusive.” • But – No current statistical agency has in place a way to uniquely cite data – Black box of restricted-access data enclaves – Worries about “leakage” of confidential information 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 35 Declining Role of Public-use Data (Chetty, 2012) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 36 Increasing Use of Administrative Data (Chetty, 2012) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 37 Not Just in Social Sciences • Nature, 2012 “Many of the emerging ‘big data’ applications come from private sources that are inaccessible to other researchers. The data source may be hidden, compounding problems of verification, as well as concerns about the generality of the results.” Huberman, Nature 482, 308 (16 February 2012), doi:10.1038/482308d 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 38 Verification Is Important • Falsifying data – Andrew Wakefield (autism and vaccines) – Yoshitaka Fujii (fabricated data in 172 out of 249 papers) • “Believe it or not: how much can we rely on published data on potential drug targets?” doi:10.1038/nrd3439-c1 – Drug maker cannot replicate more than 20-25% of findings • “Why Most Published Research Findings Are False” Ioannidis JPA (2005) doi:10.1371/journal.pmed.0020124 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 39 But … • Even studies that worry about replication… do not provide their own data in a replicable way “The questionnaire can be obtained from the authors.” (doi:10.1038/nrd3439-c1) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 40 Other Approaches: Replication for a Fee • “The Reproducibility Initiative takes advantage of Science Exchange’s existing network of more than 1,000 core facilities and commercial research organizations. Researchers submit their studies (…) [which] will attempt to replicate the studies for a fee. • Submitting researchers will have to pay for the replication studies (…) one-tenth that of the original study (…) 5 percent transaction fee to Science Exchange. • Participants will remain anonymous unless they choose to publish the replication results in a PLoS ONE Special Collection (source) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 41 CORE ISSUES 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 42 Core Issues a. Insufficient curation (starting with archiving) b. No consistent way to learn about the data (metadata) c. No way to reference data (unique identifiers) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 43 Core Requirements for Data Access • Royal Society (2012) – Accessible (a researcher can easily find it); – Intelligible (to various audiences); – Assessable (are researchers able make judgments about or assess the quality of the data); – Usable (at minimum, by other scientists). 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 44 Identifying Data “DOI names are assigned to any entity for use on digital networks. They are used to provide current information, including where they (or information about them) can be found on the Internet. Information about a digital object may change over time, including where to find it, but its DOI name will not change.” http://datacite.org/whatisdoi, accessed on Sept 26, 2012. 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 45 Data Curation • First step: make (some of) the data accessible • Repositories/data custodians can address the issue for some types of data • Generally provide a way to identify data (DOI) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 46 Repositories • DataOne (bio sciences) • Dryad (ecological data) • DataVerse (data extracts and programs accompanying papers) • University Libraries (Dspace) • UK Data Archive • ICPSR (researcher-initiated surveys) • FRED (St. Louis Fed, time-series) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 47 Journals and Data Curation • PLOS ONE – Policy – Limitations: data limited to 10MB… • AEA – Policy – Example • Econometrica – Policy 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 48 PLoS ONE • http://www.plosone.org/static/policies#sharing • “PLOS is committed to ensuring the availability of data and materials that underpin any articles published in PLOS journals.” • “PLOS reserves the right to post corrections on articles, to contact authors' institutions and funders, and in extreme cases to withdraw publication, if restrictions on access to data or materials come to light after publication of a PLOS journal article.” 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 49 PLoS ONE (cont.) • “(…)appropriate accession numbers or digital object identifiers (DOIs) published with the paper” • Also guidelines for software (in particular when it is critical to the paper) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 50 AEA Policy • http://www.aeaweb.org/aer/data.php • “Authors of accepted papers that contain empirical work, simulations, or experimental work must provide to the Review, prior to publication, the data, programs, and other details of the computations sufficient to permit replication.” 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 51 AEA Policy (cont.) • http://www.aeaweb.org/aer/data.php • For econometric and simulation papers, the minimum requirement should – include the data set(s) and programs used to run the final models, – plus a description of how previous intermediate data sets and programs were employed to create the final data set(s). – Authors are invited to submit these intermediate data files and programs as an option – […] as well as instructing a user on how replication can be conducted. 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 52 AEA Example: Abowd and Vilhuber (2012) • Article: http://www.aeaweb.org/articles.php?doi=10.125 7/aer.102.3.589 • Appendix – Description at http://www.aeaweb.org/aer/data/may2012/2012_27 90_app.pdf (note: no DOI!) – Tried to be careful about referencing data, but no DOIs available on any of the data • Even our own data (National QWI, 38MB compressed) – Only generic programs – Final dataset was too large – not accepted. 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 53 Econometrica Policy • http://www.econometricsociety.org/submissionprocedures .asp#replication • “Econometrica has the policy that all empirical, experimental and simulation results must be replicable. • Therefore, authors of accepted papers must submit data sets, programs, and information on empirical analysis, experiments and simulations that are needed for replication and some limited sensitivity analysis” • Limited-access/proprietary datasets: “detailed data description and the programs used to generate the estimation data sets must be provided, as well as information of the source of the data so that researchers who do obtain access may be able to replicate the results” 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 54 Limitation of Current Repositories • Do not (yet) provide full provenance – For lack of citation tools – For lack of guidance • Limitations when using “big data” – Repository not the solution (suggested size: <10MB, although Econometrica has some in the 400MB range) – Unique references to data publication, onus on publisher? • Do not work (well) for restricted-access data 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 55 Metadata Access • Information about the data • Can be – Variable names – Formats – Values – Distribution of values – Description – Provenance 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 56 Metadata on Public-use Data • IPUMS: Structured/browsable metadata • Most other sites: – PDF or ASCII files – Generally not linked to actual data • Restricted-access data in Census RDC – Generic information outside – PDF once access granted 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 57 IPUMS Metadata 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 58 IPUMS Metadata (Details) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 59 ICPSR Metadata on ATUS 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 60 BLS Metadata on ATUS 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 61 Current Metadata on Confidential Data • Mostly by inference • Census Bureau (CES): – links to public-use tabulations, documents (some by yours truly), codebooks (Snapshot S2004) – PDFs of detailed data in RDC – Codebooks for a few data sets at ICPSR • 1960 (ICPSR 21980); 1970 (21981); 1980 (21982); 1990 (21983); 2000 (21820) • NCHS: – what is in questionnaire (PDF) but not in public-use codebook (PDF) might be accessible 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 62 Approaches and Solutions • NCRN-Cornell node: Comprehensive Extensible Data Documentation and Access Repository (CED²AR) – Based on existing metadata standards (DDI) with possible extensions – Provide structured mechanism to synchronize confidential and public-use metadata – Assign DOI where needed 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 63 NCRN-Cornell 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 64 Pruning Confidential Metadata 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 65 End Result (mid-2013) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 66 End Result (mid-2013) 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 67 EASE OF ACCESS/REPLICATION 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 68 FRED: Federal Reserve Economic Data • http://research.stlouisfed.org/fred2/ – Excellent job in providing easy access to a large number of data series – Also provide archival versions (data series ‘as-of’) – Online graphs 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 69 Issues with FRED • No link back to original data provider’s unique ID (in large part because there is nothing to link back to) • Archival versions identified by “publication” date (may be imprecise at times) • Incomplete … 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 70 Accessing FRED • Demo using Stata’s “freduse” • Program used in this demo: – stata-recession-fred.do 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 71 Stata 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 72 Stata Results 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 73 Accessing FRED • Demo using R’s “quantmod” • Program used in this demo: – r-recession-fred.R 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 74 Using quantmod 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 75 Results with R 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 76 FRED Issues • Positives: it’s available! • Trains people to use keys to look up online references • Issues: – Not able to link to archival versions (always the latest version), – But does store local copies (-> repository, onus back on ad hoc data archiving) – How to cite the data? 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 77 Tools and Replicability • Tools help to do replicability analysis – Ability to reference URL of data (handle, DOI, etc.) – Ability to access data through URL • Even if/when run in restricted-access environments 3/11/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 78
© Copyright 2026 Paperzz