How do scientists determine data reusability? A quasiexperiment think-aloud study Angela P. Murillo University of North Carolina at Chapel Hill 100 Manning Hall, Chapel Hill, North Carolina [email protected] ABSTRACT This poster presents preliminary findings of a quasiexperiment think-aloud study where scientists were presented four canned results of information regarding earth science data in a counter-balanced design. Scientists were asked to think-aloud regarding what information about the data assisted them in their ability to determine reusability of that dataset. Sixteen scientists from various earth science fields participated in the study. Each scientist responded to four canned results, a post-result usefulness survey, a postsearch rank-order survey, and a post-search survey. Participants stated that concise data descriptions, attribute and unit lists, as well as research methods steps were particularly important in their ability to determine reusability of data. Participants preferred more robust results over less robust results, and stated that they would rather have too much information than to request the data to find out it actually did not serve their needs. Keywords Data reuse, scientific data, quasi-experiment, think-aloud INTRODUCTION This poster-paper presents preliminary analysis of a quasiexperiment think-aloud study conducted to gain an understanding of that information scientists need to determine reusability of data. Data sharing and reuse in the sciences has been a topic of growing attention in recent years. Changes in scientific practices (Bell, Hey, & Szalay, 2009; Hey & Trefethen, 2003) and changes in journal and grant agency policies (National Institutes of Health, 2007; National Science Foundation, 2010) are driving scientists to attempt to share and reuse data. There are known benefits to reusing data including extracting additional value from existing data, avoid ASIST 2016, October 14-18, 2016, Copenhagen, Denmark. © Angela P. Murillo 2016, All rights reserved reproducing research, ask new questions of existing data, and advance the state of science in general (Borgman, 2012; Lord & Macdonald, 2003). The the U.S. National Science Foundation created the Sustainable Digital Data Preservation and Access Network Partners (DataNet) to support the development of long-term sustainable data infrastructures, interoperable data preservation and access, and cyberinfrastructure capabilities (National Science Foundation, 2006). The Data Observation Network for Earth (DataONE) provides cyberinfrastructure for “open, persistent, robust, and secure access to welldescribed and easily discovered earth science observational data” (DataONE, 2013). Scientists participating in DataONE are able to deposit, search, and reuse data available through the various DataONE tools. This purpose of this research is to determine how scientists reuse data. The study uses a quasi-experiment think-aloud as a method to examine data reusability. The DataONE serves as the test environment for this study BACKGROUND LITERATURE Reuse of scientific data provides a number of benefits. These benefits include advances in scientific development by avoiding duplication of work, allowing new questions to be asked to existing data, and encouraging diversity in analysis (Borgman, 2012; Lord & Macdonald, 2003). Additionally, funding agencies and journals have put pressure on scientists to make their data available. Funding agencies have data sharing policies in place for data created through major grants (National Institutes of Health, 2007; National Science Foundation, 2010) Studies have gauged scientists attitude toward sharing and reuse and these studies indicate that scientists want to have data sharing as a norm in science (Borgman, 2012; Ceci, 1988; Lord & Macdonald, 2003). Additionally studies have described motivators for data sharing and have found that ideas of data ownership, previous assistance from coworkers, journal policies, and grant agency requirements were some of the motivations for data sharing (Blumenthal et al., 2006; Constant, Kiesler, & Sproull, 1994; Tenopir et al., 2011). Reasons for not sharing data include financial concerns, lack of time, lack of organizational support, lack of documentation and complexity of metadata standards, as well as the difficulty to anticipate intended users (Birnholtz & Bietz, 2003; Tenopir et al., 2011; Zimmerman, 2003). Researchers have also examined journal policies and data deposition and examined if data sharing policies influence data deposition rates. These studies have indicated that while no journal has complete compliance, much research data is deposited along with the article (Noor, Zimmerman, & Teeter, 2006; Ochsner, Steffen, Stoeckert, & McKenna, 2008). Additionally several studies have investigated specific factors associated with data deposition. These studies have shown that author experience and publications associated with high-impact factor journals were more likely to have associated data deposited alongside the authors’ journal articles (Piwowar, 2011; Piwowar & Chapman, 2010). While the above studies have provided much information for the current data sharing and reuse environment for scientists. These studies have not determine what information scientists need about the data to determine reusability. To address this research gap, a quasiexperiment think-aloud was developed to further understand what information scientist need to determine data reusability. Think-aloud protocols are useful in providing an understanding of the cognitive processes and knowledge acquisition involved in decision making and provides the least amount of interpretation errors that other methods have (Oh & Wildemuth, 2009; Someren, Barnard, & Sandberg, 1994). The quasi-experimental counter-balanced design was chosen to test what information is needed to determine if data are reusable through a manipulation of the information presented to the participant regarding results. In a counterbalanced design, the comparison of interest is within each subject’s performance in the multiple treatment conditions and therefore multiple treatments or interventions are applied to each subject (Hank & Wildemuth, 2009). RESEARCH QUESTIONS The overarching question of this study is how do scientists determine reusability of data? More specifically: • • How is information about the data, such as information regarding metadata standards, provenance information, research methods, and instruments influencing scientists’ ability to determine if that data is reusable? How does this information assist scientists in their ability to reuse this data? RESEARCH METHODS & PROCEDURES Participant Recruitment Sixteen participants were recruited through face-to-face recruitment at scientific conferences (American Geophysical Union and Geologic Society of America), email list-serves at University of North Carolina at Chapel Hill and North Carolina State University, targeted emails to specific UNC and NCSU department, CODATA, Drexel University, and through word of mouth. There were several major recruitment efforts in the summer 2015, fall 2015, and spring 2016. Participants were paid $20 for their participation either by cash or an Amazon gift card. The study was conducted either face-to-face or via screen share over Skype, and lasted approximately between 45 minutes to 1 hour, with the longest at 1 hour 30 minutes and the shorted lasting 43 minutes. Quasi-Experimental Think-Aloud Set Up Participants were walked through the experimental interface using the Qualtrics survey software. They were presented the search interface and four canned results. The researcher incorporated a counter-balance design into the experimental set-up in order to assist in testing for which pieces of information assisted scientists in determining reusability of data. The counter-balanced design was set up as shown in Figure 1. Participant 1: X1 O X2 O X3 O X4 O Participant 2: X2 O X4 O X1 O X3 O Participant 3: X3 O X1 O X4 O X2 O Participant 4: X4 O X3 O X2 O X1 O Figure 1. Counter-Balanced Design For this study: • X1 refers to canned result 1, which contained robust metadata that included an abstract, a research methods section, and a unit/attribute list. • X2 refers to canned result 2, which contained basic metadata robustness, an abstract, and a research methods section. • X3 refers to canned result 3, which contained basic metadata robustness and an abstract. • X4 refers to canned result 4, which contained basic metadata robustness and a research methods section. This design was used to ensure that participants’ responses to the different canned results were not based on the order of presentation. Quasi-Experimental Think-Aloud Procedures There were multiple data gathering points including the post-result usefulness survey, rank order survey, postsearch survey results (general questions, open-ended questions, data reuse factors survey, and demographic survey), and the think-aloud, which was gathering data throughout. Participants were asked to think-aloud about what information was useful to them in regards to potentially reusing the data. After each canned result they were asked to rate the result as far as usefulness on a scale of 1 to 5 (Post-Result Usefulness Survey). Additionally, after seeing all results they were asked to rank them from most useful to least useful (Rank-Order Survey). Lastly, they participated Post-Search Survey, which contained general questions, open-ended questions, a data reuse factors survey, and a demographic survey. Figure 2 below provides visual of the procedures. degrees, and 31.3% had bachelors degrees. None of the participants had used the DataONE system. Post-Result Usefulness Survey Canned result #1 was found as the most useful with a mean usefulness score of 3.56 and canned result #3 was found the least useful with a mean of 2.25. As a reminder canned result #1 the most robust metadata that included an abstract, a research methods section, and a unit/attribute list, canned result #2 did not contain a unit/attribute list but contained the abstract and methods section, canned result #3 contained an abstract, but no methods section, and canned result #4 contained a methods section but no abstract. As shown in Figure 3 there was a preference to having more information over less information, and there was a preference to having the methods section over the abstract section. Canned Result #1 Canned Result #2 Canned Result #3 Canned Result #4 3.56 (.81) 3.31 (.79) 2.25 (1.00) 2.31 (.79) Figure 3. Post-Search Usefulness Survey (Mean (SD)) Rank Order Survey After seeing all canned results, participants were also asked to rank the canned results in order of preference from most useful to least useful in regards to usefulness for potential data reuse. Figure 4 shows the results of these rankings. Canned result #1 ranked the highest most often and canned result #3 ranked the least highest most often. Additionally, canned result #2 ranked the second highest, while canned result #4 ranked the third highest most often. Figure 2. Quasi-Experiment Think-Aloud Procedures/Data Collection RESULTS Demographic Results Of the sixteen participants 56% were male and 44% were female and they were all students, professors or researchers in the earth sciences. Six participants considered their primary area of expertise as Geology, four - Ecology, two Atmospheric Science, two - Environmental Science, 1Physics, and 1 - Hydrology. Participant’s sub-disciplines included: paleoecology, geophysics, seismology, macroecology, evolutionary biology, planetary geology, sedimentology, and coral reef conservation. Of the participants 62.5% were students, 31.25% were professors, and 6.25% worked in a professional organization; additionally, 31.3% had Ph.D.’s, 37.5% had Masters Figure 4. Rank Order Survey Results Post-Search Survey (Reuse Factors) Participants took a short survey to examine what information is needed in order to determine reusability of data. Participants were provided options for: 1) metadata standard, 2) provenance information, 3) permission and intellectual property, 4) instrumentation, 5) research methods and were asked to rank these from 1 to 7 with 1 being Not at all Important, 2 being Very Unimportant, 3 being Somewhat Unimportant, 4 being Neither Important nor Unimportant, 5 being Somewhat Important, 6 being Very Important, and 7 Extremely Important. Figure 5 shows the mean and standard deviation from this survey. Factor Mean (SD) Metadata Standard 4.94 (1.53) Provenance Information 5.25 (1.18) Intellectual Property Information 4.75 (1.29) Instrument Information 5.88 (1.5) Research Methods Information 6.13 (1.45) Other 6.6 (0.52) These short summaries will be described in further detail in the discussion section. Other items that were described as important for determining relevance included: abstract (3 participants), temporal information (3 participants), and provenance (2 participants). Participant #5 stated they needed to know the “who, what, when, where, and how of the data” in order to determine if it was relevant for reuse for them. Participants also made suggestions for information that was not currently in the canned results example including: field collection, uncertainty information, experimental setup, instruments and calibration, and data analysis techniques. Results Open-Ended Question 2 Figure 5. Post Search Survey (Reuse Factors) Post-Search Survey (Open-Ended Questions Results) Along with the survey above, participants also had the opportunity to answer several open-ended questions regarding data reuse. As a reminder these are: 1. When looking at the search results above, what information did you need to determine if the data is relevant? 2. In regard to the DataONE system, what information inhibits your ability to reuse data? 3. In regard to the DataONE system, what information facilitates your ability to reuse data? 4. When thinking about the DataONE system, what information did you need that the system did not provide? Results Open-Ended Question 1 Ten participants suggested that the methods and the attribute table were the information needed to determine if the data was relevant. For example, participant 1 stated “The description box and the methods data was important in determining whether the data was relevant to my needs”; by description box they were referring to the attribute table. Six participants suggested that the data description was important for determining relevance. In fact, participant 12 suggested it was the most important with “The short summary of the dataset present in the final search result was perhaps the most pertinent information needed, but was unfortunately buried at the bottom of the result. The short summary provided the contents of the dataset, and a quick look of whether or not it would be applicable.” In regards to what participants found inhibiting about the DataONE, four participants suggested not knowing enough about the data format inhibited their ability to reuse the data. One participant stated, “The data may be in a format that is difficult to extract. Not knowing the data of the format may lead to the user to not want to use the data.” Additionally two participants suggested that there was too much information, however one of them suggested that it was the organization of this too much information that was the problem with “Organization of information, for example, the dataset description was vital to understanding the dataset, however, was near the bottom. Also, I would prefer more of a snapshot of the data rather than the long list.” Other factors that participants suggested inhibited their ability to reuse the data included: unknown data quality and no links to secondary publications. Two participants stated that there were no factors that inhibited their ability to reuse the data and three participants left the box blank or put dashes in the box implying they did not consider there were any factors that inhibited their ability to reuse the data. Results Open-Ended Question 3 Several participants suggested the layout of the page facilitated their ability to reuse the data. They stated that they “liked that all of the information was on one page” and the “easy to follow layout” of the page. Additionally, five participants stated that the attribute and unit list table facilitated their ability to reuse the data; four stated the methods section facilitated their ability to reuse the data, and three stated that the data description/summary facilitated their ability to reuse the data. Participants also stated their appreciation for the licensing information, the abstract, the geographic information (particularly the coordinates), and the instrument information. One participant stated they appreciated that the DataONE provided the “ability to quickly see most important aspects of study instead of having to read an entire article.” Only two notes by participants suggested areas of improvement, which included having a clearer location to download the data and a clearer licensing summary. It was suggested that these both should be a simple “download data here” and “data can be used” buttons, respectively. Results Open-Ended Question 4 Lastly, in regards to information that the DataONE did not provide that they would have wanted to have, four participants suggested they would like more information about the actual data itself, a snapshot or description of the data. One participant suggested that they “would like to get a preview of the raw data. The dataset may contain other information that is not displayed that can be useful for my study.” Another participant suggested why more information about the actual data was so important with “In the case of datasets stored in particular formats, knowing the format and size of a dataset could be critical. For instance, if I needed image files in .PNG format, it would save time if I knew that a dataset were .JPG only. Similarly, knowing the size of the dataset could be critical (if I don't want to download 10 TBs worth of seed storage temperature data).” Additionally, two participants stated that along with the bounding coordinates they would like a map. Other suggestions included: more information on the history of the data manipulation/provenance, sample size, publications, storage, uncertainty information, and naming conventions. One participant also suggested with would be nice to be able to control their information that was provided through the use of drop down menus. Additionally, six participants either did not answer this question or drew a dash, suggesting they were pleased with the information provided by the DataONE. Think-Aloud Results While participants were thinking aloud, the researcher was taking notes as to their thoughts regarding the information presented from each canned result and if this information was useful for the participants to determine the reusability of the data. The below provides a summary of these results by each canned result. Think-Aloud Results: Canned Result #1 The majority of the participants suggested that they appreciated the data description, attribute table, and research methods information and stated that these items were all important for them to determine if the data was reusable. As is shown in the usefulness results most participants agreed that Canned Result #1 was particularly useful and ranked #1 in the rank order. In general most participants preferred having too much information than not enough information. One participant stated they preferred this because it provided the “who, what, when, where, and how” of the data,” and that was the information they needed to determine if the data was reusable. On two occasions participants stated that there was too much information, however, this was not the opinion of the majority. Some suggestions from included providing the type of data (e.g. experiment, field, sensor), adding a drop down menu so user can determine what that want to look at. Other suggested moving the data descriptions up in the page to make it more prominent. Nearly all participants saw the data description, attribute table, and research methods as the most vital pieces of information for their determination of reusability. Think-Aloud Results: Canned Result #2 The majority of participants indicated that they appreciated having the methods and abstract (particularly the methods), which were useful for them determining if they were able to reuse the data. However, seven participants did state that they wanted more information with a description of the data or a snippet of the data. Even those participants who had not seen the data description and attribute table stated they “really wants a short description of the actual dataset” (P4). Without the data description and attribute table, participants stated that they did like the conciseness of this result. They stated that the abstract, and methods were all very useful. Other items that the helpful were the bounding coordinates, contact information, and keywords. Think-Aloud Results: Canned Result #3 Most participants really found canned result #3 hard to work with. Participant #2 stated “not enough information” and participant 5 stated “mostly secondary information”. In general participants stated the abstract was too much text to parse through and it made it difficult to determine if the data was reusable or not. They stated they really did not prefer the “wall of text” organization. Additionally, they stated that a description of the data and an abstract was not the same thing and wanted to know if the abstract was a paper abstract or a data abstract. From the abstract available this distinction was not clear to participants. These results were similar to the results from the usefulness and rank order surveys. For participants, this was the least useful of all of the canned results. Think-Aloud Results: Canned Result #4 For canned result number 4, the majority of participants agreed that the most helpful item was the research methods to help them determine if the data was reusable. Participants who had already seen canned result #3, suggested that they prefer having the methods over having the abstract indicating that having the methods was more important. One of the reasons why participants preferred the methods over the abstract was that it was easier for them to parse the information from the method. Additionally, they were able to find out information such as data collection and experimental set-up if the methods were available, however this information would not always be in an abstract. Those who had already seen canned result #1 did state that they still preferred having the attribute table and data descriptions, but found the methods valuable. These results were similar to the results in the usefulness and rank order survey, participants ranked this #3 overall. CONCLUSION This research provides insight into what information is needed in order for scientists to determine reusability of data. While there has been much discussion on the importance of data sharing and reuse, changes in journal and granting agency policies, and reasons scientists want to share and reuse each others data, very little attention has been given to the actual reusing of data and what information is needed to reuse this data. This study provides meaningful insights to how scientists determine reusability, which in essence will assist those creating cyberinfrastructures such as the DataONE that, are attempting to enable scientists to reuse data. ACKNOWLEDGMENTS The author would like to thank Dr. Jane Greenberg for her guidance and support. The author would also like to thank the DataONE for their support. REFERENCES Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data deluge. Science, 323(5919), 1297–1298. http://doi.org/10.1126/science.1170411. Birnholtz, J. P., & Bietz, M. J. (2003). Data at work: Supporting and sharing in science and engineering. In Proceedings of the 2003 International ACM SIGGROUP Conference on Supporting Group Work (pp. 339–348). New York, New York, USA: ACM Press. http://doi.org/10.1145/958160.958215 Blumenthal, D., Campbell, E. G., Gokhale, M., Yucel, R., Clarridge, B., & Hilgartner, S. (2006). Data withholding in genetics and the other life sciences: Prevalences and predictors. Academic Medicine, 81(2), 137–145. Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology, 63(6), 1059–1078. http://doi.org/10.1002/asi.22634 Ceci, S. J. (1988). Scientists’ attitudes toward data sharing. Science, Technology, and Human Values, 13(No. 1/2), 45–52. Constant, D., Kiesler, S., & Sproull, L. (1994). What’s mine is ours, or is it? A study of attitudes about information sharing. Information Systems Research, 5(4), 400–421. DataONE. (2013). What is DataONE? | DataONE. Retrieved January 2, 2014, from http://www.dataone.org/what-dataone Hank, C., & Wildemuth, B. M. (2009). Quasi-experimental studies. In B. M. Wildemuth (Ed.), Applications of social research methods to questions in information and library science (pp. 93–104). Westport, Conn: Libraries Unlimited. Hey, T., & Trefethen, A. E. (2003). The data deluge: An eScience perspective. In F. Berman, A. J. G. Hey, & G. C. Fox (Eds.), Grid computing: Making the global infrastructure a reality (pp. 809–824). Wiley and Sons. Retrieved from http://en.scientificcommons.org/2325382 Lord, P., & Macdonald, A. (2003). e-Science curation report: Data curation for e-Science in the UK: An audit to establish requirements for future curation and provision (pp. 1–84). Twickenham, England: The JISC Committee for the Support of Research (JCSR). Retrieved from http://www.jisc.ac.uk/uploaded_documents/eScienceReportFinal.pdf National Institutes of Health. (2007). NIH Data Sharing Policy. Retrieved from http://grants.nih.gov/grants/policy/data_sharing/ National Science Foundation. (2006, November 7). Sustainable Digital Data Preservation and Access Network Partners (DataNet). Retrieved February 4, 2014, from http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.htm National Science Foundation. (2010, November 10). Dissemination and Sharing of Research Results. Retrieved from http://www.nsf.gov/bfa/dias/policy/dmp.jsp Noor, M. A. F., Zimmerman, K. J., & Teeter, K. C. (2006). Data sharing: How much doesn’t get submitted to genBank? PLoS Biology, 4(7), e228. http://doi.org/10.1371/journal.pbio.0040228 Ochsner, S. A., Steffen, D. L., Stoeckert, C. J., & McKenna, N. J. (2008). Much room for improvement in deposition rates of expression microarray datasets. Nature Methods, 5(12), 991. http://doi.org/10.1038/nmeth1208-991 Oh, S., & Wildemuth, B. M. (2009). Think-aloud Protocols. In B. M. Wildemuth (Ed.), Applications of social research methods to questions in information and library science (pp. 178–188). Westport Conn.: Libraries Unlimited. Piwowar, H. A. (2011). Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS ONE, 6(7, e18657), 1–13. http://doi.org/10.1371/journal.pone.0018657 Piwowar, H. A., & Chapman, W. W. (2010). Public sharing of research datasets: A pilot study of associations. Journal of Informetrics, 4, 148–156. http://doi.org/10.1016/j.joi.2009.11.010 Someren, M. W. van, Barnard, Y. F., & Sandberg, J. (1994). The think aloud method: a practical guide to modelling cognitive processes. London; San Diego: Academic Press. Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., Wu, L., Read, E., … Frame, M. (2011). Data sharing by scientists: Practices and perceptions. PLoS ONE, 6(6, e21101), 1–21. Zimmerman, A. S. (2003). Data sharing and secondary use of scientific data: Experiences of ecologists.
© Copyright 2026 Paperzz