How do scientists determine data reusability? A quasi

How do scientists determine data reusability? A quasiexperiment think-aloud study
Angela P. Murillo
University of North Carolina at Chapel Hill
100 Manning Hall, Chapel Hill, North Carolina
[email protected]
ABSTRACT
This poster presents preliminary findings of a quasiexperiment think-aloud study where scientists were
presented four canned results of information regarding earth
science data in a counter-balanced design. Scientists were
asked to think-aloud regarding what information about the
data assisted them in their ability to determine reusability of
that dataset. Sixteen scientists from various earth science
fields participated in the study. Each scientist responded to
four canned results, a post-result usefulness survey, a postsearch rank-order survey, and a post-search survey.
Participants stated that concise data descriptions, attribute
and unit lists, as well as research methods steps were
particularly important in their ability to determine
reusability of data. Participants preferred more robust
results over less robust results, and stated that they would
rather have too much information than to request the data to
find out it actually did not serve their needs.
Keywords
Data reuse, scientific data, quasi-experiment, think-aloud
INTRODUCTION
This poster-paper presents preliminary analysis of a quasiexperiment think-aloud study conducted to gain an
understanding of that information scientists need to
determine reusability of data.
Data sharing and reuse in the sciences has been a topic of
growing attention in recent years. Changes in scientific
practices (Bell, Hey, & Szalay, 2009; Hey & Trefethen,
2003) and changes in journal and grant agency policies
(National Institutes of Health, 2007; National Science
Foundation, 2010) are driving scientists to attempt to share
and reuse data.
There are known benefits to reusing data including
extracting additional value from existing data, avoid
ASIST 2016, October 14-18, 2016, Copenhagen, Denmark.
© Angela P. Murillo 2016, All rights reserved
reproducing research, ask new questions of existing data,
and advance the state of science in general (Borgman, 2012;
Lord & Macdonald, 2003).
The the U.S. National Science Foundation created the
Sustainable Digital Data Preservation and Access Network
Partners (DataNet) to support the development of long-term
sustainable data infrastructures, interoperable data
preservation and access, and cyberinfrastructure capabilities
(National Science Foundation, 2006). The Data Observation
Network for Earth (DataONE) provides cyberinfrastructure
for “open, persistent, robust, and secure access to welldescribed and easily discovered earth science observational
data” (DataONE, 2013). Scientists participating in
DataONE are able to deposit, search, and reuse data
available through the various DataONE tools.
This purpose of this research is to determine how scientists
reuse data. The study uses a quasi-experiment think-aloud
as a method to examine data reusability. The DataONE
serves as the test environment for this study
BACKGROUND LITERATURE
Reuse of scientific data provides a number of benefits.
These benefits include advances in scientific development
by avoiding duplication of work, allowing new questions to
be asked to existing data, and encouraging diversity in
analysis (Borgman, 2012; Lord & Macdonald, 2003).
Additionally, funding agencies and journals have put
pressure on scientists to make their data available. Funding
agencies have data sharing policies in place for data created
through major grants (National Institutes of Health, 2007;
National Science Foundation, 2010)
Studies have gauged scientists attitude toward sharing and
reuse and these studies indicate that scientists want to have
data sharing as a norm in science (Borgman, 2012; Ceci,
1988; Lord & Macdonald, 2003). Additionally studies have
described motivators for data sharing and have found that
ideas of data ownership, previous assistance from
coworkers, journal policies, and grant agency requirements
were some of the motivations for data sharing (Blumenthal
et al., 2006; Constant, Kiesler, & Sproull, 1994; Tenopir et
al., 2011). Reasons for not sharing data include financial
concerns, lack of time, lack of organizational support, lack
of documentation and complexity of metadata standards, as
well as the difficulty to anticipate intended users (Birnholtz
& Bietz, 2003; Tenopir et al., 2011; Zimmerman, 2003).
Researchers have also examined journal policies and data
deposition and examined if data sharing policies influence
data deposition rates. These studies have indicated that
while no journal has complete compliance, much research
data is deposited along with the article (Noor, Zimmerman,
& Teeter, 2006; Ochsner, Steffen, Stoeckert, & McKenna,
2008). Additionally several studies have investigated
specific factors associated with data deposition. These
studies have shown that author experience and publications
associated with high-impact factor journals were more
likely to have associated data deposited alongside the
authors’ journal articles (Piwowar, 2011; Piwowar &
Chapman, 2010).
While the above studies have provided much information
for the current data sharing and reuse environment for
scientists. These studies have not determine what
information scientists need about the data to determine
reusability. To address this research gap, a quasiexperiment think-aloud was developed to further
understand what information scientist need to determine
data reusability. Think-aloud protocols are useful in
providing an understanding of the cognitive processes and
knowledge acquisition involved in decision making and
provides the least amount of interpretation errors that other
methods have (Oh & Wildemuth, 2009; Someren, Barnard,
& Sandberg, 1994).
The quasi-experimental counter-balanced design was
chosen to test what information is needed to determine if
data are reusable through a manipulation of the information
presented to the participant regarding results.
In a
counterbalanced design, the comparison of interest is within
each subject’s performance in the multiple treatment
conditions and therefore multiple treatments or
interventions are applied to each subject (Hank &
Wildemuth, 2009).
RESEARCH QUESTIONS
The overarching question of this study is how do scientists
determine reusability of data?
More specifically:
•
•
How is information about the data, such as
information regarding metadata standards,
provenance information, research methods, and
instruments influencing scientists’ ability to
determine if that data is reusable?
How does this information assist scientists in their
ability to reuse this data?
RESEARCH METHODS & PROCEDURES
Participant Recruitment
Sixteen participants were recruited through face-to-face
recruitment
at
scientific
conferences
(American
Geophysical Union and Geologic Society of America),
email list-serves at University of North Carolina at Chapel
Hill and North Carolina State University, targeted emails to
specific UNC and NCSU department, CODATA, Drexel
University, and through word of mouth.
There were several major recruitment efforts in the summer
2015, fall 2015, and spring 2016. Participants were paid
$20 for their participation either by cash or an Amazon gift
card. The study was conducted either face-to-face or via
screen share over Skype, and lasted approximately between
45 minutes to 1 hour, with the longest at 1 hour 30 minutes
and the shorted lasting 43 minutes.
Quasi-Experimental Think-Aloud Set Up
Participants were walked through the experimental interface
using the Qualtrics survey software. They were presented
the search interface and four canned results.
The researcher incorporated a counter-balance design into
the experimental set-up in order to assist in testing for
which pieces of information assisted scientists in
determining reusability of data.
The counter-balanced design was set up as shown in Figure
1.
Participant 1:
X1
O
X2
O
X3
O
X4
O
Participant 2:
X2
O
X4
O
X1
O
X3
O
Participant 3:
X3
O
X1
O
X4
O
X2
O
Participant 4:
X4
O
X3
O
X2
O
X1
O
Figure 1. Counter-Balanced Design
For this study:
•
X1 refers to canned result 1, which contained
robust metadata that included an abstract, a
research methods section, and a unit/attribute list.
•
X2 refers to canned result 2, which contained basic
metadata robustness, an abstract, and a research
methods section.
•
X3 refers to canned result 3, which contained basic
metadata robustness and an abstract.
•
X4 refers to canned result 4, which contained basic
metadata robustness and a research methods
section.
This design was used to ensure that participants’ responses
to the different canned results were not based on the order
of presentation.
Quasi-Experimental Think-Aloud Procedures
There were multiple data gathering points including the
post-result usefulness survey, rank order survey, postsearch survey results (general questions, open-ended
questions, data reuse factors survey, and demographic
survey), and the think-aloud, which was gathering data
throughout.
Participants were asked to think-aloud about what
information was useful to them in regards to potentially
reusing the data. After each canned result they were asked
to rate the result as far as usefulness on a scale of 1 to 5
(Post-Result Usefulness Survey). Additionally, after seeing
all results they were asked to rank them from most useful to
least useful (Rank-Order Survey). Lastly, they participated
Post-Search Survey, which contained general questions,
open-ended questions, a data reuse factors survey, and a
demographic survey. Figure 2 below provides visual of the
procedures.
degrees, and 31.3% had bachelors degrees. None of the
participants had used the DataONE system.
Post-Result Usefulness Survey
Canned result #1 was found as the most useful with a mean
usefulness score of 3.56 and canned result #3 was found the
least useful with a mean of 2.25. As a reminder canned
result #1 the most robust metadata that included an abstract,
a research methods section, and a unit/attribute list, canned
result #2 did not contain a unit/attribute list but contained
the abstract and methods section, canned result #3
contained an abstract, but no methods section, and canned
result #4 contained a methods section but no abstract. As
shown in Figure 3 there was a preference to having more
information over less information, and there was a
preference to having the methods section over the abstract
section.
Canned
Result #1
Canned
Result #2
Canned
Result #3
Canned
Result #4
3.56 (.81)
3.31 (.79)
2.25 (1.00)
2.31 (.79)
Figure 3. Post-Search Usefulness Survey (Mean (SD))
Rank Order Survey
After seeing all canned results, participants were also asked
to rank the canned results in order of preference from most
useful to least useful in regards to usefulness for potential
data reuse. Figure 4 shows the results of these rankings.
Canned result #1 ranked the highest most often and canned
result #3 ranked the least highest most often. Additionally,
canned result #2 ranked the second highest, while canned
result #4 ranked the third highest most often.
Figure 2. Quasi-Experiment Think-Aloud
Procedures/Data Collection
RESULTS
Demographic Results
Of the sixteen participants 56% were male and 44% were
female and they were all students, professors or researchers
in the earth sciences. Six participants considered their
primary area of expertise as Geology, four - Ecology, two Atmospheric Science, two - Environmental Science, 1Physics, and 1 - Hydrology. Participant’s sub-disciplines
included: paleoecology, geophysics, seismology, macroecology, evolutionary biology, planetary geology,
sedimentology, and coral reef conservation. Of the
participants 62.5% were students, 31.25% were professors,
and 6.25% worked in a professional organization;
additionally, 31.3% had Ph.D.’s, 37.5% had Masters
Figure 4. Rank Order Survey Results
Post-Search Survey (Reuse Factors)
Participants took a short survey to examine what
information is needed in order to determine reusability of
data. Participants were provided options for: 1) metadata
standard, 2) provenance information, 3) permission and
intellectual property, 4) instrumentation, 5) research
methods and were asked to rank these from 1 to 7 with 1
being Not at all Important, 2 being Very Unimportant, 3
being Somewhat Unimportant, 4 being Neither Important
nor Unimportant, 5 being Somewhat Important, 6 being
Very Important, and 7 Extremely Important. Figure 5 shows
the mean and standard deviation from this survey.
Factor
Mean (SD)
Metadata
Standard
4.94 (1.53)
Provenance
Information
5.25 (1.18)
Intellectual
Property
Information
4.75 (1.29)
Instrument
Information
5.88 (1.5)
Research
Methods
Information
6.13 (1.45)
Other
6.6 (0.52)
These short summaries will be described in further detail in
the discussion section.
Other items that were described as important for
determining relevance included: abstract (3 participants),
temporal information (3 participants), and provenance (2
participants). Participant #5 stated they needed to know the
“who, what, when, where, and how of the data” in order to
determine if it was relevant for reuse for them. Participants
also made suggestions for information that was not
currently in the canned results example including: field
collection, uncertainty information, experimental setup,
instruments and calibration, and data analysis techniques.
Results Open-Ended Question 2
Figure 5. Post Search Survey (Reuse Factors)
Post-Search Survey (Open-Ended Questions Results)
Along with the survey above, participants also had the
opportunity to answer several open-ended questions
regarding data reuse. As a reminder these are:
1.
When looking at the search results above, what
information did you need to determine if the data
is relevant?
2.
In regard to the DataONE system, what
information inhibits your ability to reuse data?
3.
In regard to the DataONE system, what
information facilitates your ability to reuse data?
4.
When thinking about the DataONE system, what
information did you need that the system did not
provide?
Results Open-Ended Question 1
Ten participants suggested that the methods and the
attribute table were the information needed to determine if
the data was relevant. For example, participant 1 stated
“The description box and the methods data was important
in determining whether the data was relevant to my needs”;
by description box they were referring to the attribute table.
Six participants suggested that the data description was
important for determining relevance. In fact, participant 12
suggested it was the most important with “The short
summary of the dataset present in the final search result
was perhaps the most pertinent information needed, but
was unfortunately buried at the bottom of the result. The
short summary provided the contents of the dataset, and a
quick look of whether or not it would be applicable.”
In regards to what participants found inhibiting about the
DataONE, four participants suggested not knowing enough
about the data format inhibited their ability to reuse the
data. One participant stated, “The data may be in a format
that is difficult to extract. Not knowing the data of the
format may lead to the user to not want to use the data.”
Additionally two participants suggested that there was too
much information, however one of them suggested that it
was the organization of this too much information that was
the problem with “Organization of information, for
example, the dataset description was vital to understanding
the dataset, however, was near the bottom. Also, I would
prefer more of a snapshot of the data rather than the long
list.”
Other factors that participants suggested inhibited their
ability to reuse the data included: unknown data quality and
no links to secondary publications. Two participants stated
that there were no factors that inhibited their ability to reuse
the data and three participants left the box blank or put
dashes in the box implying they did not consider there were
any factors that inhibited their ability to reuse the data.
Results Open-Ended Question 3
Several participants suggested the layout of the page
facilitated their ability to reuse the data. They stated that
they “liked that all of the information was on one page”
and the “easy to follow layout” of the page. Additionally,
five participants stated that the attribute and unit list table
facilitated their ability to reuse the data; four stated the
methods section facilitated their ability to reuse the data,
and three stated that the data description/summary
facilitated their ability to reuse the data. Participants also
stated their appreciation for the licensing information, the
abstract, the geographic information (particularly the
coordinates), and the instrument information. One
participant stated they appreciated that the DataONE
provided the “ability to quickly see most important aspects
of study instead of having to read an entire article.”
Only two notes by participants suggested areas of
improvement, which included having a clearer location to
download the data and a clearer licensing summary. It was
suggested that these both should be a simple “download
data here” and “data can be used” buttons, respectively.
Results Open-Ended Question 4
Lastly, in regards to information that the DataONE did not
provide that they would have wanted to have, four
participants suggested they would like more information
about the actual data itself, a snapshot or description of the
data. One participant suggested that they “would like to get
a preview of the raw data. The dataset may contain other
information that is not displayed that can be useful for my
study.” Another participant suggested why more
information about the actual data was so important with “In
the case of datasets stored in particular formats, knowing
the format and size of a dataset could be critical. For
instance, if I needed image files in .PNG format, it would
save time if I knew that a dataset were .JPG only. Similarly,
knowing the size of the dataset could be critical (if I don't
want to download 10 TBs worth of seed storage
temperature data).”
Additionally, two participants stated that along with the
bounding coordinates they would like a map. Other
suggestions included: more information on the history of
the
data
manipulation/provenance,
sample
size,
publications, storage, uncertainty information, and naming
conventions. One participant also suggested with would be
nice to be able to control their information that was
provided through the use of drop down menus.
Additionally, six participants either did not answer this
question or drew a dash, suggesting they were pleased with
the information provided by the DataONE.
Think-Aloud Results
While participants were thinking aloud, the researcher was
taking notes as to their thoughts regarding the information
presented from each canned result and if this information
was useful for the participants to determine the reusability
of the data. The below provides a summary of these results
by each canned result.
Think-Aloud Results: Canned Result #1
The majority of the participants suggested that they
appreciated the data description, attribute table, and
research methods information and stated that these items
were all important for them to determine if the data was
reusable. As is shown in the usefulness results most
participants agreed that Canned Result #1 was particularly
useful and ranked #1 in the rank order. In general most
participants preferred having too much information than not
enough information. One participant stated they preferred
this because it provided the “who, what, when, where, and
how” of the data,” and that was the information they
needed to determine if the data was reusable.
On two occasions participants stated that there was too
much information, however, this was not the opinion of the
majority. Some suggestions from included providing the
type of data (e.g. experiment, field, sensor), adding a drop
down menu so user can determine what that want to look at.
Other suggested moving the data descriptions up in the
page to make it more prominent. Nearly all participants saw
the data description, attribute table, and research methods as
the most vital pieces of information for their determination
of reusability.
Think-Aloud Results: Canned Result #2
The majority of participants indicated that they appreciated
having the methods and abstract (particularly the methods),
which were useful for them determining if they were able to
reuse the data. However, seven participants did state that
they wanted more information with a description of the data
or a snippet of the data. Even those participants who had
not seen the data description and attribute table stated they
“really wants a short description of the actual dataset”
(P4). Without the data description and attribute table,
participants stated that they did like the conciseness of this
result. They stated that the abstract, and methods were all
very useful. Other items that the helpful were the bounding
coordinates, contact information, and keywords.
Think-Aloud Results: Canned Result #3
Most participants really found canned result #3 hard to
work with. Participant #2 stated “not enough information”
and participant 5 stated “mostly secondary information”.
In general participants stated the abstract was too much text
to parse through and it made it difficult to determine if the
data was reusable or not. They stated they really did not
prefer the “wall of text” organization. Additionally, they
stated that a description of the data and an abstract was not
the same thing and wanted to know if the abstract was a
paper abstract or a data abstract. From the abstract available
this distinction was not clear to participants. These results
were similar to the results from the usefulness and rank
order surveys. For participants, this was the least useful of
all of the canned results.
Think-Aloud Results: Canned Result #4
For canned result number 4, the majority of participants
agreed that the most helpful item was the research methods
to help them determine if the data was reusable. Participants
who had already seen canned result #3, suggested that they
prefer having the methods over having the abstract
indicating that having the methods was more important.
One of the reasons why participants preferred the methods
over the abstract was that it was easier for them to parse the
information from the method. Additionally, they were able
to find out information such as data collection and
experimental set-up if the methods were available, however
this information would not always be in an abstract. Those
who had already seen canned result #1 did state that they
still preferred having the attribute table and data
descriptions, but found the methods valuable. These results
were similar to the results in the usefulness and rank order
survey, participants ranked this #3 overall.
CONCLUSION
This research provides insight into what information is
needed in order for scientists to determine reusability of
data. While there has been much discussion on the
importance of data sharing and reuse, changes in journal
and granting agency policies, and reasons scientists want to
share and reuse each others data, very little attention has
been given to the actual reusing of data and what
information is needed to reuse this data. This study provides
meaningful insights to how scientists determine reusability,
which in essence will assist those creating
cyberinfrastructures such as the DataONE that, are
attempting to enable scientists to reuse data.
ACKNOWLEDGMENTS
The author would like to thank Dr. Jane Greenberg for her
guidance and support. The author would also like to thank
the DataONE for their support.
REFERENCES
Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data
deluge.
Science,
323(5919),
1297–1298.
http://doi.org/10.1126/science.1170411.
Birnholtz, J. P., & Bietz, M. J. (2003). Data at work:
Supporting and sharing in science and engineering. In
Proceedings of the 2003 International ACM SIGGROUP
Conference on Supporting Group Work (pp. 339–348).
New York, New York, USA: ACM Press.
http://doi.org/10.1145/958160.958215
Blumenthal, D., Campbell, E. G., Gokhale, M., Yucel, R.,
Clarridge, B., & Hilgartner, S. (2006). Data withholding
in genetics and the other life sciences: Prevalences and
predictors. Academic Medicine, 81(2), 137–145.
Borgman, C. L. (2012). The conundrum of sharing research
data. Journal of the American Society for Information
Science
and
Technology,
63(6),
1059–1078.
http://doi.org/10.1002/asi.22634
Ceci, S. J. (1988). Scientists’ attitudes toward data sharing.
Science, Technology, and Human Values, 13(No. 1/2),
45–52.
Constant, D., Kiesler, S., & Sproull, L. (1994). What’s mine
is ours, or is it? A study of attitudes about information
sharing. Information Systems Research, 5(4), 400–421.
DataONE. (2013). What is DataONE? | DataONE.
Retrieved
January
2,
2014,
from
http://www.dataone.org/what-dataone
Hank, C., & Wildemuth, B. M. (2009). Quasi-experimental
studies. In B. M. Wildemuth (Ed.), Applications of social
research methods to questions in information and library
science (pp. 93–104). Westport, Conn: Libraries
Unlimited.
Hey, T., & Trefethen, A. E. (2003). The data deluge: An eScience perspective. In F. Berman, A. J. G. Hey, & G. C.
Fox (Eds.), Grid computing: Making the global
infrastructure a reality (pp. 809–824). Wiley and Sons.
Retrieved from http://en.scientificcommons.org/2325382
Lord, P., & Macdonald, A. (2003). e-Science curation
report: Data curation for e-Science in the UK: An audit
to establish requirements for future curation and
provision (pp. 1–84). Twickenham, England: The JISC
Committee for the Support of Research (JCSR).
Retrieved
from
http://www.jisc.ac.uk/uploaded_documents/eScienceReportFinal.pdf
National Institutes of Health. (2007). NIH Data Sharing
Policy.
Retrieved
from
http://grants.nih.gov/grants/policy/data_sharing/
National Science Foundation. (2006, November 7).
Sustainable Digital Data Preservation and Access
Network Partners (DataNet). Retrieved February 4, 2014,
from
http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.htm
National Science Foundation. (2010, November 10).
Dissemination and Sharing of Research Results.
Retrieved
from
http://www.nsf.gov/bfa/dias/policy/dmp.jsp
Noor, M. A. F., Zimmerman, K. J., & Teeter, K. C. (2006).
Data sharing: How much doesn’t get submitted to
genBank?
PLoS
Biology,
4(7),
e228.
http://doi.org/10.1371/journal.pbio.0040228
Ochsner, S. A., Steffen, D. L., Stoeckert, C. J., &
McKenna, N. J. (2008). Much room for improvement in
deposition rates of expression microarray datasets.
Nature
Methods,
5(12),
991.
http://doi.org/10.1038/nmeth1208-991
Oh, S., & Wildemuth, B. M. (2009). Think-aloud Protocols.
In B. M. Wildemuth (Ed.), Applications of social
research methods to questions in information and library
science (pp. 178–188). Westport Conn.: Libraries
Unlimited.
Piwowar, H. A. (2011). Who shares? Who doesn’t? Factors
associated with openly archiving raw research data. PLoS
ONE,
6(7,
e18657),
1–13.
http://doi.org/10.1371/journal.pone.0018657
Piwowar, H. A., & Chapman, W. W. (2010). Public sharing
of research datasets: A pilot study of associations.
Journal
of
Informetrics,
4,
148–156.
http://doi.org/10.1016/j.joi.2009.11.010
Someren, M. W. van, Barnard, Y. F., & Sandberg, J.
(1994). The think aloud method: a practical guide to
modelling cognitive processes. London; San Diego:
Academic Press.
Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U.,
Wu, L., Read, E., … Frame, M. (2011). Data sharing by
scientists: Practices and perceptions. PLoS ONE, 6(6,
e21101), 1–21.
Zimmerman, A. S. (2003). Data sharing and secondary use
of scientific data: Experiences of ecologists.