Research Lattices: Towards a scientific hypothesis data model

Research Lattices: Towards a
scientific hypothesis data model
Bernardo Gonçalves
Fabio Porto
LNCC – National Laboratory
for Scientific Computing
Av. Getúlio Vargas 333
Petrópolis, Brazil
LNCC – National Laboratory
for Scientific Computing
Av. Getúlio Vargas 333
Petrópolis, Brazil
[email protected]
[email protected]
ABSTRACT
As the problems of scientific interest raise in scale and complexity, scientists have to tacitly manage too many analytic
elements. Hypotheses are worked out to drive research towards successful explanation and prediction, which characterizes science as a dynamic activity that is partially ordered
towards progress. This paper motivates and introduces research lattices, carrying out a lattice-theoretic approach for
hypothesis representation and management in large-scale
science and engineering. The goal of this work is to equip scientists with tools to manipulate and query hypotheses while
keeping track of research progress. We refer to SciDB’s array data model and discuss how data and theories could be
managed in a unified model management framework.
Categories and Subject Descriptors
H.2.1 [Database Management]: Logical Design—Data
models; H.2.3 [Database Management]: Languages
General Terms
Languages, Design
Keywords
Scientific Hypothesis, Research Progress, Large-scale Science, Lattice Theory, Scientific Databases.
1.
INTRODUCTION
Large-scale science increasingly relies on database theory
and technology in order to deal with the increasing amount
of observational data. As the problems of interest raise in
scale and complexity [16], however, scientists have to tacitly manage too many analytic elements. Fig. 1 recalls the
scientific method, highlighting that data-driven science is,
likewise, aimed at the formulation and evaluation of hypotheses [3]. We point out the stage of hypothesis formulation,
drawing on the dynamics of science: hypotheses are worked
out to drive research towards increasingly successful explanation and prediction [11].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from [email protected].
SSDBM ’13, July 29 - 31 2013, Baltimore, MD, USA
Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00
In large-scale biomedical research, hypothesis representation
has been receiving attention by means of semantic web technologies [14]. Hypothesis knowledge-base systems (KBS)
(viz., [13, 5, 8]) have been allowing the explicit proposition of
hypotheses about genes to be generated and tested against
observational data. Most of the analytics of large-scale science, however, is mathematically-oriented and prominently
based on a multidimensional array data model [15]. Scientific analytics need a more general approach to capture what
is essential about Problem 1.
Problem 1. Representation and management of hypotheses in large-scale science and engineering.
We argue that scientists should be equipped with tools in order to manipulate and query hypotheses while keeping track
of their research progress. It is the goal of this paper to motivate and introduce research lattices, carrying out a latticetheoretic approach to the problem. It is agnostic w.r.t. to
the specific data model used by a scientific project, but we
shall refer to SciDB’s array model in order to discuss how
data and theories could be managed in a unified framework
along the lines of model management [1].
Section 2 discusses the problem background. Section 3 reviews related work. Section 4 introduces our abstraction to
the problem, which is founded in Lattice Theory [4]. Section
5 concludes the paper.
2.
PROBLEM BACKGROUND
Unlike historical efforts to represent and manage the growth
of scientific knowledge [10], the age of data-driven science
opens the possibility of developing database technology for
integrating both data and theories in the same framework.
In order to achieve that, nevertheless, scientific knowledge
should be captured when it is still provisional, viz., by tracking hypotheses as a built-in database capability. In fact, for
the sake of rigor and reproducibility in large-scale scientific
discovery, scientists should be provided with a hypothesis
query language expressive for their own long-standing principles as emerged from scientific practice (ibid.):
(P1)
(P2)
(P3)
(P4)
(P5)
(P6)
(P7)
Consistency with existing knowledge;
Agreement with observation;
Falsifiability;
Parsimony;
Conceptual integration;
Breadth of scope;
Fertility.
Context of Discovery
Context of Justification
no
Phenomenon
observation
Hypothesis
formulation
Computational
modeling
Testing
against data
yes
valid?
Publishing
results
Figure 1: A simplified flow chart of the scientific method life cycle. It highlights hypothesis formulation and
a backward transition to reformulation if prediction (and/or explanation) disagree with observation.
Trade-offs in P1–P7 set for hypothesis management a design
issue which is related to the no-overwrite tenet required by
scientists for observational data management [15]. In practice [9], hypothesis reformulation (Fig. 1) may not mean
overwriting the hypothesis of interest, say hq . Instead, scientists may insert a new hypothesis hs into the research to
compete in further analytics with hq ; or rather, they may
carry out a problem-shift on hq by conjecturing an hs to
account for ‘anomalous phenomena’ instead of putting hq
into question (ibid.). These are some of the reasons why
hypotheses are a highly interconnected kind of data.
Theoretical hypotheses must be combined with observationlevel assumptions (e.g., initial and boundary conditions) for
(P3) actually deriving testable statements about phenomena. Also, ¬(P2–P4) auxiliary hypotheses can be inserted
(initially as workarounds) into a research program, as it was
the case with Newtonian mechanics, when Neptune could
only be predicted near observed location after connecting “A
trans-Uranic planet exists” into the system [9]. That illustrates not only that hypotheses are interconnected, but also
how balanced principles P2–P4 may be with, say, P1/P5–
P7, by means of this very interconnection. In recognition
that refutation neither is nor should be followed invariably
by rejection, research programs were eventually considered
more proper units of appraisal than individual hypotheses
[10]. In Section 4 we refer to both as representation units.
All I1–I3 have the potential to provide scientists with a
community-wide (I3) platform in which their hypotheses can
be explicitly represented and tested (I1–I2). I1-I2 directly
support functional genomics science in terms of P1–P3, while
I3 indirectly supports Alzheimer research in terms of P1–P3.
All I1–I3 seem to be opaque w.r.t. P4–P7. In our view, this
is because they lack a standard abstraction for hypothesis
interconnection, as motivated in the previous section.
In our pursuit of Problem 1, large-scale research is to be
operated under constraints and tracked by the scientist endusers. We raise the level of abstraction to capture the function of hypotheses as the drivers of research towards progress.
We have set as research questions to find:
• Units of representation for hypothesis-driven research;
• Operations for hypothesis manipulation and querying;
• Constraints for the structure of scientific progress.
4.
4.1
3.
RELATED WORK
Recent initiatives are addressing Problem 1 in large-scale
biomedical research: (I1) Robot Scientist [8] is a KBS for
automated generation and testing of hypotheses about what
genes encode enzymes in the yeast organism; (I2) HyBrow
[13] is a KBS for scientists to test their hypotheses about
events of the galactose metabolism of the same organism;
and (I3) SWAN [5] is a KBS for scientists to have shared
access to hypotheses about causes of the Alzheimer disease.
All of them (I1–I3) use an OWL ontology on top of the RDF
data model. Initiatives I1–I2, in particular, keep logic programming analytics hardwired in the application layer as
well for generating and/or testing hypotheses of the kind
‘gene G has function A’ against RDF-encoded data. Initiative I3 in turn disfavors functionality for hypothesis evaluation to focus on descriptive aspects: hypotheses are highlevel natural language statements retrieved from publications. Each instance of the OWL class Hypothesis is related
to claims (a more general OWL class) that support it on
the basis of some experimental evidence (also RDF-encoded
gene/protein information). This initiative (I3) is related
to efforts on the semi-automatic retrieval of hypotheses (or
claims) from the narrative structure of scientific reports.
RESEARCH LATTICES
Research lattice is a DB abstraction for hypothesis-driven
research, which is founded on Lattice Theory [4]. A preliminary description from a theoretical point of view can
be found elsewhere [6]. In this paper we focus on our data
modeling choices for the questions above in light of P1–P7.
Units of Representation
Hypotheses are interconnected to drive research towards
successful explanation and prediction. Science, then, can be
understood as a dynamic activity that, unlike other human
endeavors, is partially ordered towards progress [11]. We
model hypothesis-driven research as a lattice [4], a special
kind of partially-ordered set (poset).
A research poset R = hR; ≤i is a non-empty set R equipped
with a reflexive, anti-symmetric and transitive relation ≤
[6]. Research posets are bounded by definition,1 and they
are constrained to be lattices for resembling the structure
of scientific progress in light of (P4) parsimony (cf. Section 4.3). From the point of view of logical-level representation, we write xi ≤ xj , if xi ‘is based on or equal to’ xj ,
and xi k xj , if xi ‘is incomparable to’ xj . Each analytic element x ∈ R is a distinguishable entity, even if it is inferred
equivalent to some other element y ∈ R. Scientist users are
supposed to operate (insert/delete/update) covering pairs
xi ≺ xj , read xi ‘is directly based on’ xj , if xi ≤ xj and, for
no x, xi < x < xj . Relation ‘≺’ is enough to determine a
finite poset, and it is used to build its Hasse diagram [4].
1
A zero of a poset P = hP ; ≤i is an element 0 such that
0 ≤ x for all x ∈ P . A unit, 1, satisfies x ≤ 1 for all x ∈ P .
A bounded poset is one that has both 0 and 1.
4.2
>
(h1 )
(d1 )
Law of f ree f all
F all f rom rest
(ds/dt = 0, s = 0)t=0
d2 s/dt2 = 32
(h2 )
F all in
quadratic time
s = 16t2
⊥
(d2 )
Observed
data
Time Dist.
(secs) (feet)
0
1
2
3
4
5
6
7
8
0.0
16.1
64.2
144.5
256.8
401.3
577.8
786.5
1027.2
Figure 2: Hasse diagram of Galileo’s example research lattice. Connections (line segments) go upward from hi to hj if hi ≺ hj . Elements > and ⊥ are
specially defined to be 1 and 0 of R = hR; ≺i. Dotted
lines are indicating the insertion of element d2 .
As a first illustration of the concept, let us abstract Galileo’s
research on free falling bodies (cf. [7]) as a research lattice.
For his (h1 ) law of free fall to be actually falsifiable, it must
be provided with (d1 ) initial conditions in order to derive
(testable) h2 by rules of the integral calculus.
(h1 ). Every body near the earth free falling towards the
Earth falls with an acceleration of 32 feet per second per
second. [d2 s/dt2 = 32].
(d1 ). The fall starts from rest. [ds/dt = 0, s = 0; at t = 0].
(h2 ). Every body starting from rest and free falling towards the Earth falls 16t2 feet in t seconds. [s = 16t2 , given
d2 s/dt2 = 32 and (ds/dt = 0, s = 0)t=0 ].
Fig. 2 shows the diagram of Galileo’s research lattice. It
highlights the algorithmic insertion of (d2 ) observed data to
be confronted against h2 . We refine the set of analytic elements R into disjoint complete subsets R+ (theories) and
R− (data), R = R+ ∪ R− . Now, consider SciDB’s multidimensional array data model (cf. [15]) in order to ground
the research lattice abstraction: every array is structured as
dimension coordinates (defining a cell ) and a set of typed
attributes to be assigned values (defining tuples) along each
cell. Then Galileo’s hypothesis h1 should have array schema
H1 with attribute a along (1-D) time cells t ranging 0 : 8.
AQL% CREATE ARRAY H1 <a:double>[t=0:8,9,0];
AFL% store(build(<a:double>[t=0:8,9,0],32,H1);
where a is the body’s acceleration (d2 s/dt2 ), which is set
a = 32 for t=0:8, like a (materialized) view in relational
databases. Accordingly, D1’s schema should be:
AQL% CREATE ARRAY D1 <v:double,s:double>[t=0,1,0];
AQL% INSERT INTO D1 ‘[(0,0)]’;
where s is the body’s traveled distance (s), and v its velocity
(ds/dt). A mapping from d1 to h1 should cast (0, 0, 32)t=0
and generate further new tuples in H2, producing in AFL:
AFL% store(build(<s:double>[t=0:8,9,0],16*pow(t,2)),H2);
Hypothesis h2 ’s predicted data, Ih2 (H2), is structurally equivalent to (d2 ) observed data, Id2 (H2), and can be matched
with it for hypothesis validation through data analysis.
Operations
Operations have to capture details at both the research and
the analytic-element levels. Firstly, we have designed algorithmic insert/delete/update operations to support the user
in managing connections of the kind hi ≺ hj at the research
level [6]. Now we are working on their refinement at the
analytic element level in a model management context [1].
We explore use cases of the operations (e.g., as in Galileo’s
example) in light of P1–P7 to provide current scientific data
programmability with abstractions of the scientific method.
We pay particular attention to scientists’ patterns of discovery in the manipulation of hypotheses [7]. E.g., a new
hypothesis may be inserted as a merge of prior hypotheses:
the user projects from the former some structure and may
drop other, possibly with a leap of creation (new structure).
Operations such as merge, match, extract, diff, invert, compose involve semi-automatic engineered mappings [1]. These
usually take three steps: (i) schema matching, (ii) design of
mapping constraints and (iii) executable transformation. In
Galileo’s example (Fig. 2), h2 ≺ h1 and h2 ≺ d1 shall comprise precise mappings Mh2 ,h1 and Mh2 ,d1 . This approach
has the potential to go beyond just tracking the lineage of
analytic elements, to actually account for their responsibility
(cf. [12]) on the results of analytic life cycles (Fig. 1).
4.3
Constraints
Research lattices are closed under the insert/delete/update
operations. That is, these operations take lattice R as input,
and return lattice R0 as output by preserving the lattice
properties as a special poset [6], which are the following.
Let H ⊆ P , a ∈ P , for an arbitrary poset hP ; ≤i. Then a is
an upper bound of H, if h ≤ a for all h ∈ H. An upper bound
a of H is the least upper bound of H or supremum of H if,
for any upper bound b of H, we have a ≤ b. It is written a =
sup H, and its uniqueness can be shown straightforwardly.
The concepts of lower bound and greatest lower bound or
infimum are defined dually. The latter is denoted by inf H,
and its uniqueness is verified likewise. The set of upper
(lower) bounds of an element h is denoted ↑ h (↓ h).
Def. 1. A poset L = hL; ≤i is a lattice if sup{a, b} and
inf{a, b} exist for all a, b ∈ L.
In our approach, the inf–sup existence property of lattices as
special posets characterizes (P4) parsimony. We can show its
relevance by referring to Newtonian mechanics [2]. Newton
builds upon Galileo and Kepler’s researches to perform the
major (P5) conceptual integration known as the (h10 ) Law
of universal gravitation. Were Newton using our framework
(Fig. 3), the insertion of hypothesis h8 would violate the
lattice-theoretic principle of parsimony. Then the hypothesis
manipulation engine would not have allowed him to commit
the operation that way (i.e., merging {h2 , h3 , h7 }). It turns
out that the technology might have induced him to manage
it by means of the (h10 ) famous generalization.
Now, for the purpose of the query language we can refer to
lattices as algebraic structures by the notation:
a∨b
a∧b
≡ sup{a, b}
≡ inf{a, b}
which reads ∨ the join, and ∧ the meet. In lattices, these
are binary operations—they can be applied to any pair of
elements a, b ∈ L to yield again an element of L.
>
(h1 )
Law of
f ree f all
ag = 32
>
(h6 )
(h5 )
0
Kepler
s
Centripetal
(h2 )
(h3 )
3rd
law
F irst Second law acceleration 3 2
law
F = mag ac = 4πr/T 2 r /T = c
(h6 )
(h1 )
(h5 )
Kepler0 s
Law of (h2 )
(h3 )
(h9 )
f ree f all F irst Second law T hird Centripetal 3rd law
acceleration r3 /T 2 = c
ag = 32 law
F = mag
law
ac = 4πr/T 2
(h7 )
Inverse square law
ac ∝ 1/r2
(h4 )
Gravitation of a
body near the Earth
W = 32m
(h8 )
Gravitation of a
planetary body
Fg ∝ m/r2
⊥
Newton’s
generalization
insertion instead
Invalid operation:
violating parsimony
(h7 )
(h10 )
Inverse
square law
Law of universal
ac ∝ 1/r2
gravitational attraction
Fg = G Mr2m
(h8 )
Gravitation of a
planetary body
Fg ∝ m/r2
(h4 )
Gravitation of a
body near the Earth
W = 32m
⊥
Figure 3: Hasse diagrams of two states of Newton’s research lattice. They illustrate how (P4) parsimony as
a lattice-theoretic property is revealed to be a neat property for the structure of scientific progress.
4.4
Query Language
Currently, the research lattice querying capabilities for largescale research can be illustrated as follows (see Fig. 3).
(Q1) Given h4 and h8 , find their join (or strongest weakest
hypothesis): { hq ∈ R | hq = h4 ∨ h8 }. [ h10 ].
(Q2) Given h3 and h6 , find their meet (or weakest strongest
hypothesis): { hq ∈ R | hq = h3 ∧ h6 }. [ h8 ].
(Q3) List all hypotheses that h10 is based on or equal to:
{h ∈ ↑ h10 ⊆ R}. [ h10 , h2 , h3 , h9 , > ].
We can then support the user’s decision w.r.t. P1–P7.
(P1) hq ’s consistency with its prior knowledge: (local)
{h ∈ R | hq ≺ h}, or (global) ↑ hq ;
(P2) hq ’s agreement with observation assertions:
{d ∈ R− | hq ≤ d and d ≤ hq };
(P3) hq ’s falsifiability trace (predictions vs. observations):
{h ∈ R+ ∩ ↓ hq , d ∈ R− | ⊥ ≺ h and ⊥ ≺ d};
(P4) Parsimony: constraint at the research level;
(P5) hs and ht ’s conceptual integration:
{h ∈ R+ | h = (hs ∨ ht ) or (hs ∧ ht )};
(P6) hq ’s breadth of scope: {h ∈ ( R+ ∩ ↓ hq )};
(P7) hq ’s fertility (hypotheses fertilized by it):
{h ∈ R+ | h k hq and ( ↓ h \ {⊥} ) ∩ ↓ hq ) 6= ∅}.
The expressiveness of this query language will be further
enriched once we refine it to the analytic-element level.
5.
CONCLUSIONS
In this paper we have motivated the relevance of Problem 1
for data-driven science [16]. Then we have introduced research lattices, carrying out a lattice-theoretic approach for
the problem. This abstraction is geared for scientists to be
able to manipulate and query hypotheses while keeping track
of their research progress. Future work includes its further
development and implementation. Our primary target to
make it operational is to refer to SciDB’s array data model
in order to allow data and theories to be managed in a unified framework along the lines of model management [1].
6.
REFERENCES
[1] P. Bernstein and S. Melnik. Model management 2.0:
Manipulating richer mappings. In ACM SIGMOD’07.
[2] I. B. Cohen. The Newtonian revolution: With
illustrations of the transformation of scientific ideas.
Cambridge University Press, 1983.
[3] J. P. Collins. Sailing on an ocean of 0s and 1s. Science,
327(5972):1455–6, 2010.
[4] B. A. Davey and H. A. Priestley. Introduction to
Lattices and Order. Cambridge Univ. Press, 2002.
[5] Y. Gao et al. SWAN: A distributed knowledge
infrastructure for alzheimer disease research. J. Web
Semantics, 4(3):222–8, 2006.
[6] B. Gonçalves and F. Porto. A lattice-theoretic
approach for representing and managing
hypothesis-driven research. In A.M. Int. Workshop on
Foundations of Data Management (AMW’13), 2013.
[7] N. R. Hanson. Patterns of Discovery: An Inquiry into
the Conceptual Foundations of Science. Cambridge
University Press, 1958.
[8] R. D. King et al. The automation of science. Science,
324(5923):85–9, 2009.
[9] I. Lakatos. Criticism and the growth of knowledge,
chapter Falsification and the methodology of scientific
research programmes. Cambridge Univ. Press, 1970.
[10] J. Losee. A historical introduction to the philosophy of
science. Oxford University Press, 4th edition, 2001.
[11] J. Losee. Theories of scientific progress: An
introduction. Routledge, 2003.
[12] A. Meliou et al. Causality in databases. IEEE Data
Eng. Bull., 33(3):59–67, 2010.
[13] S. Racunas et al. HyBrow: a prototype system for
computer-aided hypothesis evaluation. Bioinformatics,
20(1):257–64, 2004.
[14] L. N. Soldatova and A. Rzhetsky. Representation of
research hypotheses. J. Biomed. Sem., 2(S2), 2011.
[15] M. Stonebraker et al. The architecture of SciDB. In
Proc. of SSDBM’11, pages 1–16.
[16] A. Szalay and J. Gray. 2020 Computing: Science in an
exponential world. Nature, 440:413–4, 2006.