Research Lattices: Towards a scientific hypothesis data model Bernardo Gonçalves Fabio Porto LNCC – National Laboratory for Scientific Computing Av. Getúlio Vargas 333 Petrópolis, Brazil LNCC – National Laboratory for Scientific Computing Av. Getúlio Vargas 333 Petrópolis, Brazil [email protected] [email protected] ABSTRACT As the problems of scientific interest raise in scale and complexity, scientists have to tacitly manage too many analytic elements. Hypotheses are worked out to drive research towards successful explanation and prediction, which characterizes science as a dynamic activity that is partially ordered towards progress. This paper motivates and introduces research lattices, carrying out a lattice-theoretic approach for hypothesis representation and management in large-scale science and engineering. The goal of this work is to equip scientists with tools to manipulate and query hypotheses while keeping track of research progress. We refer to SciDB’s array data model and discuss how data and theories could be managed in a unified model management framework. Categories and Subject Descriptors H.2.1 [Database Management]: Logical Design—Data models; H.2.3 [Database Management]: Languages General Terms Languages, Design Keywords Scientific Hypothesis, Research Progress, Large-scale Science, Lattice Theory, Scientific Databases. 1. INTRODUCTION Large-scale science increasingly relies on database theory and technology in order to deal with the increasing amount of observational data. As the problems of interest raise in scale and complexity [16], however, scientists have to tacitly manage too many analytic elements. Fig. 1 recalls the scientific method, highlighting that data-driven science is, likewise, aimed at the formulation and evaluation of hypotheses [3]. We point out the stage of hypothesis formulation, drawing on the dynamics of science: hypotheses are worked out to drive research towards increasingly successful explanation and prediction [11]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SSDBM ’13, July 29 - 31 2013, Baltimore, MD, USA Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00 In large-scale biomedical research, hypothesis representation has been receiving attention by means of semantic web technologies [14]. Hypothesis knowledge-base systems (KBS) (viz., [13, 5, 8]) have been allowing the explicit proposition of hypotheses about genes to be generated and tested against observational data. Most of the analytics of large-scale science, however, is mathematically-oriented and prominently based on a multidimensional array data model [15]. Scientific analytics need a more general approach to capture what is essential about Problem 1. Problem 1. Representation and management of hypotheses in large-scale science and engineering. We argue that scientists should be equipped with tools in order to manipulate and query hypotheses while keeping track of their research progress. It is the goal of this paper to motivate and introduce research lattices, carrying out a latticetheoretic approach to the problem. It is agnostic w.r.t. to the specific data model used by a scientific project, but we shall refer to SciDB’s array model in order to discuss how data and theories could be managed in a unified framework along the lines of model management [1]. Section 2 discusses the problem background. Section 3 reviews related work. Section 4 introduces our abstraction to the problem, which is founded in Lattice Theory [4]. Section 5 concludes the paper. 2. PROBLEM BACKGROUND Unlike historical efforts to represent and manage the growth of scientific knowledge [10], the age of data-driven science opens the possibility of developing database technology for integrating both data and theories in the same framework. In order to achieve that, nevertheless, scientific knowledge should be captured when it is still provisional, viz., by tracking hypotheses as a built-in database capability. In fact, for the sake of rigor and reproducibility in large-scale scientific discovery, scientists should be provided with a hypothesis query language expressive for their own long-standing principles as emerged from scientific practice (ibid.): (P1) (P2) (P3) (P4) (P5) (P6) (P7) Consistency with existing knowledge; Agreement with observation; Falsifiability; Parsimony; Conceptual integration; Breadth of scope; Fertility. Context of Discovery Context of Justification no Phenomenon observation Hypothesis formulation Computational modeling Testing against data yes valid? Publishing results Figure 1: A simplified flow chart of the scientific method life cycle. It highlights hypothesis formulation and a backward transition to reformulation if prediction (and/or explanation) disagree with observation. Trade-offs in P1–P7 set for hypothesis management a design issue which is related to the no-overwrite tenet required by scientists for observational data management [15]. In practice [9], hypothesis reformulation (Fig. 1) may not mean overwriting the hypothesis of interest, say hq . Instead, scientists may insert a new hypothesis hs into the research to compete in further analytics with hq ; or rather, they may carry out a problem-shift on hq by conjecturing an hs to account for ‘anomalous phenomena’ instead of putting hq into question (ibid.). These are some of the reasons why hypotheses are a highly interconnected kind of data. Theoretical hypotheses must be combined with observationlevel assumptions (e.g., initial and boundary conditions) for (P3) actually deriving testable statements about phenomena. Also, ¬(P2–P4) auxiliary hypotheses can be inserted (initially as workarounds) into a research program, as it was the case with Newtonian mechanics, when Neptune could only be predicted near observed location after connecting “A trans-Uranic planet exists” into the system [9]. That illustrates not only that hypotheses are interconnected, but also how balanced principles P2–P4 may be with, say, P1/P5– P7, by means of this very interconnection. In recognition that refutation neither is nor should be followed invariably by rejection, research programs were eventually considered more proper units of appraisal than individual hypotheses [10]. In Section 4 we refer to both as representation units. All I1–I3 have the potential to provide scientists with a community-wide (I3) platform in which their hypotheses can be explicitly represented and tested (I1–I2). I1-I2 directly support functional genomics science in terms of P1–P3, while I3 indirectly supports Alzheimer research in terms of P1–P3. All I1–I3 seem to be opaque w.r.t. P4–P7. In our view, this is because they lack a standard abstraction for hypothesis interconnection, as motivated in the previous section. In our pursuit of Problem 1, large-scale research is to be operated under constraints and tracked by the scientist endusers. We raise the level of abstraction to capture the function of hypotheses as the drivers of research towards progress. We have set as research questions to find: • Units of representation for hypothesis-driven research; • Operations for hypothesis manipulation and querying; • Constraints for the structure of scientific progress. 4. 4.1 3. RELATED WORK Recent initiatives are addressing Problem 1 in large-scale biomedical research: (I1) Robot Scientist [8] is a KBS for automated generation and testing of hypotheses about what genes encode enzymes in the yeast organism; (I2) HyBrow [13] is a KBS for scientists to test their hypotheses about events of the galactose metabolism of the same organism; and (I3) SWAN [5] is a KBS for scientists to have shared access to hypotheses about causes of the Alzheimer disease. All of them (I1–I3) use an OWL ontology on top of the RDF data model. Initiatives I1–I2, in particular, keep logic programming analytics hardwired in the application layer as well for generating and/or testing hypotheses of the kind ‘gene G has function A’ against RDF-encoded data. Initiative I3 in turn disfavors functionality for hypothesis evaluation to focus on descriptive aspects: hypotheses are highlevel natural language statements retrieved from publications. Each instance of the OWL class Hypothesis is related to claims (a more general OWL class) that support it on the basis of some experimental evidence (also RDF-encoded gene/protein information). This initiative (I3) is related to efforts on the semi-automatic retrieval of hypotheses (or claims) from the narrative structure of scientific reports. RESEARCH LATTICES Research lattice is a DB abstraction for hypothesis-driven research, which is founded on Lattice Theory [4]. A preliminary description from a theoretical point of view can be found elsewhere [6]. In this paper we focus on our data modeling choices for the questions above in light of P1–P7. Units of Representation Hypotheses are interconnected to drive research towards successful explanation and prediction. Science, then, can be understood as a dynamic activity that, unlike other human endeavors, is partially ordered towards progress [11]. We model hypothesis-driven research as a lattice [4], a special kind of partially-ordered set (poset). A research poset R = hR; ≤i is a non-empty set R equipped with a reflexive, anti-symmetric and transitive relation ≤ [6]. Research posets are bounded by definition,1 and they are constrained to be lattices for resembling the structure of scientific progress in light of (P4) parsimony (cf. Section 4.3). From the point of view of logical-level representation, we write xi ≤ xj , if xi ‘is based on or equal to’ xj , and xi k xj , if xi ‘is incomparable to’ xj . Each analytic element x ∈ R is a distinguishable entity, even if it is inferred equivalent to some other element y ∈ R. Scientist users are supposed to operate (insert/delete/update) covering pairs xi ≺ xj , read xi ‘is directly based on’ xj , if xi ≤ xj and, for no x, xi < x < xj . Relation ‘≺’ is enough to determine a finite poset, and it is used to build its Hasse diagram [4]. 1 A zero of a poset P = hP ; ≤i is an element 0 such that 0 ≤ x for all x ∈ P . A unit, 1, satisfies x ≤ 1 for all x ∈ P . A bounded poset is one that has both 0 and 1. 4.2 > (h1 ) (d1 ) Law of f ree f all F all f rom rest (ds/dt = 0, s = 0)t=0 d2 s/dt2 = 32 (h2 ) F all in quadratic time s = 16t2 ⊥ (d2 ) Observed data Time Dist. (secs) (feet) 0 1 2 3 4 5 6 7 8 0.0 16.1 64.2 144.5 256.8 401.3 577.8 786.5 1027.2 Figure 2: Hasse diagram of Galileo’s example research lattice. Connections (line segments) go upward from hi to hj if hi ≺ hj . Elements > and ⊥ are specially defined to be 1 and 0 of R = hR; ≺i. Dotted lines are indicating the insertion of element d2 . As a first illustration of the concept, let us abstract Galileo’s research on free falling bodies (cf. [7]) as a research lattice. For his (h1 ) law of free fall to be actually falsifiable, it must be provided with (d1 ) initial conditions in order to derive (testable) h2 by rules of the integral calculus. (h1 ). Every body near the earth free falling towards the Earth falls with an acceleration of 32 feet per second per second. [d2 s/dt2 = 32]. (d1 ). The fall starts from rest. [ds/dt = 0, s = 0; at t = 0]. (h2 ). Every body starting from rest and free falling towards the Earth falls 16t2 feet in t seconds. [s = 16t2 , given d2 s/dt2 = 32 and (ds/dt = 0, s = 0)t=0 ]. Fig. 2 shows the diagram of Galileo’s research lattice. It highlights the algorithmic insertion of (d2 ) observed data to be confronted against h2 . We refine the set of analytic elements R into disjoint complete subsets R+ (theories) and R− (data), R = R+ ∪ R− . Now, consider SciDB’s multidimensional array data model (cf. [15]) in order to ground the research lattice abstraction: every array is structured as dimension coordinates (defining a cell ) and a set of typed attributes to be assigned values (defining tuples) along each cell. Then Galileo’s hypothesis h1 should have array schema H1 with attribute a along (1-D) time cells t ranging 0 : 8. AQL% CREATE ARRAY H1 <a:double>[t=0:8,9,0]; AFL% store(build(<a:double>[t=0:8,9,0],32,H1); where a is the body’s acceleration (d2 s/dt2 ), which is set a = 32 for t=0:8, like a (materialized) view in relational databases. Accordingly, D1’s schema should be: AQL% CREATE ARRAY D1 <v:double,s:double>[t=0,1,0]; AQL% INSERT INTO D1 ‘[(0,0)]’; where s is the body’s traveled distance (s), and v its velocity (ds/dt). A mapping from d1 to h1 should cast (0, 0, 32)t=0 and generate further new tuples in H2, producing in AFL: AFL% store(build(<s:double>[t=0:8,9,0],16*pow(t,2)),H2); Hypothesis h2 ’s predicted data, Ih2 (H2), is structurally equivalent to (d2 ) observed data, Id2 (H2), and can be matched with it for hypothesis validation through data analysis. Operations Operations have to capture details at both the research and the analytic-element levels. Firstly, we have designed algorithmic insert/delete/update operations to support the user in managing connections of the kind hi ≺ hj at the research level [6]. Now we are working on their refinement at the analytic element level in a model management context [1]. We explore use cases of the operations (e.g., as in Galileo’s example) in light of P1–P7 to provide current scientific data programmability with abstractions of the scientific method. We pay particular attention to scientists’ patterns of discovery in the manipulation of hypotheses [7]. E.g., a new hypothesis may be inserted as a merge of prior hypotheses: the user projects from the former some structure and may drop other, possibly with a leap of creation (new structure). Operations such as merge, match, extract, diff, invert, compose involve semi-automatic engineered mappings [1]. These usually take three steps: (i) schema matching, (ii) design of mapping constraints and (iii) executable transformation. In Galileo’s example (Fig. 2), h2 ≺ h1 and h2 ≺ d1 shall comprise precise mappings Mh2 ,h1 and Mh2 ,d1 . This approach has the potential to go beyond just tracking the lineage of analytic elements, to actually account for their responsibility (cf. [12]) on the results of analytic life cycles (Fig. 1). 4.3 Constraints Research lattices are closed under the insert/delete/update operations. That is, these operations take lattice R as input, and return lattice R0 as output by preserving the lattice properties as a special poset [6], which are the following. Let H ⊆ P , a ∈ P , for an arbitrary poset hP ; ≤i. Then a is an upper bound of H, if h ≤ a for all h ∈ H. An upper bound a of H is the least upper bound of H or supremum of H if, for any upper bound b of H, we have a ≤ b. It is written a = sup H, and its uniqueness can be shown straightforwardly. The concepts of lower bound and greatest lower bound or infimum are defined dually. The latter is denoted by inf H, and its uniqueness is verified likewise. The set of upper (lower) bounds of an element h is denoted ↑ h (↓ h). Def. 1. A poset L = hL; ≤i is a lattice if sup{a, b} and inf{a, b} exist for all a, b ∈ L. In our approach, the inf–sup existence property of lattices as special posets characterizes (P4) parsimony. We can show its relevance by referring to Newtonian mechanics [2]. Newton builds upon Galileo and Kepler’s researches to perform the major (P5) conceptual integration known as the (h10 ) Law of universal gravitation. Were Newton using our framework (Fig. 3), the insertion of hypothesis h8 would violate the lattice-theoretic principle of parsimony. Then the hypothesis manipulation engine would not have allowed him to commit the operation that way (i.e., merging {h2 , h3 , h7 }). It turns out that the technology might have induced him to manage it by means of the (h10 ) famous generalization. Now, for the purpose of the query language we can refer to lattices as algebraic structures by the notation: a∨b a∧b ≡ sup{a, b} ≡ inf{a, b} which reads ∨ the join, and ∧ the meet. In lattices, these are binary operations—they can be applied to any pair of elements a, b ∈ L to yield again an element of L. > (h1 ) Law of f ree f all ag = 32 > (h6 ) (h5 ) 0 Kepler s Centripetal (h2 ) (h3 ) 3rd law F irst Second law acceleration 3 2 law F = mag ac = 4πr/T 2 r /T = c (h6 ) (h1 ) (h5 ) Kepler0 s Law of (h2 ) (h3 ) (h9 ) f ree f all F irst Second law T hird Centripetal 3rd law acceleration r3 /T 2 = c ag = 32 law F = mag law ac = 4πr/T 2 (h7 ) Inverse square law ac ∝ 1/r2 (h4 ) Gravitation of a body near the Earth W = 32m (h8 ) Gravitation of a planetary body Fg ∝ m/r2 ⊥ Newton’s generalization insertion instead Invalid operation: violating parsimony (h7 ) (h10 ) Inverse square law Law of universal ac ∝ 1/r2 gravitational attraction Fg = G Mr2m (h8 ) Gravitation of a planetary body Fg ∝ m/r2 (h4 ) Gravitation of a body near the Earth W = 32m ⊥ Figure 3: Hasse diagrams of two states of Newton’s research lattice. They illustrate how (P4) parsimony as a lattice-theoretic property is revealed to be a neat property for the structure of scientific progress. 4.4 Query Language Currently, the research lattice querying capabilities for largescale research can be illustrated as follows (see Fig. 3). (Q1) Given h4 and h8 , find their join (or strongest weakest hypothesis): { hq ∈ R | hq = h4 ∨ h8 }. [ h10 ]. (Q2) Given h3 and h6 , find their meet (or weakest strongest hypothesis): { hq ∈ R | hq = h3 ∧ h6 }. [ h8 ]. (Q3) List all hypotheses that h10 is based on or equal to: {h ∈ ↑ h10 ⊆ R}. [ h10 , h2 , h3 , h9 , > ]. We can then support the user’s decision w.r.t. P1–P7. (P1) hq ’s consistency with its prior knowledge: (local) {h ∈ R | hq ≺ h}, or (global) ↑ hq ; (P2) hq ’s agreement with observation assertions: {d ∈ R− | hq ≤ d and d ≤ hq }; (P3) hq ’s falsifiability trace (predictions vs. observations): {h ∈ R+ ∩ ↓ hq , d ∈ R− | ⊥ ≺ h and ⊥ ≺ d}; (P4) Parsimony: constraint at the research level; (P5) hs and ht ’s conceptual integration: {h ∈ R+ | h = (hs ∨ ht ) or (hs ∧ ht )}; (P6) hq ’s breadth of scope: {h ∈ ( R+ ∩ ↓ hq )}; (P7) hq ’s fertility (hypotheses fertilized by it): {h ∈ R+ | h k hq and ( ↓ h \ {⊥} ) ∩ ↓ hq ) 6= ∅}. The expressiveness of this query language will be further enriched once we refine it to the analytic-element level. 5. CONCLUSIONS In this paper we have motivated the relevance of Problem 1 for data-driven science [16]. Then we have introduced research lattices, carrying out a lattice-theoretic approach for the problem. This abstraction is geared for scientists to be able to manipulate and query hypotheses while keeping track of their research progress. Future work includes its further development and implementation. Our primary target to make it operational is to refer to SciDB’s array data model in order to allow data and theories to be managed in a unified framework along the lines of model management [1]. 6. REFERENCES [1] P. Bernstein and S. Melnik. Model management 2.0: Manipulating richer mappings. In ACM SIGMOD’07. [2] I. B. Cohen. The Newtonian revolution: With illustrations of the transformation of scientific ideas. Cambridge University Press, 1983. [3] J. P. Collins. Sailing on an ocean of 0s and 1s. Science, 327(5972):1455–6, 2010. [4] B. A. Davey and H. A. Priestley. Introduction to Lattices and Order. Cambridge Univ. Press, 2002. [5] Y. Gao et al. SWAN: A distributed knowledge infrastructure for alzheimer disease research. J. Web Semantics, 4(3):222–8, 2006. [6] B. Gonçalves and F. Porto. A lattice-theoretic approach for representing and managing hypothesis-driven research. In A.M. Int. Workshop on Foundations of Data Management (AMW’13), 2013. [7] N. R. Hanson. Patterns of Discovery: An Inquiry into the Conceptual Foundations of Science. Cambridge University Press, 1958. [8] R. D. King et al. The automation of science. Science, 324(5923):85–9, 2009. [9] I. Lakatos. Criticism and the growth of knowledge, chapter Falsification and the methodology of scientific research programmes. Cambridge Univ. Press, 1970. [10] J. Losee. A historical introduction to the philosophy of science. Oxford University Press, 4th edition, 2001. [11] J. Losee. Theories of scientific progress: An introduction. Routledge, 2003. [12] A. Meliou et al. Causality in databases. IEEE Data Eng. Bull., 33(3):59–67, 2010. [13] S. Racunas et al. HyBrow: a prototype system for computer-aided hypothesis evaluation. Bioinformatics, 20(1):257–64, 2004. [14] L. N. Soldatova and A. Rzhetsky. Representation of research hypotheses. J. Biomed. Sem., 2(S2), 2011. [15] M. Stonebraker et al. The architecture of SciDB. In Proc. of SSDBM’11, pages 1–16. [16] A. Szalay and J. Gray. 2020 Computing: Science in an exponential world. Nature, 440:413–4, 2006.
© Copyright 2026 Paperzz