Experiments with Strategy Learning for E Prover

Experiments with Strategy Learning for E Prover
Daniel Kühlwein
Josef Urban
Stephan Schulz
Radboud University Nijmegen
Technische Universität München
Nijmegen, Netherlands∗
München, Germany
[email protected]
[email protected]
We give a brief summary of some recent experiments with the set of different ATP strategies created
for the theorem prover E as part of its automatic mode and discuss possible future directions.
1
Strategies in E
Automated theorem provers (ATPs) consist of a number of complicated algorithms, that can be parameterized and combined together in different ways. Examples of such parameterizations are clause
weighting and selection schemes, term orderings, sets of inference and reduction rules used, etc. E [8]
(as some other ATPs) has a language for packaging such useful combinations of parameterizations into
strategies. Over 200 strategies have been named and are part of the E source code.
Such a large number of strategies can be used to experiment with data-driven methods that try to
estimate how to solve a new problem, by considering a large database of previously solved problems and
their suitable characterization. E is probably the first ATP that has applied machine learning to strategy
selection. There are different ways how to do this, and how to optimize a large set of strategies in general.
We will consider some of them and report some results obtained.
2
Experiments with Strategy Learning
Characterizing ATP problems is probably a Turing-complete problem. An ambitious formulation could
be for example: Given a set of ATP strategies, find a (typically finite) set of efficient algorithms whose
results on a large class of ATP problems can be efficiently used to partition the problems into subclasses
that share a common best strategy. The results of such algorithms run on a particular problem are the
problem features.
The set of algorithms is in practice suggested by the intuition of the ATP implementor, or a domain
expert (for example, a mathematician who might know what features could be important for distinguishing different classes of problems in his domain). Some of them can correspond to the precise knowledge
in the ATP field, for example, if the problem is Horn, or effectively propositional. Others can rather express hunches, for example, if there are many or few clauses, symbols, terms, etc. If the set of problems
uses symbols consistently (this is the case of recently appeared large-theory corpora created from the
Mizar and Isabelle libraries, and SUMO and Cyc common-sense ontologies), then the (combinations of)
symbols (and derived structures, like terms) can often also imply some typical problem structure.
We use two learning methods to estimate the best strategy for a new problem. Both are based on
learning from the results of strategies run on the TPTP problems. The traditional one by Stephan Schulz
∗ The authors were supported by the NWO projects “MathWiki a Web-based Collaborative Authoring Environment for
Formal Proofs” and “Learning2Reason”.
Submitted to:
IWS 2012
c Daniel Kühlwein, Josef Urban and Stephan Schulz
This work is licensed under the
Creative Commons Attribution License.
2
Experiments with Strategy Learning for E Prover
is an instance of the case-based reasoning method. A newer one is to use kernel-based learning. Both
methods rely on TPTP being a sufficiently representative set of ATP problems of different kinds, however
the strategies can be tuned for any sufficiently big corpus of problems (for example the MPTP [12] is a
good candidate). Both methods require a relevant feature characterization of problems as an input, and
(after being trained) they return the predicted optimal strategy as an output.
2.1
E’s ”Auto” mode
E supports an ”automatic” mode that analyzes the problem and determines all major search parameters
(literal selection function, clause selection heuristics, term ordering, and a number of mostly discrete
options controlling optional simplifications and preprocessing steps). E’s automatic mode has a long
history and has been conservatively extended from the very first implementation in E 0.3 Castleton,
released in 1999.
E’s automatic mode is based on a static partitioning of the set of all CNF problems into disjoint
classes. Class membership is determined by a number of features, either naturally discrete, or numerical
and mapped into a small set of discrete values using threshold values.
2.2
Features used for learning
As stated above, E’s auto-mode was first implemented in 1999, before the TPTP language [10] achieved
its current form and gained wide acceptance. Similarly, the automatic mode predates support for full
first-order formulas. Therefore E only analyzes the CNF problem after clausification and does not make
any use of meta-features encoded in the TPTP language. It uses only the logical structure of the clauses.
A clause is called negative, if it only has negative literals. It’s called positive, if it only has positive
literals. A ground clause is a clause that contains no variables. In this setting, we refer to all negative
clauses as ”goals”, and to all other clauses as ”axioms”. Clauses can be unit (having only a single
literal), Horn (having at most one positive literal), or general (no constraints on the form). Note that all
unit clauses are Horn, and all Horn clauses are general.
The set of features used in E for problem characterization is listed in Table 1.
2.3
Strategy Classification
E’s automatic mode is generated in two steps. First, the set of all training examples (typically the set
of all current TPTP problems) is classified into disjoint classes using the features listed in Table 1.
For the numeric features, threshold values have originally been selected to split the TPTP into 3 or 4
approximately equal subsets on each features. Over time, these have been manually adapted using trialand-error.
Once the classification is fixed, a Python program reads the different classes and a set of test protocols
describing the performance of different strategies on all problems from the test set. It assigns to each class
one of the strategies that solves the most examples in this class. For large classes (arbitrarily defined as
having more than 200 problems), it picks the strategy that also is fastest on that class. For small classes,
it picks the globally best strategy among those that solve the maximum number of problems. A class
with zero solutions by all strategies is assigned the overall best strategy.
Daniel Kühlwein, Josef Urban and Stephan Schulz
3
Feature
Description
axioms
goals
equality
Most specific class (unit, Horn, general) describing all axioms
Most specific class (unit, Horn) describing all goals
Problem has no equational literals, some equational literals, or only equational
literals
Number (or fraction) of unit-axioms that are not ground
Are all goals ground?
Number of clauses
Number of literals
Number of all (sub)terms
Number (or fraction) of positive axioms that are ground
Maximal arity of a function or predicate symbol
Average arity of symbols in the problem
Sum of arities of symbols in the problem
Maximal clause depth
non-ground units
ground-goals
clauses
literals
term cells
groundpositiveaxioms
max fun arity
avg fun arity
sum fun arity
clause max depth
Table 1: Problem features used by strategy selection in E
2.4
Strategy Scheduling
An alternative to the running the best possible strategy approach that is currently being used by E is
strategy scheduling (known mainly from Gandalf [11] and Vampire [7]). The idea is to run different
strategies for fractions of the overall time. This can help when the strategies are sufficiently orthogonal
(solve different problems), and the problem classification is not good enough to clearly point to one best
strategy.
In the simplest setting, the suitable set of strategies is considered independent of the problem features. In that case, given a database of results of many strategies on a large set of problems (TPTP,
MPTP, etc.), the goal is to cover the set of solvable problems by as few strategies as possible. This is
an NP-complete problem, which however seems to be quite easy (for our instances) for systems like
MiniSAT++ [9]. For example for an experiment with covering randomly chosen 400 problems from the
MPTP2078 benchmark1 [1] by 280 E strategies, MiniSat++ can find the minimal cover (9 strategies) in
0.09s.
2.5
E-MaLeS 1.0
E-MaLeS 1.0 (E Machine Learning of Strategies)2 is a meta system for E that implements strategy
scheduling using kernel-based learning.
2.5.1
Kernel-based learning
Kernel-based learning is a machine learning method that tries to find a suitable (typically non-linear)
transformation (kernel) of the data to a high-dimensional vector space, in which various fast versions of
linear regression can be applied to obtain good prediction functions. The training data are the problem
1 Available
at http://wiki.mizar.org/twiki/bin/view/Mizar/MpTP2078.
~tptp/CASC/23/SystemDescriptions.html#E-MaLeS---1.0
2 http://www.cs.miami.edu/
4
Experiments with Strategy Learning for E Prover
features together with the performance of the strategies on the problems. On these data, the kernel
learning produces a ranking function that for each combination of problem features ranks the strategies
according to their predicted usefulness. The higher a strategy is ranked, the greater the (predicted)
probability that it solves the problem. Specifically, for our experiments with the E-MaLeS system, the
Multi-Output Ranking kernel-based method (MOR) [1] is used.
2.5.2
The E-MaLeS Setup
E-MaLeS uses the features described in Figure 1 to rank the strategies. It then uses a combination of E’s
auto mode and the top two highest ranked strategies. First, the auto mode is run for 60 seconds and then
the highest ranked strategy t is run for 120 seconds. If the problem still isn’t solved, the prediction score
of every other strategies s in multiplied with the probability of s solving the problem while knowing that
t does not solve the problem. Finally the now highest ranked strategy (after the update) is run for another
120 seconds. Note that this means that E-MaLeS can only solve problems for which there is a strategy
that solved them in under 120 seconds. If a problem needs more than 120 seconds, independent of the
strategy, E-MaLeS cannot solve it. However, since almost all E-solvable TPTP problems can be solved
in less than 120 seconds this is not a big restriction.
3
Experiments
We compared E’s auto mode (see section 2.3) with MOR learning on four different datasets. First,
we consider three subsets of the MPTP. The results can be seen in table 2. Bushy and Chainy are the
problems from the MPTP Challenge. 3 1kMPTP are 1000 random problem taken from the MPTP dataset.
The table shows the number of problems, the number of problems that can be solved if one would always
pick the best strategy, the number of problems solved with E’s auto mode, and the number of problems
solved when learning the best strategy with MOR.
Table 2: Performance of E on MPTP using different strategy methods.
Dataset
Bushy
Chainy
1kMPTP
Problems
Optimal
Auto
MOR
252
252
1000
169
110
462
144
100
316
167
110
459
On these small Mizar-based datasets, MOR outperforms the auto mode, getting close or even reaching the optimal number of problems solved. We also compared the auto mode with E-MaLeS on the
complete TPTP dataset. At the point of creating E-MaLeS, TPTP contained 14771 problems with 10269
being solvable by E.
Table 3: Performance of E on TPTP using different strategy methods.
Dataset
TPTP
3 www.tptp.org/MPTPChallenge/
Problems
Optimal
Auto
E-MaLeS
14771
10269
9583
10026
Daniel Kühlwein, Josef Urban and Stephan Schulz
5
Finally, we can look at the performance of E and E-MaLeS at CASC 23. There were 300 problems
in the FOF division. 98 of these were new, they weren’t part of the TPTP before. In total, E-MaLeS
solves one more problem than E. Looking only at the new problems, we see that E-MaLeS solves four
more new problems than E. This could indicate that is too tuned on the TPTP data.
Table 4: Performance of E at CASC 23 using different strategy methods.
All
New
4
Problems
Auto
E-MaLeS
300
98
232
57
233
61
Future Work
Strategies themselves are just combinations of many ATP parameters (currently determined by guesses of
the ATP implementators). An interesting application of machine learning to ATP is to develop new strategies by searching for such good parameter combinations systematically (by methods like hill-climbing)
on a large corpus of problems. A related task is to produce a set of strategies that are highly orthogonal,
i.e., they solve very different problems, and their collective coverage is higher.
We could also learn strategies based on conjecture features (like symbols) in consistently-named
corpora like MPTP, instead of just using abstract features like on the TPTP problems.
The methods that we develop can be probably used for any ATP, and it would be interesting to see how
such learning and optimization works for systems like Vampire [7], Z3 [6], and iProver [5]. Metasystems
like Isabelle/Sledgehammer [4, 3] and MaLARea [13] could then use this deeper knowledge about ATPs
to attack new conjectures with the strongest possible combinations of systems and strategies.
References
[1] Jesse Alama, Tom Heskes, Daniel Kühlwein, Evgeni Tsivtsivadze & Josef Urban (2011): Premise Selection
for Mathematics by Corpus Analysis and Kernel Methods. CoRR abs/1108.3446. Available at http://
arxiv.org/abs/1108.3446.
[2] Alessandro Armando, Peter Baumgartner & Gilles Dowek, editors (2008): Automated Reasoning, 4th International Joint Conference, IJCAR 2008, Sydney, Australia, August 12-15, 2008, Proceedings. Lecture Notes
in Computer Science 5195, Springer.
[3] Jasmin Christian Blanchette, Sascha Böhme & Lawrence C. Paulson (2011): Extending Sledgehammer with SMT Solvers. In Nikolaj Bjørner & Viorica Sofronie-Stokkermans, editors: CADE, Lecture
Notes in Computer Science 6803, Springer, pp. 116–130. Available at http://dx.doi.org/10.1007/
978-3-642-22438-6_11.
[4] Jasmin Christian Blanchette, Lukas Bulwahn & Tobias Nipkow (2011): Automatic Proof and Disproof in Isabelle/HOL. In Cesare Tinelli & Viorica Sofronie-Stokkermans, editors: FroCos, Lecture Notes in Computer
Science 6989, Springer, pp. 12–27. Available at http://dx.doi.org/10.1007/978-3-642-24364-6_2.
[5] Konstantin Korovin (2008): iProver - An Instantiation-Based Theorem Prover for First-Order Logic (System Description). In Armando et al. [2], pp. 292–298. Available at http://dx.doi.org/10.1007/
978-3-540-71070-7_24.
6
Experiments with Strategy Learning for E Prover
[6] Leonardo Mendonça de Moura & Nikolaj Bjørner (2008): Z3: An Efficient SMT Solver. In C. R. Ramakrishnan & Jakob Rehof, editors: TACAS, Lecture Notes in Computer Science 4963, Springer, pp. 337–340.
Available at http://dx.doi.org/10.1007/978-3-540-78800-3_24.
[7] Alexandre Riazanov & Andrei Voronkov (2002): The design and implementation of VAMPIRE. AI
Commun. 15(2-3), pp. 91–110. Available at http://iospress.metapress.com/openurl.asp?genre=
article{&}issn=0921-7126{&}volume=15{&}issue=2{&}spage=91.
[8] Stephan Schulz (2002): E - A Brainiac Theorem Prover. AI Commun. 15(2-3), pp. 111–126. Available at
http://iospress.metapress.com/content/n908n94nmvk59v3c/.
[9] Niklas Sörensson & Niklas Eén (2008): MiniSat 2.1 and MiniSat++ 1.0 SAT Race 2008 Editions. Technical
Report.
[10] Geoff Sutcliffe, Stephan Schulz, Koen Claessen & Allen Van Gelder (2006): Using the TPTP Language for
Writing Derivations and Finite Interpretations . In Ulrich Fuhrbach & Natarajan Shankar, editors: Proc. of
the 3rd IJCAR, Seattle, LNAI 4130, Springer, 4130, pp. 67–81.
[11] Tanel Tammet: Gandalf version c-1.0c Reference Manual.
[12] Josef Urban (2006): MPTP 0.2: Design, Implementation, and Initial Experiments. J. Autom. Reasoning
37(1-2), pp. 21–43. Available at http://dx.doi.org/10.1007/s10817-006-9032-3.
[13] Josef Urban, Geoff Sutcliffe, Petr Pudlák & Jirı́ Vyskocil (2008): MaLARea SG1- Machine Learner for
Automated Reasoning with Semantic Guidance. In Armando et al. [2], pp. 441–456. Available at http:
//dx.doi.org/10.1007/978-3-540-71070-7_37.