Experiments with Strategy Learning for E Prover Daniel Kühlwein Josef Urban Stephan Schulz Radboud University Nijmegen Technische Universität München Nijmegen, Netherlands∗ München, Germany [email protected] [email protected] We give a brief summary of some recent experiments with the set of different ATP strategies created for the theorem prover E as part of its automatic mode and discuss possible future directions. 1 Strategies in E Automated theorem provers (ATPs) consist of a number of complicated algorithms, that can be parameterized and combined together in different ways. Examples of such parameterizations are clause weighting and selection schemes, term orderings, sets of inference and reduction rules used, etc. E [8] (as some other ATPs) has a language for packaging such useful combinations of parameterizations into strategies. Over 200 strategies have been named and are part of the E source code. Such a large number of strategies can be used to experiment with data-driven methods that try to estimate how to solve a new problem, by considering a large database of previously solved problems and their suitable characterization. E is probably the first ATP that has applied machine learning to strategy selection. There are different ways how to do this, and how to optimize a large set of strategies in general. We will consider some of them and report some results obtained. 2 Experiments with Strategy Learning Characterizing ATP problems is probably a Turing-complete problem. An ambitious formulation could be for example: Given a set of ATP strategies, find a (typically finite) set of efficient algorithms whose results on a large class of ATP problems can be efficiently used to partition the problems into subclasses that share a common best strategy. The results of such algorithms run on a particular problem are the problem features. The set of algorithms is in practice suggested by the intuition of the ATP implementor, or a domain expert (for example, a mathematician who might know what features could be important for distinguishing different classes of problems in his domain). Some of them can correspond to the precise knowledge in the ATP field, for example, if the problem is Horn, or effectively propositional. Others can rather express hunches, for example, if there are many or few clauses, symbols, terms, etc. If the set of problems uses symbols consistently (this is the case of recently appeared large-theory corpora created from the Mizar and Isabelle libraries, and SUMO and Cyc common-sense ontologies), then the (combinations of) symbols (and derived structures, like terms) can often also imply some typical problem structure. We use two learning methods to estimate the best strategy for a new problem. Both are based on learning from the results of strategies run on the TPTP problems. The traditional one by Stephan Schulz ∗ The authors were supported by the NWO projects “MathWiki a Web-based Collaborative Authoring Environment for Formal Proofs” and “Learning2Reason”. Submitted to: IWS 2012 c Daniel Kühlwein, Josef Urban and Stephan Schulz This work is licensed under the Creative Commons Attribution License. 2 Experiments with Strategy Learning for E Prover is an instance of the case-based reasoning method. A newer one is to use kernel-based learning. Both methods rely on TPTP being a sufficiently representative set of ATP problems of different kinds, however the strategies can be tuned for any sufficiently big corpus of problems (for example the MPTP [12] is a good candidate). Both methods require a relevant feature characterization of problems as an input, and (after being trained) they return the predicted optimal strategy as an output. 2.1 E’s ”Auto” mode E supports an ”automatic” mode that analyzes the problem and determines all major search parameters (literal selection function, clause selection heuristics, term ordering, and a number of mostly discrete options controlling optional simplifications and preprocessing steps). E’s automatic mode has a long history and has been conservatively extended from the very first implementation in E 0.3 Castleton, released in 1999. E’s automatic mode is based on a static partitioning of the set of all CNF problems into disjoint classes. Class membership is determined by a number of features, either naturally discrete, or numerical and mapped into a small set of discrete values using threshold values. 2.2 Features used for learning As stated above, E’s auto-mode was first implemented in 1999, before the TPTP language [10] achieved its current form and gained wide acceptance. Similarly, the automatic mode predates support for full first-order formulas. Therefore E only analyzes the CNF problem after clausification and does not make any use of meta-features encoded in the TPTP language. It uses only the logical structure of the clauses. A clause is called negative, if it only has negative literals. It’s called positive, if it only has positive literals. A ground clause is a clause that contains no variables. In this setting, we refer to all negative clauses as ”goals”, and to all other clauses as ”axioms”. Clauses can be unit (having only a single literal), Horn (having at most one positive literal), or general (no constraints on the form). Note that all unit clauses are Horn, and all Horn clauses are general. The set of features used in E for problem characterization is listed in Table 1. 2.3 Strategy Classification E’s automatic mode is generated in two steps. First, the set of all training examples (typically the set of all current TPTP problems) is classified into disjoint classes using the features listed in Table 1. For the numeric features, threshold values have originally been selected to split the TPTP into 3 or 4 approximately equal subsets on each features. Over time, these have been manually adapted using trialand-error. Once the classification is fixed, a Python program reads the different classes and a set of test protocols describing the performance of different strategies on all problems from the test set. It assigns to each class one of the strategies that solves the most examples in this class. For large classes (arbitrarily defined as having more than 200 problems), it picks the strategy that also is fastest on that class. For small classes, it picks the globally best strategy among those that solve the maximum number of problems. A class with zero solutions by all strategies is assigned the overall best strategy. Daniel Kühlwein, Josef Urban and Stephan Schulz 3 Feature Description axioms goals equality Most specific class (unit, Horn, general) describing all axioms Most specific class (unit, Horn) describing all goals Problem has no equational literals, some equational literals, or only equational literals Number (or fraction) of unit-axioms that are not ground Are all goals ground? Number of clauses Number of literals Number of all (sub)terms Number (or fraction) of positive axioms that are ground Maximal arity of a function or predicate symbol Average arity of symbols in the problem Sum of arities of symbols in the problem Maximal clause depth non-ground units ground-goals clauses literals term cells groundpositiveaxioms max fun arity avg fun arity sum fun arity clause max depth Table 1: Problem features used by strategy selection in E 2.4 Strategy Scheduling An alternative to the running the best possible strategy approach that is currently being used by E is strategy scheduling (known mainly from Gandalf [11] and Vampire [7]). The idea is to run different strategies for fractions of the overall time. This can help when the strategies are sufficiently orthogonal (solve different problems), and the problem classification is not good enough to clearly point to one best strategy. In the simplest setting, the suitable set of strategies is considered independent of the problem features. In that case, given a database of results of many strategies on a large set of problems (TPTP, MPTP, etc.), the goal is to cover the set of solvable problems by as few strategies as possible. This is an NP-complete problem, which however seems to be quite easy (for our instances) for systems like MiniSAT++ [9]. For example for an experiment with covering randomly chosen 400 problems from the MPTP2078 benchmark1 [1] by 280 E strategies, MiniSat++ can find the minimal cover (9 strategies) in 0.09s. 2.5 E-MaLeS 1.0 E-MaLeS 1.0 (E Machine Learning of Strategies)2 is a meta system for E that implements strategy scheduling using kernel-based learning. 2.5.1 Kernel-based learning Kernel-based learning is a machine learning method that tries to find a suitable (typically non-linear) transformation (kernel) of the data to a high-dimensional vector space, in which various fast versions of linear regression can be applied to obtain good prediction functions. The training data are the problem 1 Available at http://wiki.mizar.org/twiki/bin/view/Mizar/MpTP2078. ~tptp/CASC/23/SystemDescriptions.html#E-MaLeS---1.0 2 http://www.cs.miami.edu/ 4 Experiments with Strategy Learning for E Prover features together with the performance of the strategies on the problems. On these data, the kernel learning produces a ranking function that for each combination of problem features ranks the strategies according to their predicted usefulness. The higher a strategy is ranked, the greater the (predicted) probability that it solves the problem. Specifically, for our experiments with the E-MaLeS system, the Multi-Output Ranking kernel-based method (MOR) [1] is used. 2.5.2 The E-MaLeS Setup E-MaLeS uses the features described in Figure 1 to rank the strategies. It then uses a combination of E’s auto mode and the top two highest ranked strategies. First, the auto mode is run for 60 seconds and then the highest ranked strategy t is run for 120 seconds. If the problem still isn’t solved, the prediction score of every other strategies s in multiplied with the probability of s solving the problem while knowing that t does not solve the problem. Finally the now highest ranked strategy (after the update) is run for another 120 seconds. Note that this means that E-MaLeS can only solve problems for which there is a strategy that solved them in under 120 seconds. If a problem needs more than 120 seconds, independent of the strategy, E-MaLeS cannot solve it. However, since almost all E-solvable TPTP problems can be solved in less than 120 seconds this is not a big restriction. 3 Experiments We compared E’s auto mode (see section 2.3) with MOR learning on four different datasets. First, we consider three subsets of the MPTP. The results can be seen in table 2. Bushy and Chainy are the problems from the MPTP Challenge. 3 1kMPTP are 1000 random problem taken from the MPTP dataset. The table shows the number of problems, the number of problems that can be solved if one would always pick the best strategy, the number of problems solved with E’s auto mode, and the number of problems solved when learning the best strategy with MOR. Table 2: Performance of E on MPTP using different strategy methods. Dataset Bushy Chainy 1kMPTP Problems Optimal Auto MOR 252 252 1000 169 110 462 144 100 316 167 110 459 On these small Mizar-based datasets, MOR outperforms the auto mode, getting close or even reaching the optimal number of problems solved. We also compared the auto mode with E-MaLeS on the complete TPTP dataset. At the point of creating E-MaLeS, TPTP contained 14771 problems with 10269 being solvable by E. Table 3: Performance of E on TPTP using different strategy methods. Dataset TPTP 3 www.tptp.org/MPTPChallenge/ Problems Optimal Auto E-MaLeS 14771 10269 9583 10026 Daniel Kühlwein, Josef Urban and Stephan Schulz 5 Finally, we can look at the performance of E and E-MaLeS at CASC 23. There were 300 problems in the FOF division. 98 of these were new, they weren’t part of the TPTP before. In total, E-MaLeS solves one more problem than E. Looking only at the new problems, we see that E-MaLeS solves four more new problems than E. This could indicate that is too tuned on the TPTP data. Table 4: Performance of E at CASC 23 using different strategy methods. All New 4 Problems Auto E-MaLeS 300 98 232 57 233 61 Future Work Strategies themselves are just combinations of many ATP parameters (currently determined by guesses of the ATP implementators). An interesting application of machine learning to ATP is to develop new strategies by searching for such good parameter combinations systematically (by methods like hill-climbing) on a large corpus of problems. A related task is to produce a set of strategies that are highly orthogonal, i.e., they solve very different problems, and their collective coverage is higher. We could also learn strategies based on conjecture features (like symbols) in consistently-named corpora like MPTP, instead of just using abstract features like on the TPTP problems. The methods that we develop can be probably used for any ATP, and it would be interesting to see how such learning and optimization works for systems like Vampire [7], Z3 [6], and iProver [5]. Metasystems like Isabelle/Sledgehammer [4, 3] and MaLARea [13] could then use this deeper knowledge about ATPs to attack new conjectures with the strongest possible combinations of systems and strategies. References [1] Jesse Alama, Tom Heskes, Daniel Kühlwein, Evgeni Tsivtsivadze & Josef Urban (2011): Premise Selection for Mathematics by Corpus Analysis and Kernel Methods. CoRR abs/1108.3446. Available at http:// arxiv.org/abs/1108.3446. [2] Alessandro Armando, Peter Baumgartner & Gilles Dowek, editors (2008): Automated Reasoning, 4th International Joint Conference, IJCAR 2008, Sydney, Australia, August 12-15, 2008, Proceedings. Lecture Notes in Computer Science 5195, Springer. [3] Jasmin Christian Blanchette, Sascha Böhme & Lawrence C. Paulson (2011): Extending Sledgehammer with SMT Solvers. In Nikolaj Bjørner & Viorica Sofronie-Stokkermans, editors: CADE, Lecture Notes in Computer Science 6803, Springer, pp. 116–130. Available at http://dx.doi.org/10.1007/ 978-3-642-22438-6_11. [4] Jasmin Christian Blanchette, Lukas Bulwahn & Tobias Nipkow (2011): Automatic Proof and Disproof in Isabelle/HOL. In Cesare Tinelli & Viorica Sofronie-Stokkermans, editors: FroCos, Lecture Notes in Computer Science 6989, Springer, pp. 12–27. Available at http://dx.doi.org/10.1007/978-3-642-24364-6_2. [5] Konstantin Korovin (2008): iProver - An Instantiation-Based Theorem Prover for First-Order Logic (System Description). In Armando et al. [2], pp. 292–298. Available at http://dx.doi.org/10.1007/ 978-3-540-71070-7_24. 6 Experiments with Strategy Learning for E Prover [6] Leonardo Mendonça de Moura & Nikolaj Bjørner (2008): Z3: An Efficient SMT Solver. In C. R. Ramakrishnan & Jakob Rehof, editors: TACAS, Lecture Notes in Computer Science 4963, Springer, pp. 337–340. Available at http://dx.doi.org/10.1007/978-3-540-78800-3_24. [7] Alexandre Riazanov & Andrei Voronkov (2002): The design and implementation of VAMPIRE. AI Commun. 15(2-3), pp. 91–110. Available at http://iospress.metapress.com/openurl.asp?genre= article{&}issn=0921-7126{&}volume=15{&}issue=2{&}spage=91. [8] Stephan Schulz (2002): E - A Brainiac Theorem Prover. AI Commun. 15(2-3), pp. 111–126. Available at http://iospress.metapress.com/content/n908n94nmvk59v3c/. [9] Niklas Sörensson & Niklas Eén (2008): MiniSat 2.1 and MiniSat++ 1.0 SAT Race 2008 Editions. Technical Report. [10] Geoff Sutcliffe, Stephan Schulz, Koen Claessen & Allen Van Gelder (2006): Using the TPTP Language for Writing Derivations and Finite Interpretations . In Ulrich Fuhrbach & Natarajan Shankar, editors: Proc. of the 3rd IJCAR, Seattle, LNAI 4130, Springer, 4130, pp. 67–81. [11] Tanel Tammet: Gandalf version c-1.0c Reference Manual. [12] Josef Urban (2006): MPTP 0.2: Design, Implementation, and Initial Experiments. J. Autom. Reasoning 37(1-2), pp. 21–43. Available at http://dx.doi.org/10.1007/s10817-006-9032-3. [13] Josef Urban, Geoff Sutcliffe, Petr Pudlák & Jirı́ Vyskocil (2008): MaLARea SG1- Machine Learner for Automated Reasoning with Semantic Guidance. In Armando et al. [2], pp. 441–456. Available at http: //dx.doi.org/10.1007/978-3-540-71070-7_37.
© Copyright 2026 Paperzz