Premise Selection for Mathematics by
Corpus Analysis and Kernel Methods
Jesse Alama1 , Daniel Kühlwein2 , Evgeni Tsivtsivadze2 , Josef Urban2 , and Tom
Heskes2?
1
2
Center for Artificial Intelligence
New University of Lisbon
Intelligent Systems, Institute for Computing and Information Sciences
Radboud University Nijmegen
Abstract. Smart premise selection is essential when using automated
reasoning as a tool for large-theory formal verification, formal proof development, and experimental reverse mathematics. A strong method for
premise selection in complex mathematical libraries is the application of
machine learning to large corpora of proofs.
This work develops learning-based premise selection in two ways. First,
a newly available fine-grained dependency analysis of existing high-level
formal mathematical proofs is used to build a large knowledge base
of proof dependencies, providing new precise data for ATP-based reverification and for training premise selection algorithms. Second, a new
machine learning algorithm for premise selection based on kernel methods is proposed and implemented. The impact of both techniques is evaluated on large-theory formalized mathematical data, and shows significant improvement over existing methods for ATP assistance in real-world
mathematics.
1
Introduction
In this paper we use two complementary methods to significantly improve learningbased premise selection and, consequently, automated theorem proving in large
formal mathematical libraries.
The first method makes the first practical use of the recently conducted
detailed proof dependency analysis of the large Mizar Mathematical Library
(mml). This analysis allows us to construct precise problems for ATP-based
re-verification of the Mizar proofs. More importantly, the precise dependency
data can be used as a vast repository of previous problem-solving knowledge
from which premise selection can be efficiently automatically learned by machine
learning algorithms.
?
Jesse Alama was funded by the FCT project “Dialogical Foundations of Semantics”
(DiFoS) in the ESF EuroCoRes programme LogICCC (FCT LogICCC/0001/2007).
The other authors were supported by the NWO projects “MathWiki a Web-based
Collaborative Authoring Environment for Formal Proofs” and “Learning2Reason”.
The second complementary improvement is achieved by using a new kernelbased machine learning algorithm. This means that based on the large number
of previously solved mathematical problems, we can more accurately estimate
which premises will be needed for proving a new conjecture.
Such learned knowledge then significantly helps automated proving of new
formally expressed mathematical problems by recommending the most relevant
previous theorems and definitions from the very large existing libraries, shielding the existing ATP methods from a combinatorial explosion caused by thousands of irrelevant axioms. The better such symbiosis of formal mathematics
and learning-assisted automated reasoning gets, the better for both parties: Improved automated reasoning increases the efficiency of formal mathematicians,
and lowers the cost of producing formal mathematics. This in turn leads to
larger corpora of previously solved nontrivial problems from which the learningassisted ATP can extract additional problem-solving knowledge covering larger
and larger parts of mathematics.
The rest of the paper is organized as follows. Section 2 describes recent
developments in large-theory automated reasoning and motivates our problem.
Section 3 summarizes the recent implementation of precise dependency analysis
over the large mml, and its use for ATP-based cross-verification and training
premise selection. Section 4 describes the general machine learning approach to
premise selection and an efficient kernel-based multi-output ranking algorithm
for premise selection. Section 5 evaluates the techniques, and Section 6 concludes
and discusses future work and extensions.
2
Large-theory Automated Reasoning
In the recent years, large formal libraries of re-usable knowledge expressed in
rich formalisms have been built with interactive proof assistants, such as Mizar,
Isabelle, Coq, and HOL (light), among many others. Formal approaches are also
being used increasingly in non-mathematical fields such as software and hardware verification and common-sense reasoning about real-world knowledge. Such
trends lead to growth of formal knowledge bases in these fields.
One important development is that a number of these formal knowledge
bases and core logics have been translated to first-order formats suitable for
ATPs [8,21,9]. These translations give rise to large, semantically rich corpora
that present significant new challenges for the field of automated reasoning. The
techniques developed so far for ATP in large theories can be broadly divided
into two categories:
1. Heuristic symbolic analysis of the formulas appearing in problems, and
2. Analysis of previous proofs.
In the first category, the SInE preprocessor by K. Hoder [6,23] has so far been the
most successful. SInE is particularly strong in domains with many hierarchical
definitions such as those of in common-sense ontologies. In the second category,
machine learning of premise selection, as done e.g. by the MaLARea [24] system,
is a strong method in hard mathematical domains, where the knowledge bases
contain much less definitions than nontrivial lemmas and theorems, and previous
verified proofs can be used for learning proof guidance.
Automated reasoning in large mathematical corpora is an interesting new
field in several respects. Large theories permit data-driven approaches [14] to
constructing ATP algorithms; indeed, the sheer size of such libraries actually
necessitates such methods. It turns out that purely deductive, brute-force search
methods can be improved significantly by heuristic and inductive3 methods,
thus allowing research into successful, experimentally driven combinations [24]
of inductive and deductive methods. The MPTP Challenge benchmark4 , for
example, and the set of large benchmarks defined in [23] can serve for rigorous
evaluation of such novel AI methods over thousands of real-world mathematical
problems.5 Apart from the novel AI aspect, and the obvious proof assistance
aspect, automated reasoning over large formal mathematical corpora can also
become a new weapon in the established field of reverse mathematics [15]. This
line research has been already started, for example by Solovay’s analysis [16] of
the connection between Tarski’s axiom6 and the axiom of choice, and by Alama’s
analysis of the Euler’s polyhedron formula [2], both conducted over the Mizar
Mathematical Library (mml).
3
Computing Fine-grained Dependencies in Mizar
In the world of automated theorem proving, proofs contain essentially all logical
steps, even very small ones (such as the steps taken in a resolution proof). In
the world of interactive theorem proving, one of the goals is to allow the users to
express themselves with minimal logical verbosity. Towards that end, ITP systems often come with mechanisms for suppressing some steps of an argument. By
design, an ITP can suppress logical and mathematical steps that might be necessary for a complete analysis of what a particular proof depends upon. In this
section we summarize a recently developed solution to this problem for the Mizar
3
4
5
6
The word inductive denotes here inductive reasoning, as opposed to deductive reasoning.
See http://www.tptp.org/MPTPChallenge
We do not evaluate data-driven ATP methods on the recent CASC LTB datasets,
which are too small to allow useful application of learning, and which are, by design, limited to problems solvable without learning. Our goal is to help real mathematicians, who work with and re-use large amounts of previous difficult proofs and
theorems.
Tarski’s axiom is the principle: for every set N there exists a set M such that
–
–
–
–
N ∈ M,
if X ∈ M Y ⊂ X, then Y ∈ M ,
if X ∈ M , then P(X) ∈ M , and
if X ⊂ M and |X| 6= |M |, then X ∈ M .
See [18].
Mathematical Library (mml).7 The basis of the solution is advanced refactoring
of the “articles” of the mml into one-item micro-articles, and computing their
minimal dependencies by a brute-force minimization algorithm. For a more detailed discussion of Mizar, see [7,5]; for a more detailed discussion of refactoring
and minimization algorithms, see [3].
As a leading example of how inferences in ITP-assisted formal mathematical
proofs can be suppressed, consider a theorem of the form
∀x : σ[φ (f (x))],
where “x : σ” means that the variable x has type σ, and f is a unary function
symbol that accepts arguments of type τ . Suppose further that, prior to the assertion of this theorem, it is proved that σ is a subtype of τ . The well-formedness
of the theorem depends on this subtyping relationship. Moreover, the proof of
the theorem may not mention this fact; the subtyping relationship between σ
and τ may very well not be an outright theorem. In such a situation, the fact
∀x(x : σ → x : τ )
is suppressed. Indeed, we can see that by not requiring the author of a formal
proof to supply such subtyping relationships, we permit him to focus more on the
heart of the matter of his proof, rather than focusing on tedious (but logically
necessary) steps such as subtyping relationships. But if we are interested in giving
a complete answer to the question of what a formalized proof depends upon, we
must expose suppressed facts and inferences. Having the complete answer is
important for a number of applications, see [3] for examples. The particular
importance for the work described here is that when efficient first-order ATPs
are used to assist high-level formal proof assistants like Mizar, the difference
between the implicitly used facts and the explicitly used facts disappears. The
ATPs work in a much simpler setting, and they obviously need to explicitly
know all the facts that are necessary for finding the proofs. (If we were to omit
the subtyping axiom, for example, an ATP might find, to our surprise, that our
problem is unprovable or even countersatisfiable.)
The first step in the computation of fine-grained dependencies in Mizar is
to break up each article in the mml into a sequence of Mizar texts, each consisting of a single top-level item (e.g., theorem, definition). Each of these texts
can—with suitable preprocessing—be regarded as a complete, valid Mizar article
in its own right. The decomposition of a whole article from the mml into such
smaller articles typically requires a number of nontrivial refactoring steps, comparable, e.g., to automated splitting and re-factoring of large programs written
in programming languages with complicated syntactic mechanisms.
In Mizar, every article begins with a so-called environment specifying the
background knowledge (theorems, notations, etc.) that is used to verify the article. The actual Mizar content that is imported, given an environment, is, in
general, a rather conservative overestimate of the items that the article actually
7
http://www.mizar.org/
needs. That is why a minimization process is applied to the environment to compute the smallest set of items that are sufficient to verify each “micro-article”.
This produces a minimal set of dependencies for each Mizar item, both syntactic (e.g., notational macros), and semantic (e.g., theorems, typings, etc.). The
drawback of this minimization process is that the brute-force minimization of
certain dependencies can be time consuming.8 The advantage is that (unlike in
any other proof assistant) the set of dependencies is truly minimal (with respect
to the power of the proof checker), and does not include redundant dependencies
which are typically drawn in by overly powerful proof checking algorithms (like
congruence closure over sets of all available equalities, etc.) when the dependency
tracking is implemented internally inside a proof assistant. Another advantage of
this approach is that it also provides syntactic dependencies, which are needed
for real-world recompilation of the particular item as written in the article. This
functionality is important for fast fine-grained recompilation in formal wikis [1],
however for semantic applications like ATP we are only considering the truly
semantic dependencies, i.e., those dependencies that result in a formula when
translated by the MPTP system [21] to first-order logic.
Table 1 provides a summary of the fine-grained dependency data for Mizar,
in particular for a set of 33 articles coming from the MPTP Challenge and used
for experiments in Section 5. For each theorem in the sequence of the 33 Mizar
articles we show how many explicit dependencies are involved (on average) in
their proofs and how many implicit dependencies (on average) it contains. We
thus have, for each theorem, a list of explicit and implicit dependencies for
it. The table also shows how much of an improvement our exact dependency
calculation is compared to a simple safe fixed-point MPTP construction of an
over-approximation what is truly needed.
4
Premise Selection in Large Theories by Machine
Learning
4.1
Problem Definition
Knowledge of a large number of previous nontrivial proofs and problem-solving
techniques is used by mathematicians to guide their thinking about new problems. To our knowledge, the detailed mml proof analysis described above provides the currently largest computer-understandable corpus of dependencies of
mathematical proofs, on which we would like to train computers, and thus begin to emulate the training of human mathematicians. A particularly important
problem to learn in the context of large real-world mathematics is premise selection: Given a large knowledge base P of thousands of premises and proofs, and
a new conjecture x, find the premises Px that are most relevant for proving x.
Usually, we have the following setting: Let Γ be the set of all first order
formulas over a fixed countable alphabet, X = {xi | 1 ≤ i ≤ n} ⊂ Γ be the
8
This can however be improved by various heuristics for guessing the needed dependencies
Table 1. Effectiveness of fine-grained dependencies from the MPTP Challenge
Article
xboole 0
xboole 1
enumset1
zfmisc 1
subset 1
setfam 1
relat 1
funct 1
ordinal1
wellord1
relset 1
mcart 1
wellord2
funct 2
finset 1
pre topc
orders 2
lattices
tops 1
tops 2
compts 1
connsp 2
filter 1
lattice3
yellow 0
yellow 1
waybel 0
tmap 1
tex 2
yellow 6
waybel 7
waybel 9
yellow19
Theorems Expl. Refs. Uniq. Expl. Refs. Fine Deps. MPTP Deps.
7
117
87
129
43
48
184
107
37
53
32
92
24
124
15
36
56
27
71
65
23
29
61
55
70
28
76
141
74
44
46
41
36
4
5.34
3.26
4.74
4.62
6.56
5.66
7.69
7.81
11.7
4.71
5.71
14.2
4.14
8.66
6.47
10.6
5.59
6.67
7.36
17.7
9.86
14.6
8.6
6.75
9.03
9.34
8.78
7.66
11.2
13.0
9
8.44
2.7
2.3
2.7
2.9
2.4
2.4
2.2
3.4
3.9
5.6
2.6
2.8
6.9
2.5
3.9
3.2
5.3
3.1
4.2
4.0
8.6
6.4
5.3
4.2
3.2
5.1
4.1
4.5
5.0
6.4
7.7
5.2
5.3
11.57
15.27
10.67
16.59
22.30
25.62
19.97
22.94
26
30.45
27
21.25
36.41
30.77
23.93
35.58
46.28
43
43.46
37.41
48.86
39.51
52.27
47.07
26.55
53.17
35.03
47.04
37.31
50.31
57.10
51.56
53.02
12.62
17.86
10.82
21.08
28.15
37.93
27.31
42.05
61.92
55.5
55.59
29.77
75.2
92.45
72.22
53.51
79.77
72.96
66.42
103.4
102.9
96.5
122.8
92.25
46.12
128.3
82.34
140.0
155.6
173.5
140.3
156.5
137.2
Article: Mizar Article relevant to the MPTP Challenge.
Theorems: Total number of theorems in the article.
Expl. Refs.: Average number of (non-unique) explicit references (to theorems, definitional theorems, and schemes) per theorem in the article.
Uniq. Expl. Refs.: Average number of unique explicit references per theorem.
Fine Deps.: Average number of all (both explicitly and implicitly) needed items (explicitly referred to theorems, together with implicitly needed items) per theorem
as computed by dependency analysis.
MPTP Deps.: Average number of items needed for reproving per theorem approximated by the MPTP fixpoint algorithm.
set of conjectures, P = {pj | 1 ≤ j ≤ m} ⊂ Γ be the set of premises, and
Y : X × P → {0, 1} be the indicator function such that yxi ,pj = 1 if pj is used to
prove xi and yxi ,pj = 0 if pj is not used to prove xi . In the case of the whole mml,
we have approximately n = 50000 and m = 100000, while for the experiments
in Section 5, we have n = 2078 and (average) m = 1976.5. The corresponding
indicator function YM M L is provided by the analysis described in Section 3.
For each premise p ∈ P we can construct a dataset Dp = {(x, yx,p ) | x ∈ X }.
Based on Dp , a suitable algorithm can learn a classifier Cp (·) : Γ → R which,
given a formula x as input, can predict whether the premise p is relevant for
proving x. Typically, classifiers give a graded output. Having learned classifiers
for all premises p, the classifier predictions Cp (x) can be ranked: the premises
that are predicted to be most relevant will have the highest output Cp (x) (see
[20]).
To make a simple example, one way how to define a classifier of a premise p
is to compute a set of features ϕk (x), 1 ≤ k ≤ l for each formula x and set the
Pl
classifier to be a linear combination of these features: Cp (x) = k=1 αp,k ϕk (x)
where the αp,k ∈ R are weights to be learned. Performance of such a simple
linear classifier would obviously vary, depending a lot on how suitable the selected
features for characterizing formulas in various domains are. A real-world classifier
used by us for learning premise selection is described below in Section 4.3, after
introducing kernel methods.
4.2
Kernel Methods
While so far we have been exclusively using multiclass naive Bayes [24] as a reasonably common learning method capable of very fast learning over the whole
mml, in this work we also experiment with employing the state-of-the-art kernel
methods for premise selection in the formal mathematical domain, and in combination with ATP systems as a component of the MaLARea inductive/deductive
AI loop in Section 5.4.
Kernel-based machine learning methods [4,13,14] are becoming increasingly
important for data analysis tasks. One key difference between traditional statistical methods and the kernel-based approach is that the former deals mainly
with the data represented in vector form, whereas the latter can be naturally
applied to structured (non-vectorial) data, such as sequences, trees, graphs, etc.
This is generally achieved by mapping the input data into the feature space
using the so-called kernel function, and then using a learning algorithm to discover relations in that space. The feature space so constructed can have a high
dimension; however, the “kernel trick” of using inner products allows the learning algorithm to operate in that feature space without explicitly computing the
coordinates of its constituent data points.
To summarize, there are several reasons that make kernel based methods applicable to many real-world problems where data is naturally structured. Kernel
methods allow non-linear models and can incorporate prior knowledge about the
target domain, which allows kernel methods to perform significantly better with
general learning methods. Furthermore, kernel methods are naturally applicable
in situations where data is not represented as a vector, thus avoiding potentially
extensive (and expensive) preprocessing.
4.3
Kernel-Based Multi-Output Ranking Algorithm for Premise
Selection
As introduced in 4.1, given Γ , X , P, Y, a premise p ∈ P, and the corresponding
dataset Dp = {(x, yx,p ) | x ∈ X }, our goal is to find the classifier function Cp
that best fits the data points in Dp , i.e.,
Cp = arg min
f
n
X
l(f (xi ), yxi ,p ) + λkf k2 ,
(1)
i=1
where f ∈ RΓ , l is a loss function, e.g. l(f (x), y) = (y − f (x))2 , λ ∈ R is a
regularization parameter to prevent overfitting, and k·k is some norm to measure
the complexity of f .
A kernel is a function k such that for all x, x0 ∈ Γ : k(x, x0 ) = hφ(x), φ(x0 )i
where φ : Γ → F is a mapping from Γ to a feature space F and h·, ·i : F ×F → R.
Often, F is an inner product space and h·, ·i is the corresponding inner product.
A nice property of kernel methods is that this feature representation need not
be computed explicitly: all necessary operations for learning and prediction can
be written in terms of kernel functions alone. In mathematical terms, this observation follows from the so-called Representer Theorem, which we give here for
completeness.
Let k be a kernel. If we restrict f from equation (1) to the following class of
functions,
(
)
∞
X
Γ
F = f ∈ R | f (x) =
βi k(x, zi ), βi ∈ R, zi ∈ Γ, kf k < ∞ .
i=1
P∞
P∞
with k i=1 βi k(·, zi )k2 =
i,j=1 βi βj k(zi , zj ) (the norm in the Reproducing
Kernel Hilbert Space associated with k, see [12]) then, by the Representer Theorem [12], the minimizer of equation (1) has the following form:
Cp (x) =
n
X
ap,i k(x, xi ),
(2)
i=1
where ap,i ∈ R and xi ∈ X .9 So once we fix the kernel k, learning the classifier
boils down to fitting the parameters ap,i .
For our experiments, we defined a kernel-based multi-output ranking (MOR)
algorithm that is a relatively straightforward extension of our preference learning
9
Note that if φ : Γ → Rn and h·, ·i is the dot product on Rn then this minimizer can
be rewritten as a linear combination of the features and we get the example linear
classifier from Section 4.1.
algorithm presented in [19]. It is based on the regularized least-squares (RLS)
[11] algorithm. The basic idea is as follows:
First, we fix the loss function as l(y, y 0 ) = (y − y 0 )2 . Plugging (2) into (1) we
need to solve
min
ap,1 ,...,ap,n
n X
n
n
X
X
(
ap,j k(xi , xj ) − yxi ,p )2 + λ
ap,i ap,j k(xi , xj ).
i=1 j=1
(3)
i,j=1
Since all Cp depend on the values of k(xi , xj ), 1 ≤ i, j ≤ n we can fasten the
computation. Instead of learning the classifiers Cp for each premise separately, we
learn all the weights ap,i simultaneously. Let A = (ap,i )i,p with 1 ≤ i ≤ n, p ∈ P,
i.e. A is the matrix where each column contains the parameters of one premise
classifier and let K = (k(xi , xj ))i,j , 1 ≤ i, j ≤ n be the kernel matrix. Similar to
(3), we need to solve the following problem to find the optimal parameters:
arg min tr (Y − KA)t (Y − KA) + λAt KA .
(4)
A
Setting the derivative to zero and solving with respect to A gives:
A = (K + λI)−1 Y
(5)
Note that the whole derivation is independent of the actual kernel and features,
allowing further experiments with different kernels and features. A particular
initial choice of the kernel and formula features for our domain is explained
below, in Section 4.4.
4.4
MOR Experimental Setup
We saw that kernel methods are quite flexible. In our experiments we use the
following settings: The features of a conjecture are the multiset of the symbols
occurring in the conjecture. I.e. φ : Γ → Rn where n is the number of different
symbols in the whole data set (the cardinality of the signature of Γ ). As the
kernel function, we use the so-called gaussian kernel which is parameterized by
a single parameter σ. It is defined as:
1
0
0
0
k(x, x ) = exp − 2 (hφ(x), φ(x)i − 2hφ(x), φ(x )i + hφ(x ), φ(x )i)
2σ
0
with h·, ·i being the normal dot product on Rn . Thus, to fully determine the
classifiers, it remains to find good values for the parameters λ and σ. This is
done (as usual with such parameter optimization for kernel methods) by simple
(logarithmically scaled) grid search and testing on a small subset of the (MPTP)
dataset. These settings were then used for the experiments described in 5.3 and
in 5.4. The MOR algorithm is not yet used in the large-scale experiment described
in 5.2, because the efficient incremental training necessary for that experiment
is so far only available with the SNoW toolkit.
5
Experiments
We evaluate the improvements obtained by using fine-grained dependency data
on a set of 2078 problems extracted from the Mizar library.10 Then we experimentally plug the MOR algorithm into the MaLARea metasystem, and compare
its performance on the standard large-theory MPTP Challenge benchmark with
MaLARea using naive Bayes for learning. All measurements are done on an Intel
Xeon E5520 2.27GHz server with 8GB RAM and 8MB CPU cache.
5.1
Using the Fine-Grained Dependency Analysis for Re-proving
The first experiment evaluates the effect of fine-grained dependencies on reproving Mizar theorems automatically. We compare the performance of Vampire
[10] on three datasets:
– Versions of the 2078 problems containing all previous mml contents as premises.
This means that the conjecture is attacked with “all existing knowledge”,
without any premise selection. This is a common use-case for proving new
conjectures fully automatically, see also Section 5.2.
– Versions of the 2078 problems with premises pruned by the old heuristic
dependency-pruning method used by for constructing re-proving problems
by the MPTP system.11
– Versions of the 2078 problems with premises pruned using the new finegrained dependency information. This use-case has been introduced in proof
assistants by Harrison’s MESON TACTIC, which takes an explicit list of premises
from the large library selected by a knowledgeable user, and attempts to
prove the conjecture just from these premises. We are interested in how
powerful ATPs can get on mml with such precise advice.
All datasets contain the same conjectures. They only differ in the number of redundant axioms. Note that the problems in the second and third dataset are considerably smaller than the unpruned problems. The average number of premises
is 1976.5 for the unpruned problems, 74 for the heuristically-pruned problems
and 31.5 for the problems pruned using fine-grained dependencies.
The results are shown in Table 2. Vampire (run in the unmodified automated
CASC mode) solves 548 of the unpruned problems. If we use the ’-d1’ parameter12 , Vampire solves 556 problems. Things change a lot with pruning. Vampire
10
11
12
The 33 Mizar articles from which problems were previously selected for constructing
the MPTP Challenge are used. We however use a new version of Mizar and the mml
allowing the dependency analysis, and test on all problems from these articles.
The MPTP heuristic proceeds by taking all explicit premises contained in the original human-written Mizar proof. To get all the premises used by Mizar implicitly, the
heuristic watches the problem’s set of symbols, and adds the implicitly used formulas (typically typing formulas about the problem’s symbols) in a fixpoint manner.
The heuristic attempts hard to guarantee completeness, however, minimality is not
achievable with such simple heuristics.
In [23] running Vampire with the -d1 pruning parameter resulted in significant performance improvement on large Mizar problems.
Table 2. Performance of Vampire (10s timelimit) on 2078 MPTP problems with different axiom pruning.
Unpruned Unpruned (-d1) Heuristic Pruning New Pruning
Average problem size
Solved problems
Solved as percentage
1976.5
548
26.4
1976.5
556
26.8
74
1023
49.2
31.5
1105
53.2
solves 1023 of the 2078 problems when the old heuristic pruning is applied. Using
the pruning based on the new fine-grained analysis Vampire solves 1105 problems, which is an 8% improvement over the heuristic pruning in the number of
problems solved. Since the heuristic pruning becomes more and more inaccurate
as the mml grows (see Table 1), we can conjecture that this improvement will
be even more significant when considering the whole mml. Also note that these
numbers point to the significant improvement potential that can be gained by
good premise selection: The performance on the pruned dataset is doubled in
comparison to the unpruned dataset. Again, this ratio grows as mml grows, and
the number of premises approaches 100.000.
Thus, there is a lot to gain by providing good algorithms for premise selection,
which is demonstrated in the next subsections. In Section 5.2 the fast naive Bayes
(SNoW) algorithm is incrementally trained on the fine-grained data and used for
premise selection followed by ATP runs, significantly improving the performance
of Vampire/SiNE on large problems. Then the new kernel-based MOR prediction
precision is compared in Section 5.3 with the naive Bayes algorithm. Finally, in
Section 5.4 we show that the MaLARea system is improved by using the MOR
as a learning component instead of naive Bayes.
5.2
Combining Fine-Grained Dependencies with Learning
For the next experiment, we emulate the growth of the library (limited to the
2078 problems), by considering all previous theorems and definitions when a new
conjecture is attempted. This is a natural “ATP advice over the whole library”
scenario, in which the ATP problems however become very large, containing
thousands of the previously proved formulas. Premise selection can therefore
help significantly. In this setting we use the fine-grained dependencies extracted
from previous proofs to train the premise-selection algorithms, use their advice
on the new problems, and compare the final ATP performance obtained on the
large problems filtered by the advice.
For each problem, the learning algorithm is allowed to learn on the dependencies of all previous problems, which is a natural scenario emulating that
the whole “previous” library is available whenever a new problem is attempted.
This approach however requires us to do 2078 trainings as the problems and
their proofs are added to the library and the dataset grows. This can so far only
be done efficiently with the SNoW toolkit, which (thanks to the simplicity of the
naive Bayes algorithm) allows so called “incremental learning”.13 That is why
we do not (yet) evaluate the MOR algorithm in this scenario, and only compare
it on smaller fixed datasets with SNoW in the next sections.
Table 3. Performance of Vampire with pruning by trained SNoW.
Unpruned Unpruned (-d1) Top 40 Premises Top 100 Premises Top 200 Premises
548
556
656
610
600
Table 3 compares the performance of Vampire with a 10 second time limit
run on problems with various amount of pruning done by SNoW trained on all
previous problems. If we use only 5 seconds instead of 10, Vampire with
SNoW advice solves 572 problems when using only the top 40 premises and 554
problems with the top 200 premises. Combining these two shorter runs solves 717
problems all together in 10s which is a 31% improvement over the 548 problems
Vampire solves in auto-mode, and a 29% improvement over the 556 problems
solved by Vampire using the -d1 option.14
5.3
MOR vs SNoW: Measuring the Coverage of Existing Proofs
A comparison of the standard machine learning performance of SNoW and the
new MOR learning algorithm on the MPTP Challenge proof data15 is shown in
Figure 1. This figure shows the average premise coverage graph for SNoW and
for MOR. The rankings we get from the algorithms are compared with the actual
premises used in the proof, by computing the size (ratio) of the overlap for the
increasing top segments of the ranked predicted premises (the size of the segment
is the x axis in Figure 1). I.e. on average 60% of the used premises are within the
24 highest MOR-ranked premises, whereas when we consider the SNoW ranking
only around 55% of the used premises are with the 24 highest ranked premises.
The average coverage is computed in the following way: The MPTP Challenge
dataset is randomly divided into ten equally sized parts. Each premise-selection
algorithm is trained ten times, each time leaving out in training one part, on
13
14
15
Incremental learning here means that the testing examples are immediately used
also as training examples for the learning algorithm. This obviously saves a great
amount of re-training.
Note that Vampire/SiNE does strategy scheduling internally, and with different SiNE
parameters. Thus combining two different premise selection strategies by us is perfectly comparable to the way Vampire’s automated mode is constructed and used.
Also note that combining the two different ways in which unadvised Vampire/SinE
was run is not productive: The union of the both unadvised runs is just 559 problems, which is only 3 more solved problems (generally in 20s) than with running
Vampire/SinE with ’-d1’.
http://www.tptp.org/MPTPChallenge
which evaluation of the premise selection is done after the training. Thus, in the
ten training/testing runs, we get exactly one ranked premise prediction for each
problem.
Fig. 1. Comparison of SNoW and MOR Coverage
The figure shows that the MOR algorithm converges faster than SNoW, which
is important for providing ATPs with the relevant premises without hindering
the ATPs with too many irrelevant axioms (recall the detrimental effect of irrelevant axioms shown in Table 2). Note that this kind of comparison is the
standard endpoint in machine learning applications like keyword-based document retrieval, consumer choice prediction, etc. However, in a semantic domain
like ours, we can go further, and see how this improved prediction performance
helps the theorem proving process. This is done in the next section.
5.4
MOR vs SNoW: MaLARea
In this experiment, we plug the MOR algorithm into the MaLARea system,
and compare its speed and precision on the MPTP Challenge benchmark with
MaLARea running with naive Bayes (SNoW) as a learning algorithm. The MaLARea
metasystem interleaves deductive ATP runs with learning premise selection from
the discovered proofs in a closed feedback loop described in more detail in [22,24].
The overall performance of the metasystem can be improved both by plugging
into it stronger deductive (ATP) tools, and by providing machine learning algorithms that can make better use of the discovered proofs to guide premise
selection for related proofs. Given the better coverage performance of MOR over
SNoW demonstrated in the previous section, using MOR instead of SNoW should
also result in better overall theorem-proving performance of the whole metasystem. The experiment confirms this expectation.
Only eight MaLARea iterations are run, in order to remove the gradual effect
of using many different specifications, which can with sufficient time equalize
any advice algorithm with a random one. The eight learnings are used to advice
on a growing set of premises, from 4 to 256. The comparison is given in table 4.
Table 4. Comparison between MaLARea-MOR and MaLARea-SNoW
Run Time Limit Axiom Limit Solved MOR Solved SNoW
1
2
3
4
5
6
7
8
1
1
1
1
1
1
1
1
sec
sec
sec
sec
sec
sec
sec
sec
0
256
256
4
8
16
32
64
57
79
83
90
117
135
140
141
58
74
74
84
99
119
127
129
Even though the MaLARea/MOR algorithm started with one problem less
solved,16 it converges faster. The number of problems solved by the system after
the sixth fast one-second run using the MOR-based premise selection outperforms
any other non-learning (non-MaLARea-based) system run in 21 hours on the
MPTP Challenge problems. Both versions of MaLARea take less than one hour to
conduct the eight one-second theorem proving iterations and the corresponding
learnings and predictions.
6
Conclusion and Future Work
We have significantly improved the performance of automated theorem proving
over real-world mathematics by using detailed formally-assisted proof analysis,
by using improved prediction algorithms, and by combining the two with deductive tools. In particular, it was demonstrated that premise selection based on
learning from exact previous proof dependencies improves the ATP performance
in large mathematical theories by about 30% in comparison with state-of-the-art
general premise-selection heuristics like SiNE.
Automated reasoning in large mathematical libraries is becoming a complex
AI field, allowing interplay of very different AI techniques. Manual tuning of
16
This is a random event showing that ATP time limits are not behaving perfectly
deterministically on real-world computers.
strategies and heuristics does not scale to large complicated domains, and datadriven approaches are becoming necessary to handle them. At the same time,
existing strong learning methods are typically developed on imprecise domains,
where feedback loops between prediction and automated verified confirmation
as done in MaLARea are not possible. The stronger such AI systems become,
the closer we get to formally assisted mathematics, both in its “forward” and
“reverse” form. And this is obviously another positive feedback loop that we
explore here: The larger the body of formally expressed and verified ideas, the
smarter the AI systems that learn from them.
The work started here can be improved in many possible ways. While we
have achieved 30% ATP improvement on large problems by better premise selection resulting in 717 problems proved within 10 seconds, we know (from 5.1)
that with a perfect premise selection it is possible to prove 1105 problems. Thus,
there is still a big opportunity for improved premise selection algorithms. Our dependency analysis can be finer and faster, and combined with ATP and machine
learning systems, can be the basis for a research tool for experimental formal
(reverse) mathematics. The MOR algorithm has a number of parameterizations
that we have fixed for the experiments done here. One of the most interesting
parameterizations is the right choice of features for the formal mathematical
domain. So far, we have been using only the symbols (and terms) occurring in
formulas as their feature characterizations, but other features are possible. In
particular, for ad hoc problem collections like the TPTP library, where symbols
are used inconsistently across different problems, formula features that abstract
from particular symbols will likely be needed. Also, the output of the learning
algorithms does not have to be limited to the ranking of premises. In general,
all kinds of relevant problem-solving parameterizations can be learned, and an
attractive candidate for such treatment is the large set of ATP strategies and
options parameterizing the proof search. An obvious future target is to improve
the training efficiency of the MOR algorithm, so that its performance can be
evaluated in the same “growing library” scenario as SNoW was in Section 5.2.
References
1. Alama, J., Brink, K., Mamane, L., Urban, J.: Large formal wikis: Issues and solutions. In: CICM 11 (2011), to appear
2. Alama, J.: Formal Proofs and Refutations. Ph.D. thesis, Stanford University (2009)
3. Alama, J., Mamane, L., Urban, J.: Dependencies in formal mathematics,
http://centria.di.fct.unl.pt/~alama/materials/preprints/
dependencies-in-formal-mathematics.pdf, submitted to LMCS
4. Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathematical Society 68 (1950)
5. Grabowski, A., Kornilowicz, A., Naumowicz, A.: Mizar in a nutshell. Journal of
Formalized Reasoning 3(2), 153–245 (2010)
6. Hoder, K., Voronkov, A.: Sine qua non for large theory reasoning. In: CADE 11
(2011), to appear
7. Matuszewski, R., Rudnicki, P.: Mizar: the first 30 years. Mechanized Mathematics
and Its Applications 4, 3–24 (2005)
8. Meng, J., Paulson, L.C.: Translating higher-order clauses to first-order clauses. J.
Autom. Reasoning 40(1), 35–60 (2008)
9. Pease, A., Sutcliffe, G.: First order reasoning on a large ontology. In: Sutcliffe et al.
[17]
10. Riazanov, A., Voronkov, A.: The design and implementation of vampire. AI Commun. 15(2-3), 91–110 (2002)
11. Rifkin, R., Yeo, G., Poggio, T., Rifkin, R., Yeo, G., Poggio, T.: Regularized LeastSquares Classification, chap. 131-154. IOS Press (2003), http://citeseerx.ist.
psu.edu/viewdoc/summary?doi=10.1.1.15.3663
12. Schoelkopf, B., Herbrich, R., Williamson, R., Smola, A.J.: A Generalized Representer Theorem. In: Helmbold, D., Williamson, R. (eds.) Proceedings of the 14th
Annual Conference on Computational Learning Theory. pp. 416–426. Berlin, Germany (2001)
13. Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA (2001)
14. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press, New York, NY, USA (2004)
15. Simpson, S.G.: Subsystems of Second Order Arithmetic. Perspectives in Mathematical Logic, Springer, 2 edn. (2009)
16. Solovay, R.: AC and strongly inaccessible cardinals. Available on the Foundations of
Mathematics archives at http://www.cs.nyu.edu/pipermail/fom/2008-March/
012783.html (March 29 2008)
17. Sutcliffe, G., Urban, J., Schulz, S. (eds.): Proceedings of the CADE-21 Workshop
on Empirically Successful Automated Reasoning in Large Theories, Bremen, Germany, 17th July 2007, CEUR Workshop Proceedings, vol. 257. CEUR-WS.org
(2007)
18. Tarski, A.: On well-ordered subsets of any set. Fundamenta Mathematicae 32,
176–183 (1939)
19. Tsivtsivadze, E., Pahikkala, T., Boberg, J., Salakoski, T., Heskes, T.: Coregularized least-squares for label ranking. In: Hüllermeier, E., Fürnkranz, J. (eds.)
Chapter in Preference Learning Book). pp. 107–123 (2010)
20. Tsivtsivadze, E., Urban, J., Geuvers, H., Heskes, T.: Semantic graph kernels for
automated reasoning. In: SIAM International Conference on Data Mining (2011),
to appear
21. Urban, J.: Mptp 0.2: Design, implementation, and initial experiments. J. Autom.
Reasoning 37(1-2), 21–43 (2006)
22. Urban, J.: Malarea: a metasystem for automated reasoning in large theories. In:
Sutcliffe et al. [17]
23. Urban, J., Hoder, K., Voronkov, A.: Evaluation of automated theorem proving on
the Mizar mathematical library. In: Fukuda, K., van der Hoeven, J., Joswig, M.,
Takayama, N. (eds.) ICMS. Lecture Notes in Computer Science, vol. 6327, pp.
155–166. Springer (2010)
24. Urban, J., Sutcliffe, G., Pudlák, P., Vyskocil, J.: MaLARea SG1–machine learner
for automated reasoning with semantic guidance. In: Armando, A., Baumgartner,
P., Dowek, G. (eds.) IJCAR. Lecture Notes in Computer Science, vol. 5195, pp.
441–456. Springer (2008)
© Copyright 2026 Paperzz