Learning One-Variable Pattern Languages Very Efficiently on

Learning One-Variable Pattern Languages Very
Efficiently on Average, in Parallel, and by
Asking Queries ?
Thomas Erlebach1 , Peter Rossmanith1 , Hans Stadtherr1 ,
Angelika Steger1 and Thomas Zeugmann2
1
Institut für Informatik, Technische Universität München, 80290 München,
Germany, {erlebach|rossmani|stadther|steger}@informatik.tu-muenchen.de
2
Department of Informatics, Kyushu University, Fukuoka 812-81, Japan,
[email protected]
Abstract. A pattern is a string of constant and variable symbols. The
language generated by a pattern π is the set of all strings of constant
symbols which can be obtained from π by substituting non-empty strings
for variables. We study the learnability of one-variable pattern languages
in the limit with respect to the update time needed for computing a new
single guess and the expected total learning time taken until convergence
to a correct hypothesis. The results obtained are threefold.
First, we design a consistent and set-driven learner that, using the concept of descriptive patterns, achieves update time O(n2 log n), where n
is the size of the input sample. The best previously known algorithm
to compute descriptive one-variable patterns requires time O(n4 log n)
(cf. Angluin [1]). Second, we give a parallel version of this algorithm requiring time O(log n) and O(n3 / log n) processors on an EREW-PRAM.
Third, we devise a one-variable pattern learner whose expected total
learning time is O(`2 log `) provided the sample strings are drawn from
the target language according to a probability distribution D with expected string length `. The distribution D must be such that strings
of equal length have equal probability, but can be arbitrary otherwise.
Thus, we establish the first one-variable pattern learner having an expected total learning time that provably differs from the update time by
a constant factor only.
Finally, we apply the algorithm for finding descriptive one-variable patterns to learn one-variable patterns with a polynomial number of superset
queries with respect to the one-variable patterns as query language.
1. Introduction
A pattern is a string of constant symbols and variable symbols. The language
generated by a pattern π is the set of all strings obtained by substituting strings
of constants for the variables in π (cf. [1]). Pattern languages and variations
thereof have been widely investigated (cf., e.g., [17, 18, 19]). This continuous
interest in pattern languages has many reasons, among them the learnability in
?
A full version of this paper is available as technical report (cf. [4]).
the limit of all pattern languages from text (cf. [1, 2]). Moreover, the learnability of pattern languages is very interesting with respect to potential applications
(cf., e.g., [19]). Given this, efficiency becomes a central issue. However, defining
an appropriate measure of efficiency for learning in the limit is a difficult problem (cf. [16]). Various authors studied the efficiency of learning in terms of the
update time needed for computing a new single guess. But what really counts in
applications is the overall time needed by a learner until convergence, i.e., the
total learning time. One can show that the total learning time is unbounded in
the worst-case. Thus, we study the expected total learning time. For the purpose
of motivation we shortly summarize what has been known in this regard.
The pattern languages can be learned by outputting descriptive patterns as
hypotheses (cf. [1]). The resulting learning algorithm is consistent and set-driven.
A learner is set-driven iff its output depends only on the range of its input
(cf., e.g., [20, 22]). In general, consistency and set-drivenness considerably limit
the learning capabilities (cf., e.g., [14, 22]). But no polynomial time algorithm
computing descriptive patterns is known, and hence already the update time is
practically infeasible. Moreover, finding a descriptive pattern of maximum possible length is N P-complete (cf. [1]). Thus, it is unlikely that there is a polynomialtime algorithm computing descriptive patterns of maximum possible length.
It is therefore natural to ask whether efficient pattern learners can benefit from
the concept of descriptive patterns. Special cases, e.g., regular patterns, non-cross
patterns, and unions of at most k regular pattern languages (k a priori fixed)
have been studied. In all these cases, descriptive patterns are polynomial-time
computable (cf., e.g., [19]), and thus these learners achieve polynomial update
time but nothing is known about their expected total learning time. Another
restriction is obtained by a priori bounding the number k of different variables
occurring in a pattern (k-variable patterns) but it is open if there are polynomialtime algorithms computing descriptive k-variable patterns for any fixed k > 1
(cf. [9, 12]). On the other hand, k-variable patterns are PAC–learnable with
respect to unions of k-variable patterns as hypothesis space (cf. [11]).
Lange and Wiehagen [13] provided a learner LWA for all pattern languages
that may output inconsistent guesses. The LWA achieves polynomial update time.
It is set-driven (cf. [21]), and even iterative, thus beating Angluin’s [1] algorithm
with respect to its space complexity. For the LWA, the expected total learning
time is exponential in the number of different variables occurring in the target pattern (cf. [21]). Moreover, the point of convergence for the LWA definitely
depends on the appearance of sufficiently many shortest strings of the target language, while for the other algorithms mentioned above at least the corresponding
correctness proofs depend on it. Thus, the following problem arises naturally.
Does there exist an efficient pattern language learner benefiting from the concept of descriptive patterns thereby not depending on the presentation of sufficiently many shortest strings from the target language, and still achieving an
expected total learning time polynomially bounded in the expected string length?
We provide a complete affirmative answer by studying the special case of
one-variable patterns. We believe this case to be a natural choice, since it is non2
trivial (there may be exponentially many consistent patterns for a given sample),
and since it has been the first case for which a polynomial time algorithm computing descriptive patterns has been known (cf. [1]). Angluin’s [1] algorithm for
finding descriptive patterns runs in time O(n4 log n) for inputs of size n and it
always outputs descriptive patterns of maximum possible length. She was aware
of possible improvements of the running time only for certain special cases, but
hoped that further study would provide insight for a uniform improvement.
We present such an improvement, i.e., an algorithm computing descriptive onevariable patterns in O(n2 log n) steps. A key idea to achieve this goal is giving
up necessarily finding descriptive patterns of maximum possible length. Note
that all results concerning the difficulty of finding descriptive patterns depend
on the additional requirement to compute ones having maximum possible length
(cf., e.g., [1, 12]). Thus, our result may at least support the conjecture that more
efficient learners may arise when one is definitely not trying to find descriptive
patterns of maximum possible length but just descriptive ones, instead.
Moreover, our algorithm can be also efficiently parallelized, using O(log n)
time and O(n3 / log n) processors on an EREW-PRAM. Previously, no efficient
parallel algorithm for learning one-variable pattern languages was known.
Our main result is a version of the sequential algorithm still learning all onevariable pattern languages that has expected total learning time O(`2 log `) if the
sample strings are drawn from the target language according to a probability
distribution D having expected string length `. D can be arbitrary except that
strings of equal length have equal probability. In particular, all shortest strings
may have probability 0. Note that the expected total learning time differs only
by a constant factor from the time needed to update actual hypotheses. On the
other hand, we could only prove that O(log `) many examples are sufficient to
achieve convergence.
Finally, we deal with active learning. Now the learner gains information about
the target object by asking queries to an oracle. We show how the algorithm for
descriptive one-variable patterns can be used for learning one-variable patterns
by asking polynomially many superset queries. Another algorithm learning all
pattern languages with a polynomial number of superset queries is known, but
it uses a much more powerful query language, i.e., patterns with more than one
variable even when the target pattern is a one-variable pattern (cf. [3]).
1.1. The Pattern Languages and Inductive Inference
Let IN = {0, 1, 2, . . .} be the set of all natural numbers, and let IN+ = IN \ {0}.
For all real numbers x we define bxc to be the greatest integer less than or equal
to x. Let A = {0, 1, . . .} be any finite alphabet containing at least two elements.
A∗ denotes the free monoid over A. By A+ we denote the set of all finite nonnull strings over A, i.e., A+ = A∗ \ {ε}, where ε denotes the empty string. Let
X = {xi i ∈ IN} be an infinite set of variables with A ∩ X = ∅. Patterns are
strings from (A ∪ X)+ . The length of a string s ∈ A∗ and of a pattern π is
denoted by |s| and |π|, respectively. By Pat we denote the set of all patterns.
3
Let π ∈ Pat, 1 ≤ i ≤ |π|; by π(i) we denote the i-th symbol in π, e.g.,
0x0 111(2) = x0 . If π(i) ∈ A, then π(i) is called a constant; otherwise π(i) ∈ X,
i.e., π(i) is a variable. We use s(i), i = 1, . . . , |s|, to denote the i-th symbol in
s ∈ A+ . By #var(π) we denote the number of different variables occurring in π,
and #xi (π) denotes the number of occurrences of variable xi in π. If #var(π) = k,
then π is a k-variable pattern. By Pat k we denote the set of all k-variable patterns.
In the case k = 1 we denote the variable occurring by x.
Let π ∈ Pat k , and u0 , . . . , uk−1 ∈ A+ . π[u0 /x0 , . . . , uk−1 /xk−1 ] denotes the
string s ∈ A+ obtained by substituting uj for each occurrence of xj in π. The tuple (u0 , . . . , uk−1 ) is called substitution. For every pattern π ∈ Pat k we define the
language generated by π by L(π) = {π[u0 /x0 , . . . , uk−1 /xk−1 ] u0 , . . . , uk−1 ∈
+ 3
A
S } . By PAT k we denote the set of all k-variable pattern languages. PAT =
k∈IN PAT k denotes the set of all pattern languages over A.
Note that several problems are decidable or even efficiently solvable for PAT 1
but undecidable or N P–complete for PAT . For example, for general pattern
languages the word problem is N P–complete (cf. [1]) and the inclusion problem
is undecidable (cf. [10]), but both problems are decidable in linear time for onevariable pattern languages. On the other hand, PAT 1 is still incomparable to
the regular and context free languages.
A finite set S = {s0 , s1 , . . . , sr } ⊆ A+ is called a sample. A pattern π is
consistent with a sample S if S ⊆ L(π). A (one-variable) pattern π is called
descriptive for S if it is consistent with S and there is no other consistent (onevariable) pattern τ such that L(τ ) ⊂ L(π).
Next, we define the relevant learning models. We start with inductive inference of languages from text (cf., e.g., [15, 22]). Let L be a language; every
infinite sequence t = (sj )j∈IN of strings with range(t) = {sj j ∈ IN} = L is said
to be a text for L. Text(L) denotes the set of all texts for L. Let t be a text,
and r ∈ IN. We set tr = s0 , . . . , sr , and call tr the initial segment of t of length
+
r + 1. By t+
r we denote the range of tr , i.e., tr = {sj 0 ≤ i ≤ r}. We define an
inductive inference machine (abbr. IIM) to be an algorithmic device taking as
its inputs initial segments of a text t, and outputting on every input a pattern
as hypothesis (cf. [7]).
Definition 1. PAT is called learnable in the limit from text iff there is an
IIM M such that for every L ∈ PAT and every t ∈ Text(L),
(1) for all r ∈ IN, M (tr ) is defined,
(2) there is a π ∈ Pat such that L(π) = L and for almost all r ∈ IN, M (tr ) = π.
The learnability of the one-variable pattern languages is defined analogously
by replacing PAT and Pat by PAT 1 and Pat 1 , respectively.
When dealing with set-driven learners, it is often technically advantageous to
describe them in dependence of the relevant
t+
r , i.e., a sample, obtained as
Pset
r
input. Let S = {s0 , s1 , . . . , sr }; we set n = j=0 |sj |, and refer to n as the size
of sample S.
3
We study non-erasing substitutions. Erasing substitutions have been also considered,
(variables may be replaced by ε), leading to a different language class (cf. [5]).
4
PAT as well as PAT 1 constitute an indexable class L of uniformly recursive
languages, i.e., there are an effective enumeration (Lj )j∈IN of L and a recursive
function f such that for all j ∈ IN and all s ∈ A∗ we have f (j, s) = 1 if s ∈ Lj ,
and f (j, s) = 0 otherwise.
Except in Section 3, where we use the PRAM-model, we assume the same
model of computation and representation of patterns as in [1]. Next we define the
update time and the total learning time. Let M be any IIM. For every L ∈ PAT
and t ∈ Text(L), let Conv (M, t) = the least number m such that for all r ≥
m, M (tr ) = M (tm ) denote the stage of convergence of M on t. By TM (tr ) we
denote the time to compute M (tr ), and we call TM (tr ) the update time of M .
The total learning time taken by the IIM M on successive input t is defined as
PConv (M,t)
T T (M, t) = r=0
TM (tr ).
Finally, we define learning via queries. The objects to be learned are the elements of an indexable class L. We assume an indexable class H as well as a
fixed effective enumeration (hj )j∈IN of it as hypothesis space for L. Clearly, H
must comprise L. A hypothesis h describes a target language L iff L = h. The
source of information about the target L are queries to an oracle. We distinguish
membership, equivalence, subset, superset, and disjointness queries (cf. [3]). Input to a membership query is a string s, and the output is yes if s ∈ L and
no otherwise. For the other queries, the input is an index j and the output is
yes if L = hj (equivalence query), hj ⊆ L (subset query), L ⊆ hj (superset
query), and L ∩ hj = ∅ (disjointness query), and no otherwise. If the reply is
no, a counterexample is returned, too, i.e., a string s ∈ L4hj (the symmetric
difference of L and hj ), s ∈ hj \ L, s ∈ L \ hj , and s ∈ L ∩ hj , respectively. We
always assume that all queries are answered truthfully.
Definition 2 (Angluin [3]). Let L be any indexable class and let H be a hypothesis space for it. A learning algorithm exactly identifies a target L ∈ L with
respect to H with access to a certain type of queries if it always halts and outputs
an index j such that L = hj .
Note that the learner is allowed only one hypothesis which must be correct.
The complexity of a query learner is measured by the number of queries to be
asked in the worst-case.
2. An Improved Sequential Algorithm
We present an algorithm computing a descriptive pattern π ∈ Pat 1 for a sample S = {s0 , s1 , . . . , sr−1 } ⊆ A+ as input. Without loss of generality, we assume
that s0 is the shortest string in S. Our algorithm runs in time O(n |s0 | log |s0 |)
and is simpler and much faster than Angluin’s [1] algorithm, which needs time
O(n2 |s0 |2 log |s0 |). Angluin’s [1] algorithm computes explicitly a representation
of the set of all consistent one-variable patterns for S and outputs a descriptive
pattern of maximum possible length. We avoid to find a descriptive pattern of
maximum possible length and can thus work with a polynomial-size subset of all
consistent patterns. Next, we review and establish basic properties needed later.
Lemma 1 (Angluin [1]). Let τ ∈ Pat, and let π ∈ Pat 1 . Then L(τ ) ⊆ L(π) if
and only if τ can be obtained from π by substituting a pattern % ∈ Pat for x.
5
For a pattern π to be consistent with S, there must be strings α0 , . . . , αr−1 ∈
A+ such that si can be obtained from π by substituting αi for x, for all 0 ≤ i ≤
r − 1. Given a consistent pattern π, the set {α0 , . . . , αr−1 } is denoted by α(S, π).
Moreover, a sample S is called prefix-free if |S| > 1 and no string in S is a prefix
of all other strings in S. Note that the distinction between prefix-free samples
and non-prefix free samples does well pay off.
Lemma 2. If S is prefix-free then there exists a descriptive pattern π ∈ Pat 1
for S such that at least two strings in α(S, π) start with a different symbol.
Proof. Let u denote the longest common prefix of all strings in S. The pattern
ux is consistent with S because u is shorter than every string in S, since S is
prefix-free. Consequently, there exists a descriptive pattern π ∈ Pat 1 for S with
L(π) ⊆ L(ux). Now, by Lemma 1 we know that there exists a pattern % ∈ Pat 1
such that π = ux[%/x]. Since u is a longest common prefix of all strings in S, we
can conclude %(1) ∈
/ A. Hence, % = xτ for some τ ∈ Pat 1 ∪ {ε}, and at least two
strings in α(S, uxτ ) must start with a different symbol.
2
Let Cons(S) = {π π ∈ Pat 1 , S ⊆ L(π), ∃i, j[i 6= j, αi , αj ∈ α(S, π), αi (1) 6=
αj (1)]}. Cons(S) is a subset of all consistent patterns for S, and Cons(S) = ∅ if
S is not prefix-free.
Lemma 3. Let S be any prefix-free sample. Then Cons(S) 6= ∅, and every
pattern π ∈ Cons(S) of maximum length is descriptive for S.
Proof. Let S be prefix-free. According to Lemma 2 there exists a descriptive
pattern for S belonging to Cons(S); thus Cons(S) 6= ∅. Now, suppose there is a
pattern π ∈ Cons(S) of maximum length which is not descriptive for S. Thus,
S ⊆ L(π), and, moreover, there exists a pattern τ ∈ Pat 1 such that S ⊆ L(τ ) as
well as L(τ ) ⊂ L(π). Hence, by Lemma 1 we know that τ can be obtained from
π by substituting a pattern % for x. Since at least two strings in α(S, π) start
with a different symbol, we immediately get %(1) ∈ X. Moreover, at least two
strings in α(S, τ ) must also start with a different symbol. Hence τ ∈ Cons(S) and
% = xν for some pattern ν. Note that |ν| ≥ 1, since otherwise τ = π[%/x] = π
contradicting L(τ ) ⊂ L(π). Finally, by |ν| ≥ 1, we may conclude |τ | > |π|, a
contradiction to π having maximum length. Thus, no such pattern τ can exist,
and hence, π is descriptive.
2
Next, we explain how to handle non-prefix-free samples. The algorithm checks
whether the input sample consists of a single string s. If this happens, it outputs
s and terminates. Otherwise, it tests whether s0 is a prefix of all other strings
s1 , . . . , sr−1 . If this is the case, it outputs ux ∈ Pat 1 , where u is the prefix of s0
of length |s0 | − 1, and terminates. Clearly, S ⊆ L(ux). Suppose there is a pattern
τ such that S ⊆ L(τ ), and L(τ ) ⊂ L(ux). Then Lemma 1 applies, i.e., there is a
% such that τ = ux[%/x]. But this implies % = x, since otherwise |τ | > |s0 |, and
thus, S 6⊆ L(τ ). Consequently, ux is descriptive.
Otherwise, |S| ≥ 2 and s0 is not a prefix of all other strings in S. Thus, by
Lemma 3 it suffices to find and to output a longest pattern in Cons(S).
The improved algorithm for prefix-free samples is based on the fact that
|Cons(S)| is bounded by a small polynomial. Let k, l ∈ IN, k > 0; we call
6
patterns π with #x (π) = k and l occurrences of constants (k, l)-patterns. A
(k, l)-pattern has length k + l. Every pattern π ∈ Cons(S) satisfies |π| ≤ |s0 |.
Thus, there can only be a (k, l)-pattern in Cons(S) if there is an m0 ∈ IN+
satisfying |s0 | = km0 + l. Here m0 refers to the length of the string substituted
for the occurrences of x in the relevant (k, l)-pattern to obtain s0 . Therefore,
there are at most b|s0 |/kc possible values of l for a fixed value of k. Hence, the
number of possiblej(k, l)-pairs
for which (k, l)-patterns in Cons(S) can exist is
P|s0 | |s0 | k
bounded by k=1 k = O(|s0 | · log |s0 |).
The algorithm considers all possible (k, l)-pairs in turn. We describe the algorithm for one specific (k, l)-pair. If there is a (k, l)-pattern π ∈ Cons(S), the
lengths mi of the strings αi ∈ α(S, π) must satisfy mi = (|si | − l)/k. αi is the
substring of si of length mi starting at the first position where the input strings
differ. If (|si | − l)/k ∈
/ IN for some i, then there is no consistent (k, l)-pattern and
no further computation is performed for this (k, l)-pair. The following lemma
shows that the (k, l)-pattern in Cons(S) is unique, if it exists at all.
Lemma 4. Let S = {s0 , . . . , sr−1 } be any prefix-free sample. For every given
(k, l)-pair, there is at most one (k, l)-pattern in Cons(S).
The proof of Lemma 4 directly yields Algorithm 1 below. It either returns the
unique (k, l)-pattern π ∈ Cons(S) or NIL if there is no (k, l)-pattern in Cons(S).
We assume a subprocedure taking as input a sample S, and returning the longest
common prefix u.
Algorithm 1. On input (k, l), S = {s0 , . . . , sr−1 }, and u do the following:
π ← u; b ← 0; c ← |u|;
while b + c < k + l and b ≤ k and c ≤ l do
if s0 (bm0 + c + 1) = si (bmi + c + 1) for all 1 ≤ i ≤ r − 1
then π ← πs0 (bm0 + c + 1); c ← c + 1
else π ← πx; b ← b + 1 fi od;
if b = k and c = l and S ⊆ L(π) then return π else return NIL fi
Note that minor modifications of Algorithm 1 perform the consistency test
S ⊆ L(π) even while π is constructed. Putting Lemma 4 and the fact that there
are O(|s0 | · log |s0 |) many possible (k, l)-pairs together, we directly obtain:
Lemma 5. |Cons(S)| = O(|s0 | log |s0 |) for every prefix-free sample S =
{s0 , . . . , sr−1 }.
Using Algorithm 1 as a subroutine, Algorithm 2 below for finding a descriptive
pattern for a prefix-free sample S follows the strategy exemplified above. Thus, it
simply computes all patterns in Cons(S) and outputs one with maximum length.
For inputs of size n the overall complexity of the algorithm is O(n |s0 | log |s0 |) =
O(n2 log n), since at most O(|s0 | log |s0 |) many tests must be performed, which
have time complexity O(n) each.
Algorithm 2. On input S = {s0 , . . . , sr−1 } do the following:
|s0 | P ← ∅; for k = 1, . . . , |s0 | and m0 = 1, . . . , k do
if there is a (k, |s0 | − km0 )-pattern π ∈ Cons(S) then P ← P ∪ {π} fi od;
Output a maximum-length pattern π ∈ P .
7
Note that the number of (k, l)-pairs to be processed is often smaller than
O(|s0 | log |s0 |), since the condition (|si | − l)/k ∈ IN for all i restricts the possible
values of k if not all strings are of equal length. It is also advantageous to
process the (k, l)-pairs in order of non-increasing k + l. Then the algorithm can
terminate as soon as it finds the first consistent pattern. However, the worst-case
complexity is not improved, if the descriptive pattern is x.
Finally, we summarize the main result obtained by the following theorem.
Theorem 1. Using Algorithm 2 as a subroutine, PAT 1 can be learned in the
limit by a set-driven and consistent IIM having update time O(n2 log n) on input
samples of size n.
3. An Efficient Parallel Algorithm
Whereas the RAM model has been generally accepted as the most suitable
model for developing and analyzing sequential algorithms, such a consensus has
not yet been reached in the area of parallel computing. The PRAM model introduced in [6], is usually considered an acceptable compromise. A PRAM consists
of a number of processors, each of which has its own local memory and can execute its local program, and all of which can communicate by exchanging data
through a shared memory. Variants of the PRAM model differ in the constraints
on simultaneous accesses to the same memory location by different processors.
The CREW-PRAM allows concurrent read accesses but no concurrent write accesses. For ease of presentation, we describe our algorithm for the CREW-PRAM
model. The algorithm can be modified to run on an EREW-PRAM, however, by
the use of standard techniques.
No parallel algorithm for computing descriptive one-variable patterns has been
known previously. Algorithm 2 can be efficiently parallelized by using well-known
techniques including prefix-sums, tree-contraction, and list-ranking as subroutines (cf. [8]). A parallel algorithm can handle non-prefix-free samples S in the
same way as Algorithm 2. Checking S to be singleton or s0 to be a prefix of all
other strings requires time O(log n) using O(n/ log n) processors. Thus, we may
assume that the input sample S = {s0 , . . . , sr−1 } is prefix-free. Additionally, we
assume the prefix-test has returned the first position d where the input strings
differ and an index t, 1 ≤ t ≤ r − 1, such that s0 (d) 6= st (d).
A parallel algorithm can handle all O(|s0 | log |s0 |) possible (k, l)-pairs in parallel. For each (k, l)-pair, our algorithm computes a unique candidate π for the
(k, l)-pattern in Cons(S), if it exists, and checks whether S ⊆ L(π). Again, it
suffices to output any obtained pattern having maximum length. Next, we show
how to efficiently parallelize these two steps.
For a given (k, l)-pair, the algorithm uses only the strings s0 and st for calculating the unique candidate π for the (k, l)-pattern in Cons(S). This reduces
the processor requirements, and a modification of Lemma 4 shows the candidate
pattern to remain unique.
Position jt in st is said to be b-corresponding to position j0 in s0 if jt =
j0 +b(mt −m0 ), 0 ≤ b ≤ k. The meaning of b-corresponding positions is as follows.
8
Suppose there is a consistent (k, l)-pattern π for {s0 , st } such that position j0
in s0 corresponds to π(i) ∈ A for some i, 1 ≤ i ≤ |π|, and that b occurrences of
x are to the left of π(i). Then π(i) corresponds to position jt = j0 + b(mt − m0 )
in st .
For computing the candidate pattern from s0 and st , the algorithm calculates
the entries of an array EQUAL[j, b] of Boolean values first, where j ranges from
1 to |s0 | and b from 0 to k. EQUAL[j, b] is true iff the symbol in position j in s0
is the same as the symbol in its b-corresponding position in st . Thus, the array
is defined as follows: EQUAL[j, b] = true iff s0 (j) = st (j + b(mt − m0 )). The
array EQUAL has O(k|s0 |) entries each of which can be calculated in constant
time. Thus, using O(k|s0 |/ log n) processors, EQUAL can be computed in time
O(log n). Moreover, a directed graph G that is a forest of binary in-trees, can be
built from EQUAL, and the candidate pattern can be calculated from G using
tree-contraction, prefix-sums and list-ranking. The details are omitted due to
space restrictions. Thus, we can prove:
Lemma 6. Let S = {s0 , . . . , sr−1 } be a sample, and n its size. Given the
array EQUAL, the unique candidate π for the (k, l)-pattern in Cons(S), or NIL,
if no such pattern exists, can be computed on an EREW-PRAM in time O(log n)
using O(k|s0 |/ log n) processors.
Now, the algorithm has either discovered that no (k, l)-pattern exists, or it has
obtained a candidate (k, l)-pattern π. In the latter case, it has to test whether
π is consistent with S.
Lemma 7. Given a candidate pattern π, the consistency of π with any sample S of size n can be checked on in time O(log n) using O(n/ log n) processors
on a CREW-PRAM.
Putting it all together, we obtain the following theorem.
Theorem 2. There exists a parallel algorithm computing descriptive onevariable patterns in time O(log n) using O(|s0 | max{|s0 |2 , n log |s0 |}/ log n) =
O(n3 / log n) processors on a CREW-PRAM for samples S = {s0 , . . . , sr−1 } of
size n.
Note that the product of the time and the number of processors of our algorithm is the same as the time spent by the improved sequential algorithm above
whenever |s0 |2 = O(n log |s0 |). If |s0 | is larger, the product exceeds the time of
the sequential algorithm by a factor less than O(|s0 |2 /n) = O(|s0 |).
4. Analyzing the Expected Total Learning Time
Now, we are dealing with the major result of the present paper, i.e., with
the expected total learning time of our sequential learner. Let π be the target
pattern. The total learning time of any algorithm trying to infer π is unbounded
in the worst case, since there are infinitely many strings in L(π) that can mislead
it. However, in the best case two examples, i.e., the two shortest strings π[0/x]
and π[1/x], always suffice for a learner outputting descriptive patterns as guesses.
Hence, we assume that the strings presented to the algorithm are drawn from
L(π) according to a certain probability distribution D and compute the expected
9
total learning time of our algorithm. The distribution D must satisfy two criteria:
any two strings in L(π) of equal length must have equal probability, and the
expected string length ` must be finite. We refer to such distributions as proper
probability distributions.
We design an Algorithm 1LA inferring a pattern π ∈ Pat 1 with expected total
learning time O(`2 log `). It is advantageous not to calculate a descriptive pattern
each time a new string is read. Instead, Algorithm 1LA reads a certain number
of strings before it starts to perform any computations at all. It waits until the
length of a sample string is smaller than the number of sample strings read so
far and until at least two different sample strings have been read. During these
first two phases, it outputs s1 , the first sample string, as its guess if all sample
strings read so far are the same, and x otherwise. If π is a constant pattern,
i.e., |L(π)| = 1, the correct hypothesis is always output and the algorithm never
reaches the third phase. Otherwise, the algorithm uses a modified version of
Algorithm 2 to calculate a set P 0 of candidate patterns when it enters Phase 3.
More precisely, it does not calculate the whole set P 0 at once. Instead, it uses
the function first cand once to obtain a longest pattern in P 0 , and the function
next cand repeatedly to obtain the remaining patterns of P 0 in order of nonincreasing length. This substantially reduces the memory requirements.
The pattern τ obtained from calling first cand is used as the current candidate.
Each new string s is then compared to τ . If s ∈ L(τ ), τ is output. Otherwise,
next cand is called to obtain a new candidate τ 0 . Now, τ 0 is the current candidate and output, independently of s ∈ L(τ 0 ). If the longest common prefix of
all sample strings including s is shorter than that of all sample strings excluding s, however, first cand is called again and a new list of candidate patterns is
considered. Thus, Algorithm 1LA may output inconsistent hypotheses.
Algorithm 1LA is shown in Figure 1. Let S = {s, s0 }, and let P 0 = P 0 (S) be
defined as follows. If s = vc for some c ∈ A and s0 = sw for some string w,
then P 0 = {vx}. Otherwise, denote by u the longest common prefix of s and
s0 , and let P 0 be the set of patterns computed by Algorithms 1 and 2 above if
we omit the consistency check. Hence, P 0 ⊇ Cons(S), where Cons(S) is defined
as in Section 2. P 0 necessarily contains the pattern π if s, s0 ∈ L(π) and if the
longest common prefix of s and s0 is the same as the longest constant prefix of π.
Assuming P 0 = {π1 , . . . , πt }, |πi | ≥ |πi+1 | for 1 ≤ i < t, first cand (s, s0 ) returns
π1 , and next cand (s, s0 , πi ) returns πi+1 . Since we omit the consistency checks,
a call to first cand and all subsequent calls to next cand until either the correct
pattern is found or the prefix changes can be performed in time O(|s|2 log |s|).
We will show that Algorithm 1LA correctly infers one-variable pattern languages from text in the limit, and that it correctly infers one-variable pattern
languages from text with probability 1 if the sample strings are drawn from L(π)
according to a proper probability distribution.4
Theorem 3. Let π be an arbitrary one-variable pattern. Algorithm 1LA correctly infers π from text in the limit.
4
Note that learning a language L in the limit and learning L from strings that are
drawn from L according to a proper probability distribution are not the same.
10
{ Phase 1 }
r ← 0;
repeat r ← r + 1; read string sr ;
if s1 = s2 = · · · = sr then output hypothesis s1 else output hypothesis x fi
until |sr | < r;
{ Phase 2 }
while s1 = s2 = · · · = sr do r ← r + 1; read string sr ;
if sr = s1 then output hypothesis s1 else output hypothesis x fi
od;
{ Phase 3 }
s ← a shortest string in {s1 , s2 , . . . , sr };
u ← maximum length common prefix of {s1 , s2 , . . . , sr };
if u = s then s0 ← a string in {s1 , s2 , . . . , sr } that is longer than s
else s0 ← a string in {s1 , s2 , . . . , sr } that differs from s in position |u| + 1 fi
τ ← first cand(s, s0 );
forever do
read string s00 ;
if u is not a prefix of s00 then
u ← maximum length common prefix of s and s00 ;
s0 ← s00 ;
τ ← first cand(s, s0 )
else if s00 ∈
/ L(τ ) then τ ← next cand(s, s0 , τ ) fi;
output hypothesis τ ;
od
Fig. 1. Algorithm 1LA
Theorem 4. Let π ∈ Pat 1 . If sample strings are drawn from L(π) according
to a proper probability distribution, Algorithm 1LA learns π with probability 1.
Proof. If π ∈ A+ then Algorithm 1LA outputs π after reading the first string
and converges. Otherwise, let π = uxµ, where u ∈ A∗ is a string of d−1 constant
symbols and µ ∈ Pat 1 ∪{ε}. After Algorithm 1LA has read two strings that differ
in position d, pattern π will be one of the candidates in the set P 0 implicitly
maintained by the algorithm. As each string sr , r > 1, satisfies s1 (d) 6= sr (d)
with probability (|A| − 1)/|A| ≥ 1/2, this event must happen with probability 1.
After that, as long as the current candidate τ differs from π, the probability that
the next string read does not belong to L(τ ) is at least 1/2 (cf. Lemma 8 below).
Hence, all candidate patterns will be discarded with probability 1 until π is the
current candidate and is output. After that, the algorithm converges.
2
Lemma 8. Let π = uxµ be a one-variable pattern with constant prefix u, and
µ ∈ Pat 1 ∪ {ε}. Let s0 , s1 ∈ L(π) be arbitrary such that s0 (|u| + 1) 6= s1 (|u| + 1).
Let τ 6= π be a pattern from P 0 ({s0 , s1 }) with |τ | ≥ |π|. Then τ fails to generate
a string s drawn from L(π) according to a proper probability distribution with
probability at least (|A| − 1)/|A|.
11
Now, we analyze the total expected learning time of Algorithm 1LA. Obviously, the total expected learning time is O(`) if the target pattern π ∈ A+ .
Hence, we assume in the following that π contains at least one occurrence of x.
Next, we recall the definition of the median, and establish a basic property of
it that is used later. By E(R) we denote the expectation of a random variable R.
Definition 3. Let R be any random variable with range(R) ⊆ IN. The median
of R is the number µ ∈ range(R) such that Pr(R < µ) ≤ 12 and Pr(R > µ) < 12 .
Proposition 1. Let R be a random variable with range(R) ⊆ IN+ . Then its
median µ satisfies µ ≤ 2E(R).
Lemma 9. Let D be any proper probability distribution, let L be the random
variable taking as values the length of a string drawn from L(π) with respect to
D, and let µ be the median of L and let ` be its expectation. Then, the expected
number of steps performed by Algorithm 1LA during Phase 1 is O(µ`).
Proof. Let Li be the random variable whose value is the length of the i-th string
read by the algorithm. Clearly, the distribution of Li is the same as that of L. Let
R be the random variable whose value is the number of strings Algorithm 1LA
reads in Phase 1. Let LΣ := L1 + · · · + LR be the number of symbols read by
Algorithm 1LA during Phase 1. Let W1 be the random variable whose value is
the time spent by Algorithm 1LA in Phase 1. Obviously, W1 = O(LΣ ).
E(L)
(1)
Claim 1. For i < r, we have: E(Li | R = r) ≤
Pr(Li ≥ i)
As Li must be at least i provided R > i, Equation (1) can be proved as follows:
E(Li | R = r) =
∞
X
kPr(Li = k | R = r) =
k=i
∞
X
Pr(Li = k ∧ R = r)
k
Pr(R = r)
k=i
∞
X
Pr(L1 ≥ 1) · · · Pr(Li = k) · · · Pr(Lr−1 ≥ r − 1)Pr(Lr < r)
=
k
Pr(L1 ≥ 1) · · · Pr(Li ≥ i) · · · Pr(Lr−1 ≥ r − 1)Pr(Lr < r)
k=i
=
∞
X
kPr(Li = k)
.
Pr(Li ≥ i) ≤
k=i
E(Li )
E(L)
=
Pr(Li ≥ i)
Pr(L ≥ i)
Similarly, it can be shown that E(Lr | R = r) ≤ E(L)/Pr(Lr < r)
Furthermore, it is clear that E(Lr | R = r) ≤ r − 1
Now, we rewrite E(LΣ ):
E(LΣ ) =
µ
X
E(LΣ | R = r)Pr(R = r) +
E(LΣ | R = r)Pr(R = r)
r>µ
r=1
|
X
(2)
(3)
{z
}
(α)
{z
|
(β)
}
Using Pr(Li ≥ i) ≥ 1/2 for i ≤ µ as well as Equations (1) and (3), we obtain:
(α) =
µ X
E(L1 | R = r) + · · · + E(Lr | R = r)
r=1
12
Pr(R = r)
=
≤
≤
µ
r−1
X
X
r=1
µ
X
r=1
µ
X
!
E(Li | R = r) + E(Lr | R = r) Pr(R = r)
i=1
r−1
X
i=1
!
E(L)
+ (r − 1) Pr(R = r)
Pr(L ≥ i)
((r − 1)2E(L) + (r − 1)) Pr(R = r) ≤ (µ − 1)(2E(L) + 1) = O(µ`)
r=1
For (β), we use Equations (1) and (2) to obtain:
(β) =
≤
∞
X
≤
Pr(R = r)
r=µ+1
∞
X
E(L)
E(L)
E(L)
Pr(R = r)
+ ··· +
+
Pr(L1 ≥ 1)
Pr(Lr−1 ≥ r − 1) Pr(Lr < r)
r=µ+1
≤
E(L1 | R = r) + · · · + E(Lr | R = r)
∞ X
r=µ+1
∞
X
r−1
Y
(r − 1)E(L)
E(L)
Pr(Li ≥ i) Pr(Lr < r)
+
Pr(Lr−1 ≥ r − 1) Pr(Lr < r)
i=1
rE(L)Pr(L1 ≥ 1) · · · Pr(Lr−2 ≥ r − 2)
r=µ+1
∞
X
r−2−µ
1
≤
rE(L)
(using Pr(Li ≥ i) ≤ 12 for i > µ)
2
r=µ+1
r−1
∞
X
1
(r + µ + 1)
= E(L)
2
r=0
r−1 ∞
∞
r−1
X
X
1
1
= E(L)
r
+
(µ + 1)
= E(L)(4µ + 8) = O(µ`)
2
2
r=0
r=0
|
{z
} |
{z
}
=4
=4(µ+1)
2
Now, under the same assumptions as in Lemma 9, we can estimate the expected number of steps performed in Phase 2 as follows.
Lemma 10. During Phase 2, the expected number of steps performed by Algorithm 1LA is O(`).
Finally, we deal with Phase 3. Again, let L be as in Lemma 9. Then, the
average amount of time spent in Phase 3 can be estimated as follows.
Lemma 11. During Phase 3, the expected number of steps performed in calls
to the functions first cand and next cand is O(µ2 log µ).
Lemma 12. During Phase 3, the expected number of steps performed in reading strings is O(µ` log µ).
Proof. Denote by W3r the number of steps performed while reading strings in
Phase 3. We make a distinction between strings read before the correct set of
13
candidate patterns is considered, and strings read afterwards until the end of
Phase 3. The former are accounted for by random variable V1 , the latter by V2 .
If the correct set of candidate patterns, i.e., the set containing π, is not yet
considered, the probability that a new string does not force the correct set of
candidate patterns to be considered is at most 1/2. Denote by K the random
variable whose value is the number of strings that are read in Phase 3 before the
correct set of candidate patterns is considered. We have:
E(V1 ) =
∞
X
E(V1 | K = k)Pr(K = k) ≤
k=0
∞
X
k=0
k
1
= O(E(L))
kE(L)
2
Assume that the correct set of candidate patterns P 0 contains M patterns that
are considered before pattern π. For any such pattern τ , the probability that a
string drawn from L(π) according to a proper probability distribution is in the
language of τ is at most 1/2, because either τ has an additional variable or τ has
an additional constant symbol (Lemma 8). Denote by V2i the steps performed
for reading strings while the i-th pattern in P 0 is considered.
E(V2 | M = m) =
m
X
E(V2i |M
= m) ≤
i=1
m X
∞
X
i=1 k=0
k
1
= O(mE(L))
kE(L)
2
Since M = O(R log R), we obtain:
E(V2 ) =
=
∞
X
r=1
µ
X
E(V2 | R = r)Pr(R = r)
∞
X
E(V2 | R = r)Pr(R = r) +
r=1
|
E(V2 | R = r)Pr(R = r)
r=µ+1
{z
=O(E(L)µ log µ)
}
|
{z
(α)
}
and
∞
X
r−µ
r
∞
X
1
1
O(E(L)r log r)
(α) ≤
O((r + µ) log(r + µ))
= E(L)
2
2
r=µ+1
r=1
= O(µE(L) log µ)
Hence, we have E(W3r ) = E(V1 ) + E(V2 ) = O(µ` log µ).
2
Lemma 13. During Phase 3, the expected number of steps performed in checking whether the current candidate pattern generates a newly read sample string
is O(µ` log µ).
Putting it all together, we arrive at the following expected total learning time
required by Algorithm 1LA.
Theorem 5. If the sample strings are drawn from L(π) according to a proper
probability distribution with expected string length ` the expected total learning
time of Algorithm 1LA is O(`2 log `).
14
5. Learning with Superset Queries
PAT is not learnable with polynomially many queries if only equivalence,
membership, and subset queries are allowed provided range(H) = PAT (cf. [3]).
This result may be easily extended to PAT 1 . However, positive results are also
known. First, PAT is exactly learnable using polynomially many disjointness
queries with respect to the hypothesis space PAT ∪ FIN , where FIN is the set of
all finite languages (cf. [13]). The proof technique easily extends to PAT 1 , too.
Second, Angluin [3] established an algorithm exactly learning PAT with respect
to PAT by asking O(|π|2 + |π||A|) many superset queries. However, it requires
choosing general patterns τ for asking the queries, and does definitely not work
if the hypothesis space is PAT 1 . Hence, it is natural to ask:
Does there exist a superset query algorithm learning PAT 1 with respect to
PAT 1 that uses only polynomially many superset queries?
Using the results of previous sections, we are able to answer this question affirmatively. Nevertheless, whereas PAT can be learned with respect to PAT by
restricted superset queries, i.e., superset queries not returning counterexamples,
our query algorithm needs counterexamples. Interestingly, it does not need a
counterexample for every query answered negatively, instead two counterexamples always suffice. The next theorem shows that one-variable patterns are not
learnable by a polynomial number of restricted superset queries.
Theorem 6. Any algorithm exactly identifying all L ∈ PAT 1 generated by
a pattern π of length n with respect to PAT 1 by using only restricted superset
queries and restricted equivalence queries must make at least |A|n−2 ≥ 2n−2
queries in the worst case.
Furthermore, we can show that learning PAT 1 with a polynomial number of
superset queries is impossible if the algorithm may ask for a single counterexample only.
Theorem 7. Any algorithm that exactly identifies all one-variable pattern
languages by restricted superset queries and one unrestricted superset query needs
at least 2(k−1)/4 − 1 queries in the worst case, where k is the length of the
counterexample returned.
The new algorithm QL works as follows (see Figure 2). Assume the algorithm should learn some pattern π. First QL asks whether L(π) ⊆ L(0) = {0}
holds. This is the case iff π = 0, and if the answer is yes QL knows the
right result. Otherwise, QL obtains a counterexample C(0) ∈ L(π). By asking L(π) ⊆ L(C(0)≤j x) for j = 1, 2, 3, . . . until the answer is no, QL computes
i = min{ j L(π) 6⊆ L(C(0)≤j x) }. Now we know that π starts with C(0)≤i−1 ,
but what about the i-th position of π? If |π| = i − 1, then π ∈ A+ and therefore π = C(0). QL asks L(π) ⊆ L(C(0)) to determine if this is the case. If
L(π) 6⊆ L(C(0)), then #x (π) ≥ 1 and π(i) ∈
/ A, since this would imply that
π(i) = C(0)(i) and, thus, L(π) ⊆ L(C(0)≤i x), a contradiction. Hence, π(i) = x.
Now QL uses the counterexample for the query L(π) ⊆ L(C(0)≤i x) to construct
a set S = {C(0), C(C(0)≤i x)}. By construction, the two counterexamples differ
in their i-th position, but coincide in their first i − 1 positions.
15
if L(π) ⊆ L(0) then τ ← 0
else i ← 1;
while L(π) ⊆ L(C(0)≤i x) do i ← i + 1 od;
if L(π) ⊆ L(C(0)) then τ ← C(0)
else S ← {C(0), C(C(0)≤i x)}; R ← Cons(S);
repeat τ ← max(R); R ← R \ {τ } until L(π) ⊆ L(τ )
fi
fi; return τ
Fig. 2. Algorithm QL. This algorithm learns a pattern π by superset queries. The
queries have the form “L(π) ⊆ L(τ ),” where τ ∈ Pat 1 is chosen by the algorithm. If the
answer to a query L(π) ⊆ L(τ ) is no, the algorithm can ask for a counterexample C(τ ).
By w≤i we denote the prefix of w of length i and by max(R) some maximum–length
element of R.
Algorithm 2 in Section 2 computes R = Cons(S). Since S ⊆ L(π) and since π
coincides with S in the first i − 1 positions, π ∈ R. Again we narrowed the search
for π to a set R of candidates. Let m be the length of the shortest counterexample
in S. Then |R| = O(m log m) by Lemma 5. Now, the only task left is to find π
among all patterns in R. We find π by removing other patterns from R by using
the following lemma.
Lemma 14. Let S ⊆ A+ and π, τ ∈ Cons(S), |π| ≤ |τ |. Then L(π) ⊆ L(τ )
implies π = τ .
QL tests L(π) ⊆ L(τ ) for a maximum length pattern τ ∈ R and removes τ
from R if L(π) 6⊆ L(τ ). Iterating this process finally yields the longest pattern τ
for which L(π) ⊆ L(τ ). Lemma 14 guarantees τ = π. Thus, we have:
Theorem 8. Algorithm QL learns PAT 1 with respect to PAT 1 by asking only
superset queries. The query complexity of QL is O(|π|+m log m) many restricted
superset queries plus two superset queries (these are the first two queries answered no) for every language L(π) ∈ PAT 1 , where m is the length of the shortest
counterexample returned.
Thus, the query complexity O(|π|+m log m) of our learner compares well with
that of Angluin’s [3] learner (which is O(|π||A|) when restricted to learn PAT 1 )
using the much more powerful hypothesis space PAT as long as the length of
the shortest counterexample returned is not too large.
Acknowledgements
A substantial part of this work has been done while the second author was visiting Kyushu University. This visit has been supported by the Japanese Society
for the Promotion of Science under Grant No. 106011.
The fifth author kindly acknowledges the support by the Grant-in-Aid for
Scientific Research (C) from the Japan Ministry of Education, Science, Sports,
and Culture under Grant No. 07680403.
16
References
1. D. Angluin. Finding patterns common to a set of strings. Journal of Computer
and System Sciences, 21:46–62, 1980.
2. D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117–135, 1980.
3. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
4. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, and T. Zeugmann. Efficient
Learning of One-Variable Pattern Languages from Positive Data. DOI-TR-128, Department of Informatics, Kyushu University, Dec. 12, 1996; http://www.i.kyushuu.ac.jp/∼thomas/treport.html.
5. G. Filé. The relation of two patterns with comparable languages. In Proc. 5th
Ann. Symp. Theoretical Aspects of Computer Science, LNCS 294, pp. 184–192,
Berlin, 1988. Springer-Verlag.
6. S. Fortune and J. Wyllie. Parallelism in random access machines. In Proc. 10th
Ann. ACM Symp. Theory of Computing, pp. 114–118, New York, 1978, ACM Press.
7. E. M. Gold. Language identification in the limit. Inf. Control, 10:447–474, 1967.
8. J. JáJá. An introduction to parallel algorithms. Addison-Wesley, 1992.
9. K. P. Jantke. Polynomial time inference of general pattern languages. In
Proc. Symp. Theoretical Aspects of Computer Science, LNCS 166, pp. 314–325,
Berlin, 1984. Springer-Verlag.
10. T. Jiang, A. Salomaa, K. Salomaa, and S. Yu. Inclusion is undecidable for pattern
languages. In Proc. 20th Int. Colloquium on Automata, Languages and Programming, LNCS 700, pp. 301–312, Berlin, 1993. Springer-Verlag.
11. M. Kearns and L. Pitt. A polynomial-time algorithm for learning k–variable pattern languages from examples. In Proc. 2nd Ann. ACM Workshop on Computational Learning Theory, pp. 57–71, Morgan Kaufmann Publ., San Mateo, 1989.
12. K.-I. Ko and C.-M. Hua. A note on the two-variable pattern-finding problem.
J. Comp. Syst. Sci., 34:75–86, 1987.
13. S. Lange and R. Wiehagen. Polynomial-time inference of arbitrary pattern languages. New Generation Computing, 8:361–370, 1991.
14. S. Lange and T. Zeugmann. Set-driven and rearrangement-independent learning
of recursive languages. Mathematical Systems Theory, 29:599–634, 1996.
15. D. Osherson, M. Stob and S. Weinstein. Systems that learn: An introduction to
learning theory for cognitive and computer scientists. MIT Press, 1986
16. L. Pitt. Inductive inference, DFAs and computational complexity. In Proc. Analogical and Inductive Inference, LNAI 397, pp. 18–44, Berlin, 1989, Springer-Verlag.
17. A. Salomaa. Patterns. EATCS Bulletin 54:46–62, 1994.
18. A. Salomaa. Return to patterns. EATCS Bulletin 55:144–157, 1994.
19. T. Shinohara and S. Arikawa. Pattern inference. In Algorithmic Learning for
Knowledge-Based Systems, LNAI 961, pp. 259–291, Berlin, 1995. Springer-Verlag.
20. K. Wexler and P. Culicover. Formal Principles of Language Acquisition. MIT
Press, Cambridge, MA, 1980.
21. T. Zeugmann. Lange and Wiehagen’s pattern language learning algorithm: An
average-case analysis with respect to its total learning time. Annals of Mathematics
and Artificial Intelligence, 1997, to appear.
22. T. Zeugmann and S. Lange. A guided tour across the boundaries of learning recursive languages. In Algorithmic Learning for Knowledge-Based Systems, LNAI
961, pp. 190–258, Berlin, 1995. Springer-Verlag.
17

Download Report

Learning One-Variable Pattern Languages Very Efficiently on

Paperzz.com

Your Paperzz