Characterization of state merging strategies which ensure

20th of January 2009
ICAART’09, Oporto
Parsing Tree Adjoining Grammars Using
Evolutionary Algorithms
Adrian Horia Dediu and Cătălin Ionuţ Tîrnăucă
Research Group on Mathematical Linguistics, Rovira i Virgili University
Pl. Imperial Tàrraco 1, 43005, Tarragona, Spain
E-mail: [email protected]
[email protected]
Outline
 Introduction
 Evolutionary Algorithms
 Tree Adjoining Grammars
 Grammatical Evolution
 Parsing TAGs Using EAs: EATAGP
 Future Work
Introduction (I)
• Natural Computing  Evolutionary Algorithms: stochastically
solve high dimension search problems by mimic of natural
principles (select the fittest individual from a population).
• Several branches developed
• genetic algorithms
• evolutionary programming
• evolutionary strategies
sharing common principles and components
• searching space of individuals
• fitness function
• operators to produce offspring.
• Problem: lack of mathematical framework (proofs) => other bioinspired models simulate their behavior: eco-grammar systems,
(possibly?) NEPs.
Introduction (II)
• One application: automatic program generation (LISP).
• New approach: grammatical evolution => parse trees of CFG to
automatically evolve computer programs in arbitrary languages.
• Long sentences analysis (parsing) a very difficult task for a
computer program.
• What about using the power of EAs (GEs) (reduced
complexity) to parse very long sentences of TAGs (high
complexity)?
• The algorithm EATAGP proved to be a solution at least in the tests
we performed.
EAs: Components
EA = (I, f, Ω, μ, λ, s, StopCondition)
• I : individuals forming a population;
• f :I  F: fitness function associated to individuals (F are values);
• Ω: set of genetic operators (mutation, crossover) which applied to
individuals of one generation (parents) produce new individuals
(offspring);
• μ: number of parents;
• λ: number of offspring;
• s: Iμ  Iλ  Iμ: selection operator for producing the next
generation of parents from parents and offspring;
• StopCondition: Fμ  Nat  {T, F}: stop criterion (“Stop when a
good enough value was reached by an individual fitness function”,
”Stop after a certain number of generations”)
EA: How does it work?
1. Randomly generate an initial population.
2. Evaluate each parent using fitness function.
3. If the StopCondition applied to the current generation is true, then
STOP. Otherwise, go to step 4.
4. Apply the genetic operators to obtain offspring.
5. Evaluate each of them using the fitness functions.
6. Use the selector operator to obtain the next generation by
replacing the worst individuals by the genetically modified
offspring of the best individuals.
7. Go to step 2.
Note that a gene is a physical and functional unit of heredity that
carries information from one generation to the next.
Genetic coding is the sequence of chosen genes (how the genetic
material is encoded in some type of information).
TAGs: Basic Notions
T = (X, N, I, A, S)
AA
• X : terminal alphabet
• N: nonterminals
• I: set of initial trees
A*
• A: set of auxiliary trees
Terminals+Nonterminals
Terminals+Nonterminals
marked
marked
with
with

• S: start symbol (from N)
Two operations: substitution and adjunction. Derived tree: tree build
B
from 2 other trees by using them.
B
AA
A
AA
A
A
A*
Tree set: trees derived from Srooted initial trees (no substitution
nodes left)
Languages generated by TAG:
yields of all trees in the tree set
Grammatical Evolution: What’s new
• It uses the derivation trees generated by CFGs and the fitness
function evaluation of EAs to automatically evolve computer
programs written in arbitrary high-level programming languages.
• The genetic coding is a sequence of natural numbers:
8
22 100 …
• The fitness function is a multicriterial optimization which
maximizes the number of fitting points and minimizes the error.
• The technique orders the productions for every nonterminal of
the CFG, and then uses the gene values to decide which production
to choose when it is necessary to expand a given nonterminal
(gene value mod number of choices).
Parsing TAGs with EAs: EATAGP
GOAL: find a derived tree that
- has the root labeled with S
- the yield matched the given input string.
IDEA: - start from an arbitrary S-rooted (initial) tree
- apply substitutions and adjoinings to build progressively the
target derived tree
- use EAs (gene values, fitness function) to speed up the searching
process.
Trees are internally represented as strings:
S
NA
a
S{NA}[a S[S{NA} * a]]
S
a
S
NA
*
EATAGP: Genetic Coding – The Key
WHY? Every gene selects:
- a node for substitution / adjoin
- a possible tree for substitution / adjoin.
HOW? Tuples (tree number, node number) completely
characterize all the nodes in all the trees since we order
- all the trees in the sets I and A
- all the nodes according to the node position in the string-tree
representation.
• Start tree: first gene modulo number of initial S-rooted trees.
• At each step apply a proper derivation to the node = next gene
value modulo the number of nonterminals which do not have
NA constraint.
• If substitution node, then next gene value modulo the number
of trees that can be substituted at that note select the tree for
substitution.
• Analogous, if the node is adjoining.
EATAGP: Fitness Function
•
Fitness function encourages:
- the matching of characters in the input string and in the yield
of the derived tree
- the equal length of the two strings.
•
Idea: The fitness function values could be triples:
1. the maximum length of a sequence of matched characters
2. the number of matches
3. negative values for yields longer than the input string
- When individuals are compared, 1 => 2 => 3
•
During our tests, the best results were obtained with a linear
function.
EATAGP: Running Tests
Implementation in JAVA and VBA.
15 individuals, each having 20 genes (values between 0 and 255).
Fitness function = matching characters between the input string and the yield of the
derived string.
Best fitness function’s
values of individuals
during one generation
Estimate the no. of
computation using the same
input string
Conclusions and Future Work
We proposed an Evolutionary Algorithm for TAG parsing.
Preliminary tests: 3 times less computations than in the classical TAG
parsing.
Drawback: for some examples, our algorithm is not able to say there is
no solution.
Future developments:
- approximate the requested number of generations required to
find a solution for certain lengths of input strings;
- more tests, further investigate when our algorithm performs
better, a conjecture?
- more complex grammars including natural language parsing;