20th of January 2009 ICAART’09, Oporto Parsing Tree Adjoining Grammars Using Evolutionary Algorithms Adrian Horia Dediu and Cătălin Ionuţ Tîrnăucă Research Group on Mathematical Linguistics, Rovira i Virgili University Pl. Imperial Tàrraco 1, 43005, Tarragona, Spain E-mail: [email protected] [email protected] Outline Introduction Evolutionary Algorithms Tree Adjoining Grammars Grammatical Evolution Parsing TAGs Using EAs: EATAGP Future Work Introduction (I) • Natural Computing Evolutionary Algorithms: stochastically solve high dimension search problems by mimic of natural principles (select the fittest individual from a population). • Several branches developed • genetic algorithms • evolutionary programming • evolutionary strategies sharing common principles and components • searching space of individuals • fitness function • operators to produce offspring. • Problem: lack of mathematical framework (proofs) => other bioinspired models simulate their behavior: eco-grammar systems, (possibly?) NEPs. Introduction (II) • One application: automatic program generation (LISP). • New approach: grammatical evolution => parse trees of CFG to automatically evolve computer programs in arbitrary languages. • Long sentences analysis (parsing) a very difficult task for a computer program. • What about using the power of EAs (GEs) (reduced complexity) to parse very long sentences of TAGs (high complexity)? • The algorithm EATAGP proved to be a solution at least in the tests we performed. EAs: Components EA = (I, f, Ω, μ, λ, s, StopCondition) • I : individuals forming a population; • f :I F: fitness function associated to individuals (F are values); • Ω: set of genetic operators (mutation, crossover) which applied to individuals of one generation (parents) produce new individuals (offspring); • μ: number of parents; • λ: number of offspring; • s: Iμ Iλ Iμ: selection operator for producing the next generation of parents from parents and offspring; • StopCondition: Fμ Nat {T, F}: stop criterion (“Stop when a good enough value was reached by an individual fitness function”, ”Stop after a certain number of generations”) EA: How does it work? 1. Randomly generate an initial population. 2. Evaluate each parent using fitness function. 3. If the StopCondition applied to the current generation is true, then STOP. Otherwise, go to step 4. 4. Apply the genetic operators to obtain offspring. 5. Evaluate each of them using the fitness functions. 6. Use the selector operator to obtain the next generation by replacing the worst individuals by the genetically modified offspring of the best individuals. 7. Go to step 2. Note that a gene is a physical and functional unit of heredity that carries information from one generation to the next. Genetic coding is the sequence of chosen genes (how the genetic material is encoded in some type of information). TAGs: Basic Notions T = (X, N, I, A, S) AA • X : terminal alphabet • N: nonterminals • I: set of initial trees A* • A: set of auxiliary trees Terminals+Nonterminals Terminals+Nonterminals marked marked with with • S: start symbol (from N) Two operations: substitution and adjunction. Derived tree: tree build B from 2 other trees by using them. B AA A AA A A A* Tree set: trees derived from Srooted initial trees (no substitution nodes left) Languages generated by TAG: yields of all trees in the tree set Grammatical Evolution: What’s new • It uses the derivation trees generated by CFGs and the fitness function evaluation of EAs to automatically evolve computer programs written in arbitrary high-level programming languages. • The genetic coding is a sequence of natural numbers: 8 22 100 … • The fitness function is a multicriterial optimization which maximizes the number of fitting points and minimizes the error. • The technique orders the productions for every nonterminal of the CFG, and then uses the gene values to decide which production to choose when it is necessary to expand a given nonterminal (gene value mod number of choices). Parsing TAGs with EAs: EATAGP GOAL: find a derived tree that - has the root labeled with S - the yield matched the given input string. IDEA: - start from an arbitrary S-rooted (initial) tree - apply substitutions and adjoinings to build progressively the target derived tree - use EAs (gene values, fitness function) to speed up the searching process. Trees are internally represented as strings: S NA a S{NA}[a S[S{NA} * a]] S a S NA * EATAGP: Genetic Coding – The Key WHY? Every gene selects: - a node for substitution / adjoin - a possible tree for substitution / adjoin. HOW? Tuples (tree number, node number) completely characterize all the nodes in all the trees since we order - all the trees in the sets I and A - all the nodes according to the node position in the string-tree representation. • Start tree: first gene modulo number of initial S-rooted trees. • At each step apply a proper derivation to the node = next gene value modulo the number of nonterminals which do not have NA constraint. • If substitution node, then next gene value modulo the number of trees that can be substituted at that note select the tree for substitution. • Analogous, if the node is adjoining. EATAGP: Fitness Function • Fitness function encourages: - the matching of characters in the input string and in the yield of the derived tree - the equal length of the two strings. • Idea: The fitness function values could be triples: 1. the maximum length of a sequence of matched characters 2. the number of matches 3. negative values for yields longer than the input string - When individuals are compared, 1 => 2 => 3 • During our tests, the best results were obtained with a linear function. EATAGP: Running Tests Implementation in JAVA and VBA. 15 individuals, each having 20 genes (values between 0 and 255). Fitness function = matching characters between the input string and the yield of the derived string. Best fitness function’s values of individuals during one generation Estimate the no. of computation using the same input string Conclusions and Future Work We proposed an Evolutionary Algorithm for TAG parsing. Preliminary tests: 3 times less computations than in the classical TAG parsing. Drawback: for some examples, our algorithm is not able to say there is no solution. Future developments: - approximate the requested number of generations required to find a solution for certain lengths of input strings; - more tests, further investigate when our algorithm performs better, a conjecture? - more complex grammars including natural language parsing;
© Copyright 2026 Paperzz