Questions from 東海支部: 1. Q: How far have you progressed? A: I am still in the design phase, but have started the beginning of implementation. I am still not at a point where I can obtain tangible results 2. Q: Do you plan to consider more complex than simple linear relations? A: Yes, I plan to consider class (IsA) relations as well as hierarchical (Parent/Child) relations Response to question 1 1. Start of with character sequence – (ア) DataSource = DS = “ this is a fine time for polka. ” (イ) add each character into a new alphabet model and create an atomic representation as you go and give each data element a uniform probability p(di; M) = 1/Count(D) = 1/|D| = 1/16 JOIN ModelDataList, DataElement WHERE ModelDataList.DataName EQUALS DataElement.DataName Model Data List Data Element *Data Evaluation Name *Evaluation. *Model Class Name String Type t 1/16 Prob Alphabet-001 “t” Word h 1/16 Prob Alphabet-001 “h” Word ... (Join of ModelDataList x DataElement Tables) SELECT * FROM Representation WHERE DataSource EQUALS ‘DS’ Representation *Data *Time *Time *Data Model Name Name Start End Source t 0 1 DS Alphabet-001 h 1 2 DS Alphabet-001 … (ウ) form a cost evaluation based on a uniform log-likelihood and add these evaluations to the model’s data list cost(di) = -log2(p(di)) Encoding Model Data List *Data Evaluation Name *Evaluation. *Model Name Class t 4 Cost Alphabet-001 h 4 Cost Alphabet-001 … (エ) Take the new representation and evaluate it based on the cost Evaluate(Representation, Model, EvalClass) = ∑I [ length(m(di)) + Eval(m(di), EvalClass) * count(m(di), Rep.) ] JOIN Representation, ModelDataList WHERE Representation.DataName EQUALS ModelDataList.DataName, ModelName EQALUS Alphabet-001, EvaluationClass EQUALS Cost Representation ModelDataList *Data *Time *Time *Data Name Start End Source ‘‘ 0 1 DS Alphabet-001 Cost 4 t 1 2 DS Alphabet-001 Cost 4 h 2 3 DS Alphabet-001 Cost 4 … … … … … … … o 13 14 DS Alphabet-001 Cost 4 r 14 15 DS Alphabet-001 Cost 4 ‘‘ 15 16 DS Alphabet-001 Cost 4 Model Name *Evaluation Evaluation Class ∑ Cost( ) = 1+4*8=33, Cost( i ) = 1+4*4=17 , Cost( t, s, a, f, e, o ) = 1+4*2=9, Cost( h, n, m, r, p, l, k, . )= 1+4*1=5 == SUM Evaluation from … (previous join) == 144 = Model Data Evaluation *Data Evaluation Source DS 2. *Evaluation. *Model Name Class 144 Cost Alphabet-001 attempt to improve the cost of the representation using probabilistic encoding (ア) using the atomic representation, form a new alphabet model and use the frequency to form a more accurate cost evaluation p(di) = Count(di)/Count(D) = 1/|D| == COUNT DataSource FROM Representation WHERE DataName EQUALS di , DATASOURCE = DS Cost(di; M) = -log(p(di ; M)) Representation ModelDataList *Data *Time *Time *Data Model Name *Evaluation Evaluation Name Start End Source ‘‘ 0 1 DS Alphabet-002 Cost 2 t 1 2 DS Alphabet-002 Cost 4 h 2 3 DS Alphabet-002 Cost 5 … … … … … … … o 13 14 DS Alphabet-002 Cost 4 r 14 15 DS Alphabet-002 Cost 5 ‘‘ 15 16 DS Alphabet-002 Cost 5 Class (イ) evaluate the same representation using the new model ・t・h・i・s・ ・i・s・・a・ ・f・i・n・e.・・t・i・m・e・・f・o・r・・p・o・l・k・ a・.・ 2・4・5・3・4・2・3・4・2・4・2・4・3・5・4・2・4・3・5・4・2・4・4・5・2・5・4・5・5・4・5・ 2 = 116 (ウ) compare to previous model saving of 28 bits, and an average of 3.63 bits/character (previous was 4) 3. Use a Bi-Gram learning method to further improve cost (ア) Form a matrix of transitions from one data element (character) to the next ・ t t h h i 1 1 o n e 1 2 1 1 m o r p l k . 1 2 1 1 1 1 1 n m f 2 a e a 2 s f _ 1 i _ s 1 2 1 1 1 r 1 p 1 l 1 k 1 . 1 (イ) Attempt to form a new model using heuristic search Heuristic: The success of our learning algorithm depends on the heuristic we use when searching for a better model for the data. For example, a probabilistic or information theoretic method. e.g.1. probabilistic: the probability of seeing “ab” is greater than probability expected by independence: 2. p( wi,wj ) > p(wi) p(wj) p( wi(t+1)| wj(t) ) > 1 / |w| info-theoretic: the entropy is decreased for one element when the other is known to exist MI(wi,wj) = H(wj) - H(wj|wi) = H(wi) + H(wj) - H(wi,wj) H(wi,wj) > H(wi) + H(wj) The problem with these methods is they fail to take into account the small sample size. In this example, almost any created relation is going to decrease the entropy or cost of the representation because of the high predictability of the data. An element which occurs once seems to be followed by the next element 100% of the time, and will not be considered different than one which follows the same element 100% of the time and occurs 1000 times e.g.- make a new rule – we can take two approaches to a heuristic for an estimations of a Δevaluation, conservative or aggressive 1. aggressive – trim search by looking for strong connections. e.g.- use log-likelihood to evaluate cost, and find the greatest difference to locate related data elements argmaxa,b [ -( log(p(a)) + log(p(b)) ) + log( p(ab)) ] this method can find irregular relations like “qu”, but is sensitive to sparse data problems. In this case, “lk” (Δ=5bits + 5bits – 5bits) will be considered as a connection before “is” (Δ=3bits + 4bits – 4bits) which we consider to be a more valuable connection. Another possible use is using this method to prune the search space before performing a conservative estimation 2. conservative - we attempt to estimate the affect on the cost of the entire representation that adding this rule will have. e.g. – use log-likelihood scaled by the occurrence of the elements to evaluate cost, and find the greatest difference to locate related data elements argmaxa,b [ - ( Count(a)log(p(a)) + Count(b)log(p(b)) ) + Count(ab)log(p(ab)) ] in this case, “is” (Δ=4*3bits + 2*4bits – 2*4bits) will be considered before “lk” (Δ=1*5bits + 1*5bits – 1*5bits). This is because we are indirectly taking into account the reliability of the data. Add this rule as the first to our new model Model Rules *Rule Name *Yin i, s → is i:s *Yang *Model Name is BiGram-001 (ウ) Since we have no non-trivial alternatives in the search, then we know that our new representation will cost 2 bits more for the inclusion of the new data element ( we do not need to include the cost of the rule because it is a hierarchical relationship or other dependence relation ) and 12 bits less for shortened representation resulting from the inclusion of “is” Representation ModelDataList *Data *Time *Time *Data Model Name *Evaluation Evaluation Name Start End Source ‘‘ 0 1 DS Alphabet-002 Cost 2 … … … … … … … is 3 5 DS Bi-Gram-001 Cost 4 … … … … … … … is 6 8 DS Bi-Gram-001 Cost 4 … … … … … … … ‘‘ 15 16 DS Alphabet-002 Cost 5 Class If there is a non-trivial choice to make… such as “does ‘his’ form as ‘hi・s’ or ‘h・is’, then the estimated change in evaluation may not be correct, as the added entries will not be used as expected. The goal is to overestimate the change in cost that would occur as a result of the change to adhere to the rules of optimal heuristic search (like A*) Extentions for more complicated relations: 1. Multi-Gram (non-dependent) The use of multi-grams can improve performance by allowing variable-length encoding. These are an extension of the LinearRelation relationship. With relationships like this, we can “skip” levels of relation forming. For example, in order to create an entry for the word “this” in the sequence above, we can simply calculate the cost directly by following a path through the transition matrix using dynamic programming. Otherwise we would have to explicitly make a Quad-Gram relation or build it hierarchically by relating “th” and “is” or a similar relation. This is too rigid because we may never create elements like “th” that this would depend on. ・ t t h h i 1 1 a n e m o 1 2 1 1 l k . 1 2 1 1 1 1 1 2 m 1 o 1 1 1 p 1 l 1 k . p 1 n r r 2 a e f 2 s f _ 1 i _ s 1 1 Cost in representation is now < 5 bits for “this” and we can avoid the inclusion of “th” which is a sub-optimal relation. One possible way to form these, is to look for low cost v. length chains in the matrix created with dynamic programming from above. 2. Word Classes (dependent) WordClass relations extend from Relation, and the one-to-many nature can be represented by multiple entries in the database. These can be formed using similarity matrices bound on features such as previous word or other constituents. These can be understood to shorten description length by restricting the possible choices for a word to those contained in the class, reducing entropy (perplexity?) by making more nodes in the decision tree Representation ModelDataList *Data *Time *Time *Data Model Name *Evaluation Evaluation Name Start End Source is 3 5 DS Bi-Gram-001 Cost 4 Verb 3 5 DS Classifier-001 Cost 3 Class In this case, we can describe is with 3 bits because it is the only Verb in our data list, so if we see a Verb, we are %100 sure that it is “is” 3. Hierarchical (CFG) (dependent) The hierarchical relationship extends from the Linear Relationship, so it has links between elements in a lateral shape, but it also has links to children and parent(s?). A good method used to form these is still an unsolved problem, but the benefit is obvious. These relations allow analysis of natural language that is much closer to our intuitive understanding of it. The reduction in cost is like that of Word Classes, in that it reduces entropy (perplexity?) by restricting the choices in a decision tree to determine the nature of a constituent data element Problems: - Do I use the reevaluated costs of characters when determining index cost of a word?
© Copyright 2024 Paperzz