Document


Questions from 東海支部:
1.
Q: How far have you progressed?
A:
I am still in the design phase, but have started the beginning of
implementation. I am still not at a point where I can obtain tangible results
2.
Q: Do you plan to consider more complex than simple linear relations?
A:
Yes, I plan to consider class (IsA) relations as well as hierarchical
(Parent/Child) relations

Response to question 1
1.
Start of with character sequence –
(ア) DataSource = DS = “ this is a fine time for polka. ”
(イ) add each character into a new alphabet model and create an atomic
representation as you go and give each data element a uniform probability

p(di; M) = 1/Count(D) = 1/|D| = 1/16

JOIN ModelDataList, DataElement WHERE ModelDataList.DataName
EQUALS DataElement.DataName
Model Data List
Data
Element
*Data
Evaluation
Name
*Evaluation.
*Model
Class
Name
String
Type
t
1/16
Prob
Alphabet-001
“t”
Word
h
1/16
Prob
Alphabet-001
“h”
Word
...

(Join of ModelDataList x DataElement Tables)
SELECT * FROM Representation WHERE DataSource EQUALS ‘DS’
Representation
*Data
*Time
*Time
*Data
Model Name
Name
Start
End
Source
t
0
1
DS
Alphabet-001
h
1
2
DS
Alphabet-001
…
(ウ) form a cost evaluation based on a uniform log-likelihood and add these
evaluations to the model’s data list

cost(di) = -log2(p(di))
Encoding Model Data List
*Data
Evaluation
Name
*Evaluation.
*Model Name
Class
t
4
Cost
Alphabet-001
h
4
Cost
Alphabet-001
…
(エ) Take the new representation and evaluate it based on the cost

Evaluate(Representation, Model, EvalClass) =
∑I [ length(m(di)) + Eval(m(di), EvalClass) * count(m(di), Rep.) ]

JOIN Representation, ModelDataList WHERE Representation.DataName
EQUALS ModelDataList.DataName, ModelName EQALUS Alphabet-001,
EvaluationClass EQUALS Cost
Representation
ModelDataList
*Data
*Time
*Time
*Data
Name
Start
End
Source
‘‘
0
1
DS
Alphabet-001
Cost
4
t
1
2
DS
Alphabet-001
Cost
4
h
2
3
DS
Alphabet-001
Cost
4
…
…
…
…
…
…
…
o
13
14
DS
Alphabet-001
Cost
4
r
14
15
DS
Alphabet-001
Cost
4
‘‘
15
16
DS
Alphabet-001
Cost
4

Model Name
*Evaluation
Evaluation
Class
∑ Cost( ) = 1+4*8=33, Cost( i ) = 1+4*4=17 , Cost( t, s, a, f, e, o ) = 1+4*2=9,
Cost( h, n, m, r, p, l, k, . )= 1+4*1=5
==
SUM Evaluation from … (previous join) == 144 =
Model Data Evaluation
*Data
Evaluation
Source
DS
2.
*Evaluation.
*Model Name
Class
144
Cost
Alphabet-001
attempt to improve the cost of the representation using probabilistic encoding
(ア) using the atomic representation, form a new alphabet model and use the
frequency to form a more accurate cost evaluation

p(di) = Count(di)/Count(D) = 1/|D| ==
COUNT DataSource FROM Representation WHERE DataName EQUALS
di , DATASOURCE = DS

Cost(di; M) = -log(p(di ; M))
Representation
ModelDataList
*Data
*Time
*Time
*Data
Model Name
*Evaluation
Evaluation
Name
Start
End
Source
‘‘
0
1
DS
Alphabet-002
Cost
2
t
1
2
DS
Alphabet-002
Cost
4
h
2
3
DS
Alphabet-002
Cost
5
…
…
…
…
…
…
…
o
13
14
DS
Alphabet-002
Cost
4
r
14
15
DS
Alphabet-002
Cost
5
‘‘
15
16
DS
Alphabet-002
Cost
5
Class
(イ) evaluate the same representation using the new model
・t・h・i・s・ ・i・s・・a・ ・f・i・n・e.・・t・i・m・e・・f・o・r・・p・o・l・k・
a・.・
2・4・5・3・4・2・3・4・2・4・2・4・3・5・4・2・4・3・5・4・2・4・4・5・2・5・4・5・5・4・5・
2
= 116
(ウ) compare to previous model
saving of 28 bits, and an average of 3.63 bits/character (previous was 4)
3.
Use a Bi-Gram learning method to further improve cost
(ア) Form a matrix of transitions from one data element (character) to the next
・
t
t
h
h
i
1
1
o
n
e
1
2
1
1
m
o
r
p
l
k
.
1
2
1
1
1
1
1
n
m
f
2
a
e
a
2
s
f
_
1
i
_
s
1
2
1
1
1
r
1
p
1
l
1
k
1
.
1
(イ) Attempt to form a new model using heuristic search

Heuristic: The success of our learning algorithm depends on the heuristic
we use when searching for a better model for the data. For example, a
probabilistic or information theoretic method. e.g.1.
probabilistic:
the probability of seeing “ab” is greater than
probability expected by independence:
2.

p( wi,wj ) > p(wi) p(wj)

p( wi(t+1)| wj(t) ) > 1 / |w|
info-theoretic: the entropy is decreased for one element when the other
is known to exist


MI(wi,wj) = H(wj) - H(wj|wi) = H(wi) + H(wj) - H(wi,wj)

H(wi,wj) > H(wi) + H(wj)
The problem with these methods is they fail to take into account the small
sample size.
In this example, almost any created relation is going to
decrease the entropy or cost of the representation because of the high
predictability of the data. An element which occurs once seems to be
followed by the next element 100% of the time, and will not be considered
different than one which follows the same element 100% of the time and
occurs 1000 times

e.g.- make a new rule – we can take two approaches to a heuristic for an
estimations of a Δevaluation, conservative or aggressive
1.
aggressive – trim search by looking for strong connections.

e.g.- use log-likelihood to evaluate cost, and find the greatest
difference to locate related data elements
argmaxa,b [ -( log(p(a)) + log(p(b)) ) + log( p(ab)) ]
this method can find irregular relations like “qu”, but is sensitive
to sparse data problems. In this case, “lk” (Δ=5bits + 5bits –
5bits) will be considered as a connection before “is”
(Δ=3bits +
4bits – 4bits) which we consider to be a more valuable connection.
Another possible use is using this method to prune the search
space before performing a conservative estimation
2.
conservative - we attempt to estimate the affect on the cost of the
entire representation that adding this rule will have.

e.g. – use log-likelihood scaled by the occurrence of the elements
to evaluate cost, and find the greatest difference to locate related
data elements
argmaxa,b [ - ( Count(a)log(p(a)) + Count(b)log(p(b)) ) +
Count(ab)log(p(ab)) ]
in this case, “is” (Δ=4*3bits + 2*4bits – 2*4bits) will be considered
before “lk” (Δ=1*5bits + 1*5bits – 1*5bits). This is because we
are indirectly taking into account the reliability of the data.

Add this rule as the first to our new model
Model Rules
*Rule Name
*Yin
i, s → is
i:s
*Yang
*Model Name
is
BiGram-001
(ウ) Since we have no non-trivial alternatives in the search, then we know that our
new representation will cost 2 bits more for the inclusion of the new data
element ( we do not need to include the cost of the rule because it is a
hierarchical relationship or other dependence relation ) and 12 bits less for
shortened representation resulting from the inclusion of “is”
Representation
ModelDataList
*Data
*Time
*Time
*Data
Model Name
*Evaluation
Evaluation
Name
Start
End
Source
‘‘
0
1
DS
Alphabet-002
Cost
2
…
…
…
…
…
…
…
is
3
5
DS
Bi-Gram-001
Cost
4
…
…
…
…
…
…
…
is
6
8
DS
Bi-Gram-001
Cost
4
…
…
…
…
…
…
…
‘‘
15
16
DS
Alphabet-002
Cost
5
Class
If there is a non-trivial choice to make… such as “does ‘his’ form as ‘hi・s’ or ‘h・is’,
then the estimated change in evaluation may not be correct, as the added entries
will not be used as expected. The goal is to overestimate the change in cost that
would occur as a result of the change to adhere to the rules of optimal heuristic
search (like A*)

Extentions for more complicated relations:
1.
Multi-Gram (non-dependent)

The
use
of
multi-grams
can
improve
performance
by
allowing
variable-length encoding. These are an extension of the LinearRelation
relationship. With relationships like this, we can “skip” levels of relation
forming. For example, in order to create an entry for the word “this” in
the sequence above, we can simply calculate the cost directly by following a
path through the transition matrix using dynamic programming.
Otherwise we would have to explicitly make a Quad-Gram relation or build
it hierarchically by relating “th” and “is” or a similar relation. This is too
rigid because we may never create elements like “th” that this would
depend on.
・
t
t
h
h
i
1
1
a
n
e
m
o
1
2
1
1
l
k
.
1
2
1
1
1
1
1
2
m
1
o
1
1
1
p
1
l
1
k
.
p
1
n
r
r
2
a
e
f
2
s
f
_
1
i
_
s
1
1
Cost in representation is now < 5 bits for “this” and we can avoid the inclusion
of “th” which is a sub-optimal relation. One possible way to form these, is to
look for low cost v. length chains in the matrix created with dynamic
programming from above.
2.
Word Classes (dependent)

WordClass relations extend from Relation, and the one-to-many nature can
be represented by multiple entries in the database. These can be formed
using similarity matrices bound on features such as previous word or other
constituents. These can be understood to shorten description length by
restricting the possible choices for a word to those contained in the class,
reducing entropy (perplexity?) by making more nodes in the decision tree
Representation
ModelDataList
*Data
*Time
*Time
*Data
Model Name
*Evaluation
Evaluation
Name
Start
End
Source
is
3
5
DS
Bi-Gram-001
Cost
4
Verb
3
5
DS
Classifier-001
Cost
3
Class
In this case, we can describe is with 3 bits because it is the only Verb in our
data list, so if we see a Verb, we are %100 sure that it is “is”
3.
Hierarchical (CFG) (dependent)

The hierarchical relationship extends from the Linear Relationship, so it
has links between elements in a lateral shape, but it also has links to
children and parent(s?). A good method used to form these is still an
unsolved problem, but the benefit is obvious.
These relations allow
analysis of natural language that is much closer to our intuitive
understanding of it. The reduction in cost is like that of Word Classes, in
that it reduces entropy (perplexity?) by restricting the choices in a decision
tree to determine the nature of a constituent data element
Problems:
- Do I use the reevaluated costs of characters when determining index cost of a word?
