2012_04

2012_04_21

Analysis of Affinities and Anomalies through pTrees
Algorithm-1: Look for dimension where clustering best. Below, dimension=1 (3 clusters: {r1,r2,r3,O}, {v1,v2,v3,v4} and {0}). How to determine?
1.a: Take each dimension in turn working left to right, when d(mean,median)>¼ width, declare a cluster.
1.b: Next take those clusters one at a time to the next dimension for further sub-clustering via the same algorithm.
At this point we declare {r1,r2,r3,O} a cluster and start over.
At this point we need to declare a cluster, but which one, {0,v1} or {v1,v2}? We will always take the one on the median side of the mean - in this
case, {v1,v2}. And that makes {0} a cluster (actually an outlier, since it's singleton). Continuing with {v1,v2}:
Declare {v1,v2,v3,v4} a cluster. Note we have to loop. However, rather than each single projection, delta can be the next m projs if they're close.
Next we would take one of the clusters and go to the best dimension to subcluster...
Algorithm-2: 2.a Take each dim in turn, working left to right, when density>Density_Threshold, declare a cluster (density≡count/size). 2b=1b
Oblique version: Take grid of Oblique direction vectors, e.g., For 3D dataset, a DirVect pointing to center of each PTM triangle. With projections
onto those lines, do 1 or 2 above. Ordering = any sphere surface grid: Sn≡{x≡(x1...xn)Rn | xi2=1}, in polar coords, {p≡(θ1...θn-1) | 0θi179}.
Use lexicographical polar coords? 180n too many? Use e.g., 30
deg units, giving 6n vectors, for dim=n. Attrib relevance important dim2
o
Algorithm-3: Another variation of this is to calculate the dataset
mean and vector of medians. Then on the projections of the dataset
onto the line connecting the two, do 1a or 1b. Then repeat on each
declared cluster, but use projection line other than the one through
the mean and vom, this second time, since the mean-vom-line
would likely be in approx in the same direction as the first round)
Do until no new clusters? Adjust? e.g., proj lines and stop cond,...
dim2
11,10
4,9
2,8
5,8
4,6
6.3,5.9
6,5.5
r1
10,5
9,4
8,3
7,2
dim1
Can skip doubletons since
mean always same as median.
v1
r2
v2
r3
v3
v4
dim1
meanmean
mean
median
median
medianmedian
3,4
0
mean
mean
mean
mean
median
median
median
Algorithm-4: Proj onto line of dataset mean, vom, mn=6.3,5.9 vom=6,5.5 (11,10=outlier).
4.b, Repeat on any perp line thru mean. (mn, vom far apartmulti-modality.
Algorithm-4.1: 4.b.1 In each cluster, find 2 points furthest from line? (Require projection be
done one point at a time? Or can we determine those 2 points in one pTree formula?)
Algorithm-4.2: 4.b.2 use a grid of unit direction lines, {dvi | i=1..m}. For each, calc mn, vom
of projs of each cluster (except singletons). Take the one for which the separation is max.
3
mean=(8.18, 3.27, 3.73)
1. no clusters determined yet.
2. (9,2,4) determined as an outlier cluster.
vom=(7,4,3)
435 524 504
545
323
b43
c63
752
f72
2
3. Using red dim line, (7,5,2) is determined as an outlier cluster.
maroon pts determined as cluster, purple pts too.
3.a However, continuing to use line connecting (new) mean and vom
of the projections onto this plane, would the same be determined?
924
e43
1
Other option? use (at some judicious point) a p-Kmeans type approach.
This could be done using K=2 and a divisive top down approach (using a
GA mutation at various times to get us off a non-convergent track)?
Notes:Each round, reduce dim by one (low bound on the loop.)
Each round, just need good line (in remaining hyperplane) to project cluster (so far).
1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?)
2. pick line thru longest diameter? ( or diam  1/2 previous diam?).
3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set.
From: Mark Silverman [mailto:[email protected]] April 21, 2012 8:22 AM Subject: RE: oblique faust
I’ve been doing some tests, so far not so accurate (I’m still validating the code – I “unhardcoded” it so I can deal with
arbitrary datasets and it’s possible there’s a bug, so far I think it’s ok).
Something rather unique about the test data I am using is that it has four attributes, but for all of the class decisions it is really
one of the attributes driving the classification decision (e.g. for classes 2-10, attribute 2 is dominant decision, class 11
attribute 1 is dominant, etc). I have very wide variability in std deviation in the test data (some very tight, some wider).
Thus, I think that placing “a” on the basis of relative deviation makes a lot of sense in my case (and probably in general).
My assumption is that all I need to do is to modify as follows:
Now: a[r][v] = (Mr + Mv) * d / 2 Changes to a[r][v] = (Mr + Mv) * d * std(r) / (std(r) + std(s)) Is this correct?
FAUST Oblique (our best classifier?)
PR=P(X o d
R
)<a
D≡ mRmV
d=D/|D|
R
1 pass gives classR pTree
Separate class R using midpoint of means (mom) method: Calc a
(mR+(mV-mR)/2)od = a = (mR+mV)/2od (works also if D=mVmR,
Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1
horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time)
Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the
1. vectors_of_median, vom, to represent each class, not the mean mV, where vomV ≡(median{v1|vV}, median{v2|vV}, ...)
2. mom_std, vom_std methods: project each class on d-line; then calculate std (one horizontal formula per
class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr and mv
dim 2
Note:training (finding a and d)
is a one-time process. If we
don’t have training pTrees, we
can use horizontal data for a,d
(one time) then apply the
formula to test data (as pTrees)
vomR
vomV
r
r
r vv
mR r
v v v
r r
v
mV
r
v v
r
v
v
v2
dim 1
v1
A1,bw
Ak+1,c
..An,cc
1
1
0
0
0
1
0
0
2
0
1
1
0
0
1
1
3
0
0
0
0
0
0
0
4
1
0
0
0
1
0
0
5
0
0
1
0
0
0
1
...
1
0
0
0
1
0
0
N
0
0
1
0
0
0
0
1
... A1,0
-1
1
A2,bw
...
2
1
n
Big Vertical Data: PTreeSet (Dr. G. Wettstein's) perfect for BVD! (pTrees both horiz and vert)
PTreeSets incl methods for horiz querying and vertical DM, multihopQuery/DM, and XML.
T(A1...An) is a PTreeSet data structure = bit matrix with (typically) each
numeric attr converted to fixedpt(?), (negs?) bitsliced (pt_posschema) and category attr bitmapped;
coded then bitmapped; num coded then bisliced (or as is, ie, char(25) NAME col stored outside PTreeSet?
A1..Ak num w bitwidths=bw1..bwk; Ak+1..An categorical w counts=cck+1...ccn, PTreeSet is bitmatrix:
Methods for this data structure can provide fast horizontal row access , e.g., an FPGA could (with zero
delay) convert each bit-row back to original data row.
Methods already exist to provide vertical (level-0 or raw pTree) access.
row
number
A1,bw
1
1
0
1
2
1
1
1
0
...
roof
1
(N/64)
inteval number
...
Ak+1,c
...An,cc
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
1
-1
1
... A1,0
A2,bw
2
1
n
Add any Level1 PTreeSet can be added: given any row partition (eg, equiwidth =64 row intervalization)
and a row predicate (e.g.,  50% 1-bits ).
Add "level-1 only" DM meth, e.g., FPGA converts unclassified rowsets to equiwidth=64, 50% level1
pTrees, then entire batch would be FAUST classified in one horiz program. Or lev1 pCKNN.
pDGP (pTree Darn Good Protection) by permuting col ord (permution = key). Random pre-pad for each bitcolumn would makes it impossible to break the code by simply focusing on the first bit row.
More security?:
all pTrees same (max) depth, and intron-like pads randomly interspersed...
3
4
0 0 1
1 0 0
1
0
0
1
0
1
0
0
1 2 3
0 0 0 0
0 0 0 1
0 1 0 0
0
0
0
0
4
1
0
0
1
5 ... 3B
0
1
0
0
0
1
0
0
0
1
0
0
bpp
The red person features used to define classes. AHGp pTrees for data mining.
We can look for similarity (near neighbors) in a particular chromosome, a particular gene sequence, of overall or anything else.
1
2
5
...
Order bpp? By chromosome and by gene or region (level2 is chromosome, level1 is gene within chromosome.)
Do it to facilitate cross-organism bioinformatics data mining?
Create both People and BPP-PTreeSet w human health records feature table (training set for classification and multi-hop ARM.)
comprehensive decomp (ordering of bpps) FOR cross species genomic DM. If separate PTreeSets for each chrmomsome
(even each region - gene, intron exon...) then we can may be able to dataming horizontally across the all of these vertical pTrees.
AHG(P,bpp)
(1/0 for yes/no)
0 0 1
AHG [THG/GHG/CHG] is relationship between People and adenine(A) [thymine(T)/guanine(G)/cytosine(C)]
7B
P
Relationships (rolodex cards) are 2 PTreeSets, AHGPeoplePTreeSet (shown) and AHGBasePairPositionPTreeSet (rotation of shown).
Vertical Rule Mining, Vertical Multi-hop Rule Mining and Classification/Clustering methods (viewing AHG as either a People table
(cols=BPPs) or as a BPP table (cols=People). MRM and Classification done in combination?
Any table is a relationship between row and column entities (heterogeneous entity) - e.g., an image = [reflect. labelled] relationship between
pixel entity and wavelength interval entity. Always PTreeSetting both ways facilitates new research and make horizontal row
methods (using FPGAs) instantaneous (1 pass across the row pTree)
Most bioinformatics done so far is not really data mining but is more toward the database querying side. (e.g., a BLAST search).
A radical approach View whole Human Genome as 4 binary relationships between People and base-pair-positions (ordered by chromosome first, then gene region?).
pc bc lc cc pe age ht wt
A1,bw
The PTreeSet Genius for Big Data
gene
chromosome
A1,bw
Multi-hop Data Mining (MDM): relationship1 (Buys= B(P,I) ties table1 (People=P) to table2 (Items)
Category
color
size
wt
store
city
state
country
tied by relationship2 (Friends=F(P,P) ) to table3 (also P). Can we do clustering and/or classification on one
of the tables using the relationships to define "close" or to define the other notions?
Find all strong, AC, AP, CI Frequent iff ct(PA) > minsup and
Confident iff ct(&pAPp AND &iCPi) / ct(&pA Pp) > minconf
Says: "A friend of all A will buy C if all A buy C." (the AND is always AND)
Closures: A freq then A+ freq. AC not conf, then AC- not conf
ct(|pAPpAND&iCPi)>mncf
friend of any in A will buy C if any in A buy C. ct(| P )
pA p
ct(|pAPp AND |iCPi)>mncf
ct(|pAPp)
Define NearestNeighborVoterSet of {f} using strong R-rules with F in consequent? A strong cluster
based on several self-relationships (different relationships, so it's not just strong implic both ways)
strongly implies itself (or strongly implies itself after several hops (or when closing a loop).
F(P,P)=Friends
2 3 4 5
1
0
1
0
0
1
0
1
0
0
1
0
1
0
0
1
2 3 4 5
P
2
3
4
5
0
0
0
0
0
1
0
0
0
0
0
1
1
0
0
0
B(P,I)=Buys
P=People
pc bc lc cc pe age ht wt
Change to "friend of any in A will buy
something in C if any in A buy C.
I=Items
A facebook Member, m, purchases Item, x, tells
all friends. Let's make everyone a friend of F≡Friends(M,M)
Members
him/her self. Each friend responds back with
the Items, y, she/he bought and liked.
1 10 0 1
4
Facebook-Buys:
XI MX≡&xXPx People that purchased everything in X.
FX≡ORmMXFb = Friends of a MX person.
2
3
4
5
1
0
1
1
0
1
1
0
3
0
1
0
0
1
0
1
1
2
1
0
0
1
0
1
1
1
1
0
0
1
0
So, X={x}, is Mx Purchases x strong"
I≡Items
P≡Purchase(M,I)
Mx=ORmPxFmx
4 3 2 1
frequent if Mx large. This is a tractable calculation.
K2 = {1,2,4} P2 = {2,4} ct(K2) = 3
ct(K2&P2)/ct(K2) = 2/3
Members
Take one x at a time and do the OR.
Mx=ORmPxFmx confident if Mx large. ct( Mx  Px ) / ct(Mx) > minconf
To mine X, start with X={x}. If not confident then no superset is.
ct(ORmPxFm & Px)/ct(ORmPxFm)>mncnf
Closure: X={x.y} for x and y forming confident rules themselves....
Ogx frequent if Kx large (tractable- one x at a time and OR.
Kx=OR
gORbPxFb
Fcbk buddy, b, purchases F≡Friends(K,B)
Buddies
x, tells friends.
1 10 0 1
4
Friend tells all friends.
Strong purchase poss?
3
0 1 1 0
Intersect rather than union
1 0 1 1
2
(AND rather than OR).
1
0 1 1 1
Ad to friends of friends
Kiddos
3
4
5
I≡Items
F≡Friends(K,B)
3
2
11
11
0
1
3
4
5
1
1
1
10
0
1
4
1
0
1
1
0
1
0
0
0
1
1
0
3
0
1
0
0
0
1
1
2
1
0
0
1
1
1
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
P≡Purchase(B,I)
P≡Purchase(B,I)
K2={2,4} P2={2,4} ct(K2) = 2
ct(K2&P2)/ct(K2) = 2/2
I≡Items
Buddies
0
44
3
22
11
Groupies
11
44
11
0
01
11
00
1
10
00
33
00
1
0
00
33
22
11
0
0
11
22
11
0
10
11
11
00
0
1
01
11
00
0
11
01
Compatriots (G,K)
2
1
Kiddos
44
Groupies
44
2
Others(G,K)
K2={1,2,3,4} P2={2,4} ct(K2) = 4
ct(K2&P2)/ct(K2)=2/4
The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F.
U(H,I)
downward closure: If a condition is true of A, then it is true for all subsets D of A.
1
1
0
1
1
0
1
0
0
0
0
0
0
0
1
1
upward closure: If a condition is true of A then it is true of all supersets D of A.
2 3 4 5
S(F,G)
1
1
0
1
For transitive (a+c)-hop strong rule mine where the focus or count entity is
a hops from the antecedent and c hops from the consequent, if a (or c) is
odd/even then downward/upward closure applies to frequency (confidence).
1
0
1
0
0
0
1
0
4
3
2
1
I C
H
G
0
0
1
1
4
3
2
1
0
0
1
0
1
0
0
0
0
0
1
0
1
1
0
1
T(G,H)
F
4
3
2
1
E A
2 3 4 5
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
1
Odd  downward
Even  upward
R(E,F)
The proof of the theorem: a pTree, X, is said to be "covered by" a pTree, Y, if 1-bit in X, there is a 1-bit at that
same position in Y.
Lemma-0: For any two pTrees, X and Y, X & Y is covered by X and ct(X)  ct(X&Y)
Proof-0: ANDing with Y may zero some of X's 1-positions but never ones any of X's 0-positions.
Lemma-1: Let AB, &aBXa is covered by &aAXa
Proof-1: Let Z=&aB-AXa then &aB Xa = Z & (&aA Xa), so the result follows from lemma-0.
Lemma-2: For a (or c) =0, frequency and confidence are upward closed
Proof-2: ct(B)ct(A), so ct(A)>mnsp  ct(B)>mnsp and ct(C&A)/ct(C)>mncf  ct(C&B)/ct(C)>.mncf
Lemma-3: If a (or c) we have upward/downward closure of frequency or confidence, then for a+1 (or c+1) we have
downward/upward closer.
Proof-3: Taking the a and upward closure, going to a+1 and DA, we are removing ANDs in the numerator for both
frequency and confidence, so by Lemma-1, the a+1 numerator is covers the a numerator and therefore the
a+1_count  the a_count. Therefore, the condition (frequency or confidence) holds in the a+1 case and we have
downward closure.
The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F.
Odd  downward
Even  upward
downward closure: If a condition is true of A, then it is true for all subsets D of A.
upward closure: If a condition is true of A then it is true of all supersets D of A.
For transitive (a+c)-hop strong rule mine where the focus entity is a hops from the antecedent and c hops from the
consequent, if a (or c) is odd/even then downward/upward closure applies to frequency (confidence).
A pTree, X, is "covered by" a pTree, Y, if 1-bit in X, there is a 1-bit at that same position in Y.
Lemma-0: For any two pTrees, X&Y is covered by X and ct(X)  ct(X&Y)
Proof-0: ANDing with Y may zero some of X's 1-positions but never ones any of X's 0-positions.
Lemma-1: Let AB, &aBXa is covered by &aAXa
Proof-1: Let Z=&aB-AXa then &aB Xa = Z & (&aA Xa), so the result follows from lemma-0.
Lemma2: If n is even/odd
ct( &a1(&...&a(n-1)(&
anARan)
)Sa2Ta1)  Thresh is upward/downard closed on A
Proof-2: Let AD, then &anDRan  &anARan
&a(n-1)(&anDRan)  &a(n-1)(&anARan)
&a(n-2)&a(n-1)(&anDRan)  &a(n-2)& a(n-1)(&
&a(n-3)& a(n-2)&a(n-1)(&
anARan)
&a(n-3)& a(n-2)& a(n-1)(&
anDRan) 
anARan)
Dear Dr. Perrizo and All,
I think I found a method to calculate mode of a dataset using pTrees.
Assume we have a data set that is represented by three pTrees. So possible values of each data value is 0 to 7. Now if we do the following operations:
F0 = count (P2'&P1'&P0') will give us frequency of value 0
F1 = count (P2'&P1 &P0') will give us frequency of value 1
F2 = count (P2'&P1 &P0') will give us frequency of value 2
.
.
.
F7 = count (P2 &P1 &P0 ) will give us frequency of value 7
Now Mode = Max(F0, F1, ...,F7)
Problem of this method is: if we have large number of pTrees then there will be large number of F operations and each F operation will involve many
AND operations. For examples, if we have 8 pTrees then we'll have 2^8=256 F's and each F contains 8-1=7 AND operations.
I have though of a solution that may overcome this problem: Assume we have 3 pTrees and Value=2 is the mode. So if we do F2=P2'&P1&P0' would
give us maximum F value. Assume it is m. Now if we get the count of all individual component of F2 that is subsets ( P2', P1, P0, P2'&P1, P2'&P0',
P1&P0', P2'&P1&P0') then all of them are must be greater than of equal to m (Down closure property).
So to search for P2'&P1&P0' we can run an aprio like algorithm with singleton itemset P2, P2', P1, P1', P0, P0'. Then form doubleton P2P1, P2P1' ... etc.
Now we need a support value for pruning. Obviously the support should be the mode but we do not know it ahead of time. So we can set a minimum
value of mode as support. (Note: There cannot be any PiPi' doubleton as it is 0.)
Minimum value of mode is Min(1, floor[Datasize/2^n]) where n is number of pTrees.
Sorry I cannot give any example now but I can try to give an example in the white board. Thanks.
Sincerely,
Mohammad
1
0
0
0
1
0
1
1
R11
Given a n-row table, a row predicate (e.g., a bit slice predicate, or a category map) and a row ordering (e.g., asc on key; or for spatial data, col/rowraster, Z, Hilbert), the sequence of predicate truth bits is the raw or level-0 predicate Tree (pTree) for that table, row predicate and row order.
gte50%
pure1
gte25%
gte75%
pred: rem(div(SL/2)/2)=1
IRIS Table
stride=5
str=5
str=5
str=5
order: given order
Name
SL SW PL PW Color
1
1
1
P
P
P
P1SL,1
SL,1
SL,1
SL,1
setosa
38 38 14 2 red
0
0
0
P SL,0
P Color=red
P SL,1
setosa
50 38 15 2 blue
1
0
1
setosa
50 34 16 2 red
1
0
1
1
setosa
48 42 15 2 white
0
0
1
0
0
0
1
setosa
50 34 12 2 blue
0
0
1
0
0
1
1
versicolor
51 24 45 15 red
0
0
0
gte75%
gte50%
pure1
gte25%
versicolor
56 30 45 14 red
0
0
1
str=5
versicolor
57 28 32 14 white
str=5
str=5
str=5
1
1
1
1
1
1
P1C=red
versicolor
54 26 45 13 blue
P C=red
P C=red
P C=red
0
1
0
versicolor
57 30 42 12 white
0
1
0
0
0
0
1
virginica
73 29 58 17 white
1
0
1
0
0
0
1
virginica
64 26 51 22 red
1
0
0
1
1
0
1
virginica
72 28 49 16 blue
1
0
0
0
1
0
virginica
74 30 48 22 red
0
0
0
virginica
67 26 50 19 red
predicate: remainder(SL/2)=1 pred: Color=red
0
1
1
order: the given table order order: given ord
1
1
1
Given a raw pTree, P, a partitioned of it, par, and a bit-set predicate, bsp (e.g., pure1, pure0, gte50%One), the level-1 par, bsp pTree is the string
of truths of bsp on consecutive partitions of par. If the partition is an equiwidth=m intervalization, it's called the level-1 stride=m bsp pTree.
gte50% st=5 pTree predicts setosa. pred: PW<7 gte50%
P1gte50%,s=4,SL,0≡
order: given stride=5
rem(SL/2)=1
gte50% gte50%
1
pred: rem(SL/2)=1 gte50%
ord: given
stride=4 stride=8
P0PW<7 P PW<7
level-2
ord:
given
order
1
1
0
1
1
stride=4
P SL,0
P SL,0
P SL,0
gte50%
1
0
P1SL,0
P0SL,0
1
stride=2
1
0
0
0
0
P2gte50%,s=4,SL,0
1
0
0
1
0
1
0
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
0
0
0
0
1
0
lev2
pTree=
0
1
0
0
raw level-0 pTree
lev1 pTree
0
1
on
a
lev1.
1
1
gte50_P11
(1col tbl)
1
1
1
1
1
1
0
0
0
0
0
0
1
level-1 gt50 stride=4 pTree
0
1
gte50%
1
1
1
0
level-1 gt50 stride=2 pTree
1
stride=16
1
1
1
P SL,0
0
1 0 0 0 1 0 1 1
FAUST
Satlog
evaluation
R
62.83
48.84
87.48
77.41
59.59
69.01
NonOblique lev-0
True Positives:
Class actual->
NonOblq lev1 gt50
True Positives:
False Positives:
G
95.29
39.91
105.50
90.94
62.27
77.42
1's
99
461
1's
212
14
ir1
108.12
113.89
110.60
95.61
83.02
81.59
R
8
8
5
6
6
5
ir2
mn
89.50 1
118.31 2
87.46 3
75.35 4
69.95 5
64.13 7
2's
193
224
2's
183
1
3's
325
397
3's
314
42
4's
130
211
4's
103
103
G
15
13
7
8
12
8
ir1 ir2 std
13
9
1
13 19
2
7
6
3
8
7
4
13 13
5
9
7
7
5's
151
237
5's
157
36
7's
257
470
7's
330
189
Oblique level-0 using midpoint of means
1's
2's
3's
4's
5's
7's
True Positives:
322
199
344
145
174
353
False Positives:
28
3
80
171
107
74
Oblique level-0 using means and stds of projections (w/o cls elim)
1's
2's
3's
4's
5's
7's
True Positives:
359
205
332
144
175
324
False Positives:
29
18
47
156
131
58
Oblique lev-0, means, stds of projections (w cls elim in 2345671 order) Note that none occurs
1's
2's
3's
4's
5's
7's
True Positives:
359
205
332
144
175
324
False Positives:
29
18
47
156
131
58
Oblique level-0 using means and stds of projections, doubling pstd No elimination!
1's
2's
3's
4's
5's
7's
True Positives:
410
212
277
179
199
324
False Positives: 114
40
113
259
235
58
Oblique lev-0, means, stds of projs,doubling pstdr, classify, eliminate in 2,3,4,5,7,1 ord
1's
2's
3's
4's
5's
7's
True Positives:
309
212
277
154
163
248
False Positives:
22
40
65
211
196
27
Oblique lev-0, means,stds of projs, doubling pstdr, classify, elim 3,4,7,5,1,2 ord
1's
2's
3's
4's
5's
7's
True Positives:
329
189
277
154
164
307
False Positives:
25
1
113
211
121
33
2s1/(2s1+s2) elim ord: 425713
TP:
FP:
red
1
2
3
4
5
7
abv below
4.33 2.10
1.30
1.09
1.31 1.09
1.30 4.33
2.10 1.31
cls avg
4 2.12
2 2.36
5 4.03
7 4.12
1 4.71
3 5.27
green
355
37
205
18
ir1
abv below abv below
5.29 2.16 1.68 8.09
1.12 6.07
2.16
8.09 6.07
1.18 5.29 1.67 1.68
1.12 1.32 15.37 1.67
1.32 1.18
15.37
224
14
179
259
172
121
ir2
abv
below avg
13.11 0.94 4.71
0.94
2.36
1.07 13.11 5.27
3.70 1.07 2.12
3.43 3.70 4.03
3.43 4.12
above=(std+stdup)/gap
below=(std+stddn)/gapdn
suggest ord 425713
UTbl(User, M1,...,M17,770)
307
33
2pstdr
a = pmr + pstd +2pstd
(pmv-pmr) =
v
r
pmr*pstdv + pmv*2pstdr
pstdr +2pstdv
2s1, # of FPs reduced and TPs somewhat reduced. Better?
Parameterize the 2 to max TPs, min FPs. Best parameter?
1
2
3
4
5
7
tot
461 224 397 211 237 470 2000
99 193 325 130 151 257 1155
TP actual
TP nonOb L0 pure1
212 183 314 103 157 330 1037
14
1 42 103 36 189 385
TP nonOblique
FP level-1 50%
322 199 344 145 174 353 1537
28
3 80 171 107 74 463
TP Obl level-0
FP MeansMidPoint
359 205 332 144 175 324 1539
29 18 47 156 131 58 439
TP Obl level-0
FP s1/(s1+s2)
410 212 277 179 199 324 1601
114 40 113 259 235 58 819
TP 2s1/(2s1+s2)
FP Ob L0 no elim
309 212 277 154 163 248 1363
22 40 65 211 196 27 561
TP 2s1/(2s1+s2)
FP Ob L0 234571
329 189 277 154 164 307 1420
25
1 113 211 121 33 504
TP 2s1/(2s1+s2)
FP Ob L0 347512
355 189 277 154 164 307 1446
37 18 14 259 121 33 482
TP 2s1/(2s1+s2)
FP Ob L0 425713
2
0
33
0
56
24
58
46
G[0,46]2 G[47,64]5
G[65,81]7 G[81,94]4
G[94,255]{1,3}
R[0,48]{1,2}
R[49,62]{1,5}
R[82,255]3
ir1[0,88]{5,7} ir2[0,52]5
6 18
0 193
173
263
TP BandClass rule
FP mining (below)
Conclusion? MeansMidPoint and
Oblique std1/(std1+std2)
are best with the Oblique
version slightly better.
I wonder how these two methods
would work on Netflix?
Two ways:
(u,m); umTrainingTbl = SubUTbl(Support(m), Support(u), m)
MTbl(Movie, U1,...,U480189) (m,u); muTrainingTbl = SubMTbl(Support(u), Support(m), u)
UserTable(uID,m1,...,m17770)
Netflix data {mk}k=1..17770
uID rating
u i1 rmk,u
ui2
date
dmk,u
.
.
.
ui n
k
u1
mID
m1
m1
uID
u1
u2
rating
rm,u
date
dm,u
.
.
.
u480189
m17770
m
m0,2
:
:
rmhuk
uk
.
. . .
m17769,0
m
1/0
.
1
2
4
324513?45
5
5
.
:
mh ...
u1
u
m1
...
u1
uk
m17770 u480189 r17770,480189 d17770,480189
or U2649429
MTbl(mID,u1...u480189)
uk
m1
Main:(m,u,r,d) avg:209m/u
-------- 100,480,507 --------
mk(u,r,d) avg:5655u/m
UPTreeSet 3*17770 bitslices wide
u
.
1
2
4
324513?45
5
5
mh
rmhuk
:
.
m17770
MPTreeSet 3*480189 bitslices wide
u0,2
u480189
u480189,0
m1

 47B 

u480189

 47B 

(u,m) to be predicted, form umTrainingTbl=SubUTbl(Support(m),Support(u),m)
:
mh
:
.

47B 

Lots of 0s in vector sp, umTraningTbl). Want the largest subtable without zeros. How?
0/1
m17770
SubUTbl( nSup(u)mSup(n), Sup(u),m)?

 47B 

Of course, the two supports won't be tight together like that but they
are put that way for clarity.
(u,m) to be predicted, from umTrainingTbl =
SubUTbl(Support(m), Support(u),m)
Using Coordinate-wise FAUST (not Oblique), in each coordinate, nSup(u), divide up all
users vSup(n)Sup(m) into their rating classes, rating(m,v). then:
1. calculate the class means and stds. Sort means.
2. calculate gaps
3. choose best gap and define cutpoint using stds.
Coord FAUST, in each coord, vSup(m), divide up all movies nSup(v)Sup(u) to rating classes
1. calculate the class means and stds. Sort means.
2. calculate gaps
3. choose best gap and define cutpoint using stds.
This of course may be slow. How can we speed it up?
Gaps alone not best (especially since the sum of the gaps is no more than 4 and there are 4 gaps).
Weighting (correlation(m,n)-based) useful (higher the correlation the more significant the gap??)
Ctpts constructed for just this one prediction, rating(u,m). Make sense to find all of them.
Should just find, e,g, which n-class-mean(s) rating(u,n) is closest to and make those the votes?

Download Report

2012_04_21

Paperzz.com

Your Paperzz