I’m working on implementing this… here’s where I am so far. I get the furthest point using hob method, and can construct a line and start looking for gaps. 1. Looking at this dataset, will we be lumping the cluster in the upper right with the cluster at the bottom? 2. Has anyone given thought as to what constitutes a “gap”? Subject: what we need to show 1. Clustering, auto k determination, anomaly (fmg?). Show on normal datasets. 2. Apply (2) to text, English (reuters) and finally, arabic (I have a dataset and translator) The net will be a demonstration across a large segment of the government, Dod, Intel Community, DHS, other agencies .. this is for all the marbles (I had to throw in arabic to win, let’s progress on what we can and deal with it when we get there) – we can tweak, change, or modify our goals as we progress with the experiment, so don’t get completely caught up in worrying if the approach doesn’t work – we can adjust. It is in their interest as well as ours to see this succeed… Mark Silverman Treeminer, Inc. [email protected] (240) 389-0750 Why fM? In many cases, e.g., when anomalies are the rarity, f will show up as a anomally the very first round. If no gaps on fM line? Let fM= (a, b, c, d,...). Next use (b,-a,0, 0,...) line. Then use (a, b, -c/(a2+b2), 0, ...) F=p1 and xoFM, T=23. X x1 1 p2 3 p3 2 p4 3 p5 6 p6 9 p7 15 p8 14 p9 15 pa 13 pb 10 pc 11 pd 9 pe 11 pf 7 f= p1 x2 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 xofM 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 p6'p5'p4'p3' 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 width=23=8 gap: Illustration of the first round of finding gaps p6 p5 p4 p3 p2 p1 p0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 p6'p5'p4'p3 1 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 0 1 1 0 p6' p5' p4' p3' p2' p1' p0' 1 1 1 0 1 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 p6'p5'p4 p3' 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 p6'p5'p4 p3 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 p6'p5 p4'p3' 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 p6 p5'p4'p3 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 For FAUST SMM (Oblique), do a similar thing on the MrMv line. Record the number of r and v errors if RtEndPt is used to split. Take RtEndPt where sum min Parallelizes easily. Useful in pTree sorting? OR between gap 2 and 3 for cluster C2={p5} p6'p5 p4'p3 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 width=23 =8 gap: p6'p5 p4 p3' 0 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 [010 1000, 010 1111] =[40,48) [000 0000, 000 0111]=[0,8) p6 p5'p4'p3' 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 pTree gap finder using PTreeSet(xofM) p6 p5'p4 p3' 0 1 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 p6 p5'p4 p3 1 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 p6 p5 p4'p3' 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 width = 24 =16 gap: width= 24 =16 gap: [100 0000, 100 1111]= [64,80) [101 1000, 110 0111]=[88,104) OR between gap 1 & 2 for cluster C1={p1,p3,p2,p4} between 3,4 cluster C3={p6,pf} p6 p5 p4'p3 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 1 0 1 0 p6'p5 p4 p3 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 width=23 =8 gap: [011 1000, 011 1111] =[56,64) p6 p5 p4 p3' 0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 0 1 1 0 p6 p5 p4 p3 1 0 1 0 0 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0 Or for cluster C4={p7,p8,p9,pa,pb,pc,pd,pe} 1. MapReduce FAUST Current_Relevancy_Score =9 Killer_Idea_Score=2 Nothing comes to minds as to what we would do here. MapReduce.Hadoop is a key-value approach to organizing complex BigData. In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER we start with a vector space. Mark suggests (my understanding), capturing pTreeBases as Hadoop/MapReduce key-value bases? I suggested to Arjun developing XML to capture Hadoop datasets as pTreeBases. The former is probably wiser. A wish list of great things that might result would be a good start. 2. pTree Text Mining: Current_Relevancy_Score =10 Killer_Idea_Score=9 I I think Oblique FAUST is the way to do this. Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =9 Killer_Idea_Score=9 No No one has taken up the proof that this is a break through method. The applications are unlimited! 4. Secure pTreeBases: Current_Relevancy_Score =9 Killer_Idea_Score=10 This seems straight forward and a certainty (to be a killer advance)! It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs. Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =9 Killer_Idea_Score=10 No one done a complete analysis of this is a break through method. The applications are unlimited here too! 6. pTree Algorithmic Tools: Current_Relevancy_Score =10 Killer_Idea_Score=10 This is Md’s work. Expanding the algorithmic tool set to include quadratic tools and even higher degree tools is very powerful. It helps us all! 7. pTree Alternative Algorithm Impl: Current_Relevancy_Score =9 Killer_Idea_Score=8 This is Bryan’s work. Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8. pTree O/S Infrastructure: Current_Relevancy_Score =10 Killer_Idea_Score=10 This is Matt’s work. I don’t yet know the details, but Matt, under the direction of Dr. Wettstein, is finishing up his thesis on this topic – such changes as very large page sizes, cache sizes, prefetching,… I give it a 10/10 because I know the people – they do double digit work always! From: [email protected]] Sent: Thurs, Aug 09 Dear Dr. Perrizo, Do you think a map reduce class of FAUST algorithms could be built into a thesis? If the ultimate aim is to process big data, modification of existing P-tree based FAUST algorithms on Hadoop framework could be something to look on? I am myself not sure how far can I go but if you approve, then I can work on it. From: Mark to:Arjun Aug 9 From industry perspective, hadoop is king (at least at this point in time). I believe vertical data organization maps really well with a map/reduce approach – these are complimentary as hadoop is organized more for unstructured data, so these topics are not mutually exclusive. So from industry side I’d vote hadoop… from Treeminer side text (although we are very interested in both) From: [email protected] Sent: Friday, Aug 10 I’m working thru a list of what we need to get done – it will include implementing anomaly detection which is now on my list for some time. I tried to establish a number of things such that even if we had some difficulties with some parts we could show others (w/o digging us too deep). Once I get this I’ll get a call going. I have another programming resource down here who’s been working with me on our production code who will also be picking up some of the work to get this across the finish line, and a have also someone who was a director at our customer previously assisting us in packaging it all up so the customer will perceive value received… I think Dale sounded happy yesterday. pTree Text Mining lev2, pred=pure1 on tfP1 -stide 1 1 t=a 0 t=again hdfP ... t=all 8 0 0 0 0 ... 0 ... ... 0 ... ... lev1tfPk eg pred tfP0: mod(sum(mdl-stride),2)=1 1 2 0 0 doc=1 d=2 term=a t=a 0 0 ... d=3 t=a 0 ... ... ... ... tfP0 1 ... tfP1 ... tf d=3 d=1 d=2 d=3 d=1 d=2 t=again t=again t=again t=all t=all t=all ... 0 0 t=a d=1 t=a d=2 t=a d=3 tePt=a 8 0 1 1 1 0 <--dfP0 . . . <--dfP3 lev-2 (len=VocabLen) 3 2 ... t=again t=all 1 t=a 1 8 3 3 df count tePt=again 0 tePt=all t=all d=1 t=all d=2 t=all d=3 lev1 (len=DocCt*VocabLen) t=again t=again t=again d=1 d=2 d=3 lev0 corpusP (len=MaxDocLen*DocCt*VocabLen) t=a d=2 t=a d=1 t=a d=3 Libry Congress masks (document categories move us up document semantic hierarchy Math book mask 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 t=again d=1 2 ... tf2 ... tf1 ... tf0 tf again df a d=1 Preface te Vocab Terms Reading position masks (pos categories) d=1 References move us up position semantic hierarchy d=1 commas (and allows puncutation etc., placement.) 1 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ... 1 0 0 0 0 0 0 0 0 0 0 0 0 always. 1 0 0 0 0 0 0 0 0 0 0 0 0 an 2 0 0 0 0 0 0 0 0 0 0 0 0 and 1 1 0 1 1 3 0 0 0 0 0 0 0 apple 1 0 0 0 0 0 0 0 0 0 0 0 0 . . . April 3 0 0 0 0 0 0 0 0 0 0 0 0 . . . are 1 0 0 0 0 0 0 0 . . . 7 Position 0 0 0 0 0 JSE Corpus pTreeSet 1 1 data Cube layout: 1 2 3 4 5 6 0 0 0 0 0 ... ... ... all HHS LMM 0 ptf: positional term frequency The frequency of each term in each position across all documents (Is this any good?). T S X tm\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0 1 0 0 0 0 0 human .22 -.11 3.34 interface 1 0 1 0 0 0 0 0 0 interface .2 -.07 2.54 computer 1 1 0 0 0 0 0 0 0 computer .24 .04 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 user .40 .06 response 0 1 0 0 1 0 0 0 0 system .64 -.17 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 X^ response .27 .11 survey 0 1 0 0 0 0 0 0 1 time .27 .11 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 EPS .30 -.14 minors 0 0 0 0 0 0 0 1 1 survey .21 .27 c1 Human machine interface for Lab ABC computer apps trees .01 .49 c2 A survey of user opinion of comp system response time graph .04 .62 c3 The EPS user interface management system minors .03 .45 c4 System and human system engineering testing of EPS c5 Relation of user-perceived response time to error measmnt m1 The generation of random, binary, unordered trees X = T0S0D0T T0, D0 column-orthonormal. m2 The intersection graph of paths in trees X^ keeps only 1st 2 singular values. m3 Graph minors IV: Widths of trees and well-quasi-ordering Corresp T,D columns give term and doc m4 Graph minors: A survey coordinates in 2D. X ~ X^ = TSDT T0 .22 .20 .24 .40 .64 .27 .27 .30 .21 .01 .04 .03 -.11 -.07 .04 .06 -.17 .11 .11 -.14 .27 .49 .62 .45 .29 .14 -.16 -.34 .36 -.43 -.43 .33 -.18 .23 .22 .14 -.41 -.55 -.59 .10 .33 .07 .07 .19 -.03 .03 .00 -.01 -.11 .28 -.11 .33 -.16 .08 .08 .11 -.54 .59 -.07 -.30 -.34 .50 -.25 .38 -.21 -.17 -.17 .27 .08 -.39 .11 .28 .52 -.07 -.30 .00 -.17 .28 .28 .03 -.47 -.29 .16 .34 -.06 -.01 .06 .00 .03 -.02 -.02 -.02 -.04 .25 -.68 .68 -.41 -.11 .49 .01 .27 -.05 -.05 -.17 -.58 -.23 .23 .18 S0 3.34 D0 2.54 2.35 1.64 1.50 1.31 0.85 0.56 0.36 .02 -.06 .11 -.95 .05 -.08 .18 -.01 -.06 .61 .17 -.05 -.03 -.21 -.26 -.43 .05 .24 .46 =.03 .21 .04 .38 .72 -.24 .01 .02 .54 -.23 .57 .27 -.21 -.37 .26 -.02 -.08 .28 .11 -.51 .15 .33 .03 .67 -.06 -.26 .00 .19 .10 .02 .39 -.30 -.34 .45 -.62 .01 .44 .19 .02 .35 -.21 -.15 -.76 .02 .02 .62 .25 .01 .15 .00 .25 .45 .52 .08 .53 .08 -.03 -.60 .36 -.04 -.07 -.45 DT c1 c2 c3 c4 c5 m1 m2 m3 m4 .02 -.06 .11 -.95 .05 -.08 .18 -.01 -.06 .61 .17 -.05 -.03 -.21 -.26 -.43 .05 .24 .02 -.06 .11 -.95 .05 -.08 .18 -.01 -.06 .61 .17 -.05 -.03 -.21 -.26 -.43 .05 .24 .46 =.03 .21 .04 .38 .72 -.24 .01 .02 .54 -.23 .57 .27 -.21 -.37 .26 -.02 -.08 .28 .11 -.51 .15 .33 .03 .67 -.06 -.26 .00 .19 .10 .02 .39 -.30 -.34 .45 -.62 .01 .44 .19 .02 .35 -.21 -.15 -.76 .02 .02 .62 .25 .01 .15 .00 .25 .45 .52 .08 .53 .08 -.03 -.60 .36 -.04 -.07 -.45 human 1 0 0 1 0 0 0 0 0 inter face 1 0 1 0 0 0 0 0 0 comp uter 1 1 0 0 0 0 0 0 0 user system response time 0 0 0 0 1 1 1 1 1 1 0 0 0 2 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 mc mm 0.4 0 0.4 0 0.4 0 0.6 0 0.8 0 0.4 0 0.4 0 0.4 0 q 1 0 1 0 0 0 0 0 0 0 0 0 D 0.40 d 0.23 (mc+mm)/2 0.09 mc+mm/2*d 0.02 a 204.92 0.40 0.30 0.12 0.04 0.40 0.42 0.17 0.07 0.60 0.80 1.09 -16.00 0.65 -12.80 0.71 204.80 0.40 -0.47 -0.19 0.09 0.40 -0.32 -0.13 0.04 0.40 -0.24 -0.10 0.02 -0.05 0.02 -0.00 -0.00 -0.75 0.38 -0.28 -0.11 -0.75 0.60 -0.45 -0.27 -0.50 1.00 -0.50 -0.50 0.00 0.42 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 X doc\term c1 c2 c3 c4 c5 m1 m2 m3 m4 q * d 0.23 q dot d 0.65 d(doc,q) 1.00 2.45 2.45 2.45 2.24 (c1-q)^2 (c2-q)^2 (c3-q)^2 (c4-q)^2 (c5-q)^2 1.73 2.00 2.24 2.24 (m1-q)^2 (m2-q)^2 (m3-q)^2 (m4-q)^2 0.00 EPS survey 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0.2 0.25 trees 0 0 0 0 0 1 1 1 0 0 0.75 graph minors 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0.75 0 0.5 Since .65 is ar less than a =~ 205, q is clearly in the c class human interface computer user system response time 0 1 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 4 0 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 graph 0 0 0 0 0 1 1 1 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1 This tells us c1 is closest to q in the full space, but that the other c documents are no closer than the m documents. q is probably classified c (one voter in the 1.5 nbhd) but it's not clear. This shows need for SVD or Oblique FAUST! user system respons 0.26 0.45 0.16 0.84 1.23 0.58 0.61 1.05 0.38 0.70 1.27 0.42 0.39 0.56 0.28 time 0.16 0.58 0.38 0.42 0.28 -0.03 -0.07 -0.10 -0.04 0.02 0.06 0.09 0.12 0.03 0.08 0.12 0.19 -0.07 -0.15 -0.21 -0.05 0.06 0.13 0.19 0.22 0.06 0.13 0.19 0.22 -0.07 -0.14 -0.20 -0.11 0.14 0.31 0.44 0.42 0.24 0.55 0.77 0.66 0.31 0.69 0.98 0.85 0.22 0.50 0.71 0.62 0.32 -0.11 0.28 -0.06 0.33 0.07 0.56 0.11 0.91 -0.12 0.36 0.15 0.36 0.15 0.43 -0.13 0.27 0.33 -0.02 0.56 0.01 0.71 0.01 0.51 0.16 0.14 0.15 0.26 0.45 0.16 0.16 0.22 0.10 -0.07 -0.07 -0.05 0.00 0.06 0.05 0.09 0.00 0.00 0.05 0.03 0.07 0.00 0.00 0.13 0.04 0.07 0.01 0.00 0.34 0.12 0.20 0.02 0.00 0.60 0.36 0.67 0.01 0.00 0.17 0.05 0.07 0.01 0.00 0.17 0.05 0.07 0.01 0.00 0.11 0.08 0.17 0.00 0.03 -0.00 -0.00 0.01 0.03 0.32 -0.19 -0.06 0.04 0.04 0.58 -0.41 -0.24 0.05 0.07 1.00 -0.50 -0.50 0.03 0.04 0.05 0.08 0.10 0.06 0.03 0.05 0.06 0.03 0.02 0.01 0.00 0.00 0.05 0.03 0.02 0.00 0.27 0.36 0.44 0.25 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.09 0.13 0.18 0.11 0.00 0.04 0.12 0.10 0.09 0.38 0.70 0.53 0.14 0.57 1.10 0.84 0.07 0.30 0.58 0.45 human 0.16 0.40 0.38 0.47 0.18 m1 m2 m3 m4 -0.05 -0.12 -0.16 -0.09 mc mm c mean m mean q^ = TSDq' d(doc,q^) 0.02 (c1-q^)^2 1.47 (c2-q^)^2 0.90 (c3-q^)^2 1.23 (c4-q^)^2 0.50 (c5-q^)^2 0.92 1.40 1.82 1.55 comp uter 0.15 0.51 0.36 0.41 0.24 doc\term c1 c2 c3 c4 c5 (m1-q^)^2 (m2-q^)^2 (m3-q^)^2 (m4-q^)^2 inter face 0.14 0.37 0.33 0.4 0.16 EPS survey 0.22 0.10 0.55 0.53 0.51 0.23 0.63 0.21 0.24 0.27 trees -0.06 0.23 -0.14 -0.27 0.14 graph minors -0.06 -0.04 0.34 0.25 -0.15 -0.10 -0.30 -0.21 0.20 0.15 Using knn this SVD transformed picture puts q cleary in the c class: dis=.25 dis=.5 dis=.9 dis=1 dis=1.25 dis=1.5 nbrs nbrs nbrs nbrs nbrs nbrs {c1 {c1,c5 {c1,c3,c5 {c1,c3,c5, {c1,c3,c4,c5, {c1,c2,c3,c4,c5, D = mcmm d = D/|D| (mc+mm)/2 mc+mm/2 dot d 0.42 0.25 0.11 0.03 a 33.00 q * d q dot d 0.04 2.61 ( 2.61 << a } } } m1 } m1 } m1,m2} Un-SVD transformed, it's not conclusive (i.e., using X instead of X^). 0.34 0.27 0.09 0.03 0.26 0.29 0.08 0.02 0.46 0.71 0.33 0.23 1.03 5.69 5.87 33.36 0.21 -0.25 -0.05 0.01 0.21 -0.20 -0.04 0.01 0.56 -0.44 -0.25 0.11 -0.05 0.03 -0.00 -0.00 0.04 0.04 0.18 2.58 -0.04 -0.03 -0.10 0.00 so q is classified as c. And, we note that O'FAUST is more conclusive with X than it is with X^. ) I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with 30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. Little Miss Muffet Lev1 (term freq/exist) Lev-0 1 pos te tf tf1 tf0 VOCAB 1 1 2 1 0 a 2 0 0 0 0 again. 3 0 0 0 0 all 4 0 0 0 0 always 5 0 0 0 0 an 6 1 3 1 1 and 7 0 0 0 0 apple 8 0 0 0 0 April 9 0 0 0 0 are 10 0 0 0 0 around 11 0 0 0 0 ashes, 12 0 0 0 0 away 13 0 0 0 0 away 14 1 1 0 1 away. 15 0 0 0 0 baby 16 0 0 0 0 baby. 17 0 0 0 0 bark! 18 0 0 0 0 beans 19 0 0 0 0 beat 20 0 0 0 0 bed, 21 0 0 0 0 Beggars 22 0 0 0 0 begins. 23 1 1 0 1 beside 24 0 0 0 0 between . . . . . . 182 0 0 0 0 your Humpty Dumpty Lev1 (term freq/exist) pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 . . . 182 te tf tf1 tf0 1 2 1 0 1 1 0 1 1 2 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 0 0 0 0 2 3 Little Miss Muffet 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 sat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 on 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 7 0 0 0 0 0 0 1 2 Humpty Dumpty 3 sat 4 on 5 a 6 wall. 7 Humpt Lev-0 05HDS a again. all always an and apple April are around ashes, away away away. baby baby. bark! beans beat bed, Beggars begins. beside between 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 your 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 a tuffet eating 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8... yDumpty 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 10 11 of curds 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 13 14 15 16 17 18 19 20... and whey. There 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 came 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 big spider 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 and 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sat down... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Level-2 pTrees (document frequency) df3 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 df0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 df VOCAB 8 a 1 again. 3 all 1 always 1 an 13 and 1 apple 1 April 1 are 1 around 1 ashes, 2 away 2 away 1 away. 1 baby 1 baby. 1 bark! 1 beans 1 beat 1 bed, 1 Beggars 1 begins. 1 beside 1 between Next we look at using only content words (reduces VocabSize=8 and CorpusSize=11). te04 te05 te08 te09 te27 te29 te34 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 In this slide section, the vocabulary is reduce to content words (8 of them). mdl=5, vocab={baby,cry,dad,eat,man,mother,pig,shower}, VocabLen=8 and there are 11 docs of the 15 (11 survivors of the content word reduction). First Content Word Mask, FCWM Level-1 (rolled vocab of level-0) Level-1 (roll up position of level-0) Level-2 (roll up document of level-1) df1 1 1 df0 1 0 1 0 df 1 0 2 1 1 2 1 0 2 1 1 3 0 2 0 3 2 2 te 2 2 4 5 5 7 7 4 5 8 9 7 9 6 3 4 1 3 0 0 tf1 0 1 1 0 20 20 40 50 50 7 7 0 0 40 50 81 90 71 90 60 30 40 1 3 00 01 00 0020020040050 50 7 0 0 00 01tf0 4 1 0 01 00 0050081190070090060030 40 1 0 1 00 0000000tf 0000101100002 0002 00040050 4 5 8 9 7 9 0 0 0 0 0 0 1 0 0 00 01 11 01 00 00 00 00 000060030 0 0 00 10000000 0000 1100 0001 1101 0000 00000000 0 0 0 0 2 0 1 0 0 0 0 0 1 0 0 00 00 00 00 00 00 00 01 010010000 0 0 00 1000000000110000011000000 0 0 00 0010001021001000000010000 0 0 00 10 00 00 00 00 01 10 0 0 00 00 00 10 10 10 00 01 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 7 3 50 40 00 00 00 00 00 01 1 0 7 1 0 0 0 0 0 0 0 1 7 3 0 0 0 0 0 0 0 1 d=73 1 0 0 0 0 d=71 1 0 0 0 0 d=54 1 0 0 0 0 d=53 1 0 0 0 0 doc=73 d=46 1 0 0 0 0 0 0 0 0 d=29 1 0 0 0 0 doc=71 0 0 0 0 d=27 1 0 0 0 0 0 0 0 0 0 doc=54 0 0 0 0 d=09 1 0 0 0 0 0 0 0 0 0 0 0 0 00000 0 d=08 1 0 0 0 0 doc=530 0 0 0 0 0 0 0 00000 0 d=05 1 0 0 0 0 0 0 0 00 00 0 0 0 doc=46 0 0 0 00000 0 d=04 1 0 0 0 0 0 0 0 00 00 0 0 0 0 0 0 0 0 000000 0 doc=290 0 0 00 00 0 0 0 0 0 0 1 0 0 010000 0 0 0 0 0 0 0 00 00 0 0 0 doc=270 0 0 0 0 0 0 0 0 0 0 0 1 0 0 10 00 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 doc=090 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 doc=080 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 doc=050 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 VOCAB doc=040 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 baby 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 cry 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dad 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 eat 1 0 0 0 0 0 0 0 0 0 0 0 0 man 0 0 0 0 0 0 0 0 0 0 0 0 0 Level-0 mother 0 0 0 0 0 0 0 0 0 0 pig 0 0 0 0 0 shower 0 0 0 0 0 POSITION 1 2 3 4 5 0 0 0 0 0 0 0 0 term baby 5 reading positions for doc=04LMM (Little Miss Muffet) 04LMM 2 3 4 5 05HDS 7 8 9 10 08JSC 12 13 14 15 09HBD 17 18 19 20 27CBC 22 23 24 25 29LFW 27 28 29 30 46TTP 32 33 34 35 53NAP 37 38 39 40 54BOF 42 43 44 45 71MWA 47 48 49 50 73SSW 52 53 54 55 0 baby 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cry 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dad 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 eat 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 man 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 mother 0 pig 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Level-0 (ordered by position, document, then vocab) 0 shower 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 doc 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW cry 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW dad 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW eat 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW man 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW mother04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW pig 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW shower04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW tf 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 2 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 tf1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tf0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 te 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 baby cry dad eat man mother pig shower df 2 2 2 3 2 3 2 2 df1 0 0 0 1 0 1 0 0 df0 0 0 0 1 0 1 0 0 Level-2 (roll up doc) Level-1 (roll up pos) Masking FCW 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Taking a very simple task - that of clustering vocab by document frequency. Each cluster contains the words that are of relatively equal importance - assuming the more frequently the term occurs, the less important the term. baby In the original data, cry baby cry dad eat man mother pig shower df 2 2 2 3 2 3 2 2 baby cry eat man mother pig shower df 1 1 2 2 1 2 2 eat In FCW filtered data man the clusters in decreasing order of importance are: {shower, pig, man, dad, cry, baby} {mother, eat} the clusters in decreasing order of importance are: {mother, cry, baby} {shower, pig, man, eat} One could argue that the latter is a better clustering. Crying, babies and mothers are strongly associated? Men, pigs, eating and needing a shower are strongly associated? mother pig shower baby cry eat man mother pig shower df 1 1 2 2 1 2 2 df1 0 0 1 1 0 1 1 df0 1 1 1 0 1 0 0 The point of this is to demonstrate (suggest?) that there may be new information in the expanded view of the text corpus that we take (not just starting from the tf matrix but including the reading sequences as well. I'm sure others have considered an "abstract only" or "executive summary only" data mine, but the horizontal structuring does not yield that input readily our pTree approach does (just by applying the "abstract" mask). In the general case, an additional weighting (other than the usual, inverse of df type weightings of term importance within the corpus) could include the (inverse of) position number of the 1st occurrence of the term (normalized). Or even the (inverse of) the weighted average of the position number (or relative position numbers, since documents are different lengths). baby cry dad 04LMM 2 3 4 5 05HDS 7 8 9 10 08JSC 12 13 14 15 09HBD 17 18 19 20 27CBC 22 23 24 25 29LFW 27 28 29 30 46TTP 32 33 34 35 53NAP 37 38 39 40 54BOF 42 43 44 45 71MWA 47 48 49 50 73SSW 52 53 54 55 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 baby cry dad eat man mother pig shower 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df 2 2 2 3 2 3 2 2 df1 0 0 0 1 0 1 0 0 eat man mother pig shower 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 df0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 baby 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW cry 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW dad 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW eat 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW man 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW mother04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW pig 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW shower04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW te 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 tf 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 2 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 tf1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tf0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 baby cry 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 dad 0 0 0 1 0 1 0 0 0 0 0 eat man mother pig shower 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 APPENDIX: Latent semantic indexing (LSI) is indexing and retrieval that uses Singular value decomposition for patterns in terms and concepts in text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing associations between terms that occur in similar contexts.[1] LSI overcomes synonymy, polysemy which cause mismatches in info retrieval [3] and cause Boolean keyword queries to mess up. LSI performs autodoc categorization (assignment of docs to predefined categories based on similarity to conceptual content of the categories.[5] LSI uses example docs for conceptual basis categories - concepts are compared to the concepts contained in the example items, and a category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in example docs. Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) terms represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, aij, initially representing number of times the associated term appears in the indicated document, tfij. This matrix is usually large and very sparse. SVD basically reduces the dimensionality of the matrix to a tractable size by finding the singular values. It involves matrix operations and may not be amenable to pTree operations (i.e. horizontal methods are highly developed and my be best. We should study it though to see if we can identify a pTree based breakthrough for creating the reduction that SVD achieves. Here is a good paper on the subject of LSI and SVD: http://www.cob.unt.edu/itds/faculty/evengelopoulos/dsci5910/LSA_Deerwester1990.pdf SVD: Let X be the t by d TermFrequency (tf) matrix. It can be decomposed as T0S0D0T where T and D have ortho-normal columns and S has only the singular values on its diagonal in descending order. Remove from T0,S0,D0, row-col of all but highest k singular values, giving T,S,D. X ~= X^ ≡ TSDT (X^ is the rank=k matrix closest to X). We have reduced the dimension from rank(X) to k and we note, X^X^T = TS2TT and X^TX^ = DS2DT There are three sorts of comparisons of interest: Comparing 1. terms (how similar are terms, i and j?) (comparing rows) 2. documents (how similar are documents i and j?) (comparing documents) 3. terms and documents (how associated are term i and doc j?) (examining individual cells) Comparing terms (how similar are terms, i and j?) (comparing rows) Dot product between two rows of X^ reflects their similarity (similar occurrence pattern across the documents). X^X^T is the square t x t symmetric matrix containing all these dot products. X^X^T = TS2TT This means the ij cell in X^X^T is the dot prod of i and j rows of TS (rows TS can be considered coords of terms). Comparing documents (how similar are documents, i and j?) (comparing columns) Dot product of two columns of X^ reflects their similarity (extent to which two documents have a similar profile of terms). X^TX^ is the square d x d symmetric matrix containing all these dot products. X^TX^ = DS2DT This means the ij cell in X^TX^ is the dot prod of i and j columns of DS (considered coords of documents). Comparing a term and a document (how associated are term i and document j?) (analyzing cell i,j of X^) Since X^ = TSDT cell ij is the dot product of the ith row of TS½ and the jth column of DS½ FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal information) FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) 1. While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). 2. Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) 3. If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) 4. End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa M 5 f 6 M4 7 C3 pf C4 8C1 C2 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. M0 8.3 M1 6.3 f1=p3, C1 doesn't split (complete). X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf x1 1 3 2 3 6 9 15 14 15 13 10 11 9 11 7 x2 1 1 2 3 2 3 1 2 3 4 9 10 11 11 8 D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 3.3 1.8 1.5 4.2 3.5 Interlocking horseshoes with an outlier 1 2 3 4 5 6 7 8 p2 p5 p1 p4 p3 M1 p8 pf p6 M p7 p9 0 pb pe pc pd pa 1 2 3 4 5 6 7 8 9 a b c d e f Separate classR, classV using midpoints of means (mom) method: calc a FAUST Oblique PR = P(X dot d)<a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) D≡ mRmV = oblique vector. d=D/|D| Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, median{v2|vV}, ... ) 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) dim 2 vomR vomV r r r vv mR r v v vv r r v mV r v v r v v v2 dim 1 v1
© Copyright 2025 Paperzz