d = 2, 20, 200, 20000

Object Orie’d Data Analysis, Last Time
Distance Weighted Discrimination:
• Revisit microarray data
• Face Data
• Outcomes Data
• Simulation Comparison
Twiddle ratios of subtypes
UNC, Stat & OR
2
Why not adjust by means?
UNC, Stat & OR

DWD robust against non-proportional
subtypes…

Mathematical Statistical Question:
Are there mathematics behind this?
(will answer next time…)
3
Distance Weighted Discrim’n
Maximal Data Piling
HDLSS Discrim’n Simulations
Main idea:
Comparison of
•
SVM (Support Vector Machine)
•
DWD (Distance Weighted Discrimination)
•
MD (Mean Difference, a.k.a. Centroid)
Linear versions, across dimensions
HDLSS Discrim’n Simulations
Conclusions:
• Everything (sensible) is best sometimes
• DWD often very near best
• MD weak beyond Gaussian
Caution about simulations (and examples):
• Very easy to cherry pick best ones
• Good practice in Machine Learning
– “Ignore method proposed, but read
paper for useful comparison of others”
HDLSS Discrim’n Simulations
Can we say more about:
All methods come together
in very high dimensions???
Mathematical Statistical Question:
Mathematics behind this???
(will answer now)
HDLSS Asymptotics
Modern Mathematical Statistics:
 Based on asymptotic analysis
 I.e. Uses limiting operations
 Almost always lim
n
 Occasional misconceptions:




Indicates behavior for large samples
Thus only makes sense for “large” samples
Models phenomenon of “increasing data”
So other flavors are useless???
HDLSS Asymptotics
Modern Mathematical Statistics:
 Based on asymptotic analysis
 Real Reasons:
Approximation provides insights
Can find simple underlying structure
In complex situations
Thus various flavors are fine:
lim , lim ,
n
d 
lim , lim
n , d 
 0
Even desirable! (find additional insights)
HDLSS Asymptotics: Simple Paradoxes
For d dim’al Standard Normal dist’n:
 Z1 
 
Z     ~ N d 0, I d 
Z 
 d
Euclidean Distance to Origin (as d   ):
Z  d  O p (1)
HDLSS Asymptotics: Simple Paradoxes
As d   ,
Z  d  O p (1)
-Data lie roughly on surface of sphere,
with radius
d
- Yet origin is point of highest density???
- Paradox resolved by:
density w. r. t. Lebesgue Measure
HDLSS Asymptotics: Simple Paradoxes
For d dim’al Standard Normal dist’n:
Z1
indep. of Z 2 ~ N d 0, I d 
Euclidean Dist. Between Z 1 and Z 2
(as d   ):
Distance tends to non-random constant:
Z 1  Z 2  2d  O p (1)
HDLSS Asymptotics: Simple Paradoxes
Distance tends to non-random constant:
Z 1  Z 2  2d  O p (1)
•Factor
2 , since
2
2
sd  X1  X 2   sd  X1   sd  X 2 
Can extend to Z 1 ,..., Z n
Where do they all go???
(we can only perceive 3 dim’ns)
HDLSS Asymptotics: Simple Paradoxes
For d dim’al Standard Normal dist’n:
Z 1 indep. of Z 2 ~ N d 0, I d 
High dim’al Angles (as d   ):
AngleZ 1 , Z 2   90  O p (d 1/ 2 )
- Everything is orthogonal???
- Where do they all go???
(again our perceptual limitations)
- Again 1st order structure is non-random
HDLSS Asy’s: Geometrical Represent’n
Assume
Z 1 ,..., Z n ~ N d 0, I d  ,
let
d 
Study Subspace Generated by Data
Hyperplane through 0,
of dimension
n
Points are “nearly equidistant to 0”,
& dist
d
Within plane, can
“rotate towards
d
Unit Simplex”
All Gaussian data sets are:
“near Unit Simplex Vertices”!!!
“Randomness” appears
only in rotation of simplex
Hall, Marron & Neeman (2005)
HDLSS Asy’s: Geometrical Represent’n
Assume Z 1 ,..., Z n ~ N d 0, I d, let
Study Hyperplane Generated by
Data
n  1 dimensional hyperplane
Points are pairwise equidistant,
dist ~ d
Points lie at vertices of: 2d 
“regular
n  hedron”
Again “randomness in data” is
only in rotation
Surprisingly rigid structure in
data?
d 
HDLSS Asy’s: Geometrical Represen’tion
Simulation View: study “rigidity after rotation”
• Simple 3 point data sets
• In dimensions d = 2, 20, 200, 20000
• Generate hyperplane of dimension 2
• Rotate that to plane of screen
• Rotate within plane, to make “comparable”
• Repeat 10 times, use different colors
HDLSS Asy’s: Geometrical Represen’tion
Simulation View: shows “rigidity after rotation”
HDLSS Asy’s: Geometrical Represen’tion
Explanation of Observed (Simulation) Behavior:
“everything similar for very high d ”
• 2 popn’s are 2 simplices (i.e. regular n-hedrons)
• All are same distance from the other class
• i.e. everything is a support vector
• i.e. all sensible directions show “data piling”
• so “sensible methods are all nearly the same”
• Including 1 - NN
HDLSS Asy’s: Geometrical Represen’tion
Straightforward Generalizations:
non-Gaussian data:
non-independent:
only need moments
use “mixing conditions”
Mild Eigenvalue condition on Theoretical Cov.
(Ahn, Marron, Muller & Chi, 2007)

All based on simple “Laws of Large Numbers”
2nd Paper on HDLSS Asymptotics
Ahn, Marron, Muller & Chi (2007)
 Assume 2nd Moments
 Assume no eigenvalues too large in sense:


j 


j

1
For    d 
d  2j
d
j 1
2
assume
  o(d ) i.e.
1
  d 1
(min possible)
(much weaker than previous mixing conditions…)
2nd Paper on HDLSS Asymptotics
Background:
In classical multivariate analysis, the statistic


j 


j 1
 d 
d  2j
d
Is called the “epsilon statistic”
2
j 1
And is used to test “sphericity” of dist’n,
i.e. “are all cov’nce eigenvalues the same?”
2nd Paper on HDLSS Asymptotics
Can show: epsilon statistic:
Satisfies:
   d1 ,1


j 


j

1
 d 
d  2j
d
2
j 1
• For spherical Normal,   1
1
• Single extreme eigenvalue gives  
d
1
• So assumption   d is very mild
• Much weaker than mixing conditions
2nd Paper on HDLSS Asymptotics
Ahn, Marron, Muller & Chi (2007)
 Assume 2nd Moments
 Assume no eigenvalues too large,   d 1 :
Then
X i  X j  O p (1)  d
Not so strong as before:
Z 1  Z 2  2d  O p (1)
2nd Paper on HDLSS Asymptotics
Can we improve on:
X i  X j  O p (1)  d ?
John Kent example:
Normal scale mixture
X i ~ 0.5 N d 0, I d   0.5 N d 0,10 * I d 
Won’t get:
X i  X j  C  d  Op (1)
2nd Paper on HDLSS Asymptotics
Notes on Kent’s Normal Scale Mixture
X i ~ 0.5 N d 0, I d   0.5 N d 0,10 * I d 
• Data Vectors are indep’dent of each other
• But entries of each have strong depend’ce
• However, can show entries have cov = 0!
• Recall statistical folklore:
Covariance = 0

Independence
0 Covariance is not independence
Simple Example:
• Random Variables X and Y
• Make both Gaussian
X , Y ~ N 0,1
• With strong dependence
• Yet 0 covariance
 X
Given c  0 , define Y  

X

X c
X c
0 Covariance is not independence
Simple Example:
0 Covariance is not independence
Simple Example:
0 Covariance is not independence
Simple Example, c to make cov(X,Y) = 0
0 Covariance is not independence
Simple Example:
• Distribution is degenerate
• Supported on diagonal lines
• Not abs. cont. w.r.t. 2-d Lebesgue meas.
• For small
c , have cov X , Y   0
c , have cov X , Y   0
• By continuity,  c with cov X , Y   0
• For large
0 Covariance is not independence
Result:
• Joint distribution of X and Y :
– Has Gaussian marginals
– Has
cov X , Y   0
– Yet strong dependence of X and Y
– Thus not multivariate Gaussian
Shows Multivariate Gaussian means more
than Gaussian Marginals
HDLSS Asy’s: Geometrical Represen’tion
Further Consequences of Geometric Represen’tion
1. Inefficiency of DWD for uneven sample size
(motivates weighted version, Xingye Qiao)
2. DWD more stable than SVM
(based on deeper limiting distributions)
(reflects intuitive idea feeling sampling variation)
(something like mean vs. median)
3. 1-NN rule inefficiency is quantified.
HDLSS Math. Stat. of PCA, I
Consistency & Strong Inconsistency:
Spike Covariance Model, Paul (2007)

For Eigenvalues:
1,d  d , 2,d  1, , d ,d  1
1st Eigenvector:
u1
How good are empirical versions,
as estimates?
ˆ1,d , , ˆd ,d , uˆ1
HDLSS Math. Stat. of PCA, II
Consistency (big enough spike):
For   1,
Angleu1 , uˆ1   0
Strong Inconsistency (spike not big enough):
For   1,
0
ˆ
Angleu1 , u1   90
HDLSS Math. Stat. of PCA, III
Consistency of eigenvalues?

L
ˆ
1,d 
 1,d
n
2
n
 Eigenvalues Inconsistent
 But known distribution
 Unless
n 
as well
HDLSS Work in Progress, I
Batch Adjustment: Xuxin Liu
Recall Intuition from above:
 Key is sizes of biological subtypes
 Differing ratio trips up mean
 But DWD more robust
Mathematics behind this?
Liu: Twiddle ratios of subtypes
HDLSS Data Combo Mathematics
Xuxin Liu Dissertation Results:
 Simple Unbalanced Cluster Model
 Growing at rate
d

 Answers depend on
as

Visualization of setting….
d 
HDLSS Data Combo Mathematics
HDLSS Data Combo Mathematics
HDLSS Data Combo Mathematics
Asymptotic Results (as d
  ):
1
 For   2, DWD Consistent
Angle(DWD,Truth)
 0
1
 For   , DWD Strongly Inconsistent
2
Angle(DWD,Truth)  900
HDLSS Data Combo Mathematics
Asymptotic Results (as d
  ):
1
 For   2, PAM Inconsistent
Angle(PAM,Truth)  Cr  0
1
 For   , PAM Strongly Inconsistent
2
Angle(PAM,Truth)
 90
0
HDLSS Data Combo Mathematics
Value of
Cr , for sample size ratio r :
 r 1 

Cr  cos 

2
 2r  2 
 Cr  0 , only when r  1
1
 Otherwise for r  1, PAM Inconsistent
 Verifies intuitive idea in strong way
The Future of Geometrical Repres’tion?
HDLSS version of “optimality” results?
•“Contiguity” approach?
Params depend on d?
•Rates of Convergence?
•Improvements of DWD?
(e.g. other functions of distance than inverse)
It is still early days …
State of HDLSS Research?
Development
Of Methods
Mathematical
Assessment
…
(thanks to:
defiant.corban.edu/gtipton/net-fun/iceberg.html)