Curse.demo

Curse of dimensionality
Applying ML algorithms to highly­dimensional data – an illustration
Barbora Hladká and Martin Holub, Charles University in Prague, NPFL 054, 2016/17
I. Distribution of feature vectors
– each feature is binary, with the same binomial distribution
N
dim
prob
<- 10^6
<- 7
<- 1/10
# number of observations
# number of dimensions
# probability of Wi=1
binom_vector <- character(N)
for(i in 1:N) binom_vector[i] <- paste(rbinom(dim,1,prob), collapse="")
expected_values = 2^dim
emerged_values = length(unique(binom_vector))
print( sort(table(binom_vector), dec=T) )
observations = 1,000,000
dimensions = 7
p(x = 1) = 0.1
number of possible different values = 128
number of emerged different values = 125
0000000
477890
1000100
5981
0100010
5867
1001010
704
0101010
667
1001001
639
0100101
622
0010111
76
0011101
72
0110011
66
1101011
13
1110110
8
1111001
3
0010000
53566
0010001
5967
1000010
5844
1000110
700
1100010
665
0110010
637
0001011
616
0101101
76
1000111
72
1100101
66
1001111
12
1010111
7
0111111
2
0100000
53305
0000011
5961
1100000
5843
0010110
692
1010010
663
0010101
635
1011000
614
1011001
75
1010011
72
0110101
65
1101110
11
1111010
7
1111011
2
0001000
53303
0000101
5923
0101000
5815
0101100
690
0000111
656
1010100
635
0110001
589
1011100
75
1010110
72
0101110
63
0110111
10
1110011
6
1101111
1
0000001
53279
0001001
5901
0100001
5814
0100011
686
0011100
656
0011001
634
1011010
95
1101001
75
1111000
72
0111010
62
0111101
10
1111100
6
1111110
1
0000010
53272
0011000
5899
0100100
5810
1001100
680
0100110
652
1100001
634
1001011
87
0011110
73
0100111
71
1110010
62
0101111
9
1011011
5
1000000
53249
0110000
5898
1010000
5779
0001101
673
0110100
648
1000101
632
0110110
83
0111001
73
0101011
71
0001111
56
1101101
9
1100111
5
0000100
53023
0010100
5896
1000001
5778
1100100
673
0111000
648
1101000
630
1001101
78
1101100
73
1100011
70
0111100
56
0011111
8
0111110
4
0001010
6100
0001100
5893
0010010
5722
1010001
672
0011010
646
1110000
630
1001110
77
1110100
73
1101010
69
1100110
55
0111011
8
1011101
4
1001000
5995
0000110
5886
0010011
714
0001110
669
0101001
640
1000011
626
1010101
77
0011011
72
1110001
68
1011110
14
1110101
8
1110111
3
II. Distribution of distances from 0 of randomly distributed points in a unit cube
dim
n
<- 6
<- 10000
cube <- data.frame(
x1 = runif(n),
x2 = runif(n),
x3 = runif(n),
x4 = runif(n),
x5 = runif(n),
x6 = runif(n)
)
distances <- numeric(n)
for(i in 1:n) distances[i] <- sqrt(sum(cube[i,]^2))
greater_than_1 <- sum(distances > 1)
message("Most of the distances (",
format(greater_than_1/n*100, digits=3), "%) are greater than 1.")
message("Frequency of distances in intervals:")
print(table(cut(distances, breaks=seq(0, 2.5, 0.5))))
---------------------------------------------------------------This program generates 10,000 random 6-dimensional sample points in a unit cube.
Maximum possible distance from 0 is: 2.45
Distances from 0 in the sample of 10000 points:
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.324
1.214
1.407
1.390
1.578
2.183
Most of the distances (91.7%) are greater than 1.
Frequency of distances in intervals:
(0,0.5] (0.5,1] (1,1.5] (1.5,2] (2,2.5]
7
824
5591
3529
49
III. Distribution of mutual distances between randomly distributed points in a unit cube
dim
n
d
lim
<<<<-
6
150
choose(n,2)
0.5
message("Maximum possible distance between two points is: ",
format(sqrt(6), digits=3) )
message("Number of different pairs is: ", d)
cube <- data.frame(
x1 = runif(n),
x2 = runif(n),
x3 = runif(n),
x4 = runif(n),
x5 = runif(n),
x6 = runif(n)
)
distances <- numeric(d)
k <- 1
for(i in 1:(n-1) ) for(j in (i+1):n ) {
distances[k] <- sqrt( sum((cube[i,]-cube[j,])^2) ); k <- k+1
}
greater_than_lim <- sum(distances > lim)
message("Most of the distances (",
format(greater_than_lim/d*100, digits=3), "%) are greater than ", lim, ".")
message("Frequency of distances in intervals:")
print(table(cut(distances, breaks=seq(0, 2.5, 0.25))))
---------------------------------------------------------------This program generates 150 random 6-dimensional sample points in a unit cube.
Maximum possible distance between two points is: 2.45
Number of different pairs is: 11,175
Mutual distances in the sample of 150 points:
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.1173 0.8098 0.9797 0.9732 1.1420 1.8770
Most of the distances (97%) are greater than 0.5.
Frequency of distances in intervals:
(0,0.25] (0.25,0.5] (0.5,0.75]
9
322
1704
(1.75,2]
4
(2,2.25] (2.25,2.5]
0
0
(0.75,1]
3914
(1,1.25] (1.25,1.5] (1.5,1.75]
3793
1301
128