Curse of dimensionality Applying ML algorithms to highlydimensional data – an illustration Barbora Hladká and Martin Holub, Charles University in Prague, NPFL 054, 2016/17 I. Distribution of feature vectors – each feature is binary, with the same binomial distribution N dim prob <- 10^6 <- 7 <- 1/10 # number of observations # number of dimensions # probability of Wi=1 binom_vector <- character(N) for(i in 1:N) binom_vector[i] <- paste(rbinom(dim,1,prob), collapse="") expected_values = 2^dim emerged_values = length(unique(binom_vector)) print( sort(table(binom_vector), dec=T) ) observations = 1,000,000 dimensions = 7 p(x = 1) = 0.1 number of possible different values = 128 number of emerged different values = 125 0000000 477890 1000100 5981 0100010 5867 1001010 704 0101010 667 1001001 639 0100101 622 0010111 76 0011101 72 0110011 66 1101011 13 1110110 8 1111001 3 0010000 53566 0010001 5967 1000010 5844 1000110 700 1100010 665 0110010 637 0001011 616 0101101 76 1000111 72 1100101 66 1001111 12 1010111 7 0111111 2 0100000 53305 0000011 5961 1100000 5843 0010110 692 1010010 663 0010101 635 1011000 614 1011001 75 1010011 72 0110101 65 1101110 11 1111010 7 1111011 2 0001000 53303 0000101 5923 0101000 5815 0101100 690 0000111 656 1010100 635 0110001 589 1011100 75 1010110 72 0101110 63 0110111 10 1110011 6 1101111 1 0000001 53279 0001001 5901 0100001 5814 0100011 686 0011100 656 0011001 634 1011010 95 1101001 75 1111000 72 0111010 62 0111101 10 1111100 6 1111110 1 0000010 53272 0011000 5899 0100100 5810 1001100 680 0100110 652 1100001 634 1001011 87 0011110 73 0100111 71 1110010 62 0101111 9 1011011 5 1000000 53249 0110000 5898 1010000 5779 0001101 673 0110100 648 1000101 632 0110110 83 0111001 73 0101011 71 0001111 56 1101101 9 1100111 5 0000100 53023 0010100 5896 1000001 5778 1100100 673 0111000 648 1101000 630 1001101 78 1101100 73 1100011 70 0111100 56 0011111 8 0111110 4 0001010 6100 0001100 5893 0010010 5722 1010001 672 0011010 646 1110000 630 1001110 77 1110100 73 1101010 69 1100110 55 0111011 8 1011101 4 1001000 5995 0000110 5886 0010011 714 0001110 669 0101001 640 1000011 626 1010101 77 0011011 72 1110001 68 1011110 14 1110101 8 1110111 3 II. Distribution of distances from 0 of randomly distributed points in a unit cube dim n <- 6 <- 10000 cube <- data.frame( x1 = runif(n), x2 = runif(n), x3 = runif(n), x4 = runif(n), x5 = runif(n), x6 = runif(n) ) distances <- numeric(n) for(i in 1:n) distances[i] <- sqrt(sum(cube[i,]^2)) greater_than_1 <- sum(distances > 1) message("Most of the distances (", format(greater_than_1/n*100, digits=3), "%) are greater than 1.") message("Frequency of distances in intervals:") print(table(cut(distances, breaks=seq(0, 2.5, 0.5)))) ---------------------------------------------------------------This program generates 10,000 random 6-dimensional sample points in a unit cube. Maximum possible distance from 0 is: 2.45 Distances from 0 in the sample of 10000 points: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.324 1.214 1.407 1.390 1.578 2.183 Most of the distances (91.7%) are greater than 1. Frequency of distances in intervals: (0,0.5] (0.5,1] (1,1.5] (1.5,2] (2,2.5] 7 824 5591 3529 49 III. Distribution of mutual distances between randomly distributed points in a unit cube dim n d lim <<<<- 6 150 choose(n,2) 0.5 message("Maximum possible distance between two points is: ", format(sqrt(6), digits=3) ) message("Number of different pairs is: ", d) cube <- data.frame( x1 = runif(n), x2 = runif(n), x3 = runif(n), x4 = runif(n), x5 = runif(n), x6 = runif(n) ) distances <- numeric(d) k <- 1 for(i in 1:(n-1) ) for(j in (i+1):n ) { distances[k] <- sqrt( sum((cube[i,]-cube[j,])^2) ); k <- k+1 } greater_than_lim <- sum(distances > lim) message("Most of the distances (", format(greater_than_lim/d*100, digits=3), "%) are greater than ", lim, ".") message("Frequency of distances in intervals:") print(table(cut(distances, breaks=seq(0, 2.5, 0.25)))) ---------------------------------------------------------------This program generates 150 random 6-dimensional sample points in a unit cube. Maximum possible distance between two points is: 2.45 Number of different pairs is: 11,175 Mutual distances in the sample of 150 points: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1173 0.8098 0.9797 0.9732 1.1420 1.8770 Most of the distances (97%) are greater than 0.5. Frequency of distances in intervals: (0,0.25] (0.25,0.5] (0.5,0.75] 9 322 1704 (1.75,2] 4 (2,2.25] (2.25,2.5] 0 0 (0.75,1] 3914 (1,1.25] (1.25,1.5] (1.5,1.75] 3793 1301 128
© Copyright 2024 Paperzz