olivieri_tyler

olivieri_tyler_ca12

[Tyler Olivieri],
ECE 3512: Stochastic Processing in Signals and Systems
Department of Electrical and Computer Engineering, Temple University, Philadelphia, PA 1912
I.
PROBLEM STATEMENT
This computer assignment computes principle components analysis on several different ratings of cities. The city
ratings will are a data set from MATLAB. This computer assignment will show which three cities are the most closely
related and how to transform ratings onto the same scale (same variance).
II.
APPROACH AND RESULTS
Figure 1: percent each eigenvector contributes to variance
In Figure 1, the percentage of how much each eigenvector contributes to the variance is seen. This shows how some
ratings are essentially not useful to find similar cities as the ratings are on different scales.
According to the largest eigenvalues, housing and arts are biggest contributing variables in explaining the variance of
the data. They have the largest absolute value.
Above are the indexes for the two most closely related cities by finding the differences in just the ratings matrix. The
third most closely related city can be seen below. This is again just the index of the city.
Now, the distance is taken of the transformed data. The covariance of the transformed data is an identity matrix which
tells us that the data is uncorrelated and the energies or variances are all equal. We can see that the two most closely
related cities changed.
Below the third most closely related city can be seen. Again, a different result but more accurate.
Now we only take the first three categories as criteria to rank the cities. Below are the two most closely related cities.
Now, the city that is third most cloesly related. The third most closely related city
The results are different again.
III.
MATLAB CODE
In this assignment, a function was written to calculate the distance between every two points in the matrix.
In the main script, the data was loaded. Then the covariance was taken of this data. The eigenvectors and
eigenvalues were found for the covariance of the data. A check was made to insure the correct covariance with
eigenvectors and eigenvalues. Then the data was transformed into a different space to place the variance on the same
scale. The covariance of this transformation matrix is an identity and this was checked as well. The eigenvalues
were extracted into a vector and sorted. Then calculated the percent each eigenvalue contributed to the variance.
Now, the two eigenvectors corresponding to the two largest eigenvalues were found. Then the distance was
calculated between all points. Now we want to find the minimum distance to show which two cities are the most
closely related. The distance formula would return 0s on the diagonal as the distance between that point and itself is
0. We want to eliminate these 0s and make them large values. These zeros were replaced with the maximum value in
the matrix. Then the minimum value was found that showed the two closely related cities. Now the average of these
ratings were taken and a third city is found to be closely related to these two cities. After this, the same process is
done on the transformed data(equal variance) and with only the first three rating categories.
%%CA 12
%Tyler Olivieri
clc;clear;
%load data
load cities
%compute covariance matrix
c = cov(ratings);
%eigenvectors and eigenvalues
[evector,evalue] = eig(c);
%check
Ccheck = evector*evalue*evector';
%whiten the matrix
z = (evalue^(-1/2))*evector'*ratings';
%Cz should be identity for succesful transformation matrix
Cz = cov(z')';
%extract eigenvalues into a vector
for i = 1:length(evalue)
evalue_VECTOR(i) = evalue(i,i);
end
%sort eigenvalues---largest to smallest
evalue_VECTOR_SORT = sort(evalue_VECTOR, 'descend');
%calculate the % each eigenvalue contributes to variance
for i = 1:length(evalue_VECTOR_SORT)
cent_var(i) = evalue_VECTOR_SORT(i)/sum(evalue_VECTOR_SORT);
end
%this should add up to one.
figure(1);
plot(cent_var)
title('% of variance each eigenvalue contributes')
xlabel('eigenvector size')
ylabel('% of variance')
%find the eigenvectors corresponding to the largest eigenvalue
[row,col] = find(evalue ==evalue_VECTOR_SORT(1));
[row1,col1] = find(evalue ==evalue_VECTOR_SORT(2));
firstlargest = evector(:,col);
secondlargest = evector(:,col1);
%find two most closely related using Euclidean distance
distance2 = smallest_distance(ratings);
%find the biggest distance max returns max of each row,
%taking max again will return max in matrix.
maxcolvalue = max(distance2);
maxdis = max(maxcolvalue);
%find index of entry that gives the maximum distance
[maxr, maxc] = find(distance2==maxdis);
%make zeros the maximum distance so that we insure that the main
%row will not be the minimum
for i = 1:length(distance2(:,1))
for j = 1:length(distance2(1,:))
if( i == j)
distance2(i,j) = distance2(i,j)+maxdis;
end
end
end
%using similar method, find minimum
mincolvalue = min(distance2);
mindis = min(mincolvalue);
%find-- these are the closest cities
[minr, minc] = find(distance2==mindis);
%first, we must average the ratings for the two closest cities
for i = 1:length(ratings(1,:))
avg_rate(i) = (ratings(minr(1),i) + ratings(minr(1),i))/2;
end
%now we can find the distance between the average and the ratings
%to find the third closest
for i = 1:length(ratings(:,1))
for j = 1:length(ratings(1,:))
d2(j) = (ratings(i,j)-avg_rate(j))^2;
end
distance3(i) = sqrt(sum(d2));
%dont want to test the cities already too closest
%make those arbitrarily large
if( (i==minc(1)) || (i==minr(1)) )
distance3(i) = 1e9;
end
end
%same thing to find the third closest
minc2 = find(distance3==min(distance3));
%repeat with the transformed!! lets see if they are different
%same process
z = z';
disz = smallest_distance(z);
maxcolz
maxdisz
[maxrz,
for i =
for
= max(disz);
= max(maxcolz);
maxcz] = find(disz==maxdisz);
1:length(disz(:,1))
j = 1:length(disz(1,:))
if( i == j)
disz(i,j) = disz(i,j)+maxdisz;
end
end
end
mincolz = min(disz);
mindisz = min(mincolz);
[minrz, mincz] = find(disz==mindisz);
for i = 1:1:length(z(1,:))
avg_ratez(i) = (z(minrz(1),i)+z(mincz(1),i))/2;
end
for i = 1:length(z(:,1))
for j = 1:length(z(1,:))
temp(j) = (z(i,j)-avg_ratez(j))^2;
end
dis2z(i) = sqrt(sum(temp));
if( (i==mincz(1)) || (i==minrz(1)) )
dis2z(i) = 1e6;
end
end
minc2z = find(dis2z==min(dis2z));
%first 3 criteria
ratings_lim = [ratings(:,1) ratings(:,2) ratings(:,3)];
zlim = [z(:,1) z(:,2) z(:,3)];
for i = 1:length(ratings_lim(:,1))
for j = 1:length(ratings_lim(:,1))
for k = 1:length(ratings_lim(1,:))
temp(k) = (ratings_lim(i,k)-ratings_lim(j,k))^2;
end
dis_lim(i,j) = sqrt(sum(temp));
end
end
maxcol_lim = max(dis_lim);
maxdis_lim = max(maxcol_lim);
[maxr_lim, maxc_lim] = find(dis_lim==maxdis_lim);
for i = 1:length(dis_lim(:,1))
for j = 1:length(dis_lim(1,:))
if( i == j)
dis_lim(i,j) = dis_lim(i,j)+maxdis_lim;
end
end
end
mincol_lim = min(dis_lim);
mindis_lim = min(mincol_lim);
[minr_lim, minc_lim] = find(dis_lim==mindis_lim);
for i = 1:1:length(ratings_lim(1,:))
avg_rate_lim(i) = (ratings_lim(minr_lim(1),i)+ratings_lim(minc_lim(1),i))/2;
end
for i = 1:length(ratings_lim(:,1))
for j = 1:length(ratings_lim(1,:))
temp(j) = (ratings_lim(i,j)-avg_rate_lim(j))^2;
end
dis2_lim(i) = sqrt(sum(temp));
if( (i==minc_lim(1)) || (i==minr_lim(1)) )
dis2_lim(i) = 1e6;
end
end
minc2_lim = find(dis2_lim==min(dis2_lim));
z_lim = zlim;
for i = 1:length(z_lim(:,1))
for j = 1:length(z_lim(:,1))
for k = 1:length(z_lim(1,:))
temp(k) = (z_lim(i,k)-z_lim(j,k))^2;
end
disz_lim(i,j) = sqrt(sum(temp));
end
end
maxcolz_lim = max(disz_lim);
maxdisz_lim = max(maxcolz_lim);
[maxrz_lim, maxcz_lim] = find(disz_lim==maxdisz_lim);
for i = 1:length(disz_lim(:,1))
for j = 1:length(disz_lim(1,:))
if( i == j)
disz_lim(i,j) = disz_lim(i,j)+maxdisz_lim;
end
end
end
mincolz_lim = min(disz_lim);
mindisz_lim = min(mincolz_lim);
[minrz_lim, mincz_lim] = find(disz_lim==mindisz_lim);
for i = 1:1:length(z_lim(1,:))
avg_ratez_lim(i) = (z_lim(minrz_lim(1),i)+z_lim(mincz_lim(1),i))/2;
end
for i = 1:length(z_lim(:,1))
for j = 1:length(z_lim(1,:))
temp(j) = (z_lim(i,j)-avg_ratez_lim(j))^2;
end
dis2z_lim(i) = sqrt(sum(temp));
if( (i==mincz_lim(1)) || (i==minrz_lim(1)) )
dis2z_lim(i) = 1e6;
end
end
minc2z = find(dis2z_lim==min(dis2z_lim));
%find two most closely related using Euclidean distance
%Tyler Olivieri
function distance2 = smallest_distance(ratings)
for i = 1:length(ratings(:,1))
for j = 1:length(ratings(:,1))
for k = 1:length(ratings(1,:))
d(k) = (ratings(i,k)-ratings(j,k))^2;
end
distance2(i,j) = sqrt(sum(d));
end
end
end
IV.
CONCLUSIONS
Using principal components analysis we can take ratings on different scales and essentially normalize them to the
same scale so that each category has the same effect on the rating of the city. When the eigenvalues are taken of the
covariance matrix of the original data they represent the variance of each category. The data is transformed into a
different space and the covariance of the transformed data should be an identity matrix. This shows that the
transformed data is uncorrelated and the variances are all equal. Computing Euclidean distances of the transformed
data will give an accurate representation of the similarity of the two observations. The results were seen to be different
when the Euclidean distance was taken with and without the transformation matrix. We could also note that the
eigenvalues or certain categories are initially weighted much more than others (Figure 1). The end of the assignment
explored what happened if only some categories were considered in the similarity between the two cities. There were
different results, but this is not surprising either because you now found the cities most closely related in only a few
categories instead of several. The variances are still equal even when only considering a few categories of the
transformed data.

Download Report

olivieri_tyler_ca12

Paperzz.com

Your Paperzz