Data mshow - Department of Computer Science and Information

Boris Mirkin
Computational Intelligence and Data Visualization
A unique course by an expert - looking at data from inside
rather than from outside as is common to CS
Lecture 6, 8/2/2010
According to the Popular Demand:
MatLab Lab sessions in Room 128:
8/2/2010
15/2/2010
22/2/2010
15.30-17.00
15.30-17.00
14.00-17.00 (Reading week)
Today’s subjects:
Principal component analysis (see file summm.doc, pp. 14-33)
o Conventional formulation of PCA
o Latent Semantic analysis
Reformulation in terms of the covariance matrix
yiv = cv*zi*+ eiv
(4.10)
The first-order optimality conditions to a triplet (, z, c) be the least-squares solution to (4.10) under the unit
length constraints imply that  = zTY c is maximum value satisfying equations
YTz=c
and
Yc=z
(4.12)
1
These equations for the optimal scores give the transformation of the data leading to the summaries z* and c*.
The transformation, denoted by F(Y) in (4.1), appears to be linear, and combines optimal c and z so that each
determines the other. It appears, this type of summarization is well known in linear algebra.
A triplet (, z, c) consisting of a non-negative  and two normed vectors, c (size M1) and z (size N1) is
referred to as to a singular triplet for Y if it satisfies (4.12) with  being the singular value and z, c the
corresponding singular vectors, sometimes referred to as left singular vector, z, and right singular vector, c.
Property 1. The score vector z* is a linear combination of columns of Y weighted by c*’s components: c*’s
components are feature weights in the score z*
This allows mapping additional features or entities onto the other part of the hidden factor model. Consider,
for example an additional N-dimemsional feature vector y standardized same way as Y. Its loading c*(y) is
determined as c*(y)=<z*,y>/ for the talent score z*. Similarly, an additional standardized V-dimensional
entity point h has its hidden factor score defined according to the other part of (4.12), z*(h)=<c*,h>/..
Property 2. Pythagorean decomposition of the data scatter T(Y) relating the least squares criterion (4.11) and
the singular value holds
T(Y)= 2 + L2
(4.13)
2
This implies that the squared singular value  expresses the proportion of the data scatter explained by the
principal component z*.
In the English-written literature, PCA is introduced not via the decoder model (4.10) or (4.14), but rather as a
heuristic technique to build up linear combinations of features with the help of the data covariance matrix.
The covariance matrix is defined as VV matrix A=YTY/N, where Y is a centered version of the data – matrix
X with all its columns centered. The (v’,v’’)-entry in the covariance matrix is the covariance coefficient
between features v’ and v’’; the diagonal elements are variances of the corresponding features. The covariance
matrix is referred to as the correlation matrix if Y has been z-scoring standardizes, that is, further normalized
by dividing each column by its standard deviation. Its elements are correlation coefficients between
corresponding variables. (Note how a bi-variate concept is carried through to multivariate data by using
matrix multiplication.)
The PCA problem conventionally is formulated as follows. Given a centered NV data matrix Y, find a
normed V-dimensional vector c=(cv) such that the sum of Y columns weighted by c, f=Yc, has the largest
variance possible. This vector is the principal component, termed so because it shows the direction of the
maximum variance in data. Vector f is centered for any c, since Y is centered. Therefore, its variance is
s2=<f,f>/N=fTf/N. The last equation comes under the convention that a V-dimensional vector is a V1 matrix,
that is, a column. By substituting Yc for f, this leads to s2=cTYTYc/N. Maximizing this with respect to all
vectors c that are normed, that is, satisfy condition cTc=1, is equivalent to unconditionally maximizing the
ratio
c T (Y T Y N )c
q (c ) 
cT c
(4.18)
over all V-dimensional vectors c. Expression (4.18) is well known in linear algebra as the Rayleigh quotient
for matrix A= YTY/N which is the covariance matrix of course. The maximum of Rayleigh quotient is reached
2
at c being an eigen-vector, also termed latent vector, for matrix A, corresponding to its maximum eigen
(latent) value q(c) (4.18).
Vector a is referred to as an eigen-vector for a square matrix B if Ba=a for some number  which is referred
to as the eigen value corresponding to a. In general, an eigen value may be not necessarily real but contain an
imaginary part. However, in the case of a covariance matrix all eigen values are not only real but non-negative
as well. The number of eigen values of B is equal to the rank of B, and eigen-vectors corresponding to
different eigen values are orthogonal to each other.
Therefore, the first principal component, in the conventional definition, is vector f=Yc defined by the eigenvector of the covariance matrix A corresponding to its maximum eigen-value. The second principal
component is defined as another linear combination of columns of Y, which maximizes its variance under the
condition that it is orthogonal to the first principal component. It is defined, of course, by the second eigen
value and corresponding eigenvector. Other principal components are defined similarly, in a recursive
manner, under the condition that they are orthogonal to all preceding principal components; they correspond
to eigen values and vectors that are those further down.
This definition seems rather far of how the principal components are introduced above. However, it is rather
straightforward to prove the computational equivalence between the two definitions. Indeed, take equation
Yc=z from (4.12), express z from this as z=Yc/, and substitute this z into the other equation (4.12): YTz= c,
so that YTYc/ = c. This implies that 2 and c, defined by the multiplicative decoder model, satisfy equation
YTYc = 2c,
(4.19)
that is, c is an eigen vector of square matrix YTY, corresponding to its maximum eigen value  = 2. Since
YTY differs from the covariance matrix A by constant factor N only, this c is an egen-vector of A
corresponding to its maximum eigen-value. We not only proved the equivalence, but also established a
relation between the eigen values of A and singular values of Y,  =2.
In spite of the computational equivalence, there are some conceptual differences between the two definitions.
In contrast to the definition in A.1 based on the multiplicative decoder, the conventional definition is a pure
heuristic, having no underlying model whatsoever. It makes sense only for centered data because of its
reliance on the concept of variance. Moreover, the fact that the principal component is a linear combination of
features is postulated in the conventional definition whereas this is a derived property of the multiplicative
decoder model which does not presume a linear relation.
Q. could you write equations defining 2 and z as an eigen-value and corresponding eigen-vector of matrix
YYT.
Application to LSI
Latent semantic analysis: An application to document analysis, information retrieval first of all, using
document-to-keyword data.
Presentation
Information retrieval is a problem that no computational intelligence may skip: given a set of records or
documents stored, find out those related to a specific query expressed by a set of keywords. Initially, at the
dawn of computer era, when all the documents were stored in the same database, the problem was treated in a
3
hard manner – only documents containing the query words were to be given to the user. Currently, this is a
very much soft problem of much intelligence that is being constantly solved by various search engines such as
Yahoo or Google for millions of World Wide Web users.
In its generic format, the problem can be illustrated with data in Table 4.12 that refers to a number of
newspaper articles on subjects such as entertainment, feminism and housholds, conveniently coded with
letters E, F and H, respectively. Columns correspond to keywords, or terms, listed in the first line, and entries
refer to the term frequency in the articles according to conventional coding scheme:
0 – no occurrence,
1 – occurs once,
2 – occurs twice or more.
The user may wish to retrieve all the articles on the subject of households, but there is no way they can do it
directly. The access can be gained only with a query that is a keyword or a set of keywords from those
available in the database Table 4.12. For example, query “fuel” will retrieve six items, E1, E3 and all H1-H4;
query “tax”, four items, E3, H1, H3, H4, which is part of the former and, thus, cannot be improved by using
the combination “fuel and tax”.
As one can see, this is very much a one-class description problem; just the decision rules, the queries, must be
combinations of keywords. The error of such a query is characterized by two characterstics, precision and
recall. For example, “fuel” query’s precision is 4/6=2/3 since only four of six are relevant and recall is 1
because all of the relevant documents have been returned.
Table 4.12. An illustrative database of 12 newspaper articles along with 10 terms and the conventional coding
of term frequencies. The articles are labeled according to the subjects – Feminism, Entertainment and
Household. One line holds document frequencies of terms (Df) and the other, inverse document frequence
weights (Idf).
Article
F1
F2
F3
F4
E1
E2
E3
E4
H1
H2
H3
H4
Df
Idf
Keyword
drink equal fuel play popular price relief talent
1
2
0
1
2
0
0
0
0
0
0
1
0
1
0
2
0
2
0
0
0
0
0
1
2
1
0
0
0
2
0
2
2
0
1
2
2
0
0
1
0
1
0
3
2
1
2
0
1
0
2
0
1
1
0
3
0
1
0
1
1
0
1
1
0
0
2
0
1
2
0
0
1
0
2
2
0
2
2
0
0
0
1
1
2
1
1
0
0
0
1
0
0
2
2
0
5
5
6
7
7
8
5
6
0.88 0.88 0.69 0.54 0.54 0.41 0.88 0.69
tax woman
0
2
0
2
0
2
0
1
0
0
0
0
1
1
0
0
2
0
0
0
2
0
2
0
4
5
1.10 0.88
As one can see, the rigidity of the query format does not fit well into the polysemy of natural language – such
words as “fuel” or “play” have more than one meanings, the more so of phrases – thus leading to impossibility
of exact information retrieval in many cases.
4
The method of latent semantic indexing (LSI) utilizes the SVD decomposition of the document-to-term data to
soften and thus improve the query system by embedding both documents and terms into a subspace of
singular vectors of the data matrix.
Before proceeding to SVD, the data table sometimes is standardized, typically, with what is referred to TermFrequency-Inverse-Document-Frequency (tf-idf) normalization. This procedure gives a different weight to
any keyword according to the number of documents it occurs at (document frequency). The intuition is that
the greater the document frequency, the more common and thus less informative is the word. The idf
weighting assigns each keyword with a weight inversely proportional to the logarithm of its document
frequency. For the table 4.12, these weights are in its last line.
After the SVD of the data matrix is obtained, the documents are considered points of the subspace of the first
singular vectors. The dimension of the space is not very important here, though it still should be much smaller
than the original dimension. Good practical results have been reported at the dimension of about 100-200
when the number of documents in tens and hundred thousands and the number of keywords in thousands. A
query is also represented as a point in the same space. The principal components, in general, are considered as
“orthogonal” concepts underlying the meaning of terms. This however, should not be taken too literally as the
singular vectors can be quite difficult to interpret. Also, a representation of documents and queries as points in
a Euclidean space is referred to sometimes as the vector space model in information retrieval.
The space allows to measure similarity between items using the inner product or even correlation coefficient,
referred to as cosine sometimes. Then a query would return the set of documents whose similarity to the query
point is greater than a threshold. This tool effectively provides for a better resolution in the problem of
information retrieval, because it well separates different meanings of synonyms.
This can be illustrated with the example of data in Table 4.12; its left part corresponding to the original term
frequency codes in Table 4.12 and the right part to the data weighted with tf-idf transformation.
Figure 4.17. The space of two first principal components for data in Table 4.12 both in the original format, on
the left, and with tf-idf normalization, on the right. Query Q combining four terms, “fuel, price, relief, tax”,
corresponds to the pentagram which is connected to the origin with a line.
As one can see, both representations separate the three subjects, F, E and H, more or less similarly, and
provide for a rather good resolution to the query Q combining four keywords, “fuel, price, relief, tax”,
relevant to household. Taking into account the position of the origin of the concept space – the circle in the
5
middle of the right boundary, the four H items are indeed have very good angular similarity to the pentagram
representing the query Q.
Table 4.13. Two first singular vectors of term frequency data in Table 4.12
Order
1st comp.
2d comp.
Query
SV
8.6
5.3
Contrib,%
46.9
17.8
Left singular vectors normed
-0.25 -0.19 -0.34 -0.40 -0.39 -0.42 -0.29 -0.32 -0.24 -0.22
0.22 0.34 -0.25 -0.07 0.01 -0.22 -0.35 0.48 -0.33 0.51
0
0
1
0
0
1
1
0
1
0
Table 4.13 contains data that are necessary for computing coordinates of the query Q in the concept space: the
first coordinate is computed by summing up all the components of the first left singular vector and dividing
the result by the square root of the first singular value: u1=(-0.34-0.42-0.29-0.24)/8.6½ = -0.44. The second
coordinate is computed similarly from the second singular vector and value: x2=(-0.25-0.22-0.35-0.33)/ 5.3½
= - 0.48. These correspond to the pentagram on the left part of Figure 4.17.
The SVD representation of documents is utilized in other applications such as text mining, web search, text
categorization, software engineering, etc.
Formulation.
To define characteristics of information retrieval, consider the following contingency table, describing a query
Q in its relation to a set of documents S to be retrieved: a +b is the number of documents returned at query Q
of which only a documents are relevant, c+d is the number of documents left behind of which c documents are
relevant.
Table 4.14. A statistical representation of the match between query Q and document set S. The entries in this
contingency table are counts of S, or not S, documents returned, or not, with respect to query Q.
S
a
c
a+c
Q
Not Q
Total
Not S
b
d
b+d
Total
a+b
c+d
N
The precision of Q is defined as proportion of the relevant documents among all returned:
P(Q) 
a
ab
and the recall is defined as the proportion of those returned among all the relevant documents,
R (Q) 
a
ac
These characteristics differ from the false positive and false negative rates in the problem of classification:
Fp(Q) 
b
ab
and
Fn(Q) 
c
cd
As a combined expression of the error, the harmonic mean of precision and recall is usually utilized:
2
a
F (Q) 

1 / P(Q)  1 / R(Q) a  (b  c) / 2
6
Using SVD
The full SVD of data matrix F leads to equation F=UMVT where U and V are matrices whose columns are
right and left normed singular vectors of F and M is a diagonal matrix with the corresponding singular values
of X on the diagonal. By leaving only K columns in these matrices, we substitute matrix F by matrix FK=
UKMKVKT so that the entities are represented in the K-dimensional concept space by the rows of matrix
UKMK½.
To translate a query presented as a vector q in the V-dimensional space into the corresponding point u in the
K-dimensional concept space, one needs to take the product g=VKTq, which is equal to g=uMK½ according to
the definition of singular values, after which u is found as u=gMK½. Specifically, k-th coordinate of vector u
is calculated as uk=<vk,q>/k½ (k=1, 2, …, K).
The similarities between rows (documents), corresponding to row-to-row inner products in the concept space
are computed as UKMK2UKT and, similarly, the similarities between columns (keywords) are computed
according to the dual formula VKMK2VKT. Applying this to the case of the K-dimensional point u representing
the original V-dimensional vector q, its similarities to N original entities are computed as uMK2UKT.
Q. Prove that P(Q)=1-Fp(Q)
Computation
Let X be NxV array representing the original frequency data. To convert it to the conventional coding, in
which all entries larger than 1 are coded by 2, one can use this operation:
>> Y=min(X,2*ones(N,V));
Computing vector df of document frequencies over matrix Y can be done with this line:
>>df=zeros(1,V); for k=1:V;df(k)=length(find(Y(:,k)>0));end;
and converting df to the inverse-document-frequency weights, with this:
>> idf=log(N./df);
After that, it-idf normalization can be made by using command
>>YI=Y.*repmat(idf, N,1);
Given term frequency matrix Y, its K-dimensional concept space is created with commands:
>> [u,m,v]=svd(Y);
>>uK=u(:, [1:K]); vK=v(:, [1:K]); mK=m([1:K], [1:K]);
Specifically, to draw the left part of Figure 4.17, one can define the coordinates with vectors u1 and u2:
>> u1=u(:,1)*sqrt(m(1,1)); %first coordinates of N entities in the concept space
>> u2=u(:,2)*sqrt(m(2,2)); %second coordinates of N entities in the concept space
Then prepare the query vector and its translation to the concept space:
>> q=[0 0 1 0 0 1 1 0 1 0]; % “fuel, price, relief, tax” query vector
>> d1=q*v(:,1)/sqrt(m(1,1)); %first coordinate of query q in the concept space
7
>> d2=q*v(:,2)/sqrt(m(2,2)); %second coordinate of query q in the concept space
After this auxiliary information should be put according to MatLab requirements:
>> tt={‘E1’,’E2’, …, ‘H4’}; %cell of names of the items in Y
>>ll=[0:.04:1.5]; ud1=d1*ll;ud2=d2*ll; %line of 37 points through origin and point (d1,d2)
Now we are ready for plotting the left drawing in Figure 4.17:
>> subplot(1,2,1);plot(u1,u2,'k.',d1,d2,'kp',0,0,’ko’,ud1,ud2);text(u1,u2,tt);text(d1,d2,' Q');axis([-1.5 0 -1
1.2]);
Here the arguments of plot command are:
u1,u2,'k.' – black dots corresponding to the original entities;
d1,d2,'kp' – black pentagram corresponding to query q;
0,0,’ko’ – black circle corresponding to the space origin;
ud1,ud2 – line through the query and the origin
Command text provides for string labels at corresponding points. Command axis specifies the boundaries of
Cartesian plane on the figure, which can be used for making different plots uniform.
The plot on the right of Figure 4.17 is coded similarly by using SVD of matrix YI rather than Y.
8

Download Report

Data mshow - Department of Computer Science and Information

Paperzz.com

Your Paperzz