Independent component analysis (ICA) PCA

Computational BioMedical Informatics
SCE 5095: Special Topics Course
Instructor: Jinbo Bi
Computer Science and Engineering Dept.
1
Course Information

Instructor: Dr. Jinbo Bi
– Office: ITEB 233
– Phone: 860-486-1458
– Email: [email protected]

– Web: http://www.engr.uconn.edu/~jinbo/
– Time: Mon / Wed. 2:00pm – 3:15pm
– Location: CAST 204
– Office hours: Mon. 3:30-4:30pm
HuskyCT
– http://learn.uconn.edu
– Login with your NetID and password
– Illustration
2
Review of previous classes
Introduced unsupervised learning – particularly,
cluster analysis techniques
 Discussed one important application of cluster
analysis – cardiac ultrasound view recognition
 Review papers on medical, health or public
health topics


This class, we start to discuss dimension
reduction
3
Topics today
What is feature reduction?
 Why feature reduction?
 More general motivations of component analysis
 Feature reduction algorithms
– Principal component analysis
– Canonical correlation analysis
– Independent component analysis

4
What is feature reduction?

Feature reduction refers to the mapping of the
original high-dimensional data onto a lowerdimensional space.
– Criterion for feature reduction can be different based on different
problem settings.
Unsupervised
Supervised

setting: minimize the information loss
setting: maximize the class discrimination
Given a set of data points of p variables x1 , x2 ,, xn 
Compute the linear transformation (projection)
G   pd : x   p  y  GT x  d (d  p)
5
What is feature reduction?
Original data
reduced data
Linear transformation
Y  d
GT  d p
X p
G   p d : X  Y  G T X   d
6
High-dimensional data
Gene expression
Face images
Handwritten digits
7
Feature reduction versus feature selection

Feature reduction
– All original features are used
– The transformed features are linear
combinations of the original features.

Feature selection
– Only a subset of the original features are
used.
8
Why feature reduction?

Most machine learning and data mining
techniques may not be effective for highdimensional data
– Curse of Dimensionality
– Query accuracy and efficiency degrade
rapidly as the dimension increases.

The intrinsic dimension may be small.
– For example, the number of genes
responsible for a certain type of disease may
be small.
9
Why feature reduction?






Understanding: reason about or obtain insights from data
Combinations of observed variables may be more effective
bases for insights, even if physical meaning is obscure
Visualization: projection of high-dimensional data onto 2D
or 3D. Too much noise in the data
Data compression: Need to “reduce” them to a smaller set
of factors for an efficient storage and retrieval
Better representation of data without losing much
information
Noise removal: Can build more effective data analyses on
the reduced-dimensional space: classification, clustering,
pattern recognition, (positive effect on query accuracy)
10
Application of feature reduction
Face recognition
 Handwritten digit recognition
 Text mining
 Image retrieval
 Microarray data analysis
 Protein classification

11
More general motivations
• We study phenomena that can not be directly
observed
– ego, personality, intelligence in psychology
– Underlying factors that govern the observed data
•
We want to identify and operate with underlying
latent factors rather than the observed data
– E.g. topics in news articles
– Transcription factors in genomics
• We want to discover and exploit hidden relationships
–
–
–
–
“beautiful car” and “gorgeous automobile” are closely related
So are “driver” and “automobile”
But does your search engine know this?
Reduces noise and error in results
12
More general motivations
• Discover a new set of factors/dimensions/axes against
which to represent, describe or evaluate the data
–
–
–
–
–
For more effective reasoning, insights, or better visualization
Reduce noise in the data
Typically a smaller set of factors: dimension reduction
Better representation of data without losing much information
Can build more effective data analyses on the reduced-dimensional space:
classification, clustering, pattern recognition
• Factors are combinations of observed variables
– May be more effective bases for insights, even if physical meaning is
obscure
– Observed data are described in terms of these factors rather than in terms of
original variables/dimensions
13
Feature reduction algorithms
Unsupervised
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Independent component analysis (ICA)
 Supervised
– Linear discriminant analysis (LDA)
– Sparse support vector machine (SSVM)
 Semi-supervised
– Research topic

14
Feature reduction algorithms
Linear
– Principal Component Analysis (PCA)
– Linear Discriminant Analysis (LDA)
– Canonical Correlation Analysis (CCA)
– Independent component analysis (ICA)
 Nonlinear
– Nonlinear feature reduction using kernels

Kernel
PCA, kernel CCA, …
– Manifold learning
LLE,
Isomap
15
Basic Concept

Areas of variance in data are where items can be best discriminated
and key underlying phenomena observed
– Areas of greatest “signal” in the data

If two items or dimensions are highly correlated or dependent
– They are likely to represent highly related phenomena
– If they tell us about the same underlying variance in the data,
combining them to form a single measure is reasonable
Parsimony
Reduction
in Error

So we want to combine related variables, and focus on uncorrelated
or independent ones, especially those along which the observations
have high variance

We want a smaller set of variables that explain most of the variance
in the original data, in more compact and insightful form
16
Basic Concept

What if the dependences and correlations are not so
strong or direct?

And suppose you have 3 variables, or 4, or 5, or 10000?

Look for the phenomena underlying the observed
covariance/co-dependence in a set of variables
– Once again, phenomena that are uncorrelated or independent,
and especially those along which the data show high variance

These phenomena are called “factors” or “principal
components” or “independent components,” depending
on the methods used
– Factor analysis: based on variance/covariance/correlation
– Independent Component Analysis: based on independence
17
What is Principal Component Analysis?

Principal component analysis (PCA)
– Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
– Retains most of the sample's information.
– Useful for the compression and classification of data.

By information we mean the variation present in the sample,
given by the correlations between the original variables.
18
Principal Component Analysis
Most common form of factor analysis
 The new variables/dimensions
– are linear combinations of the original ones
– are uncorrelated with one another

Orthogonal
in original dimension space
– capture as much of the original variance in the
data as possible
– are called Principal Components
– are ordered by the fraction of the total
information each retains
19
Some Simple Demos

http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.ht
ml
20
Original Variable B
What are the new axes?
PC 2
PC 1
Original Variable A
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along
any one axis
21
Principal Components
First principal component is the direction of
greatest variability (covariance) in the data
 Second is the next orthogonal (uncorrelated)
direction of greatest variability

– So first remove all the variability along the first
component, and then find the next direction of
greatest variability

And so on …
22
Principal Components Analysis (PCA)

Principle
– Linear projection method to reduce the number of parameters
– Transfer a set of correlated variables into a new set of uncorrelated
variables
– Map the data into a space of lower dimensionality
– Form of unsupervised learning

Properties
– It can be viewed as a rotation of the existing axes to new positions in the
space defined by original variables
– New axes are orthogonal and represent the directions with maximum
variability
23
Computing the Components



Data points are vectors in a multidimensional space
Projection of vector x onto an axis (dimension) u is u.x
Direction of greatest variability is that in which the average
square of the projection is greatest
– I.e. u such that E((u.x)2) over all x is maximized
– (we subtract the mean along each dimension, and center the
original axis system at the centroid of all data points, for simplicity)
– This direction of u is the direction of the first Principal Component
24
Computing the Components

E((uTx)2) = E ((uTx) (uTx)) = E (uTx.x Tu)

The matrix C = x.xT contains the correlations
(similarities) of the original axes based on how the
data values project onto them

So we are looking for w that maximizes uTCu, subject
to u being unit-length

It is maximized when w is the principal eigenvector of
the matrix C, in which case
– uTCu = uTlu = l if u is unit-length, where l is the
principal eigenvalue of the covariance matrix C
– The eigenvalue denotes the amount of variability captured
along that dimension
25
Why the Eigenvectors?
uTu = 1
Construct Langrangian
uTxxTu – λuTu
Vector of partial derivatives set to zero
xxTu – λu = (xxT – λI) u = 0
As u ≠ 0 then u must be an eigenvector of xxT with
eigenvalue λ
Maximise uTxxTu
s.t
26
Singular Value Decomposition
The first root is called the prinicipal eigenvalue which has an
associated orthonormal (uTu = 1) eigenvector u
Subsequent roots are ordered such that λ1> λ2 >… > λM with
rank(D) non-zero values.
Eigenvectors form an orthonormal basis i.e. uiTuj = δij
The eigenvalue decomposition: C = xxT = UΣUT
where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M]
Similarly the eigenvalue decomposition of xTx = VΣVT
The SVD is closely related to the above x=U Σ1/2 VT
The left eigenvectors U, right eigenvectors V,
singular values = square root of eigenvalues.
27
Computing the Components


Similarly for the next axis, etc.
So, the new axes are the eigenvectors of the
covariance matrix of the original variables, which
captures the similarities of the original variables
based on how data samples project to them
•
Geometrically: centering followed by rotation
– Linear transformation
28
PCs, Variance and Least-Squares




The first PC retains the greatest amount of variation in the
sample
The kth PC retains the kth greatest fraction of the variation in
the sample
The kth largest eigenvalue of the correlation matrix C is the
variance in the sample along the kth PC
The least-squares view: PCs are a series of linear least squares
fits to a sample, each orthogonal to all previous ones
29
How Many PCs?
For n original dimensions, correlation matrix is
nxn, and has up to n eigenvectors. So n PCs.
 Where does dimensionality reduction come from?

30
Dimensionality Reduction
Can ignore the components of lesser significance.
25
Variance (%)
20
15
10
5
0
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
You do lose some information, but if the eigenvalues are small, you don’t lose much
– p dimensions in original data
– calculate p eigenvectors and eigenvalues
– choose only the first d eigenvectors, based on their
eigenvalues
– final data set has only d dimensions
31
Eigenvectors of a Correlation Matrix
32
Geometric picture of principal components
z1
• the 1st PC
z1
is a minimum distance fit to a line in X space
• the 2nd PC z2 is a minimum distance fit to a line in the
plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
33
Optimality property of PCA
Reconstruction
Dimension reductionX   pn  GT X  d n
Original data
GT X  d n  X  G(GT X )   pn
GT  d p
Y  G T X   d n
X   p n
X   pn
G 
p d
34
Optimality property of PCA
Main theoretical result:
The matrix G consisting of the first d eigenvectors of the
covariance matrix S solves the following min problem:
2
min G pd X  G(G X ) subject to G TG  I d
T
F
X X
2
F
reconstruction error
PCA projection minimizes the reconstruction error among all
linear projections of size d.
35
Applications of PCA

Eigenfaces for recognition. Turk and Pentland.
1991.

Principal Component Analysis for clustering
gene expression data. Yeung and Ruzzo. 2001.

Probabilistic Disease Classification of
Expression-Dependent Proteomic Data from
Mass Spectrometry of Human Serum. Lilien.
2003.
36
PCA applications -Eigenfaces

the principal eigenface looks like a bland and
rogynous average human face
http://en.wikipedia.org/wiki/Image:Eigenfaces.png
37
Eigenfaces – Face Recognition
When properly weighted, eigenfaces can be
summed together to create an approximate grayscale rendering of a human face.
 Remarkably few eigenvector terms are needed
to give a fair likeness of most people's faces
 Hence eigenfaces provide a means of applying
data compression to faces for identification
purposes.
 Similarly, Expert Object Recognition in Video

38
PCA for image compression
d=1
d=16
d=2
d=32
d=4
d=64
d=8
d=100
Original
Image
39
Topics today

Feature reduction algorithms
– Principal component analysis
– Canonical correlation analysis
– Independent component analysis
40
Canonical correlation analysis (CCA)

CCA was developed first by H. Hotelling.
– H. Hotelling. Relations between two sets of variates.
Biometrika, 28:321-377, 1936.

CCA measures the linear relationship between two
multidimensional variables.

CCA finds two bases, one for each variable, that are
optimal with respect to correlations.

Applications in economics, medical studies,
bioinformatics and other areas.
41
Canonical correlation analysis (CCA)

Two multidimensional variables
– Two different measurements on the same set of objects
Web
images and associated text
Protein
(or gene) sequences and related literature (text)
Protein
sequence and corresponding gene expression
In
classification: feature vector and class label
– Two measurements on the same object are likely to be correlated.
May
not be obvious on the original measurements.
Find
the maximum correlation on transformed space.
42
Canonical Correlation Analysis (CCA)
measurement
YT
WX
transformation
WY
Transformed data
Correlation
XT
43
Problem definition

Find two sets of basis vectors, one for x and the
other for y, such that the correlations between
the projections of the variables onto these basis
vectors are maximized.
Given
Compute two basis vectorswx
and wy :
y   wy ,y 
44
Problem definition

Compute the two basis vectors so that the
correlations of the projections onto these vectors
are maximized.
45
Algebraic derivation of CCA
The optimization problem is equivalent to
where
C xy  XY T , C xx  XX T
C yx  YX T , C yy  YY T
46
Geometric interpretation of CCA

The Geometry of CCA
a j  X T wx , b j  Y T w y
aj
2
 w XX wx  1, b j
a j  bj
T
x
2
T
 aj
2
 bj
2
2
 wTy YY T w y  1
 2corr ( a j , b j )
Maximization of the correlation is equivalent to the
minimization of the distance.
47
Algebraic derivation of CCA
The optimization problem is equivalent to
max
s.t.
48
Algebraic derivation of CCA
C
yx
C
T
xy

lx  l y  l
49
Applications in bioinformatics

CCA can be extended to multiple views of the data
– Multiple (larger than 2) data sources

Two different ways to combine different data
sources
– Multiple CCA
Consider
all pairwise correlations
– Integrated CCA
Divide
into two disjoint sources
50
Applications in bioinformatics
Source: Extraction of Correlated Gene Clusters from Multiple
Genomic Data by Generalized Kernel Canonical Correlation
Analysis. ISMB’03
http://cg.ensmp.fr/~vert/publi/ismb03/ismb03.pdf
51
Topics today

Feature reduction algorithms
– Principal component analysis
– Canonical correlation analysis
– Independent component analysis
52
Independent component analysis (ICA)
PCA
(orthogonal coordinate)
ICA
(non-orthogonal coordinate)
53
Cocktail-party problem
Multiple sound sources in room (independent)
 Multiple sensors receiving signals which are
mixture of original signals
 Estimate original source signals from mixture of
received signals
 Can be viewed as Blind-Source Separation as
mixing parameters are not known

54
DEMO: Blind Source Seperation
http://www.cis.hut.fi/projects/ica/cocktail/cocktail_en.cgi
55
BSS and ICA

Cocktail party or Blind Source Separation (BSS) problem
– Ill posed problem, unless assumptions are made!


Most common assumption is that source signals are
statistically independent. This means knowing value of one
of them gives no information about the other.
Methods based on this assumption are called Independent
Component Analysis methods
– statistical techniques for decomposing a complex data set into
independent parts.

It can be shown that under some reasonable conditions, if
the ICA assumption holds, then the source signals can be
recovered up to permutation and scaling.
56
Source Separation Using ICA
Microphone 1
Separation 1
W11
+
W21
W12
Microphone 2
Separation 2
W22
+
57
Original signals (hidden sources) s1(t), s2(t), s3(t), s4(t),
t=1:T
58
The ICA model
s1
s3
s2
s4
xi(t) = ai1*s1(t) +
ai2*s2(t) +
ai3*s3(t) +
ai4*s4(t)
Here, i=1:4.
a12
a11
x1
In vector-matrix notation, and
dropping index t, this is
x=A*s
a13
a14
x2
x3
x4
59
This is recorded by the microphones: a linear mixture of the sources
xi(t) = ai1*s1(t) + ai2*s2(t) + ai3*s3(t) + ai4*s4(t)
60
Recovered signals
61
BSS
Problem: Determine the source
signals s, given only the mixtures x.

If we knew the mixing parameters aij then we would just need to
solve a linear system of equations.

We know neither aij nor si.

ICA was initially developed to deal with problems closely related
to the cocktail party problem

Later it became evident that ICA has many other applications
– e.g. from electrical recordings of brain activity from different
locations of the scalp (EEG signals) recover underlying
components of brain activity
62
ICA Solution and Applicability

ICA is a statistical method, the goal of which is to decompose
given multivariate data into a linear sum of statistically
independent components.

For example, given two-dimensional vector , x = [ x1 x2 ] T , ICA
aims at finding the following decomposition
 x1   a11
 a12 
     s1    s 2
 x2  a21
a22
x  a1 s1  a 2 s2
where a1, a2 are basis vectors and s1, s2 are basis coefficients.
Constraint: Basis coefficients s1 and s2 are statistically
independent.
63
Approaches to ICA

Unsupervised Learning
– Factorial coding (Minimum entropy coding, Redundancy
reduction)
– Maximum likelihood learning
– Nonlinear information maximization (entropy maximization)
– Negentropy maximization
– Bayesian learning

Statistical Signal Processing
– Higher-order moments or cumulants
– Joint approximate diagonalization
– Maximum likelihood estimation
64
Application domains of ICA











Blind source separation
Image denoising
Medical signal processing – fMRI, ECG, EEG
Modelling of the hippocampus and visual cortex
Feature extraction, face recognition
Compression, redundancy reduction
Watermarking
Clustering
Time series analysis (stock market, microarray data)
Topic extraction
Econometrics: Finding hidden factors in financial data
65
Image denoising
Original
image
Wiener
filtering
Noisy
image
ICA
filtering
66