S - R Project

Extensions of CCA and PLS to
unravel relationships between two
data sets
S. Déjean(1) - I. González(2) - K-A. Lê Cao(3)
1. Institut de Mathématiques de Toulouse, UMR 5219
Université de Toulouse et CNRS
[email protected]
2. Plateforme Biopuces, Genopôle Toulouse Midi-Pyrénées
Institut National des Sciences Appliquées
[email protected]
3. ARC Centre of Excellence in Bioinformatics
Institute for Molecular Bioscience, University of Queensland, Australia
[email protected]
Conference 2009 – Rennes, July 8-10
History
Once upon a time in Toulouse, a city in South West of France,
two groups of scientists lived nearly together without
talking to each other.
But one day, they decided to do so and to work together.
They had Ph.D students, wrote articles and built R packages...
n
1
∑X
n i =1 i
« Stat »
DNA
RNA
ATGCC
TACCAGT
−1
X ' X X 'Y
« Bio »
ATGCC
n
1
∑X
n i =1 i
J. Stat.
Soft.
ofw
UseR Conference 2009, Rennes, July 8-10
SAGMB
J. Biol.
Syst.
BMC
Bioinformatics
CCA
integrOmics
S. Déjean, I. González, K-A Lê Cao – integrOmics
2 / 10
Why integrOmics ?
Gen-omics
Transcript-omics
Prote-omics
Biological
system
Each « -omics » data set can
be studied separately, but
● A great part of relevant
information can be extracted
from joint analysis of 2 or
several datasets, so
●
⇒ Integrate omics data project
in short :
Metabol-omics
integrOmics
math.univ-toulouse.fr/biostat
Lipid-omics
Methodology
Applications
Software
...-omics
UseR Conference 2009, Rennes, July 8-10
S. Déjean, I. González, K-A Lê Cao – integrOmics
3 / 10
Methodology
Methodology
Two ways to deal with the 'large p - small n' problem in the classical
framework provided by Canonical Correlation Analysis and Partial
Least Squares regression.
n
p
q
X
Y
CCA / regularization
●
●
●
Maximize correlation
between linear combination
of variables in X and Y
Requires inversion of XX'
and YY'
Regularization of CCA
−1
PLS / selection
●
●
Maximize covariance
between linear combination
of variables in X and Y
Selection obtained through
Lasso penalization of loading
vectors
−1
 XX '  ⇒ XX '  X I n 
UseR Conference 2009, Rennes, July 8-10
S. Déjean, I. González, K-A Lê Cao – integrOmics
4 / 10
Applications
Applications
E. Yergeau, S.A. Schoondermark-Stolk, E.L. Brodie, S. Déjean, T.Z.
DeSantis, O. Gonçalves, Y.M. Piceno, G.L. Andersen and G.A.
Kowalchuk (2008). Environmental microarray analyses of Antarctic soil
microbial communities. The International Society for Microbial Ecology
Journal, 3, 340-351
●
●
●
S. Combes, I. González, S. Déjean, A. Baccini, N. Jehl, H. Juin, L.
Cauquil, B. Gabinaud, F. Lebas, C. Larzul (2008). Relationships
between sensory and physicochemical measurements in meat of rabbit
from three different breeding systems using canonical correlation analysis.
Meat science, 80(3), 835-841
I. González, S. Déjean, P.G.P. Martin, O. Gonçalves, P. Besse,
A. Baccini (2009). Highlighting Relationships Between
Heterogeneous Biological Data Through Graphical Displays
Based On Regularized Canonical Correlation Analysis. Journal of
Biological Systems, 17(2), 173-199
●
UseR Conference 2009, Rennes, July 8-10
K. A. Lê Cao, D. Rossouw, C. Robert-Granié, P. Besse
(2008). A sparse PLS for variable selection when integrating
Omics data, Statistical Applications in Genetics and Molecular
Biology, 7(1), article 35
S. Déjean, I. González, K-A Lê Cao – integrOmics
5 / 10
IntegrOmics R package
UseR Conference 2009, Rennes, July 8-10
S. Déjean, I. González, K-A Lê Cao – integrOmics
Software
6 / 10
Using integrOmics
●
From X and Y two matrices
●
Preliminary view of the correlations matrices
Software
R> imgCor(X, Y, type = "separate")
●
Classical CCA
R> res.rcc = rcc(X, Y)
●
Regularized CCA
R> res.rcc = rcc(X, Y, 0.05, 0.01)
●
PLS
R> res.pls = pls(X, Y)
●
Sparse PLS
R> res.pls = spls(X, Y, mode=c("regression", "canonical"),
+ keep.X=c(10, 10, 10), keep.Y=c(10, 10, 10))
UseR Conference 2009, Rennes, July 8-10
S. Déjean, I. González, K-A Lê Cao – integrOmics
7 / 10
Graphical display
R> plotVar(res.rcc,
R> plotIndiv(res.rcc),
+ X.label=T,Y.label=T)
+ ind.names=nutrimouse$diet
1.5
1.0
fish
C18.2n.6
C20.2n.6
fish
fish
C22.4n.6
1.0
LDLr
0.5
GSTpi2
SPI1.1
C20.5n.3
C22.6n.3
C22.5n.3
C18.0
apoA.I
CYP4A10
HMGCoAred
cHMGCoAS
CYP3A11
C20.3n.9
PLTP
FAS
PMDCI
CYP27a1
GK CPT2
AOX
BSEP
THIOL
BIEN
GSTa
Lpin2
C16.0
HPNCL
0.0
0.5
Comp 1
Variables plot
UseR Conference 2009, Rennes, July 8-10
sun
lin
lin
-1.5
-0.5
lin
lin
lin
sun
sun
sun
ref
coc
refrefref
ref
lin
ref
sun
sun
sun
sun
ref ref
coc
-1.0
-1.0
fish
-1.0
C14.0
C18.1n.7
C16.1n.7
fish
fish lin
lin
fish
-0.5
SR.BI
C20.3n.3
FAT
C18.3n.6
UCP2 MCAD
C18.1n.9
CYP2c29
Dimension 2
C20.3n.6
C16.1n.9
-0.5
0.0
Ntcp ACOTH
apoC3
fish
C22.5n.6
0.0
0.5
C20.1n.9
C20.4n.6
Comp 2
Software
1.0
coc
WT
PPAR
coccoc
coc
coc
coc
-2
-1
0
1
Dimension 1
Individuals plot
S. Déjean, I. González, K-A Lê Cao – integrOmics
8 / 10
Future work
Software
11
13
5
12
2
Provide new graphical display using
graphs
●
6
3
4
9
8
1
14
7
0
10
p1
p2
p4
●
Methodologies to deal
with more than 2 data sets
20
10
Functional statistics to
deal with metabolomics
or proteomics data
0
●
30
40
n
p3
0
UseR Conference 2009, Rennes, July 8-10
50
100
150
200
S. Déjean, I. González, K-A Lê Cao – integrOmics
9 / 10
Summary
Ph.D K-A. Lê Cao
(P. Besse, C. Robert-Granié)
Ph.D I. González
First steps of
collaborations
between
biologists and
statisticians
in Toulouse
around Omics
technologies
2003
(A. Baccini, J.R. Léon)
CCA
ofw
integrOmics
First
article
« state
of the
art »
2004
UseR Conference 2009, Rennes, July 8-10
2005
Conference
Rennes
2006
2007
2008
2009
S. Déjean, I. González, K-A Lê Cao – integrOmics
10 / 10