Damian Brzyski

The selection of relevant groups of explanatory variables in
GWA studies
Damian Brzyski
Department of Mathematics and Computer Science, Wroclaw University of
Technology, Poland
Institute of Mathematics, Jagiellonian University, Cracow, Poland
31.08.2015
1 / 19
Genome-wide association studies (GWAS)
𝒊𝟏
𝑦1
⋮ =
𝑦𝑛 𝒊𝒏
𝒚
2 / 19
𝐺1
𝑥11
⋮
𝑥𝑛1
𝐺𝑝
⋯
⋱
⋯
𝑥1𝑝 𝛽1
⋮
⋮ +𝑧
𝑥𝑛𝑝 𝛽𝑝
𝑿
𝛽
Model selection problem
Consider the linear regression model of form y = Xβ + z, with
experiment matrix X ∈ M (n, p) (with centered, `2 normalized
columns), observation vector y and z ∼ N (0, σ 2 In )
3 / 19
Model selection problem
Consider the linear regression model of form y = Xβ + z, with
experiment matrix X ∈ M (n, p) (with centered, `2 normalized
columns), observation vector y and z ∼ N (0, σ 2 In )
We assume that the number of nonzero coefficients in β is
small comparing to p (i.e. β is sparse)
3 / 19
Model selection problem
Consider the linear regression model of form y = Xβ + z, with
experiment matrix X ∈ M (n, p) (with centered, `2 normalized
columns), observation vector y and z ∼ N (0, σ 2 In )
We assume that the number of nonzero coefficients in β is
small comparing to p (i.e. β is sparse)
The task is to find the support of β, which corresponds to
finding relevant explanatory variables
3 / 19
Existing penalized methods: LASSO
LASSO is defined as solution to
2
1
arg min
y − Xb 2 + λL kbk1 ,
2
b
with λL > 0 being tuning parameter
4 / 19
(LASSO)
Existing penalized methods: LASSO
LASSO is defined as solution to
2
1
arg min
y − Xb 2 + λL kbk1 ,
2
b
(LASSO)
with λL > 0 being tuning parameter
General rule: the reduction of λL results in identification of more
elements from the true support (true discoveries) but at the same
time it produces more falsely identified variables (false discoveries)
4 / 19
Existing penalized methods: LASSO
LASSO is defined as solution to
2
1
arg min
y − Xb 2 + λL kbk1 ,
2
b
(LASSO)
with λL > 0 being tuning parameter
General rule: the reduction of λL results in identification of more
elements from the true support (true discoveries) but at the same
time it produces more falsely identified variables (false discoveries)
Choosing of λL is challenging - it is not obvious which sparsity
level could be perceived as proper
4 / 19
Existing penalized methods: SLOPE
SLOPE is defined as solution to
p
X
2
1
arg min
y − Xb 2 + σ
λi b(i) ,
2
b
i=1
(SLOPE)
where b(1) ≥ . . . ≥ b(p) are ordered magnitudes of coefficients of b and
λ1 ≥ . . . ≥ λp ≥ 0 is the sequence of tuning parameters
5 / 19
Existing penalized methods: SLOPE
SLOPE is defined as solution to
p
X
2
1
arg min
y − Xb 2 + σ
λi b(i) ,
2
b
i=1
(SLOPE)
where b(1) ≥ . . . ≥ b(p) are ordered magnitudes of coefficients of b and
λ1 ≥ . . . ≥ λp ≥ 0 is the sequence of tuning parameters
The above optimization problem is convex and could be efficiently solved
even for large design matrices
5 / 19
Existing penalized methods: SLOPE
SLOPE is defined as solution to
p
X
2
1
arg min
y − Xb 2 + σ
λi b(i) ,
2
b
i=1
(SLOPE)
where b(1) ≥ . . . ≥ b(p) are ordered magnitudes of coefficients of b and
λ1 ≥ . . . ≥ λp ≥ 0 is the sequence of tuning parameters
The above optimization problem is convex and could be efficiently solved
even for large design matrices
The method reduces to LASSO, when λ1 = . . . = λp
5 / 19
False discovery rate (FDR) control
Let βe be estimate of β
6 / 19
False discovery rate (FDR) control
Let βe be estimate of β
We define:
6 / 19
False discovery rate (FDR) control
Let βe be estimate of β
We define:
6 / 19
the number of all discoveries, R := i : βei 6= 0 False discovery rate (FDR) control
Let βe be estimate of β
We define:
6 / 19
the number of all discoveries, R := i : βei 6= 0 the number of false discoveries, V := i : βi = 0,
βei 6= 0 False discovery rate (FDR) control
Let βe be estimate of β
We define:
6 / 19
the number of all discoveries, R := i : βei 6= 0 the number of false discoveries, V := i : βi = 0,
h
i
V
false discovery rate, F DR := E max{R,1}
βei 6= 0 False discovery rate (FDR) control
Let βe be estimate of β
We define:
the number of all discoveries, R := i : βei 6= 0 the number of false discoveries, V := i : βi = 0,
h
i
V
false discovery rate, F DR := E max{R,1}
βei 6= 0 The goal is to construct the method for which tuning
parameters could be chosen (in explicit way) such as the
condition F DR ≤ q is met, for predefined q ∈ (0, 1)
6 / 19
FDR control with SLOPE
In orthogonal situation, i.e. when Xi> Xj = 0 for i 6= j, the
condition F DR ≤ q is theoretically provided when λ sequence
is defined as
q
λi := Φ−1 1 − i ·
2p
7 / 19
FDR control with SLOPE
In orthogonal situation, i.e. when Xi> Xj = 0 for i 6= j, the
condition F DR ≤ q is theoretically provided when λ sequence
is defined as
q
λi := Φ−1 1 − i ·
2p
The heuristic procedure for choosing smoothing parameters
was derived for the near orthogonal situation, which was
modeled by assuming that entries of X are realizations of
independent, zero-mean normal distributions
7 / 19
FDR control with SLOPE
In orthogonal situation, i.e. when Xi> Xj = 0 for i 6= j, the
condition F DR ≤ q is theoretically provided when λ sequence
is defined as
q
λi := Φ−1 1 − i ·
2p
The heuristic procedure for choosing smoothing parameters
was derived for the near orthogonal situation, which was
modeled by assuming that entries of X are realizations of
independent, zero-mean normal distributions
It turns out that the mentioned heuristic works well for many
other zero-mean distributions
7 / 19
The structure of GWAS data
Tendency of strong correlation between nearly located columns
while columns of distant indices are generally weakly correlated
8 / 19
The structure of GWAS data
Tendency of strong correlation between nearly located columns
while columns of distant indices are generally weakly correlated
1
0.9
10
0.8
20
0.7
30
0.6
40
0.5
50
0.4
60
0.3
70
0.2
80
0.1
90
100
10
20
30
40
50
60
70
80
90
100
Figure: Histogram of correlation matrix (absolute values) for 100 predictors
8 / 19
1.0
0.8
0.6
FDR
0.4
0.2
0.0
0.0
0.2
0.4
FDR
0.6
0.8
1.0
SLOPE in GWA studies
0
5
10
15
20
25
30
Number of relevant variables
(a) Norm. dist. entries, n = p = 1000
9 / 19
0
5
10
15
20
25
30
Number of relevant variables
(b) GWAS data, n = p = 1000
New strategy: selecting groups
Let I = {I1 , . . . , Im } be partition of {1, . . . , p} and denote li := |Ii | for
i = 1, . . . , m
10 / 19
New strategy: selecting groups
Let I = {I1 , . . . , Im } be partition of {1, . . . , p} and denote li := |Ii | for
i = 1, . . . , m
Consider the linear regression model with m groups of form
y=
m
X
j=1
10 / 19
XIj βIj + z
New strategy: selecting groups
Let I = {I1 , . . . , Im } be partition of {1, . . . , p} and denote li := |Ii | for
i = 1, . . . , m
Consider the linear regression model with m groups of form
y=
m
X
XIj βIj + z
j=1
We define truly relevant group by condition kXIj βIj k2 > 0
10 / 19
New strategy: selecting groups
Let I = {I1 , . . . , Im } be partition of {1, . . . , p} and denote li := |Ii | for
i = 1, . . . , m
Consider the linear regression model with m groups of form
y=
m
X
XIj βIj + z
j=1
We define truly relevant group by condition kXIj βIj k2 > 0
The task is to identify the relevant group instead of individual predictors
10 / 19
New strategy: selecting groups
Let I = {I1 , . . . , Im } be partition of {1, . . . , p} and denote li := |Ii | for
i = 1, . . . , m
Consider the linear regression model with m groups of form
y=
m
X
XIj βIj + z
j=1
We define truly relevant group by condition kXIj βIj k2 > 0
The task is to identify the relevant group instead of individual predictors
STANDARDIZATION: XIj could be decomposed as XIj = Uj Rj , where
>
Uj> Uj = I, we can define βe := (R1 βI1 )> , . . . , (Rm βIm )> ,
βee := Rj βI . Then
Ij
j
kXIj βIj k2 > 0 ⇐⇒ kUj βeIej k2 > 0 ⇐⇒ kβeIej k2 > 0
10 / 19
Group false discovery rate (gFDR) control
Let βe be estimate of β and I = {I1 , . . . , Im } be partition of
{1, . . . , p}
11 / 19
Group false discovery rate (gFDR) control
Let βe be estimate of β and I = {I1 , . . . , Im } be partition of
{1, . . . , p}
We define:
11 / 19
Group false discovery rate (gFDR) control
Let βe be estimate of β and I = {I1 , . . . , Im } be partition of
{1, . . . , p}
We define:
11 / 19
the number
of all discovered
groups,
gR := i : kXIi βeIi k2 > 0 Group false discovery rate (gFDR) control
Let βe be estimate of β and I = {I1 , . . . , Im } be partition of
{1, . . . , p}
We define:
the number
of all discovered
groups,
gR := i : kXIi βeIi k2 > 0 the number
of falsely discovered groups,
gV := i : kXIi βIi k2 = 0, kXIi βeIi k2 > 0 11 / 19
Group false discovery rate (gFDR) control
Let βe be estimate of β and I = {I1 , . . . , Im } be partition of
{1, . . . , p}
We define:
the number
of all discovered
groups,
gR := i : kXIi βeIi k2 > 0 the number
of falsely discovered groups,
gV := i : kXIi βIi k2 = 0, kXIi βeIi k2 > 0 h
gV
group false discovery rate, gF DR := E max{gR,
11 / 19
i
1}
Group false discovery rate (gFDR) control
Let βe be estimate of β and I = {I1 , . . . , Im } be partition of
{1, . . . , p}
We define:
the number
of all discovered
groups,
gR := i : kXIi βeIi k2 > 0 the number
of falsely discovered groups,
gV := i : kXIi βIi k2 = 0, kXIi βeIi k2 > 0 h
gV
group false discovery rate, gF DR := E max{gR,
i
1}
The goal is to control gF DR at assumed level q ∈ (0, 1)
11 / 19
Group SLOPE (gFDR)
Let I = {I1 , . . . , Im } be partition of {1, . . . , p}, li = |Ii | and
λ1 ≥ . . . ≥ λm ≥ 0
12 / 19
Group SLOPE (gFDR)
Let I = {I1 , . . . , Im } be partition of {1, . . . , p}, li = |Ii | and
λ1 ≥ . . . ≥ λm ≥ 0
We introduce the group SLOPE estimate, defined as a solution to
arg min
b
where
m
2
q X
1
λi l(i) bI(i) 2 ,
y − Xb + σ
2
2
i=1
p l(i) bI(i) 2 is the ith largest coefficient of the vector
√ >
√ l1 b I 1 2 , . . . , lm b I m 2
12 / 19
(gSLOPE)
Group SLOPE (gFDR)
Let I = {I1 , . . . , Im } be partition of {1, . . . , p}, li = |Ii | and
λ1 ≥ . . . ≥ λm ≥ 0
We introduce the group SLOPE estimate, defined as a solution to
arg min
b
where
m
2
q X
1
λi l(i) bI(i) 2 ,
y − Xb + σ
2
2
i=1
(gSLOPE)
p l(i) bI(i) 2 is the ith largest coefficient of the vector
√ >
√ l1 b I 1 2 , . . . , lm b I m 2
gSLOPE is solution to convex optimization problem which could be efficiently
solved (by proximal gradient method)
12 / 19
gSLOPE
l1 = . . . = lm = 1
λ1 = . . . = λm
gLASSO
SLOPE
l1 = . . . = lm = 1
λ1 = . . . = λp
LASSO
13 / 19
Theorem (gFDR control under orthogonal case)
Consider linear regression model with m groups, in which X is experiment
matrix satisfying XI>i XIj = 0, for any i 6= j. Let m0 denote the number of
truly irrelevant groups. Apply following steps:
redefine X, I, {li }m
i=1 , p by applying the standardization
fix q ∈ (0, 1)
>
define λ = [λ1 , . . . , λm ] , for λi := max
j=1,...,m
√1 Fχ−1
1−
l
lj
j
q·i
m
Fχli is cumulative of chi distribution with li degrees of freedom
2
p Pm
1
e
β := arg min
l(i) bI(i) 2 ;
i=1 λi
2 y − Xb + σ
b
14 / 19
2
, where
Theorem (gFDR control under orthogonal case)
Consider linear regression model with m groups, in which X is experiment
matrix satisfying XI>i XIj = 0, for any i 6= j. Let m0 denote the number of
truly irrelevant groups. Apply following steps:
redefine X, I, {li }m
i=1 , p by applying the standardization
fix q ∈ (0, 1)
>
define λ = [λ1 , . . . , λm ] , for λi := max
j=1,...,m
√1 Fχ−1
1−
l
lj
j
q·i
m
Fχli is cumulative of chi distribution with li degrees of freedom
2
p Pm
1
e
β := arg min
l(i) bI(i) 2 ;
i=1 λi
2 y − Xb + σ
b
2
Then, it holds
gF DR ≤ q ·
14 / 19
m0
.
m
, where
Results for near orthogonal situation
Our main goal was to find procedure for generating tuning parameters for
gSLOPE in situation when explanatory variables included to different groups
are weakly correlated (as in GWAS case).
15 / 19
Results for near orthogonal situation
Our main goal was to find procedure for generating tuning parameters for
gSLOPE in situation when explanatory variables included to different groups
are weakly correlated (as in GWAS case).
Basing on heuristics for SLOPE, we have derived the procedure for
gSLOPE in situation when entries of design matrix are realizations of
independent, zero mean, normal distributions and all groups have the
same sizes
15 / 19
Results for near orthogonal situation
Our main goal was to find procedure for generating tuning parameters for
gSLOPE in situation when explanatory variables included to different groups
are weakly correlated (as in GWAS case).
Basing on heuristics for SLOPE, we have derived the procedure for
gSLOPE in situation when entries of design matrix are realizations of
independent, zero mean, normal distributions and all groups have the
same sizes
For arbitrary group sizes we considered two approaches: the conservative
(giving gFDR significantly lower than assumed) and the liberal (giving
gFDR slightly above the target level but identifying more truly relevant
groups than conservative)
15 / 19
Results for near orthogonal situation
Our main goal was to find procedure for generating tuning parameters for
gSLOPE in situation when explanatory variables included to different groups
are weakly correlated (as in GWAS case).
Basing on heuristics for SLOPE, we have derived the procedure for
gSLOPE in situation when entries of design matrix are realizations of
independent, zero mean, normal distributions and all groups have the
same sizes
For arbitrary group sizes we considered two approaches: the conservative
(giving gFDR significantly lower than assumed) and the liberal (giving
gFDR slightly above the target level but identifying more truly relevant
groups than conservative)
For the fixed sequence of tuning parameters, we have developed the
iterative version of gSLOPE, allowing the estimation of σ 2
15 / 19
GWAS experiment
We have analyzed the genetic data collected for 5402 individuals and
16427 genetic regions located in chromosome 1
16 / 19
GWAS experiment
We have analyzed the genetic data collected for 5402 individuals and
16427 genetic regions located in chromosome 1
Columns of the experiment matrix were divided into groups by using the
hierarchical clustering algorithm (HCA)
16 / 19
GWAS experiment
We have analyzed the genetic data collected for 5402 individuals and
16427 genetic regions located in chromosome 1
Columns of the experiment matrix were divided into groups by using the
hierarchical clustering algorithm (HCA)
As a results we achieved 1358 groups, with average group size close to 12
16 / 19
GWAS experiment
We have analyzed the genetic data collected for 5402 individuals and
16427 genetic regions located in chromosome 1
Columns of the experiment matrix were divided into groups by using the
hierarchical clustering algorithm (HCA)
As a results we achieved 1358 groups, with average group size close to 12
We have performed 200 iterations for each sparsity level from the set
[1, 4, 8, 13, 18, 24], in each iteration we generated observations using
linear regression model with σ = 1
16 / 19
GWAS experiment
We have analyzed the genetic data collected for 5402 individuals and
16427 genetic regions located in chromosome 1
Columns of the experiment matrix were divided into groups by using the
hierarchical clustering algorithm (HCA)
As a results we achieved 1358 groups, with average group size close to 12
We have performed 200 iterations for each sparsity level from the set
[1, 4, 8, 13, 18, 24], in each iteration we generated observations using
linear regression model with σ = 1
Tuning parameters were obtain by applying the conservative and the
liberal strategies for target gFDR level 0.1
16 / 19
GWAS experiment
We have analyzed the genetic data collected for 5402 individuals and
16427 genetic regions located in chromosome 1
Columns of the experiment matrix were divided into groups by using the
hierarchical clustering algorithm (HCA)
As a results we achieved 1358 groups, with average group size close to 12
We have performed 200 iterations for each sparsity level from the set
[1, 4, 8, 13, 18, 24], in each iteration we generated observations using
linear regression model with σ = 1
Tuning parameters were obtain by applying the conservative and the
liberal strategies for target gFDR level 0.1
Coefficients in βIj , for truly relevant group j, were generated such as
kXIj βIj k2 = 8
16 / 19
GWAS experiment
We have analyzed the genetic data collected for 5402 individuals and
16427 genetic regions located in chromosome 1
Columns of the experiment matrix were divided into groups by using the
hierarchical clustering algorithm (HCA)
As a results we achieved 1358 groups, with average group size close to 12
We have performed 200 iterations for each sparsity level from the set
[1, 4, 8, 13, 18, 24], in each iteration we generated observations using
linear regression model with σ = 1
Tuning parameters were obtain by applying the conservative and the
liberal strategies for target gFDR level 0.1
Coefficients in βIj , for truly relevant group j, were generated such as
kXIj βIj k2 = 8
Estimates were obtained by using the iterative version of gSLOPE with σ
estimation
16 / 19
Defining groups by HCA
1
0.9
20
20
0.8
40
40
60
60
80
80
100
100
120
120
140
140
160
160
0.7
0.6
0.5
0.4
0.3
0.2
0.1
180
180
20
40
60
80
100
(a) Original
17 / 19
120
140
160
180
20
40
60
80
100
120
140
160
180
(b) After clustering with HCA
1.0
0.8
0.6
0.4
0.0
5
10
15
Number of relevant groups
(a) gFDR
18 / 19
Liberal strategy
Conservative strategy
0.2
0.6
0.4
0.0
0.2
Estimated gFDR
0.8
Liberal strategy
Conservative strategy
Proportion of identified relevant groups
1.0
Results
20
5
10
15
Number of relevant groups
(b) Power
20
[1]
M. Bogdan, R. van den Berg, W. Su, E. J. Candes. Statistical Estimation and
Testing via the Ordered `1 Norm. arXiv:1310.1969
[2]
Yuan M., Lin Y. Consistent group selection in high-dimensional linear regression.
Bernoulli, 16(4):1369-1384, 2007.
[3]
Noah Simon, Jerome Friedman, Trevor Hastie, Rob Tibshirani A sparse-group lasso.
Journal of Computational and Graphical Statistics, 2013.
[4]
Gossmann, Alexej and Cao, Shaolong and Wang, Yu-Ping Identification of significant
genetic variants via SLOPE, and its extension to Group SLOPE. Proceedings of the
International Conference on Bioinformatics, Computational Biology and Biomedical
Informatics, 2015
[5]
Johnson S. C. Hierarchical clustering schemes. Psychometrika, Vol. 32, 241–254,
1967.
[6]
Yuan M., Lin Y. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society, Series B, 68(1):49-67, 2007.
[7]
By Peng Zhao, Guilherme Rocha, Bin Yu The composite absolute penalties family
for grouped and hierarchical variable selection. Annals of Statistics,
37(6A):3468-3497, 2009.
19 / 19

Download Report

Damian Brzyski

Paperzz.com

Your Paperzz