Non-negative matrix factorization. - National Institute of Statistical

Linking Genetic Profiles to
Biological Outcome
Paul Fogel
Consultant, Paris
S. Stanley Young
National Institute of Statistical Sciences
NISS, NMF Workshop February 23, ‘07
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
Row 9
Row 10
Row 11
Row 12
Row 13
Row 14
Row 15
Row 16
Row 17
Row 18
Row 19
Row 20
Row 21
Row 22
Row 23
Row 24
Row 25
Row 26
Row 27
Row 28
Row 29
Row 30
Row 31
Row 32
Row 33
Row 34
Row 35
Row 36
Row 37
Row 38
Row 39
Row 40
Row 41
Row 42
Row 43
Row 44
Row 45
Row 46
Row 47
Row 48
Row 49
Row 50
Row 51
Row 52
Row 53
Row 54
Row 55
Row 56
Row 57
Row 58
Row 59
Row 60
Row 61
Row 62
Row 63
Row 64
Row 65
Row 66
Row 67
Row 68
Row 69
Row 70
Row 71
Row 72
Row 73
Row 74
Row 75
Row 76
Row 77
Row 78
Row 79
Row 80
Row 81
Row 82
Row 83
Row 84
Row 85
Row 86
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Comp
Comp
Comp
Comp
Scotch whiskey database
Original matrix
=
Prototypical flavor
patterns
X
Mixing levels
(weights)
+
Residual
How many flavor patterns?
Profile likelihood (eigen values)
-47
Profile Likelihood
-48
Profile likelihood
-49
(Zhu and Ghodsi)
-50
-51
Scree Plot (eigen values)
250
-52
0
1
2
3
4
5
6
7
Rows
8
9
10
11
12
13
150
Scree plot
100
50
Scree Plot (determinant)
1.1
0
0
1
2
3
4
5
6
7
Rows
8
9
10
11
12
13
1
0.9
Det
Eigen Value
200
Volume filled
0.8
(Determinant)
0.7
0
1
2
3
4
5
6
7
Rows
8
9
10
11
12
13
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
Row 9
Row 10
Row 11
Row 12
Row 13
Row 14
Row 15
Row 16
Row 17
Row 18
Row 19
Row 20
Row 21
Row 22
Row 23
Row 24
Row 25
Row 26
Row 27
Row 28
Row 29
Row 30
Row 31
Row 32
Row 33
Row 34
Row 35
Row 36
Row 37
Row 38
Row 39
Row 40
Row 41
Row 42
Row 43
Row 44
Row 45
Row 46
Row 47
Row 48
Row 49
Row 50
Row 51
Row 52
Row 53
Row 54
Row 55
Row 56
Row 57
Row 58
Row 59
Row 60
Row 61
Row 62
Row 63
Row 64
Row 65
Row 66
Row 67
Row 68
Row 69
Row 70
Row 71
Row 72
Row 73
Row 74
Row 75
Row 76
Row 77
Row 78
Row 79
Row 80
Row 81
Row 82
Row 83
Row 84
Row 85
Row 86
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Comp
Comp
Comp
Comp
AnCnoc
Floral
Sweetness
Fruity
Malty
Nutty
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
Row 9
Row 10
Row 11
Row 12
Row 13
Row 14
Row 15
Row 16
Row 17
Row 18
Row 19
Row 20
Row 21
Row 22
Row 23
Row 24
Row 25
Row 26
Row 27
Row 28
Row 29
Row 30
Row 31
Row 32
Row 33
Row 34
Row 35
Row 36
Row 37
Row 38
Row 39
Row 40
Row 41
Row 42
Row 43
Row 44
Row 45
Row 46
Row 47
Row 48
Row 49
Row 50
Row 51
Row 52
Row 53
Row 54
Row 55
Row 56
Row 57
Row 58
Row 59
Row 60
Row 61
Row 62
Row 63
Row 64
Row 65
Row 66
Row 67
Row 68
Row 69
Row 70
Row 71
Row 72
Row 73
Row 74
Row 75
Row 76
Row 77
Row 78
Row 79
Row 80
Row 81
Row 82
Row 83
Row 84
Row 85
Row 86
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Comp
Comp
Comp
Comp
Balmenach
Winey
Body
Honey
Sweetness
Nutty
Malty
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
Row 9
Row 10
Row 11
Row 12
Row 13
Row 14
Row 15
Row 16
Row 17
Row 18
Row 19
Row 20
Row 21
Row 22
Row 23
Row 24
Row 25
Row 26
Row 27
Row 28
Row 29
Row 30
Row 31
Row 32
Row 33
Row 34
Row 35
Row 36
Row 37
Row 38
Row 39
Row 40
Row 41
Row 42
Row 43
Row 44
Row 45
Row 46
Row 47
Row 48
Row 49
Row 50
Row 51
Row 52
Row 53
Row 54
Row 55
Row 56
Row 57
Row 58
Row 59
Row 60
Row 61
Row 62
Row 63
Row 64
Row 65
Row 66
Row 67
Row 68
Row 69
Row 70
Row 71
Row 72
Row 73
Row 74
Row 75
Row 76
Row 77
Row 78
Row 79
Row 80
Row 81
Row 82
Row 83
Row 84
Row 85
Row 86
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Comp
Comp
Comp
Comp
GlenGarioch
Spicy
Fruity
Sweetness
Body
Malty
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
Row 9
Row 10
Row 11
Row 12
Row 13
Row 14
Row 15
Row 16
Row 17
Row 18
Row 19
Row 20
Row 21
Row 22
Row 23
Row 24
Row 25
Row 26
Row 27
Row 28
Row 29
Row 30
Row 31
Row 32
Row 33
Row 34
Row 35
Row 36
Row 37
Row 38
Row 39
Row 40
Row 41
Row 42
Row 43
Row 44
Row 45
Row 46
Row 47
Row 48
Row 49
Row 50
Row 51
Row 52
Row 53
Row 54
Row 55
Row 56
Row 57
Row 58
Row 59
Row 60
Row 61
Row 62
Row 63
Row 64
Row 65
Row 66
Row 67
Row 68
Row 69
Row 70
Row 71
Row 72
Row 73
Row 74
Row 75
Row 76
Row 77
Row 78
Row 79
Row 80
Row 81
Row 82
Row 83
Row 84
Row 85
Row 86
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Flavor
Body
Sweetness
Smoky
Medicinal
Tobacco
Honey
Spicy
Winey
Nutty
Malty
Fruity
Floral
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Comp
Comp
Comp
Comp
Lagavulin & Laphroig
Medicinal
Smoky
Body
Statistical Issues
1. Massive testing: Hundreds of
“omic” predictors and several
questions per sample.
2. Family-wise versus false
discovery.
3. Missing data, outliers.
Don’t fool
yourself.
Matrix Factorization Methods
1. Principle component analysis.
2. Singular value decomposition.
3. Non-negative matrix factorization.
4. Independent component analysis.
5. Robust MF.
Area of
active
research.
Key Papers
1. Good (1969) Technometrics – SVD.
2. Liu et al. (2003) PNAS – rSVD.
NMF commits
one vector to
each
mechanism.
3. Lee and Seung (1999) Nature – NMF.
4. Kim and Tidor (2003) Genome
Research.
5. Brunet et al. (2004) PNAS – Micro
array.
SVD eigen
vectors come
from a
composite of 
mechanisms.
NMF Algorithm
Samples
Genes or Compounds
Start with
random
elements in red
and green.
A
Optimize so
that
=
WH
+E
Green are the “spectra”. Red are the “weights”.
(aij – whij)2 is
minimized.
Inference
• Test each variable sequentially
within an ordered set. Each
set corresponds to a particular
eigenvector, which has been
ordered by decreasing values.
Increase in
statistical
power.
Genomic
example.
Simulation.
Micro Array Example
• Group AML: patients with acute myeloid leukemia
• Group ALL: patients with acute lymphoblastic
leukemia
– Subgroup ALL-T: T cell subtypes
– Subgroup ALL-B: B cell subtypes
Golub,T.R. et al. (1999) Molecular classification of
cancer: class discovery and class prediction by
gene expression monitoring. Science, 286, 531–
537.
Clustering
NMF clusters
samples
correctly.
Brunet et al
(2004). PNAS
vol. 101 no. 12
4164–4169
Additional
subgroup of
ALL-B.
Clustering
NMF clusters
samples
correctly.
Additional
subgroup of
ALL-B.
Brunet et al
(2004). PNAS
vol. 101 no. 12
4164–4169
Clustering
NMF clusters
samples
correctly.
Additional
subgroup of
ALL-B.
Brunet et al
(2004). PNAS
vol. 101 no. 12
4164–4169
Sequential testing
Cluster 1 ALL-B1
(33 genes)
Immune Response
MHC class II
10 genes (p=0.00019)
5 genes
Proteasome
7 genes
P = 0.00054
Immune Response
28 genes (p=0.00047)
MHC class I & II
6 genes
P = 0.00018
Upregulation in
ALL-B2 genes
Higher rate of
transcription and
replication
processes
More:
RNA Processing
Cluster 3 ALL-B2
11 genes
P = 0.00260
(169 genes)
DNA Repair and
Replication
Cell Growth and
Proliferation
11 genes
P = 0.01519
61 genes
Cell Cycle
12 genes
Transcription
16 genes
 Proliferative
nature compared
with ALL-B1
 Proteasomal
activity
 Energy
production.
Simulation
Simulation
200
Genes 1-5: upregulated by T1
Y
150
100
T2
T2
T1
T1
T1
N
N
50
Group
Y
200
Genes 6-10: upregulated by T2
150
100
T2
T2
T1
T1
T1
N
N
50
Group
250
Genes 11-20: upregulated by T1
and T2
Y
200
150
100
Group
T2
T2
T1
T1
T1
N
Intragroup correlation structure
N
50
Simulation results
Increased power
Same level of
FDR
For more details
see paper
Summary
•
The strategy is conceptually simple:
–
–
–
•
Non-negative matrix factorization is used to
create groups of genes that are moving
together in the dataset.
The error rate to be controlled is allocated
over these groups.
Within each group, genes are tested
sequentially.
The strategy should be effective if there
are sets of genes moving together so
that group formation reflects biological
reality.
Areas of
research:
Robust
algorithms
Speed
Multiblock
NMF (e.g.
relate active
motifs with
differentially
expressed
genes)
Contact Information
Paul Fogel
[email protected]
+33 1 43 26 16 86
Independent
consultant
Stan Young
National Institute of Statistical Sciences
[email protected]
919 685 9328
Literature
www.niss.org/irMF
Software