Learning structural protein properties

Computational approaches to
analyze proteins
Dr. Ian Walsh
Decemeber 2013
3 Part outline
• Part 1: Simulating random peptides
• Monte Carlo simulations
• Part 2: Introduction to machine learning
– Linear regression: Neural Networks
– Non-linear regression: Neural Networks
– Maximum margin classifier: Support vector
Machines
– Factors when constructing a data set
• Part 3: Case study for practical Monday 16th
– Secondary structure prediction
Monte Carlo approximation
• What is it?
– repeated random sampling algorithm (1940s)
• Simple example: two outcomes e.g. coin toss
Monte Carlo approximation
• What is it?
– repeated random sampling algorithm (1940s)
• Simple example: two outcomes e.g. coin toss
Gambler: 100 coin tosses
• Bias random fluctuation?
– Limited number of events
70
30
Monte Carlo approximation
• What is it?
– repeated random sampling algorithm (1940s)
Gambler: 100 coin tosses
70
30
We know a fair coin:
P(heads)=P(tails)=0.5
But GAMBLER says:
P(heads) = 0.7
P(tails) = 0.3
• What is the behaviour for many tosses:
– 100,000 human tosses 5 second each = 6 boring days
– Can we simulate the process on a computer
Monte Carlo approximation
• What is it?
– repeated random sampling algorithm (1940s)
Coin flip is a simple experiment: but experiment
is often complex with unknown behavior
70
30
We know a fair coin:
P(heads)=P(tails)=0.5
But GAMBLER says:
P(heads) = 0.7
P(tails) = 0.3
• What is the behaviour for many tosses:
– 100,000 human tosses 5 second each = 6 boring days
– Can we simulate the process on a computer
Monte Carlo approximation
• What is it?
– repeated random sampling algorithm (1940s)
• Computer simulation:
– Random number U between 0-1
– If U≥0.5  heads; otherwise tails
– Repeat N times and record the measurments
Monte Carlo approximation
• What is it?
– repeated random sampling algorithm (1940s)
• Computer simulation:
Computer says:
– Random number Ui between 0-1
P(heads)=P(tails)
– If Ui≥0.5  heads; otherwise tails
=0.5±error
– Repeat N times and record the measurmentserror<0.00000001
Computer says:
Gambling is stupid
Monte Carlo approximation
– repeated random sampling:
• random (x,y) : blue x²+y²≤1
π/4
=
#
#(
+
)
Monte Carlo approximation
– repeated random sampling:
• random (x,y) : blue x²+y²≤1
π/4
=
#
#(
+
)
Monte Carlo approximation
– repeated random sampling:
• random (x,y) : blue x²+y²≤1
π/4
=
#
#(
+
)
Monte Carlo approximation
– repeated random sampling:
• random (x,y) : blue x²+y²≤1
π/4
=
#
#(
+
)
Conclusion:
Random sampling on a large scale
can be used to approximate many
processes
Monte Carlo simulated annealing
• Simulated annealing contains:
– Cooling parameter (temperature)
– Energy at a random point: monte carlo technique
– Movement to new energy proportional to temperature
E
N
E
R
G
Y
Source: wikipedia
Monte Carlo simulated annealing
• Simulated annealing contains:
– Cooling parameter (temperature)
– Energy at a random point: monte carlo technique
– Movement to new energy given temperature
Analogy to thermodynamics:
• high temperature  explore a lot
• Low temperature  restrict exploration
E
N
E
R
G
Y
Source: wikipedia
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
∆E
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
∆E<0
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
∆E>0
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
6: if ∆E>0 ......... REJECT  start=start;
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
6: if ∆E>0 ......... REJECT  start=start; go back to 3
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
6: if ∆E>0 ......... REJECT  start=start; go back to 3
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
6: if ∆E>0 ......... REJECT  start=start; go back to 3
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
6: if ∆E>0 ......... REJECT  start=start; go back to 3
X
Ideal situation: a simple energy surface
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT  start=next; go back to 3
6: if ∆E>0 ......... REJECT  start=start; go back to 3
X
For many dimensions x, y, z, ...a, b, ….
And many energy peaks this will take forever
Monte Carlo simulated annealing
E
N
E
R
G
Y
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT
6: if ∆E>0
6a: U = random[0,1]
 E / temperature
6b: if U< e
...... ACCEPT
else go back to 3 and repeat ....... REJECT
∆E
X
Monte Carlo simulated annealing
E
N
E
R
G
Y
∆E
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT
6: if ∆E>0
6a: U = random[0,1]
X
 E / temperature
6b: if U< e
...... ACCEPT
else go back to 3 and repeat ....... REJECT
Controls the amount of
Exploration ∝ Temperature
Monte Carlo simulated annealing
E
N
E
R
G
Y
∆E
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT; go to 7
6: if ∆E>0
6a: U = random[0,1]
X
 E / temperature
6b: if U< e
...... ACCEPT; go to 7
else go back to 3 and repeat ....... REJECT
Controls the amount of
7: decrease temperature
exploration
Monte Carlo simulated annealing
E
N
E
R
G
Y
e.g. Possible to get
trapped here at low temperatures
But this is probably acceptable
1: Start = E(x=0)
2: Temperature = 100 .... “boiling”
3: next = E(x+random(x))
4: ∆E = next – start
5: if ∆E≤0 ........ ACCEPT; go to 7
6: if ∆E>0
6a: U = random[0,1]
X
 E / temperature
6b: if U< e
...... ACCEPT; go to 7
else go back to 3 and repeat ....... REJECT
Controls the amount of
7: decrease temperature
exploration
Simulated annealing mutations
• amyloid fibril formation is a toxic formation of betastrands
– insoluble fibrous aggregates
Intermediate magnification micrograph of cerebral
amyloid angiopathy with senile plaques in the
cerebral cortex consistent of amyloid beta, as may
be seen in Alzheimer disease.
Nelson et al . Nature 435, 773-778(9 June 2005)
Simulated annealing mutations
• PASTA: software which gives Energy of fibril formation
– Based on amino acid propensity of hydrogen bonding formation
– PASTA(MVGGVVIA) = Energy of self pairings forming fibrils
Simulated annealing mutations
• PASTA: software which gives Energy of fibril formation
– Based on amino acid propensity of hydrogen bonding formation
– PASTA(MVGGVVIA) = Energy of self pairings forming fibrils
– MVGGVVIA  Energy – 5.410074 Kcal/mol
Pairing parallel 1: Energy -5.410074 pairing segments 5-8 and 5-8 (size 4)
Pairing parallel 2: Energy -4.956611 pairing segments 1-7 and 1-7 (size 7)
Pairing parallel 3: Energy -4.873594 pairing segments 2-7 and 2-7 (size 6)
Pairing parallel 4: Energy -4.440684 pairing segments 4-7 and 4-7 (size 4)
Pairing parallel 5: Energy -4.327075 pairing segments 1-8 and 1-8 (size 8)
Simulated annealing mutations
• Interesting question: given a seed sequence of length
N, what set of mutations can induce amyloidgenicity?
seed
Potentially better fibrils
Simulated annealing mutations
• Interesting question: given a seed sequence of length
N, what set of mutations can induce amyloidgenicity?
– But Given a peptide of length 8 there are 20⁸ = 25,600,000,000
possible peptides. A large energy surface (all possibilities 811
years on 1 CPU)
seed
Potentially better fibrils
Example Amyloid-β aggregates
Forms fibrils and known to be
a factor in Alzheimers
blue to red = low-to-high predicted amyloid propensity
Paper: PNAS October 11, 2011 vol. 108 no. 41 16938-16943
Example Amyloid-β aggregates
• Forms fibrils and known to be
a factor in Alzheimers
– Is there mutants in nature with
more potential to form fibrils?
– Starting from MVGGVVIA
Using simulated annealing found:
1494 peptides with more favorable
energy
blue to red = low-to-high predicted amyloid propensity
Paper: PNAS October 11, 2011 vol. 108 no. 41 16938-16943
Annealed sequence similarity
network
black to red = less to more likely to form amyloids
Annealed sequence similarity
network - peptide exists in nature
Exist: Scrambled peptides were searched in swissprot
black to red = less to more likely to form amyloids
Annealed sequence similarity
network - peptide exists in nature
Exist: Scrambled peptides were searched in swissprot
black to red = less to more likely to form amyloids
Annealed sequence similarity
network - peptide exists in nature
Scrambled peptides were searched in swissprot
Need to investigate further.
black to red = less to more likely to form amyloids
Machine learning of proteins
– Obviously: a Machine Learning (ML)
algorithm must be quicker and more accurate
than a human
ML
Statistics
Computer science
– Algorithms which learn past data (e.g.
to predict the outcome of future data
(e.g. Sequence with unknown structure
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
)
)
Machine learning of proteins
• Given a large set of experimentally annotated
proteins. Is it possible to learn experimental
properties to predict unannotated proteins?
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
From structure (e.g. binding site)
Possible binding site
at position x-y
Machine learning of proteins
Pattern recognition in bioinformatics
de Ridder D et al.
Brief Bioinform 2013;14:633-647
Large amount of data
+
Algorithms:
• Neural networks
• Support vector machine
Find relationships among patterns:
• clusters of patterns (not covered)
• Linear and non-linear separation
and function fitting (later)
Neural Networks
• Inspired by neuronal connections in the brain.
• Neuron cells connected by
snapses
• Connections of different
neurons detemine its function
• Each circle represents a Neuron
• Each edge represents a snaptic
connection
• Connections of different neurons
detemine its function
Artificial Neuron
• Inspired by simple biological neuronal model
yi
K: number of previous neurons
wij
y i : output of neuron (i), this neuron
y j : output of previous j neurons, j=1..K
: weight from neuron j to neuron i
Artificial Neuron
• Inspired by simple biological neuronal model
yi

k
j 1
wij y j
f ()
: Net input
: Activation function
Linear Neural Network
y = Mx + c
Two variables to move line:
(1) Slope M
(2) Y-axis intercept C
Linear Neural Network
• Implementing decision lines and linear fitting
functions.
y3  w31 x  w32
Two variables (called weights) free
parameters:
(1) Slope
31
w
(2) Y-axis intercept
w32
wi1
f ( j 1 wij y j )
k
wik
yi
Linear Neural Network
• Implementing decision lines and linear fitting
functions.
y3  w31 x  w32
Two variables (called weights) free
parameters:
(1) Slope
31
w
(2) Y-axis intercept
w32
y1  x
w31
y2  1
w32
f ( j 1 w2 j y j )
2
y3
Linear Neural Network
• Implementing decision lines and linear fitting
functions.
y3  w31 x  w32
Two variables (called weights) free
parameters:
(1) Slope
31
w
(2) Y-axis intercept
Note: f(s) = s ….. The identity
function
w32
y1  x
w31
y2  1
w32
f ( j 1 w2 j y j )
2
y3
Linear Neural Network
y3
x
Note: this is a very simple 1 input 1 output fitting. Often there are
many inputs and outputs resulting in complicated neural networks (later).
Learning in Neural Networks
Adequacey measured by
Gradient Descent
wik
Learning in Neural Networks
Gradient Descent
• Given a desired output d
• and a predicted output y
wik
y
y3
d
1
Error (d  y ) 2
2
Learning in Neural Networks
Gradient Descent (move to
minimum error with repect to
weight parameters)
wik
1
Error (d i  yi ) 2
2
Learning in Neural Networks
Gradient Descent (move to
minimum error with repect to
weight parameters)
wik
1
Error (d i  yi ) 2
2
 ( Error) 1  (di  yi ) 2

wik
2
wik
Learning in Neural Networks
Gradient Descent (move to
minimum error with repect to
weight parameters)
wik
1
Error (d i  yi ) 2
2
 ( Error) 1  (di  yi ) 2

wik
2
wik
 ( Error)  (di  yi ) 2 yi

yi
wik
Chain rule wik
Learning in Neural Networks
Gradient Descent (move to
minimum error with repect to
weight parameters)
wik
1
Error (d i  yi ) 2
2
 ( Error) 1  (di  yi ) 2

wik
2
wik
 ( Error )
yi
 ( Error)  (di  yi ) 2 yi

yi
wik
Chain rule wik
 (d i  yi )
Learning in Neural Networks
Gradient Descent (move to
minimum error with repect to
weight parameters)
wik
1
Error (d i  yi ) 2
2
 ( Error) 1  (di  yi ) 2
wik
k
Given
k
yi   wik y j
j 1
then
yi

wik
 ( wij y j )
j 1
wik

2
wik
 ( Error)  (di  yi ) 2 yi

wik
yi
wik
 yj
Learning in Neural Networks
Gradient Descent (move to
minimum error with repect to
weight parameters)
wik
1
Error (d i  yi ) 2
2
 ( Error) 1  (di  yi ) 2
wik

2
wik
 ( Error)  (di  yi ) 2 yi

wik
yi
wik
 ( Error )
 (d i  yi ) y j
wik
Learning in Neural Networks
Gradient Descent
(algorithm)
wik
Given:
• input neurons: y j
• Predicted output neurons: y i
• desired output:d i
• Update weights in direction of
minimum error:
wij  wij (d i  yi ) y j
Input-output dimension
Example: 3-dimensional input
yi
yj
• 3-dimensional input
Seperating plane (2 dimensions)
• N-dimensional input
• N-1 separting hyperplane
• In fact, any number of outputs
and inputs can be used to model
the data
• Input: often called features of
the dataset
Non-linear Neural Network
y2
x
• Sometimes a non-linear fit is more appropriate
• What neural network models
 1.7 tanh(1.1x  1.4)  1.1
the curve above?
Non-linear Neural Network
•What combination of neurons model the curve?
 1.7 tanh(1.1x  1.4)  1.1
Starting point:
wi1
f ( j 1 wij y j )
k
wik
yi
Non-linear Neural Network
•What combination of neurons model the curve?
 1.7 tanh(1.1x  1.4)  1.1
Starting point:
y1  x
1.1
y2  1
1.4
lower
w
 j 1 ij y j
2
wijlower  1.1,1.4
y3
Non-linear Neural Network
•What combination of neurons model the curve?
 1.7 tanh(1.1x  1.4)  1.1
Starting point:
y1  x
y2  1
1.1
tanh( j 1 w
2
1.4
wijlower  1.1,1.4
lower
ij
yj)
y3
Non-linear Neural Network
•What combination of neurons model the curve?
 1.7 tanh(1.1x  1.4)  1.1
Multiple layers allowed:
y1  x
y2  1
1.1
y3  tanh( j 1 w
2
1.4
wijlower  1.1,1.4
lower
ij
1.7
yj)

2
j 1
upper
ij
w
yj
y4
Non-linear Neural Network
•What combination of neurons model the curve?
 1.7 tanh(1.1x  1.4)  1.1
Multiple layers allowed:
y1  x
y2  1
1.1
y3  tanh( j 1 w
2
lower
ij
1.7
yj)

2
j 1
upper
ij
w
yj
y4
1.4
wijlower  1.1,1.4
1.1
wijupper  1.7,1.1
Non-linear Neural Network
•In general any curve of the following form:
 a tanh(bx  c)  d
… where a, b, c, and d
are constants
Can be modeled Using the following
multiple layer neural network and gradient descent:
lower
y1  x w31
y2  1
y3  tanh( j 1 w
2
lower
ij
yj)
upper
w43
lower
w32
upper
w42

2
j 1
upper
ij
w
yj
y4
Multi layer networks can
approximate any function
Universal approximation theorom:
Multi layer networks can
approximate any function
y1  x
y2  1
1.1
y3  tanh( j 1 w
2
lower
ij
1.7
yj)

2
j 1
upper
ij
w
yj
1.4
y4  1.7 tanh(1.1x  1.4)
Universal approximation theorom:
y4
Multi layer networks can
approximate any function
y1  x
y2  1
w1
y3   ( j 1 w
2
lower
ij
b1
Universal approximation theorom:
yj)
1

2
j 1
upper
ij
w
yj
y4  1 ( w1T x  b1 )
y4
General structure of a multi layer
network
• As a consequence of the Universal approximation theorem there
exists N hidden neurons which can approximate any function f()
• A multi layer neural network contains:
• At least one hidden neuron , an input and output layer
• Non-linear activation functions
• A set of weights/parameters separated into layers
Support Vector Machine
y2
x
• Classifies data by finding a separating line with maximum margin
Support Vector Machine
Poor model
y2
x
• Classifies data by finding a separating line with maximum
separating margin
Support Vector Machine
Good model
y2
x
• Classifies data by finding a separating line with maximum
separating margin
Support Vector Machine
Good model
y2
Margin
x
• Classifies data by finding a separating line with maximum
separating margin
Support Vector Machine
Good model
y2
Margin
x
• Classifies data by finding a separating line with maximum
separating margin
Support
vectors
Support Vector Machine
Better model
y2
Bigger
Margin
x
• Classifies with maximum separating margin
• Here, simple 2-dimensional surface with 1-dimensional separation
Support Vector Machine
• constructs a separating plane in a high-dimensional
space
• Separation with largest margin to the nearest training
data
• This helps to classify unseen data (testing data)
• The nearest training data to the margin are called
support vectors.
Support Vector Machine:kernel
• N-dimensional surface finds N-1 dimensional linear
separation.
• Data points are not linear separable
• 3 red correct
• 3 blue correct
• 3 red incorrect
Support Vector Machine:kernel
• N-dimensional surface finds N-1 dimensional linear
separation.
• Data points are not linear separable
• 3 red correct
• 3 blue correct
• 3 red incorrect
Support Vector Machine: kernel
• N-dimensional surface finds N-1 dimensional linear
separation.
• Data points are not linear separable
•A kernel function tries to map data into feature space where they
become linearly separable
Support Vector Machine: kernel
• N-dimensional surface finds N-1 dimensional linear
separation.
• Data points are not linear separable
Kernel map
X  Ω(x)
Ω: non-linear map
•A kernel function tries to map data into feature space where they
become linearly separable
Support Vector Machine: kernel
• N-dimensional surface finds N-1 dimensional linear
separation.
• Data points are not linear separable
Kernel map
X  Ω(x)
Ω: non-linear map
•A kernel function tries to map data into feature space where they
become linearly separable
Secondary structure prediction
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
Examining position i=31 residue=W
Next slides will help you in practical
Monday 16th December
Secondary structure prediction
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
Examining position i=31 residue=W
• Secondary structure depends on position i
Secondary structure prediction
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
Examining position i=31 residue=W
• Secondary structure depends on position i
• and position i-w and i+w
• w = 3, window size = 7
Secondary structure prediction
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
Examining position i=31 residue=W
• Secondary structure depends on position i
• and position i-w and i+w
• w = 3, window size = 7
• Data representation:
Example  target
Amino acid  (Helix:H, Strand: S and Coil: C)
better to have windows
Window  (Helix:H, Strand: S and Coil: C)
Secondary structure prediction
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
• Secondary structure depends on position i and it’s
surrounding molecular context (e.g. window size =7):
Example  target
XMSGGGD  coil
Secondary structure prediction
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
• Secondary structure depends on position i and it’s
surrounding molecular context (e.g. window size =7):
Example  target
XMSGGGD  coil
GDWCTGW  helix
Secondary structure prediction
From sequence (e.g. secondary structure)
MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGR
• Secondary structure depends on position i and it’s
surrounding molecular context (e.g. window size =7):
Example 
XMSGGGD 
GDWCTGW 
YAWKKRW 
Etc………
target
coil
helix
helix
Secondary structure linear network
(1) Linear Neural Network in practical
Helix: H
G
i-3
D
i-2
Strand: E
W
i-1
Coil: C
C
i
Weight parameters
(input*output +
bias*output)
= 7x3 + 3x1 = 24
T
i+1
G
i+2
W
i+3
bias=1
Secondary structure linear network
(1) Linear Neural Network in practical
Helix: H
Strand: E
Coil: C
24 parameters
G
i-3
D
i-2
W
i-1
C
i
T
i+1
G
i+2
W
i+3
bias=1
Secondary structure non-linear
network
Helix: H
Strand: E
Coil: C
(1) Non-linear Neural
Network in practical
Weight parameters
G
i-3
D
i-2
W
i-1
C
i
T
i+1
G
i+2
W
i+3
bias=1
Secondary structure non-linear
network
Helix: H
Strand: E
Coil: C
(1) Non-linear Neural
Network in practical
Weight parameters
Upper:
(output*hidden +
Bias*hidden)
3x3+1x3 = 12
G
i-3
D
i-2
W
i-1
C
i
T
i+1
G
i+2
W
i+3
bias=1
Secondary structure non-linear
network
Helix: H
Strand: E
Coil: C
(1) Non-linear Neural
Network in practical
12 parameters
G
i-3
D
i-2
W
i-1
C
i
T
i+1
G
i+2
W
i+3
bias=1
Secondary structure non-linear
network
Helix: H
Strand: E
Coil: C
(1) Non-linear Neural
Network in practical
Weight parameters
Lower:
(Input*hidden +
Bias*hidden)
7x3+1x3 = 24
G
i-3
D
i-2
W
i-1
C
i
T
i+1
G
i+2
W
i+3
bias=1
Secondary structure non-linear
network
Helix: H
Strand: E
Coil: C
(1) Non-linear Neural
Network in practical
24 parameters
G
i-3
D
i-2
W
i-1
C
i
T
i+1
x1
G
i+2
W
i+3
bias=1
Secondary structure non-linear
network
Helix: H
Strand: E
Coil: C
(1) Non-linear Neural
Network in practical
Total
36 parameters
G
i-3
D
i-2
W
i-1
C
i
T
i+1
x1
G
i+2
W
i+3
bias=1
Data split
In a simple scenario data can be split into:
Training data
Testing data: unseen
• Training data: used for weight/parameter estimation
(gradient descent  find weights by minimizing error)
• Testing data: used to measure performance on unseen
proteins
Data diversity
Each data point must not be similar to another data
point. Why?
Rule 1
Enforce
{Training set}

Data[i]
Data[j] for all i and j
No sequence pair >40% sequence identity
Reasons for above:
• learning diversity  No over-learning clusters of data points
Data diversity
Each data point must not be similar to another data
point. Why?
Rule 1
Enforce
{Training set}

Data[i]
Data[j] for all i and j
No sequence pair >40% sequence identity
Reasons for above:
• learning diversity  No over-learning clusters of data points
A child learning transport, only seen cars
Data diversity
Each data point must not be similar to another data
point. Why?
Rule 1
Enforce
{Training set}

Data[i]
Data[j] for all i and j
No sequence pair >40% sequence identity
Reasons for above:
• learning diversity  No over-learning clusters of data points
I do not know what this is!
Data diversity
Each data point must not be similar to another data
point. Why?
Rule 1
Enforce
{Training set}

Data[i]
Data[j] for all i and j
No sequence pair >40% sequence identity
Reasons for above:
• learning diversity  No over-learning clusters of data points
Better to diversify
Data diversity
Each data point must not be similar to another data
point. Why?
Rule 2
Enforce
{Training set} U {Test set}

Data[i]
Data[j] for all i and j
No sequence pair >40% sequence identity
Reasons for above:
• Testing data is unseen measure performance
Data diversity
Each data point must not be similar to another data
point. Why?
Rule 3
Enforce
{Test set}

Data[i]
Data[j] for all i and j
No sequence pair >40% sequence identity
Reasons for above:
• Testing data diversity no over estimation of performance
Bias-variance
Bias-variance trade-off:
• Many weights/parameters will make the
model “too powerful”
Bias-variance
Bias-variance trade-off:
• Many weights/parameters will make the
model “too powerful”
• Consider the following curve --- to be the reality of
some protein process
x2
x1
Bias-variance
Bias-variance trade-off:
• Many weights/parameters will make the
model “too powerful”
• Consider the following curve --- to be the reality of
some protein process
x2
( x1 , x2 )
All proteins with
feature (x1,x2) outside
are of type
“red”
( x1 , x2 )
process
Proteins
with
Feature (x1,x2) inside are of type
“green”
x1
Bias-variance
Bias-variance trade-off:
• Many weights/parameters will make the
model “too powerful”
• We don’t know anything about the process except
for some data (experiments)
x2
x1
Bias-variance
Bias-variance trade-off:
• Many weights/parameters will make the
model “too powerful”
• This linear model is too simple it has too few
parameters: high bias
x2
x1
Bias-variance
Bias-variance trade-off:
• Many weights/parameters will make the
model “too powerful”
• This non-linear model is too complicated it has too
many parameters: high variance
x2
x1
Bias-variance
Problems with high bias
• The real process --- is much more complex
x2
x1
Bias-variance
Problems with high variance
1.
Small data sets and high parameter models are very bad
(a) There are gaps in our data. The smaller the more
chance of gaps.
(b) There is experimental error (noise)
x2
x1
Bias-variance
Problems with high variance
1.
x2
Small data sets and high parameter models are very bad
(a) There are gaps in our data. The smaller the more
chance of gaps.
(b) There is experimental error (noise)
Data
gap
x1
Data
gap
Bias-variance
Problems with high variance
1.
x2
Small data sets and high parameter models are very bad
(a) There are gaps in our data. The smaller the more
chance of gaps.
(b) There is experimental error (noise)
Data
gap
x1
Data
gap
Balancing bias-variance
Increase the size of training (more experiments)
x2
x1
Balancing bias-variance
Decrease the number of parameters (weights)
• the fitting function is “less powerful” – less
variance
x2
x1
Generalization
Poor fitting model - High
variance
Good fitting model - balanced
variance and bias
Generalization
Over-fitting testing error is
too high
Error(training) almost 0 (perfect)
Error(testing) is high
Poor fitting model - High
variance
Training data
Testing data: unseen
Generalization
Over-fitting testing error is
too high
Error(training) almost 0 (perfect)
Error(testing) is high,
Many green proteins outside
real process
Poor fitting model - High
variance
Training data
Some Red proteins inside real
process
Testing data: unseen
Generalization
Error(training) is low
No Over-fitting testing
error is acceptable
Error(testing) is acceptable,
green proteins generally inside
real process
red proteins generally outside
real process
Good fitting model - balanced
variance and bias
Training data
Testing data: unseen
Over-fitting: prevention
When c’è limited data and we need many parameter model.
We can stop the training/learning early
Validation error
E
R
R
O
R
Poor fitting model - High
variance
Training data
Validation
set
Training error
Time updating weights
(gradient descent)
Testing data: unseen
Over-fitting: prevention
When c’è limited data and we need many parameter model.
We can stop the training/learning early
STOP: Starting to over-fit on noise
and data gaps
Validation error
E
R
R
O
R
Good fitting model - High
variance
Training data
Validation
set
Training error
Time updating weights
(gradient descent)
Testing data: unseen
Machine learning prediction
checklist
• Data:
1. Ideally should be from high quality experiments
(less noise)
2. Ideally should be in large quantities
3. Diversity is necessary
•
Bias-variance trade-off:
1. If data is small then high bias (low number of
parameters) is preferred
2. If data is large then more expressive models are ok
•
Over-fitting:
1. If testing error is high then try prevent over-fitting

Download Report

Learning structural protein properties

Paperzz.com

Your Paperzz