Quiz 2 - answer key

Quiz 2 - answer key - b

Bioe 190, Fall 2016
Quiz 2
(answers for selected problems)
Part 1: Benchmarking techniques
1. Give the formulas for Recall, Precision, Specificity and False Positive Rate. (4)
Recall = TP/(TP+FN)
Precision = TP/(TP+FP)
Specificity = TN/(TN+FP)
False Positive Rate = FP/(FP+TN)
2. What is the purpose of a holdout method? (1)
Correct answer: The actual purpose is to evaluate the generalization capacity of a method, i.e., how well it
will perform on data it has not seen. (This is why correlations between training and test data causes
problems.)
Common mistake: only describing the process (dividing a dataset into training and test) and not explaining
the purpose (why we divide a dataset into separate training and test sets). I gave credit on this quiz for
answers that described the technique but did not describe the purpose (but if this question shows up on a
future quiz I’ll expect to see a more fleshed out response.).
3.Why is K-fold cross-validation seen as superior to a single holdout method? (2)
Most students got this right: it’s because in K-fold cross-validation every data point is used in testing the
method.
Common mistake: Some students wrote that each data point is used “at least once” – that’s incorrect, and
I deducted a small amount for this error. (Imagine what would happen if individual data points were
included in testing unequal numbers of times, and if all data points were tested multiple times, a great deal
more work is being done than is necessary.)
4. What is the primary downside to performing K-fold cross validation relative to a single holdout method?
(1)
Most students got this right: K-fold cross validation is much more complex, taking K times as much time
(and potentially space) as a single holdout.
5. Explain what is meant by “when data in a benchmark dataset are correlated, hold-out methods can
result in an over-estimate of method performance.” (1)
Correct answer: When benchmark data are correlated, a random division into testing and training (i.e., the
standard default approach to using a benchmark dataset) will often result in performance measures
(precision, recall, etc.) that are misleadingly good. Instead of the results on the test data being indicative of
the method’s ability to generalize to novel (unseen) data, the results show the ability of the method to store
and retrieve information about data used in training.
To understand this, consider one very common type of correlation in biological data: homology
relationships. Consider the case of catalytic site prediction: if your training set includes a human enzyme
with labeled catalytic residues and your test set includes the chimp (or any other mammalian) ortholog, all
your algorithm needs to do is store the information for each training sequence, run BLAST for each test
sequence against the sequences in the training set and transfer the annotations of catalytic residues based
on the pairwise alignment. With closely related sequences on both sides of the partition, performing well
1
will be trivial. This is an example of correlation in data and why it can increase the recall and precision
results on a test set.
Common mistake: Many students mistakenly ascribed this to model overfitting. In fact, model overfitting
describes a different problem (when a model fits the training data very well but performs poorly in testing
on novel data).
6. Give pseudocode for K-fold cross validation. (10)
Example detailed solution (I gave full credit to simpler solutions that left out some of these details, as long
as the key elements were included)
Input: benchmark dataset D.
Split D at random into k subsets.
Open file “Scores” to store all scores produced during testing phases
Open file “ModelParameters” to store all model parameter settings during training phases
For i=1..k {
Initialize model parameters to random values /* wipe the slate clean between iterations */
Set subset i aside for testing
Train model parameters on {D-i} /* D-i represents the set complement – everything in D minus
the subset i */
Write model parameters to ModelParameters file
Score reserved subset i using the trained model; append scores (and relevant information) to
Scores file
}
Sort Scores file from strongest to weakest
Submit sorted Scores file to Precision-Recall or ROC analysis to evaluate the performance of the model
estimation procedure.
The ModelParameters file data may be used (e.g., to set the output model parameters to the average) or
discarded (e.g., if the output model will be estimated from scratch on the entire dataset).
Common mistakes:
1. Biggest problem: Many students focused on the use of repeated partitioning into training and testing as
a way to derive optimal model parameters (potentially retaining the parameters that gave the best
performance in one of the partitions –you should be able to understand why this is not a good idea), and
ignored the focus on evaluating how well the algorithm for estimating model parameters performs overall
on all of the test data.
2. Many students forgot to include the last step, when the performance over the entire dataset is analyzed.
(2 points deduced for leaving this out)
3. Most students ignored the issue of resetting the model parameters to random values during each
iteration, i.e., to wipe the slate clean between iterations
Note that set theory is relevant to many areas of data science, and you may want to use set notation in
your work (and in pseudocode and comments). A handy resource for set notation is
http://www.purplemath.com/modules/venndiag2.htm.
7. What is plotted on an ROC curve; what is on the X-axis and what is on the Y axis?
X-axis: FP rate - FP/(FP+TN)
Y-axis: TP rate = Sensitivity (or Recall) = TP/(TP+FN)
2
8. Computing Performance Statistics (10)
Common mistake: not initializing the FN (or other values) correctly, or not understanding the procedure
(or purpose). See the toy example provided in a Powerpoint slides for section 2 of the class.
Part 2: Homology search algorithms (18)
1. On page 172, Pevsner presented the PSI-BLAST algorithm as a 5-step procedure. Describe that 5-step
procedure (full credit requires some detail). (5)
Correct answer – (with grading rubric: 1 point for each of the following):
1. Run BLAST given an input protein sequence;
2. Retrieve matches meeting an E-value significance cutoff; align to the query to construct an MSA.
3. Update PSSM parameter to fit MSA (including sequence weighting/redundancy filtering, BLOSUM62
substitution matrix, etc.)
4. Score search database with the PSSM.
5. Iterate steps 2-4 until convergence or max iterations met.
Common mistake: presenting a general pipeline for statistical modeling.
2. Let M be an MSA of N protein sequences and C columns, what is at position Mi,j ? (1)
Note: Questions 2-5 were addressed in a whiteboard presentation in lecture, but the basic idea of what a
profile is (and the mapping between an MSA and a profile) was also presented in the Powerpoint slides
uploaded to bCourses.
Correct answer: the character (amino acid or indel character) at row i and column j.
Common mistakes: assuming that the character at row i and column j is an amino acid; assuming that the
character at row i and column j is the jth amino acid of sequence i. (I deduced 0.2 for leaving this out)
3. Let P be a Profile derived from M. What is represented at position Pi,j ? (2)
Correct answer: the probability of amino acid i at column j.
4. Let n be a vector of amino acid counts (n1, n2, …, n20) from a multiple sequence alignment column; ni is
the number of amino acid of type i. Let pi = the estimated probability of amino acid i derived from the
multiple sequence alignment column. Give the formula for the Maximum Likelihood Estimate for pi (1)
Correct answer: pi = ni/|n|
5. Using the same notation, give the formula for the Add-1 pseudocount estimate for pi (2)
Correct answer pi =(ni + 1)/(|n| +20)
Questions 6-11 pertain to the Park et al paper (presented in lecture and in the Powerpoint slides)
comparing 5 techniques for homolog detection, benchmarked on a dataset of structural domains from
SCOP.
6. What 5 techniques were compared? (2)
Correct answer: BLAST (or GAP-BLAST), FASTA, ISS, PSI-BLAST, UCSC HMM method (SAM-T98)
7. What was the worst performer of all the methods tested? (1)
BLAST
8. What was the top performer of all the methods tested? (1)
UCSC HMM method (SAM-T98)
3
9. SCOP calls two domains homologous if they are in which part of the SCOP hierarchy? (1)
a. Class b. Fold c. Superfamily
10. Two domains are called not homologous by SCOP if they are in: (1)
Different folds
11. The evolutionary relationship between two domains is indeterminate if they are: (1)
In the same fold but different superfamilies.
Part 3: Domain architecture, pairwise alignment and inferences based on alignments (7)
1. Define what is meant by a protein’s multi-domain architecture. (2)
Correct answer: the ordered series of structural domains composing the protein (labeling the sequence
from the N to the C-terminus).
Common mistake: describing the protein’s overall fold as composed of different domains (but not
mentioning the order in which the domains are found in the primary sequence). You should be able to
convince yourself that including the order in which the domains are found from N-C terminus is necessary
in order to say that two proteins have the same – or different – multi-domain architectures. (1 point)
2. What is meant by a “promiscuous” domain? (1)
Correct answer: structural domains found in many different multi-domain architectures.
Common error: saying that promiscuous domains are not related by evolution, or that they are involved in
protein-protein interaction.
3. What is the structural definition of the term “domain”? (1)
Correct answer: independently folding globular building blocks.
4. Which of the following meets the structural definition of domain? (1)
a. transmembrane domain b. signal peptide c. kinase domain d. enzyme active site motif
5. When you hear people say that two proteins are 50% homologous, what do they actually mean? (1)
They mean that the two proteins have 50% identity in a pairwise alignment. (This is an incorrect use of the
term homology but people often use the term this way.)
Common error: saying that two proteins have 50% similarity. (The % similar is not the same as the %
identity – the number of similar residue pairs is almost always higher than the number identical, and
biologists generally use %ID as a rule of thumb in inferring homology.)
6. Under what circumstances is it technically correct (but still not a good idea because it’s likely to be
confusing) to say that two proteins are 50% homologous? (1)
If two proteins share only one (or more) domains in common, and this domain (or domains) comprise half
of the amino acids in each protein (and the two are not homologous outside of this region), it can be
technically correct (but still not a good idea) to say that they are 50% homologous.
Part 4: Gene Ontology and gene “function” (5)
1. Name the three main organizing principles of the Gene Ontology (GO) (2)
molecular function, biological process, cellular component (location)
4
2. What does the evidence code IEA indicate? (1)
Correct answer: Inferred from Electronic Annotation. (This represents automatic annotation transfer
protocols using bioinformatics tools such as BLAST – with all the errors associated with this approach;
IEA represents the weakest type of supporting evidence.)
Common error: Inferred from Experimental Annotation.
3. If two bacterial genes are in the same operon, they are most likely to: (1)
(a) have the same molecular function,
(b) participate in the same biological process
(c) be homologous.
4. If two genes have correlated expression patterns, then they are most likely to (1)
(a) have the same molecular function,
(b) participate in the same biological process
(c) be homologous.
5

Download Report

Quiz 2 - answer key - b

Paperzz.com

Your Paperzz