answer key and motivation - updated 4Nov16 - b

HW 5: Structural phylogenomic analysis:
Voltage-gated Potassium channel membrane topology prediction
Answer key and motivation for steps in the homework
Bioe 190
Fall 2016
Notes on this answer key: You’ll see text in red sprinkled throughout this answer
key, along with a few figures. The figures come from a solution provided by Katelyn
Greene. Katelyn went far above what was required, but I’m grateful to her for this
work. The text in black is from the original homework assignment (which has been
edited down a bit for compactness).
1.
Go to UniProt and retrieve and examine the record for KCNA1_HUMAN.
a.
Draw the membrane topology for KCNA1_HUMAN, given the SwissProt
annotation. (For the purposes of this lab, you can assume the SwissProt topology
annotation, including the location of transmembrane domains and intramembrane
segments, is roughly correct.)
Purpose: To gain experience in diagramming membrane proteins. The standard
approach, as presented in lecture, for depicting membrane proteins is a “snake
diagram” (so-called because the protein snakes in and out of the membrane).
Katelyn’s solution (shown below) goes above and beyond what I expected – it’s
publication quality. For the purposes of this homework, I was happy to accept
pictures (taken with a cell phone) of simple hand drawings.
b.
Submit the sequence to Pfam. Include this information on your membrane
topology drawing …
i.
Confirm that the Ion_trans domain spans the entire region from TM segment
S1 to S6.
Purpose: To understand the correspondence between this important Pfam domain
and the membrane-spanning region.
c.
Submit the sequence to TMHMM and compare the results with the TM
domains and topology listed in the SwissProt record. Note which TM domains (or
intramembrane segments) are identified by both SwissProt and TMHMM; note any
disagreements.
Purpose: To understand the common errors produced by transmembrane
prediction tools.
Answer: Most of the proteins were problematic for TMHMM, which missed one or
more TM helices. In most cases, the positively charged S4 segment (the voltage
sensor) stumped TMHMM.
2.
Run BLAST to find homologs in SwissProt using the UniProt BLAST server.
Aim at having 3 or 4 members per subfamily (with species names that you recognize
and whose relative taxonomic relationships you understand).
Purpose: To gain experience in interpreting MSAs and phylogenetic trees.
3.
Confirm that your selection includes the following UniProt IDs (if not, add
them):
1.
KCNA3_HUMAN
2.
KCA10_HUMAN
3.
KCNA7_HUMAN
4.
KCNC1_HUMAN
5.
KCNC4_HUMAN
6.
KCNA1_ONCMY
7.
KCNSK_CAEEL
8.
KCNAB_DROME
9.
KCNAG_CAEEL
Purpose: I chose these proteins because the SwissProt membrane segment
annotations (the intramembrane and/or transmembrane segments) of these
sequences disagrees with the consensus.
4.
Then click the Align button just above the sequence selection box.
5.
Highlight the transmembrane and intramembrane regions using the control
box at the left side of the screen. Create a series of figures, with a screenshot of each
panel displaying either a TM domain or an intramembrane segment. You will
immediately observe that some members (including all of the sequences listed
above) disagree with an obvious consensus. Surprisingly, the sequence similarity in
these regions (between members having the consensus topology and those
disagreeing) can be really high, so you would expect them to have transmembrane
and intramembrane segments in the same positions. As shown in Figure 3, the
(annotated) membrane segment edges do not always align. Recall the Positive
Inside Rule, and draw a rectangle around the consensus region for each TM
and intramembrane segment. Your figure caption should include a title that
indicates which transmembrane/intramembrane segments are displayed. Label
each TM segment by the segment label in SwissProt (e.g., S1, S2, etc.).
Purpose: To gain experience in using a consensus annotation approach. Note that
there are many different ways to derive a consensus. The most common is a
majority-rule approach (as shown in the Pevsner text for secondary structure
prediction). If you used a majority rule consensus, you’d label a column as in the
membrane if a majority of the sequences were labeled as in the membrane for that
position. A strict consensus would require all the sequences to be labeled as in the
membrane. As you can imagine, which rule you use depends on whether you want to
optimize precision (use the strict consensus) or a balance between precision and
recall (use the majority rule). But since you have a separate source of information –
the Positive Inside Rule – you can use this to set the membrane segment border
more precisely: if you see positively charged residues (K, R and H) at the
cytoplasmic side of a membrane segment, you can position the end or beginning of
the membrane segment so that those residues are just outside the membrane.
6.
Now, examine the outliers. You should be able to see that each of the 9
sequences listed in point 3, above, diverge in fundamental ways from their
homologs. For each protein in the list, explain in what way(s) it departs from the
consensus.
Purpose: to understand the limitations of transmembrane prediction when the
membrane segment does not agree with the model used by the prediction tool (i.e.,
that the TM segment should be hydrophobic): the S4 segment is the voltage-sensor,
and includes positively charged residues. This throws the prediction tools off. The
other problem is that the TM prediction tools don’t (generally) recognize
intramembrane segments, and can mistake intramembrane segments for TM
segments. Since the SwissProt curators make use of TM prediction
webservers/tools, their annotations will include these errors. (SwissProt also
transfers annotations of membrane segments by homology: this is the most likely
explanation for both the accurate annotations and the errors.)
Here’s what you should have found.
1.
2.
3.
4.
5.
6.
KCNA3_HUMAN (missing intramembrane segment)
KCA10_HUMAN (missing intramembrane segment)
KCNA7_HUMAN (missing intramembrane segment)
KCNC1_HUMAN (missing intramembrane segment)
KCNC4_HUMAN (missing intramembrane segment)
KCNA1_ONCMY (missing S4 segment; has a predicted TM domain where
the consensus is an intramembrane segment),
7. KCNSK_CAEEL (missing S4 segment; missing intramembrane segment)
8. KCNAB_DROME (missing intramembrane segment)
9. KCNAG_CAEEL (missing S4 segment)
Note that SwissProt identified a candidate S4 segment for all of these proteins – but
in several cases, the S4 segment was not in the consensus position.
7.
Phylogenomic function prediction and interpreting phylogenetic trees for
protein superfamilies. Examine the tree (below the UniProt MSA display).
Purpose: knowing how to examine phylogenetic trees representing gene families (or
superfamilies, such as these Potassium channel proteins) is a useful skill. This
problem is designed to give you some experience.
a.
Examine the placement of KCNA1_ONCMY.
i.
If KCNA1_ONCMY had not been assigned to the KCNA1 subfamily and had
instead been labeled “Unknown” or “Hypothetical” and you had to predict a function
(gene subfamily assignment) based on its phylogenetic placement, what subfamily
would you assign it to?
Purpose: to gain experience in inferring function based on a phylogenetic
placement.
In this case, the correct inference is that the sequence cannot be classified
functionally based on the tree. If you examine the subtree containing
KCNA1_ONCMY and other sequences (see figure below), you’ll see that all the other
sequences are equally distant from KCNA1_ONCMY (based on a visual inspection of
the tree, recalling how tree distances are measured). Also: the most recent common
ancestor (MRCA) of KCNA1_ONCMY and all of the other subtypes is at the node just
above KCNA1_ONCMY.
Figure from Katelyn’s solution. You’ll note that the tree successfully clusters KCNA2
and KCNA3 subfamilies – but that the KCNA1 subfamily appears to be broken up,
with KCNA1_ONCMY isolated as an outgroup sequence apart from the other KCNA1
sequences.
ii.
If the KCNA1 subfamily assignment were correct, where would you expect
this sequence to be placed in the tree? Explain your logic.
Answer: you’d expect KCNA1_ONCMY to be placed within the KCNA1 subtree.
iii.
Look up the terms monophyletic/monophyly, paraphyletic/paraphyly and
polyphyletic/polyphyly, and decide which term describes the KCNA1 subfamily
(based on the SwissProt assignment of sequences to this subfamily, for sequences
included in the tree). See https://www.mun.ca/biology/scarr/Taxon_types.htm. See
Figure 1, at the end of this document.
Answer: the KCNA1 subfamily is polyphyletic.
iv.
Go back to the SwissProt page and examine the evidence supporting the
functional assignment to the KCNA1 subfamily. What evidence is provided? How
strong is that evidence?
Answer: the annotation appears to be based on similarity (to some unnamed
protein). In fact, if you click on the Publications link in the left column, you’ll find a
paper that describes this protein as not belonging to any particular subfamily (see
below).
A publication is linked in (see below).
The abstract (see below) shows that the KCNA1_ONCMY sequence (tsha2) was
described as equally similar to KCNA1, KNCA2 and KCNA3 subtypes: “tsha2 did not
show a preferential sequence homology with a particular subtype of shaker, but
exhibited uniform similarity with mammalian Kv1.1, Kv1.2, and Kv1.3, respectively.”
(This appears to not have been noticed or perhaps ignored by the SwissProt
curators.)
v.
Submit KCNA1_ONCMY to BLAST (either at UniProt or NCBI) to see what
matches come to the top. If you were using an annotation transfer protocol to
predict the function of KCNA1_ONCMY, what subfamily would you assign it to?
Purpose: This part of the homework is designed to demonstrate the problems with
the standard annotation transfer protocol.
Answer: The answers to this problem depend on what sequence database you
selected.
 If you ran BLAST against SwissProt, the top hit is: P22739 (KCNA2_XENLA),
annotated as “Potassium voltage-gated channel subfamily A member 2” (i.e.,
KCNA2 subfamily).
 If you ran BLAST against UniProt (including TrEMBL), the top hit is: G3G7Y7
(G3G7Y7_LATJA), annotated as “Potassium voltage-gated channel Kv1.3”. (A
separate search in SwissProt will show you that Kv1.3 indicates the KCNA3
subfamily – see the annotation for P15384.)
 If you ran BLAST against NR, the top hit is: XP_013981333.1 (from Salmo
salar) annotated as “shaker-related potassium channel tsha2”. (It’s not clear
where this functional description originated; perhaps from KCNA1_ONCMY.)
b.
Examine the subtrees containing three or more sequences with the same
gene name (e.g., KCNA1, KCNA2, KCNA3). Find the maximal subtree that is
restricted to sequences from the same subfamily. (See Figure 4 for an example of a
solution to this part of the lab.) For each subtree satisfying these criteria, do the
following:
i.
Insert a screen shot of the subtree (from the subtree root to the leaf labels).
Note that some students submitted screen shots that were not restricted to the
subtree for that subfamily. Unless your figure somehow indicated the subtree under
consideration (e.g., by putting a box around the subfamily tree – as Katelyn does in
her solution, as shown below -- or placing a disk at the subtree root node) it would
not be an effective figure. This is why the instructions specifically asked you to
restrict the subtree screenshot from the subtree root to the leaf labels.
(Note that Katelyn’s subtree images would have been more effective if she’d
restricted the screenshot to each subtree; this would have enabled her to expand the
image and make the leaf labels (sequence identifiers) larger.)
ii.
Create a figure caption with title “Subtree for <gene name> subfamily”.
iii.
In the figure caption text, explain the assumed evolutionary relationship of
these sequences (ortholog, paralog, super-ortholog) based on having been assigned
by SwissProt to the same subfamily/subtype, and the subtree topology. Use the
most precise term possible, and explain your logic.
Purpose: To gain experience with the orthology subtype definitions and inferring
these relationships by analysis of a phylogenetic tree
Answer: All of the subtrees should satisfy the definition of super-orthology.
iv.
Explain whether the subfamily is monophyletic, paraphyletic or polyphyletic
– based on the gene name assigned by SwissProt and by the presence or absence of
other proteins with the same gene name in the tree.
Purpose: To gain experience with these phylogenetic terms.
Answer: With the exception of the KCNA1 subfamily tree (which would not include
KCNA1_ONCMY) of the subtrees should satisfy the definition of monophyletic.
v.
Examine the branching order between the species in the subtree. Is the
branching order congruent with the trusted species phylogenies? (Recall that in step
2 you selected sequences from species whose taxonomic relationships you
understood.)
Purpose: To gain experience with interpreting phylogenetic trees and gene tree
species tree reconciliation.
Answer: Many of the subtrees were not congruent with the reference species
phylogeny. Shown below are three examples (from Katelyn Greene’s solution), two
that are not congruent with the trusted species tree and one that is.
8.
Now, perform a detailed MDA and topology analysis for each of the 4
proteins shown in boldface in point 3 using the same techniques as in part 1 (for
KCNA1_HUMAN).
a.
Based on your analyses, answer the same questions (1a-c).
b.
Examine the SwissProt record for the hit to confirm that the transmembrane
and intramembrane alignment coloring in the UniProt MSA reflects the actual
assignments in the SwissProt record.
c.
Derive and evaluate the pairwise alignment of KCNA1_HUMAN and each hit
being evaluated using the NCBI BLAST server (select the “Align two or more
sequences” checkbox).
Purpose: To gain experience with using the BLAST “align two or more sequences”
tool, and in examining a pairwise alignment to evaluate whether the two share a
common MDA.
i.
Insert a screenshot of the pairwise alignment, including the alignment
statistics (%ID and E-value, etc.). Highlight (or draw a box around) each region
where the SwissProt labeling of transmembrane or intramembrane segments
disagrees with the consensus.
ii.
Evaluate the degree to which you can infer that the two proteins share the
same MDA based exclusively on the pairwise alignment. Is the alignment global to
each, or local to one or both?
Note: This was a slightly hard call. Many heuristic approaches for evaluating
whether two proteins are globally alignable examine only the E-value and fractional
(bi-directional) overlap, using a cutoff of 70% or 80% overlap. If you use this
approach, you have a lot of company. But in fact, it’s potentially problematic, as
these analyses will show you.
If you want to predict whether two proteins share a common MDA – based on the
pairwise alignment – you should be on the lookout for any extended regions that are
not included in the alignment. For all of the proteins you are asked to evaluate, the
pairwise alignments appear convincing: the alignments have strong E-values, good
to high pairwise percent identity and low (to moderate) gaps, and most of the
residues are included in the pairwise alignment. But if you look more closely, you’ll
see that several are in an ambiguous zone with long stretches at the N-terminus
(before the BTB/POZ domain) and/or at the C-terminus (after the Ion_trans
domain) that are not included. If you return to the MSA produced by the UniProt
server you’ll see that these regions appear variable across the set of homologs I
asked you to select – but then, I deliberately asked you to include sequences that
aligned to these two domains but allowed glocal matches.
Two sequences with extended regions not included in the pairwise BLAST
alignment (and small to moderate indels in the BTB/POZ and Ion_trans domains):
 KCNA1_ONCMY (very minor indels in the BTB/POZ and Ion_trans domains,
but BLAST alignment leaves last 110aa of KCNA1_HUMAN out)

KCNSK_CAEEL (the first 134 amino acids (before the BTB/POZ domain) are
not included. Moderate size gap (12aa) in the BTB/POZ domain.
iii.
To further confirm agreement in MDA, you can compare the Pfam domains
and domain order. Now compare the Pfam domains found in the hit against the
Pfam domains found in KCNA1_HUMAN:
1.
Are the same Pfam domains found in the same order? (Compare this finding
to your answer to the previous question.)
Answer: yes to all.
2.
Examine the ranges of each Pfam domain in each sequence in the pairwise
alignment: is the entire Pfam BTB_2 and Ion_trans domain included in the pairwise
alignment? Do the Pfam domain ranges overlap perfectly or only somewhat? If there
are insertions or deletions, are they primarily (or exclusively) outside of the two
Pfam domains?
Answer: The Pfam domain boundaries overlap.
iv.
Examine the membrane-spanning and intramembrane segments of
KCNA1_HUMAN (based on the SwissProt annotation). Are there any insertions or
deletions relative to KCNA1_HUMAN in these membrane segments? If there are
indels in the alignment, are they primarily outside of these (KCNA1_HUMAN)
membrane segments?
Purpose: This was designed to help you develop an intuition about the types of
structural/functional pressures on proteins – indels are generally well tolerated in
regions between evolutionary/structural/functional domains and you will seldom
see indels in membrane segments.
Answer: With almost no exceptions, the indels are outside of the membrane
segments identified (by SwissProt) for KCNA1_HUMAN. However, if you take a look
at the MSA produced by Clustal Omega (which I didn’t ask you to do), you’ll see that
the indel characters have been shifted out of the membrane segments – and the
alignments are generally better.
v.
For each region where the membrane segment labeling disagrees with the
consensus, attempt to characterize the similarity between the hit and
KCNA1_HUMAN (very similar, moderately similar, divergent).
Answer: The sequence identity appears high (very similar) in the membrane
segments, even though the annotations in SwissProt do not agree.
vi.
Try to explain the disagreement between the membrane-segment labeling of
KCNA1_HUMAN and the hit. If the sequence similarity is high and there are few
gaps, you should expect that the structural similarity should be high. So if the
membrane segment annotation disagrees dramatically, you might reasonably expect
that biological curator error is to blame. In your opinion, is this (1) biological
curator error (perhaps based on TMHMM or other transmembrane prediction tool
error), (2) sequence divergence (indicative of actual structural divergence), or (3)
some other reason.
Answer: I assume this is biological curator error.