PFAM: Example

Protein motifs databases
and search tools
PROSITE & PFAM
PROSITE
PROSITE: a documented database using
patterns and profiles as motifs descriptors
 PROSITE is an annotated collection of motifs descriptors dedicated to the identification of
protein families and domains. The motifs descriptors used in PROSITE are either patterns
or profiles which are derived from multiple alignments of homologous sequences.
 The core of PROSITE is composed of two text files:
• PROSITE.DAT (computer-readable): contains all the information necessary to scan
sequence(s) for the occurrence of pattern of profiles.
• PROSITE.DOC: contains textual informations that fully documents each pattern of
profiles.
 On 8/3/2011 (Release 20.71), PROSITE contained 1308 patterns and 920 PSSM profiles
Web: http://www.expasy.ch/prosite/
References
Sigrist et al (2002) Brief. Bioinfo. 3: 265-274
Hulo et al (2007) Nucl Acids Res. 36: D245-249.
PROSITE: Example
Test case: Protein Delta (from Mus Musculus)
>tr|Q9DBU9|Q9DBU9_MOUSE Delta-like 4 (Drosophila) OS=Mus musculus GN=Dll4 PE=2 SV=1
MTPASRSACRWALLLLAVLWPQQRAAGSGIFQLRLQEFVNQRGMLANGQSCEPGCRTFFR
ICLKHFQATFSEGPCTFGNVSTPVLGTNSFVVRDKNSGSGRNPLQLPFNFTWPGTFSLNI
QAWHTPGDDLRPETSPGNSLISQIIIQGSLAVGKIWRTDEQNDTLTRLSYSYRVICSDNY
YGESCSRLCKKRDDHFGHYECQPDGSLSCLPGWTGKYCDQPICLSGCHEQNGYCSKPDEC
ICRPGWQGRLCNECIPHNGCRHGTCSIPWQCACDEGWGGLFCDQDLNYCTHHSPCKNGST
CSNSGPKGYTCTCLPGYTGEHCELGLSKCASNPCRNGGSCKDQENSYHCLCPPGYYGQHC
EHSTLTCADSPCFNGGSCRERNQGSSYACECPPNFTGSNCEKKVDRCTSNPCANGGQCQN
RGPSRTCRCRPGFTGTHCELHISDCARSPCAHGGTCHDLENGPVCTCPAGFSGRRCEVRI
THDACASGPCFNGATCYTGLSPNNFVCNCPYGFVGSRCEFPVGLPPSFPWVAVSLGVGLV
VLLVLLVMVVVAVRQLRLRRPDDESREAMNNLSDFQKDNLIPAAQLKNTNQKKELEVDCG
LDKSNCGKLQNHTLDYNLAPGLLGRGGMPGKYPHSDKSLGEKVPLRLHSEKPECRISAIC
SPRDSMYQSVCLISEERNECVIATEV
Protein Sequence found
in UniProt (ID: Q9DBU9)
http://www.uniprot.org/uniprot/Q9DBU9
PROSITE: Example
Enter here your query sequence
(in FASTA format) or the UniProt ID
PROSITE: Example
Two groups of results are
returned:
Hit by profiles
(PSSM motifs)
position
aa
Hit by patterns
(Consensus/Regular expression)
PROSITE: Example
Click here
to obtain a
detailed
description
of the
domain
By setting the
cursor above a
domain, it will be
highlighted in the
sequence. Here
the C-C involved
in the di-sulfur
bound are also
shown in green
PFAM
PFAM: multiple sequence alignment
and HMM-profiles of protein domains
 Pfam is a comprehensive database of conserved protein and provide tools to search
query sequences for these motifs. These tools are based on profile hidden Markov
models (HMMER3).
 In 2010 (release 24.0), PFAM contained about 12 000 manually verified domain families,
which are described by profile HMM models.
 There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality,
manually curated families. Although these Pfam-A entries cover a large proportion of the
sequences in the underlying sequence database, in order to give a more comprehensive
coverage of known proteins we also generate a supplement using the PRODOM
database (http://prodom.prabi.fr/). These automatically generated entries are called
Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying
functionally conserved regions when no Pfam-A entries are found.
Web: http://pfam.janelia.org/
References:
Sonnhammer et al (1998) Nucl Acids Res. 26: 302-322.
Bateman et al (1999) Nucleic Acids Res. 27:260-2.
Finn et al (2010) Nucleic Acids Res. 38: D211-22.
PFAM: Example
Test case: Protein Delta (from Mus Musculus)
>tr|Q9DBU9|Q9DBU9_MOUSE Delta-like 4 (Drosophila) OS=Mus musculus GN=Dll4 PE=2 SV=1
MTPASRSACRWALLLLAVLWPQQRAAGSGIFQLRLQEFVNQRGMLANGQSCEPGCRTFFR
ICLKHFQATFSEGPCTFGNVSTPVLGTNSFVVRDKNSGSGRNPLQLPFNFTWPGTFSLNI
QAWHTPGDDLRPETSPGNSLISQIIIQGSLAVGKIWRTDEQNDTLTRLSYSYRVICSDNY
YGESCSRLCKKRDDHFGHYECQPDGSLSCLPGWTGKYCDQPICLSGCHEQNGYCSKPDEC
ICRPGWQGRLCNECIPHNGCRHGTCSIPWQCACDEGWGGLFCDQDLNYCTHHSPCKNGST
CSNSGPKGYTCTCLPGYTGEHCELGLSKCASNPCRNGGSCKDQENSYHCLCPPGYYGQHC
EHSTLTCADSPCFNGGSCRERNQGSSYACECPPNFTGSNCEKKVDRCTSNPCANGGQCQN
RGPSRTCRCRPGFTGTHCELHISDCARSPCAHGGTCHDLENGPVCTCPAGFSGRRCEVRI
THDACASGPCFNGATCYTGLSPNNFVCNCPYGFVGSRCEFPVGLPPSFPWVAVSLGVGLV
VLLVLLVMVVVAVRQLRLRRPDDESREAMNNLSDFQKDNLIPAAQLKNTNQKKELEVDCG
LDKSNCGKLQNHTLDYNLAPGLLGRGGMPGKYPHSDKSLGEKVPLRLHSEKPECRISAIC
SPRDSMYQSVCLISEERNECVIATEV
Protein Sequence found
in UniProt (ID: Q9DBU9)
http://www.uniprot.org/uniprot/Q9DBU9
PFAM: Example
Enter here your query
sequence
(in FASTA format)
You can search for PfamB
motifs (additional HMM
motifs, of lower quality)
http://pfam.sanger.ac.uk/search?tab=searchSequenceBlock
Select a cut-off value: choose a E-value
(default: 1) or Gathering threshold (the
program then uses a pre-defined
threshold on the score for each HMM
motifs).
PFAM: Example
Schematic
view of the
domains
Significant
matches
Insignificant
matches
NB: some
statistically
"insignificant"
matches seems
biologically
significant (EGF).
PFAM: Example
Click on this link
for a detailed
description of
the motif
Motifs are
grouped into
CLANs
Score and E-value
(small E-value <=>
high significance)
Click here
to see the
alignment
(cf next slide)
HMM
From
HMM
To
PFAM motif
envelop align
Start Start
alignment
align envelop
End
End
your (query)
sequence
PFAM reports two sets of domain coordinates for each profile HMM match. The envelope
coordinates delineate the region on the sequence where the match has been probabilistically
determined to lie, whereas the alignment coordinates delineate the region over which the
program is confident that the alignment of the sequence to the profile HMM is correct.
PFAM: Example
The alignment includes the following rows:
Hits which do not start and end
at the end points of the
matching HMM are highlighted.
#HMM: consensus of the HMM. Capital letters indicate the most conserved positions.
#MATCH: the match between the query sequence and the HMM. A '+' indicates a positive score which
can be interpreted as a conservative substitution.
#PP: posterior probability. The degree of confidence in each individual aligned residue.
0 means 0-5%, 1 means 5-15% and so on; 9 means 85-95% and a '*' means 95-100% posterior
probability.
#SEQ: query sequence. A '-' indicate deletions in the query sequence with respect to the HMM. Columns
are coloured according to the posterior probability.
NB: you can access this description by clicking on
the following link (at the top of the results page)
Motif Scan (MyHits)
Motif Scan (MyHits, Swiss Institute for Bioinformatics) is a program that allows the
user to run various motifs search programs including PROSITE and PFAM
Availability: http://hits.isb-sib.ch/cgi-bin/PFSCAN
NCBI - CDD
Conserved Domain Database (CDD)
CDD is a protein annotation resource that consists of a
collection of well-annotated multiple sequence alignment
models for ancient domains and full-length proteins. These
are available as position-specific score matrices (PSSMs)
for fast identification of conserved domains in protein
sequences via RPS-BLAST
CD-Search & Batch CD-Search
CD-Search is NCBI's interface to searching the Conserved Domain Database with protein or
nucleotide query sequences. It uses RPS-BLAST, a variant of PSI-BLAST, to quickly scan a set of
pre-calculated position-specific scoring matrices (PSSMs) with a protein query.
CDTree
CDTree is a helper application for your web browser that allows you to interactively view and
examine conserved domain hierarchies curated at NCBI.
Web: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
NCBI - Search for CDD
Availability: http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
Marchler-Bauer A et al. (2015) CDD: NCBI's conserved domain database, Nucleic Acids Res.43(D):222-6.
Marchler-Bauer A et al. (2011) CDD: a Conserved Domain Database for the functional annotation of proteins, Nucleic Acids Res.39(D):
225-9.
Marchler-Bauer A et al. (2009) CDD: specific functional annotation with the Conserved Domain Database, Nucleic Acids Res.37(D):
205-10.
Marchler-Bauer A, Bryant SH (2004) CD-Search: protein domain annotations on the fly, Nucleic Acids Res. 32(W): 327-331.
NCBI - Search for CDD
http://www.ncbi.nlm.nih.gov/Structure/cdd/docs/cdd_how_to_protein_function.html