Protein motifs databases and search tools PROSITE & PFAM PROSITE PROSITE: a documented database using patterns and profiles as motifs descriptors PROSITE is an annotated collection of motifs descriptors dedicated to the identification of protein families and domains. The motifs descriptors used in PROSITE are either patterns or profiles which are derived from multiple alignments of homologous sequences. The core of PROSITE is composed of two text files: • PROSITE.DAT (computer-readable): contains all the information necessary to scan sequence(s) for the occurrence of pattern of profiles. • PROSITE.DOC: contains textual informations that fully documents each pattern of profiles. On 8/3/2011 (Release 20.71), PROSITE contained 1308 patterns and 920 PSSM profiles Web: http://www.expasy.ch/prosite/ References Sigrist et al (2002) Brief. Bioinfo. 3: 265-274 Hulo et al (2007) Nucl Acids Res. 36: D245-249. PROSITE: Example Test case: Protein Delta (from Mus Musculus) >tr|Q9DBU9|Q9DBU9_MOUSE Delta-like 4 (Drosophila) OS=Mus musculus GN=Dll4 PE=2 SV=1 MTPASRSACRWALLLLAVLWPQQRAAGSGIFQLRLQEFVNQRGMLANGQSCEPGCRTFFR ICLKHFQATFSEGPCTFGNVSTPVLGTNSFVVRDKNSGSGRNPLQLPFNFTWPGTFSLNI QAWHTPGDDLRPETSPGNSLISQIIIQGSLAVGKIWRTDEQNDTLTRLSYSYRVICSDNY YGESCSRLCKKRDDHFGHYECQPDGSLSCLPGWTGKYCDQPICLSGCHEQNGYCSKPDEC ICRPGWQGRLCNECIPHNGCRHGTCSIPWQCACDEGWGGLFCDQDLNYCTHHSPCKNGST CSNSGPKGYTCTCLPGYTGEHCELGLSKCASNPCRNGGSCKDQENSYHCLCPPGYYGQHC EHSTLTCADSPCFNGGSCRERNQGSSYACECPPNFTGSNCEKKVDRCTSNPCANGGQCQN RGPSRTCRCRPGFTGTHCELHISDCARSPCAHGGTCHDLENGPVCTCPAGFSGRRCEVRI THDACASGPCFNGATCYTGLSPNNFVCNCPYGFVGSRCEFPVGLPPSFPWVAVSLGVGLV VLLVLLVMVVVAVRQLRLRRPDDESREAMNNLSDFQKDNLIPAAQLKNTNQKKELEVDCG LDKSNCGKLQNHTLDYNLAPGLLGRGGMPGKYPHSDKSLGEKVPLRLHSEKPECRISAIC SPRDSMYQSVCLISEERNECVIATEV Protein Sequence found in UniProt (ID: Q9DBU9) http://www.uniprot.org/uniprot/Q9DBU9 PROSITE: Example Enter here your query sequence (in FASTA format) or the UniProt ID PROSITE: Example Two groups of results are returned: Hit by profiles (PSSM motifs) position aa Hit by patterns (Consensus/Regular expression) PROSITE: Example Click here to obtain a detailed description of the domain By setting the cursor above a domain, it will be highlighted in the sequence. Here the C-C involved in the di-sulfur bound are also shown in green PFAM PFAM: multiple sequence alignment and HMM-profiles of protein domains Pfam is a comprehensive database of conserved protein and provide tools to search query sequences for these motifs. These tools are based on profile hidden Markov models (HMMER3). In 2010 (release 24.0), PFAM contained about 12 000 manually verified domain families, which are described by profile HMM models. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the PRODOM database (http://prodom.prabi.fr/). These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Web: http://pfam.janelia.org/ References: Sonnhammer et al (1998) Nucl Acids Res. 26: 302-322. Bateman et al (1999) Nucleic Acids Res. 27:260-2. Finn et al (2010) Nucleic Acids Res. 38: D211-22. PFAM: Example Test case: Protein Delta (from Mus Musculus) >tr|Q9DBU9|Q9DBU9_MOUSE Delta-like 4 (Drosophila) OS=Mus musculus GN=Dll4 PE=2 SV=1 MTPASRSACRWALLLLAVLWPQQRAAGSGIFQLRLQEFVNQRGMLANGQSCEPGCRTFFR ICLKHFQATFSEGPCTFGNVSTPVLGTNSFVVRDKNSGSGRNPLQLPFNFTWPGTFSLNI QAWHTPGDDLRPETSPGNSLISQIIIQGSLAVGKIWRTDEQNDTLTRLSYSYRVICSDNY YGESCSRLCKKRDDHFGHYECQPDGSLSCLPGWTGKYCDQPICLSGCHEQNGYCSKPDEC ICRPGWQGRLCNECIPHNGCRHGTCSIPWQCACDEGWGGLFCDQDLNYCTHHSPCKNGST CSNSGPKGYTCTCLPGYTGEHCELGLSKCASNPCRNGGSCKDQENSYHCLCPPGYYGQHC EHSTLTCADSPCFNGGSCRERNQGSSYACECPPNFTGSNCEKKVDRCTSNPCANGGQCQN RGPSRTCRCRPGFTGTHCELHISDCARSPCAHGGTCHDLENGPVCTCPAGFSGRRCEVRI THDACASGPCFNGATCYTGLSPNNFVCNCPYGFVGSRCEFPVGLPPSFPWVAVSLGVGLV VLLVLLVMVVVAVRQLRLRRPDDESREAMNNLSDFQKDNLIPAAQLKNTNQKKELEVDCG LDKSNCGKLQNHTLDYNLAPGLLGRGGMPGKYPHSDKSLGEKVPLRLHSEKPECRISAIC SPRDSMYQSVCLISEERNECVIATEV Protein Sequence found in UniProt (ID: Q9DBU9) http://www.uniprot.org/uniprot/Q9DBU9 PFAM: Example Enter here your query sequence (in FASTA format) You can search for PfamB motifs (additional HMM motifs, of lower quality) http://pfam.sanger.ac.uk/search?tab=searchSequenceBlock Select a cut-off value: choose a E-value (default: 1) or Gathering threshold (the program then uses a pre-defined threshold on the score for each HMM motifs). PFAM: Example Schematic view of the domains Significant matches Insignificant matches NB: some statistically "insignificant" matches seems biologically significant (EGF). PFAM: Example Click on this link for a detailed description of the motif Motifs are grouped into CLANs Score and E-value (small E-value <=> high significance) Click here to see the alignment (cf next slide) HMM From HMM To PFAM motif envelop align Start Start alignment align envelop End End your (query) sequence PFAM reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which the program is confident that the alignment of the sequence to the profile HMM is correct. PFAM: Example The alignment includes the following rows: Hits which do not start and end at the end points of the matching HMM are highlighted. #HMM: consensus of the HMM. Capital letters indicate the most conserved positions. #MATCH: the match between the query sequence and the HMM. A '+' indicates a positive score which can be interpreted as a conservative substitution. #PP: posterior probability. The degree of confidence in each individual aligned residue. 0 means 0-5%, 1 means 5-15% and so on; 9 means 85-95% and a '*' means 95-100% posterior probability. #SEQ: query sequence. A '-' indicate deletions in the query sequence with respect to the HMM. Columns are coloured according to the posterior probability. NB: you can access this description by clicking on the following link (at the top of the results page) Motif Scan (MyHits) Motif Scan (MyHits, Swiss Institute for Bioinformatics) is a program that allows the user to run various motifs search programs including PROSITE and PFAM Availability: http://hits.isb-sib.ch/cgi-bin/PFSCAN NCBI - CDD Conserved Domain Database (CDD) CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST CD-Search & Batch CD-Search CD-Search is NCBI's interface to searching the Conserved Domain Database with protein or nucleotide query sequences. It uses RPS-BLAST, a variant of PSI-BLAST, to quickly scan a set of pre-calculated position-specific scoring matrices (PSSMs) with a protein query. CDTree CDTree is a helper application for your web browser that allows you to interactively view and examine conserved domain hierarchies curated at NCBI. Web: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml NCBI - Search for CDD Availability: http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi Marchler-Bauer A et al. (2015) CDD: NCBI's conserved domain database, Nucleic Acids Res.43(D):222-6. Marchler-Bauer A et al. (2011) CDD: a Conserved Domain Database for the functional annotation of proteins, Nucleic Acids Res.39(D): 225-9. Marchler-Bauer A et al. (2009) CDD: specific functional annotation with the Conserved Domain Database, Nucleic Acids Res.37(D): 205-10. Marchler-Bauer A, Bryant SH (2004) CD-Search: protein domain annotations on the fly, Nucleic Acids Res. 32(W): 327-331. NCBI - Search for CDD http://www.ncbi.nlm.nih.gov/Structure/cdd/docs/cdd_how_to_protein_function.html
© Copyright 2026 Paperzz