Primary Protein Sequence Databases

Bioinformatics
– a definition ?
The design, construction and use of software tools to generate,
store, annotate, access and analyse data and information relating
to Molecular Biology
OR
Biologists doing “stuff” with computers?
Here we consider the use of Bioinformatics tools rather than
their design and construction
Here we consider the access and analysis of data and
information items rather than their generation, storage or
annotation
Databases – Genes to Genomes
Bioinformatics – the simple view
Bioinformatics – the simple view 2
Sequence
assembly
Sequence or genome
annotation
Raw sequence
Annotated
sequence
Annotation: adding notes
AD-NOTARE
A textbook with your own notes is valuable…

Data (e.g. sequence)

Data on data (annotation, meta-data)

Data on annotations (ontologies,
meta-meta-data: defining the
language of annotations)
Anything added to the “standard description” is annotation
Building a database from raw data + annotations

Put raw data into database records

Add basic annotations (project name, date etc.)

Add annotations by similarity. This is called database searching ( gives
results as: 95% similarity to trypsin  probably trypsin. But only probably!!)

Add further information based on human knowledge (analysis programs ,
literature search)
So our notes are partly trivial, partly based on guesses (similarity) or
on sophisticated background work.
“Generalized structures” As Database Records
Other db’s
Identification
Name of protein
Organism
Function
Cross-references
...
Domain structure
Sec. structure
Disulphides
….
ANNOTATIONS
CIPKWNRCGPKMDGVPCCEPYTCTSDYYGNC
Sequence (structure)
qfinetdttvivtwtpprarivgyrltvgllseeg
depqyldlpstatsvnipdllpgrkytvnvyeise
egeqnlilstsqttapdappdptvdqvddtsivvr
wsrprapitgyrivyspsvegsstelnlpetansv
tlsdlqpgvqynitiyaveenqestpvfiqqettg
vprsdkvppprdlqfvevtdvkitimwtppespvt
gyrvdvipvnlpgehgqrlpvsrntfaevtglspg
vtyhfkv
Database record, fields
STRUCTURE,
eg. SEQUENCE
Annotation of (sequence) data
Global descriptors
e.g. function
Annotation requires
database searching and
knowledge of „biology”
Local (positional)
descriptors e.g.
domains
Structure: Theoretical topology + annotation
Theoretical topology
(number line)
1
2
3
|
M
4
|
R
5
|
N
6
|
G
7
|
G
8
|
T
9 10 11 12 13 14 15 16 17 18 19 20
|
T...
Assigning aminno acids
= sequence
a-helix  secondary
structures
Hydrophobicity or other
numerical properties
Function (protease)
This is just to make a funny point: even a structure can be viewed as annotation…i.e. we need
grossly similar algoriths all the way. DNA and proteins have a linear, chain-like topology. We
assign properties to arbitrary segments of this chain. This is a database-centric view as opposed
to the network-like systems theory view of entities and relationships.
Annotation and the World Wide Web

Traditionally, annotations to a structure are validated and added
by humans : authors trying to suggest a function for a new
gene, database developers trying to add structural or functional
descriptions to molecular data, etc.

WWW is the biggest annotation system: millions of nonvalidated links are added to data. Important types include
databases (bioinformatics and bibliographic), Wikipedia
(community based encyclopedia), specialist wikis, blogs,
discussion lists. Google search is a first step…

Today, database annotation means generating standard
language descriptions for data, validated via Internet links and
specialized programs. Relies on human intervention.
Bioinformatics – the simple view 3
DATABASE
(annotated
sequence)
Sequence
assembly
Visualization
Raw sequence
Picture view
Text view
Sequence codes
•
•
•
One letter codes are used
Amino acids: 20-letter
alphabet
Nucleotides 4-letter alphabet
(either T (DNA) or U (RNA))
Sequence formats
• Simple (“FASTA”) format
>name
ACAAGTTG
• Multiple “Concatenated FASTA”
>name1
ACAAGTTG
>name1
ACAAGTTG
>name1
ACAAGTTG
PROTEIN SEQUENCE
ANNNOTATED WITH DOMAINS
Domain A
Domain B
001-200
DOMAIN
PROTEASE A
205-230
DOMAIN
TRANSMEMBRANE
250-350
DOMAIN
SIGNAL BINDING
TABULAR DESCRIPTION: FEATURE TABLE, PTT TABLE
Sanjar Hudaiberdiev
GENOME SEQUENCE
ANNOTATED WITH GENES
Gene 1
Sequence view of a genome
Gene 2
Genome annotation .ptt table
Primary DNA Sequence Databases
Original submission by experimentalists
Content controlled by the submitter
Primary Protein Sequence Databases
Protein knowledgebase
consists of two sections:
• Swiss-Prot, manually annotated, reviewed.
• TrEMBL, automatically annotated, not reviewed.
Derivative Databases
Built from primary data
RefSeq
Submission by experimentalists
Controlled by the submitter
akin to the primary
research literature
non-redundant
richly annotated
DNA, RNA, protein
diverse taxa
akin to the review
literature
Derivative Databases
Protein domains, motifs, families
Protein domains/families represented as alignments and HMMs
Derived primarily from UniprotKB and Genpept
Manually curated models for several hundred protein domains
Derived from proteins from completely sequenced genomes
Derivative Databases
Protein domains, motifs, families
Protein motifs/domains represented as Patterns and/or HMMs
Both derived from UniprotKB/Swissprot
Patterns are for highly conserved short regions. Example:
R-P-C-x(11)-C-V-S
HMMs are for less conserved longer regions.
Often there will be pattern(s) and an HMM for one domain.
HMM matches
Pattern matches
Database Access
Each database must have software to enable searching
Either by text term against annotation
And / Or by data comparison
Database inquiry by text search:
Sequence Retrieval System – SRS
Searching annotation by text match
Implemented in many places
Follows links between databases
Can allow in situ analysis of matches
Database Access
Database inquiry by data comparison:
Sequence databases
BLAST - for sequence databases, DNA or Protein
Implemented in very many places
Most notably at the NCBI
Protein domain, motif, family databases
Each database has a customised search tool
To search all databases is a lot of work!
Database Access
Interpro is a consortium of member databases
Interpro defines protein families, domains, regions, repeats
and sites according to matches against member databases
Interpro enables any subset of member databases to be
searched together
The End
Bioinformatics – the simple view 3
DATABASE
Sequence
assembly
(annotated
sequence)
Visualization
Raw sequence
Picture for your paper