Bioinformatics – a definition ? The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology OR Biologists doing “stuff” with computers? Here we consider the use of Bioinformatics tools rather than their design and construction Here we consider the access and analysis of data and information items rather than their generation, storage or annotation Databases – Genes to Genomes Bioinformatics – the simple view Bioinformatics – the simple view 2 Sequence assembly Sequence or genome annotation Raw sequence Annotated sequence Annotation: adding notes AD-NOTARE A textbook with your own notes is valuable… Data (e.g. sequence) Data on data (annotation, meta-data) Data on annotations (ontologies, meta-meta-data: defining the language of annotations) Anything added to the “standard description” is annotation Building a database from raw data + annotations Put raw data into database records Add basic annotations (project name, date etc.) Add annotations by similarity. This is called database searching ( gives results as: 95% similarity to trypsin probably trypsin. But only probably!!) Add further information based on human knowledge (analysis programs , literature search) So our notes are partly trivial, partly based on guesses (similarity) or on sophisticated background work. “Generalized structures” As Database Records Other db’s Identification Name of protein Organism Function Cross-references ... Domain structure Sec. structure Disulphides …. ANNOTATIONS CIPKWNRCGPKMDGVPCCEPYTCTSDYYGNC Sequence (structure) qfinetdttvivtwtpprarivgyrltvgllseeg depqyldlpstatsvnipdllpgrkytvnvyeise egeqnlilstsqttapdappdptvdqvddtsivvr wsrprapitgyrivyspsvegsstelnlpetansv tlsdlqpgvqynitiyaveenqestpvfiqqettg vprsdkvppprdlqfvevtdvkitimwtppespvt gyrvdvipvnlpgehgqrlpvsrntfaevtglspg vtyhfkv Database record, fields STRUCTURE, eg. SEQUENCE Annotation of (sequence) data Global descriptors e.g. function Annotation requires database searching and knowledge of „biology” Local (positional) descriptors e.g. domains Structure: Theoretical topology + annotation Theoretical topology (number line) 1 2 3 | M 4 | R 5 | N 6 | G 7 | G 8 | T 9 10 11 12 13 14 15 16 17 18 19 20 | T... Assigning aminno acids = sequence a-helix secondary structures Hydrophobicity or other numerical properties Function (protease) This is just to make a funny point: even a structure can be viewed as annotation…i.e. we need grossly similar algoriths all the way. DNA and proteins have a linear, chain-like topology. We assign properties to arbitrary segments of this chain. This is a database-centric view as opposed to the network-like systems theory view of entities and relationships. Annotation and the World Wide Web Traditionally, annotations to a structure are validated and added by humans : authors trying to suggest a function for a new gene, database developers trying to add structural or functional descriptions to molecular data, etc. WWW is the biggest annotation system: millions of nonvalidated links are added to data. Important types include databases (bioinformatics and bibliographic), Wikipedia (community based encyclopedia), specialist wikis, blogs, discussion lists. Google search is a first step… Today, database annotation means generating standard language descriptions for data, validated via Internet links and specialized programs. Relies on human intervention. Bioinformatics – the simple view 3 DATABASE (annotated sequence) Sequence assembly Visualization Raw sequence Picture view Text view Sequence codes • • • One letter codes are used Amino acids: 20-letter alphabet Nucleotides 4-letter alphabet (either T (DNA) or U (RNA)) Sequence formats • Simple (“FASTA”) format >name ACAAGTTG • Multiple “Concatenated FASTA” >name1 ACAAGTTG >name1 ACAAGTTG >name1 ACAAGTTG PROTEIN SEQUENCE ANNNOTATED WITH DOMAINS Domain A Domain B 001-200 DOMAIN PROTEASE A 205-230 DOMAIN TRANSMEMBRANE 250-350 DOMAIN SIGNAL BINDING TABULAR DESCRIPTION: FEATURE TABLE, PTT TABLE Sanjar Hudaiberdiev GENOME SEQUENCE ANNOTATED WITH GENES Gene 1 Sequence view of a genome Gene 2 Genome annotation .ptt table Primary DNA Sequence Databases Original submission by experimentalists Content controlled by the submitter Primary Protein Sequence Databases Protein knowledgebase consists of two sections: • Swiss-Prot, manually annotated, reviewed. • TrEMBL, automatically annotated, not reviewed. Derivative Databases Built from primary data RefSeq Submission by experimentalists Controlled by the submitter akin to the primary research literature non-redundant richly annotated DNA, RNA, protein diverse taxa akin to the review literature Derivative Databases Protein domains, motifs, families Protein domains/families represented as alignments and HMMs Derived primarily from UniprotKB and Genpept Manually curated models for several hundred protein domains Derived from proteins from completely sequenced genomes Derivative Databases Protein domains, motifs, families Protein motifs/domains represented as Patterns and/or HMMs Both derived from UniprotKB/Swissprot Patterns are for highly conserved short regions. Example: R-P-C-x(11)-C-V-S HMMs are for less conserved longer regions. Often there will be pattern(s) and an HMM for one domain. HMM matches Pattern matches Database Access Each database must have software to enable searching Either by text term against annotation And / Or by data comparison Database inquiry by text search: Sequence Retrieval System – SRS Searching annotation by text match Implemented in many places Follows links between databases Can allow in situ analysis of matches Database Access Database inquiry by data comparison: Sequence databases BLAST - for sequence databases, DNA or Protein Implemented in very many places Most notably at the NCBI Protein domain, motif, family databases Each database has a customised search tool To search all databases is a lot of work! Database Access Interpro is a consortium of member databases Interpro defines protein families, domains, regions, repeats and sites according to matches against member databases Interpro enables any subset of member databases to be searched together The End Bioinformatics – the simple view 3 DATABASE Sequence assembly (annotated sequence) Visualization Raw sequence Picture for your paper
© Copyright 2026 Paperzz