EMBOSS

EMBOSS
"The European Molecular Biology
Open Software Suite "
EMBOSS
• Open Source software
• Over 150 individual programs
– Sequence alignment
– Rapid database searching
– Protein motif identification
– Nucleotide sequence pattern analysis
– Codon usage analysis
– Identification of sequence patterns
– An much more…
• EMBOSS was initiated as an european
project when GCG (american analysis
package) became commercial.
• They both provide roughly the same
services:
http://helix.nih.gov/apps/bioinfo/embossgcg.html
Advantages
• It is free
• It runs practically on every UNIX based system (Linux
and MacOSX. At the CSC netsite you can also use a
windows version)
• Free of arbitrary size limits
• Can be used from most of the programming
environments
• Programs of EMBOSS package can be combined and
piped together in countless ways
• Extremely stable
• Most useful in UNIX command prompt enviroment but
there is GUIs available
http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html
Programs are grouped
•
•
•
•
•
•
•
•
•
•
Alignment
Display
Edit
Enzyme kinetics
Feature tables
Information
Nucleic
Phylogeny
Protein
Utils
• EMBOSS website has
comprehensive list of
programs
• Another list of EMBOSS
programs can be found from
http://www.csc.fi/english/r
esearch/sciences/bioscien
ce/programs/emboss/inde
x_html
EMBOSS command syntax
• Follows normal UNIX syntax
• Uniform Sequence Addresses
– (=> USA syntax…nothing to do with the USA ;)
• Sequence format
– Multiple formats supported
• Alignment formats
• Feature formats
• Report formats
USA syntax
•
•
•
•
”format::file”
”format::file:entry”
”dbname:entry”
”@listfile” (a file of file-names)
Sequence Formats I
• There are at least couple of dozens different
formats
• ”Nearly every collection of sequences that call
itself a database has stored its data in its own
format”
• Ids and Accessions
– Most databases has both
– ID was originally intended to be human-readable…not
working since there is far too many sequences to be
named by humans
– Accession numbers are unique identificators more for
computer (=automated) use
Sequence Formats II
• Annotation and Features
– Every format have some line or field for holding annotation about
sequence in question
• The Sequence
– Sequences are usually held in the IUPAC standards one-letter
codes
• Sequence Database Formats
–
–
–
–
EMBL
GenBank
SwissProt
PIR
• Formats supported by EMBOSS can be seen from
http://emboss.sourceforge.net/docs/themes/SequenceFo
rmats.html