Genome sequencing and assembly

Complete genome sequence of Genus/species
Author list
Institution list
Corresponding author: Robert Edwards, [email protected]
Keywords
5-6 key words that describe the organism and/or your findings
Abstract
You should write a 150-200 word abstract that describes what you have found and why it
is interesting.
Abbreviations
EMBL: European Molecular Biology Laboratory
NCBI: National Center for Biotechnology Information (Bethesda, MD, USA)
RDP: Ribosomal Database Project (East Lansing, MI, USA)
Introduction
In the introduction, you should provide additional information concerning the
background, purpose and overall approach of what was done. You should describe both
the sequencing (but in general terms) and the annotation and analysis.
Organism information
In this section, we will describe the organism. The first thing that you will do is identify
the 16S genes from the organism, and build a phylogenetic tree of the 5-10 most closely
related organisms.
You should use that information to generate a Genus and Species name for the strain that
you have sequenced, and then use that name throughout the paper.
This is an example figure and figure legend, you should make something that looks like
this:
Figure 1. Phylogenetic tree highlighting the position of Sphingomonas wittichii strain
RW1 relative to other type and non-type strains within the Sphingomonadacaea. Strains
shown are those within the Sphingomonadacaea having corresponding NCBI genome
project ids listed within . The strains and their corresponding GenBank accession
numbers (and, when applicable, draft sequence coordinates) for 16S rRNA genes are
(type=T): N. aromaticivorans strain SMCC F199T, U20756; Erythrobacter sp. strain
NAP1, AAMW01000002.1:1127089-1128582; E. litoralis strain HTCC2594, CP000157;
Sphingomonas sp. strain SKA58, AAQG01000001.1:1-836; S. wittichii strain RW1T,
AB021492; and Z. mobilis strain ATCC 31821, AF281031. The tree uses sequences
aligned by the RDP aligner, and uses the Jukes-Cantor corrected distance model to
construct a distance matrix based on alignment model positions without the use of
alignment inserts, and uses a minimum comparable position of 200. The tree is built with
RDP Tree Builder, which uses Weighbor with an alphabet size of 4 and length size of
1000. The building of the tree also involves a bootstrapping process repeated 100 times to
generate a majority consensus tree . Z. mobilis (AF281031) was used as an outgroup.
Table 1. Classification and general features of your strain.
You should fill this table in to the best of your ability, and you should add evidence codes
(either IDA, TAS, or NAS; see the website) as applicable. You may need to consult with
Dr. Edwards or Dr. Dinsdale to complete this table!
MIGS ID
Property
Current classification
MIGS-6
MIGS-6.3
Gram stain
Cell shape
Motility
Sporulation
Temperature range
Optimum temperature
Carbon source
Energy source
Terminal electron
receptor
Habitat
Salinity
MIGS-22
Oxygen
MIGS-15
Biotic relationship
MIGS-14
Pathogenicity
MIGS-4
MIGS-5
MIGS-4.1
MIGS-4.2
MIGS-4.3
MIGS-4.4
Geographic location
Sample collection time
Latitude – Longitude
Term
Evidence
codea
Domain
Phylum
Class
Order
Family
Genus
Species
Depth
Altitude
Genome sequencing information
Genome project history
You should write a brief summary of the genome sequencing approach, including how
the bacteria were found, and how the sequencing was done. You may be able to fill in
some of these sentences, or you may not be able to.
The genome was selected because…
The genome sequence was completed in … and presented for public access on ….
Finishing was done using ….
Annotation was performed using … and….
A summary of the project information is shown in Table 2.)
Table 2. Project information
MIGS ID
MIGS-31
MIGS-28
MIGS-29
MIGS-31.2
MIGS-30
MIGS-32
Property
Finishing quality
Libraries used
Sequencing platforms
Fold coverage
Assemblers
Gene calling method
Genome Database
release
Genbank ID
Genbank Date of Release
GOLD ID
Project relevance
Term
(e.g., improved-high-quality draft)
(…)
(e.g., 454, Sanger)
(e.g., 14.5 x)
(e.g., Arachne)
(e.g., Glimmer)
(…)
(e.g., biotechnological, pathway)
Growth conditions and DNA isolation
You should discuss how the organism was grown and how the DNA was isolated.
Genome sequencing and assembly
You should discuss general aspects of library construction and sequencing. Types of
technologies and methods that could be mentioned include 454 pyrosequencing reads,
Newbler assembler (Roche 454), size of overlapping fragments, q-scores.
This is an example of what you might write: This microbial genomes was curated to
close as many gaps as possible. Each base pair has a minimum q (quality) score of xxx.
The genome of <your strain name> was sequenced as a part of the San Diego State
University sequencing center. The error rate of the completed genome sequence is less
than xxx in 50000. On average, there is a xxx-fold coverage of the genome, although
there are some areas with higher coverage.
Genome annotation
You should describe the genome annotation approach (software and/or databases used –
with citations).
For example: Protein encoding genes were identified using the Rapid Annotation using
Subsystems (RAST) ORF caller [REF] as part of the RAST genome annotation pipeline
at Argonne National Laboratory, Argonne, IL, USA and were compared to ORFs
identified using GLIMMER version 3.1 [REF]. The predicted protein encoding genes
were translated and used to search the National Center for Biotechnology Information
(NCBI) non redundant database, UniProt, TIGRFam, Pfam, PRIAM, KEGG, COG,
InterPro, and the RAST-non redundant databases. The tRNAScanSE tool [REF] was used
to find tRNA genes, and ribosomal RNAs were found by searching [using what
software??] against the Greengenes ribosomal database [REF]. The RNA components of
the protein secretion complex and the RNaseP were identified by searching the genome
for the corresponding Rfam profiles using INFERNAL (http://infernal.janelia.org/).
Additional gene prediction analysis and manual functional annotation was performed
within the SEED platform developed at Argonne National Laboratory [REF].
Genome properties
Here you should describe what you find in the genome.
For example: The genome has a single, circular chromosome of XXX bp and an average
of XX.X% GC content. The genome also includes two plasmids, for a total size of XXX
bp. The chromosome contains XXX predicted genes , XXX of which are protein
encoding genes. XXX of protein coding genes were assigned to a putative function with
the remaining annotated as hypothetical proteins, and YYY of the protein encoding genes
have high quality annotations as they are in subsystems. XXX protein coding genes
belong to XXX paralogous families in this genome corresponding to a gene content
redundancy of XX.X%. The properties and the statistics of the genome are summarized
in Tables 3-5.
Table 3. Summary of genome: one chromosome and two plasmids
An INSDC identifier is typically a GenBank or EMBL or DDBJ accession number. The
last column can be used to refer to another identifier scheme (e.g. RAST) depending on
what identifier aided the process of data retrieval.
Label
Size (Mb)
Topology
INSDC
RAST ID
identifier
Chromosome 1
Plasmid 1
Plasmid 2
Table 4. Nucleotide content and gene count levels of the genome
You should add other rows to this table, depending on the analysis that you have done.
Attribute
Genome (total)
Value
% of totala
Size (bp)
G+C content (bp)
Coding region (bp)
Total genesb
RNA genes
Protein-coding genes
Genes in paralog clusters
Genes assigned to COGs
1 or more conserved domains
2 or more conserved domains
3 or more conserved domains
4 or more conserved domains
Genes with signal peptides
Genes with transmembrane helices
Paralogous groups
a) The total is based on either the size of the genome in base pairs or the total number of
protein coding genes in the annotated genome.
b) Also includes 54 pseudogenes and 5 other genes.
Table 5. Number of genes associated with the subsystem hierarchies.
Number
% of totala
Description
Amino Acids and Derivatives
Arabinose Sensor and transport module
Carbohydrates
Cell Division and Cell Cycle
Cell Wall and Capsule
Clustering-based subsystems
Cofactors, Vitamins, Prosthetic Groups, Pigments
DNA Metabolism
Dormancy and Sporulation
Fatty Acids, Lipids, and Isoprenoids
Iron acquisition and metabolism
Membrane Transport
Metabolism of Aromatic Compounds
Miscellaneous
Motility and Chemotaxis
Nitrogen Metabolism
Nucleosides and Nucleotides
Phages, Prophages, Transposable elements, Plasmids
Phosphorus Metabolism
Photosynthesis
Plasmids
Potassium metabolism
Protein Metabolism
RNA Metabolism
Regulation and Cell signaling
Respiration
Secondary Metabolism
Stress Response
Sulfur Metabolism
Virulence, Disease and Defense
General function prediction only
Function unknown
Not in Subsystems
a) The total is based on the total number of protein coding genes in the annotated
genome.
Additional Information
You should add other information relevant to this genome. Two examples include:
Profiles of metabolic network and pathways. For instance, how many genes can be
associated with a metabolic network? How many enzymes, enzymatic reactions,
metabolic pathways and metabolites may be found in the genome? A diagram of
interacting cellular components (e.g., amino acids, carbohydrates, proteins, purines,
cofactors, tRNAs, etc)? You can extract most or all of this information from RAST if you
check the box to have a metabolic model prediction run.
Comparisons with other fully sequenced genomes. For instance, how do the genome
properties compare to other members within the taxonomic family and the overall set of
fully sequenced Bacterial and Archaeal genomes. How do the genome properties
compare to other organisms from a similar environment.
Conclusion
Write some conclusions about your study.
Associated MIGS Record
Fill in the following table which will become an electronic record of MIGS compliance
associated with this paper. Values left blank will be filled in with the text “not reported”,
so fill in as much as you can!
Table S1. Associated MIGS record
MIGS-ID
MIGS-1
1.1
field name
Submit to INSDC/Trace archives
PID
description
Not reported
1.2
MIGS-2
MIGS-3
MIGS-4
4.1
4.2
4.3
4.4
MIGS-5
MIGS-6
6.1
6.2
6.3
6.4
Trace Archive
MIGS CHECK LIST TYPE
Project Name
Geographic Location
Latitude
Longitude
Depth
Altitude
Time of Sample collection
Habitat (EnvO)
temperature
pH
salinity
chlorophyll
6.5
6.6
6.7
6.8
6.9
6.10
conductivity
light intensity
dissolved organic carbon (DOC)
current
atmospheric data
density
6.11
6.12
6.13
6.14
6.15
6.16
alkalinity
dissolved oxygen
particulate organic carbon (POC)
phosphate
nitrate
sulfates
6.17
6.18
MIGS-7
MIGS-9
MIGS-10
MIGS-11
MIGS-12
MIGS-13
sulfides
primary production
Subspecific genetic lineage
Number of replicons
Extrachromosomal elements
Estimated Size
Reference for biomaterial or Genome report
Source material identifiers
MIGS-14
MIGS-15
MIGS-16
MIGS-17
MIGS-18
MIGS-19
MIGS-22
MIGS-23
MIGS-27
MIGS-28
28.1
28.2
28.3
MIGS-29
MIGS-30
30.1
Known Pathogenicity
Biotic Relationship
Specific Host
Host specificity or range (taxid)
Health status of Host
Trophic Level
Relationship to Oxygen
Isolation and Growth conditions
Nucleic acid preparation
Library construction
Library size
Number of reads
vector
Sequencing method
Assembly
Assembly method
30.2
30.3
estimated error rate
method of calculation
Not reported
Not reported
Emulsion PCR
N/A
Roche Pyrosequencing
Newbler
MIGS-31
31.1
31.2
31.3
MIGS-32
MIGS-33
Finishing strategy
Status
coverage
contigs
Relevant SOPs
Relevant e-resources
None
References
Please add references here using this format. I strongly recommend that you use zotero
to organize and insert your references.