Wrongly Registered MD5 checksum

training_modules Documentation
Release 1
Marc B Rossello
Jul 26, 2017
Contents
1
2
3
Interactive Submissions
1.1 Module 1: Submission Options . . . . . .
1.2 Module 2: Create a Project . . . . . . . . .
1.3 Module 3: Register Source Samples . . . .
1.4 Module 4: Add Read files . . . . . . . . .
1.5 Module 5: Updates (Samples and Projects)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
5
8
Programmatic Submissions
2.1 Module 1: Create a Study . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Module 2: Submit an Annotated Sequence . . . . . . . . . . . . . . .
2.3 Module 3: Flat File upload - Submit an ENA Supported Sequence File
2.4 Module 4: Update a Study using REST API . . . . . . . . . . . . . . .
2.5 Module 5: Submitting Sample objects . . . . . . . . . . . . . . . . . .
2.6 Module 6: Updating Sample objects using REST API . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
15
20
22
27
32
Tips and FAQs
3.1 Solving Error Notifications (runs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Preparing a file for Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Taxonomic classifications for your samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
33
36
39
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
ii
CHAPTER
1
Interactive Submissions
Module 1: Submission Options
The majority of submissions to the ENA begin here.
1. Log in and access “new submission” tab
2. If you have not already, create a study using this option. Complete this step BEFORE going on to step 3. Module
2 describes this step in more detail.
1
training_modules Documentation, Release 1
3. If you have not already, create sample objects to represent your source material. Complete this step before going
on to step 4. Module 3 describes this step in more detail.
4. You are nearly ready to register your NGS read files. You need to upload them to your ENA ftp directory (you
have one with your account). This JAVA applet does not work in all environments. See here for alternative
upload methods.
5. This step combines multiple steps from above but it is preferable to split the job up (so that you have already
registered a study and some samples). Use this step to create Runs and experiments. Module 4 describes this
step in more detail. This step will link everything together under the project:
Experiment and run objects associate read files to their source sample and a study.
Module 2: Create a Project
This form is used to create a study object (see module 1 to access this form). It is possible to create a study before any
other data is added. Webin will report an accession id for the study that will look like this: PRJEB00000. This type
of accession is typically used in journal publications. Data can be added to the study at any time – the location of the
study in the ENA browser will stay the same. The study will not be visible in the ENA browser until the release date
has expired (*). This means that the data linked to the study will not be visible until the study itself is visible. Have a
look at an example of a study in the ENA browser.
2
Chapter 1. Interactive Submissions
training_modules Documentation, Release 1
Module 3: Register Source Samples
Part 1
This is the first (of 3) sample registration forms. See module 1 if you do not know where to access this form.
1. Find a checklist that suits your type of sample. A checklist comprises of a list of attributes that are required to
annotate your samples. A well annotated sample is more searchable in the ENA browser and your data will get
more exposure.
2. Move on to the next sample step
3. Use this option if you have created your samples as a spreadsheet file from a previous session. This spreadsheet
is a very specific format. You can obtain one in the next sample step . . .
Part 2
This is the second (of 3) sample registration forms.
1.3. Module 3: Register Source Samples
3
training_modules Documentation, Release 1
1. Take a look at the list of attributes on the left. Some will be mandatory, others are recommended. Every checked
item in the list appears as a field on the right side of the form. Please select or deselect as appropriate. Remember
that the more fields you can provide the more you are enabling your users to make accurate interpretations of
your study.
2. You can create additional attributes that do not exist in the checklist. However in most cases you should find
what you need among the default checklist fields.
3. The right side of the web form represents all samples as a template. Because this form represents all samples it is
only worth entering fields that are consistent with all the samples. Also use the web form to look up taxonomic
classifications which you will use later. Start typing your organism name to see the suggestions. Note that
environmental taxonomic classifications can look like “soil metagenome” as opposed to a specific organism
scientific name. You can also use the ‘i’ symbols to read definitions for each field, as well as checking the drop
down options for the fields that have a controlled vocabulary (*).
4. The download template button will download a tab separated file which you can open using a spreadsheet
program. It is highly recommended to use this to register your samples. Each row represents an individual
sample. Please do not edit or remove the lines marked with hash ‘#’ and do not change the order of the columns
as this will impede the re upload of the spreadsheet into the web form. Begin the first sample on the first row
available
5. Step 5 is in parenthesis because in most cases you fill in the spreadsheet offline and log in again after you have
completed it. The completed spreadsheet is loaded to the previous sample registration form (Part 1 step 3) and
this has the same effect as the ‘next’ button, to take you to the third and final sample registration form.
Part 3
This is the final (of 3) sample registration forms. This form appears after uploading a spreadsheet into the form in Part
1 step 3, or directly from the form in Part 2 if you have not used a the spreadsheet file and intend to type directly into
the webform
1. If you have uploaded a spreadsheet file, the number of rows correspond to the number of samples (you can skip
4
Chapter 1. Interactive Submissions
training_modules Documentation, Release 1
this step). If you have not used a spreadsheet you can specify how many samples to create using the template
you created in the second form
2. Add some basic sample group details. A sample group has limited functionality. It is a collection of samples
that are created in the same submission event. The samples can be edited as part of the same group if necessary
later on. It is not possible to move samples in or out of a group. The study object is used to group samples and
other objects together in the public domain.
3. The samples are loaded into the webform below these 2 buttons. You can check each one in the list by using
these buttons to navigate one sample at a time.
4. Check if any fields are not accepted by the webform (where you see a red exclamation mark). Your values may
not be valid because some fields are controlled.
5. This table is a summary of all samples. It can be large but you can move through the pages using the arrows
(red asterisk in image). If all fields in a sample are accepted by the webform you will see a green tick under the
‘Valid’ column. If there are any red crosses, navigate to the sample in question (or click on that row in the table)
and go back to step 4 to correct the invalid fields. If it is easier to correct the samples in your offline spreadsheet,
do so and use the ‘previous’ button (red $ in image) two times to go back to the first form where you will see a
red cross symbol next to the file name. Click on the cross and you will be able to re load the spreadsheet file.
6. Click submit if all samples in the table are validate (previous step). Webin will deliver accessions for each
sample unless there is some problem/error. If there is an error you can go back to step 5 to correct the errors and
then try again. If accessions are delivered, the samples are now in the ENA database. They will not be affiliated
with any data or other objects. That happens in subsequent rounds of submissions. For the moment they are
‘free’.
Module 4: Add Read files
Part 1
This is the first page that you will come to when submitting runs and experiments (see module 1). A run object is used
1.4. Module 4: Add Read files
5
training_modules Documentation, Release 1
to register a demultiplexed NGS read file (or pair of files) that you have uploaded (for example, Fastq, BAM, SFF,
CRAM) to your ENA ftp directory. Without run objects the files can not be registered and archived. An experiment
object represents a library solution used on the NGS machine. The experiment object will also link the run to the
sample, and to the study.
1. Select the study that you will be adding the runs and experiments to. If the study that you want to submit to does
not exist yet you can create one now (red asterisk). However it is best to split your submission up and create the
study as part of an earlier session (see module 1)
2. Click next to move to the next stage. The next stage is the sample generation stage. In most cases the samples
will have already been generated (it is best to submit the samples in a separate submission so that the work is
more divided). Find the ‘skip’ option to skip this step. If the samples do not exist, do not use the skip option,
you can create some samples during this step (see module 1 and module 3)
Part 2
This is the step for registering the files that you have uploaded to your personal ENA ftp directory. We need to wrap
each file or pair of files into a run object, point that run to an experiment object, and point that experiment object to
the correct sample.
6
Chapter 1. Interactive Submissions
training_modules Documentation, Release 1
1. Choose the type of file that you are submitting. Note that in the case of paired runs there is a 2 x fastq file option.
2. Any information you type into the webform will be lost if you log out before submitting. So you are highly
recommended to download a tab separated spreadsheet file (step 6) and fill it in offline. First, note that some
fields have drop down lists. Check the options in these so that you can apply them correctly into the spreadsheet
when offline.
3. Every row in the table represents one run and one experiment object and they need a source sample. The drop
down for the sample column does not work in most cases so you should know how you have named them, or
you can check by way of the sample tab (*). It is possible to give multiple runs the same source by repeating the
sample id in multiple rows (for instance, in the case of a deep coverage experiment where multiple lanes have
been used)
4. The file names correspond to the files that you have uploaded to your Webin ftp directory. Each run object gets
matched with files which are separately uploaded. Here is a list of ways you can upload your files. File names
should be written exactly as they appear in the ftp directory. For instance, FastQ files must be compressed and
so will carry the extension “.gz” or similar. The extension should be included when referencing the files in this
column.
5. The checksum is a fingerprint for the file. If the file is not 100% transferred we will only have a corrupted or
truncated version which is of no use so we need to check that this has not happened. If the file checksum is
different after the transfer we know there has been a problem, so you need to supply the checksum for the file
before it is uploaded so that we can do this check. We will calculate the checksum of the uploaded file and then
compare it with the one that you have provided. You can paste the checksum directly into this column. It will
be a 32 character string. You can also put the 32 character string into its own file and upload this checksum
file with the original file. The checksum file has to be named in a way that it can be recognised. It needs to
have the same name as the original file PLUS the extension “.md5” (so for file XXX the md5 checksum should
be in file XXX.md5). If you have uploaded a checksum file for each read file then you can leave this column
blank. Do not write the checksum file name (file XXX.md5) into the field – Webin will report an error, that it
is expecting a 32 character string. The Webin uploader tool automatically deposits an checksum file to your ftp
directory for every file that upload so if you have used this tool leave the column blank. The Webin uploader
tool uses Java applet technology which is generally being reduced or discontinued in browsers due to security
risks so the uploader tool may not be an option depending on your environment. So how do you create your
own checksum file? On a Linux machine it is easy, simply type (without the quotes, at the command line)
“md5sum <file name>” and it will display a line formatted like this: <32 character md5sum><2 spaces><file
name>. This is exactly the format our system will recognise if you create a checksum file so simply redirect
(using ‘>’ symbol) the output to a checksum file: “md5sum file_name > file_name.md5”. Then upload this file
along with the original one before you reach step 8. Apple Mac operating systems also have a similar checksum
generator that you can use. It is also possible on a Windows operating system but you may have to download
3rd party software to do it. More info here
6. Any information that you type into the webform will be lost if you log out before submitting. Therefore you
should download a template tab delimited spreadsheet file which you can open in a spreadsheet program like
MS Excel. Once you have filled it in offline, log back in, return to this submission page and upload it (step
7). The other advantage of having an offline copy of your experiments and runs is that if there is a problem
submitting the data you can send the spreadsheet file to ENA helpdesk and they can troubleshoot it for you.
7. Upload the completed spreadsheet file that you created in step 6. The web form should fill up with the data in
your spreadsheet. You can do a preliminary check to see if some fields are not recognised (check the controlled
drop down lists and that file names appear as expected).
8. This is the final step. If errors are reported you can remove the loaded table (use the cross that has appeared
in step 7), then make your edits tot eh tsv spreadsheet and try again (from step 7). If you need to send the tsv
spreadsheet to ENA help desk for troubleshooting (as mentioned in step 6), ensure that the project and samples
are already submitted (module 2 and 3) so that the ENA officer can focus only on the step that is failing.
1.4. Module 4: Add Read files
7
training_modules Documentation, Release 1
Module 5: Updates (Samples and Projects)
The interactive web based GUI (Webin) has some support for editing existing objects. This module is concerned with
sample and project objects. Access existing objects from the following tabs (after logging into Webin)
Sample Edit
A sample group is an internal concept (do not quote sample group ids in any publications) which groups together
samples for one purpose: so that you can edit them in bulk. The only way to ensure a collection of samples is in the
same group is by submitting them at the same time (during the same submission event). If you need to edit samples in
bulk but they are not in the same sample group you can use the REST API (more details to come).
First choose a sample from the sample tab or a sample group from the sample group tab. Click the ‘edit’ button for
that sample/group. You will come to a screen like this:
1. The left panel is used to select the sample that you would like to edit. Even if you selected a single sample from
the sample tab the whole group will still be displayed.
2. This is another way to select the sample that you would like to edit: you can go through the list one by one.
8
Chapter 1. Interactive Submissions
training_modules Documentation, Release 1
3. It is not possible to add or remove samples from a group, or to change the associated checklist, but you can
add/remove fields from the previously selected checklist
4. The right hand panel expands whichever sample you have selected in step 1. You can change the content of the
fields using this panel.
5. These little boxes are clickable. Click on this box to copy the content of the field to all the other samples in the
sample group (for fields that are common to all samples).
6. When you have completed your edits click save.
7. Warning! Although you can download a spreadsheet you cannot yet upload it again so you cannot use this
option to edit samples yet. It can be useful to obtain a spreadsheet similar to the one that you used to submit the
samples in the first place. Editing by tsv spreadsheet should be possible in the future.
Study Edit
Some parts of the study object can be edited. You can change the release date or release the whole study. You can also
edit titles and descriptions, as well as add publications which will become clickable links when the study goes live in
the ENA browser.
1. Login to Webin and find the studies tab.
2. If you have a long list of studies you can search for one by name or accession. This functionality exists in the
other tabs too.
3. If your study is confidential you can change the release date by clicking on the pencil icon. A calendar will
open so that you can navigate to required date. To release the study simply select the current date/present day.
Releasing a study will cause all the data associated to that study to be released as well. Upon releasing a study
various stages are set in motion:
• Moving read files and sequence files from our confidential archive to the public servers
• Indexing and rendering the study and its affiliated objects so that they can be linked-to and
visualised in the ENA browser
1.5. Module 5: Updates (Samples and Projects)
9
training_modules Documentation, Release 1
• Mirroring to INSDC databases, who will then follow similar procedures so the data is searchable
and viewable in their web portals.
These stages are usually complete in a couple of days but please allow several days for busy times or
for times when technical problems are causing the queue of jobs to build up.
4. For edits besides changing the release date, click the edit button next to the study that you need to edit. This will
expand the study into an editable webform.
5. There are various text boxes that you can edit if you need to. The short name for the study will be visible in
search outputs and overview pages whereas the descriptive title and abstract will be viewable when the study
has its own webpage (when the hold date has expired)
6. You can add a publication by clicking the ‘Add’ button (a fresh row will appear) and inserting the pubmed id.
This will result in a hyperlink on the main study page allowing the publication to be linked from the study (when
it is public).
7. Study ‘attributes’ are optional. They act as key words and can help expose the study to more specific searches.
In some cases we will standardise some attributes and index them. These may be related to specific projects
known to ENA and will help filtering and searching. Each key word needs a ‘tag’ which is the name of the field,
and an actual value (called ‘FieldType’). Some submitters add their DOI as a keyword when they do not have a
pubmed id. So the tag is something like ‘DOI’ and the value is the DOI value.
8. Remember to save changes when you are finished!
10
Chapter 1. Interactive Submissions
CHAPTER
2
Programmatic Submissions
Module 1: Create a Study
The Study Object
Objects such as a study or a sample, are stored in the ENA in XML form like this:
<?xml version = '1.0' encoding = 'UTF-8'?><PROJECT_SET>
<PROJECT alias="iranensis_wgs" center_name="HKI JENA" accession="PRJEB5932">
<NAME>WGS Streptomyces iranensis</NAME>
<TITLE>Whole-genome sequencing of Streptomyces iranensis</TITLE>
<DESCRIPTION>The genome sequence of Streptomyces iranensis (DSM41954) was
˓→obtained using Illumina HiSeq2000. The genome was assembled using a hybrid assembly
˓→approach based on Velvet and Newbler. The resulting genome has been annotated with
˓→a specific focus on secondary metabolite gene clusters.</DESCRIPTION>
<SUBMISSION_PROJECT>
<SEQUENCING_PROJECT>
<LOCUS_TAG_PREFIX>SIRAN</LOCUS_TAG_PREFIX>
</SEQUENCING_PROJECT>
<ORGANISM>
<TAXON_ID>576784</TAXON_ID>
<SCIENTIFIC_NAME>Streptomyces iranensis</SCIENTIFIC_NAME>
<CULTIVAR>DSM41954</CULTIVAR>
</ORGANISM>
</SUBMISSION_PROJECT>
<PROJECT_LINKS>
<PROJECT_LINK>
<XREF_LINK>
<DB>PUBMED</DB>
<ID>25035323</ID>
</XREF_LINK>
</PROJECT_LINK>
</PROJECT_LINKS>
11
training_modules Documentation, Release 1
</PROJECT>
</PROJECT_SET>
Creating objects in XML format is not always necessary. The Webin submission tool can create a project from a
webform. It will convert the form data into XML and load it into the ENA database. However, you will find that
in some cases there is more flexibility in creating submittable XML objects yourself and by-passing the interactive
submission tool. Do consider using the interactive Webin submission tool to create a study and then adding the other
objects programmatically instead. It is fine to mix and match submission routes and you may find that programmatic
submission is better suited to repetitive submission tasks, of which project creation is not normally one of.
A study (sometimes referred to as a project) in the ENA is used to group other objects together, so we will look into
creating a project/study as a first step towards learning to submit ENA objects programmatically.
Create the XML
Below is a template. Do not use any default values - enter your own information and save it as a file, for example, you
may call it “project.xml”
<?xml version = '1.0' encoding = 'UTF-8'?>
<PROJECT_SET>
<PROJECT alias="cheddar_cheese" center_name="">
<TITLE>Characterisation of Microbial Diversity and Chemical Properties of
˓→Cheddar Cheese Prepared from Heat-treated Milk</TITLE>
<DESCRIPTION>This study aimed to characterise the interaction of microbial
˓→diversity and chemical properties of Cheddar cheese after three different heat
˓→treatments of milk</DESCRIPTION>
<SUBMISSION_PROJECT>
<SEQUENCING_PROJECT/>
</SUBMISSION_PROJECT>
</PROJECT>
</PROJECT_SET>
In your file “project.xml” paste the above XML but change the alias=”” and give it a unique name. You may need
this unique name to refer to your project when adding other objects to it. It can be a short acronym but it should be
meaningful/memorable in some way (instead of just a number). Also provide a center name center_name="".
The center name is specific to your Webin account. You chose it when you set up the account. Log in to confirm your
centre name. Within the <DESCRIPTION></DESCRIPTION> block add an abstract detailing the project including
any information that may be useful for someone to interpret your project correctly. Within the <TITLE></TITLE>
block add a descriptive title.
12
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
Create a Submission XML
To register the submission of a project or any other object(s), you need an accompanying submission xml in a separate
file. Let’s call the file “sub.xml” for this purpose.
<?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION alias="cheese" center_name="">
<ACTIONS>
<ACTION>
<ADD source="project.xml" schema="project"/>
</ACTION>
</ACTIONS>
</SUBMISSION>
This file simply registers an ‘action’ on the ENA servers. In this case the action is to <ADD/> a project object(s) using
the XML file “project.xml”. Make sure the project.xml and the sub.xml are in the same directory on a linux file system
(or Mac/Unix should work too). If you do not want to use the command line or if you are using a Windows operating
system it is also possible to register the submission via a web form on any internet browser (more details to come).
Add an alias to the submission XML to mark the submission event, and add your centre name as before.
Send the XML files to ENA
CURL is a Linux/Unix command line program which you can use to send the XMLs to the ENA server along with
authentication.
curl -k -F "[email protected]" -F "[email protected]" "https://www-test.ebi.ac.
˓→uk/ena/submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD"
From the same directory containing files sub.xml snd project.xml run CURL as above. You must exchange
Webin-NNN with your Webin account id and PASSWORD for your account password. The %20 is URL encoding
for a space character. Leave these in place. After running the command above a receipt in XML format is returned. It
will look like the one below (it won’t be line wrapped but you can copy and paste it or redirect the CURL output to a
separate file.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2017-05-09T16:58:08.634+01:00" submissionFile="sub.xml" success=
˓→"true">
<PROJECT accession="PRJEB20767" alias="cheddar_cheese" status="PRIVATE" />
<Submission accession="ERA912529" alias="cheese" />
<MESSAGES>
<INFO>This submission is a TEST submission and will be discarded within 24 hours
˓→</INFO>
</MESSAGES>
<ACTIONS>ADD</ACTIONS>
</RECEIPT>
It is possible to use a browser to register the XML files instead of using cURL at the command line. See here.
2.1. Module 1: Create a Study
13
training_modules Documentation, Release 1
Simply use the study row and the submission row to browse and navigate to the project.xml file and the sub.xml file
respectively and then add your Webin account and password in the Username and password fields before clicking
submit. You should receive the receipt in the browser window.
The Receipt XML
Note the info message in the receipt
<INFO>This submission is a TEST submission and will be discarded within 24 hours</
˓→INFO>
It is advisable to run your submissions through the ENA test server where changes are not permanent and are erased
every 24 hours. If you are happy with the result of the submission you can run the CURL command again, but this
time on the production server. Simply change the part in the URL from /www-test.ebi.ac.uk to /www.ebi.
ac.uk and remove the -k flag:
curl -F "[email protected]" -F "[email protected]" "https://www.ebi.ac.uk/ena/
˓→submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD"
If you are using the webform instead of cURL at the command line you will get the receipt XML displayed in your
14
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
browser. Similarly, to submit via webform to the production server, change the part in the webform URL from /
www-test.ebi.ac.uk to /www.ebi.ac.uk.
To know if the submission was successful look in the first line of the <RECEIPT> block. The attribute success will
have value true or value false. If the attribute is false then the submission did not succeed. If this is the case check the
rest of the receipt for error messages and after making corrections, try the submission again. If the success attribute
is true then the submission was successful. The receipt will contain the accession numbers of the objects that you
have submitted. In the case of an ENA study/project this is likely to be the accession that you will be including in a
publication.
<PROJECT accession="PRJEB20767" alias="cheddar_cheese" status="PRIVATE" />
Module 2: Submit an Annotated Sequence
Annotated sequences can be any number of sequences that are assembled from shorter reads or sequenced using Sanger
capillary sequencing. They can be annotated with features such as coding domains, introns, exons, non coding RNA
etc. Typical sequences submitted to the ENA are rRNA genes, single CDS genomic DNA sequences, MHC genes,
mRNA and many more. Most submitters will use the interactive WebIn Submission system to submit these types of
sequences:
The analysis object
This is a guide for programmatic submission of annotated sequences. This submission route is useful for automating
your submissions if you expect to be submitting large numbers of sequences at regular intervals. For one off or small
scale submissions you are encouraged to use Webin instead. The ENA metadata model uses various objects to hold
information and group other objects together. Annotated sequences are wrapped in an analysis object. The other
objects are frequently used in read data submission and whole genome submissions. The analysis object can point
to a study and samples. It is not necessary to register a sample object for an annotated sequence submission, but
you should have a study available before you submit the analysis/annotated sequence. Studys are used to group other
objects together. You may well use the study again in the future to submit additional data types including read data
and whole genomes. A study can package together all elements of a typical publication.
2.2. Module 2: Submit an Annotated Sequence
15
training_modules Documentation, Release 1
A word about Accession Numbers
Annotated sequences are submitted as TSV spreadsheet files. One analysis object wraps one TSV file, but a TSV file
may contain many sequences (each row = 1 annotated sequence). Templates with predefined columns are available.
A TSV template is specific to a type of sequence so each tsv/analysis can have multiple sequences but they will all
be the same type. For example if you have 10 rRNA genes and 20 single protein coding genes as part of the same
study then you will use 2 different TSV templates, which will be submitted as 2 separate analysis objects, 1 with 10
rows and the other with 20 rows. All 30 rows will be converted into EMBL sequence files and each sequence file will
be accessioned. The analysis objects will be accessioned too but this is for internal ENA tracking. Do not quote an
analysis (ERZxxxxxx) accession when referring to an annotated sequence. Only quote the sequence accessions (as a
range for example, if there are many). You can also quote the study accession (PRJEBxxxx), especially if you have
a collection of data to report. The analysis object is used to submit other file types as well and in some cases it is
appropriate to reference an analysis accession.
At submission time you will not receive any sequence accessions. These will come later by email (multiple email
accounts can be registered per Webin account). After submission the TSV file is moved to a staging area and each
row is converted into an EMBL sequence flatfile. The flatfiles are then validated and accessioned. After this the
accessions are emailed and the sequences are moved to the confidential or public archive depending on the status of
the encompassing study (a public study will make the sequences public too).
Step 1: Create a study
If you already have a study you can add your annotated sequence entries to it. If you do not you need to create one
first. Use either the interactive submission route or the programmatic submission route to do this.
Step 2: Get hold of the TSV template
Sequences are submitted as tsv spreadsheets. You can use Webin submission option “Submit other assembled and
annotated sequences [formerly EMBL-Bank]” to get hold of the template that you will be using. You will only need to
do this once for each type of sequence that you are submitting. After you have the template(s) you can submit without
logging in to Webin.
16
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
For this example I chose sequence type rRNA gene and then navigated to the page where there was an option to
download the template:
The downloaded file is called something like “Sequence-ERT000002-5697110325950293078.tsv”. Take note of the
ERT number which in this example is ERT000002. It represents the sequence type (rRNA gene in this case). This is
required later - the system needs to know the sequence type so that it can create the right EMBL file from the TSV. To
fill in the TSV you can use a spreadsheet editor. Each row in the tsv is a separate sequence record. The last column is
for the sequence and the others are for annotation fields. It is a bit like a FASTA except that the header and sequence
are on one line instead of two and the fields are tab separated.
Step 3: Upload the TSV file to your FTP directory
After submission, the TSV file will be accessed from your Webin FTP directory (all accounts have some space on
the ENA FTP server for this purpose) for processing. So before going any further you need to compress the TSV
file and upload it to your Webin ftp directory. A full set of instructions can be found here. You also need to register
the MD5 checksum for the TSV file. This can be done in the next step (by adding it to the analysis xml object) or
you can do it now by uploading a supplementary checksum file in addition to the TSV file. So if your tsv is called
ethylomonas.tsv.gz the file with the checksum in it is called ethylomonas.tsv.gz.md5. See here for guidelines on
preparing files for a submission.
Step 4: Prepare the Analysis XML file
The TSV file, now sitting in your Webin FTP directory, is registered/submitted using the ENA XML REST API.
Create an analysis object as an XML file. Note that this analysis object references a study (see step 1 above) and the
compressed tsv file. It also includes the MD5 checksum for the compressed TSV file (so we can check that the transfer
is 100% completed). You can omit the checksum attribute in the XML if you have already uploaded a checksum
2.2. Module 2: Submit an Annotated Sequence
17
training_modules Documentation, Release 1
file to your Webin ftp directory along with the compressed TSV file. See here for guidelines on preparing files for a
submission.
The analysis object also references the ERT number (corresponding to the rRNA sequence type in this case). In this
example I changed the name of the TSV file that was accessed in step 2 above. But you do not have to.
<?xml version = '1.0' encoding = 'UTF-8'?>
<ANALYSIS_SET>
<ANALYSIS alias="ethylomonas" center_name="EBI">
<TITLE>16S of Methylomonas sp.</TITLE>
<DESCRIPTION>16S Methylomonas sp.</DESCRIPTION>
<STUDY_REF accession="PRJEBxxxx">
</STUDY_REF>
<ANALYSIS_TYPE>
<SEQUENCE_FLATFILE/>
</ANALYSIS_TYPE>
<FILES>
<FILE checklist="ERT000002" checksum="5831463bb16a4c14374a0962d5a353cc"
˓→checksum_method="MD5" filename="ethylomonas.tsv.gz" filetype="tab"/>
</FILES>
</ANALYSIS>
</ANALYSIS_SET>
Create a file, it can have any name but in this example we will call it analysis.xml You can use the above XML as a
template but be sure to change all the fields because this is an example only. Remember to:
1. Provide your own alias. This is a unique id for the analysis object and you may need it to identify your submission later
2. Apply your own center name. When you created your Webin account you provided a center name acronym.
You can check it by logging in and looking at the account details section.
3. Add a similar title to the one in the example. It mentions the sequence type and the organism.
4. Use the same or a similar title in the block. Title and description are not used in the final EMBL flatfiles so these
fields do not have to be very detailed.
5. Apply the correct study id (PRJEBxxxx)
6. Apply the correct checklist id (ERTxxxxxx)
7. If registering the MD5 checksum, apply it.
8. Apply the correct file name to refer to the compressed TSV. Use the full path if you have uploaded it to a
subdirectory within your Webin FTP directory.
9. filetype and checksum_method should stay the same as the example
Step 5: Prepare a Submission XML file
There is also a submission object which represents the submission event itself. An XML file with a submission object
needs to accompany the analysis object when it gets sent to ENA REST API server so that the system knows what to
do with the analysis object.
<?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION alias="ethylomonas_submission" center_name="EBI">
<ACTIONS>
<ACTION>
<ADD source="analysis.xml" schema="analysis"/>
</ACTION>
18
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
</ACTIONS>
</SUBMISSION>
The submission XML file can have any name. In this example it is called submission.xml. Change the example
template above
1. Provide your own alias. This is a unique id for the submission event.
2. Apply your own center name. When you created your Webin account you provided a center name acronym.
You can check it by logging in and looking at the account details section.
3. Change the source attribute so that it has the name of the XML file containing the analysis object from step 5
above.
4. You can change the ‘ADD’ block to a ‘VALIDATE’ block if you just want to test and see what messages are
returned. <VALIDATE source="analysis.xml" schema="analysis"/> When you validate the
object will not be committed to the ENA database and no accession number will be assigned. We recommend
testing all your submissions like this, before using the ADD action.
Step 6: Send the XMLs to ENA through the REST API
This step is the same as other REST API submissions. Please go to this section which is based on submitting a project
XML. Submitting an analysis XML is very similar.
Please note the following.
• Your cURL command will look something like this
curl -k -F "[email protected]" -F "[email protected]" "https://www-test.
˓→ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD"
• If you are using the webform instead of cURL in Linux or Mac operating systems, use the analysis row and the
submission row to browse and navigate to the analysis.xml file and the submission.xml file respectively
• The receipt will look like this. Look out for the success="true" in the receipt.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2017-05-05T15:28:38.557+01:00" submissionFile="sub.xml" success=
˓→"true">
<ANALYSIS accession="ERZ407913" alias="ethylomonas" status="PRIVATE" />
<SUBMISSION accession="ERA907974" alias="ethylomonas" />
2.2. Module 2: Submit an Annotated Sequence
19
training_modules Documentation, Release 1
<MESSAGES>
<INFO>This submission is a TEST submission and will be discarded within 24 hours
˓→</INFO>
</MESSAGES>
<ACTIONS>ADD</ACTIONS>
</RECEIPT>
• The URL in the cURL command above belongs to the test server https://www-test.ebi.ac.uk/.
.. so the accessions delivered are not genuine. If you are happy with the submission in TEST, change to
the production server https://www.ebi.ac.uk/... and remove the -k flag. Also remember if you are
using ‘VALIDATE’ action in the submission XML then despite a success="true" the submission was not
committed!
Module 3: Flat File upload - Submit an ENA Supported Sequence File
Annotated sequence entries are stored in the ENA as ENA supported sequence files. Here is an example of an HLA
gene in ENA supported format. It is a text file that is computer readable due to the 2 character line beginnings (ID,
AC, DE ...). The ENA browser renders the text file into a friendlier and more graphical view but the computer readable
version is still available so that automatic pipelines down stream of the ENA can download and parse large numbers
of sequence entries.
Create your own ENA supported sequence file
In most cases it is not necessary to submit an ENA supported sequence file because the interactive tool Webin provides
spreadsheet templates for various types of sequences so that you can submit using a tab separated file (TSV) which
you can fill in using any spreadsheet editor. These are called ‘annotation checklists’. After the submission via Webin
or via programmatic REST API the TSV is converted into an ENA supported sequence file (or ‘flat file’) and validated
before accessions are delivered.
Not all sequence types are available as a TSV spreadsheet template/annotation checklist. For instance the HLA gene
above has multiple exons and this is difficult for us to turn into a template. Typically the more complicated sequences
with multiple and repeating features are the hardest to make into TSV templates. For these types of sequences you can
create an ENA supported sequence file yourself and submit it to the ENA using the programmatic REST API (this is
submission by “flat file upload”, previously “entry upload”).
For a list of sequence types that are available as annotation checklists (TSV spreadsheets) see here: http://www.ebi.ac.
uk/ena/submit/annotation-checklists
Please do not use submission by flat file for any sequence type listed on the above webpage.
sheet/annotation checklist submission route is more robust because we do the file conversion.
The spread-
For examples of ENA flat files that are not available for submission using annotation checklists/TSV see here: http:
//www.ebi.ac.uk/ena/submit/entry-upload-templates
Pay close attention to how the flat files are formatted. Use the web page above to construct your sequence flat file.
This will be submitted by flat file upload. As with a TSV/annotation checklist submission (module 2) you need to
create an analysis object in XML format to wrap the ENA flat file. Please check module 2: Analysis object for more
information. To see how the analysis object and the sequence entries will be accessioned please refer to module 2: A
word about Accession Numbers
20
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
Submission by Flat File Upload
Submitting an ENA flat file is the same as submitting a tab separated file, so much of the detail is in module 2).
The main difference is that for tsv spreadsheet submissions the tab/tsv file is converted to an ENA flat file and then
validation is applied. For a submission by flat file upload, the conversion is omitted because the file is already in the
ENA supported format. The system will try to validate your ENA flat file after only minimal processing. There is a
little more opportunity for error but this can be remedied by following the guidelines closely.
Step 1: Create a project
As with a TSV/annotation checklist submission (module 2), a project/study is required. If you already have a study
you can add your annotated sequence entries to it. If not, create one first. Use either the interactive submission route
or the programmatic submission route to do this. Note the project accession number when you receive it.
Step 2: Compress and upload the sequence flat file
As with a TSV/annotation checklist submission, the sequence flat file must be compressed and uploaded to
your Webin ftp directory. You may also need to calculate the MD5 checksum. Check here and here
for instructions. In this example I have an ENA flat file called Human_parvovirus_B19_entryupload.embl
which I have compressed to create file Human_parvovirus_B19_entryupload.embl.gz. The checksum of Human_parvovirus_B19_entryupload.embl.gz is 7138bf3320cad8d215b7e9930ded114b.
Step 3: Create the analysis and submission XMLs
First check how the analysis file was created in module 2 step 4
In this example the analysis file looks like this
<?xml version = '1.0' encoding = 'UTF-8'?>
<ANALYSIS_SET>
<ANALYSIS alias="Human_parvovirus_B19_entryupload" center_name="EBI">
<TITLE>Human parvovirus B19 isolate IRB_1_2008 NS1 and VP1 unique region genes,
˓→partial cds</TITLE>
<DESCRIPTION>Human parvovirus B19 isolate IRB_1_2008 NS1 and VP1 unique region
˓→genes, partial cds</DESCRIPTION>
<STUDY_REF accession="PRJEBXXXX">
</STUDY_REF>
<ANALYSIS_TYPE>
<SEQUENCE_FLATFILE/>
</ANALYSIS_TYPE>
<FILES>
<FILE checksum="7138bf3320cad8d215b7e9930ded114b" checksum_method="MD5"
˓→filename="Human_parvovirus_B19_entryupload.embl.gz" filetype="flatfile"/>
</FILES>
</ANALYSIS>
</ANALYSIS_SET>
In this case there is no ERT number/checklist attribute because no TSV annotation checklist template is being used.
Also the file type attribute is different: filetype="flatfile". The title and description can be a brief description
of what is presented in the sequence flat file. Make sure to add all your own attributes and field values as the above is
only for example purposes.
The submission XML in this example looks like this:
2.3. Module 3: Flat File upload - Submit an ENA Supported Sequence File
21
training_modules Documentation, Release 1
<?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION alias="entry_upload_Human_parvovirus_B19" center_name="EBI">
<ACTIONS>
<ACTION>
<ADD source="analysis.xml" schema="analysis"/>
</ACTION>
</ACTIONS>
</SUBMISSION>
As in module 2 step 5, the next step is to complete a submission XML file. Provide a unique alias for the submission
object and reference the file containing the analysis object (in this case I called it ‘analysis.xml’).
Step 4: Send both XMLs to ENA using REST API
This step is the same as module 2 step 6.
Use cURL or the web form to send the XMLs to ENA and register the flat file submission. Use the test server first and
if successful and you are happy with the receipt proceed to submit to the production server.
In this example I obtained the following receipt
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2017-05-08T12:51:53.601+01:00" submissionFile="submission.xml"
˓→success="true">
<ANALYSIS accession="ERZ408000" alias="Human_parvovirus_B19_entryupload" status=
˓→"PRIVATE" />
<SUBMISSION accession="ERA911540" alias="entry_upload" />
<ACTIONS>ADD</ACTIONS>
</RECEIPT>
In this example the analysis received accession ERZ408000 and the submission received accession ERA911540. You
will not need the submission accession, whereas the analysis accession may be useful if you need to enquire about the
progress of the submission. After the sequence entries are processed they will be accessioned and you will receive the
accession (or accession range if multiple sequences were in the flat file) via the email address that is registered with
your Webin account. Do not quote the analysis accession in any publication, always quote the sequence accessions
(which come later by email). You can also quote the project/study accession, especially if you have used the project
to group several submissions across different domains.
Module 4: Update a Study using REST API
Editing studies in the ENA using the REST API is an almost identical process to the submitting a new one. The first
step is to obtain the original study in XML format. This step alone can be tricky if you did not submit the project using
the REST API to begin with. Note that Webin has good study editing functionality already:
22
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
However, learning to use the REST API with a simple project object can pave the way for submitting and updating
more complicated objects such as samples, experiments and runs. Also for making edits in bulk (to many projects) the
ENA REST API is more feasible than Webin.
Step 1: Get hold of the study in XML format
If you used REST API to submit the study in the first place you can use the XML files that you used previously.
If you don’t have an XML file containing the study you can copy the public version by using &display=xml at the end
of the study page. For example, http://www.ebi.ac.uk/ena/data/view/PRJEB5932&display=xml.
Note that the web version has additional blocks that are not part of the original XML as well as parts that have been
added automatically and can be cleaned up for the purpose of updating (besides, they will be added again automatically). For example the below web version XML can be cleaned up so that it looks like submitted version that follows
it.
Web Version
<?xml version="1.0" encoding="UTF-8"?>
<ROOT request="PRJEB14252&amp;display=xml">
<PROJECT alias="ena-STUDY-klanvin-03-06-2016-07:54:42:301-120" center_name="klanvin"
˓→accession="PRJEB14252" first_public="2016-08-02+01:00">
<IDENTIFIERS>
<PRIMARY_ID>PRJEB14252</PRIMARY_ID>
<SECONDARY_ID>ERP015887</SECONDARY_ID>
<SUBMITTER_ID namespace="klanvin">ena-STUDY-klanvin-03-06-2016-07:54:42:301˓→120</SUBMITTER_ID>
</IDENTIFIERS>
<NAME>Cheddar cheese</NAME>
<TITLE>Characterization of Microbial Diversity and Chemical Properties of
˓→Cheddar Cheese Prepared from Heat-treated Milk</TITLE>
<DESCRIPTION>This study aimed to characterize the interaction of microbial
˓→diversity and chemical properties of Cheddar cheese after three different heat
˓→treatments of milk; low temperature/long time (LTLT), thermization, and high
˓→temperature/short time (HTST). Cheese obtained from LTLT-treated milk (LC) and
˓→thermized milk (TC) .... </DESCRIPTION>
<SUBMISSION_PROJECT>
<SEQUENCING_PROJECT>
2.4. Module 4: Update a Study using REST API
23
training_modules Documentation, Release 1
<LOCUS_TAG_PREFIX>BN8055</LOCUS_TAG_PREFIX>
</SEQUENCING_PROJECT>
</SUBMISSION_PROJECT>
<PROJECT_LINKS>
<PROJECT_LINK>
<XREF_LINK>
<DB>ENA-SUBMISSION</DB>
<ID>ERA645775</ID>
</XREF_LINK>
</PROJECT_LINK>
<PROJECT_LINK>
<XREF_LINK>
<DB>ENA-FASTQ-FILES</DB>
<ID><![CDATA[http://www.ebi.ac.uk/ena/data/warehouse/filereport?
˓→accession=PRJEB14252&result=read_run&fields=run_accession,fastq_ftp,fastq_md5,fastq_
˓→bytes]]></ID>
</XREF_LINK>
</PROJECT_LINK>
<PROJECT_LINK>
<XREF_LINK>
<DB>ENA-SUBMITTED-FILES</DB>
<ID><![CDATA[http://www.ebi.ac.uk/ena/data/warehouse/filereport?
˓→accession=PRJEB14252&result=read_run&fields=run_accession,submitted_ftp,submitted_
˓→md5,submitted_bytes,submitted_format]]></ID>
</XREF_LINK>
</PROJECT_LINK>
</PROJECT_LINKS>
<PROJECT_ATTRIBUTES>
<PROJECT_ATTRIBUTE>
<TAG>ENA-FIRST-PUBLIC</TAG>
<VALUE>2016-08-02</VALUE>
</PROJECT_ATTRIBUTE>
<PROJECT_ATTRIBUTE>
<TAG>ENA-LAST-UPDATE</TAG>
<VALUE>2016-06-03</VALUE>
</PROJECT_ATTRIBUTE>
</PROJECT_ATTRIBUTES>
</PROJECT>
</ROOT>
Submitted version
<?xml version="1.0" encoding="US-ASCII"?>
<PROJECT_SET>
<PROJECT center_name="klanvin" accession="PRJEB14252">
<NAME>Cheddar cheese</NAME>
<TITLE>Characterization of Microbial Diversity and Chemical Properties of Cheddar
˓→Cheese Prepared from Heat-treated Milk</TITLE>
<DESCRIPTION>This study aimed to characterize the interaction of microbial
˓→diversity and chemical properties of Cheddar cheese after three different heat
˓→treatments of milk; low temperature/long time (LTLT), thermization, and high
˓→temperature/short time (HTST). Cheese obtained from LTLT-treated milk (LC) and
˓→thermized milk (TC) .... </DESCRIPTION>
<SUBMISSION_PROJECT>
<SEQUENCING_PROJECT>
<LOCUS_TAG_PREFIX>BN8055</LOCUS_TAG_PREFIX>
24
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
</SEQUENCING_PROJECT>
</SUBMISSION_PROJECT>
</PROJECT>
</PROJECT_SET>
The submitted version is much shorter and I even removed the unique alias because now that the object has an accession
number the server will not need both alias and accession number to realise the identity of the object that is being
overwritten.
ERP version
If your study is not public yet and you do not have it in XML format you can try using the submit/drop-box/ REST
endpoint. Log in to here with your Webin id and password and click on ‘STUDY’. You will see a list of studies
submitted from your account and you can view the XML for each by selecting the study and then clicking ‘xml’
2.4. Module 4: Update a Study using REST API
25
training_modules Documentation, Release 1
Studies obtained from this resource are actually different (you may have noticed). Previously a study in the read
domain had an accession like this ERP000001 whereas a project object (used for registering genome assemblies
among other things) would have an accession like this PRJEB0001. We no longer distinguish between the 2 objects
officially and we expose the PRJEB type more while the ERP type is kept for legacy reasons. You can edit either the
PRJEB type or the ERP type and most attributes will be carried over to the other one. Similarly when you create a
PRJEB type project then an ERP project is created automatically (and vice versa).
Step 2: Create a submission XML file
As with submitting a new study (see module 1), a submission object is required to accompany the study XML for
updating an existing study object too. You may have this from a previous submission or update but it is also very quick
to create.
26
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
<?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION alias="cheese_update" center_name="">
<ACTIONS>
<ACTION>
<MODIFY source="project.xml" schema="project"/>
</ACTION>
</ACTIONS>
</SUBMISSION>
Make sure that you give the submission object a unique alias (which can be any string) and fill in the center_name for
your account (you can find this in the “my account details” drop down from inside Webin.
If you are updating the ‘ERP’ version of the project (see above) you also need to specify this in the submission XML
by changing schema="project" to schema="study" because the ERP style objects use a different schema.
The important part of this submission object is the <MODIFY> tag. Contrast this with the tag used to submit an object
for the first time (in module 1) which is <ADD>. This tells the REST server that we are updating an existing object
instead of adding a new one.
Make the edit and send to ENA
Now you can make changes to the study object contained in the XML file. For example as a test, you might try
modifying the title or the description.
The final step is identical to submitting a study for the first time in module 1. You will send the submission xml and
the study xml to the ENA REST server using cURL or the webform and you should receive a receipt in XML format.
If the receipt contains success="true" then your edit will have been committed to the database. If not, check the
error message(s), correct and repeat.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2017-07-17T13:22:11.020+01:00" submissionFile="sub.xml" success=
˓→"true">
<PROJECT accession="PRJEB14252" alias="ena-STUDY-klanvin-03-06-2016-07:54:42:301˓→120"
status="PUBLIC"/>
<SUBMISSION accession="" alias="cheese_update"/>
<ACTIONS>MODIFY</ACTIONS>
</RECEIPT>
Module 5: Submitting Sample objects
As with most modules in this programmatic series, this one draws on the basic principles laid out in the first module:
Create a Study. It is recommended that you work through the study module first. When you can create a study object
in the ENA, so too will you be able to create sample objects by the same means.
What does the XML file look like?
The sample below is from an actual project released in 2016. Its title is Different gastric microbiota compositions in
two human populations with high and low gastric cancer risk in Colombia.
Here is one of the samples
2.5. Module 5: Submitting Sample objects
27
training_modules Documentation, Release 1
<?xml version="1.0" encoding="US-ASCII"?>
<SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.sample.
˓→xsd">
<SAMPLE alias="MT5176" center_name="">
<TITLE>human gastric microbiota, mucosal</TITLE>
<SAMPLE_NAME>
<TAXON_ID>1284369</TAXON_ID>
<SCIENTIFIC_NAME>stomach metagenome</SCIENTIFIC_NAME>
<COMMON_NAME></COMMON_NAME>
</SAMPLE_NAME>
<SAMPLE_ATTRIBUTES>
<SAMPLE_ATTRIBUTE>
<TAG>investigation type</TAG>
<VALUE>mimarks-survey</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>project name</TAG>
<VALUE>Different gastric microbiota compositions in two human populations with
˓→high and low gastric cancer risk in Colombia</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>sequencing method</TAG>
<VALUE>pyrosequencing</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>collection date</TAG>
<VALUE>2010</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>host body site</TAG>
<VALUE>Mucosa of stomach</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>human-associated environmental package</TAG>
<VALUE>human-associated</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>geographic location (latitude)</TAG>
<VALUE>1.81</VALUE>
<UNITS>DD</UNITS>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>geographic location (longitude)</TAG>
<VALUE>-78.76</VALUE>
<UNITS>DD</UNITS>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>geographic location (country and/or sea)</TAG>
<VALUE>Colombia</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>geographic location (region and locality)</TAG>
<VALUE>Tumaco</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>environment (biome)</TAG>
<VALUE>coast</VALUE>
28
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>environment (feature)</TAG>
<VALUE>human-associated habitat</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>environment (material)</TAG>
<VALUE>gastric biopsy</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ENA-CHECKLIST</TAG>
<VALUE>ERC000014</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
</SAMPLE_SET>
A sample is ultimately connected to raw read data and can also be connected to an assembly and various types of
interpreted data. It provides most of the context and value to the data that it is connected to and it is representing the
source material that has been sequenced. Note that most of the added value comes in the form of and pairs that belong
in <SAMPLE_ATTRIBUTE> blocks. These blocks are not restricted so you can add as many as you like and you can
define them however you like. Most submitters will want to apply attributes that are recognised by ENA and that are
indexed for searching and filtering as this will increase the search-ability and value of your sample even further. You
can also use a combination of your own attributes with those recognised by ENA.
Apply an ENA minimum information standard checklist to your samples
ENA offer sample ‘checklists’ which define all the mandatory and recommended attributes for specific types of samples. By declaring that you would like to register your sample under a specific checklist you are enabling the sample
to be validated for correctness at submission time and you will also benefit from additional exposure of that sample to
various services downstream of ENA that are interested in using ENA data that has been annotated to those minimum
standards represented by the ENA checklists.
The sample above is using and will be validated against ENA checklist ERC000014. Note that the checklist itself
is declared using a SAMPLE_ATTRIBUTE block. The rest of the SAMPLE_ATTRIBUTE blocks are defined by
that checklist. You can omit a checklist reference if you do not want your samples to be confined to the minimum
annotation standards of one of ENA’s checklists. We advise against this and you can always add more of your own
attributes which will not be subject to strict validation.
Find all the sample checklists here. You can see that the sample in the example above is using checklist ERC000014
which corresponds to the GSC MIxS annotation standard for human associated source samples. Use these webpages in
the ENA to know what attributes are required by each checklist and what controlled vocabularies and regular expressions and units are expected in each case. You may want to access the XML version of the checklist if you want to write
a script to validate your own samples before you submit them. XML version of the checklist is available by appending
&display=xml to the URL for the specific checklist: http://www.ebi.ac.uk/ena/data/view/ERC000014&display=xml
If there is not a suitable checklist that describes your type of source samples you can use [ENA default checklist](the
http://www.ebi.ac.uk/ena/data/view/ERC000011). This checklist has virtually no mandatory fields but does include a
lot of optional attributes that you can review to help annotate your sample to the highest standard that is possible. A
well annotated sample will eventually lead to maximum exposure and use-ability of your data.
Submitting many samples simultaneously
The main attraction for using the REST API to submit samples (and other objects) is that you do not need to interact
with a manual web interface and that you can submit many objects in bulk at the same time. The example contains
2.5. Module 5: Submitting Sample objects
29
training_modules Documentation, Release 1
one sample block inside one sample_set block <SAMPLE_SET></SAMPLE_SET>. Your submission is more likely
to have multiple samples in one sample_set. Make sure you highlight how the samples are different from each other if
it is not already clear from some of the attribute values. Merely naming them 1 to 4 will not help your users to do any
comparative analysis!
<?xml version="1.0" encoding="US-ASCII"?>
<SAMPLE_SET>
<SAMPLE alias="1" center_name="">
<TITLE>first human gastric microbiota sample</TITLE>
<SAMPLE_NAME>
<TAXON_ID>1284369</TAXON_ID>
</SAMPLE_NAME>
</SAMPLE>
<SAMPLE alias="2" center_name="">
<TITLE>second human gastric microbiota sample</TITLE>
<SAMPLE_NAME>
<TAXON_ID>1284369</TAXON_ID>
</SAMPLE_NAME>
</SAMPLE>
<SAMPLE alias="3" center_name="">
<TITLE>third human gastric microbiota sample</TITLE>
<SAMPLE_NAME>
<TAXON_ID>1284369</TAXON_ID>
</SAMPLE_NAME>
</SAMPLE>
<SAMPLE alias="4" center_name="">
<TITLE>fourth human gastric microbiota sample</TITLE>
<SAMPLE_NAME>
<TAXON_ID>1284369</TAXON_ID>
</SAMPLE_NAME>
</SAMPLE>
</SAMPLE_SET>
Two more points about the sample XML file
XML Schema
Note the first 2 lines in the first example above.
<SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.sample.
˓→xsd">
This part points your XML editor (if you are using one) to a schema so that it can validate as you type. This is
the schema for the sample XML which is not the same as the checklist validation system. This schema defines the
order of the blocks and the controlled terms that may be available in some cases. It is more of a structural check and
unfortunately many ENA rules are not embedded into this first level schema so it can not guarantee that the submission
will be successful. However it will help you to compile properly written sample XML files.
Taxonomic classification
Note the sample_name block from the example above
<SAMPLE_NAME>
<TAXON_ID>1284369</TAXON_ID>
30
Chapter 2. Programmatic Submissions
training_modules Documentation, Release 1
<SCIENTIFIC_NAME>stomach metagenome</SCIENTIFIC_NAME>
<COMMON_NAME></COMMON_NAME>
</SAMPLE_NAME>
Taxon, scientific name and common name are ways of classifying the organism of the sample. Except in this case the
source sample is environmental and represents an unknown variety and quantity of organisms. Because every sample
still needs a taxonomic classification we have specific environmental terms in our taxonomy database typically used
for metagenomic studies. More about these here.
Taxon, scientific name and common name are referencing the same node in our taxonomic database so you do not
need to include all 3. Including the unique taxon_id is sufficient and the other fields will be added automatically after
the sample is submitted and archived. To find the correct taxonomic information for your organism including taxon_id
and scientific_name see here.
Submitting the XML files
The procedure for submitting XML files is outlined in module 1. Module 1 describes submitting a study object but
the process for sample submission is the same. The submission XML file should look something like this (assuming
the samples are in another XML called “samp.xml”. Also remember to apply the correct centre name for your Webin
account. The alias can be any unique string.
<?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION alias="MT5176_submission" center_name="">
<ACTIONS>
<ACTION>
<ADD source="samp.xml" schema="sample"/>
</ACTION>
</ACTIONS>
</SUBMISSION>
Assuming that the above submission XML is saved in a file called “sub.xml” a cURL statement to send the XMLs to
the ENA REST TEST server will look like this:
curl -k -F "[email protected]" -F "[email protected]" "https://www-test.ebi.ac.uk/
˓→ena/submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD"
The cURL command will return a receipt in XML formatting containing the accession numbers, or if accession numbers were not administered because there was a problem/error then you will get a list of errors to work through before
trying again.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2017-07-25T16:07:50.248+01:00" submissionFile="sub.xml" success=
˓→"true">
<SAMPLE accession="ERS1833148" alias="MT5176" status="PRIVATE">
<EXT_ID accession="SAMEA104174130" type="biosample"/>
</SAMPLE>
<SUBMISSION accession="ERA979927" alias="MT5176_submission"/>
<MESSAGES>
<INFO>This submission is a TEST submission and will be discarded within 24
˓→hours</INFO>
</MESSAGES>
<ACTIONS>ADD</ACTIONS>
</RECEIPT>
The receipt can be quite large so you may prefer to redirect the cURL output to a file, for example “receipt.xml”.
2.5. Module 5: Submitting Sample objects
31
training_modules Documentation, Release 1
Module 6: Updating Sample objects using REST API
under construction
32
Chapter 2. Programmatic Submissions
CHAPTER
3
Tips and FAQs
Solving Error Notifications (runs)
Submission of read files such as BAM and FastQ involves uploading them to your confidential ftp directory (this
comes with your Webin account). Following this you will ‘submit’ the files: wrap each file (or pair of files) into a run
object. This action of registering the run objects triggers our file processing pipeline to do some preliminary checks on
the read files before moving them to an archive area. If any check fails you will receive an error notification by email.
To correct these preliminary validation errors you do not need to repeat any of the submission process. Simply upload
and replace the files with fixed versions and if necessary update the registered md5 checksum (more on this later). The
runs will automatically be updated because the processing pipeline cycles through all files that are flagged/unvalidated
to see if there has been a changed. The duration of this cycle is dependent on the queue. At quiet times it can be less
than 24 but it can take several days during busier times so please allow some time after you have implemented your
fix for the automatic email notifications to cease.
Error Type 1: Invalid file checksum
If ‘Invalid file checksum’ appears in your emailed error report
List of file processing errors:
FILE_NAME | ERROR | MD5 | FILE_SIZE | DATE | RUN_ID/ANALYSIS_ID
mbr_depth_05.bam | Invalid file checksum | 594934819a1571f805ff299807431da4 | 895557023 | 20-DEC-2016
14:02:50 | ERR1766300
mbr_depth_minus_05.bam | Invalid file checksum | a2becdf04ab799c4e208de6161b470b3 | 341165746 | 20-DEC2016 14:00:46 | ERR1766407
File checksum refers to a hash function that can be performed on a file to create a unique string. When you upload a
file to our ftp server it may not get transferred 100%. In this case we will have a corrupted or truncated file which is
no good. To check this we can calculate the hash function. If it is different from the hash function of the original file
33
training_modules Documentation, Release 1
before you uploaded it then we can be sure the file on our server is not 100% complete. You can read this page on
Wikipedia for more information about hash functions.
We use the MD5 hash algorithm which you can perform easily with Linux or Mac command line on your local read
files:
> md5sum mbr_depth_05.bam
594934819a1571f805ff299807431da4
mbr_depth_05.bam
> md5sum mbr_depth_minus_05.bam
99cf94b7287658254dd1be689fbc447d
mbr_depth_minus_05.bam
Outcome One: Corrupt File: Upload Again
In the example above, according to the email notification, file “mbr_depth_05.bam” has a registered MD5 of
594934819a1571f805ff299807431da4. When we calculate the checksum of the original file ourselves we find the
same MD5. So the registered checksum is correct. This table is reporting that the uploaded file does not match the
registered checksum so we can assume that the file was not transferred completely. To remedy this try to upload the
file again. The file processing pipeline is checking for a match systematically and when it happens the run will update
itself.
Outcome Two: Wrongly Registered MD5 checksum: Register new one
File “mbr_depth_minus_05.bam” has a different story. The registered checksum according to the email notification is
a2becdf04ab799c4e208de6161b470b3. When we calculate it locally we get 99cf94b7287658254dd1be689fbc447d.
It appears that the wrong MD5 is registered. To remedy this we need to change the registered MD5 checksum. To do
this, upload the correct checksum as a separate file. For file XXX the md5 checksum should be in file XXX.md5, so
we need to create a file called mbr_depth_minus_05.bam.md5 and this file should contain the correct MD5 checksum.
We should then upload this MD5 file to the same location as the original file (your Webin ftp directory)
>md5sum mbr_depth_minus_05.bam > mbr_depth_minus_05.bam.md5 # create MD5 file
>cat mbr_depth_minus_05.bam.md5 # check contents of new MD5 file
99cf94b7287658254dd1be689fbc447d mbr_depth_minus_05.bam
Remember the file processing pipeline cycles through all files that are flagged with an error. You do not need to repeat
the submission. Uploading the file again, or the checksum file, or both (for extra security) is sufficient to update the
run but you may continue to get errors by email for a day or 2 after (depending on the queue).
If you don’t remember registering the MD5 checksum for each file when you submitted it, it would have happened in
one of 3 ways:
1. Our file uploader tool calculates the MD5 checksum automatically for any file that you upload. It then deposits
the ‘XXX.md5’ file itself
2. You registered the MD5 checksum using the tsv columns during submission time. (module 4, part 2, step 5).
This method is the most common source of wrongly registered checksums. Most other times it is sufficient to re
upload the file and assume the registered checksum is correct.
3. You uploaded ‘XXX.md5’ checksum files along with the XXX read files.
If you are submitting a new run(s) you can use the procedure described above to register an md5 checksum for each
file that you upload. If you use option 2 from above (register the checksum in the metadata tsv table) it will over-ride
the checksum file present in your ftp directory. If you provide a checksum file for every read file you can leave the
checksum column(s) blank at the metadata registration stage.
34
Chapter 3. Tips and FAQs
training_modules Documentation, Release 1
Error Type 2: Number of lines is not multiple of 4
This validation check helps to pick up errors in FastQ files. It is by no means thorough, but it can catch badly formatted
FastQ files before they enter the processing pipeline (after which, errors are harder to fix). You will have received an
email with a table like this.
List of file processing errors:
FILE_NAME | ERROR | MD5 | FILE_SIZE | DATE | RUN_ID/ANALYSIS_ID
SOC9/MCONS1_R1.fq.gz | File content missing or malformed, Number of lines in fastq is not multiple of 4 |
c2f8455c1a024cfb96a6c91f5d71f534 | 1358349886 | 01-DEC-2016 03:12:35 | ERR1755094
SOC9/MDSD8_R2.fq.gz | File content missing or malformed, Number of lines in fastq is not multiple of 4 |
3729df0ab14b2f00e863780281ec69fc | 3324175122 | 01-DEC-2016 03:14:33 | ERR1755093
This is the check that is done on FastQ files
zcat MCONS1_R1.fq.gz | grep -c [^[:space:]]
zcat and grep are commands that exist on the Linux platform as well as the Mac platform. ‘zcat’ uncompresses and
prints the contents and the grep command will count the number of non-whitespace lines. A read in FastQ format is 4
lines long (header line + base calls + quality score header line + quality score calls) and so the total line count should
be a multiple of 4.
The output of the command above is simply divided by 4 and if a whole number is not reached an error is flagged and
the email notification is sent. To remedy the error, upload a version of the file that has the correct line count (same file
name and directory location as before (overwriting any pre-existing files)). You can check your files before uploading
them using the above command on a linux machine.
IMPORTANT Final Step: The new file you upload will have a different MD5 checksum to the registered MD5
checksum. The registered checksum for each file is provided in the table in the email (column 3). To remedy this
follow this step from the previous section: Outcome Two: Wrongly Registered MD5 checksum: Register new one
Error Type 3: File integrity check failed
This error occurs when we can not unpack or read the file. The type of problem is related to the format of the file.
Here are a few examples of the error notification that you might receive.
List of file processing errors:
FILE_NAME | ERROR | MD5 | FILE_SIZE | DATE | RUN_ID/ANALYSIS_ID
UK/BR1-20_2.fq.gz | File integrity check failed, Can’t unzip file | ef7e73ed95f64355d7bf7d48636b704f |
3801612790 | 22-DEC-2016 04:08:41 | ERR0757927
cetbiorep1.bam | File integrity check failed,
File cannot be read using
cecfa479356456cb6770986a6141bc44 | 800838646 | 24-MAY-2016 03:02:08 | ERR0332189
samtools
|
frger.cram | File integrity check failed, Can’t count number of records in the file using cram tools |
807a0f61da013916c1ca5f60b9b42526 | 2347399950 | 11-JAN-2017 14:59:49 | ERR363314
The integrity checks are different for each file type but they follow the same principle.
3.1. Solving Error Notifications (runs)
35
training_modules Documentation, Release 1
File Types
for compressed fastq files
zcat BR1-20_2.fq.gz > /dev/null 2>&1
echo $? # exit code of 1 or higher means that there was an error.
The linux zcat command uncompresses the gzipped file (bzcat for bzip2) and parses it. The output is not important at
this stage, just the exit code. The output (and any human readable error message) is redirected to /dev/null (a way of
discarding it). If the exit code of the program is greater than 0 we know there was some issue in uncompressing the
file and the error report gets generated. To fix the problem, check that your local file can be uncompressed. You can
use a similar approach to above or try using the -t flag with gzip program (it tests the integrity of the gzipped file (gzip
-t <filename>)).
for BAM files
samtools view cetbiorep1.bam > /dev/null 2>&1
echo $? # exit code of 1 or higher means that there was an error.
Preliminary validation done on BAM files is simply to use samtools ‘view’ option on the BAM file to check that it can
unpack and read the BAM. If the exit code of the program is greater than 0 we know that the samtools program was
not able to fully read the BAM file and this triggers the error report to be emailed.
for CRAM files
CRAM files are similar to BAM files with some additional steps. The reference needs to be downloaded before the file
can be unpacked. The validation checks are based on this process and you can test cram file integrity yourself before
uploading the file in a similar way to the previous file formats.
How to Fix
1. Obtain a working file that passes the same preliminary test that our own validator applies. Upload the fixed file
(same name and location as the previous version so as to overwrite it) to your Webin ftp directory.
2. The fixed file that you upload will have a different MD5 checksum to the registered MD5 checksum. The
registered checksum for each file is provided in the table in the email (column 3). To remedy this follow this
step from the previous section: Outcome Two: Wrongly Registered MD5 checksum: Register new one
3. Do not attempt to re do the submission. Uploading the file and registering its checksum will be enough to fix
the run object. Our system checks for updates to files regularly. This can take a few days depending on the file
queue so please allow a couple of days for the emails to cease.
Preparing a file for Upload
Most files submitted to the ENA need to be transferred to the ENA server in a process that is separate from the submission itself. When we talk about submissions we are usually talking about registering the metadata- the information
about the file and about where it comes from. This metadata usually gets registered in the form of objects. For example
a sample object represents the physical source material that is sampled for eventual sequencing. The file itself can be
the result of sequencing the sample, such as the output of the sequencing machine. Having a separate transfer step
means that files can be large and handled separately without interrupting or delaying the submission/registration steps.
When data files are uploaded to the ENA ftp server the submission is not complete. There is usually more to come by
way of this metadata registration. For instance, a read file submission requires project, sample, experiment, and run
objects, while a whole genome FASTA file needs a sample and a project object. An annotated sequence submission
requires at the very least a project object to belong to.
36
Chapter 3. Tips and FAQs
training_modules Documentation, Release 1
Most files uploaded to the ENA ftp server need to be
1. Compressed
2. Have their MD5 checksum registered
Step 1: Compress the file using gzip or bzip2
Files that are in a human readable text format (FastQ, FastA, VCF, tsv, csv ...) are compressed before uploading them
to the ENA ftp server. Files that are not in a human readable text format like BAM, CRAM, SFF are already in a
format that is efficient for transferring so additional compression is not required (the file will fail to validate if it is
wrongly compressed). Also, with the exception of Oxford Nanopore files, do not tar archive any collections of files each should be uploaded separately.
If you are unsure about the format that your files should be in you can check here for standard file formats and here
for platform specific formats.
Tools used for compressing files are 3rd party so you can find out more about how to do this from outside the ENA
(a simple web search should be sufficient). However here is a basic example of compressing a file from within a Mac
operating system using the Terminal application.
user_01$ ls *fq
eg_01.fq
user_01$ gzip eg_01.fq
user_01$ ls *gz
eg_01.fq.gz
user_01$ gunzip eg_01.fq.gz
user_01$ ls *fq
eg_01.fq
user_01$ bzip2 eg_01.fq
user_01$ ls *bz2
eg_01.fq.bz2
user_01$
In the above example the user has listed all files in the current directory that ends in ‘fq’ (there is one called ‘eg_01.fq’).
The user then compresses the file with ‘gzip’ command, then reverts it back to uncompressed form with ‘gunzip’
command. Next the user compresses the file with ‘bzip2’ command. Note that files that are compressed end in ‘.gz’
or ‘.bz2’ depending on what tool is used.
Step 2: Calculate the MD5 checksum for the file
Md5 is a hash function that can be done on any file to create a 32 character string that is unique to that file (see the
Wikipedia page on MD5). It is a bit like a fingerprint for the file. If the contents of the file change in any way the
MD5 checksum will change as well. The file name can change without affecting the MD5 checksum because the
calculation is done on the contents of the file only. The idea is that when you transfer your large file to us it may
not get transferred 100%. If you tell us the MD5 checksum of the file that you have before it is uploaded and then
we calculate the checksum of the file that has been uploaded to us we can tell if the upload was successful. If the
checksum we calculate matches the one you provided then the transfer was a success.
Hash functions are a common way of testing file identity and integrity so you can find out more about how to do this
from outside the ENA (a simple web search should be sufficient). However here is a basic example of calculating the
checksum for a file called ‘eg_01.fq.bz2’ using the Terminal application within the Mac operating system.
user_01$ md5 eg_01.fq.bz2
MD5 (eg_01.fq.bz2) = 74f085a6f3dd8b2877b89fcb592c7f5c
user_01$ md5 eg_01.fq.bz2 > eg_01.fq.bz2.md5
3.2. Preparing a file for Upload
37
training_modules Documentation, Release 1
user_01$ cat eg_01.fq.bz2.md5
MD5 (eg_01.fq.bz2) = 74f085a6f3dd8b2877b89fcb592c7f5c
In the above example the user uses command ‘md5’ to calculate the checksum for the file. In a Linux operating
system this is equivalent to ‘md5sum’ command. Then the user does it again, but redirects the output to a file called
‘eg_01.fq.bz2.md5’. Finally the user checks the contents of the new file. This is an md5 file and can be used to register
the MD5 checksum of the original file with ENA.
Registering the MD5 checksum with ENA
In the example above the data file to be submitted is called ‘eg_01.fq.bz2’ It is a compressed version on the original
‘file eg_01.fq’. Compressing large files is advantageous because it takes less time to transfer them and this increases
the likelihood of a complete transfer without corruption. The MD5 checksum of file ‘eg_01.fq.bz2’ is contained in file
‘eg_01.fq.bz2.md5’. ENA requires the checksum that you have calculated so that we can compare it to the one that
we calculate once the file is on our ftp server. So you can upload this checksum file in addition to the data file and our
system will find it. As long as you abide by the naming convention XXX.md5 where XXX is the name of the data file
and XXX.md5 is a text file containing the MD5 checksum ENA will understand.
This is not the only way to register the checksum for a data file. When you come to submit the uploaded data file you
will find that you can include the 32 character checksum string in with the submission metadata. If you do include the
checksums in with the metadata at submission time then you do not have to accompany each data file with an md5 file
at upload time. Also note that the ENA file uploader (one of the upload options available) will automatically create an
MD5 file for every data file that it uploads and it will deposit this MD5 file (using the naming convention discussed)
along with the data file on the ftp server. That means that you do not need to provide MD5 checksums in the metadata
at submission time if you have used the ENA file uploader.
You can not pool checksums from several data files into a single md5 file. The ENA file processing system will not be
able to interpret this. Each file must have its own md5 file (if you are choosing to register it that way)
File Validation Errors
A common cause of file validation errors is when the checksum that you provide does not match the one that we have
calculated. Automatic email notifications are set up to alert you of these problems. Remember the data file will not
be validated until you have submitted it - uploading a data file does not constitute a submission. If you do receive an
email about checksum mismatches then there is a chance that your transfers could not complete 100% and the files are
corrupted. It could also be the case that you accidentally registered the wrong checksum. You can re-upload any file
you like. Make sure it has the same name and is placed in the same subdirectory (if any) as the original. This should
solve a corrupt file issue if the second upload is 100% successful because its checksum will now match the registered
checksum. Alternatively if you believe the wrong checksum is registered simply upload a new checksum file with the
correct MD5 checksum in it. The file processing system at ENA checks and recalculates all unvalidated files cyclically
so once there is a match between the calculated and the registered MD5 value the file will be validated. You do not
have to repeat any part of the submission but the queue of unvalidated files is variable so at busy times it can still take
some time for the error notifications to cease. It is recommended to re-upload the data file and a checksum file so that
both scenarios are covered and your file will be validated without any further trouble.
There are other possible validation errors. For example we may not be able to uncompress your data file because it
is corrupted. You will need to upload a fixed version of the data file but you must always accompany fixed files with
checksum files because you know that the new file will have a different MD5 checksum compared with the original
because you have changed it. Often submitters provide a fixed file but forget to update the registered checksum so the
validation still fails. Also remember that replacement data files must always have the same file name as the original or
the system will not pick it up as a replacement. If the file name itself must change it is usually to submit a new data file
and cancel the problem submission. For most validation errors this is completely unnecessary so do not be tempted to
repeat a submission if you do not have to!
38
Chapter 3. Tips and FAQs
training_modules Documentation, Release 1
Step 3: Uploading the file
This is the final step before the submission.
http://www.ebi.ac.uk/ena/about/sra_data_upload
Instructions for this are well detailed already:
Remember to upload the checksum file in addition to the data file unless you are going to register the checksum at
submission time or you are using the ENA file uploader instead. Here is a basic example of using FTP to upload a
data file called ‘eg_01.fq.bz2’ and its md5 file ‘eg_01.fq.bz2.md5’. The example is using the Terminal application in
the Mac operating system. See above link for more detailed instructions.
user_01$ ftp webin.ebi.ac.uk
Connected to hh-webin.ebi.ac.uk.
220 (vsFTPd 2.2.2)
Name (webin.ebi.ac.uk:user_01): Webin-XXX
331 Please specify the password.
Password:
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> mput eg_01.fq.bz2
229 Entering Extended Passive Mode (|||42382|).
150 Ok to send data.
100%
˓→|**************************************************************************************************
˓→
51
25.65 KiB/s
00:00 ETA
226 Transfer complete.
50000 bytes sent in 05:00 (1.57 KiB/s)
ftp> mput eg_01.fq.bz2.md5
229 Entering Extended Passive Mode (|||41642|).
150 Ok to send data.
100%
˓→|**************************************************************************************************
˓→
54
48.20 KiB/s
00:00 ETA
226 Transfer complete.
54 bytes sent in 00:00 (1.92 KiB/s)
ftp> bye
221 Goodbye.
Taxonomic classifications for your samples
The Tax database
Every ENA sample object should have a taxonomic classification. The INSDC maintains a database of all unique
taxonomy classifications known to us and you should apply one from this database when you create your samples.
Each classification has a unique id and this is expanded to show the scientific name and common name of the organism
when the sample is viewed.
The interactive submission service has a look up table which you can use before you download the spreadsheet template
so that you already know what taxonomy identifications to apply when you are creating your samples offline.
3.3. Taxonomic classifications for your samples
39
training_modules Documentation, Release 1
Submitters using REST API will apply the taxonomic information to the sample object using the sample_name block
<SAMPLE_NAME>
<TAXON_ID>450267</TAXON_ID>
<SCIENTIFIC_NAME>Chlamyphorus truncatus</SCIENTIFIC_NAME>
<COMMON_NAME>Pink fairy armadillo</COMMON_NAME>
</SAMPLE_NAME>
REST access to the tax database
Submitters using the REST API to programmatically submit samples in XML format can use the taxonomy database
look up to find what tax id they need to apply to their sample using these REST endpoints:
If you know the scientific name of the organism you can find the taxonomy id with this endpoint www.ebi.ac.uk/
ena/data/taxonomy/v1/taxon/scientific-name/. Simply append the scientific name to the URL. You
can use a browser or use cURL at the command line (the “see URL” program available on Linux and Mac). Note the
use of %20 to represent a space character. This is URL encoding and you may find the commands do not work unless
you replace space characters with %20
> curl "http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/Leptonycteris
˓→%20nivalis"
[
{
"taxId": "59456",
"scientificName": "Leptonycteris nivalis",
"commonName": "Mexican long-nosed bat",
"formalName": "true",
"rank": "species",
"division": "MAM",
"lineage": "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
˓→Mammalia; Eutheria; Laurasiatheria; Chiroptera; Microchiroptera; Phyllostomidae;
˓→Glossophaginae; Leptonycteris; ",
"geneticCode": "1",
"mitochondrialGeneticCode": "2",
"submittable": "true"
}
]
You can do the same with the common name. Use endpoint http://www.ebi.ac.uk/ena/data/taxonomy/
v1/taxon/any-name/ and append the name
40
Chapter 3. Tips and FAQs
training_modules Documentation, Release 1
> curl "http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/any-name/golden%20arrow
˓→%20poison%20frog"
[
{
"taxId": "377316",
"scientificName": "Atelopus zeteki",
"commonName": "golden arrow poison frog",
"formalName": "true",
"rank": "species",
"division": "VRT",
"lineage": "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
˓→Amphibia; Batrachia; Anura; Neobatrachia; Hyloidea; Bufonidae; Atelopus; ",
"geneticCode": "1",
"mitochondrialGeneticCode": "2",
"submittable": "true"
}
]
If you do not know the scientific name or the common name but you have an idea, you can use this suggest endpoint
http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/suggest-for-submission/
> curl "http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/suggest-for-submission/curry"
[
{
"taxId": "159030",
"scientificName": "Murraya koenigii",
"displayName": "curry leaf"
},
{
"taxId": "261786",
"scientificName": "Helichrysum italicum",
"displayName": "curry plant"
}
]
In each case above a JSON document is outputted and you will be looking for the taxId field. Outputting JSON format
will help you to automate the call if appropriate.
Environmental taxonomic classifications
Every sample object in the ENA must have a taxonomic classification assigned to it. Of course environmental samples typically collected for metagenomic studies can not have a single organism identifier because they represent an
environment with an unknown variety and number of organisms. For this purpose we have entries in the taxonomic
database to apply exclusively to environmental samples. You can search for these terms using the methods described
above - they tend to have “metagenome” as part of the scientific name.
curl "www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/suggest-for-submission/marsupial%20meta
˓→"
[
{
"taxId": "1477400",
"scientificName": "marsupial metagenome",
"displayName": "marsupial metagenome"
}
To have an idea of what environmental sample names are available, below is a list. This list is not regularly updated so
3.3. Taxonomic classifications for your samples
41
training_modules Documentation, Release 1
it may be worth trying the suggest-for-submission look up method described above to see if you can find one that better
represents your environmental samples. The following terms go in the scientific name field of the sample object. To
find the tax id use the method outlined above (scientific-name endpoint) . For example you can paste the following into
your browser to find the tax id for termite fungus garden metagenome: http://www.ebi.ac.uk/ena/data/
taxonomy/v1/taxon/scientific-name/termite fungus garden metagenome
metagenome
synthetic metagenome
ecological metagenomes
organismal metagenomes
Specific
ecological metagenomes sub nodes
activated carbon metagenome
activated sludge metagenome
aerosol metagenome
air metagenome
alkali sediment metagenome
anaerobic digester metagenome
anchialine metagenome
ant fungus garden metagenome
aquatic metagenome
aquifer metagenome
ballast water metagenome
beach sand metagenome
bioanode metagenome
biocathode metagenome
biofilm metagenome
biofilter metagenome
biofloc metagenome
biogas fermenter metagenome
bioreactor metagenome
bioreactor sludge metagenome
biosolids metagenome
cave metagenome
clinical metagenome
cloud metagenome
coal metagenome
cold seep metagenome
compost metagenome
concrete metagenome
coral reef metagenome
cow dung metagenome
crude oil metagenome
decomposition metagenome
dietary supplements metagenome
dust metagenome
electrolysis cell metagenome
estuary metagenome
fermentation metagenome
fertilizer metagenome
floral nectar metagenome
flotsam metagenome
food contamination metagenome
42
Chapter 3. Tips and FAQs
training_modules Documentation, Release 1
food fermentation metagenome
food metagenome
food production metagenome
freshwater metagenome
freshwater sediment metagenome
fuel tank metagenome
gas well metagenome
glacier lake metagenome
glacier metagenome
groundwater metagenome
halite metagenome
herbal medicine metagenome
honey metagenome
hospital metagenome
hot springs metagenome
HVAC metagenome
hydrocarbon metagenome
hydrothermal vent metagenome
hypersaline lake metagenome
hyphosphere metagenome
hypolithon metagenome
ice metagenome
indoor metagenome
industrial waste metagenome
interstitial water metagenome
lagoon metagenome
lake water metagenome
landfill metagenome
leaf litter metagenome
lichen crust metagenome
lobster shelll metagenome
mangrove metagenome
manure metagenome
marine metagenome
marine plankton metagenome
marine sediment metagenome
metal metagenome
microbial fuel cell metagenome
microbial mat metagenome
milk metagenome
mine drainage metagenome
mine tailings metagenome
mixed culture metagenome
money metagenome
moonmilk metagenome
mud volcano metagenome
museum specimen metagenome
musk metagenome
neuston metagenome
oasis metagenome
oil field metagenome
oil metagenome
oil production facility metagenome
oil sands metagenome
outdoor metagenome
paper pulp metagenome
parchment metagenome
peat metagenome
3.3. Taxonomic classifications for your samples
43
training_modules Documentation, Release 1
periphyton metagenome
permafrost metagenome
phytotelma metagenome
pitcher plant inquiline metagenome
plastisphere metagenome
pond metagenome
poultry litter metagenome
power plant metagenome
probiotic metagenome
retting metagenome
rhizoplane metagenome
rhizosphere metagenome
rice paddy metagenome
riverine metagenome
rock metagenome
rock porewater metagenome
root associated fungus metagenome
saline spring metagenome
salt lake metagenome
salt marsh metagenome
salt mine metagenome
saltern metagenome
sand metagenome
seawater metagenome
sediment metagenome
shale gas metegenome
silage metagenome
sludge metagenome
snow metagenome
snowblower vent metagenome
soda lake metagenome
soil crust metagenome
soil metagenome
solid waste metagenome
steel metagenome
stromatolite metagenome
subsurface metagenome
surface metagenome
tar pit metagenome
termitarium metagenome
termite fungus garden metagenome
terrestrial metagenome
tidal flat metagenome
tin mine metagenome
tobacco metagenome
tomb wall metagenome
urban metagenome
wastewater metagenome
wetland metagenome
whale fall metagenome
wine metagenome
wood decay metagenome
organismal metagenomes sub nodes
algae metagenome
annelid metagenome
ant metagenome
44
Chapter 3. Tips and FAQs
training_modules Documentation, Release 1
aquatic viral metagenome
bat metagenome
bear gut metagenome
beetle metagenome
bird metagenome
blood metagenome
bovine gut metagenome
bovine metagenome
cetacean metagenome
chicken gut metagenome
ciliate metagenome
coral metagenome
crab metagenome
crustacean metagenome
ctenophore metagenome
dinoflagellate metagenome
ear metagenome
echinoderm metagenome
endophyte metagenome
epibiont metagenome
eye metagenome
feces metagenome
feline metagenome
fish gut metagenome
fish metagenome
flower metagenome
fossil metagenome
frog metagenome
fungus metagenome
gill metagenome
gonad metagenome
grain metagenome
grasshopper gut metagenome
gut metagenome
honeybee metagenome
human bile metagenome
human blood metagenome
human brain metagenome
human eye metagenome
human gut metagenome
human gut metagenome gcode 4
human lung metagenome
human metagenome
human milk metagenome
human nasopharyngeal metagenome
human oral metagenome
human reproductive system metagenome
human saliva metagenome
human semen metagenome
human skeleton metagenome
human skin metagenome
human tracheal metagenome
human vaginal metagenome
hydrozoan metagenome
insect gut metagenome
insect metagenome
invertebrate gut metagenome
invertebrate metagenome
3.3. Taxonomic classifications for your samples
45
training_modules Documentation, Release 1
jellyfish metagenome
koala metagenome
leaf metagenome
lichen metagenome
liver metagenome
lung metagenome
marsupial metagenome
mite metagenome
mollusc metagenome
mosquito metagenome
moss metagenome
mouse gut metagenome
mouse metagenome
mouse skin metagenome
nematode metagenome
oral metagenome
oral-nasopharyngeal metagenome
ovine metagenome
oyster metagenome
parasite metagenome
phage metagenome
phyllosphere metagenome
pig gut metagenome
pig metagenome
placenta metagenome
plant metagenome
pollen metagenome
primate metagenome
psyllid metagenome
rat gut metagenome
rat metagenome
reproductive system metagenome
respiratory tract metagenome
rodent metagenome
root metagenome
scorpion gut metagenome
sea anemone metagenome
sea squirt metagenome
sea urchin metagenome
seagrass metagenome
seed metagenome
sheep gut metagenome
sheep metagenome
shoot metagenome
shrimp gut metagenome
skin metagenome
snake metagenome
spider metagenome
sponge metagenome
stomach metagenome
symbiont metagenome
termite gut metagenome
termite metagenome
tick metagenome
upper respiratory tract metagenome
urine metagenome
urogenital metagenome
vaginal metagenome
46
Chapter 3. Tips and FAQs
training_modules Documentation, Release 1
viral metagenome
wallaby gut metagenome
wasp metagenome
zebrafish metagenome
3.3. Taxonomic classifications for your samples
47