training_modules Documentation Release 1 Marc B Rossello Jul 26, 2017 Contents 1 2 3 Interactive Submissions 1.1 Module 1: Submission Options . . . . . . 1.2 Module 2: Create a Project . . . . . . . . . 1.3 Module 3: Register Source Samples . . . . 1.4 Module 4: Add Read files . . . . . . . . . 1.5 Module 5: Updates (Samples and Projects) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 5 8 Programmatic Submissions 2.1 Module 1: Create a Study . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Module 2: Submit an Annotated Sequence . . . . . . . . . . . . . . . 2.3 Module 3: Flat File upload - Submit an ENA Supported Sequence File 2.4 Module 4: Update a Study using REST API . . . . . . . . . . . . . . . 2.5 Module 5: Submitting Sample objects . . . . . . . . . . . . . . . . . . 2.6 Module 6: Updating Sample objects using REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 15 20 22 27 32 Tips and FAQs 3.1 Solving Error Notifications (runs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Preparing a file for Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Taxonomic classifications for your samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 36 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ii CHAPTER 1 Interactive Submissions Module 1: Submission Options The majority of submissions to the ENA begin here. 1. Log in and access “new submission” tab 2. If you have not already, create a study using this option. Complete this step BEFORE going on to step 3. Module 2 describes this step in more detail. 1 training_modules Documentation, Release 1 3. If you have not already, create sample objects to represent your source material. Complete this step before going on to step 4. Module 3 describes this step in more detail. 4. You are nearly ready to register your NGS read files. You need to upload them to your ENA ftp directory (you have one with your account). This JAVA applet does not work in all environments. See here for alternative upload methods. 5. This step combines multiple steps from above but it is preferable to split the job up (so that you have already registered a study and some samples). Use this step to create Runs and experiments. Module 4 describes this step in more detail. This step will link everything together under the project: Experiment and run objects associate read files to their source sample and a study. Module 2: Create a Project This form is used to create a study object (see module 1 to access this form). It is possible to create a study before any other data is added. Webin will report an accession id for the study that will look like this: PRJEB00000. This type of accession is typically used in journal publications. Data can be added to the study at any time – the location of the study in the ENA browser will stay the same. The study will not be visible in the ENA browser until the release date has expired (*). This means that the data linked to the study will not be visible until the study itself is visible. Have a look at an example of a study in the ENA browser. 2 Chapter 1. Interactive Submissions training_modules Documentation, Release 1 Module 3: Register Source Samples Part 1 This is the first (of 3) sample registration forms. See module 1 if you do not know where to access this form. 1. Find a checklist that suits your type of sample. A checklist comprises of a list of attributes that are required to annotate your samples. A well annotated sample is more searchable in the ENA browser and your data will get more exposure. 2. Move on to the next sample step 3. Use this option if you have created your samples as a spreadsheet file from a previous session. This spreadsheet is a very specific format. You can obtain one in the next sample step . . . Part 2 This is the second (of 3) sample registration forms. 1.3. Module 3: Register Source Samples 3 training_modules Documentation, Release 1 1. Take a look at the list of attributes on the left. Some will be mandatory, others are recommended. Every checked item in the list appears as a field on the right side of the form. Please select or deselect as appropriate. Remember that the more fields you can provide the more you are enabling your users to make accurate interpretations of your study. 2. You can create additional attributes that do not exist in the checklist. However in most cases you should find what you need among the default checklist fields. 3. The right side of the web form represents all samples as a template. Because this form represents all samples it is only worth entering fields that are consistent with all the samples. Also use the web form to look up taxonomic classifications which you will use later. Start typing your organism name to see the suggestions. Note that environmental taxonomic classifications can look like “soil metagenome” as opposed to a specific organism scientific name. You can also use the ‘i’ symbols to read definitions for each field, as well as checking the drop down options for the fields that have a controlled vocabulary (*). 4. The download template button will download a tab separated file which you can open using a spreadsheet program. It is highly recommended to use this to register your samples. Each row represents an individual sample. Please do not edit or remove the lines marked with hash ‘#’ and do not change the order of the columns as this will impede the re upload of the spreadsheet into the web form. Begin the first sample on the first row available 5. Step 5 is in parenthesis because in most cases you fill in the spreadsheet offline and log in again after you have completed it. The completed spreadsheet is loaded to the previous sample registration form (Part 1 step 3) and this has the same effect as the ‘next’ button, to take you to the third and final sample registration form. Part 3 This is the final (of 3) sample registration forms. This form appears after uploading a spreadsheet into the form in Part 1 step 3, or directly from the form in Part 2 if you have not used a the spreadsheet file and intend to type directly into the webform 1. If you have uploaded a spreadsheet file, the number of rows correspond to the number of samples (you can skip 4 Chapter 1. Interactive Submissions training_modules Documentation, Release 1 this step). If you have not used a spreadsheet you can specify how many samples to create using the template you created in the second form 2. Add some basic sample group details. A sample group has limited functionality. It is a collection of samples that are created in the same submission event. The samples can be edited as part of the same group if necessary later on. It is not possible to move samples in or out of a group. The study object is used to group samples and other objects together in the public domain. 3. The samples are loaded into the webform below these 2 buttons. You can check each one in the list by using these buttons to navigate one sample at a time. 4. Check if any fields are not accepted by the webform (where you see a red exclamation mark). Your values may not be valid because some fields are controlled. 5. This table is a summary of all samples. It can be large but you can move through the pages using the arrows (red asterisk in image). If all fields in a sample are accepted by the webform you will see a green tick under the ‘Valid’ column. If there are any red crosses, navigate to the sample in question (or click on that row in the table) and go back to step 4 to correct the invalid fields. If it is easier to correct the samples in your offline spreadsheet, do so and use the ‘previous’ button (red $ in image) two times to go back to the first form where you will see a red cross symbol next to the file name. Click on the cross and you will be able to re load the spreadsheet file. 6. Click submit if all samples in the table are validate (previous step). Webin will deliver accessions for each sample unless there is some problem/error. If there is an error you can go back to step 5 to correct the errors and then try again. If accessions are delivered, the samples are now in the ENA database. They will not be affiliated with any data or other objects. That happens in subsequent rounds of submissions. For the moment they are ‘free’. Module 4: Add Read files Part 1 This is the first page that you will come to when submitting runs and experiments (see module 1). A run object is used 1.4. Module 4: Add Read files 5 training_modules Documentation, Release 1 to register a demultiplexed NGS read file (or pair of files) that you have uploaded (for example, Fastq, BAM, SFF, CRAM) to your ENA ftp directory. Without run objects the files can not be registered and archived. An experiment object represents a library solution used on the NGS machine. The experiment object will also link the run to the sample, and to the study. 1. Select the study that you will be adding the runs and experiments to. If the study that you want to submit to does not exist yet you can create one now (red asterisk). However it is best to split your submission up and create the study as part of an earlier session (see module 1) 2. Click next to move to the next stage. The next stage is the sample generation stage. In most cases the samples will have already been generated (it is best to submit the samples in a separate submission so that the work is more divided). Find the ‘skip’ option to skip this step. If the samples do not exist, do not use the skip option, you can create some samples during this step (see module 1 and module 3) Part 2 This is the step for registering the files that you have uploaded to your personal ENA ftp directory. We need to wrap each file or pair of files into a run object, point that run to an experiment object, and point that experiment object to the correct sample. 6 Chapter 1. Interactive Submissions training_modules Documentation, Release 1 1. Choose the type of file that you are submitting. Note that in the case of paired runs there is a 2 x fastq file option. 2. Any information you type into the webform will be lost if you log out before submitting. So you are highly recommended to download a tab separated spreadsheet file (step 6) and fill it in offline. First, note that some fields have drop down lists. Check the options in these so that you can apply them correctly into the spreadsheet when offline. 3. Every row in the table represents one run and one experiment object and they need a source sample. The drop down for the sample column does not work in most cases so you should know how you have named them, or you can check by way of the sample tab (*). It is possible to give multiple runs the same source by repeating the sample id in multiple rows (for instance, in the case of a deep coverage experiment where multiple lanes have been used) 4. The file names correspond to the files that you have uploaded to your Webin ftp directory. Each run object gets matched with files which are separately uploaded. Here is a list of ways you can upload your files. File names should be written exactly as they appear in the ftp directory. For instance, FastQ files must be compressed and so will carry the extension “.gz” or similar. The extension should be included when referencing the files in this column. 5. The checksum is a fingerprint for the file. If the file is not 100% transferred we will only have a corrupted or truncated version which is of no use so we need to check that this has not happened. If the file checksum is different after the transfer we know there has been a problem, so you need to supply the checksum for the file before it is uploaded so that we can do this check. We will calculate the checksum of the uploaded file and then compare it with the one that you have provided. You can paste the checksum directly into this column. It will be a 32 character string. You can also put the 32 character string into its own file and upload this checksum file with the original file. The checksum file has to be named in a way that it can be recognised. It needs to have the same name as the original file PLUS the extension “.md5” (so for file XXX the md5 checksum should be in file XXX.md5). If you have uploaded a checksum file for each read file then you can leave this column blank. Do not write the checksum file name (file XXX.md5) into the field – Webin will report an error, that it is expecting a 32 character string. The Webin uploader tool automatically deposits an checksum file to your ftp directory for every file that upload so if you have used this tool leave the column blank. The Webin uploader tool uses Java applet technology which is generally being reduced or discontinued in browsers due to security risks so the uploader tool may not be an option depending on your environment. So how do you create your own checksum file? On a Linux machine it is easy, simply type (without the quotes, at the command line) “md5sum <file name>” and it will display a line formatted like this: <32 character md5sum><2 spaces><file name>. This is exactly the format our system will recognise if you create a checksum file so simply redirect (using ‘>’ symbol) the output to a checksum file: “md5sum file_name > file_name.md5”. Then upload this file along with the original one before you reach step 8. Apple Mac operating systems also have a similar checksum generator that you can use. It is also possible on a Windows operating system but you may have to download 3rd party software to do it. More info here 6. Any information that you type into the webform will be lost if you log out before submitting. Therefore you should download a template tab delimited spreadsheet file which you can open in a spreadsheet program like MS Excel. Once you have filled it in offline, log back in, return to this submission page and upload it (step 7). The other advantage of having an offline copy of your experiments and runs is that if there is a problem submitting the data you can send the spreadsheet file to ENA helpdesk and they can troubleshoot it for you. 7. Upload the completed spreadsheet file that you created in step 6. The web form should fill up with the data in your spreadsheet. You can do a preliminary check to see if some fields are not recognised (check the controlled drop down lists and that file names appear as expected). 8. This is the final step. If errors are reported you can remove the loaded table (use the cross that has appeared in step 7), then make your edits tot eh tsv spreadsheet and try again (from step 7). If you need to send the tsv spreadsheet to ENA help desk for troubleshooting (as mentioned in step 6), ensure that the project and samples are already submitted (module 2 and 3) so that the ENA officer can focus only on the step that is failing. 1.4. Module 4: Add Read files 7 training_modules Documentation, Release 1 Module 5: Updates (Samples and Projects) The interactive web based GUI (Webin) has some support for editing existing objects. This module is concerned with sample and project objects. Access existing objects from the following tabs (after logging into Webin) Sample Edit A sample group is an internal concept (do not quote sample group ids in any publications) which groups together samples for one purpose: so that you can edit them in bulk. The only way to ensure a collection of samples is in the same group is by submitting them at the same time (during the same submission event). If you need to edit samples in bulk but they are not in the same sample group you can use the REST API (more details to come). First choose a sample from the sample tab or a sample group from the sample group tab. Click the ‘edit’ button for that sample/group. You will come to a screen like this: 1. The left panel is used to select the sample that you would like to edit. Even if you selected a single sample from the sample tab the whole group will still be displayed. 2. This is another way to select the sample that you would like to edit: you can go through the list one by one. 8 Chapter 1. Interactive Submissions training_modules Documentation, Release 1 3. It is not possible to add or remove samples from a group, or to change the associated checklist, but you can add/remove fields from the previously selected checklist 4. The right hand panel expands whichever sample you have selected in step 1. You can change the content of the fields using this panel. 5. These little boxes are clickable. Click on this box to copy the content of the field to all the other samples in the sample group (for fields that are common to all samples). 6. When you have completed your edits click save. 7. Warning! Although you can download a spreadsheet you cannot yet upload it again so you cannot use this option to edit samples yet. It can be useful to obtain a spreadsheet similar to the one that you used to submit the samples in the first place. Editing by tsv spreadsheet should be possible in the future. Study Edit Some parts of the study object can be edited. You can change the release date or release the whole study. You can also edit titles and descriptions, as well as add publications which will become clickable links when the study goes live in the ENA browser. 1. Login to Webin and find the studies tab. 2. If you have a long list of studies you can search for one by name or accession. This functionality exists in the other tabs too. 3. If your study is confidential you can change the release date by clicking on the pencil icon. A calendar will open so that you can navigate to required date. To release the study simply select the current date/present day. Releasing a study will cause all the data associated to that study to be released as well. Upon releasing a study various stages are set in motion: • Moving read files and sequence files from our confidential archive to the public servers • Indexing and rendering the study and its affiliated objects so that they can be linked-to and visualised in the ENA browser 1.5. Module 5: Updates (Samples and Projects) 9 training_modules Documentation, Release 1 • Mirroring to INSDC databases, who will then follow similar procedures so the data is searchable and viewable in their web portals. These stages are usually complete in a couple of days but please allow several days for busy times or for times when technical problems are causing the queue of jobs to build up. 4. For edits besides changing the release date, click the edit button next to the study that you need to edit. This will expand the study into an editable webform. 5. There are various text boxes that you can edit if you need to. The short name for the study will be visible in search outputs and overview pages whereas the descriptive title and abstract will be viewable when the study has its own webpage (when the hold date has expired) 6. You can add a publication by clicking the ‘Add’ button (a fresh row will appear) and inserting the pubmed id. This will result in a hyperlink on the main study page allowing the publication to be linked from the study (when it is public). 7. Study ‘attributes’ are optional. They act as key words and can help expose the study to more specific searches. In some cases we will standardise some attributes and index them. These may be related to specific projects known to ENA and will help filtering and searching. Each key word needs a ‘tag’ which is the name of the field, and an actual value (called ‘FieldType’). Some submitters add their DOI as a keyword when they do not have a pubmed id. So the tag is something like ‘DOI’ and the value is the DOI value. 8. Remember to save changes when you are finished! 10 Chapter 1. Interactive Submissions CHAPTER 2 Programmatic Submissions Module 1: Create a Study The Study Object Objects such as a study or a sample, are stored in the ENA in XML form like this: <?xml version = '1.0' encoding = 'UTF-8'?><PROJECT_SET> <PROJECT alias="iranensis_wgs" center_name="HKI JENA" accession="PRJEB5932"> <NAME>WGS Streptomyces iranensis</NAME> <TITLE>Whole-genome sequencing of Streptomyces iranensis</TITLE> <DESCRIPTION>The genome sequence of Streptomyces iranensis (DSM41954) was ˓→obtained using Illumina HiSeq2000. The genome was assembled using a hybrid assembly ˓→approach based on Velvet and Newbler. The resulting genome has been annotated with ˓→a specific focus on secondary metabolite gene clusters.</DESCRIPTION> <SUBMISSION_PROJECT> <SEQUENCING_PROJECT> <LOCUS_TAG_PREFIX>SIRAN</LOCUS_TAG_PREFIX> </SEQUENCING_PROJECT> <ORGANISM> <TAXON_ID>576784</TAXON_ID> <SCIENTIFIC_NAME>Streptomyces iranensis</SCIENTIFIC_NAME> <CULTIVAR>DSM41954</CULTIVAR> </ORGANISM> </SUBMISSION_PROJECT> <PROJECT_LINKS> <PROJECT_LINK> <XREF_LINK> <DB>PUBMED</DB> <ID>25035323</ID> </XREF_LINK> </PROJECT_LINK> </PROJECT_LINKS> 11 training_modules Documentation, Release 1 </PROJECT> </PROJECT_SET> Creating objects in XML format is not always necessary. The Webin submission tool can create a project from a webform. It will convert the form data into XML and load it into the ENA database. However, you will find that in some cases there is more flexibility in creating submittable XML objects yourself and by-passing the interactive submission tool. Do consider using the interactive Webin submission tool to create a study and then adding the other objects programmatically instead. It is fine to mix and match submission routes and you may find that programmatic submission is better suited to repetitive submission tasks, of which project creation is not normally one of. A study (sometimes referred to as a project) in the ENA is used to group other objects together, so we will look into creating a project/study as a first step towards learning to submit ENA objects programmatically. Create the XML Below is a template. Do not use any default values - enter your own information and save it as a file, for example, you may call it “project.xml” <?xml version = '1.0' encoding = 'UTF-8'?> <PROJECT_SET> <PROJECT alias="cheddar_cheese" center_name=""> <TITLE>Characterisation of Microbial Diversity and Chemical Properties of ˓→Cheddar Cheese Prepared from Heat-treated Milk</TITLE> <DESCRIPTION>This study aimed to characterise the interaction of microbial ˓→diversity and chemical properties of Cheddar cheese after three different heat ˓→treatments of milk</DESCRIPTION> <SUBMISSION_PROJECT> <SEQUENCING_PROJECT/> </SUBMISSION_PROJECT> </PROJECT> </PROJECT_SET> In your file “project.xml” paste the above XML but change the alias=”” and give it a unique name. You may need this unique name to refer to your project when adding other objects to it. It can be a short acronym but it should be meaningful/memorable in some way (instead of just a number). Also provide a center name center_name="". The center name is specific to your Webin account. You chose it when you set up the account. Log in to confirm your centre name. Within the <DESCRIPTION></DESCRIPTION> block add an abstract detailing the project including any information that may be useful for someone to interpret your project correctly. Within the <TITLE></TITLE> block add a descriptive title. 12 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 Create a Submission XML To register the submission of a project or any other object(s), you need an accompanying submission xml in a separate file. Let’s call the file “sub.xml” for this purpose. <?xml version="1.0" encoding="UTF-8"?> <SUBMISSION alias="cheese" center_name=""> <ACTIONS> <ACTION> <ADD source="project.xml" schema="project"/> </ACTION> </ACTIONS> </SUBMISSION> This file simply registers an ‘action’ on the ENA servers. In this case the action is to <ADD/> a project object(s) using the XML file “project.xml”. Make sure the project.xml and the sub.xml are in the same directory on a linux file system (or Mac/Unix should work too). If you do not want to use the command line or if you are using a Windows operating system it is also possible to register the submission via a web form on any internet browser (more details to come). Add an alias to the submission XML to mark the submission event, and add your centre name as before. Send the XML files to ENA CURL is a Linux/Unix command line program which you can use to send the XMLs to the ENA server along with authentication. curl -k -F "[email protected]" -F "[email protected]" "https://www-test.ebi.ac. ˓→uk/ena/submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD" From the same directory containing files sub.xml snd project.xml run CURL as above. You must exchange Webin-NNN with your Webin account id and PASSWORD for your account password. The %20 is URL encoding for a space character. Leave these in place. After running the command above a receipt in XML format is returned. It will look like the one below (it won’t be line wrapped but you can copy and paste it or redirect the CURL output to a separate file. <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2017-05-09T16:58:08.634+01:00" submissionFile="sub.xml" success= ˓→"true"> <PROJECT accession="PRJEB20767" alias="cheddar_cheese" status="PRIVATE" /> <Submission accession="ERA912529" alias="cheese" /> <MESSAGES> <INFO>This submission is a TEST submission and will be discarded within 24 hours ˓→</INFO> </MESSAGES> <ACTIONS>ADD</ACTIONS> </RECEIPT> It is possible to use a browser to register the XML files instead of using cURL at the command line. See here. 2.1. Module 1: Create a Study 13 training_modules Documentation, Release 1 Simply use the study row and the submission row to browse and navigate to the project.xml file and the sub.xml file respectively and then add your Webin account and password in the Username and password fields before clicking submit. You should receive the receipt in the browser window. The Receipt XML Note the info message in the receipt <INFO>This submission is a TEST submission and will be discarded within 24 hours</ ˓→INFO> It is advisable to run your submissions through the ENA test server where changes are not permanent and are erased every 24 hours. If you are happy with the result of the submission you can run the CURL command again, but this time on the production server. Simply change the part in the URL from /www-test.ebi.ac.uk to /www.ebi. ac.uk and remove the -k flag: curl -F "[email protected]" -F "[email protected]" "https://www.ebi.ac.uk/ena/ ˓→submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD" If you are using the webform instead of cURL at the command line you will get the receipt XML displayed in your 14 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 browser. Similarly, to submit via webform to the production server, change the part in the webform URL from / www-test.ebi.ac.uk to /www.ebi.ac.uk. To know if the submission was successful look in the first line of the <RECEIPT> block. The attribute success will have value true or value false. If the attribute is false then the submission did not succeed. If this is the case check the rest of the receipt for error messages and after making corrections, try the submission again. If the success attribute is true then the submission was successful. The receipt will contain the accession numbers of the objects that you have submitted. In the case of an ENA study/project this is likely to be the accession that you will be including in a publication. <PROJECT accession="PRJEB20767" alias="cheddar_cheese" status="PRIVATE" /> Module 2: Submit an Annotated Sequence Annotated sequences can be any number of sequences that are assembled from shorter reads or sequenced using Sanger capillary sequencing. They can be annotated with features such as coding domains, introns, exons, non coding RNA etc. Typical sequences submitted to the ENA are rRNA genes, single CDS genomic DNA sequences, MHC genes, mRNA and many more. Most submitters will use the interactive WebIn Submission system to submit these types of sequences: The analysis object This is a guide for programmatic submission of annotated sequences. This submission route is useful for automating your submissions if you expect to be submitting large numbers of sequences at regular intervals. For one off or small scale submissions you are encouraged to use Webin instead. The ENA metadata model uses various objects to hold information and group other objects together. Annotated sequences are wrapped in an analysis object. The other objects are frequently used in read data submission and whole genome submissions. The analysis object can point to a study and samples. It is not necessary to register a sample object for an annotated sequence submission, but you should have a study available before you submit the analysis/annotated sequence. Studys are used to group other objects together. You may well use the study again in the future to submit additional data types including read data and whole genomes. A study can package together all elements of a typical publication. 2.2. Module 2: Submit an Annotated Sequence 15 training_modules Documentation, Release 1 A word about Accession Numbers Annotated sequences are submitted as TSV spreadsheet files. One analysis object wraps one TSV file, but a TSV file may contain many sequences (each row = 1 annotated sequence). Templates with predefined columns are available. A TSV template is specific to a type of sequence so each tsv/analysis can have multiple sequences but they will all be the same type. For example if you have 10 rRNA genes and 20 single protein coding genes as part of the same study then you will use 2 different TSV templates, which will be submitted as 2 separate analysis objects, 1 with 10 rows and the other with 20 rows. All 30 rows will be converted into EMBL sequence files and each sequence file will be accessioned. The analysis objects will be accessioned too but this is for internal ENA tracking. Do not quote an analysis (ERZxxxxxx) accession when referring to an annotated sequence. Only quote the sequence accessions (as a range for example, if there are many). You can also quote the study accession (PRJEBxxxx), especially if you have a collection of data to report. The analysis object is used to submit other file types as well and in some cases it is appropriate to reference an analysis accession. At submission time you will not receive any sequence accessions. These will come later by email (multiple email accounts can be registered per Webin account). After submission the TSV file is moved to a staging area and each row is converted into an EMBL sequence flatfile. The flatfiles are then validated and accessioned. After this the accessions are emailed and the sequences are moved to the confidential or public archive depending on the status of the encompassing study (a public study will make the sequences public too). Step 1: Create a study If you already have a study you can add your annotated sequence entries to it. If you do not you need to create one first. Use either the interactive submission route or the programmatic submission route to do this. Step 2: Get hold of the TSV template Sequences are submitted as tsv spreadsheets. You can use Webin submission option “Submit other assembled and annotated sequences [formerly EMBL-Bank]” to get hold of the template that you will be using. You will only need to do this once for each type of sequence that you are submitting. After you have the template(s) you can submit without logging in to Webin. 16 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 For this example I chose sequence type rRNA gene and then navigated to the page where there was an option to download the template: The downloaded file is called something like “Sequence-ERT000002-5697110325950293078.tsv”. Take note of the ERT number which in this example is ERT000002. It represents the sequence type (rRNA gene in this case). This is required later - the system needs to know the sequence type so that it can create the right EMBL file from the TSV. To fill in the TSV you can use a spreadsheet editor. Each row in the tsv is a separate sequence record. The last column is for the sequence and the others are for annotation fields. It is a bit like a FASTA except that the header and sequence are on one line instead of two and the fields are tab separated. Step 3: Upload the TSV file to your FTP directory After submission, the TSV file will be accessed from your Webin FTP directory (all accounts have some space on the ENA FTP server for this purpose) for processing. So before going any further you need to compress the TSV file and upload it to your Webin ftp directory. A full set of instructions can be found here. You also need to register the MD5 checksum for the TSV file. This can be done in the next step (by adding it to the analysis xml object) or you can do it now by uploading a supplementary checksum file in addition to the TSV file. So if your tsv is called ethylomonas.tsv.gz the file with the checksum in it is called ethylomonas.tsv.gz.md5. See here for guidelines on preparing files for a submission. Step 4: Prepare the Analysis XML file The TSV file, now sitting in your Webin FTP directory, is registered/submitted using the ENA XML REST API. Create an analysis object as an XML file. Note that this analysis object references a study (see step 1 above) and the compressed tsv file. It also includes the MD5 checksum for the compressed TSV file (so we can check that the transfer is 100% completed). You can omit the checksum attribute in the XML if you have already uploaded a checksum 2.2. Module 2: Submit an Annotated Sequence 17 training_modules Documentation, Release 1 file to your Webin ftp directory along with the compressed TSV file. See here for guidelines on preparing files for a submission. The analysis object also references the ERT number (corresponding to the rRNA sequence type in this case). In this example I changed the name of the TSV file that was accessed in step 2 above. But you do not have to. <?xml version = '1.0' encoding = 'UTF-8'?> <ANALYSIS_SET> <ANALYSIS alias="ethylomonas" center_name="EBI"> <TITLE>16S of Methylomonas sp.</TITLE> <DESCRIPTION>16S Methylomonas sp.</DESCRIPTION> <STUDY_REF accession="PRJEBxxxx"> </STUDY_REF> <ANALYSIS_TYPE> <SEQUENCE_FLATFILE/> </ANALYSIS_TYPE> <FILES> <FILE checklist="ERT000002" checksum="5831463bb16a4c14374a0962d5a353cc" ˓→checksum_method="MD5" filename="ethylomonas.tsv.gz" filetype="tab"/> </FILES> </ANALYSIS> </ANALYSIS_SET> Create a file, it can have any name but in this example we will call it analysis.xml You can use the above XML as a template but be sure to change all the fields because this is an example only. Remember to: 1. Provide your own alias. This is a unique id for the analysis object and you may need it to identify your submission later 2. Apply your own center name. When you created your Webin account you provided a center name acronym. You can check it by logging in and looking at the account details section. 3. Add a similar title to the one in the example. It mentions the sequence type and the organism. 4. Use the same or a similar title in the block. Title and description are not used in the final EMBL flatfiles so these fields do not have to be very detailed. 5. Apply the correct study id (PRJEBxxxx) 6. Apply the correct checklist id (ERTxxxxxx) 7. If registering the MD5 checksum, apply it. 8. Apply the correct file name to refer to the compressed TSV. Use the full path if you have uploaded it to a subdirectory within your Webin FTP directory. 9. filetype and checksum_method should stay the same as the example Step 5: Prepare a Submission XML file There is also a submission object which represents the submission event itself. An XML file with a submission object needs to accompany the analysis object when it gets sent to ENA REST API server so that the system knows what to do with the analysis object. <?xml version="1.0" encoding="UTF-8"?> <SUBMISSION alias="ethylomonas_submission" center_name="EBI"> <ACTIONS> <ACTION> <ADD source="analysis.xml" schema="analysis"/> </ACTION> 18 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 </ACTIONS> </SUBMISSION> The submission XML file can have any name. In this example it is called submission.xml. Change the example template above 1. Provide your own alias. This is a unique id for the submission event. 2. Apply your own center name. When you created your Webin account you provided a center name acronym. You can check it by logging in and looking at the account details section. 3. Change the source attribute so that it has the name of the XML file containing the analysis object from step 5 above. 4. You can change the ‘ADD’ block to a ‘VALIDATE’ block if you just want to test and see what messages are returned. <VALIDATE source="analysis.xml" schema="analysis"/> When you validate the object will not be committed to the ENA database and no accession number will be assigned. We recommend testing all your submissions like this, before using the ADD action. Step 6: Send the XMLs to ENA through the REST API This step is the same as other REST API submissions. Please go to this section which is based on submitting a project XML. Submitting an analysis XML is very similar. Please note the following. • Your cURL command will look something like this curl -k -F "[email protected]" -F "[email protected]" "https://www-test. ˓→ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD" • If you are using the webform instead of cURL in Linux or Mac operating systems, use the analysis row and the submission row to browse and navigate to the analysis.xml file and the submission.xml file respectively • The receipt will look like this. Look out for the success="true" in the receipt. <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2017-05-05T15:28:38.557+01:00" submissionFile="sub.xml" success= ˓→"true"> <ANALYSIS accession="ERZ407913" alias="ethylomonas" status="PRIVATE" /> <SUBMISSION accession="ERA907974" alias="ethylomonas" /> 2.2. Module 2: Submit an Annotated Sequence 19 training_modules Documentation, Release 1 <MESSAGES> <INFO>This submission is a TEST submission and will be discarded within 24 hours ˓→</INFO> </MESSAGES> <ACTIONS>ADD</ACTIONS> </RECEIPT> • The URL in the cURL command above belongs to the test server https://www-test.ebi.ac.uk/. .. so the accessions delivered are not genuine. If you are happy with the submission in TEST, change to the production server https://www.ebi.ac.uk/... and remove the -k flag. Also remember if you are using ‘VALIDATE’ action in the submission XML then despite a success="true" the submission was not committed! Module 3: Flat File upload - Submit an ENA Supported Sequence File Annotated sequence entries are stored in the ENA as ENA supported sequence files. Here is an example of an HLA gene in ENA supported format. It is a text file that is computer readable due to the 2 character line beginnings (ID, AC, DE ...). The ENA browser renders the text file into a friendlier and more graphical view but the computer readable version is still available so that automatic pipelines down stream of the ENA can download and parse large numbers of sequence entries. Create your own ENA supported sequence file In most cases it is not necessary to submit an ENA supported sequence file because the interactive tool Webin provides spreadsheet templates for various types of sequences so that you can submit using a tab separated file (TSV) which you can fill in using any spreadsheet editor. These are called ‘annotation checklists’. After the submission via Webin or via programmatic REST API the TSV is converted into an ENA supported sequence file (or ‘flat file’) and validated before accessions are delivered. Not all sequence types are available as a TSV spreadsheet template/annotation checklist. For instance the HLA gene above has multiple exons and this is difficult for us to turn into a template. Typically the more complicated sequences with multiple and repeating features are the hardest to make into TSV templates. For these types of sequences you can create an ENA supported sequence file yourself and submit it to the ENA using the programmatic REST API (this is submission by “flat file upload”, previously “entry upload”). For a list of sequence types that are available as annotation checklists (TSV spreadsheets) see here: http://www.ebi.ac. uk/ena/submit/annotation-checklists Please do not use submission by flat file for any sequence type listed on the above webpage. sheet/annotation checklist submission route is more robust because we do the file conversion. The spread- For examples of ENA flat files that are not available for submission using annotation checklists/TSV see here: http: //www.ebi.ac.uk/ena/submit/entry-upload-templates Pay close attention to how the flat files are formatted. Use the web page above to construct your sequence flat file. This will be submitted by flat file upload. As with a TSV/annotation checklist submission (module 2) you need to create an analysis object in XML format to wrap the ENA flat file. Please check module 2: Analysis object for more information. To see how the analysis object and the sequence entries will be accessioned please refer to module 2: A word about Accession Numbers 20 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 Submission by Flat File Upload Submitting an ENA flat file is the same as submitting a tab separated file, so much of the detail is in module 2). The main difference is that for tsv spreadsheet submissions the tab/tsv file is converted to an ENA flat file and then validation is applied. For a submission by flat file upload, the conversion is omitted because the file is already in the ENA supported format. The system will try to validate your ENA flat file after only minimal processing. There is a little more opportunity for error but this can be remedied by following the guidelines closely. Step 1: Create a project As with a TSV/annotation checklist submission (module 2), a project/study is required. If you already have a study you can add your annotated sequence entries to it. If not, create one first. Use either the interactive submission route or the programmatic submission route to do this. Note the project accession number when you receive it. Step 2: Compress and upload the sequence flat file As with a TSV/annotation checklist submission, the sequence flat file must be compressed and uploaded to your Webin ftp directory. You may also need to calculate the MD5 checksum. Check here and here for instructions. In this example I have an ENA flat file called Human_parvovirus_B19_entryupload.embl which I have compressed to create file Human_parvovirus_B19_entryupload.embl.gz. The checksum of Human_parvovirus_B19_entryupload.embl.gz is 7138bf3320cad8d215b7e9930ded114b. Step 3: Create the analysis and submission XMLs First check how the analysis file was created in module 2 step 4 In this example the analysis file looks like this <?xml version = '1.0' encoding = 'UTF-8'?> <ANALYSIS_SET> <ANALYSIS alias="Human_parvovirus_B19_entryupload" center_name="EBI"> <TITLE>Human parvovirus B19 isolate IRB_1_2008 NS1 and VP1 unique region genes, ˓→partial cds</TITLE> <DESCRIPTION>Human parvovirus B19 isolate IRB_1_2008 NS1 and VP1 unique region ˓→genes, partial cds</DESCRIPTION> <STUDY_REF accession="PRJEBXXXX"> </STUDY_REF> <ANALYSIS_TYPE> <SEQUENCE_FLATFILE/> </ANALYSIS_TYPE> <FILES> <FILE checksum="7138bf3320cad8d215b7e9930ded114b" checksum_method="MD5" ˓→filename="Human_parvovirus_B19_entryupload.embl.gz" filetype="flatfile"/> </FILES> </ANALYSIS> </ANALYSIS_SET> In this case there is no ERT number/checklist attribute because no TSV annotation checklist template is being used. Also the file type attribute is different: filetype="flatfile". The title and description can be a brief description of what is presented in the sequence flat file. Make sure to add all your own attributes and field values as the above is only for example purposes. The submission XML in this example looks like this: 2.3. Module 3: Flat File upload - Submit an ENA Supported Sequence File 21 training_modules Documentation, Release 1 <?xml version="1.0" encoding="UTF-8"?> <SUBMISSION alias="entry_upload_Human_parvovirus_B19" center_name="EBI"> <ACTIONS> <ACTION> <ADD source="analysis.xml" schema="analysis"/> </ACTION> </ACTIONS> </SUBMISSION> As in module 2 step 5, the next step is to complete a submission XML file. Provide a unique alias for the submission object and reference the file containing the analysis object (in this case I called it ‘analysis.xml’). Step 4: Send both XMLs to ENA using REST API This step is the same as module 2 step 6. Use cURL or the web form to send the XMLs to ENA and register the flat file submission. Use the test server first and if successful and you are happy with the receipt proceed to submit to the production server. In this example I obtained the following receipt <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2017-05-08T12:51:53.601+01:00" submissionFile="submission.xml" ˓→success="true"> <ANALYSIS accession="ERZ408000" alias="Human_parvovirus_B19_entryupload" status= ˓→"PRIVATE" /> <SUBMISSION accession="ERA911540" alias="entry_upload" /> <ACTIONS>ADD</ACTIONS> </RECEIPT> In this example the analysis received accession ERZ408000 and the submission received accession ERA911540. You will not need the submission accession, whereas the analysis accession may be useful if you need to enquire about the progress of the submission. After the sequence entries are processed they will be accessioned and you will receive the accession (or accession range if multiple sequences were in the flat file) via the email address that is registered with your Webin account. Do not quote the analysis accession in any publication, always quote the sequence accessions (which come later by email). You can also quote the project/study accession, especially if you have used the project to group several submissions across different domains. Module 4: Update a Study using REST API Editing studies in the ENA using the REST API is an almost identical process to the submitting a new one. The first step is to obtain the original study in XML format. This step alone can be tricky if you did not submit the project using the REST API to begin with. Note that Webin has good study editing functionality already: 22 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 However, learning to use the REST API with a simple project object can pave the way for submitting and updating more complicated objects such as samples, experiments and runs. Also for making edits in bulk (to many projects) the ENA REST API is more feasible than Webin. Step 1: Get hold of the study in XML format If you used REST API to submit the study in the first place you can use the XML files that you used previously. If you don’t have an XML file containing the study you can copy the public version by using &display=xml at the end of the study page. For example, http://www.ebi.ac.uk/ena/data/view/PRJEB5932&display=xml. Note that the web version has additional blocks that are not part of the original XML as well as parts that have been added automatically and can be cleaned up for the purpose of updating (besides, they will be added again automatically). For example the below web version XML can be cleaned up so that it looks like submitted version that follows it. Web Version <?xml version="1.0" encoding="UTF-8"?> <ROOT request="PRJEB14252&display=xml"> <PROJECT alias="ena-STUDY-klanvin-03-06-2016-07:54:42:301-120" center_name="klanvin" ˓→accession="PRJEB14252" first_public="2016-08-02+01:00"> <IDENTIFIERS> <PRIMARY_ID>PRJEB14252</PRIMARY_ID> <SECONDARY_ID>ERP015887</SECONDARY_ID> <SUBMITTER_ID namespace="klanvin">ena-STUDY-klanvin-03-06-2016-07:54:42:301˓→120</SUBMITTER_ID> </IDENTIFIERS> <NAME>Cheddar cheese</NAME> <TITLE>Characterization of Microbial Diversity and Chemical Properties of ˓→Cheddar Cheese Prepared from Heat-treated Milk</TITLE> <DESCRIPTION>This study aimed to characterize the interaction of microbial ˓→diversity and chemical properties of Cheddar cheese after three different heat ˓→treatments of milk; low temperature/long time (LTLT), thermization, and high ˓→temperature/short time (HTST). Cheese obtained from LTLT-treated milk (LC) and ˓→thermized milk (TC) .... </DESCRIPTION> <SUBMISSION_PROJECT> <SEQUENCING_PROJECT> 2.4. Module 4: Update a Study using REST API 23 training_modules Documentation, Release 1 <LOCUS_TAG_PREFIX>BN8055</LOCUS_TAG_PREFIX> </SEQUENCING_PROJECT> </SUBMISSION_PROJECT> <PROJECT_LINKS> <PROJECT_LINK> <XREF_LINK> <DB>ENA-SUBMISSION</DB> <ID>ERA645775</ID> </XREF_LINK> </PROJECT_LINK> <PROJECT_LINK> <XREF_LINK> <DB>ENA-FASTQ-FILES</DB> <ID><![CDATA[http://www.ebi.ac.uk/ena/data/warehouse/filereport? ˓→accession=PRJEB14252&result=read_run&fields=run_accession,fastq_ftp,fastq_md5,fastq_ ˓→bytes]]></ID> </XREF_LINK> </PROJECT_LINK> <PROJECT_LINK> <XREF_LINK> <DB>ENA-SUBMITTED-FILES</DB> <ID><![CDATA[http://www.ebi.ac.uk/ena/data/warehouse/filereport? ˓→accession=PRJEB14252&result=read_run&fields=run_accession,submitted_ftp,submitted_ ˓→md5,submitted_bytes,submitted_format]]></ID> </XREF_LINK> </PROJECT_LINK> </PROJECT_LINKS> <PROJECT_ATTRIBUTES> <PROJECT_ATTRIBUTE> <TAG>ENA-FIRST-PUBLIC</TAG> <VALUE>2016-08-02</VALUE> </PROJECT_ATTRIBUTE> <PROJECT_ATTRIBUTE> <TAG>ENA-LAST-UPDATE</TAG> <VALUE>2016-06-03</VALUE> </PROJECT_ATTRIBUTE> </PROJECT_ATTRIBUTES> </PROJECT> </ROOT> Submitted version <?xml version="1.0" encoding="US-ASCII"?> <PROJECT_SET> <PROJECT center_name="klanvin" accession="PRJEB14252"> <NAME>Cheddar cheese</NAME> <TITLE>Characterization of Microbial Diversity and Chemical Properties of Cheddar ˓→Cheese Prepared from Heat-treated Milk</TITLE> <DESCRIPTION>This study aimed to characterize the interaction of microbial ˓→diversity and chemical properties of Cheddar cheese after three different heat ˓→treatments of milk; low temperature/long time (LTLT), thermization, and high ˓→temperature/short time (HTST). Cheese obtained from LTLT-treated milk (LC) and ˓→thermized milk (TC) .... </DESCRIPTION> <SUBMISSION_PROJECT> <SEQUENCING_PROJECT> <LOCUS_TAG_PREFIX>BN8055</LOCUS_TAG_PREFIX> 24 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 </SEQUENCING_PROJECT> </SUBMISSION_PROJECT> </PROJECT> </PROJECT_SET> The submitted version is much shorter and I even removed the unique alias because now that the object has an accession number the server will not need both alias and accession number to realise the identity of the object that is being overwritten. ERP version If your study is not public yet and you do not have it in XML format you can try using the submit/drop-box/ REST endpoint. Log in to here with your Webin id and password and click on ‘STUDY’. You will see a list of studies submitted from your account and you can view the XML for each by selecting the study and then clicking ‘xml’ 2.4. Module 4: Update a Study using REST API 25 training_modules Documentation, Release 1 Studies obtained from this resource are actually different (you may have noticed). Previously a study in the read domain had an accession like this ERP000001 whereas a project object (used for registering genome assemblies among other things) would have an accession like this PRJEB0001. We no longer distinguish between the 2 objects officially and we expose the PRJEB type more while the ERP type is kept for legacy reasons. You can edit either the PRJEB type or the ERP type and most attributes will be carried over to the other one. Similarly when you create a PRJEB type project then an ERP project is created automatically (and vice versa). Step 2: Create a submission XML file As with submitting a new study (see module 1), a submission object is required to accompany the study XML for updating an existing study object too. You may have this from a previous submission or update but it is also very quick to create. 26 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 <?xml version="1.0" encoding="UTF-8"?> <SUBMISSION alias="cheese_update" center_name=""> <ACTIONS> <ACTION> <MODIFY source="project.xml" schema="project"/> </ACTION> </ACTIONS> </SUBMISSION> Make sure that you give the submission object a unique alias (which can be any string) and fill in the center_name for your account (you can find this in the “my account details” drop down from inside Webin. If you are updating the ‘ERP’ version of the project (see above) you also need to specify this in the submission XML by changing schema="project" to schema="study" because the ERP style objects use a different schema. The important part of this submission object is the <MODIFY> tag. Contrast this with the tag used to submit an object for the first time (in module 1) which is <ADD>. This tells the REST server that we are updating an existing object instead of adding a new one. Make the edit and send to ENA Now you can make changes to the study object contained in the XML file. For example as a test, you might try modifying the title or the description. The final step is identical to submitting a study for the first time in module 1. You will send the submission xml and the study xml to the ENA REST server using cURL or the webform and you should receive a receipt in XML format. If the receipt contains success="true" then your edit will have been committed to the database. If not, check the error message(s), correct and repeat. <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2017-07-17T13:22:11.020+01:00" submissionFile="sub.xml" success= ˓→"true"> <PROJECT accession="PRJEB14252" alias="ena-STUDY-klanvin-03-06-2016-07:54:42:301˓→120" status="PUBLIC"/> <SUBMISSION accession="" alias="cheese_update"/> <ACTIONS>MODIFY</ACTIONS> </RECEIPT> Module 5: Submitting Sample objects As with most modules in this programmatic series, this one draws on the basic principles laid out in the first module: Create a Study. It is recommended that you work through the study module first. When you can create a study object in the ENA, so too will you be able to create sample objects by the same means. What does the XML file look like? The sample below is from an actual project released in 2016. Its title is Different gastric microbiota compositions in two human populations with high and low gastric cancer risk in Colombia. Here is one of the samples 2.5. Module 5: Submitting Sample objects 27 training_modules Documentation, Release 1 <?xml version="1.0" encoding="US-ASCII"?> <SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.sample. ˓→xsd"> <SAMPLE alias="MT5176" center_name=""> <TITLE>human gastric microbiota, mucosal</TITLE> <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> <SCIENTIFIC_NAME>stomach metagenome</SCIENTIFIC_NAME> <COMMON_NAME></COMMON_NAME> </SAMPLE_NAME> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>investigation type</TAG> <VALUE>mimarks-survey</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>project name</TAG> <VALUE>Different gastric microbiota compositions in two human populations with ˓→high and low gastric cancer risk in Colombia</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>sequencing method</TAG> <VALUE>pyrosequencing</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>collection date</TAG> <VALUE>2010</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>host body site</TAG> <VALUE>Mucosa of stomach</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>human-associated environmental package</TAG> <VALUE>human-associated</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (latitude)</TAG> <VALUE>1.81</VALUE> <UNITS>DD</UNITS> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (longitude)</TAG> <VALUE>-78.76</VALUE> <UNITS>DD</UNITS> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (country and/or sea)</TAG> <VALUE>Colombia</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>geographic location (region and locality)</TAG> <VALUE>Tumaco</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>environment (biome)</TAG> <VALUE>coast</VALUE> 28 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>environment (feature)</TAG> <VALUE>human-associated habitat</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>environment (material)</TAG> <VALUE>gastric biopsy</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>ENA-CHECKLIST</TAG> <VALUE>ERC000014</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE> </SAMPLE_SET> A sample is ultimately connected to raw read data and can also be connected to an assembly and various types of interpreted data. It provides most of the context and value to the data that it is connected to and it is representing the source material that has been sequenced. Note that most of the added value comes in the form of and pairs that belong in <SAMPLE_ATTRIBUTE> blocks. These blocks are not restricted so you can add as many as you like and you can define them however you like. Most submitters will want to apply attributes that are recognised by ENA and that are indexed for searching and filtering as this will increase the search-ability and value of your sample even further. You can also use a combination of your own attributes with those recognised by ENA. Apply an ENA minimum information standard checklist to your samples ENA offer sample ‘checklists’ which define all the mandatory and recommended attributes for specific types of samples. By declaring that you would like to register your sample under a specific checklist you are enabling the sample to be validated for correctness at submission time and you will also benefit from additional exposure of that sample to various services downstream of ENA that are interested in using ENA data that has been annotated to those minimum standards represented by the ENA checklists. The sample above is using and will be validated against ENA checklist ERC000014. Note that the checklist itself is declared using a SAMPLE_ATTRIBUTE block. The rest of the SAMPLE_ATTRIBUTE blocks are defined by that checklist. You can omit a checklist reference if you do not want your samples to be confined to the minimum annotation standards of one of ENA’s checklists. We advise against this and you can always add more of your own attributes which will not be subject to strict validation. Find all the sample checklists here. You can see that the sample in the example above is using checklist ERC000014 which corresponds to the GSC MIxS annotation standard for human associated source samples. Use these webpages in the ENA to know what attributes are required by each checklist and what controlled vocabularies and regular expressions and units are expected in each case. You may want to access the XML version of the checklist if you want to write a script to validate your own samples before you submit them. XML version of the checklist is available by appending &display=xml to the URL for the specific checklist: http://www.ebi.ac.uk/ena/data/view/ERC000014&display=xml If there is not a suitable checklist that describes your type of source samples you can use [ENA default checklist](the http://www.ebi.ac.uk/ena/data/view/ERC000011). This checklist has virtually no mandatory fields but does include a lot of optional attributes that you can review to help annotate your sample to the highest standard that is possible. A well annotated sample will eventually lead to maximum exposure and use-ability of your data. Submitting many samples simultaneously The main attraction for using the REST API to submit samples (and other objects) is that you do not need to interact with a manual web interface and that you can submit many objects in bulk at the same time. The example contains 2.5. Module 5: Submitting Sample objects 29 training_modules Documentation, Release 1 one sample block inside one sample_set block <SAMPLE_SET></SAMPLE_SET>. Your submission is more likely to have multiple samples in one sample_set. Make sure you highlight how the samples are different from each other if it is not already clear from some of the attribute values. Merely naming them 1 to 4 will not help your users to do any comparative analysis! <?xml version="1.0" encoding="US-ASCII"?> <SAMPLE_SET> <SAMPLE alias="1" center_name=""> <TITLE>first human gastric microbiota sample</TITLE> <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> </SAMPLE_NAME> </SAMPLE> <SAMPLE alias="2" center_name=""> <TITLE>second human gastric microbiota sample</TITLE> <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> </SAMPLE_NAME> </SAMPLE> <SAMPLE alias="3" center_name=""> <TITLE>third human gastric microbiota sample</TITLE> <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> </SAMPLE_NAME> </SAMPLE> <SAMPLE alias="4" center_name=""> <TITLE>fourth human gastric microbiota sample</TITLE> <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> </SAMPLE_NAME> </SAMPLE> </SAMPLE_SET> Two more points about the sample XML file XML Schema Note the first 2 lines in the first example above. <SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.sample. ˓→xsd"> This part points your XML editor (if you are using one) to a schema so that it can validate as you type. This is the schema for the sample XML which is not the same as the checklist validation system. This schema defines the order of the blocks and the controlled terms that may be available in some cases. It is more of a structural check and unfortunately many ENA rules are not embedded into this first level schema so it can not guarantee that the submission will be successful. However it will help you to compile properly written sample XML files. Taxonomic classification Note the sample_name block from the example above <SAMPLE_NAME> <TAXON_ID>1284369</TAXON_ID> 30 Chapter 2. Programmatic Submissions training_modules Documentation, Release 1 <SCIENTIFIC_NAME>stomach metagenome</SCIENTIFIC_NAME> <COMMON_NAME></COMMON_NAME> </SAMPLE_NAME> Taxon, scientific name and common name are ways of classifying the organism of the sample. Except in this case the source sample is environmental and represents an unknown variety and quantity of organisms. Because every sample still needs a taxonomic classification we have specific environmental terms in our taxonomy database typically used for metagenomic studies. More about these here. Taxon, scientific name and common name are referencing the same node in our taxonomic database so you do not need to include all 3. Including the unique taxon_id is sufficient and the other fields will be added automatically after the sample is submitted and archived. To find the correct taxonomic information for your organism including taxon_id and scientific_name see here. Submitting the XML files The procedure for submitting XML files is outlined in module 1. Module 1 describes submitting a study object but the process for sample submission is the same. The submission XML file should look something like this (assuming the samples are in another XML called “samp.xml”. Also remember to apply the correct centre name for your Webin account. The alias can be any unique string. <?xml version="1.0" encoding="UTF-8"?> <SUBMISSION alias="MT5176_submission" center_name=""> <ACTIONS> <ACTION> <ADD source="samp.xml" schema="sample"/> </ACTION> </ACTIONS> </SUBMISSION> Assuming that the above submission XML is saved in a file called “sub.xml” a cURL statement to send the XMLs to the ENA REST TEST server will look like this: curl -k -F "[email protected]" -F "[email protected]" "https://www-test.ebi.ac.uk/ ˓→ena/submit/drop-box/submit/?auth=ENA%20Webin-NNN%20PASSWORD" The cURL command will return a receipt in XML formatting containing the accession numbers, or if accession numbers were not administered because there was a problem/error then you will get a list of errors to work through before trying again. <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2017-07-25T16:07:50.248+01:00" submissionFile="sub.xml" success= ˓→"true"> <SAMPLE accession="ERS1833148" alias="MT5176" status="PRIVATE"> <EXT_ID accession="SAMEA104174130" type="biosample"/> </SAMPLE> <SUBMISSION accession="ERA979927" alias="MT5176_submission"/> <MESSAGES> <INFO>This submission is a TEST submission and will be discarded within 24 ˓→hours</INFO> </MESSAGES> <ACTIONS>ADD</ACTIONS> </RECEIPT> The receipt can be quite large so you may prefer to redirect the cURL output to a file, for example “receipt.xml”. 2.5. Module 5: Submitting Sample objects 31 training_modules Documentation, Release 1 Module 6: Updating Sample objects using REST API under construction 32 Chapter 2. Programmatic Submissions CHAPTER 3 Tips and FAQs Solving Error Notifications (runs) Submission of read files such as BAM and FastQ involves uploading them to your confidential ftp directory (this comes with your Webin account). Following this you will ‘submit’ the files: wrap each file (or pair of files) into a run object. This action of registering the run objects triggers our file processing pipeline to do some preliminary checks on the read files before moving them to an archive area. If any check fails you will receive an error notification by email. To correct these preliminary validation errors you do not need to repeat any of the submission process. Simply upload and replace the files with fixed versions and if necessary update the registered md5 checksum (more on this later). The runs will automatically be updated because the processing pipeline cycles through all files that are flagged/unvalidated to see if there has been a changed. The duration of this cycle is dependent on the queue. At quiet times it can be less than 24 but it can take several days during busier times so please allow some time after you have implemented your fix for the automatic email notifications to cease. Error Type 1: Invalid file checksum If ‘Invalid file checksum’ appears in your emailed error report List of file processing errors: FILE_NAME | ERROR | MD5 | FILE_SIZE | DATE | RUN_ID/ANALYSIS_ID mbr_depth_05.bam | Invalid file checksum | 594934819a1571f805ff299807431da4 | 895557023 | 20-DEC-2016 14:02:50 | ERR1766300 mbr_depth_minus_05.bam | Invalid file checksum | a2becdf04ab799c4e208de6161b470b3 | 341165746 | 20-DEC2016 14:00:46 | ERR1766407 File checksum refers to a hash function that can be performed on a file to create a unique string. When you upload a file to our ftp server it may not get transferred 100%. In this case we will have a corrupted or truncated file which is no good. To check this we can calculate the hash function. If it is different from the hash function of the original file 33 training_modules Documentation, Release 1 before you uploaded it then we can be sure the file on our server is not 100% complete. You can read this page on Wikipedia for more information about hash functions. We use the MD5 hash algorithm which you can perform easily with Linux or Mac command line on your local read files: > md5sum mbr_depth_05.bam 594934819a1571f805ff299807431da4 mbr_depth_05.bam > md5sum mbr_depth_minus_05.bam 99cf94b7287658254dd1be689fbc447d mbr_depth_minus_05.bam Outcome One: Corrupt File: Upload Again In the example above, according to the email notification, file “mbr_depth_05.bam” has a registered MD5 of 594934819a1571f805ff299807431da4. When we calculate the checksum of the original file ourselves we find the same MD5. So the registered checksum is correct. This table is reporting that the uploaded file does not match the registered checksum so we can assume that the file was not transferred completely. To remedy this try to upload the file again. The file processing pipeline is checking for a match systematically and when it happens the run will update itself. Outcome Two: Wrongly Registered MD5 checksum: Register new one File “mbr_depth_minus_05.bam” has a different story. The registered checksum according to the email notification is a2becdf04ab799c4e208de6161b470b3. When we calculate it locally we get 99cf94b7287658254dd1be689fbc447d. It appears that the wrong MD5 is registered. To remedy this we need to change the registered MD5 checksum. To do this, upload the correct checksum as a separate file. For file XXX the md5 checksum should be in file XXX.md5, so we need to create a file called mbr_depth_minus_05.bam.md5 and this file should contain the correct MD5 checksum. We should then upload this MD5 file to the same location as the original file (your Webin ftp directory) >md5sum mbr_depth_minus_05.bam > mbr_depth_minus_05.bam.md5 # create MD5 file >cat mbr_depth_minus_05.bam.md5 # check contents of new MD5 file 99cf94b7287658254dd1be689fbc447d mbr_depth_minus_05.bam Remember the file processing pipeline cycles through all files that are flagged with an error. You do not need to repeat the submission. Uploading the file again, or the checksum file, or both (for extra security) is sufficient to update the run but you may continue to get errors by email for a day or 2 after (depending on the queue). If you don’t remember registering the MD5 checksum for each file when you submitted it, it would have happened in one of 3 ways: 1. Our file uploader tool calculates the MD5 checksum automatically for any file that you upload. It then deposits the ‘XXX.md5’ file itself 2. You registered the MD5 checksum using the tsv columns during submission time. (module 4, part 2, step 5). This method is the most common source of wrongly registered checksums. Most other times it is sufficient to re upload the file and assume the registered checksum is correct. 3. You uploaded ‘XXX.md5’ checksum files along with the XXX read files. If you are submitting a new run(s) you can use the procedure described above to register an md5 checksum for each file that you upload. If you use option 2 from above (register the checksum in the metadata tsv table) it will over-ride the checksum file present in your ftp directory. If you provide a checksum file for every read file you can leave the checksum column(s) blank at the metadata registration stage. 34 Chapter 3. Tips and FAQs training_modules Documentation, Release 1 Error Type 2: Number of lines is not multiple of 4 This validation check helps to pick up errors in FastQ files. It is by no means thorough, but it can catch badly formatted FastQ files before they enter the processing pipeline (after which, errors are harder to fix). You will have received an email with a table like this. List of file processing errors: FILE_NAME | ERROR | MD5 | FILE_SIZE | DATE | RUN_ID/ANALYSIS_ID SOC9/MCONS1_R1.fq.gz | File content missing or malformed, Number of lines in fastq is not multiple of 4 | c2f8455c1a024cfb96a6c91f5d71f534 | 1358349886 | 01-DEC-2016 03:12:35 | ERR1755094 SOC9/MDSD8_R2.fq.gz | File content missing or malformed, Number of lines in fastq is not multiple of 4 | 3729df0ab14b2f00e863780281ec69fc | 3324175122 | 01-DEC-2016 03:14:33 | ERR1755093 This is the check that is done on FastQ files zcat MCONS1_R1.fq.gz | grep -c [^[:space:]] zcat and grep are commands that exist on the Linux platform as well as the Mac platform. ‘zcat’ uncompresses and prints the contents and the grep command will count the number of non-whitespace lines. A read in FastQ format is 4 lines long (header line + base calls + quality score header line + quality score calls) and so the total line count should be a multiple of 4. The output of the command above is simply divided by 4 and if a whole number is not reached an error is flagged and the email notification is sent. To remedy the error, upload a version of the file that has the correct line count (same file name and directory location as before (overwriting any pre-existing files)). You can check your files before uploading them using the above command on a linux machine. IMPORTANT Final Step: The new file you upload will have a different MD5 checksum to the registered MD5 checksum. The registered checksum for each file is provided in the table in the email (column 3). To remedy this follow this step from the previous section: Outcome Two: Wrongly Registered MD5 checksum: Register new one Error Type 3: File integrity check failed This error occurs when we can not unpack or read the file. The type of problem is related to the format of the file. Here are a few examples of the error notification that you might receive. List of file processing errors: FILE_NAME | ERROR | MD5 | FILE_SIZE | DATE | RUN_ID/ANALYSIS_ID UK/BR1-20_2.fq.gz | File integrity check failed, Can’t unzip file | ef7e73ed95f64355d7bf7d48636b704f | 3801612790 | 22-DEC-2016 04:08:41 | ERR0757927 cetbiorep1.bam | File integrity check failed, File cannot be read using cecfa479356456cb6770986a6141bc44 | 800838646 | 24-MAY-2016 03:02:08 | ERR0332189 samtools | frger.cram | File integrity check failed, Can’t count number of records in the file using cram tools | 807a0f61da013916c1ca5f60b9b42526 | 2347399950 | 11-JAN-2017 14:59:49 | ERR363314 The integrity checks are different for each file type but they follow the same principle. 3.1. Solving Error Notifications (runs) 35 training_modules Documentation, Release 1 File Types for compressed fastq files zcat BR1-20_2.fq.gz > /dev/null 2>&1 echo $? # exit code of 1 or higher means that there was an error. The linux zcat command uncompresses the gzipped file (bzcat for bzip2) and parses it. The output is not important at this stage, just the exit code. The output (and any human readable error message) is redirected to /dev/null (a way of discarding it). If the exit code of the program is greater than 0 we know there was some issue in uncompressing the file and the error report gets generated. To fix the problem, check that your local file can be uncompressed. You can use a similar approach to above or try using the -t flag with gzip program (it tests the integrity of the gzipped file (gzip -t <filename>)). for BAM files samtools view cetbiorep1.bam > /dev/null 2>&1 echo $? # exit code of 1 or higher means that there was an error. Preliminary validation done on BAM files is simply to use samtools ‘view’ option on the BAM file to check that it can unpack and read the BAM. If the exit code of the program is greater than 0 we know that the samtools program was not able to fully read the BAM file and this triggers the error report to be emailed. for CRAM files CRAM files are similar to BAM files with some additional steps. The reference needs to be downloaded before the file can be unpacked. The validation checks are based on this process and you can test cram file integrity yourself before uploading the file in a similar way to the previous file formats. How to Fix 1. Obtain a working file that passes the same preliminary test that our own validator applies. Upload the fixed file (same name and location as the previous version so as to overwrite it) to your Webin ftp directory. 2. The fixed file that you upload will have a different MD5 checksum to the registered MD5 checksum. The registered checksum for each file is provided in the table in the email (column 3). To remedy this follow this step from the previous section: Outcome Two: Wrongly Registered MD5 checksum: Register new one 3. Do not attempt to re do the submission. Uploading the file and registering its checksum will be enough to fix the run object. Our system checks for updates to files regularly. This can take a few days depending on the file queue so please allow a couple of days for the emails to cease. Preparing a file for Upload Most files submitted to the ENA need to be transferred to the ENA server in a process that is separate from the submission itself. When we talk about submissions we are usually talking about registering the metadata- the information about the file and about where it comes from. This metadata usually gets registered in the form of objects. For example a sample object represents the physical source material that is sampled for eventual sequencing. The file itself can be the result of sequencing the sample, such as the output of the sequencing machine. Having a separate transfer step means that files can be large and handled separately without interrupting or delaying the submission/registration steps. When data files are uploaded to the ENA ftp server the submission is not complete. There is usually more to come by way of this metadata registration. For instance, a read file submission requires project, sample, experiment, and run objects, while a whole genome FASTA file needs a sample and a project object. An annotated sequence submission requires at the very least a project object to belong to. 36 Chapter 3. Tips and FAQs training_modules Documentation, Release 1 Most files uploaded to the ENA ftp server need to be 1. Compressed 2. Have their MD5 checksum registered Step 1: Compress the file using gzip or bzip2 Files that are in a human readable text format (FastQ, FastA, VCF, tsv, csv ...) are compressed before uploading them to the ENA ftp server. Files that are not in a human readable text format like BAM, CRAM, SFF are already in a format that is efficient for transferring so additional compression is not required (the file will fail to validate if it is wrongly compressed). Also, with the exception of Oxford Nanopore files, do not tar archive any collections of files each should be uploaded separately. If you are unsure about the format that your files should be in you can check here for standard file formats and here for platform specific formats. Tools used for compressing files are 3rd party so you can find out more about how to do this from outside the ENA (a simple web search should be sufficient). However here is a basic example of compressing a file from within a Mac operating system using the Terminal application. user_01$ ls *fq eg_01.fq user_01$ gzip eg_01.fq user_01$ ls *gz eg_01.fq.gz user_01$ gunzip eg_01.fq.gz user_01$ ls *fq eg_01.fq user_01$ bzip2 eg_01.fq user_01$ ls *bz2 eg_01.fq.bz2 user_01$ In the above example the user has listed all files in the current directory that ends in ‘fq’ (there is one called ‘eg_01.fq’). The user then compresses the file with ‘gzip’ command, then reverts it back to uncompressed form with ‘gunzip’ command. Next the user compresses the file with ‘bzip2’ command. Note that files that are compressed end in ‘.gz’ or ‘.bz2’ depending on what tool is used. Step 2: Calculate the MD5 checksum for the file Md5 is a hash function that can be done on any file to create a 32 character string that is unique to that file (see the Wikipedia page on MD5). It is a bit like a fingerprint for the file. If the contents of the file change in any way the MD5 checksum will change as well. The file name can change without affecting the MD5 checksum because the calculation is done on the contents of the file only. The idea is that when you transfer your large file to us it may not get transferred 100%. If you tell us the MD5 checksum of the file that you have before it is uploaded and then we calculate the checksum of the file that has been uploaded to us we can tell if the upload was successful. If the checksum we calculate matches the one you provided then the transfer was a success. Hash functions are a common way of testing file identity and integrity so you can find out more about how to do this from outside the ENA (a simple web search should be sufficient). However here is a basic example of calculating the checksum for a file called ‘eg_01.fq.bz2’ using the Terminal application within the Mac operating system. user_01$ md5 eg_01.fq.bz2 MD5 (eg_01.fq.bz2) = 74f085a6f3dd8b2877b89fcb592c7f5c user_01$ md5 eg_01.fq.bz2 > eg_01.fq.bz2.md5 3.2. Preparing a file for Upload 37 training_modules Documentation, Release 1 user_01$ cat eg_01.fq.bz2.md5 MD5 (eg_01.fq.bz2) = 74f085a6f3dd8b2877b89fcb592c7f5c In the above example the user uses command ‘md5’ to calculate the checksum for the file. In a Linux operating system this is equivalent to ‘md5sum’ command. Then the user does it again, but redirects the output to a file called ‘eg_01.fq.bz2.md5’. Finally the user checks the contents of the new file. This is an md5 file and can be used to register the MD5 checksum of the original file with ENA. Registering the MD5 checksum with ENA In the example above the data file to be submitted is called ‘eg_01.fq.bz2’ It is a compressed version on the original ‘file eg_01.fq’. Compressing large files is advantageous because it takes less time to transfer them and this increases the likelihood of a complete transfer without corruption. The MD5 checksum of file ‘eg_01.fq.bz2’ is contained in file ‘eg_01.fq.bz2.md5’. ENA requires the checksum that you have calculated so that we can compare it to the one that we calculate once the file is on our ftp server. So you can upload this checksum file in addition to the data file and our system will find it. As long as you abide by the naming convention XXX.md5 where XXX is the name of the data file and XXX.md5 is a text file containing the MD5 checksum ENA will understand. This is not the only way to register the checksum for a data file. When you come to submit the uploaded data file you will find that you can include the 32 character checksum string in with the submission metadata. If you do include the checksums in with the metadata at submission time then you do not have to accompany each data file with an md5 file at upload time. Also note that the ENA file uploader (one of the upload options available) will automatically create an MD5 file for every data file that it uploads and it will deposit this MD5 file (using the naming convention discussed) along with the data file on the ftp server. That means that you do not need to provide MD5 checksums in the metadata at submission time if you have used the ENA file uploader. You can not pool checksums from several data files into a single md5 file. The ENA file processing system will not be able to interpret this. Each file must have its own md5 file (if you are choosing to register it that way) File Validation Errors A common cause of file validation errors is when the checksum that you provide does not match the one that we have calculated. Automatic email notifications are set up to alert you of these problems. Remember the data file will not be validated until you have submitted it - uploading a data file does not constitute a submission. If you do receive an email about checksum mismatches then there is a chance that your transfers could not complete 100% and the files are corrupted. It could also be the case that you accidentally registered the wrong checksum. You can re-upload any file you like. Make sure it has the same name and is placed in the same subdirectory (if any) as the original. This should solve a corrupt file issue if the second upload is 100% successful because its checksum will now match the registered checksum. Alternatively if you believe the wrong checksum is registered simply upload a new checksum file with the correct MD5 checksum in it. The file processing system at ENA checks and recalculates all unvalidated files cyclically so once there is a match between the calculated and the registered MD5 value the file will be validated. You do not have to repeat any part of the submission but the queue of unvalidated files is variable so at busy times it can still take some time for the error notifications to cease. It is recommended to re-upload the data file and a checksum file so that both scenarios are covered and your file will be validated without any further trouble. There are other possible validation errors. For example we may not be able to uncompress your data file because it is corrupted. You will need to upload a fixed version of the data file but you must always accompany fixed files with checksum files because you know that the new file will have a different MD5 checksum compared with the original because you have changed it. Often submitters provide a fixed file but forget to update the registered checksum so the validation still fails. Also remember that replacement data files must always have the same file name as the original or the system will not pick it up as a replacement. If the file name itself must change it is usually to submit a new data file and cancel the problem submission. For most validation errors this is completely unnecessary so do not be tempted to repeat a submission if you do not have to! 38 Chapter 3. Tips and FAQs training_modules Documentation, Release 1 Step 3: Uploading the file This is the final step before the submission. http://www.ebi.ac.uk/ena/about/sra_data_upload Instructions for this are well detailed already: Remember to upload the checksum file in addition to the data file unless you are going to register the checksum at submission time or you are using the ENA file uploader instead. Here is a basic example of using FTP to upload a data file called ‘eg_01.fq.bz2’ and its md5 file ‘eg_01.fq.bz2.md5’. The example is using the Terminal application in the Mac operating system. See above link for more detailed instructions. user_01$ ftp webin.ebi.ac.uk Connected to hh-webin.ebi.ac.uk. 220 (vsFTPd 2.2.2) Name (webin.ebi.ac.uk:user_01): Webin-XXX 331 Please specify the password. Password: 230 Login successful. Remote system type is UNIX. Using binary mode to transfer files. ftp> mput eg_01.fq.bz2 229 Entering Extended Passive Mode (|||42382|). 150 Ok to send data. 100% ˓→|************************************************************************************************** ˓→ 51 25.65 KiB/s 00:00 ETA 226 Transfer complete. 50000 bytes sent in 05:00 (1.57 KiB/s) ftp> mput eg_01.fq.bz2.md5 229 Entering Extended Passive Mode (|||41642|). 150 Ok to send data. 100% ˓→|************************************************************************************************** ˓→ 54 48.20 KiB/s 00:00 ETA 226 Transfer complete. 54 bytes sent in 00:00 (1.92 KiB/s) ftp> bye 221 Goodbye. Taxonomic classifications for your samples The Tax database Every ENA sample object should have a taxonomic classification. The INSDC maintains a database of all unique taxonomy classifications known to us and you should apply one from this database when you create your samples. Each classification has a unique id and this is expanded to show the scientific name and common name of the organism when the sample is viewed. The interactive submission service has a look up table which you can use before you download the spreadsheet template so that you already know what taxonomy identifications to apply when you are creating your samples offline. 3.3. Taxonomic classifications for your samples 39 training_modules Documentation, Release 1 Submitters using REST API will apply the taxonomic information to the sample object using the sample_name block <SAMPLE_NAME> <TAXON_ID>450267</TAXON_ID> <SCIENTIFIC_NAME>Chlamyphorus truncatus</SCIENTIFIC_NAME> <COMMON_NAME>Pink fairy armadillo</COMMON_NAME> </SAMPLE_NAME> REST access to the tax database Submitters using the REST API to programmatically submit samples in XML format can use the taxonomy database look up to find what tax id they need to apply to their sample using these REST endpoints: If you know the scientific name of the organism you can find the taxonomy id with this endpoint www.ebi.ac.uk/ ena/data/taxonomy/v1/taxon/scientific-name/. Simply append the scientific name to the URL. You can use a browser or use cURL at the command line (the “see URL” program available on Linux and Mac). Note the use of %20 to represent a space character. This is URL encoding and you may find the commands do not work unless you replace space characters with %20 > curl "http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/Leptonycteris ˓→%20nivalis" [ { "taxId": "59456", "scientificName": "Leptonycteris nivalis", "commonName": "Mexican long-nosed bat", "formalName": "true", "rank": "species", "division": "MAM", "lineage": "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; ˓→Mammalia; Eutheria; Laurasiatheria; Chiroptera; Microchiroptera; Phyllostomidae; ˓→Glossophaginae; Leptonycteris; ", "geneticCode": "1", "mitochondrialGeneticCode": "2", "submittable": "true" } ] You can do the same with the common name. Use endpoint http://www.ebi.ac.uk/ena/data/taxonomy/ v1/taxon/any-name/ and append the name 40 Chapter 3. Tips and FAQs training_modules Documentation, Release 1 > curl "http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/any-name/golden%20arrow ˓→%20poison%20frog" [ { "taxId": "377316", "scientificName": "Atelopus zeteki", "commonName": "golden arrow poison frog", "formalName": "true", "rank": "species", "division": "VRT", "lineage": "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; ˓→Amphibia; Batrachia; Anura; Neobatrachia; Hyloidea; Bufonidae; Atelopus; ", "geneticCode": "1", "mitochondrialGeneticCode": "2", "submittable": "true" } ] If you do not know the scientific name or the common name but you have an idea, you can use this suggest endpoint http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/suggest-for-submission/ > curl "http://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/suggest-for-submission/curry" [ { "taxId": "159030", "scientificName": "Murraya koenigii", "displayName": "curry leaf" }, { "taxId": "261786", "scientificName": "Helichrysum italicum", "displayName": "curry plant" } ] In each case above a JSON document is outputted and you will be looking for the taxId field. Outputting JSON format will help you to automate the call if appropriate. Environmental taxonomic classifications Every sample object in the ENA must have a taxonomic classification assigned to it. Of course environmental samples typically collected for metagenomic studies can not have a single organism identifier because they represent an environment with an unknown variety and number of organisms. For this purpose we have entries in the taxonomic database to apply exclusively to environmental samples. You can search for these terms using the methods described above - they tend to have “metagenome” as part of the scientific name. curl "www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/suggest-for-submission/marsupial%20meta ˓→" [ { "taxId": "1477400", "scientificName": "marsupial metagenome", "displayName": "marsupial metagenome" } To have an idea of what environmental sample names are available, below is a list. This list is not regularly updated so 3.3. Taxonomic classifications for your samples 41 training_modules Documentation, Release 1 it may be worth trying the suggest-for-submission look up method described above to see if you can find one that better represents your environmental samples. The following terms go in the scientific name field of the sample object. To find the tax id use the method outlined above (scientific-name endpoint) . For example you can paste the following into your browser to find the tax id for termite fungus garden metagenome: http://www.ebi.ac.uk/ena/data/ taxonomy/v1/taxon/scientific-name/termite fungus garden metagenome metagenome synthetic metagenome ecological metagenomes organismal metagenomes Specific ecological metagenomes sub nodes activated carbon metagenome activated sludge metagenome aerosol metagenome air metagenome alkali sediment metagenome anaerobic digester metagenome anchialine metagenome ant fungus garden metagenome aquatic metagenome aquifer metagenome ballast water metagenome beach sand metagenome bioanode metagenome biocathode metagenome biofilm metagenome biofilter metagenome biofloc metagenome biogas fermenter metagenome bioreactor metagenome bioreactor sludge metagenome biosolids metagenome cave metagenome clinical metagenome cloud metagenome coal metagenome cold seep metagenome compost metagenome concrete metagenome coral reef metagenome cow dung metagenome crude oil metagenome decomposition metagenome dietary supplements metagenome dust metagenome electrolysis cell metagenome estuary metagenome fermentation metagenome fertilizer metagenome floral nectar metagenome flotsam metagenome food contamination metagenome 42 Chapter 3. Tips and FAQs training_modules Documentation, Release 1 food fermentation metagenome food metagenome food production metagenome freshwater metagenome freshwater sediment metagenome fuel tank metagenome gas well metagenome glacier lake metagenome glacier metagenome groundwater metagenome halite metagenome herbal medicine metagenome honey metagenome hospital metagenome hot springs metagenome HVAC metagenome hydrocarbon metagenome hydrothermal vent metagenome hypersaline lake metagenome hyphosphere metagenome hypolithon metagenome ice metagenome indoor metagenome industrial waste metagenome interstitial water metagenome lagoon metagenome lake water metagenome landfill metagenome leaf litter metagenome lichen crust metagenome lobster shelll metagenome mangrove metagenome manure metagenome marine metagenome marine plankton metagenome marine sediment metagenome metal metagenome microbial fuel cell metagenome microbial mat metagenome milk metagenome mine drainage metagenome mine tailings metagenome mixed culture metagenome money metagenome moonmilk metagenome mud volcano metagenome museum specimen metagenome musk metagenome neuston metagenome oasis metagenome oil field metagenome oil metagenome oil production facility metagenome oil sands metagenome outdoor metagenome paper pulp metagenome parchment metagenome peat metagenome 3.3. Taxonomic classifications for your samples 43 training_modules Documentation, Release 1 periphyton metagenome permafrost metagenome phytotelma metagenome pitcher plant inquiline metagenome plastisphere metagenome pond metagenome poultry litter metagenome power plant metagenome probiotic metagenome retting metagenome rhizoplane metagenome rhizosphere metagenome rice paddy metagenome riverine metagenome rock metagenome rock porewater metagenome root associated fungus metagenome saline spring metagenome salt lake metagenome salt marsh metagenome salt mine metagenome saltern metagenome sand metagenome seawater metagenome sediment metagenome shale gas metegenome silage metagenome sludge metagenome snow metagenome snowblower vent metagenome soda lake metagenome soil crust metagenome soil metagenome solid waste metagenome steel metagenome stromatolite metagenome subsurface metagenome surface metagenome tar pit metagenome termitarium metagenome termite fungus garden metagenome terrestrial metagenome tidal flat metagenome tin mine metagenome tobacco metagenome tomb wall metagenome urban metagenome wastewater metagenome wetland metagenome whale fall metagenome wine metagenome wood decay metagenome organismal metagenomes sub nodes algae metagenome annelid metagenome ant metagenome 44 Chapter 3. Tips and FAQs training_modules Documentation, Release 1 aquatic viral metagenome bat metagenome bear gut metagenome beetle metagenome bird metagenome blood metagenome bovine gut metagenome bovine metagenome cetacean metagenome chicken gut metagenome ciliate metagenome coral metagenome crab metagenome crustacean metagenome ctenophore metagenome dinoflagellate metagenome ear metagenome echinoderm metagenome endophyte metagenome epibiont metagenome eye metagenome feces metagenome feline metagenome fish gut metagenome fish metagenome flower metagenome fossil metagenome frog metagenome fungus metagenome gill metagenome gonad metagenome grain metagenome grasshopper gut metagenome gut metagenome honeybee metagenome human bile metagenome human blood metagenome human brain metagenome human eye metagenome human gut metagenome human gut metagenome gcode 4 human lung metagenome human metagenome human milk metagenome human nasopharyngeal metagenome human oral metagenome human reproductive system metagenome human saliva metagenome human semen metagenome human skeleton metagenome human skin metagenome human tracheal metagenome human vaginal metagenome hydrozoan metagenome insect gut metagenome insect metagenome invertebrate gut metagenome invertebrate metagenome 3.3. Taxonomic classifications for your samples 45 training_modules Documentation, Release 1 jellyfish metagenome koala metagenome leaf metagenome lichen metagenome liver metagenome lung metagenome marsupial metagenome mite metagenome mollusc metagenome mosquito metagenome moss metagenome mouse gut metagenome mouse metagenome mouse skin metagenome nematode metagenome oral metagenome oral-nasopharyngeal metagenome ovine metagenome oyster metagenome parasite metagenome phage metagenome phyllosphere metagenome pig gut metagenome pig metagenome placenta metagenome plant metagenome pollen metagenome primate metagenome psyllid metagenome rat gut metagenome rat metagenome reproductive system metagenome respiratory tract metagenome rodent metagenome root metagenome scorpion gut metagenome sea anemone metagenome sea squirt metagenome sea urchin metagenome seagrass metagenome seed metagenome sheep gut metagenome sheep metagenome shoot metagenome shrimp gut metagenome skin metagenome snake metagenome spider metagenome sponge metagenome stomach metagenome symbiont metagenome termite gut metagenome termite metagenome tick metagenome upper respiratory tract metagenome urine metagenome urogenital metagenome vaginal metagenome 46 Chapter 3. Tips and FAQs training_modules Documentation, Release 1 viral metagenome wallaby gut metagenome wasp metagenome zebrafish metagenome 3.3. Taxonomic classifications for your samples 47
© Copyright 2026 Paperzz