The Ensembl online training series 2016 Ben Moore Ensembl Outreach EMBL-EBI This webinar course Date Webinar topic Instructor 24th March Introduction to Ensembl Emily Perry 31st March Ensembl genes Denise Carvalho-Silva 7th April Data export with BioMart Helen Sparrow 14th April Variation data in Ensembl and the Ensembl VEP Denise Carvalho-Silva 21st April Comparing genes and genomes with Ensembl Compara Helen Sparrow 28th April Finding features that regulate genes – the Ensembl Regulatory Build Emily Perry 5th May Uploading your data to Ensembl and advanced ways to access Ensembl data Ben Moore Objectives • • • • What is Ensembl? What type of data can you get in Ensembl? How to navigate the Ensembl browser website. Where to go for help and documentation. Structure Presentation: What the data/tool is How we produce/process the data Demo: Getting the data Using the tool Exercises: On the train online course Questions? • • • Ask questions in the Chat box in the webinar interface My Ensembl colleagues will respond during the webinar There’s no threading so please respond with @username Helen Sparrow Emily Perry Denise CarvalhoSilva Poll Questions - Poll 1: Did you attend the previous webinars? - Poll 2: Have you done the previous exercises? Course exercises http://www.ebi.ac.uk/training/online/course/ensemblbrowser-webinar-series-2016 This text will be replaced by a YouTube (link to YouKu too) video of the webinar and a pdf of the slides. A link to exercises and their solutions will appear in the page hierarchy The “next page” will be the exercises Get help with the exercises • Use the exercise solutions in the online course • Join our Facebook group and discuss the exercises with everybody (see the online course for the link) • Email us [email protected] Custom Data and Advanced Data Access EBI is an Outstation of the European Molecular Biology Laboratory. Viewing your own data in Ensembl Add custom tracks with your own data: - BAM files - GTF/GFF - BED/BEDGraph files - PSL -VCF - Pairwise interactions - BigWig http://www.ensembl.org/info/website/upload/index. html#formats Hands on We’re going to map large-scale deletions from patients with microcephaly and developmental delay by uploading BED files. chr5 36821632 37091234 P1 chr5 36731476 36978306 P2 chr5 36908552 37108671 P3 Advanced data access • Accessing data at different scales: • Full database download from the FTP site • Direct database access with MySQL • Programmatic access with the Perl API • Fast and flexible access with the REST API Access scales One by one Main browser Mobile site BioMart REST API VEP Groups Perl API MySQL FTP Whole genome FTP • Files of our complete database: • Genomic, cDNA, CDS, ncRNA and protein sequence (FASTA) • Annotated sequence (EMBL, GenBank) • Gene sets (GTF, GFF) • Whole-genome multiple and gene-based multiple alignments (MAF) • Variants (VCF, GVF) • Constrained elements (BED) • Regulatory features (BED, BigWig) • RNA-Seq files (BAM, BigWig) • MySQL database Access FTP Your favourite FTP client FTP site ftp://ftp.ensembl.org/pub/ FTP downloads page http://www.ensembl.org/info/data/ftp/index.html FTP files are big • Multiple Mb/Gb • Lots of time to download/unzip • Do you really need this data? • Make sure it’s the right file before you download. FTP site summary Skills needed Web-browsing or FTP client use. Handling and parsing file types. Scalability Whole database only Speed Many minutes to download and decompress a file. Difficulty to query Files easy to download and decompress Long-term New files with each release. File types stay the same Sequences? Yes Ensembl data through MySQL • Direct database querying using MySQL queries • http://www.ensembl.org/info/data/mysql.html mysql -u anonymous -h ensembldb.ensembl.org mysql> use homo_sapiens_core_82_38; mysql> SELECT gene.stable_id FROM gene, xref WHERE gene.display_xref_id = xref.xref_id AND xref.display_label LIKE ’brca2'; MySQL schema Ensembl data through MySQL • I have an Ensembl gene ID - ENSG00000079950 • I want to get the EntrezGene ID for this gene • We need to refer to the schema: http://www.ensembl.org/info/docs/api/core/ core_schema.html The schema Choose our tables: from xref, object_xref, gene, external_db Get the gene: gene.stable_id = "ENSG00000079 950" Link the object xref to the gene ID: object_xref. ensembl_id= gene.gene_id Link the xref to the object xref: xref. xref_id=object_x ref.xref_id Specify you want gene xrefs: Get the xref ID: xref. ensembl_object select xref. display_label _type = 'Gene' Link the xref to the external database: external_db. external_db_id=xref. external_db_id Choose the external database: external_bd. db_display_name = “EntrezGene” Our query /usr/local/mysql/bin/mysql -h ensembldb. ensembl.org -u anonymous -P 3306 use homo_sapiens_core_82_38; Use port 3337 for GRCh37 select gene.stable_id, xref.display_label from xref, object_xref, gene, external_db where xref.xref_id=object_xref.xref_id and object_xref.ensembl_id=gene.gene_id and gene.stable_id = "ENSG00000079950" and external_db.external_db_id=xref. external_db_id and external_db. db_display_name= "EntrezGene"; MySQL summary Skills needed MySQL querying. Understanding of the schema. Scalability Can query whole genome. Speed Minimum query speed 9ms. Time cost for complexity of query. Difficulty to query Queries can get very complicated if extracting data from multiple tables and are often not reusable. Long-term The schema can change between releases. Sequences? No Ensembl data through the Perl API • Database querying using Perl scripts • We use object-oriented Perl my $gene_adaptor = $registry-> get_adaptor( 'human', 'core', ‘gene' ); my $gene = $gene_adaptor-> fetch_by_display_label( 'brca2' ); print $gene->stable_id, "\n"; http://www.ensembl.org/info/data/api.html Perl API Learn Perl download API modules Get out all possible Ensembl data. Output in any format you like. (download more modules) Learn Ensembl API Write scripts Learn to use the API EBI Train Online course: http://www.ebi.ac. uk/training/online/course /ensembl-filmed-apiworkshop API documentation: http://www.ensembl. org/info/docs/Doxygen/c ore-api/index.html Ensembl data through the Perl API • I want a script that gets a gene name from the command line and prints its sequence. • We’ve already learnt how to use the API and know our way around the documentation • We need to write a script. Perl script #!/usr/bin/perl # Get the gene adaptor - this allows you to fetch genes from the database use strict; my $gene_adaptor = $reg->get_adaptor ('human', 'core', 'gene'); use warnings; # Load the Ensembl API registry use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( # Get genes using the gene adaptor my @genes = @{ $gene_adaptor>fetch_all_by_external_name ($gene_name) }; -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); Use port 3337 for GRCh37 # move through the genes one-by-one while (my $gene = shift @genes) { # Get the gene name from the command line # print the gene name, ID and sequence my $gene_name = shift; print "> ", $gene_name , " ", $gene->stable_id, "\n", $gene->seq, "\n"; } Perl API summary Skills needed Programming in Perl. Understanding of features in Ensembl. Scalability Can query whole database Speed 1s for start-up plus minimum query speed 50ms. Time cost per datapoint. Difficulty to query Scripts needed, but these can be easily reused and adapted. API links data easily. Long-term The API takes databases changes into account, so scripts do not need to change between releases Sequences? Yes Data access via REST • We’ve had a Perl API for a long time … • … but not everybody works in Perl • Our RESTful service allows language agnostic access to our data. • Visit rest.ensembl.org for installation, documentation and examples What is REST? • REST allows you to query the database using simple URLs giving output in plain text format eg http://rest.ensembl.org/xrefs/symbol/homo_sapiens/BRCA2? content-type=application/json gives [{"type":"gene","id":"ENSG00000139618"},{"type":"gene","id":" LRG_293"}] • This means you can write scripts in any language to construct these URLs and read their output Single endpoint demo • • I want to get a gene sequence from an Ensembl gene ID I need to use the docs to find an appropriate endpoint: http://rest.ensembl.org/ Use grch37.rest.ensembl.org for GRCh37 http://rest.ensembl.org/sequence/id/ENSG00000157764 Scripting demo • I want a script that gets a gene name from the command line and prints its sequence. • There’s no one endpoint that does this action, so I have to combine two endpoints with a script Python script #!/usr/bin/env python # decode the json output import json # Get modules needed for script genes = json.loads(get_genes) import sys import urllib # move through the genes one-by-one import urllib2 for gene in genes: import json import time import httplib2, sys http = httplib2.Http(".cache") # Get the gene name from the command line gene_name = sys.argv[1] # define the general URL parameters # define the REST query to get the sequence from the gene ext_get_seq = '/sequence/id/' + gene['id'] + '?'; # submit the query resp, get_seq = http.request (server+ext_get_seq, method="GET", headers= {"Content-Type":"application/json"}) server = "http://rest.ensembl.org" # decode the json output # define REST query to get the gene ID from the gene name ext_get_gene = "/xrefs/symbol/homo_sapiens/" + gene_name + "?" # submit the query resp, get_genes = http.request (server+ext_get_gene, method="GET", headers= {"Content-Type":"application/json"}) import json seq = json.loads(get_seq) # print the gene name, ID and sequence print '>', gene_name, gene['id'], "\n", seq ['seq'] POST demo • • Some endpoints can perform multiple queries at once using POST Use Postman https://www.getpostman.com/ POST demo Choose POST Input endpoint POST demo Choose Body Input IDs in json { "ids" : ["ENST00000000233", "ENST00000000412", "ENST00000000442", "ENST00000001008", "ENST00000001146", "ENST00000002125", "ENST00000002165", "ENST00000002501", "ENST00000002596", "ENST00000002829", "ENST00000003084", "ENST00000003100", "ENST00000003302", "ENST00000003583", "ENST00000003912", "ENST00000004103", "ENST00000004531", "ENST00000004982", "ENST00000005082", "ENST00000005178", "ENST00000005180", "ENST00000005226", "ENST00000005257", "ENST00000005259", "ENST00000005260", "ENST00000005284", "ENST00000005286", "ENST00000005340", "ENST00000005374", "ENST00000005386", "ENST00000005558", "ENST00000005756", "ENST00000005995", "ENST00000006015", "ENST00000006053", "ENST00000006251", "ENST00000006275", "ENST00000006658", "ENST00000006724", "ENST00000006750", "ENST00000006777", "ENST00000007264", "ENST00000007390", "ENST00000007414", "ENST00000007510", "ENST00000007516", "ENST00000007699", "ENST00000007708", "ENST00000007722", "ENST00000007735" ] } POST demo Choose Headers Click on the pencil Add ContentType... ...application/json by typing the first few letters then selecting POST demo REST summary Skills needed Understanding of features in Ensembl. Possibly programming in any language Scalability With programming can query whole database Speed Minimum query speed 150ms. Time cost per datapoint. Difficulty to query Need to construct URLs. May need scripts to dissect data. Long-term The API takes databases changes into account, so URLs do not need to change between releases. Sequences? Yes Webinar course feedback We will send a SurveyMonkey feedback survey for this webinar series by e-mail: PLEASE fill it out to tell us whether you have enjoyed and benefitted from the course! Host an Ensembl course Browser course We can teach an Ensembl course at your institute for free (except ½-2 day course on the Ensembl browser, aimed trainers’ expenses). at wet-lab scientists. One trainer. Email me: [email protected] API course 1-4 day course on the Ensembl Perl API, aimed at bioinformaticians. 1-4 trainers. http://www.ensembl.info/workshops/ Help and documentation Course online http://www.ebi.ac. uk/training/online/subjects/11 Tutorials www.ensembl.org/info/website/tutorials Flash animations www.youtube.com/user/EnsemblHelpdesk http://u.youku.com/Ensemblhelpdesk Email us [email protected] Ensembl public mailing lists [email protected], [email protected] Publications http://www.ensembl.org/info/about/publications.html Yates, A. et al Ensembl 2016 Nucleic Acids Research http://nar.oxfordjournals.org/content/early/2015/12/19/nar.gkv1157.full Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244 Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295 Follow us www.facebook. com/Ensembl.org @Ensembl www.ensembl.info Acknowledgements
© Copyright 2026 Paperzz