The Ensembl online training series 2016 - EMBL-EBI

The Ensembl online training
series 2016
Ben Moore
Ensembl Outreach
EMBL-EBI
This webinar course
Date
Webinar topic
Instructor
24th
March
Introduction to Ensembl
Emily Perry
31st
March
Ensembl genes
Denise
Carvalho-Silva
7th April Data export with BioMart
Helen Sparrow
14th
April
Variation data in Ensembl and the Ensembl VEP
Denise
Carvalho-Silva
21st
April
Comparing genes and genomes with Ensembl Compara
Helen Sparrow
28th
April
Finding features that regulate genes – the Ensembl
Regulatory Build
Emily Perry
5th May Uploading your data to Ensembl and advanced ways to
access Ensembl data
Ben Moore
Objectives
•
•
•
•
What is Ensembl?
What type of data can you get in Ensembl?
How to navigate the Ensembl browser website.
Where to go for help and documentation.
Structure
Presentation:
What the data/tool is
How we produce/process the data
Demo:
Getting the data
Using the tool
Exercises:
On the train online course
Questions?
•
•
•
Ask questions in the Chat
box in the webinar
interface
My Ensembl colleagues
will respond during the
webinar
There’s no threading so
please respond with
@username
Helen Sparrow
Emily Perry
Denise CarvalhoSilva
Poll Questions
- Poll 1: Did you attend the previous webinars?
- Poll 2: Have you done the previous exercises?
Course exercises
http://www.ebi.ac.uk/training/online/course/ensemblbrowser-webinar-series-2016
This text will be replaced by
a YouTube (link to YouKu
too) video of the webinar
and a pdf of the slides.
A link to exercises and
their solutions will
appear in the page
hierarchy
The “next page” will
be the exercises
Get help with the exercises
• Use the exercise
solutions in the online
course
• Join our Facebook group
and discuss the exercises
with everybody (see the
online course for the
link)
• Email us
[email protected]
Custom Data and
Advanced Data
Access
EBI is an Outstation of the European Molecular Biology Laboratory.
Viewing your own data in Ensembl
Add custom tracks with your own data:
- BAM files
- GTF/GFF
- BED/BEDGraph files
- PSL
-VCF
- Pairwise interactions
- BigWig
http://www.ensembl.org/info/website/upload/index.
html#formats
Hands on
We’re going to map large-scale deletions from patients with
microcephaly and developmental delay by uploading BED
files.
chr5 36821632 37091234 P1
chr5 36731476 36978306 P2
chr5 36908552 37108671 P3
Advanced data access
• Accessing data at different scales:
• Full database download from the FTP site
• Direct database access with MySQL
• Programmatic access with the Perl API
• Fast and flexible access with the REST API
Access scales
One by one
Main browser
Mobile site
BioMart
REST API
VEP
Groups
Perl API
MySQL
FTP
Whole
genome
FTP
• Files of our complete database:
• Genomic, cDNA, CDS, ncRNA and protein sequence
(FASTA)
• Annotated sequence (EMBL, GenBank)
• Gene sets (GTF, GFF)
• Whole-genome multiple and gene-based multiple
alignments (MAF)
• Variants (VCF, GVF)
• Constrained elements (BED)
• Regulatory features (BED, BigWig)
• RNA-Seq files (BAM, BigWig)
• MySQL database
Access FTP
Your favourite FTP client
FTP site
ftp://ftp.ensembl.org/pub/
FTP downloads page
http://www.ensembl.org/info/data/ftp/index.html
FTP files are big
• Multiple Mb/Gb
• Lots of time to download/unzip
• Do you really need this data?
• Make sure it’s the right file before you download.
FTP site summary
Skills needed
Web-browsing or FTP client use. Handling
and parsing file types.
Scalability
Whole database only
Speed
Many minutes to download and
decompress a file.
Difficulty to query
Files easy to download and decompress
Long-term
New files with each release. File types
stay the same
Sequences?
Yes
Ensembl data through MySQL
• Direct database querying using MySQL queries
• http://www.ensembl.org/info/data/mysql.html
mysql -u anonymous -h ensembldb.ensembl.org
mysql> use homo_sapiens_core_82_38;
mysql> SELECT gene.stable_id
FROM gene, xref
WHERE gene.display_xref_id = xref.xref_id
AND xref.display_label LIKE ’brca2';
MySQL schema
Ensembl data through MySQL
• I have an Ensembl gene ID - ENSG00000079950
• I want to get the EntrezGene ID for this gene
• We need to refer to the schema:
http://www.ensembl.org/info/docs/api/core/
core_schema.html
The schema
Choose our tables: from xref, object_xref, gene, external_db
Get the gene:
gene.stable_id =
"ENSG00000079
950"
Link the object
xref to the gene
ID: object_xref.
ensembl_id=
gene.gene_id
Link the xref to
the object xref:
xref.
xref_id=object_x
ref.xref_id
Specify you want
gene xrefs:
Get the xref ID:
xref.
ensembl_object select xref.
display_label
_type = 'Gene'
Link the xref to the
external database:
external_db.
external_db_id=xref.
external_db_id
Choose the external
database:
external_bd.
db_display_name =
“EntrezGene”
Our query
/usr/local/mysql/bin/mysql -h ensembldb.
ensembl.org -u anonymous -P 3306
use homo_sapiens_core_82_38;
Use port 3337
for GRCh37
select gene.stable_id, xref.display_label
from xref, object_xref, gene, external_db
where xref.xref_id=object_xref.xref_id and
object_xref.ensembl_id=gene.gene_id and
gene.stable_id = "ENSG00000079950" and
external_db.external_db_id=xref.
external_db_id and external_db.
db_display_name= "EntrezGene";
MySQL summary
Skills needed
MySQL querying. Understanding of the
schema.
Scalability
Can query whole genome.
Speed
Minimum query speed 9ms. Time cost
for complexity of query.
Difficulty to query
Queries can get very complicated if
extracting data from multiple tables and
are often not reusable.
Long-term
The schema can change between
releases.
Sequences?
No
Ensembl data through the Perl API
• Database querying using Perl scripts
• We use object-oriented Perl
my $gene_adaptor = $registry->
get_adaptor( 'human', 'core', ‘gene' );
my $gene = $gene_adaptor->
fetch_by_display_label( 'brca2' );
print $gene->stable_id, "\n";
http://www.ensembl.org/info/data/api.html
Perl API
Learn Perl
download API
modules
Get out all possible
Ensembl data.
Output in any
format you like.
(download more
modules)
Learn Ensembl API
Write scripts
Learn to use the API
EBI Train Online course:
http://www.ebi.ac.
uk/training/online/course
/ensembl-filmed-apiworkshop
API documentation:
http://www.ensembl.
org/info/docs/Doxygen/c
ore-api/index.html
Ensembl data through the Perl API
• I want a script that gets a gene name from the command
line and prints its sequence.
• We’ve already learnt how to use the API and know our
way around the documentation
• We need to write a script.
Perl script
#!/usr/bin/perl
# Get the gene adaptor - this allows
you to fetch genes from the database
use strict;
my $gene_adaptor = $reg->get_adaptor
('human', 'core', 'gene');
use warnings;
# Load the Ensembl API registry
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
# Get genes using the gene adaptor
my @genes = @{ $gene_adaptor>fetch_all_by_external_name
($gene_name) };
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
Use port 3337
for GRCh37
# move through the genes one-by-one
while (my $gene = shift @genes) {
# Get the gene name from the command
line
# print the gene name, ID and
sequence
my $gene_name = shift;
print "> ", $gene_name , " ",
$gene->stable_id, "\n", $gene->seq,
"\n";
}
Perl API summary
Skills needed
Programming in Perl. Understanding of
features in Ensembl.
Scalability
Can query whole database
Speed
1s for start-up plus minimum query
speed 50ms. Time cost per datapoint.
Difficulty to query
Scripts needed, but these can be easily
reused and adapted. API links data easily.
Long-term
The API takes databases changes into
account, so scripts do not need to change
between releases
Sequences?
Yes
Data access via REST
• We’ve had a Perl API for a long time …
• … but not everybody works in Perl
• Our RESTful service allows language agnostic access to
our data.
• Visit rest.ensembl.org for installation, documentation
and examples
What is REST?
• REST allows you to query the database using simple URLs
giving output in plain text format
eg http://rest.ensembl.org/xrefs/symbol/homo_sapiens/BRCA2?
content-type=application/json
gives
[{"type":"gene","id":"ENSG00000139618"},{"type":"gene","id":"
LRG_293"}]
• This means you can write scripts in any language to
construct these URLs and read their output
Single endpoint demo
•
•
I want to get a gene sequence from an Ensembl
gene ID
I need to use the docs to find an appropriate
endpoint: http://rest.ensembl.org/
Use grch37.rest.ensembl.org
for GRCh37
http://rest.ensembl.org/sequence/id/ENSG00000157764
Scripting demo
• I want a script that gets a gene name from the command
line and prints its sequence.
• There’s no one endpoint that does this action, so I have
to combine two endpoints with a script
Python script
#!/usr/bin/env python
# decode the json output
import json
# Get modules needed for script
genes = json.loads(get_genes)
import sys
import urllib
# move through the genes one-by-one
import urllib2
for gene in genes:
import json
import time
import httplib2, sys
http = httplib2.Http(".cache")
# Get the gene name from the command line
gene_name = sys.argv[1]
# define the general URL parameters
# define the REST query to get the sequence
from the gene
ext_get_seq = '/sequence/id/' + gene['id'] +
'?';
# submit the query
resp, get_seq = http.request
(server+ext_get_seq, method="GET", headers=
{"Content-Type":"application/json"})
server = "http://rest.ensembl.org"
# decode the json output
# define REST query to get the gene ID from the
gene name
ext_get_gene = "/xrefs/symbol/homo_sapiens/" +
gene_name + "?"
# submit the query
resp, get_genes = http.request
(server+ext_get_gene, method="GET", headers=
{"Content-Type":"application/json"})
import json
seq = json.loads(get_seq)
# print the gene name, ID and sequence
print '>', gene_name, gene['id'], "\n", seq
['seq']
POST demo
•
•
Some endpoints can perform multiple queries at
once using POST
Use Postman https://www.getpostman.com/
POST demo
Choose
POST
Input endpoint
POST demo
Choose
Body
Input IDs in
json
{ "ids" : ["ENST00000000233", "ENST00000000412", "ENST00000000442",
"ENST00000001008", "ENST00000001146", "ENST00000002125", "ENST00000002165",
"ENST00000002501", "ENST00000002596", "ENST00000002829", "ENST00000003084",
"ENST00000003100", "ENST00000003302", "ENST00000003583", "ENST00000003912",
"ENST00000004103", "ENST00000004531", "ENST00000004982", "ENST00000005082",
"ENST00000005178", "ENST00000005180", "ENST00000005226", "ENST00000005257",
"ENST00000005259", "ENST00000005260", "ENST00000005284", "ENST00000005286",
"ENST00000005340", "ENST00000005374", "ENST00000005386", "ENST00000005558",
"ENST00000005756", "ENST00000005995", "ENST00000006015", "ENST00000006053",
"ENST00000006251", "ENST00000006275", "ENST00000006658", "ENST00000006724",
"ENST00000006750", "ENST00000006777", "ENST00000007264", "ENST00000007390",
"ENST00000007414", "ENST00000007510", "ENST00000007516", "ENST00000007699",
"ENST00000007708", "ENST00000007722", "ENST00000007735" ] }
POST demo
Choose
Headers
Click on the
pencil
Add ContentType...
...application/json by
typing the first few letters
then selecting
POST demo
REST summary
Skills needed
Understanding of features in Ensembl.
Possibly programming in any language
Scalability
With programming can query whole
database
Speed
Minimum query speed 150ms. Time cost
per datapoint.
Difficulty to query
Need to construct URLs. May need scripts
to dissect data.
Long-term
The API takes databases changes into
account, so URLs do not need to change
between releases.
Sequences?
Yes
Webinar course feedback
We will send a SurveyMonkey feedback survey for this webinar
series by e-mail:
PLEASE fill it out to tell us whether you have enjoyed and
benefitted from the course!
Host an Ensembl course
Browser course
We can teach an Ensembl course
at your institute for free (except ½-2 day course on the
Ensembl browser, aimed
trainers’ expenses).
at wet-lab scientists.
One trainer.
Email me: [email protected]
API course
1-4 day course on the
Ensembl Perl API, aimed
at bioinformaticians.
1-4 trainers.
http://www.ensembl.info/workshops/
Help and documentation
Course online http://www.ebi.ac.
uk/training/online/subjects/11
Tutorials www.ensembl.org/info/website/tutorials
Flash animations
www.youtube.com/user/EnsemblHelpdesk
http://u.youku.com/Ensemblhelpdesk
Email us [email protected]
Ensembl public mailing lists [email protected],
[email protected]
Publications
http://www.ensembl.org/info/about/publications.html
Yates, A. et al
Ensembl 2016
Nucleic Acids Research
http://nar.oxfordjournals.org/content/early/2015/12/19/nar.gkv1157.full
Xosé M. Fernández-Suárez and Michael K. Schuster
Using the Ensembl Genome Server to Browse Genomic Sequence Data.
Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010)
www.ncbi.nlm.nih.gov/pubmed/20521244
Giulietta M Spudich and Xosé M Fernández-Suárez
Touring Ensembl: A practical guide to genome browsing
BMC Genomics 11:295 (2010)
www.biomedcentral.com/1471-2164/11/295
Follow us
www.facebook.
com/Ensembl.org
@Ensembl
www.ensembl.info
Acknowledgements