Introduction to RDF and the Semantic Web for the life sciences

Introduction to RDF and the Semantic Web for
the life sciences
Simon Jupp
Sample Phenotypes and Ontologies Team
European Bioinformatics Institute
[email protected]
Day 2 practical session
•  Exploring EBI RDF platform
•  Querying EBI resources
•  Federated queries from one SPARQL endpoint to another
Exercise 17
•  Explore the EBI RDF platform at http://www.ebi.ac.uk/rdf
•  A) On the ChEMBL endpoint get ChEMBL activities, assays
and targets for the drug Clotrimazole (CHEMBL104)
•  B) On the Atlas endpoint find expression for
ENSDARG00000042641 (Cyp51)
•  B2) filter the results by property type contains “organism_part”
•  C) On the Reactome endpoint find pathways that
references Cyp51 (http://purl.uniprot.org/uniprot/Q1JPY5)
•  D) Query the UniProt endpoint to describe
http://purl.uniprot.org/uniprot/Q1JPY5
Exercise 17 solution A
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/>
SELECT ?activity ?assay ?target ?targetcmpt ?uniprot
WHERE {
?activity a cco:Activity ;
cco:hasMolecule chembl_molecule:CHEMBL104 ;
cco:hasAssay ?assay .
?assay cco:hasTarget ?target .
?target cco:hasTargetComponent ?targetcmpt .
?targetcmpt cco:targetCmptXref ?uniprot .
?uniprot a cco:UniprotRef
}
Exercise 17 solution B
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX identifiers:<http://identifiers.org/ensembl/>
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue
WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?diffValue .
?value atlasterms:hasFactorValue ?factor .
?factor atlasterms:propertyType ?propertyType .
?factor atlasterms:propertyValue ?propertyValue .
?value atlasterms:pValue ?pvalue .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref identifiers:ENSDARG00000042641 .
}
Exercise 17 solution B1
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX identifiers:<http://identifiers.org/ensembl/>
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue
WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?diffValue .
?value atlasterms:hasFactorValue ?factor .
?factor atlasterms:propertyType ?propertyType .
?factor atlasterms:propertyValue ?propertyValue .
?value atlasterms:pValue ?pvalue .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref identifiers:ENSDARG00000042641 .
FILTER regex (?propertyType, "organism_part")
}
Exercise 17 solution C
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#>
SELECT DISTINCT ?pathway ?pathwayname
WHERE
{?pathway rdf:type biopax3:Pathway .
?pathway biopax3:displayName ?pathwayname .
?pathway biopax3:pathwayComponent ?reaction .
?reaction rdf:type biopax3:BiochemicalReaction .
{
{?reaction ?rel ?protein .}
UNION
{
?reaction ?rel ?complex .
?complex rdf:type biopax3:Complex .
?complex ?comp ?protein .
}}
?protein rdf:type biopax3:Protein .
?protein biopax3:entityReference <http://purl.uniprot.org/uniprot/Q1JPY5>
}
LIMIT 100
Exercise 17 solution D
DESCRIBE <http://purl.uniprot.org/uniprot/Q1JPY5>
Federated querying
•  One of the biggest advantages of SPARQL and triples
stores is the ability to federate queries across endpoints
•  Integrating data at query time with SPARQL
GO annotation
Expression
value
GXA
Uniprot
Uniprot
Protein
Federated SPARQL
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?experiment ?description WHERE {
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?experiment a atlasterms:Experiment .
?experiment dcterms:identifier ?id .
?experiment dcterms:description ?description .
FILTER regex(?id, "E-GEOD-2852")
}
}
We can execute this query from any other endpoint using the SPARQL
SERVICE keyword
http://tinyurl.com/o9kvvzn
Exercise 19
•  Execute the following federated query on
•  1. The UniProt SPARQL endpoint
•  2. Your Sesame workbench SPARQL endpoint
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?experiment ?description WHERE {
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?experiment a atlasterms:Experiment .
?experiment dcterms:identifier ?id .
?experiment dcterms:description ?description .
FILTER regex(?id, "E-GEOD-2852")
}
}
Constructing a Federated query
•  Basic query to get genes out of our dataset
•  How can we integrate this with data from the EMBL-EBI
Gene Expression Atlas?
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX mydata:<http://www.mydomain.com/mydata#>
PREFIX efo:<http://www.ebi.ac.uk/efo/>
SELECT DISTINCT ?geneid ?label WHERE {
?result mydata:dbXref ?geneid .
?geneid rdfs:label ?label .
}
Querying the Atlas
Example query 3
(http://www.ebi.ac.uk/rdf/services/atlas/sparql)
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue
WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?diffValue .
Our query
?value atlasterms:hasFactorValue ?factor .
?factor atlasterms:propertyType ?propertyType .
SELECT DISTINCT ?geneid ?label WHERE {
?factor atlasterms:propertyValue ?propertyValue .
?result mydata:dbXref ?geneid .
?value atlasterms:pValue ?pvalue .
?geneid rdfs:label ?label .
?value atlasterms:isMeasurementOf ?probe .
}
?probe atlasterms:dbXref identifiers:ENSMUSG00000034450 .
}
ORDER BY ASC (?pvalue)
Integration point
Querying the Atlas
Example query 3
(http://www.ebi.ac.uk/rdf/services/atlas/sparql)
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue
WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?diffValue .
?value atlasterms:hasFactorValue ?factor .
?factor atlasterms:propertyType ?propertyType .
?factor atlasterms:propertyValue ?propertyValue .
?value atlasterms:pValue ?pvalue .
Our query
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref identifiers:ENSMUSG00000034450 .
}
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
ORDER BY ASC (?pvalue)
PREFIX mydata:<http://www.mydomain.com/mydata#>
PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?geneid ?label ?probe WHERE {
?result mydata:dbXref ?geneid .
?geneid rdfs:label ?label .
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?probe atlasterms:dbXref ?geneid
}
}
1st gotcha
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX mydata:<http://www.mydomain.com/mydata#>
PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/>
SELECT ?geneid ?label ?probe ?value WHERE {
?result mydata:dbXref ?geneid .
?geneid rdfs:label ?label .
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref ?geneid
}
}
This should work but there is an issue with querying the EBI RDF
Platform with this version of Sesame (fix coming soon!)
1st gotcha
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX mydata:<http://www.mydomain.com/mydata#>
PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/>
SELECT ?label ?probe ?value WHERE {
?result mydata:dbXref <http://identifiers.org/ensembl/ENSMUSG00000024673> .
<http://identifiers.org/ensembl/ENSMUSG00000024673> rdfs:label ?label .
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref <http://identifiers.org/ensembl/ENSMUSG00000024673>
}
}
Bind on gene <http://identifiers.org/ensembl/ENSMUSG00000024673>
Exercise 20
•  A) Using the previous query, extend it to query the Atlas
endpoint to also return the Experiment id and factors
(property values) where Ms4ai (ENSMUSG00000024673) is
expressed
•  B) Filter those results to only include experiments where the
factor contains “liver”
Exercise 20 solution A)
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX mydata:<http://www.mydomain.com/mydata#>
PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX identifiers:<http://identifiers.org/ensembl/>
SELECT ?label ?expUri ?propertyValue WHERE {
?result mydata:dbXref identifiers:ENSMUSG00000024673 .
identifiers:ENSMUSG00000024673 rdfs:label ?label .
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value atlasterms:hasFactorValue ?factor .
?factor atlasterms:propertyValue ?propertyValue .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref identifiers:ENSMUSG00000024673
}
}
Exercise 20 solution B)
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX mydata:<http://www.mydomain.com/mydata#>
PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX identifiers:<http://identifiers.org/ensembl/>
SELECT ?label ?expUri ?propertyValue WHERE {
?result mydata:dbXref identifiers:ENSMUSG00000024673 .
identifiers:ENSMUSG00000024673 rdfs:label ?label .
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value atlasterms:hasFactorValue ?factor .
?factor atlasterms:propertyValue ?propertyValue .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref identifiers:ENSMUSG00000024673
}
FILTER regex(?propertyValue, "liver", "i")
}
Alzheimer’s Use Case – EBI RDF platform
• 
• 
• 
• 
• 
EFO term for Alzheimer’s: EFO_0000249
Get Genes diff expressed for Alzheimer’s
Get proteins encoded for those genes
GO annotations from UniProt for those genes
Get pathways form Reactome in which those proteins are
involved
•  Get drugs that target proteins within those pathways
Q1. Get Ensembl genes diff expressed for Alzheimer’s
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dcterms:<http://purl.org/dc/terms/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue
WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?expressionValue .
?value atlasterms:pValue ?pvalue .
?value atlasterms:hasFactorValue ?factor .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref ?dbXref .
?dbXref rdf:type atlasterms:EnsemblDatabaseReference .
?factor atlasterms:propertyType ?propertyType .
?factor atlasterms:propertyValue ?propertyValue .
?factor rdf:type efo:EFO_0000249 .
}
Q2. Get UniProt proteins for those genes
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue
WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?expressionValue .
?value atlasterms:pValue ?pvalue .
?value atlasterms:hasFactorValue ?factor .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref ?dbXref .
?dbXref rdf:type atlasterms:UniprotDatabaseReference .
?factor atlasterms:propertyType ?propertyType .
?factor atlasterms:propertyValue ?propertyValue .
?factor rdf:type efo:EFO_0000270 .
}
Q3. Get UniProt GO Annotations for those genes
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX efo:<http://www.ebi.ac.uk/efo/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX upc:<http://purl.uniprot.org/core/>
PREFIX identifiers:<http://identifiers.org/ensembl/>
SELECT distinct ?valueLabel ?goid ?golabel
WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?expressionValue .
?value atlasterms:pValue ?pvalue .
?value atlasterms:hasFactorValue ?factor .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref ?dbXref.
?dbXref rdf:type atlasterms:EnsemblDatabaseReference .
?factor atlasterms:propertyType ?propertyType .
?factor atlasterms:propertyValue ?propertyValue .
?factor rdf:type efo:EFO_0000249 .
?value rdfs:label ?valueLabel .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref ?uniprot .
SERVICE <http://beta.sparql.uniprot.org/sparql> {
?uniprot a upc:Protein .
?uniprot upc:classifiedWith ?keyword .
?keyword rdfs:seeAlso ?goid .
?goid rdfs:label ?golabel .
}
}
Q4. get pathways from Reactome for those proteins
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>
SELECT DISTINCT ?pathway ?dbXref WHERE {
?expUri atlasterms:hasAnalysis ?analysis .
?analysis atlasterms:hasExpressionValue ?value .
?value rdfs:label ?expressionValue .
?value atlasterms:pValue ?pvalue .
?value atlasterms:hasFactorValue ?factor .
?value atlasterms:isMeasurementOf ?probe .
?probe atlasterms:dbXref ?dbXref .
?dbXref rdf:type atlasterms:UniprotDatabaseReference .
?factor atlasterms:propertyType ?propertyType .
?factor atlasterms:propertyValue ?propertyValue .
?factor rdf:type efo:EFO_0000270 .
SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql>
{?pathway rdf:type biopax3:Pathway .
?pathway biopax3:displayName ?pathwayname .
?pathway biopax3:pathwayComponent ?reaction .
?reaction rdf:type biopax3:BiochemicalReaction .
{
{?reaction ?rel ?protein .}
UNION
{
?reaction ?rel ?complex .
?complex rdf:type biopax3:Complex .
?complex ?comp ?protein .
}}
?protein rdf:type biopax3:Protein .
?protein biopax3:entityReference ?dbXref
}
}
Q5. Get drugs that target proteins within those pathways
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>!
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>!
PREFIX efo:<http://www.ebi.ac.uk/efo/>!
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>!
PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>!
PREFIX cco:<http://rdf.ebi.ac.uk/terms/chembl#>!
SELECT distinct ?dbXrefProt ?pathwayname ?moleculeLabel ?expressionValue ?propertyValue!
WHERE {!
#Get differentially expressed genes (and proteins) where factor is asthma!
?value atlasterms:pValue ?pvalue .!
?value atlasterms:hasFactorValue ?factor .!
?value rdfs:label ?expressionValue .!
?value atlasterms:isMeasurementOf ?probe .!
?probe atlasterms:dbXref ?dbXrefProt .!
?dbXrefProt a atlasterms:UniprotDatabaseReference .!
?factor atlasterms:propertyType ?propertyType .!
?factor atlasterms:propertyValue ?propertyValue .!
?factor rdf:type efo:EFO_0000249 .!
#Compunds target them!
SERVICE <http://www.ebi.ac.uk/rdf/services/chembl/sparql> {!
?act a cco:Activity ;!
cco:hasMolecule ?molecule ;!
cco:hasAssay ?assay .!
?molecule rdfs:label ?moleculeLabel .!
?assay cco:hasTarget ?target .!
?target cco:hasTargetComponent ?targetcmpt .!
?targetcmpt cco:targetCmptXref ?dbXrefProt .!
?targetcmpt cco:taxonomy <http://identifiers.org/taxonomy/9606> .!
?dbXrefProt a cco:UniprotRef .!
}!
SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> {!
!
?protein rdf:type biopax3:Protein .!
?protein biopax3:memberPhysicalEntity!
!
![biopax3:entityReference ?dbXrefProt] .!
?pathway biopax3:displayName ?pathwayname .!
?pathway biopax3:pathwayComponent ?reaction .!
?reaction ?rel ?protein!
}!
}!
Summary
•  Why there is a need for new technologies in the life
sciences
• 
• 
• 
• 
• 
• 
Why RDF is a good fit for some of the problems
The role of ontologies
Generating RDF triples from data
Working with an RDF database
How to write a SPARQL query
How the EBI is using RDF
Conclusions
•  Generating RDF triples is relatively easy
•  Extracting the schema from your data can be tricky
•  Avoid over modeling – have good use cases
•  Look for appropriate ontologies, reuse terms where possible
•  Good tooling now available
•  RDF APIs for most programming language
•  Lots of scalable triples stores
•  SPARQL is a powerful query language for RDF
•  Also very unforgiving; debugging queries is hard
•  Treat the same as you would SQL, not for your average user
Conclusions cont..
•  Lots of interest in Linked Data and RDF
•  See LOD clouds and DBpedia
•  Big name companies using/generating RDF content
(Facebook, Google, Oracle)
•  Some good examples of applications
•  Pharma industry (OpenPhacts project), Semantic publishing
(BBC), Government data (data,gov.uk)
•  Tread cautiously
•  This technology is still maturing
•  Not a panacea
•  Good solutions for some problems
Thinking beyond RDF and SPARQL
•  Selling SPARQL endpoints to biologists is hard i.e. near
impossible
•  Entry level is too high and advantages too intangible
•  Let programmers code against SPARQL
•  Let everyone else use more familiar modes through Apps
RDFApps
•  Our first RDFApp targets the existing community of R users
– an ArrayExpress R package already exists
•  Goal is to expose the power of the Atlas RDF+SPARQL
behind a conventional R interface
•  Enables those working with raw data to also use power of
Atlas
Codefest
•  Got an idea for an RDF App? Join us at Codefest 2014
•  http://www.open-bio.org/wiki/Codefest_2014
•  18th/19th September, Cambridge, UK
Interesting RDF resources for biology
• 
• 
• 
• 
• 
• 
• 
• 
• 
EBI RDF (http://www.ebi.ac.uk/rdf )
Bio2RDF (http://bio2rdf.org )
BioPortal (https://bioportal.bioontology.org )
OpenPhacts (https://www.openphacts.org )
PubChem RDF (https://pubchem.ncbi.nlm.nih.gov/rdf/ )
Identifiers.org (http://identifiers.org )
Wikipathways (http://wikipathways.org )
DisGeNet (http://ibi.imim.es/web/DisGeNET/v01/ )
W3C Healthcare and Life Sciences Working Group (HCLS http://www.w3.org/blog/hcls/ )
Acknowledgments
•  Samples Phenotypes and Ontologies Group and Functional
Genomics Production Team
•  James Malone, Robert Petryszak, Tony Burdett, Helen
Parkinson
•  EBI RDF platform
•  Andy Jenkinson, Mark Davies, Marco Brandizi, Sarala
Wimalaratne, Leyla Garcia, Jerven Bolleman
Funding
Components of the RDF platform pilot are supported by a
number of sources, including:
•  EMBL
•  European Commission:
•  BioMedBridges [284209]
•  Diachron [601043]
•  OpenPhacts (Innovative Medicines Initiative)
•  National Institutes of Health
Questions?
Sign up for our mailing list:
http://listserver.ebi.ac.uk/mailman/listinfo/rdf-announce