(UMLS) and the Semantic Web

University of Pisa, Italy
June 12, 2007
NETTAB 2007 - A Semantic Web for Bioinformatics
Tutorial T5
The Unified Medical Language System
(UMLS) and the Semantic Web
Olivier Bodenreider
Lister Hill National Center
for Biomedical Communications
Bethesda, Maryland - USA
Outline
‹ Information integration in biomedicine
z
z
Some issues: naming, normalization, mapping
Semantic Web perspective
‹ Terminology integration in biomedicine
Unified Medical Language System
‹ Some differences between UMLS and SW
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
2
Information integration
in biomedicine
Some issues: naming, normalization, mapping
X
Naming
‹ Many biomedical entities have several names
(synonymy)
z
z
z
z
Drug names
Gene names
Disease names
…
‹ A given name may refer to several different
entities (polysemy)
z
z
Nail (body part)
Nail (medical device)
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
4
Brand names for paracetamol (acetaminophen)
http://en.wikipedia.org/wiki/List_of_paracetamol_brand_names
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
5
Names for dystrophin
http://www.ncbi.nlm.nih.gov/sites/entrez
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
6
Names for renal cell carcinoma
http://www.clininfo.co.uk/clue5/clue.htm
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
7
Entity recognition
‹ Identifying biomedical entities in text
z
z
z
Names entity recognition
Tagging “mentions”
Semantic annotation
‹ Supported by terminology
z
z
Collects the names used in the domain
Often incompletely
‹ Example: BioCreative
z
z
1A – Gene name identification
2GM – Gene mention tagging
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
8
Y
Normalization
‹ Biomedical entities are identified by unique
identifiers in various terminology systems
‹ Resolve names into identifiers (in a given
namespace)
‹ Supported (in part) by terminology resources
‹ Example: BioCreative
z
1B and 2GN – Gene Normalization
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
9
Identifier for paracetamol (acetaminophen)
Master Drug Data Base. Medi-Span
5005 Acetaminophen
FDA National Drug Code Directory
50612 PARACETAMOL
FDA Structured Product Labels
First DataBank NDDF Plus
SNOMED Clinical Terms
SNOMED Clinical Terms
VA National Drug File
362O9ITL9D ACETAMINOPHEN
001605 Acetaminophen
90332006 Acetaminophen (product)
387517004 Acetaminophen (substance)
4017513 ACETAMINOPHEN
Source: RxNorm database (5/3/2007)
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
10
Identifier for dystrophin
http://www.ncbi.nlm.nih.gov/sites/entrez
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
11
Identifier for renal cell carcinoma
http://www.clininfo.co.uk/clue5/clue.htm
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
12
Z
Mapping / Integration
‹ Identify equivalent entities across systems
(across namespaces)
z
z
z
Shared identifiers
Existing mappings (e.g., SNOMED CT to ICD-9-CM)
Ontology alignment techniques (lexical + structural)
‹ Align equivalent entities
z
z
Pairwise: mapping
More broadly: integration
‹ Forms the basis for information integration in the
Semantic Web (mashups)
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
13
Identifier for paracetamol (acetaminophen)
Master Drug Data Base. Medi-Span
5005 Acetaminophen
FDA National Drug Code Directory
50612 PARACETAMOL
FDA Structured Product Labels
First DataBank NDDF Plus
SNOMED Clinical Terms
SNOMED Clinical Terms
VA National Drug File
RxNorm
362O9ITL9D ACETAMINOPHEN
001605 Acetaminophen
90332006 Acetaminophen (product)
387517004 Acetaminophen (substance)
4017513 ACETAMINOPHEN
161 Acetaminophen
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
14
Identifier for dystrophin
http://www.ncbi.nlm.nih.gov/sites/entrez
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
15
Identifier for renal cell carcinoma
645875019
379798014
379801015
379800019
379797016
379803017
379802010
http://www.clininfo.co.uk/clue5/clue.htm
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
16
Information integration
in biomedicine
Semantic Web perspective
HCLS mashup
PDSPki
Gene
Ontology
NeuronDB
Reactome
BAMS
Antibodies
NC
Annotations
Entrez
Gene
Allen Brain
Atlas
MeSH
Mammalian
Phenotype
SWAN
AlzGene
BrainPharm
PubChem
Homologene
Publications
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
18
Shared identifiers Example
GO
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
19
HCLS mashup
PDSPki
GO
Reactome
Genes/proteins
Interactions
Cellular location
Processes (GO)
Molecular function
Cell components
Biological process
Annotation gene
PubMedID
Antibodies
Genes
Antibodies
NC
Annotations
Genes/Proteins
Processes
Cells (maybe)
PubMed ID
PubMedID
Hypothesis
Questions
Evidence
Genes
SWAN
Proteins
Chemicals
Neurotransmitters
Entrez
Gene
Genes
Protein
GO
PubMedID
Interaction (g/p)
Chromosome
C. location
Allen Brain
Atlas
Genes
Brain images
Gross anatomy ->
neuroanatomy
BAMS
Protein
Neuroanatomy
Cells
Metabolites
(channels)
PubMedID
MeSH
Drugs
Anatomy
Phenotypes
Compounds
Chemicals
PubMedID
PubChem
Genes
Phenotypes
Disease
PubMedID
Mammalian
Genes
Species
Phenotype
Gene
Orthologies
Polymorphism Proofs
Population
Alz Diagnosis
NeuronDB
Protein (channels/receptors)
Neurotransmitters
Neuroanatomy
Cell
Compartments
Currents
BrainPharm
Drug
Drug effect
Pathological agent
Phenotype
Receptors
Channels
Cell types
PubMedID
Disease
Name
Structure
Properties
MeSH term
PubChem
Homologene
AlzGene
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
20
HCLS mashup
PDSPki
GO
Reactome
Genes/proteins
Interactions
Cellular location
Processes (GO)
Molecular function
Cell components
Biological process
Annotation gene
PubMedID
Antibodies
Genes
Antibodies
NC
Annotations
Genes/Proteins
Processes
Cells (maybe)
PubMed ID
PubMedID
Hypothesis
Questions
Evidence
Genes
SWAN
Proteins
Chemicals
Neurotransmitters
Entrez
Gene
Genes
Protein
GO
PubMedID
Interaction (g/p)
Chromosome
C. location
Allen Brain
Atlas
Genes
Brain images
Gross anatomy ->
neuroanatomy
BAMS
Protein
Neuroanatomy
Cells
Metabolites
(channels)
PubMedID
MeSH
Drugs
Anatomy
Phenotypes
Compounds
Chemicals
PubMedID
PubChem
Genes
Phenotypes
Disease
PubMedID
Mammalian
Genes
Species
Phenotype
Gene
Orthologies
Polymorphism Proofs
Population
Alz Diagnosis
NeuronDB
Protein (channels/receptors)
Neurotransmitters
Neuroanatomy
Cell
Compartments
Currents
BrainPharm
Drug
Drug effect
Pathological agent
Phenotype
Receptors
Channels
Cell types
PubMedID
Disease
Name
Structure
Properties
MeSH term
PubChem
Homologene
AlzGene
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
21
HCLS mashup
PDSPki
GO
Reactome
Genes/proteins
Interactions
Cellular location
Processes (GO)
Molecular function
Cell components
Biological process
Annotation gene
PubMedID
Antibodies
Genes
Antibodies
NC
Annotations
Genes/Proteins
Processes
Cells (maybe)
PubMed ID
PubMedID
Hypothesis
Questions
Evidence
Genes
SWAN
Proteins
Chemicals
Neurotransmitters
Entrez
Gene
Genes
Protein
GO
PubMedID
Interaction (g/p)
Chromosome
C. location
Allen Brain
Atlas
Genes
Brain images
Gross anatomy ->
neuroanatomy
BAMS
Protein
Neuroanatomy
Cells
Metabolites
(channels)
PubMedID
MeSH
Drugs
Anatomy
Phenotypes
Compounds
Chemicals
PubMedID
PubChem
Genes
Phenotypes
Disease
PubMedID
Mammalian
Genes
Species
Phenotype
Gene
Orthologies
Polymorphism Proofs
Population
Alz Diagnosis
NeuronDB
Protein (channels/receptors)
Neurotransmitters
Neuroanatomy
Cell
Compartments
Currents
BrainPharm
Drug
Drug effect
Pathological agent
Phenotype
Receptors
Channels
Cell types
PubMedID
Disease
Name
Structure
Properties
MeSH term
PubChem
Homologene
AlzGene
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
22
HCLS mashup
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
23
From glycosyltransferase
to congenital muscular dystrophy
glycosyltransferase
GO:0016757
isa
GO:0008194
GO:0016758
GO:0008375
acetylglucosaminyltransferase
GO:0008375
acetylglucosaminyltransferase
MIM:608840
Muscular dystrophy,
congenital, type 1D
has_molecular_function
LARGE
EG:9215
has_associated_phenotype
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
24
Terminology integration
in biomedicine
Unified Medical Language System
Motivation
‹ Started in 1986
‹ National Library of Medicine
‹ “Long-term R&D project”
«[…] the UMLS project is an effort to overcome two significant
barriers to effective retrieval of machine-readable information.
• The first is the variety of ways the same concepts are expressed
in different machine-readable sources and by different people.
• The second is the distribution of useful information among many
disparate databases and systems.»
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
26
Unified Medical Language System
‹ SPECIALIST Lexicon
z
z
200,000 lexical items
Part of speech and variant information
Lexical
resources
‹ Metathesaurus
z
z
z
5M names from over 100 terminologies
1M concepts
16M relations
‹ Semantic Network
z
z
135 high-level categories
7000 relations among them
Terminological
resources
Ontological
resources
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
27
Addison’s disease
Example
Addison’s disease in medical vocabularies
‹ Synonyms
z
z
z
z
z
z
z
z
Addisonian syndrome
Bronzed disease
Addison melanoderma
Asthenia pigmentosa
Primary adrenal deficiency
Primary adrenal insufficiency
Primary adrenocortical insufficiency
Chronic adrenocortical insufficiency
eponym
symptoms
clinical
variants
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
29
Organize terms
‹ Synonymous terms clustered into a concept
‹ Preferred term
‹ Unique identifier (CUI)
Addison Disease
Primary hypoadrenalism
Primary adrenocortical insufficiency
Addison's disease (disorder)
MeSH
MedDRA
ICD-10
SNOMED CT
D000224
10036696
E27.1
363732003
C0001403
Addison's disease
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
30
Metathesaurus Concepts
‹ Concept
z
Set of synonymous
concept names
‹ Term
z
(~ 5.5 M) SUI
Distinct concept name
‹ Atom
z
(~ 4.9 M) LUI
Set of normalized names
‹ String
z
(~ 1.4 M) CUI
(~ 6.8 M) AUI
Concept name
in a given source
(2007AA)
A0000001 headache (source 1)
A0000002 headache (source 2)
S0000001
A0000003 Headache (source 1)
A0000004 Headache (source 2)
S0000002
L0000001
A0000005 Cephalgia (source 1)
S0000003
L0000002
C0000001
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
31
Addison’s Disease: Concept
Disease or Syndrome
Addison’s Disease
SNOMED
MeSH
AOD
Read Codes
…
C0001403
ADRENAL INSUFFICIENCY (ADDISON'S DISEASE)
ADRENOCORTICAL INSUFFICIENCY, PRIMARY FAILURE
Addison melanoderma
Melasma addisonii
Primary adrenal deficiency
Asthenia pigmentosa
Bronzed disease
Insufficiency, adrenal primary
Primary adrenocortical insufficiency
Addison's, disease
Maladie d'Addison - French
Addison-Krankheit - German
Morbo di Addison - Italian
Doença de Addison - Portuguese
АДДИСОНОВА БОЛЕЗНЬ - Russian
アジソン病 - Japanese
A disease characterized by hypotension, weight loss, anorexia,
weakness, and sometimes a bronze-like melanotic
hyperpigmentation of the skin. It is due to tuberculosis- or
autoimmune-induced disease (hypofunction) of the adrenal
glands that results in deficiency of aldosterone and cortisol. In
the absence of replacement therapy, it is usually fatal.
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
32
Metathesaurus Evolution over time
‹ Concepts never die (in principle)
z
CUIs are permanent identifiers
‹ What happens when they do die (in reality)?
z
z
Concepts can merge or split
Resulting in new concepts and deletions
Addison's disease, NOS
C0271735
Addison's disease
C0001403
1992
1993
1994
1995
1996
1997
1998
1999 … 2007
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
33
SNOMED International
Diseases/Diagnoses
Diseases of the endocrine system
Diseases of the Adrenal Glands
Addison’s Disease
MeSH
Diseases
Endocrine Diseases
Adrenal Gland Diseases
Adrenal Gland Hypofunction
Addison’s Disease
AOD
Endocrine disorder
Adrenal disorder
Adrenal cortical disorder
Adrenal cortical hypofunction
Addison’s Disease
Read Codes
Endocrine disorder
Disorder of adrenal gland
Hypoadrenalism
Adrenal Hypofunction
Corticoadrenal insufficiency
Addison’s Disease
ICD-10
Disorders of other
endocrine gland
Other disorders of
adrenal gland
Primary adrenocortical insufficiency
Organize concepts
‹
‹
‹
Inter-concept
relationships: hierarchies
from the source
vocabularies
Redundancy: multiple
paths
One graph instead of
multiple trees
(multiple inheritance)
A
C
B D E H
E
B
F H
D E
G H
A
B
D
C
E
G
F
H
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
39
organize concepts
Endocrine Diseases
Adrenal Gland Diseases
Adrenal Cortex Diseases
SNOMED
MeSH
AOD
Read Codes
Hypoadrenalism
Adrenal Gland Hypofunction
UMLS
Adrenal cortical hypofunction
Addison’s Disease
Endocrine System
Abdominal organ
Diseases
Endocrine Glands
Endocrine Diseases
Adrenal Glands
Adrenal Dysfunction
Adrenal Gland Diseases
Adrenal Cortex Diseases
Disorders of other
endocrine gland
Adrenal Cortex Dysfunction
Adrenal Cortex
Hypoadrenalism
Other disorders of
adrenal gland
Adrenal Gland Hypofunction
Adrenal cortical hypofunction
Secondary hypocortisolism
Addison’s Disease
Addison’s disease due to autoimmunity
Semantic Types
Anatomical
Structure
Fully Formed
Anatomical
Structure
Embryonic
Structure
Body Part, Organ or
Organ Component
Disease or
Syndrome
Pharmacologic
Substance
Population
Group
Semantic
Network
Metathesaurus
Mediastinum
4
Saccular
Viscus
Angina
97 Pectoris
Esophagus
12
Heart
Left Phrenic
Nerve
Concepts
9
Heart
Valves
Fetal
31 Heart
Cardiotonic
225 Agents
Tissue
22 Donors
Source Vocabularies
(2007AA)
‹ 139 source vocabularies
z
17 languages
‹ Broad coverage of biomedicine
z
z
z
5.5M names
1.4M concepts
16M relations
‹ Common presentation
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
43
Biomedical terminologies
‹ General vocabularies
z
z
z
anatomy (UWDA, Neuronames)
drugs (RxNorm, First DataBank, Micromedex, …)
medical devices (UMD, SPN)
‹ Several perspectives
z
z
z
z
clinical terms (SNOMED CT)
information sciences (MeSH, CRISP)
administrative terminologies (ICD-9-CM, CPT-4)
data exchange terminologies (HL7, LOINC)
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
44
Biomedical terminologies (cont’d)
‹ Specialized vocabularies
z
z
z
z
z
z
z
nursing (NIC, NOC, NANDA, Omaha, PCDS)
dentistry (CDT)
oncology (NCI Thesaurus, PDQ)
psychiatry (DSM, APA)
adverse reactions (COSTART, WHO ART, MedDRA)
primary care (ICPC)
genomics (Gene Ontology, HUGO, OMIM)
‹ Terminology of knowledge bases (AI/Rheum,
DXplain, QMR)
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
45
Integrating subdomains
Clinical
Clinical
repositories
repositories
Genetic
Genetic
knowledge
knowledge bases
bases
SNOMED
Other
Other
subdomains
subdomains
OMIM
…
MeSH
UMLS
NCBI
Taxonomy
Model
Model
organisms
organisms
Biomedical
Biomedical
literature
literature
GO
UWDA
Genome
Genome
annotations
annotations
Anatomy
Anatomy
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
46
Integrating subdomains
Clinical
Clinical
repositories
repositories
Genetic
Genetic
knowledge
knowledge bases
bases
Other
Other
subdomains
subdomains
Biomedical
Biomedical
literature
literature
Model
Model
organisms
organisms
Genome
Genome
annotations
annotations
Anatomy
Anatomy
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
47
How do they do that?
‹ Lexical knowledge
‹ Semantic pre-processing
‹ UMLS editors
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
48
Lexical knowledge
Adrenal gland diseases
Adrenal disorder
Disorder of adrenal gland
Diseases of the adrenal glands
C0001621
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
49
Semantic pre-processing
‹ Metadata in the source vocabularies
‹ Tentative categorization
‹ Positive (or negative) evidence for tentative
synonymy relations based on lexical features
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
50
Additional knowledge: UMLS editors
Adrenal Gland Diseases
Adrenal Cortex Diseases
Adrenal Cortex Dysfunction
Hypoadrenalism
Other disorders of
adrenal gland
Adrenal Gland Hypofunction
Adrenal cortical hypofunction
Addison’s Disease
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
51
UMLS vs. Semantic Web
Similarities, differences
and unresolved issues
‹ Identifying biomedical entities
z
z
Trans-namespace integration
No UMLS-based URIs
‹ Availability
z
z
Intellectual property restrictions
Application Programming Interface
‹ Formats
z
RRF vs. SW languages
‹ UMLS as an ontology?
z
Underspecified semantics
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
53
X Identifying biomedical entities
‹ Syntax vs. semantics
z
URI, LSID,… vs. reference ontologies
‹ Integrative resources vs. individual namespaces
z
Unified Medical Language System (UMLS) vs. GO,
MeSH, SNOMED, …
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
54
No UMLS-based URIs Syntax
‹ No officially supported UMLS-based URIs for
biomedical entities
e.g., http://umls.org/C0001403
‹ Possible alternatives
„
Redirection service (e.g., PURL) http://purl.org/
‹ Resolution issues: what is expected to be
returned?
z
z
z
Acknowledgment of existence
Preferred term
Set of names, relations,… in RDF
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
55
No UMLS-based URIs Semantics
‹ Potential resources for trans-namespace
identification of biomedical entities
z
z
Clinical medicine: UMLS CUIs
[Genomics: Entrez Gene]
‹ Ontology of biomedical relationships
z
No comprehensive integrative resource available
„
„
„
OBO relations
UMLS Semantic Network relations
GALEN relations
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
56
Trans-namespace integration
Addison's disease (disorder)
(363732003)
Clinical
Clinical
repositories
repositories
Genetic
Genetic
knowledge
knowledge bases
bases
SNOMED
Other
Other
subdomains
subdomains
OMIM
…
MeSH
UMLS
NCBI
Taxonomy
Model
Model
organisms
organisms
C0001403
Biomedical
Biomedical
literature
literature
Addison Disease (D000224)
GO
UWDA
Genome
Genome
annotations
annotations
Anatomy
Anatomy
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
57
Trans-namespace integration
‹ Advantages
z
z
Over shared identifiers (increased recall)
Over lexical mapping (increased recall + precision)
Addison Disease
X
Primary adrenocortical insufficiency
MeSH:D000224
X
ICD9CM:E27.1
UMLS:C0001403
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
58
Ambiguity resolution
Neurofibromatosis 2
[disease]
C0027832
NF2
Neurofibromin 2
[protein]
C0254123
Neurofibromatosis 2 gene [gene]
C0085114
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
59
Other integrative resources
HGNC:2928
http://www.ncbi.nlm.nih.gov/sites/entrez
HPRD:02303
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
60
Y
Availability Intellectual property restrictions
‹ UMLS: free license required
http://www.nlm.nih.gov/research/umls/license.html
‹ Some intellectual property restrictions
„
2/3 of the names freely available (in the US)
http://www.nlm.nih.gov/research/umls/
‹ Web browser: username/password required
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
61
Availability Application Programming Interfaces
‹ Remote server at NLM
‹ Local application connected through
‹
‹
‹
‹
Java RMI
Java-based applications
Developer’s Guide:
Chapter 3
Set of Java classes
(part of the UMLSKS API
download)
Detailed Javadoc
documentation online and with
API download
‹
‹
‹
‹
TCP/IP socket
XML-based queries
Developer’s Guide: Chapter 5
XML schema
Socket server
z
z
Host: umlsks.nlm.nih.gov
Port: 8042
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
62
Availability Web Services-based API
‹ Part of the Knowledge Source Server version 3
z
z
Portlet-based, customizable
WS architecture
‹ Coming soon
z
z
Alpha release in July 2007
Beta release in November 2007
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
63
p
‹
Representation formalism
z
z
z
‹
‹
Rich Release Format (RRF)
[Original Release Format
(ORF)]
Support for source
transparency
Semantic Web
z
z
z
RDF – Resource
Description Framework
OWL – Web Ontology
Language
SKOS – Simple Knowledge
Organization Systems
Other formats
z
z
z
z
‹
‹
‹
UMLS
OBO – Open Biological Ontologies http://obo.sourceforge.net/browse.html
LexGrid
http://informatics.mayo.edu/LexGrid/
Converters
z
z
OBO – OWL
http://www.bioontology.org/tools/oboinowl/obo_converter.html
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
64
UMLS vocabularies available in RDF/OWL
‹ NCI Thesaurus (OWL)
z
http://ncicb.nci.nih.gov/core/EVS
‹ Gene Ontology
z
http://www.geneontology.org/
‹ Repository of biomedical ontologies (OBO, OWL)
z
http://www.bioontology.org/ncbo/faces/index.xhtml
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
65
Porting vocabularies to OWL Experiments
‹ MeSH
z
Soualmia et al., KR-MED 2004
‹ Foundational Model of Anatomy (FMA)
z
z
Golbreich et al., JWS 2006 (OWL DL)
Noy and Rubin, SMI Tech Report 2007 (OWL Full)
‹ UMLS Semantic Network
z
Kashyap and Borgida, ISWC 2003
‹ UMLS Metathesaurus
z
Cornet and Abu-Hanna, AMIA 2002
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
66
UMLS Semantic Network (Semantic Types)
UMLS as an “ontology”
Neoplastic Process
Neurofibromatoses
Benign neoplasms
of cranial nerves
Gene or Genome
Tumor suppressor
genes
NF2
(Neurofibromin 2 gene)
C0085114
Neurofibromatosis 2
(Type II neurofibromatosis,
Bilateral acoustic neurofibromatosis)
C0027832
UMLS Metathesaurus
(Concepts and relations)
Amino Acid,
Peptide, or Protein
Biologically Active
Substance
Tumor suppressor
proteins
Merlin
(Schwannomin,
Neurofibromin 2)
C0254123
Merlin, Drosophila
Drosophila melanogaster merlin
NEUROFIBROMATOSIS,
(Dmerlin) mRNA, complete cds.
TYPE II; NF2
67
Lister
Hill
Lister
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
#101000
U49724
OMIM
Genbank
External resources
[ UMLS as an ontology Limitations
‹ Genes not systematically represented
z
Most gene products and diseases are
‹ Gene/Gene product-Disease relations
z
z
Not systematically represented
Not explicitly represented (e.g., co-occurrence)
‹ Cross-references not systematically represented
‹ Naming conventions (genes)
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
68
Underspecified semantics
‹ Relationship “attribute” not always present
‹ Relations used to create hierarchies vs. hierachical
relations
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
69
Summary
Biomedicine and Semantic Web
‹ Semantic Web technologies have not been widely
adopted yet in biomedicine
z
z
OBO vs. OWL
caBIG vs. Taverna
‹ Use cases
z
Information/Data integration
‹ Recent efforts
z
W3C Health Care and Life Sciences Interest Group
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
71
UMLS and Semantic Web
‹
‹
‹
‹
Terminology integration
Based on existing
terminologies
Trans-namespace,
permanent identifiers
APIs available
z
‹
‹
‹
‹
‹
“Proprietary”
representation (RRF)
Some intellectual property
restrictions
Underspecified semantics
No UMLS-based URIs
Web Services-based API
coming soon
Can support information
integration
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
72
Medical
Ontology
Research
Contact: [email protected]
Web: mor.nlm.nih.gov
Olivier Bodenreider
Lister Hill National Center
for Biomedical Communications
Bethesda, Maryland - USA
UMLS References
‹ UMLS
umlsinfo.nlm.nih.gov
‹ UMLS browsers
(free, but UMLS license required)
z
z
z
Knowledge Source Server: umlsks.nlm.nih.gov
Semantic Navigator: http://mor.nlm.nih.gov/perl/semnav.pl
RRF browser
(standalone application distributed with the UMLS)
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
74
UMLS References
‹ Gentle introduction
z
Bodenreider O. (2004). The Unified Medical Language
System (UMLS): Integrating biomedical terminology.
Nucleic Acids Research; D267-D270.
http://mor.nlm.nih.gov/pubs/pdf/2004-nar-ob.pdf
‹ Seminal paper
z
Lindberg, D. A., Humphreys, B. L., & McCray, A. T.
(1993). The Unified Medical Language System.
Methods Inf Med, 32(4), 281-91.
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
75
Semantic Web for Health Care and Life Sciences
‹
W3C Health Care and Life Sciences Interest Group
z
http://www.w3.org/2001/sw/hcls/
‹
Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H,
Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J,
Marshall MS, Ogbuji C, Rees J, Stephens S, Wong GT, Wu E,
Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung K-H.
Advancing translational research with the Semantic Web. BMC
Bioinformatics 2007;8(Suppl 3):S2.
http://mor.nlm.nih.gov/pubs/pdf/2007-bmc_bioinformatics-ar.pdf
‹
Demo presented at the WWW2007 conference (May 2007)
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_
Demo
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
76
Biomedical information integration
through RDF
‹
Biomedical perspective
z
‹
Sahoo S, Zeng K, Bodenreider O, Sheth AP. (2007). From
“glycosyltransferase” to “congenital muscular dystrophy”:
Integrating knowledge from NCBI Entrez Gene and the Gene
Ontology. Proceedings of Medinfo (in press).
http://mor.nlm.nih.gov/pubs/pdf/2007-medinfo-ss.pdf
Semantic Web perspective
z
Sahoo S, Zeng K, Bodenreider O, Sheth AP. (2007). An
experiment in integrating large biomedical knowledge resources
with RDF: Application to associating genotype and phenotype
information. Proceedings of the workshop on Health Care and Life
Sciences Data Integration for the Semantic Web at the 16th
International World Wide Web Conference (WWW2007) (in press).
http://mor.nlm.nih.gov/pubs/pdf/2007-www_hcls-ss.pdf
Lister
Lister Hill
Hill National
National Center
Center for
for Biomedical
Biomedical Communications
Communications
77