Ontologies

A corpus-driven approach for design,
evolution and alignment of ontologies
Thomas Wächter (2), He Tan (1), André
André Wobst (2),
Patrick Lambrix (1), Michael Schroeder (2)
1 Dept. of Computer and Information Science,
Linköpings universitet, Sweden
2 Biotec, TU Dresden, Germany
Outline
„
„
„
„
Introduction
Ontology design and evolution
Ontology alignment
Conclusion
2
1
Ontologies
“Ontologies define the basic terms and relations
comprising the vocabulary of a topic area, as
well as the rules for combining terms and
relations to define extensions to the
vocabulary.”
3
Ontologies
„
Ontologies in biomedical research
…
many biomedical ontologies
e.g. GO, OBO, SNOMED-CT
…
practical use of biomedical ontologies
e.g. databases annotated with GO
GENE ONTOLOGY (GO)
immune response
i- acute-phase response
i- anaphylaxis
i- antigen presentation
i- antigen processing
i- cellular defense response
i- cytokine metabolism
i- cytokine biosynthesis
synonym cytokine production
…
p- regulation of cytokine
biosynthesis
…
…
i- B-cell activation
i- B-cell differentiation
i- B-cell proliferation
i- cellular defense response
…
i- T-cell activation
i- activation of natural killer
cell activity
…
4
2
Ontologies in Systems Biology
„
Protein Interaction Ontology (PSI MI)
„
Systems Biology Ontology (SBML)
„
Biological Pathways Exchange (BioPAX)
5
Ontologies
Ontologies used
- for communication between people and
organizations
- for enabling knowledge reuse and sharing
- as basis for interoperability between systems
- as repository of information
- as query model for information sources
Key technology for the Semantic Web
6
3
Ontology design and alignment
„
Often done manually
„
Difficult and time-consuming
„
Q: Can we automate part of the design and alignment
processes?
Q: Can we use the vast amount of literature?
(PubMed: 16 000 000 documents)
7
Outline
„
„
„
„
Introduction
Ontology design and evolution
Ontology alignment
Conclusion
8
4
1. Superstring prediction
Terms in GO: 20223
Terms in abstracts: 14995
Terms having children containing the term: 3129
Membrane
inner membrane
9
Superstring prediction
Membrane
inner membrane
Parent found in abstracts: 2692
Parent and one child found in abstracts: 2239
Parent and all children found in abstracts: 1761
10
5
Superstring prediction
Intuition: words that often precede or follow a term may be used
together with the terms to define new terms
Algorithm: identify the words that precede or follow a term and
rank them by frequency of occurrence
Predicted candidates:
TOP5: 27% of the children
TOP50: 45% of the children
11
GTPase activator activity
12
6
Death
13
2. Term co-occurrence analysis
Terms in GO: 20223
Terms in abstracts: 14995
Terms having children: 7541
Parent found in abstracts: 5964
Parent and one child found in abstracts: 5185
Parent and all children found in abstracts: 3757
14
7
Term co-occurrence analysis
Intuition: co-occurrence with terms may be an
indication of relevant ontology term
Algorithm: count frequency of co-occurrence and rank
Predicted candidates:
TOP5: 4% of the children
TOP50: 10% of the children
15
endosome
16
8
Outline
„
„
„
„
Introduction
Ontology design and evolution
Ontology alignment
Conclusion
17
Ontology alignment - Motivation
„
Ontologies with overlapping information
GENE ONTOLOGY (GO)
SIGNAL-ONTOLOGY (SigO)
immune response
i- acute-phase response
i- anaphylaxis
i- antigen presentation
i- antigen processing
i- cellular defense response
i- cytokine metabolism
i- cytokine biosynthesis
synonym cytokine production
…
p- regulation of cytokine
biosynthesis
…
…
i- B-cell activation
i- B-cell differentiation
i- B-cell proliferation
i- cellular defense response
…
i- T-cell activation
i- activation of natural killer
cell activity
…
Immune Response
i- Allergic Response
i- Antigen Processing and Presentation
i- B Cell Activation
i- B Cell Development
i- Complement Signaling
synonym complement activation
i- Cytokine Response
i- Immune Suppression
i- Inflammation
i- Intestinal Immunity
i- Leukotriene Response
i- Leukotriene Metabolism
i- Natural Killer Cell Response
i- T Cell Activation
i- T Cell Development
i- T Cell Selection in Thymus
18
9
Ontology alignment - Motivation
„
Use of multiple ontologies
e.g. custom-specific ontology + standard ontology
„
Bottom-up creation of ontologies
experts can focus on their domain of expertise
Æ important to know the inter-ontology relationships
19
GENE ONTOLOGY (GO)
SIGNAL-ONTOLOGY (SigO)
immune response
i- acute-phase response
i- anaphylaxis
i- antigen presentation
i- antigen processing
i- cellular defense response
i- cytokine metabolism
i- cytokine biosynthesis
synonym cytokine production
…
p- regulation of cytokine
biosynthesis
…
…
i- B-cell activation
i- B-cell differentiation
i- B-cell proliferation
i- cellular defense response
…
i- T-cell activation
i- activation of natural killer
cell activity
…
Immune Response
i- Allergic Response
i- Antigen Processing and Presentation
i- B Cell Activation
i- B Cell Development
i- Complement Signaling
synonym complement activation
i- Cytokine Response
i- Immune Suppression
i- Inflammation
i- Intestinal Immunity
i- Leukotriene Response
i- Leukotriene Metabolism
i- Natural Killer Cell Response
i- T Cell Activation
i- T Cell Development
i- T Cell Selection in Thymus
20
10
Aligning ontologies
GENE ONTOLOGY (GO)
SIGNAL-ONTOLOGY (SigO)
immune response
i- acute-phase response
i- anaphylaxis
i- antigen presentation
i- antigen processing
i- cellular defense response
i- cytokine metabolism
i- cytokine biosynthesis
synonym cytokine production
…
p- regulation of cytokine
biosynthesis
…
…
i- B-cell activation
i- B-cell differentiation
i- B-cell proliferation
i- cellular defense response
…
i- T-cell activation
i- activation of natural killer
cell activity
…
Immune Response
i- Allergic Response
i- Antigen Processing and Presentation
i- B Cell Activation
i- B Cell Development
i- Complement Signaling
synonym complement activation
i- Cytokine Response
i- Immune Suppression
i- Inflammation
i- Intestinal Immunity
i- Leukotriene Response
i- Leukotriene Metabolism
i- Natural Killer Cell Response
i- T Cell Activation
i- T Cell Development
i- T Cell Selection in Thymus
equivalent concepts
equivalent relations
is-a relation
21
Alignment Strategies
„
„
„
„
„
„
Strategies based on linguistic matching
Structure-based strategies
Constraint-based approaches
GO: Complement Activation
Instance-based strategies
Use of auxiliary information
SigO: complement signaling
synonym complement activation
Combining different approaches
22
11
Alignment Strategies
„
„
„
„
„
„
Strategies based on linguistic matching
Structure-based strategies
Constraint-based approaches
Instance-based strategies
Use of auxiliary information
Combining different approaches
23
Alignment Strategies
„
„
„
„
Strategies based on linguistic matching
Structure-based strategies
Constraint-based approaches
Instance-based strategies
Use of auxiliary information
Person
Human
Combining different approaches
O1
„
„
Animal
O2
Animal
24
12
Alignment Strategies
„
„
„
„
„
„
Strategies based on linguistic matching
instance
corpus
Structure-based strategies
Constraint-based approaches
Ontology
Instance-based strategies
Use of auxiliary information
Combining different approaches
25
Alignment Strategies
„
„
„
„
„
„
Strategies based linguistic matchingdictionary
Structure-based strategies
intermediate
thesauri
ontology
Constraint-based approaches
alignment strategies
Instance-based strategies
Use of auxiliary information
Combining different approaches
26
13
Alignment Strategies
„
„
„
„
„
„
Strategies based on linguistic matching
Structure-based strategies
Constraint-based approaches
Instance-based strategies
Use of auxiliary information
Combining different approaches
27
A General Alignment Strategy
instance
corpora
general
dictionaries
domain
thesauri
alignment algorithm
matcher
matcher
alignments
source ontologies
combination
filter
suggestions
accepted
suggestions
user
conflict
checker
28
14
1. Text classification approach
„
Intuition
A similarity measure between concepts can be computed
based on the probability that documents about one
concept are also about the other concept and vice versa.
29
Text classification approach - steps
„
Generate corpora
…
…
„
Use concept as query term in PubMed
Retrieve most recent PubMed abstracts
Generate classifiers
…
…
One classifier per ontology
Naive Bayes classifiers
30
15
Text classification approach - steps
„
Classification
…
„
Abstracts related to one ontology are classified by
the other ontology’s classifier and vice versa
Calculate similarities
31
Evaluation
„
GO vs. SigO
…
…
„
Immune Defense : 70 terms from GO, 15 terms from SigO
Behavior:
Behavior 60 terms from GO, 10 terms from SigO
GO vs. EC
…
…
EC1.3 : 141 terms from GO and 129 terms from EC
EC1.1.3 : 29 terms from GO and 31 terms from EC.
32
16
Text classification approach
„
„
Total number of suggestions
Number of correct suggestions
33
Evaluation – quality of suggestions
Classification approach
„
1
1
0,9
0,9
0,8
0,8
0,7
recall
0,6
B
ID
0,5
EC1.3
EC1.1.3
0,4
precision
0,7
0,6
B
ID
0,5
EC1.3
EC1.1.3
0,4
0,3
0,3
0,2
0,2
0,1
0,1
0
0
0.4
0.5
0.6
threshold
0.7
0.8
0.4
0.5
0.6
0.7
0.8
threshold
34
17
Comparison with other matchers
„
Matchers
Classification, Term, TermWN, Dom
„
Parameters
Quality of suggestions: precision/recall
Thresholds : 0.4, 0.5, 0.6, 0.7, 0.8
„
Data
GO – SigO, MA - MeSH
35
Evaluation – quality of suggestions
„
Classification approach
1
1
0,9
0.9
0,8
0.8
0.7
0,7
B
ID
0,5
nos e
ear
0,4
eye
precision
recall
B
0,6
0.6
ID
0.5
nos e
ear
0.4
0,3
0.3
0,2
0.2
eye
0.1
0,1
0
0
0.4
0.5
0.6
threshold
0.7
0.8
0.4
0.5
0.6
0.7
0.8
threshold
36
18
Evaluation – quality of suggestions
Terminological matchers
„
1
1
0.9
0.9
0.8
0.8
0.7
0.7
B
recall
ID
nos e
0.5
ear
0.4
eye
precision
B
0.6
0.6
ID
nos e
0.5
ear
0.4
0.3
0.3
0.2
0.2
eye
0.1
0.1
0
0
0.4
0.5
0.6
0.7
0.4
0.8
0.5
0.6
0.7
0.8
threshold
threshold
37
Evaluation – quality of suggestions
Domain matcher
1
1
0.9
0.9
0.8
0.8
0.7
B
0.7
B
0.6
ID
0.6
ID
0.5
nose
0.5
nose
0.4
ear
0.4
ear
0.3
ey e
0.3
eye
0.2
precis ion
recall
„
0.2
0.1
0.1
0
0
0.4
0.5
0.6
threshold
0.7
0.8
0.4
0.5
0.6
0.7
0.8
threshold
38
19
Evaluation – quality of suggestions
„
Comparison of the matchers
CS_TermWN
„
⊇ CS_Dom ⊇ CS_Classification
Combinations of the different matchers
leads to higher quality results
39
2. Cluster approach
„
Intuition
A similarity measure between concepts can be computed
based on normalized information distance.
40
20
Cluster approach - steps
„
„
„
For each concept, use concept as query term in
PubMed and retrieve the number of hits (documents)
For each pair of concepts, use the conjunction of the
concepts as query term in PubMed and retrieve the
number of hits (documents)
Calculate distance
41
Cluster approach - steps
„
„
Cluster the concepts based on distance using
complete-link hierarchical clustering
For a given distance threshold, concepts from
different ontologies in the same cluster are alignment
candidates
42
21
Cluster approach
„
„
„
Total number of clusters containing terms from both ontologies
Number of clusters containing correct suggestions
Number of correct suggestions
43
Evaluation
„
„
„
Recall is better for cluster approach than for
classification approach
Many clusters containing elements from both
ontologies contain a correct suggestion
Often one correct suggestion per cluster
44
22
Evaluation
„
The quality of suggestions heavily depends on the
related documents in PubMed.
…
…
case B:
10%, no related documents
additional 4%, less than 10 related documents
case EC1.3: 40%, no related documents
additional 26%, less than 10 related documents
45
Outline
„
„
„
„
Introduction
Ontology design and evolution
Ontology alignment
Conclusion
46
23
Conclusion
„
Use of literature for designing and aligning
ontologies
…
…
…
Possible approach
Much room for improvement of algorithms
Combine with other approaches may give superior results
47
Future work
„
Design and evolution
„
„
„
Extend analysis to descendants/other terms
Using POS tagging
Alignment
…
Classification approach
„
„
…
Use structure of ontologies
Different text classifiers
Cluster approach
„
Use clusters for filtering alignment suggestions
48
24