A corpus-driven approach for design, evolution and alignment of ontologies Thomas Wächter (2), He Tan (1), André André Wobst (2), Patrick Lambrix (1), Michael Schroeder (2) 1 Dept. of Computer and Information Science, Linköpings universitet, Sweden 2 Biotec, TU Dresden, Germany Outline Introduction Ontology design and evolution Ontology alignment Conclusion 2 1 Ontologies “Ontologies define the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define extensions to the vocabulary.” 3 Ontologies Ontologies in biomedical research many biomedical ontologies e.g. GO, OBO, SNOMED-CT practical use of biomedical ontologies e.g. databases annotated with GO GENE ONTOLOGY (GO) immune response i- acute-phase response i- anaphylaxis i- antigen presentation i- antigen processing i- cellular defense response i- cytokine metabolism i- cytokine biosynthesis synonym cytokine production … p- regulation of cytokine biosynthesis … … i- B-cell activation i- B-cell differentiation i- B-cell proliferation i- cellular defense response … i- T-cell activation i- activation of natural killer cell activity … 4 2 Ontologies in Systems Biology Protein Interaction Ontology (PSI MI) Systems Biology Ontology (SBML) Biological Pathways Exchange (BioPAX) 5 Ontologies Ontologies used - for communication between people and organizations - for enabling knowledge reuse and sharing - as basis for interoperability between systems - as repository of information - as query model for information sources Key technology for the Semantic Web 6 3 Ontology design and alignment Often done manually Difficult and time-consuming Q: Can we automate part of the design and alignment processes? Q: Can we use the vast amount of literature? (PubMed: 16 000 000 documents) 7 Outline Introduction Ontology design and evolution Ontology alignment Conclusion 8 4 1. Superstring prediction Terms in GO: 20223 Terms in abstracts: 14995 Terms having children containing the term: 3129 Membrane inner membrane 9 Superstring prediction Membrane inner membrane Parent found in abstracts: 2692 Parent and one child found in abstracts: 2239 Parent and all children found in abstracts: 1761 10 5 Superstring prediction Intuition: words that often precede or follow a term may be used together with the terms to define new terms Algorithm: identify the words that precede or follow a term and rank them by frequency of occurrence Predicted candidates: TOP5: 27% of the children TOP50: 45% of the children 11 GTPase activator activity 12 6 Death 13 2. Term co-occurrence analysis Terms in GO: 20223 Terms in abstracts: 14995 Terms having children: 7541 Parent found in abstracts: 5964 Parent and one child found in abstracts: 5185 Parent and all children found in abstracts: 3757 14 7 Term co-occurrence analysis Intuition: co-occurrence with terms may be an indication of relevant ontology term Algorithm: count frequency of co-occurrence and rank Predicted candidates: TOP5: 4% of the children TOP50: 10% of the children 15 endosome 16 8 Outline Introduction Ontology design and evolution Ontology alignment Conclusion 17 Ontology alignment - Motivation Ontologies with overlapping information GENE ONTOLOGY (GO) SIGNAL-ONTOLOGY (SigO) immune response i- acute-phase response i- anaphylaxis i- antigen presentation i- antigen processing i- cellular defense response i- cytokine metabolism i- cytokine biosynthesis synonym cytokine production … p- regulation of cytokine biosynthesis … … i- B-cell activation i- B-cell differentiation i- B-cell proliferation i- cellular defense response … i- T-cell activation i- activation of natural killer cell activity … Immune Response i- Allergic Response i- Antigen Processing and Presentation i- B Cell Activation i- B Cell Development i- Complement Signaling synonym complement activation i- Cytokine Response i- Immune Suppression i- Inflammation i- Intestinal Immunity i- Leukotriene Response i- Leukotriene Metabolism i- Natural Killer Cell Response i- T Cell Activation i- T Cell Development i- T Cell Selection in Thymus 18 9 Ontology alignment - Motivation Use of multiple ontologies e.g. custom-specific ontology + standard ontology Bottom-up creation of ontologies experts can focus on their domain of expertise Æ important to know the inter-ontology relationships 19 GENE ONTOLOGY (GO) SIGNAL-ONTOLOGY (SigO) immune response i- acute-phase response i- anaphylaxis i- antigen presentation i- antigen processing i- cellular defense response i- cytokine metabolism i- cytokine biosynthesis synonym cytokine production … p- regulation of cytokine biosynthesis … … i- B-cell activation i- B-cell differentiation i- B-cell proliferation i- cellular defense response … i- T-cell activation i- activation of natural killer cell activity … Immune Response i- Allergic Response i- Antigen Processing and Presentation i- B Cell Activation i- B Cell Development i- Complement Signaling synonym complement activation i- Cytokine Response i- Immune Suppression i- Inflammation i- Intestinal Immunity i- Leukotriene Response i- Leukotriene Metabolism i- Natural Killer Cell Response i- T Cell Activation i- T Cell Development i- T Cell Selection in Thymus 20 10 Aligning ontologies GENE ONTOLOGY (GO) SIGNAL-ONTOLOGY (SigO) immune response i- acute-phase response i- anaphylaxis i- antigen presentation i- antigen processing i- cellular defense response i- cytokine metabolism i- cytokine biosynthesis synonym cytokine production … p- regulation of cytokine biosynthesis … … i- B-cell activation i- B-cell differentiation i- B-cell proliferation i- cellular defense response … i- T-cell activation i- activation of natural killer cell activity … Immune Response i- Allergic Response i- Antigen Processing and Presentation i- B Cell Activation i- B Cell Development i- Complement Signaling synonym complement activation i- Cytokine Response i- Immune Suppression i- Inflammation i- Intestinal Immunity i- Leukotriene Response i- Leukotriene Metabolism i- Natural Killer Cell Response i- T Cell Activation i- T Cell Development i- T Cell Selection in Thymus equivalent concepts equivalent relations is-a relation 21 Alignment Strategies Strategies based on linguistic matching Structure-based strategies Constraint-based approaches GO: Complement Activation Instance-based strategies Use of auxiliary information SigO: complement signaling synonym complement activation Combining different approaches 22 11 Alignment Strategies Strategies based on linguistic matching Structure-based strategies Constraint-based approaches Instance-based strategies Use of auxiliary information Combining different approaches 23 Alignment Strategies Strategies based on linguistic matching Structure-based strategies Constraint-based approaches Instance-based strategies Use of auxiliary information Person Human Combining different approaches O1 Animal O2 Animal 24 12 Alignment Strategies Strategies based on linguistic matching instance corpus Structure-based strategies Constraint-based approaches Ontology Instance-based strategies Use of auxiliary information Combining different approaches 25 Alignment Strategies Strategies based linguistic matchingdictionary Structure-based strategies intermediate thesauri ontology Constraint-based approaches alignment strategies Instance-based strategies Use of auxiliary information Combining different approaches 26 13 Alignment Strategies Strategies based on linguistic matching Structure-based strategies Constraint-based approaches Instance-based strategies Use of auxiliary information Combining different approaches 27 A General Alignment Strategy instance corpora general dictionaries domain thesauri alignment algorithm matcher matcher alignments source ontologies combination filter suggestions accepted suggestions user conflict checker 28 14 1. Text classification approach Intuition A similarity measure between concepts can be computed based on the probability that documents about one concept are also about the other concept and vice versa. 29 Text classification approach - steps Generate corpora Use concept as query term in PubMed Retrieve most recent PubMed abstracts Generate classifiers One classifier per ontology Naive Bayes classifiers 30 15 Text classification approach - steps Classification Abstracts related to one ontology are classified by the other ontology’s classifier and vice versa Calculate similarities 31 Evaluation GO vs. SigO Immune Defense : 70 terms from GO, 15 terms from SigO Behavior: Behavior 60 terms from GO, 10 terms from SigO GO vs. EC EC1.3 : 141 terms from GO and 129 terms from EC EC1.1.3 : 29 terms from GO and 31 terms from EC. 32 16 Text classification approach Total number of suggestions Number of correct suggestions 33 Evaluation – quality of suggestions Classification approach 1 1 0,9 0,9 0,8 0,8 0,7 recall 0,6 B ID 0,5 EC1.3 EC1.1.3 0,4 precision 0,7 0,6 B ID 0,5 EC1.3 EC1.1.3 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0 0 0.4 0.5 0.6 threshold 0.7 0.8 0.4 0.5 0.6 0.7 0.8 threshold 34 17 Comparison with other matchers Matchers Classification, Term, TermWN, Dom Parameters Quality of suggestions: precision/recall Thresholds : 0.4, 0.5, 0.6, 0.7, 0.8 Data GO – SigO, MA - MeSH 35 Evaluation – quality of suggestions Classification approach 1 1 0,9 0.9 0,8 0.8 0.7 0,7 B ID 0,5 nos e ear 0,4 eye precision recall B 0,6 0.6 ID 0.5 nos e ear 0.4 0,3 0.3 0,2 0.2 eye 0.1 0,1 0 0 0.4 0.5 0.6 threshold 0.7 0.8 0.4 0.5 0.6 0.7 0.8 threshold 36 18 Evaluation – quality of suggestions Terminological matchers 1 1 0.9 0.9 0.8 0.8 0.7 0.7 B recall ID nos e 0.5 ear 0.4 eye precision B 0.6 0.6 ID nos e 0.5 ear 0.4 0.3 0.3 0.2 0.2 eye 0.1 0.1 0 0 0.4 0.5 0.6 0.7 0.4 0.8 0.5 0.6 0.7 0.8 threshold threshold 37 Evaluation – quality of suggestions Domain matcher 1 1 0.9 0.9 0.8 0.8 0.7 B 0.7 B 0.6 ID 0.6 ID 0.5 nose 0.5 nose 0.4 ear 0.4 ear 0.3 ey e 0.3 eye 0.2 precis ion recall 0.2 0.1 0.1 0 0 0.4 0.5 0.6 threshold 0.7 0.8 0.4 0.5 0.6 0.7 0.8 threshold 38 19 Evaluation – quality of suggestions Comparison of the matchers CS_TermWN ⊇ CS_Dom ⊇ CS_Classification Combinations of the different matchers leads to higher quality results 39 2. Cluster approach Intuition A similarity measure between concepts can be computed based on normalized information distance. 40 20 Cluster approach - steps For each concept, use concept as query term in PubMed and retrieve the number of hits (documents) For each pair of concepts, use the conjunction of the concepts as query term in PubMed and retrieve the number of hits (documents) Calculate distance 41 Cluster approach - steps Cluster the concepts based on distance using complete-link hierarchical clustering For a given distance threshold, concepts from different ontologies in the same cluster are alignment candidates 42 21 Cluster approach Total number of clusters containing terms from both ontologies Number of clusters containing correct suggestions Number of correct suggestions 43 Evaluation Recall is better for cluster approach than for classification approach Many clusters containing elements from both ontologies contain a correct suggestion Often one correct suggestion per cluster 44 22 Evaluation The quality of suggestions heavily depends on the related documents in PubMed. case B: 10%, no related documents additional 4%, less than 10 related documents case EC1.3: 40%, no related documents additional 26%, less than 10 related documents 45 Outline Introduction Ontology design and evolution Ontology alignment Conclusion 46 23 Conclusion Use of literature for designing and aligning ontologies Possible approach Much room for improvement of algorithms Combine with other approaches may give superior results 47 Future work Design and evolution Extend analysis to descendants/other terms Using POS tagging Alignment Classification approach Use structure of ontologies Different text classifiers Cluster approach Use clusters for filtering alignment suggestions 48 24
© Copyright 2026 Paperzz