Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Data Extraction Research Group Department of Computer Science, Brigham Young University, Provo, UT 84602 Introduction Source Location by Semantic Indexing Steps: Transfer each HTML table to a DOM tree Find sibling tree pairs Compare and find matched nodes PROBLEMS: Huge evolving number of Bio-databases e.g. molecular biology database collection 2004: total 548, 162 more than 2003 2005: total 719, 171 more than 2004 tr Different access capabilities tr td Gene Model td Status td Nucleotides (coding/transcript) td Protein td Swissprot td Amino Acids table table td Syntactic heterogeneity tr Semantics heterogeneity tr F47G6.1 1, 2 td confirmed by cDNA(s) td 1773/7391 bp td WP:CE26812 td DTN1_CAEEL td 590 aa tr META-DATA ANNOTATION td Gene Model td Status td Nucleotides (coding/transcript) td Protein td Amino Acids td F18H3.5a 1, 2 td confirmed by cDNA(s) td 1029/3051 bp Protein Name td WP:CE18608 td 342 aa ProtoNet td F18H3.5b 1, 2, 3 td partially confirmed by cDNA(s) td 1221/1704 bp td WP:CE28918 td 406 aa RELATIONSHIP-SET ENRICHMENT Source Old ontology Source Organism ProtoNet Accession Number ProtoNet Updated ontology Length in Amino Acid ProtoNet Target Molecular Weight in Da ProtoNet Updated at anytime by independent authorities ProtoNet OBJECT-SET ENRICHMENT GOALS: To help biologists cross search various resources Examples: Cross-linked information (Join queries) “Find genes which are longer than 5kbp, whose products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG DATA ANNOTATION Source Organism Generate a structure pattern for all sibling tables Accession Number Semantic annotation Protein Name Semantic Web Length in Amino Acid Collecting information from similar data sources (Union queries) “Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase SOLUTION: Source page understanding Table Interpretation Aligning with an ontology Source location through semantic annotation Metadata vs. instance data annotation Use of annotation in query processing Ontology evolution Adjustments to ISA and Part-Of hierarchies Addition of attributes Possible new object sets that could be added to the ontology Molecular Weight in Da Conclusions Query Implementation Status: Finished: sibling table comparison technique Ontology Evolution SAMPLE ONTOLOGY OBJECT RECOGNITION Working on: sample ontology object recognition ontology generation in the biological domain Likely to have “imperfect” ontologies Delimitations: Ontology: will not cover everything in the domain Key Concepts: sample ontology object, expected values Can enrich semi-automatically Steps: Map the values with the sample ontology object set Map the labels with the ontology concepts Understand all pages from the same web site Two possibilities: Value enrichment Object-set and relationship-set enrichment Source page understanding: structured/semi-structured Value enrichment: only value lexicons Source Page Understanding Object set and relationship set enrichment: only ISA and Part-Of hierarchies and simple attribute additions VALUE ENRICHMENT Gene SIBLING PAGE COMPARISON Key Concepts: sibling pages and sibling tables Location Map to Contact Information Length in Amino Acid Main Idea: Compare two sibling tables: variable fields ~ values & fixed fields ~ labels Structure pattern for one pair of sibling tables General structure pattern for all sibling tables “3,?095”: Start End “37,?612,?680”; “37,?610,?585”; Update values A sample ontology object (partial information) Specie Two sample pages (partial information) Data Extraction Research Group Department of Computer Science Brigham Young University Provo, UT 84602 Cui Tao, [email protected] Protein Name http://www.deg.byu.edu/
© Copyright 2026 Paperzz