PowerPoint Presentation - 48x36 Poster Template

Toward Making Online Biological Data Machine Understandable
Cui Tao
Data Extraction
Research Group
Data Extraction Research Group
Department of Computer Science, Brigham Young University, Provo, UT 84602
Introduction
Source Location by Semantic Indexing
Steps:
Transfer each HTML table to a DOM tree
Find sibling tree pairs
Compare and find matched nodes
PROBLEMS:
Huge evolving number of Bio-databases
 e.g. molecular biology database collection
2004: total 548, 162 more than 2003
2005: total 719, 171 more than 2004
tr
Different access capabilities
tr
td
Gene Model
td
Status
td
Nucleotides (coding/transcript)
td
Protein
td
Swissprot
td
Amino Acids
table
table
td
Syntactic heterogeneity
tr
Semantics heterogeneity
tr
F47G6.1 1, 2
td
confirmed by cDNA(s)
td
1773/7391 bp
td
WP:CE26812
td
DTN1_CAEEL
td
590 aa
tr
META-DATA ANNOTATION
td
Gene Model
td
Status
td
Nucleotides (coding/transcript)
td
Protein
td
Amino Acids
td
F18H3.5a 1, 2
td
confirmed by cDNA(s)
td
1029/3051 bp
Protein Name
td
WP:CE18608
td
342 aa
ProtoNet
td
F18H3.5b 1, 2, 3
td
partially confirmed by cDNA(s)
td
1221/1704 bp
td
WP:CE28918
td
406 aa
RELATIONSHIP-SET ENRICHMENT
Source
Old
ontology
Source Organism
ProtoNet
Accession Number
ProtoNet
Updated
ontology
Length in Amino Acid
ProtoNet
Target
Molecular Weight in Da
ProtoNet
Updated at anytime by independent authorities
ProtoNet
OBJECT-SET ENRICHMENT
GOALS:
To help biologists cross search various resources
Examples:
Cross-linked information (Join queries)
“Find genes which are longer than 5kbp, whose
products have at least two helices, and participate in
glycolysis” – GenBank, PDB, KEGG
DATA ANNOTATION
Source Organism
Generate a structure pattern for all sibling tables
Accession Number
Semantic annotation
Protein Name
Semantic Web
Length in Amino Acid
Collecting information from similar data sources
(Union queries)
“Find genes newly annotated after Jan. 2003 in the
fly and worm genomes” – FlyBase, WormBase
SOLUTION:
Source page understanding
Table Interpretation
Aligning with an ontology
Source location through semantic annotation
Metadata vs. instance data annotation
Use of annotation in query processing
Ontology evolution
Adjustments to ISA and Part-Of hierarchies
Addition of attributes
Possible new object sets that
could be added to the ontology
Molecular Weight in Da
Conclusions
Query
Implementation Status:
Finished: sibling table comparison technique
Ontology Evolution
SAMPLE ONTOLOGY OBJECT RECOGNITION
Working on: sample ontology object recognition
ontology generation in the biological domain
Likely to have “imperfect” ontologies
Delimitations:
Ontology: will not cover everything in the domain
Key Concepts: sample ontology object, expected values
Can enrich semi-automatically
Steps:
Map the values with the sample ontology object set
Map the labels with the ontology concepts
Understand all pages from the same web site
Two possibilities:
Value enrichment
Object-set and relationship-set enrichment
Source page understanding: structured/semi-structured
Value enrichment: only value lexicons
Source Page Understanding
Object set and relationship set enrichment: only ISA
and Part-Of hierarchies and simple attribute additions
VALUE ENRICHMENT
Gene
SIBLING PAGE COMPARISON
Key Concepts: sibling pages and sibling tables
Location
Map to
Contact Information
Length in Amino Acid
Main Idea:
Compare two sibling tables:
variable fields ~ values & fixed fields ~ labels
Structure pattern for one pair of sibling tables 
General structure pattern for all sibling tables
“3,?095”:
Start
End
“37,?612,?680”;
“37,?610,?585”;
Update values
A sample ontology object (partial
information)
Specie
Two sample pages (partial
information)
Data Extraction Research Group
Department of Computer Science
Brigham Young University
Provo, UT 84602
Cui Tao, [email protected]
Protein Name
http://www.deg.byu.edu/