Slides

Quality views: capturing and exploiting
the user perspective on data quality
Paolo Missier, Suzanne Embury, Mark Greenwood
School of Computer Science
University of Manchester, UK
Alun Preece, Binling Jin
Department of Computing Science
University of Aberdeen, UK
http://www.qurator.org
Integration of public data (in biology)
Entrez
UniProt
EnsEMBL
GenBank
dbSNP
• Large volumes of data in many public repositories
• Increasingly creative uses for this data
• Their quality is largely unknown
Combining the strengths of UMIST and
The Victoria University of Manchester
Quality of e-science data
A data consumer’s view on quality:
Criteria for data acceptability within a specific data
processing context
Defining quality can be challenging:
• In-silico experiments express cutting-edge research
– Experimental data liable to change rapidly
– Definitions of quality are themselves experimental
• Scientists’ quality requirements often just a hunch
– Quality tests missing or based on experimental heuristics
– Often implicit and embedded in the experiment  not reusable
Combining the strengths of UMIST and
The Victoria University of Manchester
Example: protein identification
“Wet lab” experiment
Support evidence:
provenance metadata
Data output
Protein identification algorithm
Reference
databases
Protein Hitlist
Quality filtering
Remove likely false positives
 Improve prediction accuracy
Protein function prediction
Goal:
to explicitly define and automatically add the additional
Combining the strengths of UMIST and
The Victoria University of Manchester
filtering step in a principled way
Our goals
Offer e-scientists a principled way to:
• Discover quality definitions for specific data domains
• Make them explicit using a formal model
• Implement them in their data processing environment
• Test them on their data
… in an incremental refinement cycle
Benefits:
• Automated processing
• Reusability
• “plug-in” quality components
Combining the strengths of UMIST and
The Victoria University of Manchester
Approach
Research hypothesis:
adding quality to data can be made cost-effective
– By separating out generic quality processing from domainspecific definitions
Define
abstract quality views
on the data
Map
quality view to an
executable process
Execute quality views
Combining the strengths of UMIST and
The Victoria University of Manchester
Qurator
architectural framework:
- runtime environment
- data-specific quality services
Abstract quality view model
Quality Metadata
Evidence
e3
Classification1
e2
C11
C12
C21
C22
Classification2
e1
Coverage
Assertions
…
Class
space 1
…
Class
space 2
PeptidesCount
Data annotation
Data
Conditions:
regions specification













Combining the strengths of UMIST and
The Victoria University of Manchester
Actions on regions
Semantic model for quality concepts
Quality “upper ontology”
(OWL)
Quality evidence types
Evidence annotations
are class instances
Evidence
Meta-data model
(RDF)
Combining the strengths of UMIST and
The Victoria University of Manchester
Quality hypotheses discovery and testing
Performance
assessment
abstract
quality view
Multiple target environments:
• Workflow
• query processor
Compilation
Compilation
Targeted
Compilation
Execution
on test data
Target-specific
Target-specific
Quality
component
Target-specific
Quality component
Quality component
Quality-enhanced
Quality-enhanced
User
environment
Quality-enhanced
User environment
User environment
Combining the strengths of UMIST and
The Victoria University of Manchester
Deployment
Deployment
Deployment
Generic quality process pattern
Collect evidence
- Fetch persistent
annotations
- Compute on-the-fly
annotations
Persistent
evidence
Compute assertions
Classifier
Classifier
Classifier
Evaluate conditions
Execute actions
Combining the strengths of UMIST and
The Victoria University of Manchester
<variables
<var variableName="Coverage“
evidence="q:Coverage"/>
<var variableName="PeptidesCount“
evidence="q:PeptidesCount"/>
</variables>
<QualityAssertion
serviceName="PIScoreClassifier"
serviceType="q:PIScoreClassifier"
tagSemType="q:PIScoreClassification"
tagName="ScoreClass"
<action> <filter> <condition>
ScoreClass in {``q:high'', ``q:mid''} and
Coverage > 12
</condition> </filter> </action>
Bindings: assertion  service
(service registry)
service class

Web service endpoint
PIScoreClassifier  http://localhost/axis/services/PIScoreClassifierSvc
All services implement the same WSDL interface
• Makes concrete assertion functions homogeneous
• Facilitates compilation
• Uniform input / output messages
Common WSDL
interface
D = {(di, evidence(di))}
{class(di)}
{score(di)}
Combining the strengths of UMIST and
The Victoria University of Manchester
PIScoreClassifierSvc
PI_Top_k_svc
Execution model for Quality views
Binding  compilation  executable component
– Sub-flow of an existing workflow
– Query processing interceptor
Abstract
Quality view
Host workflow: D  D’
QV compiler
Qurator quality framework
D
Host workflow
Embedded
quality
workflow
D’
Combining the strengths of UMIST and
The Victoria University of Manchester
Quality view on D’
Services registry
Services
implementation
Example: original proteomics workflow
Taverna (*):
workflow language and enactment engine for e-science applications
Quality flow
embedding point
Combining the strengths of UMIST and
Victoria of
University
Manchester project, University of Manchester - taverna.sourceforge.net
(*)Thepart
theofmyGrid
Example: embedded quality workflow
Combining the strengths of UMIST and
The Victoria University of Manchester
Interactive conditions / actions
Combining the strengths of UMIST and
The Victoria University of Manchester
Quality views for queries
Quality
View
manager
Query
client
Query
processor
Q
Data
R
R
annotate
assert
evidence
Actions: filtering, dump to
DB / file
act
dump
R1
Combining the strengths of UMIST and
The Victoria University of Manchester
dump
Qurator architecture
Combining the strengths of UMIST and
The Victoria University of Manchester
Summary
For complex data types, often no single “correct” and
agreed-upon definition of quality of data
• Qurator provides an environment for fast prototyping of
quality hypotheses
– Based on the notion of “evidence” supporting a quality hypothesis
– With support for an incremental learning cycle
• Quality views offer an abstract model for making data
processing environments quality-aware
– To be compiled into executable components and embedded
– Qurator provides an invocation framework for Quality Views
More info and papers: http://www.qurator.org
Combining the strengths of UMIST and
Live demos (informal) available
The Victoria University of Manchester

Download Report

Slides

Paperzz.com

Your Paperzz