Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University of Aberdeen Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science, University of Manchester David Stead, Al Brown Molecular and Cell Biology, University of Aberdeen www.qurator.org Describing the Quality of Curated e-Science Information Resources
18
Embed
Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Managing Information Quality in e-Science
using Semantic Web technology
Alun Preece, Binling Jin, Edoardo PignottiDepartment of Computing Science, University of Aberdeen
Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer Science, University of Manchester
David Stead, Al Brown Molecular and Cell Biology, University of Aberdeen
www.qurator.orgDescribing the Quality of Curated e-Science Information
Resources
Combining the strengths of UMIST andThe Victoria University of Manchester
E-scienceexperiment
Information and quality in e-science
• Scientists required to place their data in the public domain
• Scientists use other scientists' experimental results as part of their own work
Labexperiment
In silico experiments(eg Workflow-based)
How can I decide whether I can trust
this data?
• Variations in the quality of the data
• No control over the quality of public data
• Difficult to measure and assess quality - No standards
Public BioDBs
Combining the strengths of UMIST andThe Victoria University of Manchester
A concrete scenarioQualitative proteomics: identification of proteins in a cell sample
Step 1 Step nCandidate Data
for matching(peptides peak lists)
Match algorithm
Reference DBs- MSDB- NCBI- SwissProt/Uniprot
Wet lab
Information service (“Dry lab”)
Hit list:{ID, Hit Ratio, Mass Coverage,…}
False negatives: incompleteness of reference DBs, pessimistic matching
False positives: optimistic matching
False negatives: incompleteness of reference DBs, pessimistic matching
False positives: optimistic matching
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality is personal
Scientists tend to express their quality requirements for data by giving acceptability criteria
These are personal and vary with the expected use of the data
“What is the right trade-off between false positives and false negatives?”
Combining the strengths of UMIST andThe Victoria University of Manchester
Requirements for IQ ontology
1. Establish a common vocabulary
– Let scientists express quality concepts and criteria in a controlled way
– Within homogeneous scientific communities
– Enable navigation and discovery of existing IQ concepts
2. Sharing and reuse: let users contribute to the ontology while ensuring consistency
– Achieve cost reduction
3. Making IQ computable in practice
– Automatically apply acceptability criteria to the data
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality Indicators
Quality Indicators: measurable quantities that can be used to define acceptability criteria:
• “Hit Ratio”, “Mass Coverage”, “ELDP”
– provided by the matching algorithm
Match algorithm
Information service (“Dry lab”)
Hit list:{proteinID
Hit Ratio, Mass Coverage,…}
Experimentally established correlation between these indicators and the probability of mismatch
Experimentally established correlation between these indicators and the probability of mismatch
Combining the strengths of UMIST andThe Victoria University of Manchester
Data acceptability criteria
• Indicators used as indirect “clues” to assess quality
• Quality Assertions (QA) formally capture these clues as functions of indicators
• Data classification or ranking functions:
ex: PIClassifier defined as
f(proteinID, Hit Ratio, Mass Coverage, ELDP) { (proteinID, rank) }
– This provides a custom ranking of the match results
• Formalized acceptability criteria are conditions on QAs
accept(proteinID) if PIClassifier(ProteinID,…) > X OR …
Combining the strengths of UMIST andThe Victoria University of Manchester
IQ ontology backbone
Class restriction:MassCoverage is-evidence-for . ImprintHitEntry
Class restriction:PIScoreClassifier assertion-based-on-evidence . HitScorePIScoreClassifier assertion-based-on-evidence . Mass Coverage