Modelling and computing the quality of information in e-science Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Roma, 3/4/07
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Modelling and computingthe quality of information in e-science
Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK
Alun Preece, Binling JinDepartment of Computing Science
University of Aberdeen, UK
http://www.qurator.org
Roma, 3/4/07
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality of data
Main driver, historically: data cleaning for
• Integration: use of same IDs across data sources
• Warehousing, analytics:
– restore completeness,
– reconcile referential constraints
– cross-validation of numeric data by aggregation
Focus:
• Record de-duplication, reconciliation, “linkage”
– Ample literature – see eg Nov 2006 issue of IEEE TKDE
• Consistency of data across sources
• Managing uncertainty in databases (Trio - Stanford)
Data quality control in the data management practice
Combining the strengths of UMIST andThe Victoria University of Manchester
Common quality issues
• Completeness: not missing any of the results
• Correctness: each data should reflect the actual real-world entity that it is intended to model
– The actual address where you live, the correct balance in your bank account…
• Timeliness: delivered in time for use by a consumer process
– Eg stock information
• …
Combining the strengths of UMIST andThe Victoria University of Manchester
Taxonomy for data quality dimensions
Combining the strengths of UMIST andThe Victoria University of Manchester
Our motivation: quality in public e-science data
GenBankUniProt
EnsEMBL
Entrez
dbSNP
• Large volumes of data in many public repositories• Increasingly creative uses for this data
Problem: using third party data of unknown quality may result in misleading scientific conclusions
Problem: using third party data of unknown quality may result in misleading scientific conclusions
Combining the strengths of UMIST andThe Victoria University of Manchester
Some quality issues in biology
“Quality” covers a broader spectrum of issues than traditional DQ
• “X% of database A may be wrong (unreliable) – but I have no easy way to test that”
• “This microarray data looks ok but is testing the wrong hypothesis”
• The output from this sequence matching algorithm produces false positives
• …
Each of these issues calls for a separate testing procedureDifficult to generalize
Each of these issues calls for a separate testing procedureDifficult to generalize
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness in biology - examples
Data type Creation process Correctness
Uniprot protein annotation
Manual curation Functional annotation f for p correct if function f can reliably be attributed to p
Qualitative proteomics:
Protein identification
Generate peptides peak lists, match peak lists (eg Imprint)
No false positives:
Every protein in the output is actually present in the cell sample
Transcriptomics:
Gene expression report (up/down-regulation)
Microarray data analysis
No false positives, no false negatives
Combining the strengths of UMIST andThe Victoria University of Manchester
Defining quality in e-science is challenging
• In-silico experiments express cutting-edge research
– Experimental data liable to change rapidly
– Definitions of quality are themselves experimental
• Scientists’ quality requirements often just a hunch
– Quality tests missing or based on experimental heuristics
– Definitions of quality criteria are personal and subjective
• Quality controls tightly coupled to data processing
– Often implicit and embedded in the experiment
– Not reusable
“Quality” personal criteria for data acceptability
Combining the strengths of UMIST andThe Victoria University of Manchester
Research goals
1. Make personal definitions of quality explicit and formal
– Identify a common denominator for quality concepts
– Expressed as a conceptual model for Information Quality
Elicit “nuggets” of latent quality knowledgefrom the experts
Elicit “nuggets” of latent quality knowledgefrom the experts
2. Make existing data processing quality-aware
– Define an architectural framework that accommodates personal definitions of quality
– Compute quality levels and expose them to the user
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: protein identification
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Correct entry true positive
Evidence:
mass coverage (MC) measures the amount of protein sequence matched
Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum
ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness of protein identification
Estimator function: (computes a score rather than a probability)
PMF score = (HR x 100) + MC + (ELDP x 10)
Prediction performance – comparing 3 models:
ROC curve:True positives vs false positives
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality process components
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Goal:to automatically add the additional filtering step in a principled way
Goal:to automatically add the additional filtering step in a principled way
PMF score = (HR x 100) + MC + (ELDP x 10)
Quality filtering
Quality assertion:
Evidence:•mass coverage (MC)•Hit ratio (HR)•ELDP
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality Assertions
QA(D): any function of evidence (metadata for D) that computes equivalence classes on D
1. Score model (total or partial order)
2. Classification model:
D
B
A
C
Actions associated to regions:Eg accept/reject but possibly more
Quality-equivalent regions
Combining the strengths of UMIST andThe Victoria University of Manchester
Layered definition of Quality
DB
DBData sources
custom qualityknowledge
Quality Assertionsfunctions
QA QA QA
Quality Views:definition of acceptability regions QVQVQV QV
quality evidence annotations
EnvEnv
Annotationfunctions
Long-livedreusable
CommoditiesExpert-defined
DynamicUser
controlled
Combining the strengths of UMIST andThe Victoria University of Manchester
Abstract Quality ViewsAn operational definition for personal quality:
1. Formulate a quality assertion on the dataset:– i.e. a ranking of proteins by PMF score
– “quality knolwedge, possibly subjective”
2. Identify underlying evidence necessary to compute the assertion– the variables used to compute the score (HR, MC, ELDP)
– Objective, inexpensive
3. Define annotation functions that compute evidence values• Functions that compute HR, MC, ELDP
4. Define quality regions on the ranked dataset• In this case, intervals of acceptability
5. Associate actions to each region
Combining the strengths of UMIST andThe Victoria University of Manchester
Computable quality views as commodities
Cost-effective quality-awareness for data processing:
• Reuse of high-level definitions of quality views
• Compilation of abstract quality views into quality components