Top Banner
Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics
48

Query-driven search methods for large microarray databases

Jan 02, 2016

Download

Documents

vivien-clemons

Query-driven search methods for large microarray databases. Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics. Broad Goals/Challenges. Characterize the function of proteins Learn the mechanisms of gene expression and regulation under many conditions - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Query-driven search methods for large microarray databases

Query-driven search methodsfor large microarray databases

Matt Hibbs

Troyanskaya Laboratory for

BioInformatics and Functional Genomics

Page 2: Query-driven search methods for large microarray databases

Broad Goals/Challenges

• Characterize the function of proteins• Learn the mechanisms of gene expression

and regulation under many conditions– Growing amounts of data facilitate this goal

• Noise, heterogeneity, and biases in available data must be addressed

Page 3: Query-driven search methods for large microarray databases

Specific Goals

• Large collection of S. cerevisiae microarray data– From > 80 publications

– Totaling ~2400 conditions

– Divided into ~130 “datasets”

• How can such a large amount of data be leveraged?– What can we learn? Or not learn?

– Accessibility, usefulness to community

Page 4: Query-driven search methods for large microarray databases

Outline

• Microarray methodology• Analysis concerns• Functional Biases• Improved Approaches• Preliminary Conclusions

Page 5: Query-driven search methods for large microarray databases

Outline

• Microarray methodology• Analysis concerns• Functional Biases• Improved Approaches• Preliminary Conclusions

Page 6: Query-driven search methods for large microarray databases

Central Dogma

• Transcription factors recruit or repress polymerase

• Transcription– DNA mRNA

• Translation– mRNA Proteins

• Proteins do work

DNA

mRNA

Proteins

Ribosome

TF

Polymerase

Page 7: Query-driven search methods for large microarray databases

Molecular Measurements

• Measurements of protein abundance in a variety of conditions can suggest function– Difficult to measure accurately in a large-scale

manner

• One off: measure abundance of mRNA transcripts as a proxy– Much easier to measure on a large scale

– Several competing technologies reaching maturity

Page 8: Query-driven search methods for large microarray databases

Basic Microarray Methodology

Step 1: Prepare cDNA spots

Step 2: Add mRNA to slide for Hybridization

Step 3: Scan hybridized array

reference mRNA test mRNA

add green dye add red dye

hybridize

Page 9: Query-driven search methods for large microarray databases

Microarray Outputs

Measure amounts of green and red dye on each spot

Represent level of expression as a log ratio between these amounts

Raw Image from Spellman et al., 98

Page 10: Query-driven search methods for large microarray databases

Microarray Outputs

Experiments

Genes

• Log ratios in data matrix

• Missing values present

• Potentially high levels of noise

Page 11: Query-driven search methods for large microarray databases

Additional Technology

• Two-color (homemade, Agilent)– Process just described, with 2 labeled samples

undergoing competitive hybridization

• Single-color (Affymetrix)– Highly calibrated hybridization spots

– Match and Mis-match spots for each oligo

• Other techniques/tricks– Randomized layouts, barcode arrays, tiling

arrays, etc.

Page 12: Query-driven search methods for large microarray databases

Outline

• Microarray methodology• Analysis concerns• Functional Biases• Improved Approaches• Preliminary Conclusions

Page 13: Query-driven search methods for large microarray databases

Noise Sources

• Transcriptional noise– mRNA transcripts not a direct reflection of protein

levels– Process of isolating mRNA can stress cells

• Especially true of older protocols/data

• Chemical noise– Fluorescent labels sensitive to environment

• Operator noise– High variation between scientists running the

same experiment

Page 14: Query-driven search methods for large microarray databases

Missing Values

• Several choices:– Ignore missing values

– Remove genes with missing values

– Impute missing values

• KNN-Impute– Replace missing values with a weighted average

of the K-nearest neighbors

– Used for analysis presented later

Page 15: Query-driven search methods for large microarray databases

Normalization

• “Bright” arrays– Whole arrays often normalized by average

intensity

• Two-color– Choice of reference population can affect

measurements– Avoid divide by zero errors

• Affymetrix– Convert hybridization values to log ratios

• Divide by average value• Log transform

Page 16: Query-driven search methods for large microarray databases

Clustering Analysis

• Distance metrics– Euclidean– Pearson– Spearman– …

• Algorithms– Hierarchical– K-means– SOM– …

Page 17: Query-driven search methods for large microarray databases

Megaclustering

• Combining data from multiple sources can cause problems– Normalization differences

– Technology differences

– Noise biases

• Requires unified pre-processing and smart application of statistics

Page 18: Query-driven search methods for large microarray databases

Apples to Apples

• Pearson correlation distributions not always normal

– Large dependence on number of conditions

6 condition dataset 40 condition dataset

Histograms of Pearson correlation coefficients

Page 19: Query-driven search methods for large microarray databases

Apples to Apples

• Fischer’s Z-score transform normalizes the distributions

– Z = ln[(r+1)/(r-1)] / 2, where r = Pearson corr. coeff.

6 condition dataset 40 condition dataset

Histograms of Z-scores

Page 20: Query-driven search methods for large microarray databases

Evaluation Measurements

• Gene Ontology (GO)– Hierarchical organization of biological processes,

molecular functions, and cellular components– Cross-organism structure, organism-specific

annotations– Closest available approximation of a “gold

standard”

• True Positives and False Positives can be defined from the ontology– Node size, depth, expert voting used for cutoffs

Page 21: Query-driven search methods for large microarray databases

Precision / Recall

• Calculate and sort distances between all pairs of genes• Determine a cutoff, all pairs below cutoff are predicted

“true,” above “false”• Given these predictions, can calculate precision and

recall– Precision = TP / (TP + FP)– Recall = TP / TotalPositives

• Slide the cutoff from smallest to largest distance to create a curve of precision / recall pairs– Ramp down from few, high confidence predictions to many,

low confidence predictions

Page 22: Query-driven search methods for large microarray databases

Example

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Precision/Recall of various data types

Page 23: Query-driven search methods for large microarray databases

Outline

• Microarray methodology• Analysis concerns• Functional Biases• Improved Approaches• Preliminary Conclusions

Page 24: Query-driven search methods for large microarray databases

Functional Biases

• Microarray experiments often targeted at a particular process, pathway, or function

• However, several “global” signals are often present– Ribosomal response

– General Stress Response

• Some datasets do contain more targeted “local” signals as well

Page 25: Query-driven search methods for large microarray databases

Ribosome Bias

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Precision/Recall of various data types

Page 26: Query-driven search methods for large microarray databases

Ribosome Bias

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Precision/Recall excluding Ribosome Biogenesis

Page 27: Query-driven search methods for large microarray databases

Process-specific P/R

• Can generate PR-curves on a per-GO term basis

– TPs are pairs of genes annotated to term

– TFs are pairs with one gene in term, with smallest common ancestor in very large term

– Normalize by size of GO term

• Results for individual data sets can expose functional biases

Page 28: Query-driven search methods for large microarray databases

Per-dataset Biases

Typical Results

Page 29: Query-driven search methods for large microarray databases

Per-dataset Biases

Poor Results

Page 30: Query-driven search methods for large microarray databases

Per-dataset Biases

Diverse Results

Page 31: Query-driven search methods for large microarray databases

Z-test for significance

• Difference between pair-wise distances for all genes in a term vs. background

Page 32: Query-driven search methods for large microarray databases

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

A Global View

Z-test P-values

Columns - datasetsRows - GO terms

Red at a cutoff of 10-10

Page 33: Query-driven search methods for large microarray databases

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

A Global View

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 34: Query-driven search methods for large microarray databases

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

A Global View

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 35: Query-driven search methods for large microarray databases

A Local View

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 36: Query-driven search methods for large microarray databases

A Local View

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 37: Query-driven search methods for large microarray databases

Outline

• Microarray methodology• Analysis concerns• Functional Biases• Improved Approaches• Preliminary Conclusions

Page 38: Query-driven search methods for large microarray databases

Bi-clustering

• Traditional clustering will be driven by “global” signals and ignore “local” signals

• Bi-clustering identifies groups of genes and conditions rather than just genes

Traditional clustering Bi-clustering

Page 39: Query-driven search methods for large microarray databases

Bi-clustering goals/issues

• Better capture biological reality– Genes only cooperate in certain conditions

– Genes can have multiple functions

– Datasets have functional biases

• Computationally difficult problem– Reducible to bi-clique finding

• NP-complete

• Heuristics, simplifications, approximations– e.g. -biclusters, SAMBA, PISA

Page 40: Query-driven search methods for large microarray databases

Bi-clustering goals/issues

• Microarray noise can lead to spurious output– As compendiums increase in size, patterns by

chance increase

– Datasets have “smallest logical groupings”• Restrict co-expression to these groups

• Long running times + large result sets– Difficult to validate results

– Scientifically frustrating

Page 41: Query-driven search methods for large microarray databases

Query-driven approach

• Allow users to specify a starting point for search– Leverages expert knowledge of domain

– Known to be useful in other contexts• bioPIXIE

• Identify conditions/datasets of interest based on the set of query genes

• Expand query set to include additional related genes in these conditions

Page 42: Query-driven search methods for large microarray databases

Query-driven approach

• Reduces problem complexity to allow for real-time results

• Fast results allow for user-driven refinement of search criterions

• Extensible to larger data compendiums and more complex organisms– Locality sensitive hashing

– Pre-processing

Page 43: Query-driven search methods for large microarray databases

Query Weighting

• Identify data conditions related in query set– Average correlation, distance, etc.

– Signal to Noise ratio of query

– Centroid significance

• Additional genes related to query– Correlation, distance, etc. weighted by identified

condition sets

Page 44: Query-driven search methods for large microarray databases

Simple Scheme

• Weighted by correlation of query

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 45: Query-driven search methods for large microarray databases

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Simple Scheme

• Results, weighted sum of correlation to query

decreasing correlation

de

cre

asi

ng

co

rre

latio

n

Page 46: Query-driven search methods for large microarray databases

Ongoing Work

• Compare query weighting schemes• UI challenges• Scalability concerns

– Indexing, Locality Sensitive Hashing

– Human data

• Assess biological usefulness

Page 47: Query-driven search methods for large microarray databases

Preliminary Conclusions

• Noise, functional biases, collection sizes require consideration in microarray analysis

• Evaluation metrics can be influenced by biases creating misleading results

• Query-driven approaches show promise– Targeted search

– Computational feasibility / Real-time results

– Extensibility

Page 48: Query-driven search methods for large microarray databases

Acknowledgements

• Olga Troyanskaya• Chad Myers• Curtis Huttenhower

• Kai Li and lab

• Botstein and Kruglyak labs• Kara Dolinski, Maitreya Dunham

Jessy