Top Banner
Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier
97

Mining Public Data for Insights into Human Disease

Jan 12, 2016

Download

Documents

Marek

Mining Public Data for Insights into Human Disease. 11/16/2009 Baliga Lab Meeting Chris Plaisier. Utility of Gene Expression for Human Disease. Microarray Technology. Big Picture. Data Access. Gene Expression Microarray Repositories. Gene Expression Omnibus (GEO) Hosted by: NCBI - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mining Public Data for Insights into Human Disease

Mining Public Data for Insights into Human Disease

11/16/2009

Baliga Lab Meeting

Chris Plaisier

Page 2: Mining Public Data for Insights into Human Disease

Utility of Gene Expression for Human Disease

Page 3: Mining Public Data for Insights into Human Disease

Microarray Technology

Page 4: Mining Public Data for Insights into Human Disease

Big Picture

Page 5: Mining Public Data for Insights into Human Disease

Data Access

Page 6: Mining Public Data for Insights into Human Disease

Gene Expression Microarray Repositories

• Gene Expression Omnibus (GEO) Hosted by: NCBI Platform: All accepted Normalization: Experiment by experiment basis Access: R (GEOquery), EUtils Meta-Information: GEOMetaDB

• ArrayExpress Hosted by: EMBL Platform: All accepted Normalization: Experiment by experiment basis Access: Web interface, EMBL API Meta-Information: ? (API)

• Many smaller repositories which have more phenotypic information for specific diseases Phenotypic information may be hard to access

Page 7: Mining Public Data for Insights into Human Disease

Gene Expression Omnibus

Page 8: Mining Public Data for Insights into Human Disease

Samples Per Platform in GEO

HGU133 Plus 2.0

HGU133A

Latest 3’ Affymetrix Array

Affymetrix arrays account for ~67% of humangene expression data in public repositories.

Page 9: Mining Public Data for Insights into Human Disease

Affymetrix Probesets

Probe ProbePair

Probeset(11 Probe Pairs)

Perfect Match

Mismatch

GeneChip U133 Plus 2.0 Array(Image stored as CEL file.)

>54,000 Probesets

25 nucleotides

Page 10: Mining Public Data for Insights into Human Disease

Pre-Processing 101

Page 11: Mining Public Data for Insights into Human Disease

Pre-Processing Gene Expression Data

Page 12: Mining Public Data for Insights into Human Disease

Removing Miss-Targeted and Non-Specific Probes

CELFile

CDFFile

Intensities

Normally CDF File Comes from Affymetrix

Zhang, et al. 2005

CELFile

AltCDFFile

Intensities

Alternative CDF File Thorougly Cleaned

Page 13: Mining Public Data for Insights into Human Disease

Pre-Processing Gene Expression Data

Page 14: Mining Public Data for Insights into Human Disease

What Makes Cells Different?

Page 15: Mining Public Data for Insights into Human Disease

PANP: Presence/Absence Filtering

• Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution

NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand

• Utilize this background distribution from these NSMPs to threshold the entire dataset

• Output is a call for each array for each gene

Calls are:• P = presence• M = marginal• A = Absence

Page 16: Mining Public Data for Insights into Human Disease

Identifying Present Genes

• Filter out genes ≥ 50% absent Whole dataset Subsets

• Only present genes are utilized in future analyses

Page 17: Mining Public Data for Insights into Human Disease

Pre-Processing Gene Expression Data

Page 18: Mining Public Data for Insights into Human Disease

Removing Redundancy

Page 19: Mining Public Data for Insights into Human Disease

Reason for Removing Redundancy Before Running

Page 20: Mining Public Data for Insights into Human Disease

Removing Redundancy

• Collapse Affymetrix Probeset IDs to EntrezIDs

• Test for correlation between probesets If correlation is ≥ 0.8 then combine probesets If not then leave them separate

Page 21: Mining Public Data for Insights into Human Disease

Pre-Processing Gene Expression Data

Page 22: Mining Public Data for Insights into Human Disease

Pre-Processing Pipeline

= Implemented in R

= Implemented in Python

Page 23: Mining Public Data for Insights into Human Disease

Big Picture

Page 24: Mining Public Data for Insights into Human Disease

Glioma:A Deadly Brain Cancer

Wikimedia commons

Page 25: Mining Public Data for Insights into Human Disease

Brain Anatomy

Wikimedia commons

Page 26: Mining Public Data for Insights into Human Disease

What do they do?

Page 27: Mining Public Data for Insights into Human Disease

Neurophysiology

Page 28: Mining Public Data for Insights into Human Disease

Hierarchy ofNervous Tissue Tumors

Page 29: Mining Public Data for Insights into Human Disease

Glioma

WHO Grade Tumor TypePercentage of CNS

Tumors

I Pilocytic Astrocytoma

9.8%IIDiffuse or Low-Grade

Astrocytoma

III Anaplastic Astrocytoma

IV Glioblastoma Multiforme 20.3%

Gliomas account for 40% of all tumors and 78% of malignant tumors.

Buckner et al., 2007

Page 30: Mining Public Data for Insights into Human Disease

Glioma Survival

http://www.neurooncology.ucla.edu/

5 years

10 years

Page 31: Mining Public Data for Insights into Human Disease
Page 32: Mining Public Data for Insights into Human Disease

Repository of Molecular Brain Neoplasia Data (REMBRANDT)

• REMBRANDT (Madhavan et al., 2009) Currently 257 individual specimens

• Glioblastoma multiforme (GBM) = 110• Astrocytoma = 50• Oligodendroglioma = 55• Mixed = 21• Non-Tumor = 21

Phenotypes• Tumor type:

GBM, Astrocytoma, etc.• WHO Grade:

176 individuals• Age:

253 individuals• Sex:

250 individuals (partially inferred using Y chromosome genes)• Survival (days post diagnosis):

169 individuals

Page 33: Mining Public Data for Insights into Human Disease

REMBRANT:Chromosome Y Expression

Se

x spe

cificg

en

e e

xpre

ssion

Female Male

Conversions of male to female should be more common than the other way,because it is difficult for females to express the Y chromosome.

4 females clusterwith males

8 males clusterwith females

Page 34: Mining Public Data for Insights into Human Disease

REMBRANT:Chr. Y Expression – Intelligent Reassignment

Se

x spe

cificg

en

e e

xpre

ssion

Female Male

Intelligent Reassignment – If previous call of sex is for other group then the callis turned into an NA. All unknowns are given a call.

Page 35: Mining Public Data for Insights into Human Disease

Progression of Astrocytic Glioma

Furnari, et al. (2007)

Page 36: Mining Public Data for Insights into Human Disease

Modeling Glioma

• Increasing metastatic potential and severity of glioma could be modeled using this simple schema

• Correlation of model to survival post diagnosis is -0.68

0

1

2

Page 37: Mining Public Data for Insights into Human Disease

Exploring Meta-Information

• Age explains 31% of survival post diagnosis

• Age explains 25% of the progression model

• Sex does not have a significant effect on either survival or the progression model Yet it is known that glioblastoma is slightly more

common in men than in women

Page 38: Mining Public Data for Insights into Human Disease

Summary

• Very ample dataset with good amount of meta-information

• Ready for dimensionality reduction and network inference!

Page 39: Mining Public Data for Insights into Human Disease

Big Picture

Page 40: Mining Public Data for Insights into Human Disease

Clustering asDimensionality Reduction

Page 41: Mining Public Data for Insights into Human Disease

Big Picture

Page 42: Mining Public Data for Insights into Human Disease

Likely Issues

• Size of eukaryotic genomes

• Added complexity of regulatory regions

• Tissue and cell type heterogeneity

• Patient genetic and environmental heterogeneity

Page 43: Mining Public Data for Insights into Human Disease

Relative Genome Sizes

Page 44: Mining Public Data for Insights into Human Disease

Solutions

• Pre-process genomic sequences

• Reduce data complexity by collapsing redundancies

• Utilize filters that select for only the most variant genes

Page 45: Mining Public Data for Insights into Human Disease

Likely Issues

• Size of eukaryotic genomes

• Added complexity of regulatory regions

• Tissue and cell type heterogeneity

• Patient genetic and environmental heterogeneity

Page 46: Mining Public Data for Insights into Human Disease

Eukaryotic Gene Structure

Page 47: Mining Public Data for Insights into Human Disease

Eukaryotic Gene Structure

TranscriptionalStartSite Start

Codon

Untranslated Regions

Page 48: Mining Public Data for Insights into Human Disease

Eukaryotic Gene Structure

Exons

Page 49: Mining Public Data for Insights into Human Disease

Eukaryotic Gene Structure

Introns

Page 50: Mining Public Data for Insights into Human Disease

Regulatory Regions

3’ UTR

miRNA binding sites(4-9bp motifs)

Promoter

Transcription FactorBinding Sites(6-12bp motifs)

No set length forpromoters in eukaryotes.

Grabbing 2Kbp, so we canuse 2Kbp or smaller.

Median 3’ UTRlength is 831bp

Page 51: Mining Public Data for Insights into Human Disease

Three Examples After Capture

85% (n = 36,177) of probesets are associated with a sequence

Page 52: Mining Public Data for Insights into Human Disease

Solution

• Do motif detection on both promoter and 3’ UTR sequences

• Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix

Page 53: Mining Public Data for Insights into Human Disease

Promoter Sequences

• Looking for transcription factor binding sites (TFBS) Using MEME with 6-12bp motif widths

• Utilized RefSeq gene mapping to identify putative promoter regions 2Kbp of sequence upstream of transcriptional start

site (TSS) was grabbed

• If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken

Page 54: Mining Public Data for Insights into Human Disease

3’ UTR Sequences

• Looking for miRNA binding sites miRNA are 21bp RNA

molecules that bind to mRNA and alter expression

Using MEME with 4-9bp motif widths

Page 55: Mining Public Data for Insights into Human Disease

Likely Issues

• Size of eukaryotic genomes

• Added complexity of regulatory regions

• Tissue and cell type heterogeneity

• Patient genetic and environmental heterogeneity

Page 56: Mining Public Data for Insights into Human Disease

Complexity ofMammalian Systems

Page 57: Mining Public Data for Insights into Human Disease

Cellular Heterogeneityin Tissues

Page 58: Mining Public Data for Insights into Human Disease

What Makes Cells Different?

Page 59: Mining Public Data for Insights into Human Disease

Solution

• Filter our genes that are not expressed for each tissue, leaving only those that are expressed

• Enhance the capability of the software to handle missing data

Page 60: Mining Public Data for Insights into Human Disease

Likely Issues

• Size of eukaryotic genomes

• Added complexity of regulatory regions

• Tissue and cell type heterogeneity

• Patient genetic and environmental heterogeneity

Page 61: Mining Public Data for Insights into Human Disease

Intelligent Sample Collection

• Genetic and environmental heterogeneity are real world issues

• Can try to match for certain confounders

• Or stratify analyses based on particular confounders

Page 62: Mining Public Data for Insights into Human Disease

Running cMonkey

• Running cMonkey on AEGIR cluster 10 nodes with 8 cores per

node

1 node has 24GB ram

2 others have 16GB ram

• Completion time depending heavily on the size of the run

Page 63: Mining Public Data for Insights into Human Disease

Beautiful NewResult Interface

Page 64: Mining Public Data for Insights into Human Disease

Looking at a Cluster

Page 65: Mining Public Data for Insights into Human Disease

Chris’s Graphics Mods

Page 66: Mining Public Data for Insights into Human Disease

Original cMonkey Output

Page 67: Mining Public Data for Insights into Human Disease

Sorted cMonkey Output

Page 68: Mining Public Data for Insights into Human Disease

Boxplot For All Samples

Page 69: Mining Public Data for Insights into Human Disease

Boxplot for In Samples

Page 70: Mining Public Data for Insights into Human Disease

Integrating Phenotypes

Page 71: Mining Public Data for Insights into Human Disease

What to do when you find a cluster?

Page 72: Mining Public Data for Insights into Human Disease

Checking Out PSSM #1

Page 73: Mining Public Data for Insights into Human Disease

Known Motif?

Page 74: Mining Public Data for Insights into Human Disease

Motif Known?

Page 75: Mining Public Data for Insights into Human Disease

What do the genes do?

Page 76: Mining Public Data for Insights into Human Disease

Functional Enrichment?

Page 77: Mining Public Data for Insights into Human Disease

Functional Enrichment

Page 78: Mining Public Data for Insights into Human Disease

Genes?

Page 79: Mining Public Data for Insights into Human Disease

Interesting Cluster

Page 80: Mining Public Data for Insights into Human Disease

Phenotype Correlations

• Survival – Correlation coefficient = -0.48 P-value = 3.2 x 10-11

• Progression Model – Correlation coefficient = 0.55 P-value = 6.7 x 10-16

• Age – Correlation coefficient = 0.32 P-value = 2.2 x 10-7

• Sex – Correlation coefficient = -0.27 P-value = 0.0012

Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10-5

Page 81: Mining Public Data for Insights into Human Disease

Genes from Cluster

AFFY_ID Gene Symbol Gene Name

212067_S_AT C1R complement component 1, r subcomponent

208747_S_AT C1S complement component 1, s subcomponent

201743_AT CD14 cd14 antigen

215049_X_AT CD163 cd163 antigen

203854_AT CFI complement factor i

213060_S_AT CHI3L2 chitinase 3-like 2

208146_S_AT CPVL carboxypeptidase, vitellogenic-like

201798_S_AT FER1L3 fer-1-like 3, myoferlin (c. elegans)

206584_AT LY96 lymphocyte antigen 96

202180_S_AT MVP major vault protein

204150_AT STAB1 stabilin 1

204924_AT TLR2 toll-like receptor 2

= Previously known to be differentially expressed in GBM.

Page 82: Mining Public Data for Insights into Human Disease

Motif Matches

PSSM #2

PSSM #1

Page 83: Mining Public Data for Insights into Human Disease

Summary

• Very promising results

• Need to further develop certain aspects of cMonkey to better utilize the human data

• Then need to build network inference component

Page 84: Mining Public Data for Insights into Human Disease

General Questions

• Biclustering or not?

• How many genes to run?

• How much sequence to feed MEME?

• Can more than one experiment be included?

Page 85: Mining Public Data for Insights into Human Disease

Cluster Samples, or Not?

• Bi-clustering clusters not only on genes but also by experimental conditions (samples)

• Because we are using just one experiment it may not be necessary to cluster samples

• Although it may be useful again once other experiments are included

Page 86: Mining Public Data for Insights into Human Disease

Bi-clustering or Not?

Bi-clustering Gene Clustering Only

Page 87: Mining Public Data for Insights into Human Disease

Brief Glance

• Looks like for this dataset it may make more sense to only cluster genes More clusters with significant motifs

• Although this is likely to change once we add more experiments to the mix

• Need a method to quantify this

Page 88: Mining Public Data for Insights into Human Disease

General Questions

• Biclustering or not?

• How many genes to run?

• How much sequence to feed MEME?

• Can more than one experiment be included?

Page 89: Mining Public Data for Insights into Human Disease

Maxing Out cMonkey

• Can cMonkey handle running all genes Yes, without doing motif finding With motif finding this will take a long time (weeks?),

and tends to crash out

• Essentially need to balance sequence length for motif finding with cluster size and number of clusters

• Need a method to quantify this

Page 90: Mining Public Data for Insights into Human Disease

General Questions

• Biclustering or not?

• How many genes to run?

• How much sequence to feed MEME?

• Can more than one experiment be included?

Page 91: Mining Public Data for Insights into Human Disease

Length for Promoters?

• MEME suggests 1Kbp or less for sequences as input

• Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp

Page 92: Mining Public Data for Insights into Human Disease

Brief Glance

• So far looks like the 500bp give the most clusters with motifs

• Need a method to quantify this

Page 93: Mining Public Data for Insights into Human Disease

General Questions

• Biclustering or not?

• How many genes to run?

• How much sequence to feed MEME?

• Can more than one experiment be included?

Page 94: Mining Public Data for Insights into Human Disease

Breast Cancer Metastasis

Bos et al., 2009

Page 95: Mining Public Data for Insights into Human Disease

cMonkey for Eukaryotes

Future Modifications to cMonkey for eukaryotes:

Preprocess sequence data

Add 3’ UTR miRNA motif detection

Integrate 3’ UTR miRNA motif scores with promoter motif scores

Page 96: Mining Public Data for Insights into Human Disease

Network Inference

• cMonkey software is utilized to produce the bi-clusters

• Inferelator can then be used to identify regulatory factors

• Simple correlation with phenotypes can relate bi-clusters to disease

Page 97: Mining Public Data for Insights into Human Disease

Acknowledgements

Baliga Lab• Nitin• David• Chris• Dan

Hood Lab• Burak Kutlu

• Luxembourg Project• REMBRANDT