Top Banner
Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta
36

Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

Dec 23, 2015

Download

Documents

Hannah Cole
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

Algorithmic Analysis of Human DNA Replication Timing from

Discrete Microarray Data

Christopher Taylor

Gabriel Robins & Anindya Dutta

Page 2: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

2

Thesis Statement

The DNA replication timing profile can be reconstructed efficiently and accurately from discrete time points.

(Glossary)

Page 3: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

3

Presentation Outline

• Biology background

• Microarray technology

• Experimental data– Challenges

• Algorithms

• Research Plans– Replication timing– Origins– Scale up

Page 4: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

4

• Natural Science– DNA is the blueprint for organisms

• It must be passed on (organism, cell)

• Engineering– Gene therapy

• Insertion, deletion, modification

– Cancer is unchecked replication

Why Study DNA Replication?

Page 5: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

5

... A G G T C G A C A C ...

... T C C A G C T G T G ...

• Human genome > 3 billion bp• Replication rate ~ 1000 bp/min• Serial replication 5.7 years• 6 to 10 hours (speedup > 5000)

Page 6: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

6

Background

• Prokaryotes– E. Coli

• DnaA binds to oriC

• Eukaryotes – ORC– S. Cerevisiae (yeast)

• ARS 11 bp consensus– Mapping of origins

– Human• No known consensus• Few origins characterized

Page 7: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

7

ATGGACTACGGATCAGTAAATCGATTAGGCACCAGATCAAGTACGATCCAGAGTACATAGCATACCATGACTAGATACCTGATGCCTAGTCATTTAGCTAATCCGTGGTCTAGTTCATGCTAGGTCTCATGTATCGTATGGTACTGATCT

GAGTACATAGCATACCATGACTAGACTCATGTATCGTATGGTACTGATCT

• Interrogation at genomic scale– Large increase in data

• Microarray data analysis

• Array of probes tiles genome

PM probe

• Cross-hybridization– Repeats not tiled

• Gaps in genome

Genome Tiling Microarrays

GAGTACATAGCATACCATGACTAGAMM probe A

Page 8: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

8

Image analysis computes intensity of each array probe

Page 9: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

9

The Cell Cycle

Start of S-phase (0 hour)

S-Phase

Page 10: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

10

Profiling DNA Replication Timing

• Ideal: f(chr, bp) = rtime• Isolate DNA replicated in

discrete parts of S-phase– One cell is not enough– Synchronize S-phase entry

• Apply drugs• Release together

– Synchronization error

– Label in two hour intervals

• Allelic Variation– mf(chr, bp) = {rtime1, rtime2, …}

Page 11: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

11

Allelic Variation

• Fluorescent in-situ Hybridization (FISH)– Replication timing at a given site

0hr

2hr

4hr

6hr

8hr

10hr

0hr

2hr

4hr

6hr

8hr

10hr

Temporally specific replication (TS)

Temporally non-specific replication (TNS)

11

Page 12: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

12

What is the Problem?

Reconstruct a continuous replication profile– Temporally (time points)– Spatially (probes)

from noisy data– Biological experiments– Synchronization error– Microarray artifacts

efficiently– Genomic data (> 3 billion bp)

Page 13: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

13

Initial Analysis

• Tiling Analysis Software (TAS)– Wilcoxon Rank Sum test in sliding window

• Assess enrichment of treatment over control

– Window slides to get p-value for each probe• O(kn) time complexity

– n = # probes on array– k = # probes in a window

» k scales linearly with window size

Page 14: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

14

New Analysis

• Thesis Statement (revisited):The DNA replication timing profile can be reconstructed efficiently and accurately from discrete time points.

• Incorporate information from all time points– Continuous view of replication timing (TR50)

• Address temporally non-specific replication

• Scale up to the whole genome efficiently

Page 15: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

15

0 0 1/1 0 0

0 2 4 6 8 10

1/6 1/6 1/3 0 1/3

0 2 4 6 8 10

5

5

Allelic Variation Examples

TR50

TR50

Temporally specific replication

Temporally non-specific replication

Challenge: From distribution of array signal, determine replication category.

Page 16: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

16

Temporal Specificity Algorithm// Is there evidence that all alleles are replicating together?

If (max sum of two adjacent time points ≥ 5/6 * total sum) then {probe is temporally specific}

// Is at least one allele replicating apart from the majority?

Else If (max sum of two adjacent time points not including the maximum time point ≥ 1/3 * total sum)

then {probe is temporally non-specific}

// Isolated signal is not strong enough to be an allele.

Else {probe is temporally specific}

Page 17: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

17

Plotting TR50

8 6 4 2 T

R50 (hours)

33 33.5 34

Chromosomal Position (in millions of bp)

• Smoothed TR50 curve recovers replication pattern• Local minima Possible locations of replication origin

Page 18: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

18

Segregation Algorithm

• Sliding window passes over probes to generate intervals– Ratio of TSP to TNSP determines temporal specificity– Average TR50 determines timing category

Mid Late

Ratio ≥ 2-to-1 &

TNS Early

Ratio < 2-to-1

Ratio < 2-to-1

Avg > 3.93.4 ≤ Avg ≤ 3.9

Avg < 3.4

Avg < 3.4 Avg ≤ 3.9

Avg ≥ 3.4 Avg > 3.9

Ratio < 2-to-1

Page 19: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

19

Research Plan: Profile Generation

No Signal Probes Segregation Algorithm (Sliding Window)

Probe Classification (Temporal Specificity Algorithm) & TR50 Calculation

0-2hr 2-4hr 4-6hr 6-8hr 8-10hr

TNS Probes

TS Probes & TR50

Low Probe Density

TNS Regions

TS Regions

Join Intervals

Joined TNS Regions

Joined TS Regions

Mask TS probes with JTS Regions TS Probes that fall into JTS Regions

TR50 Smoothing

Smoothed TR50Segregate JTS Regions into 1/3’s based on STR50

Early

Mid

Late

Join Intervals

Joined Early

Joined Mid

Joined Late

• Parameters to evaluate:– Segregation Algorithm: sliding window size, minimum probe density

– Join Intervals: minimum interval size

Page 20: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

20

Evaluation

• Concordance of biological phenomena– Segregation intervals ↔ FISH– STR50 local minima ↔ Other origin methods– Correlation with other biological data

• Gene density ↔ Early replication• AT content ↔ Late replication• Gene expression ↔ Early replication• Activating acetylation/methylation ↔ Early replication

• Performance on random data– Large quantity of TNS replication

Page 21: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

21

Research Plan: Replication Origins

• Drive DNA replication pattern• Smoothed TR50 local minima

– Cleaned up with new profiles

• Other biological assays– Early labeling fragments– Nascent strands– Bubble trapping– ORC binding

Page 22: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

22

Approach and Evaluation

• Correlation between methods– Consensus sets

• Motif analysis

– Positional attributes• Replication timing• Proximity to genes

• Evaluation is difficult (few validated origins)– Agreement between methods– Testing proposed correlations– Paper in preparation

Page 23: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

23

Scaling Up to Whole Genome

• Pilot 1% 100% of human genome– Algorithms developed with scalability in mind

• Incremental update sliding windows Linear time

• Performance based evaluation– If 100% data available

• Profile multiple runs

– Else• Profile many 1% runs

Page 24: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

24

Implementation Details• Java

– Class representation of proprietary microarray files– Algorithms to process raw microarray data– Diagnostic tools

• Perl– Scripts to process intermediate and final data– Correlations, data transformation, quality assurance

• R statistical language– Smoothing, statistical plots, correlation studies

• Shell scripts– Automated processing of microarray sets

Page 25: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

25

Current/Expected Contributions• Algorithms, Software Infrastructure, Analysis• Probe-by-probe TR50 analysis

– Temporal Specificity Algorithm• Combinatorial analysis of allele locations

• Segregation Algorithm– TNS, Early, Mid, Late replicating areas

• Used to design validation experiments

• Smoothed TR50 profile– Local minima provide candidate origin set

• Linear algorithms enable scale up• Randomness testing

Page 26: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

26

PublicationsCompleted:• ENCODE Project Consortium. The ENCODE

(ENCyclopedia Of DNA Elements) Project. Science. 2004 Oct 22; 306(5696):636-40.

• ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. {In Press, to appear in June 14, 2007 issue}

• Karnani N., Taylor C., Malhotra A., Dutta A. Pan-S replication patterns and chromosomal domains defined by genome tiling arrays of encode genomic areas. Genome Research. {In Press, to appear in June 2007 issue}

• UCSC Browser Tracks: TR50, Smoothed TR50, Local Minima, Segregation

In Progress:• Multi-million dollar NIH grant for scale up to full

human genome• Paper detailing origin methods, correlations, etc.

Page 27: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

27

TimelineSpring 2007 (present to June 20):

Implement proposed replication profile generation algorithms– Generate new profiles for existing data and evaluate against FISH

– Collect new origin sets and continue analysis for paper completion

Summer 2007 (June 21 to September 21):

Explore correlations of new profiles with other data sets

Submit paper to PSB 2008 based on new method and results

Develop random data sets to test profile generation algorithms

Fall 2007 (September 22 to December 21):

Evaluate performance for scale up to whole genome

Tie up loose ends and begin writing the dissertation

Winter 2007-2008 (December 22 to March 19):

Finish dissertation and schedule defense before May 2008

Page 28: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

28

Acknowledgements

• Advising:– Anindya Dutta, Gabriel Robins

• Biological Experiments:– Neerja Karnani, Patrick Boyle, Larry Mesner,

Jamie Teer, Hakkyun Kim

• Collaborative Analysis:– Ankit Malhotra

• Discussions of Analysis:– Stefan Bekiranov

Page 29: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

29

THE END

Page 30: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

30

Why is this work computer science?

• Fred Brooks: The Computer Scientist as Toolsmith II– “Hitching our research to someone else’s driving problems, and

solving those problems on the owners’ terms, leads us to richer computer science research.”

• Not an incremental improvement– Algorithmic techniques and analysis used to solve a problem

previously addressed inadequately with a statistical approach that performed poorly

• Collaboration outside of engineering disciplines enhances visibility, funding opportunities, and demand for CS work

• Developed algorithms, time complexity analysis, combinatorial analysis, feedback to experimental design

Page 31: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

31

Will this work lead to any CS publications?

• The Nature article focused on analysis of the biological data and includes descriptions of some of my algorithms

• The Genome Research paper and origins paper will also contain writeups of my algorithms and analysis techniques

• The Pacific Symposium on Biocomputing focuses on algorithms and computational techniques

Page 32: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

32

Isn't your approach too simple?

• The approach isn’t simple:– Combinatorial analysis– Temporal specificity algorithm (many iterations)– Probewise computation to deal with binding affinity– Incremental updating sliding windows

• Cross-hybridiztion• Synchronization error

– Smoothing• Parameterization

– Linear algorithms for scale up

Page 33: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

33

Can't your algorithm be replaced by a well-known statistical method?

• HMM’s were used for segregation of intervals– Performed poorly in comparison to my algorithm

• Less accurate categorization of replication intervals• Prone to rapid oscillation, producing tiny intervals• Parameterization was difficult

• Lowess smoothing is a statistical method– Parameterization was not easy

Page 34: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

34

What are the biggest challenges in this work?

• Noise!– The data to analyze comes from biological experiments with

several sources of noise that compound upon one another

• Biology– I haven’t had a course in biology since 10th grade

• Microarrays– New, evolving technology we’re still learning to deal with

• Data size– Hundreds of GB of data to process– Replicates, failed experiments– Algorithms must be efficient

Page 35: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

35

What kind of career are you aiming for after graduation, and why?

• Teaching Computer Science (Small College)– I enjoyed learning in my undergraduate curriculum with

meaningful interactions with professors– I taught Discrete Math at UVa in Fall ’02 and Spring ’03

• Enjoyable, but 60-70 students too large

• Post-doctoral (Biological Computing)– Many opportunities around the world– Further exploration of the field

Page 36: Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor Gabriel Robins & Anindya Dutta.

36

How will you know when your work/thesis is done?

• Research is never really done, but you have to declare victory at some point

• The replication profiling algorithms I’ve developed already perform quite well– I have concrete plans to improve and finalize them