Top Banner
BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine [email protected] April 23, 2015
17

BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

Sep 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

BTRY 7210: Topics in Quantitative Genomics and Genetics

Jason MezeyBiological Statistics and Computational Biology (BSCB)

Department of Genetic [email protected]

April 23, 2015

Page 2: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

Lecture 8: Wrap-up and perspective: when does model fitting to genomic

data “work”

Page 3: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

A motivating example: one type of complex model fitting to data in the 1990’s

Page 4: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

Under what conditions will model fitting lead to discovery?

Signal Strength

Robustness

Mea

sure

->

Mec

hani

sm

error? Discovery

Predictionerror?

error?

error? value?

value?

Page 5: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

• Methodology designed to take raw data -> accurate measurements for new technologies

• e.g. see any new “-Seq” technology

• Cancer genomics

• 1. Evolution of cancer, 2. Classification of cancer types, 3. Correlating with survival outcomes

• Inherited genomics

• 1. functional annotation by comparative genomics; 2. inferring ancestry; 3. inferring relatedness; 4. mapping of Mendelian disease variants; 5. eQTL; 6. eQTL based network discovery

Applications of computational methodology to genomic data that work?

Page 6: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

1. Cancer evolution example

intensities were converted to copy number calls. Afterrefinement of the copy numbers, somatic copy numberalterations (SCNA) were called by subtracting the sig-nals in the tumor sample from those in the normalsample. The segments of the SCNA were identified bycircular binary segmentation.

For omental samples, whole exome sequencing data wasused to detect SCNA. Pair-end read data was processed bythe Varscan2 copynumber and copyCaller [16] with wholeexome sequencing data of blood and the following non-default parameters: max-segment-size, 250; data-ratio 0.301for OM1 and 0.306 for OM2. These raw segment data were

Figure 1 Intra-tumoral mutational profiles of HGSC. (A) Sampling sites of tumor and normal control tissues. (B) Phylogenetic tree of somaticmutations. (C) Phylogenetic tree of somatic copy number variations. (D) Patterns of somatic mutations across samples. HGSC: high grade serousovarian cancer, RO: right ovary, RF: right fimbriae, LO: left ovary, LF: left fimbriae, BP: bladder peritoneum, OM: omentum.

Lee et al. BMC Cancer (2015) 15:85 Page 3 of 9

Page 7: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

2. Cancer classification example

Nature Reviews | Cancer

0

10

20

30

40

a Class distribution of cluster

c Heat map of IPLPDGM

1

PDGM 2

PDGM 3

PDGM 4

PDGM 5

0.00 20 40 60 80 100 120

0.2

0.4

0.6

0.8

1.0

Prop

orti

on

b Kaplan–Meier curve

PDGM 1PDGM 2PDGM 3PDGM 4PDGM 5

Time (months)

IL4 signalling

Thromboxane A2 signalling

IL23 signallingIL12 signallingTCR signallingNFAT–calcineurin transcription

FOXM1 transcription

ERBB4

EndothelinsAngiopoietin receptorTIE2-mediated signalling

Gene IPL2.001.330.670.00–0.67–1.33–2.00

and proteins function by interacting with DNA, RNA and proteins, and these interactions might be specific for a given disease subclass102. Many of the current tar-geted therapies focus on proteins that are involved in cell signalling pathways, which form a complex cellu-lar communication system that governs basic cellular functions103,104. Established examples of targeted cancer treatment include EGFR-mutated non-small-cell lung cancer that can be treated with tyrosine kinase inhibi-tors (gefitinib or erlotinib)105,106, ERBB2 (also known as HER2)-directed therapy in breast cancer107,108, and mela-nomas with BRAFV600E mutations that can be targeted with vemurafenib109.

A major challenge in drug development is to pre-cisely define the subset of cancer patients that are likely to respond. Within each pathway, a range of drugs may be available, and the optimal target (and, hence, the optimal drug) will be determined by the rate-limiting protein and the individual perturbations in the pathway. In colorectal cancer, EGFR-directed therapy with monoclonal antibodies has proven to be effective110. However, in the presence of a down-stream activating KRAS mutation, the inhibition of EGFR is ineffective111. It seems likely that similar mechanisms are present in cases with resistance to other cancer treatments (both targeted and more tra-ditional chemotherapeutic agents). Iadevaia et al.112 have proposed a computational procedure to generate experimentally testable intervention strategies for the optimal use of available drugs in a cocktail. They used reverse phase protein array to evaluate the changes in the phosphorylation status of proteins after stimula-tion of the MDA-MB 231 breast cancer cell line with insulin-like growth factor, and they were able to con-clude that the simultaneous inhibition of MAPK and PI3K–AKT pathways was sufficient to significantly halt cell proliferation112. Future methods will require adding methylation and expression data to such inte-grative approaches. Introducing systematic clinical screenings for mutations that perturb these pathways is of great importance to identify the targets for tar-geted therapies and the patients that will respond to each treatment.

Outcome prediction that is based on genomic data is another central area of genomic research, and it has proven to be promising in breast cancer. One of the crucial issues in retrospective studies is that treat-ment selection is mostly based on the predicted risk of recurrence. Thus, treatment might be confounded by prognosis. This challenges the identification of pure prognostic markers, as the treatment interaction is not known. Even though the results from prospective vali-dation trials, such as the Microarray In Node-negative and 1–3 positive lymph node Disease may Avoid Chemotherapy (MINDACT) trial and the Trial assign-ing individualized options for treatment (TailorX), are still pending, prediction tools based on gene expression are included in some clinical guidelines113,114. Optimal strategies for risk prediction are, however, not settled and remain controversial. Crowdsourcing strategies for problem solving, which were previously success-fully applied to biology in areas such as the prediction of protein folding and function115,116, have been applied to this problem. In the DREAM BCC competition33, participants competed to create an algorithm that could predict — more accurately than current benchmarks — the prognosis of patients with breast cancer from clinical information (age, tumour size and histological grade), genome-scale tumour mRNA expression data and DNA copy-number data from 1,980 patients33. Integration of data was encouraged, and more than 1,400 models were submitted. The winners used a mathe matical approach that was based on co-expression gene networks associated with tumour phenotype and

Figure 3 | Classifying breast cancer using PARADIGM. All multiple layers of high-throughput molecular data described in FIG. 2, including DNA methylation, DNA copy number alterations, mRNA expression, microRNA (miRNA) expression as well as TP53-mutation status, were subjected to integrated analysis using the PAthway Recognition Algorithm using Data Integration on Genomic Models (PARADIGM). This resulted in five clusters (part a) with survival differences (part b) and this was validated in multiple other datasets87. A heat map of integrated pathway levels (IPLs) is shown in part c. FOXM1, forkhead box M1; IL, interleukin: PDGM, PARADIGM cluster; TCR, T cell receptor; TIE2, tyrosine kinase, endothelial. Figure is reproduced, with permission, from REF. 87.

REVIEWS

308 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved

Page 8: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

3. Cancer correlating with outcomes

Nature Reviews | Cancer

GIIntClustPAM50

ERTP53

GradeNPI

Low High

Grade 1Grade 2Grade 3NPI 2NPI 3NPI 4NPI 5NPI 6TP53 wild typeTP53mutantTP53 Nullallele

ER NEGER POSBasalHER2Luminal ALuminal BNormalIntClust 1IntClust 2IntClust 3IntClust 4IntClust 5IntClust 6IntClust 7IntClust 8IntClust 9IntClust 10

1q21–42

8p12

8q24

11q13/14

6p13–q24

17q12

17q25

a

IntClust1: 74(18)IntClust2: 45(20)IntClust3: 150(19)IntClust4: 164(32)IntClust5: 91(48)IntClust6: 44(14)IntClust7: 109(21)IntClust8: 140(34)IntClust9: 67(24)IntClust10: 96(30)

0.0

0.2

0.4

0.6

0.8

1.0

0 50 100 150Months

&KUGCUG�URGEKȮE�UWTXKXCN�RTQDCDKNKV[

b Discovery set

Logrank P = 1.2x10–14

would be to develop statistical models to identify crucial, rate-limiting molecular targets for inter-vention, out of the wealth of information that next- generation sequencing uncovers, on the background of great redundancy of pathways and heterogeneity of

Figure 5 | Classifying breast cancer using integrative clustering. Integrative clustering of 997 breast cancer cases

from the METABRIC cohort, based on segmented copy number and gene expression for the top 1,000 cis-acting copy

number-expression associations. Heatmap showing the product of scaled gene expression and copy number values for

the selected features and for k = 10 clusters; columns represent breast cancer cases and rows represent features (part a).

Kaplan–Meier plot of disease-specific survival (truncated at 15 years) for the integrative subgroups. For each cluster, the

number of samples at risk is indicated as well as the total number of deaths in parentheses (part b). ER, oestrogen receptor;

ER NEG, ER negative; ER POS, ER positive; GI, genomic instability based on the proportion of genome altered (black line)

and jump measure (red line); grade, genomic grade; IntClust, groups found using integrative clustering with k = 10

clusters; NPI, Nottingham prognostic index; PAM50, gene expression subtyping based on the PAM50 gene signature.

Figure is reproduced, with permission, from REF. 49 © (2012) Macmillan Publishers Ltd. All rights reserved.

tumours. As we are moving towards an era in which the amount of data produced every year is increas-ing exponentially, the biomedical community needs to embrace this complexity and find new methods of shared analysis. We need to learn from physicists

REVIEWS

310 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved

Page 9: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

1. Comparative Genomics Example

Page 10: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

2. Discovery of Mendelian disease variants

=5<A<?202=A<?@��(56@�:.A�.==2.?21�A<�0<??2@=<;1�A<�A52�4.=@22;�/2AD22;�A52�&%��.;1�A52�$'!�/F�9645A�:60?<@0<=F�������� ���� �������� � � �������$3� A52� .==?<E6:.A29F���� 9<06� ?2=?2@2;A21� <;� A52��33F:2A?6E� C@��'#%�056=���� �� =.@@21� A52� >B.96AF� 0<;A?<9� 6;0<?=<?.A21� 6;A<� A52"�� ��.94<?6A5:��$;2�.3320A21� @.:=92�D.@�0<;A.:6;.A21.;1� D.@� 2E09B121� 3?<:� A52� .;.9F@6@�� �<:=.?6@<;� <3� AD<1B=960.A21�@.:=92@��A<�.@@2@@�A52�0<;@6@A2;0F�<3�A52�"�� �=?<4?.:��3<B;1���� ��<3�A52�0.99@�A<�/2�612;A60.9�6;�A52�36?@A1B=960.A2��.;1�������6;�A52�@20<;1�1B=960.A2��3<?�.;�.C2?.42<3� ������� .;1� :<@A� :6@:.A0521� 0.99@� D2?2� /2AD22;� .;B;09B@A2?21�0.99�.;1�.�1236;21�0.99�

�6@52?��E.0A�.;.9F@6@��0<:=.?6;4�42;<AF=2@�<3�0.@2@�A<0<;A?<9@��612;A63621�.;�.@@<06.A6<;�@64;.9�<;��������64B?2���2EA2;16;4�<C2?���"/�.;1�6;09B16;4����'#%@�A5.A�2E022121�A52�<;32??<;6� 0<??20A21� @64;6360.;02� A5?2@5<91� ��!<4�=�?.;42�����I���� (./92� ���� (52� =2.8� =�C.9B2� ��!<4�=������ D.@� @5.?21� /F� @6E� '#%@� 0<:=?6@6;4� .;6;A2?C.9� <3� .==?<E6:.A29F� � � "/� ������� ��������I������ ���� �99� .3320A21� 1<4@� D2?2� 5<:<GF4<B@� .A� A52@2'#%�9<06��@B442@A6;4�.�?202@@6C2�:<12�<3�6;52?6A.;02��*6A5A5.A� 6;�:6;1��.99�42;<AF=2�0.99@�<;������D2?2�.964;21� A<612;A63F� .� 5<:<GF4<@6AF� /9<08�� A5.A� 6@��D52?2� .99� 42;<AF=20.99@� 3<?� .99� .3320A21� 1<4@� D2?2� 5<:<GF4<B@�� /2AD22;

�64B?2����&2@B9A@�<3�42;<:2�D612�.@@<06.A6<;�@AB1F�6;�0.;6;2�0<;2�?<1�1F@A?<=5F����(52�@[email protected]�@64;.9��F�.E6@��;24.A6C2�!<4�,�6@52?2E.0A�A2@A���A.6921�=?</./696AF-��3<?�.@@<06.A6<;�/2AD22;�0.;6;2�@6;492�;B092<A612�=<9F:<?=56@:��'#%��42;<AF=2�.;1�0.;6;2�0<;2�?<1�1F@A?<=5F���0?1���=52;<AF=2��=9<AA21�.4.6;@A�'#%�05?<:<@<:.9�9<0.A6<;������12:<;@A?.A2@�.�16@A6;0A�=2.8�<;�0.;6;2�05?<:<@<:2������������?22;1<A@�.?2�'#%@�3<?�D5605�A52�.@@<06.A6<;�@64;.9�2E022121�A52��<;32??<;6�A5?2@5<91�3<?�42;<:2�D612�@64;6360.;02���5?<:<@<:2�+�6@�?2=?2@2;A21/F�A52�;B:/2?@����.;1� ���<:<GF4<@6AF�.;.9F@6@�<3�'#%�42;<AF=2@�� ���6;�A52�?246<;�<3������F62916;4�A52�=2.8�.@@<06.A6<;�@64;.9��?2C2.9@52A2?<GF4<@6AF�A5?<B45<BA�A52�6;A2?C.9�6;���;<;.3320A21�0<;A?<9�1<4@��.;1�12:<;@A?.A2@�.���� �"/�5<:<GF4<@6AF�/9<08�6;���0?1��.3320A211<4@���2;<AF=2@�.?2�0<9<?�0<121�.@�3<99<D@��=6;8�.;1�4?22;�?2=?2@2;A�A52�:.7<?�.;1�:6;<?�42;<AF=2@�</@2?C21�6;�.3320A21@��?2@=20A6C29F�F299<D�6@�52A2?<GF4<B@��.;1�D56A2�6@�:6@@6;4�1.A.���9.08�96;2@�/<?12?�A52���� �"/�5<:<GF4<@6AF�/9<08��&23@2>�42;2@�@0?22;21�.@�=<A2;A6.9=<@6A6<;.9� 0.;161.A2@� 3<?� 0?1�� 6;� A52� =?2@2;A� @AB1F� �.??<D52.1@��� .;1����"� 3.:69F� 42;2@� 612;A63621�D6A56;� A52� 0?1��:6;6:.9� 96;8.4216@2>B696/?6B:�6;A2?C.9��.??<D@��.?2�6;160.A21�D6A5�.;;<A.A6<;�.;1�<?12?�0<;@6@A2;A�D6A5�A52��.;�.:��0.;6;2�42;<:2�.@@2:/9F�����;<A�1?.D;A<�@0.92��

���� ��������������������������5AA=���DDD�:<9C6@�<?4�:<9C6@�C��.��� H���"<920B9.?�)6@6<;

��

Page 11: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

2. Discovery of Mendelian disease variants

=5<A<?202=A<?@��(56@�:.A�.==2.?21�A<�0<??2@=<;1�A<�A52�4.=@22;�/2AD22;�A52�&%��.;1�A52�$'!�/F�9645A�:60?<@0<=F�������� ���� �������� � � �������$3� A52� .==?<E6:.A29F���� 9<06� ?2=?2@2;A21� <;� A52��33F:2A?6E� C@��'#%�056=���� �� =.@@21� A52� >B.96AF� 0<;A?<9� 6;0<?=<?.A21� 6;A<� A52"�� ��.94<?6A5:��$;2�.3320A21� @.:=92�D.@�0<;A.:6;.A21.;1� D.@� 2E09B121� 3?<:� A52� .;.9F@6@�� �<:=.?6@<;� <3� AD<1B=960.A21�@.:=92@��A<�.@@2@@�A52�0<;@6@A2;0F�<3�A52�"�� �=?<4?.:��3<B;1���� ��<3�A52�0.99@�A<�/2�612;A60.9�6;�A52�36?@A1B=960.A2��.;1�������6;�A52�@20<;1�1B=960.A2��3<?�.;�.C2?.42<3� ������� .;1� :<@A� :6@:.A0521� 0.99@� D2?2� /2AD22;� .;B;09B@A2?21�0.99�.;1�.�1236;21�0.99�

�6@52?��E.0A�.;.9F@6@��0<:=.?6;4�42;<AF=2@�<3�0.@2@�A<0<;A?<9@��612;A63621�.;�.@@<06.A6<;�@64;.9�<;��������64B?2���2EA2;16;4�<C2?���"/�.;1�6;09B16;4����'#%@�A5.A�2E022121�A52�<;32??<;6� 0<??20A21� @64;6360.;02� A5?2@5<91� ��!<4�=�?.;42�����I���� (./92� ���� (52� =2.8� =�C.9B2� ��!<4�=������ D.@� @5.?21� /F� @6E� '#%@� 0<:=?6@6;4� .;6;A2?C.9� <3� .==?<E6:.A29F� � � "/� ������� ��������I������ ���� �99� .3320A21� 1<4@� D2?2� 5<:<GF4<B@� .A� A52@2'#%�9<06��@B442@A6;4�.�?202@@6C2�:<12�<3�6;52?6A.;02��*6A5A5.A� 6;�:6;1��.99�42;<AF=2�0.99@�<;������D2?2�.964;21� A<612;A63F� .� 5<:<GF4<@6AF� /9<08�� A5.A� 6@��D52?2� .99� 42;<AF=20.99@� 3<?� .99� .3320A21� 1<4@� D2?2� 5<:<GF4<B@�� /2AD22;

�64B?2����&2@B9A@�<3�42;<:2�D612�.@@<06.A6<;�@AB1F�6;�0.;6;2�0<;2�?<1�1F@A?<=5F����(52�@[email protected]�@64;.9��F�.E6@��;24.A6C2�!<4�,�6@52?2E.0A�A2@A���A.6921�=?</./696AF-��3<?�.@@<06.A6<;�/2AD22;�0.;6;2�@6;492�;B092<A612�=<9F:<?=56@:��'#%��42;<AF=2�.;1�0.;6;2�0<;2�?<1�1F@A?<=5F���0?1���=52;<AF=2��=9<AA21�.4.6;@A�'#%�05?<:<@<:.9�9<0.A6<;������12:<;@A?.A2@�.�16@A6;0A�=2.8�<;�0.;6;2�05?<:<@<:2������������?22;1<A@�.?2�'#%@�3<?�D5605�A52�.@@<06.A6<;�@64;.9�2E022121�A52��<;32??<;6�A5?2@5<91�3<?�42;<:2�D612�@64;6360.;02���5?<:<@<:2�+�6@�?2=?2@2;A21/F�A52�;B:/2?@����.;1� ���<:<GF4<@6AF�.;.9F@6@�<3�'#%�42;<AF=2@�� ���6;�A52�?246<;�<3������F62916;4�A52�=2.8�.@@<06.A6<;�@64;.9��?2C2.9@52A2?<GF4<@6AF�A5?<B45<BA�A52�6;A2?C.9�6;���;<;.3320A21�0<;A?<9�1<4@��.;1�12:<;@A?.A2@�.���� �"/�5<:<GF4<@6AF�/9<08�6;���0?1��.3320A211<4@���2;<AF=2@�.?2�0<9<?�0<121�.@�3<99<D@��=6;8�.;1�4?22;�?2=?2@2;A�A52�:.7<?�.;1�:6;<?�42;<AF=2@�</@2?C21�6;�.3320A21@��?2@=20A6C29F�F299<D�6@�52A2?<GF4<B@��.;1�D56A2�6@�:6@@6;4�1.A.���9.08�96;2@�/<?12?�A52���� �"/�5<:<GF4<@6AF�/9<08��&23@2>�42;2@�@0?22;21�.@�=<A2;A6.9=<@6A6<;.9� 0.;161.A2@� 3<?� 0?1�� 6;� A52� =?2@2;A� @AB1F� �.??<D52.1@��� .;1����"� 3.:69F� 42;2@� 612;A63621�D6A56;� A52� 0?1��:6;6:.9� 96;8.4216@2>B696/?6B:�6;A2?C.9��.??<D@��.?2�6;160.A21�D6A5�.;;<A.A6<;�.;1�<?12?�0<;@6@A2;A�D6A5�A52��.;�.:��0.;6;2�42;<:2�.@@2:/9F�����;<A�1?.D;A<�@0.92��

���� ��������������������������5AA=���DDD�:<9C6@�<?4�:<9C6@�C��.��� H���"<920B9.?�)6@6<;

��

Page 12: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

3. Inferring ancestry example

applied fineSTRUCTURE to the PoBI samples’ genetic data withoutreference to the known geographical locations. The genetic clusteringcan be assessed with respect to geography by plotting individuals on amap of the UK (at the centroid of their grandparents’ places of birth)and examining the inferred genetic clusters, for different levels of thehierarchical clustering.

Figure 1 shows this map for 17 clusters, together with the tree show-ing how these clusters are related at coarser levels of the hierarchy. (Thereis nothing special about this level of clustering, but it is convenient fordescribing some of the main features of our analysis; SupplementaryFig. 1 depicts maps showing other levels of the hierarchical clustering.)The correspondence between the genetic clusters and geography is strik-ing: most of the genetic clusters are highly localized, with many occu-pying non-overlapping regions. Because the genetic clustering madeno reference to the geographical location of the samples, the resultingcorrespondence between genetic clusters and geography reassures usthat our approach is detecting real population differentiation at finescales. Our approach can separate groups in close proximity, such as inCornwall and Devon in southwest England, where the genetic clustersclosely match the modern county boundaries, or in Orkney, off thenorth coast of Scotland.

It is instructive to consider the tree that describes the hierarchicalsplitting of the 2,039 genotyped individuals into successively finer clus-ters (Fig. 1). The coarsest level of genetic differentiation (that is, theassignment into two clusters) separates the samples in Orkney from allothers. Next the Welsh samples separate from the other non-Orkneysamples. Subsequent splits reveal more subtle differentiation (reflectedin the shorter distances between branches), including separation of northand south Wales, then separation of the north of England, Scotland andNorthern Ireland from the rest of England, and separation of samplesin Cornwall from the large English cluster. There is a single large cluster(red squares) that covers most of central and southern England andextends up the east coast. Notably, even at the finest level of differen-tiation returned by fineSTRUCTURE (53 clusters), this cluster remainslargely intact and contains almost half the individuals (1,006) in ourstudy.

Although larger than between the sampling locations, estimated FST

values between the clusters represented in Fig. 1 are small (average 0.002,maximum 0.007, Supplementary Table 2), confirming that differenti-ation is subtle. On the other hand, all comparisons between pairs ofclusters of their patterns of ancestry as estimated by fineSTRUCTUREshow highly significant differences (Supplementary Table 3).

Cornw

allDev

on

S Pem

broke

shireN P

embro

kesh

ireWels

h bor

dersN W

ales

N Ire./

S Sco

tlandN Ir

e./W

Sco

tland

NE Sco

tland

2NE Sco

tland

1Orkney

2

Orkney

1

Wes

tray

Cent./

S Eng

land

W Yo

rkshir

e

Cumbria

North

umbria

OrkneyWalesOrkney Central/south

EnglandScotland/

north EnglandFigure 1 | Clustering of the 2,039UK individuals into 17 clustersbased only on genetic data. For eachindividual, the coloured symbolrepresenting the genetic cluster towhich the individual is assigned isplotted at the centroid of theirgrandparents’ birthplaces. Clusternames are in side-bars and ellipsesgive an informal sense of the range ofeach cluster (see Methods). Norelationship between clusters isimplied by the colours/symbols.The tree (top right) depicts the orderof the hierarchical merging ofclusters (see Methods for theinterpretation of branch lengths).Contains OS data E Crowncopyright and database right 2012.

E EuroGeographics for someadministrative boundaries.

RESEARCH ARTICLE

3 1 0 | N A T U R E | V O L 5 1 9 | 1 9 M A R C H 2 0 1 5

Macmillan Publishers Limited. All rights reserved©2015

Page 13: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

4. Inferring relatedness example

Red:%MZ#twin#/#replicate##Orange:#Parent1child#/#sibling#Yellow:#Half1sibling#/#aunt#/#uncle#/#grandparent#Green:#First#cousin/great1grandparent#

Page 14: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

5. expression Quantitative Trait Loci (eQTL)

3.5

4.0

4.5

5.0

5.5

6.0

rs1908530 genotypeER

AP2

expr

essi

on

T/T T/C C/C

3.5

4.0

4.5

5.0

5.5

6.0

rs27290 genotype

ERAP

2 ex

pres

sion

A/A A/G G/G

cis eQTL No eQTL

Chromosome

-log 10

(p-v

alue

)

• An eQTL is a case where allelic variation at a site in the genome is directly responsible for altering gene expression level.

• eQTL can be “discovered” by analyzing statistical associations between genome-wide genotypes and expression variables

Page 15: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

6. Using eQTL to find new regulatory relationships exmple

Page 16: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

• Development of the appropriate regularization methodology underway...

• Are the signals strong and robust enough for prediction?

Correlating derived variables from whole-genome data with health variables / life

sensing data?

Page 17: BTRY 7210: Topics in Quantitative Genomics and Geneticsmezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...Genomics and Genetics Jason Mezey Biological Statistics and Computational

That’s it for today!