This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
2 ANALYTIC PROCESS .........................................................................................................................................................................................................................6
2.1 STEP 1 – ASSOCIATION STATISTICS..................................................................................................................................................................................................6 2.2 STEP 2 – GENETIC ALGORITHM........................................................................................................................................................................................................6 2.3 STEP 3 – LOSS OF HETEROZYGOSITY ...............................................................................................................................................................................................6 2.4 STEP 4 – EM ALGORITHM ................................................................................................................................................................................................................7
3 STUDY GROUP INFORMATION ......................................................................................................................................................................................................8
4.1 AS MODULE RESULTS SUMMARY ...................................................................................................................................................................................................11 4.1.1 Statistics By Genotype ............................................................................................................................................................................................................11
4.1.1.1 Top 35 Results By Genotype:...................................................................................................................................................................................................................... 11 4.1.2 Statistics By Allele ..................................................................................................................................................................................................................13
4.1.2.1 Top 35 Results By Allele: ........................................................................................................................................................................................................................... 13 4.1.3 Fishers Exact Analysis Statistics Discussion..........................................................................................................................................................................14 4.1.4 Statistics - Odds Ratio By Genotype.......................................................................................................................................................................................14
4.1.4.1 Top 35 Results By Genotype:...................................................................................................................................................................................................................... 14 4.1.5 Statistics: Odds Ratio By Allele..............................................................................................................................................................................................17
4.1.5.1 Top 35 Results By Allele: ........................................................................................................................................................................................................................... 17 4.1.6 GA Module Results Summary .................................................................................................................................................................................................19
4.1.6.1 GA Based Analysis ..................................................................................................................................................................................................................................... 19 4.1.6.2 GA Results .................................................................................................................................................................................................................................................. 20
4.1.6.3 Genetic Algorithm SNP List........................................................................................................................................................................................................................ 32 4.1.7 CA Module Results Summary .................................................................................................................................................................................................33 4.1.8 Loh analysis on select individual samples..............................................................................................................................................................................36
4.1.8.1 Results for family CAR545 ......................................................................................................................................................................................................................... 36 4.1.8.2 Results for Family CAR36 .......................................................................................................................................................................................................................... 39 4.1.8.3 Results for Family CAR14 .......................................................................................................................................................................................................................... 42
4.1.9 EM Algorithm Results.............................................................................................................................................................................................................46 4.1.9.1 Top 100 Results By p-value: ....................................................................................................................................................................................................................... 46
5.1 FINAL SNP LIST FOR FURTHER REVIEW..........................................................................................................................................................................................51
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
3 04/24/06 3:33 PM, Sapio Sciences, LLC
6 APPENDIX A .......................................................................................................................................................................................................................................55
6.1 LOSS OF HETEROZYGOSITY CALCULATIONS ...................................................................................................................................................................................55 6.1.1 Analysis-Set Defined Block Size .............................................................................................................................................................................................55 6.1.2 LOH P-Value..........................................................................................................................................................................................................................56
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
4 04/24/06 3:33 PM, Sapio Sciences, LLC
Supplementary Tables TABLE 1 - STUDY GROUP DETAILS..................................................................................................................................................................................................................10 TABLE 2 - STATISTICS - TOP 35 RESULTS BY GENOTYPE.................................................................................................................................................................................12 TABLE 3 - STATISTICS - TOP 35 RESULTS BY ALLELE .....................................................................................................................................................................................14 TABLE 4 - ODDS RATIO - TOP 35 RESULTS BY GENOTYPE ..............................................................................................................................................................................16 TABLE 5 - ODDS RATIO - TOP 35 RESULTS BY ALLELE ...................................................................................................................................................................................18 TABLE 6 - GENETIC ALGORITHM SNP LIST.....................................................................................................................................................................................................32 TABLE 7 - LOSS OF HETEROZYGOSITY RESULTS .............................................................................................................................................................................................36 TABLE 8 - EM ALGORITHM - TOP 100 RESULTS BY PVALUE...........................................................................................................................................................................49 TABLE 9 - FINAL SNP LIST FOR FURTHER REVIEW .........................................................................................................................................................................................54
Supplementary Figures FIGURE 1 - GA MODEL 1, EXP 124 (RUN 20) ..................................................................................................................................................................................................20 FIGURE 2 - GA MODEL 1, EXP 124 (RUN 20) - PREDICTIONS ..........................................................................................................................................................................21 FIGURE 3 - GA MODEL 7, EXP 124 (RUN 20) ..................................................................................................................................................................................................22 FIGURE 4 - GA MODEL 7, EXP 124 (RUN 20) - PREDICTIONS..........................................................................................................................................................................23 FIGURE 5 - GA MODEL 8, EXP 124 (RUN 20) ..................................................................................................................................................................................................24 FIGURE 6 - GA MODEL 8, EXP 124 (RUN 20) - PREDICTIONS..........................................................................................................................................................................25 FIGURE 7 - GA MODEL 9, EXP 124 (RUN 20) ..................................................................................................................................................................................................26 FIGURE 8 - GA MODEL 9, EXP 124 (RUN 20) - PREDICTIONS ..........................................................................................................................................................................27 FIGURE 9 - GA MODEL 1, EXP 128 (RUN 23) ..................................................................................................................................................................................................28 FIGURE 10 - GA MODEL 1, EXP 128 (RUN 23) - PREDICTIONS ........................................................................................................................................................................29 FIGURE 11 - GA MODEL 2, EXP 128 (RUN 23) ................................................................................................................................................................................................30 FIGURE 12 - GA MODEL 2, EXP 128 (RUN 23) - PREDICTIONS........................................................................................................................................................................31
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
5 04/24/06 3:33 PM, Sapio Sciences, LLC
1 Overview Sapio Sciences was engaged to analyze genotyping data from samples submitted by NICHD to Genome Explorations. Genome Explorations ran the samples on the
Affymetrix 10K xba 142 arrays producing 10,000 plus genotypes for each sample. The samples themselves were derived from the adrenal tissue of affected individuals
as well as from peripheral blood. There were 34 total samples, 18 cases and 16 controls. The cases phenotype was a form of adrenal cancer called “Micronodular
(non-pigmented) Adrenocortical Hyperplasia” (MAH) that occurs early in life.
Sapio Sciences will use its Exemplar Genotyping Analysis Suite to perform various analyses on the supplied data. The objective is to reduce the 10,000+ SNP’s to a
smaller subset of interesting candidates for further exploration. Extensive reports will be part of this document to detail analytic results and support the findings of the
analyses. The Exemplar modules to be utilized are the:
1. Genetic Algorithm Module (GA Module) – This module implements an Artificial Intelligence approach to finding logical combinations of SNP’s for
association based studies.
2. Association Study Module (AS Module) – this module calculates many useful statistics like Chi Square, Yates, Fisher Exact, Odds Ratio, LD, D’, etc.
3. Chromosome Alteration module (CA Module) – this module performs LOH analysis on the dataset using user-specified controls as the reference set to
identify possible deletions in the chromosome.
4. EM Algorithm – this module performs haplotype based analysis on the dataset using to identify possible associations between haplotype pairs and the
phenotype.
The difficulty with such a small sample size is the lack of statistical power. Nonetheless, we hoped that by performing multiple types of analysis on the data, we could
reduce the problem space from ~10,000 SNP’s to <50 SNP’s for consideration. With the experience of the NICHD researchers, they could then apply their biological
knowledge of these disorders to this reduced set of data and further prune the list of interesting SNP's. Resequencing would then be used to determine whether any of
the final SNP set was related to MAH.
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
6 04/24/06 3:33 PM, Sapio Sciences, LLC
2 Analytic Process
2.1 STEP 1 – Association Statistics Exemplar’s AS Module will first be utilized to provide extensive statistical analysis of the dataset including:
1. Fishers Exact by genotype and by allele
2. Odds Ratio by genotype and by allele
The AS module will also be used for feature selection of the dataset prior to being input to the GA Module.
2.2 STEP 2 – Genetic Algorithm Exemplar’s GA Module will be run against the dataset many times with various parameter settings. A brief overview of what will be run follows:
1. GA module will be run against the entire input dataset and will attempt to build models of the smallest size that can effectively predict outcomes while
minimizing False Positives and maximizing True Positives. Different sized and type models will be attempted to improve results as necessary.
2. Various feature selection methods will be employed to reduce the input parameter space, these will include:
a. Statistical Reduction (usually Fishers is used here) whereby each SNP has a p-value calculated and if their p-value does not fall below a certain
threshold, they will be eliminated.
b. Minor allele frequency changes – the minor allele frequency is calculated for each SNP for cases and controls, if the variance is below a certain
defined threshold, the SNP is eliminated from consideration.
Comprehensive model results will be provided in this reports including:
1 Model predictive results for each sample
2 Model statistical p-values when possible
3 Relevant Ontology’s for GA discovered SNP’s
4 Complete details of each discovered SNP including its id, position, chromosome, and related genes.
2.3 STEP 3 – Loss of Heterozygosity Exemplar’s CA Module will be run against the dataset to detect possible deletions in the chromosomes by looking for Loss Of Heterozygosity. Each SNP will be assigned a p-value.
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
7 04/24/06 3:33 PM, Sapio Sciences, LLC
2.4 STEP 4 – EM Algorithm Exemplar’s EM Algorithm will be run against the dataset to identify possible associations between haplotype pairs and the phenotype. Each haplotype will be assigned a p-value. NOTE:
this analysis was done after the PDE11A gene was already identified as a test of this new methods addition to the Exemplar software. Therefore these results were not presented to
researchers when they began detailed analysis of the final SNP list and regions of interest we provided. It’s notable that this analysis confirmed PDE11A’s involvement via a haplotype
pairing from within the gene with a significant p-value being associated with the MAH.
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
8 04/24/06 3:33 PM, Sapio Sciences, LLC
3 Study Group Information There were 35 samples in total with 18 affected and 17 unaffected. Two samples existed for most cases with one from peripheral blood and the other directly from the
tumor. When possible, blood samples were collected from unaffected parents and these served as the controls in the study. There is some ethnic diversity within the
group, although most were Caucasian. Below is a table giving the details of each sample to be analyzed:
# Code Relation
1 CAR 14.02 mother onset much later in life
CAR 14.03 proband Mother also affected
CAR 14.03.AT
2 CAR 24.02 proband onset later in life
CAR 24.03
CAR 24.04
CAR 24.02.AT
3 CAR 36.01 father
CAR 36.02 mother
CAR 36.03 proband
CAR 36.03.AT
4 CAR 54.02 mother
CAR 54.03 proband
CAR 54.03.AT
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
9 04/24/06 3:33 PM, Sapio Sciences, LLC
5 CAR 61.01 father
CAR 61.02 mother
CAR 61.03 proband
CAR 61.03.AT
6 CAR 504.01 proband chromosomal aberation
CAR 504.02 mother pakistani
CAR 504.03 father
CAR 504.01.AT
7 CAR.538.01 father
CAR.538.02 mother
CAR.538.03 proband afr americans
CAR 538.03.AT
8 CAR 545.01 father
CAR 545.02 mother
CAR 545.03 proband
CAR 545.03.AT
9 CAR 551.01 father a bit later in life
CAR 551.02 mother
CAR 551.03 proband
CAR 551.03.AT
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
10 04/24/06 3:33 PM, Sapio Sciences, LLC
10 CAR 559.01 father a bit later in life
CAR 559.02 mother chiliean
CAR 559.03 proband
CAR 559.03.AT
11 CAR 583.01
CAR 583.02
CAR 583.03 proband
Table 1 - Study Group Details
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
11 04/24/06 3:33 PM, Sapio Sciences, LLC
4 Analysis Results
4.1 AS Module Results Summary Multiple statistics were generated for each SNP in the input dataset. We utilized Fishers Exact 2 tailed test to compute all statistics.
4.1.1 Statistics By Genotype
This statistic is generated by building 2 x 2 contingency tables and doing proper counts of genotypes (Note that this is not allele counts, but genotype counts). To give
an example, suppose there is a SNP RS001. Given three possible genotypes of AA, AB and BB, Exemplar generates a contingency table for each of the three
possibilities and therefore gets a p-value for each SNP/genotype combination. The provided p-values are corrected for type 1 errors using Bonferoni correction.
4.1.1.1 Top 35 Results By Genotype:
SNP Category Fisher Exact Adjusted Chromosome Position Cytoband Related
RS1404090 AB 0.00000685 0.00767358 2 76402264 2p12 C2orf3 LRRTM4
RS27220 AB 0.00009541 0.10686091 5 59791021 5q12.1 PDE4D
RS1221150 AB 0.00009592 0.10743016 2 81960032 2p12 SUCLG1
RS1964562 AB 0.00009592 0.10743016 15 76762310 15q25.1 ADAMTS7 CHRNB4
RS6539866 AB 0.00009592 0.10743016 12 83740879 12q21.31 SLC6A15 DKFZp762A217
RS58982 AB 0.00014616 0.16370309 5 104236349 5q21.2 EFNA5 NUDT12
RS691140 AA 0.00021229 0.23776227 2 219829544 2q35 TTLL4 CYP27A1
RS967305 AB 0.00021229 0.23776227 20 49145368 20q13.13 SLC9A8
RS2616552 AB 0.00027406 0.30694330 3 11641798 3p25.3 KIAA0121
RS967445 AB 0.00027406 0.30694330 21 43090530 21q22.3 WDR4 PDE9A
RS2395166 AB 0.00030146 0.33763763 6 32459456 6p21.32
RS344214 AB 0.00030146 0.33763763 8 62987120 8q12.3 ASPH FLJ39630
RS719316 AB 0.00030146 0.33763763 6 16780739 6p22.3 SCA1
RS1382883 AB 0.00035772 0.40064178 5 31568782 5p13.3 RNASE3L
RS1411445 AB 0.00043849 0.49110928 9 120294727 9q33.2 NDUFA8
RS1601981 AB 0.00046001 0.51520703 12 28961846 12p11.22 MLSTD1 FLJ11088
RS1014984 AB 0.00086670 0.97070819 6 16778258 6p22.3 SCA1
RS1380833 AB 0.00086670 0.97070819 18 39749816 18q12.3 SETBP1 SYT4
RS1380896 AB 0.00086670 0.97070819 8 71285928 8q13.3 NCOA2
RS1383618 AB 0.00086670 0.97070819 4 74135822 4q13.3 ADAMTS3 FLJ38991
RS1570487 AB 0.00086670 0.97070819 6 16791838 6p22.3 SCA1
RS1371624 AB 0.00098097 1.09868175 2 182400530 2q31.3 ITGA4 UBE2E3
RS2085808 AB 0.00109945 1.23138431 13 64216386 13q21.32 PCDH9 FLJ25694
RS2622821 AB 0.00112926 1.26477112 15 76762512 15q25.1 CHRNB4 ADAMTS7
RS1342180 AB 0.00120585 1.35055053 6 116117682 6q22.1 FRK HS3ST5
Table 2 - Statistics - Top 35 Results by Genotype
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
13 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.2 Statistics By Allele
This statistic is generated by building 2 x 2 contingency tables and doing proper counts of alleles. To give an example, suppose there is a SNP RS001. Given two
possible alleles A and B, Exemplar generates a 2x2 contingency table from allele counts and computes a p-value based on the SNP’s allele frequencies. The provided p-
values are corrected for type 1 errors using Bonferoni correction.
4.1.2.1 Top 35 Results By Allele:
SNP Fisher Exact Adjusted p-value Chromosome Position Cytoband Related Genes
As stated earlier, the statistical power of this study is low. Nonetheless, once correction was applied to the genotype and allele statistics, only 2 SNP’s fell below the
significance threshold of p<.05 (RS1404090 and RS691140). To further expand the number of SNP’s to consider, we looked for SNP’s from proximate cytobands
between the two analyses. We have color coded the SNP’s between the genotype based statistic and allele statistic where there was SNP proximity. There were only 3
regions that appeared in both tests: 2q35, 2q31 and 15q25. We further note a SNP from 2q33.3 is on the allele list, giving us several SNP’s in the 2q3# region that were
statistically significant. As a reference point, the Affymetrix 10K platform that served as the basis for this study has 155 SNP’s in the region between 2q31.1 and 2q35,
which covers positions 178470389-221390185, or roughly ~43 million base pairs.
4.1.4 Statistics - Odds Ratio By Genotype
This statistic is generated by building 2 x 2 contingency tables and doing proper counts of genotypes (Note that this is not allele counts, but genotype counts). To give
an example, suppose there is a SNP RS001. Given three possible genotypes of AA, AB and BB, Exemplar generates a contingency table for each of the three
possibilities and therefore gets an Odds Ratio for each SNP/genotype combination. These can then be read as: “RS691149 as AA has an Odds Ratio of 28.00.
RS691149 as AB has an Odds Ratio of ##.##., etc..”
4.1.4.1 Top 35 Results By Genotype:
SNP Category Odds Ratio Chromosome Position Cytoband Related Genes
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
RS3848096 AB 15.00 13 31692672 13q13.1 RFC3 STARD13
RS536243 AB 15.00 10 89623659 10q23.31 FLJ11218 PTEN
RS1112573 AB 14.00 14 76835369 14q24.3 NRXN3
RS1751005 AB 14.00 13 93527871 13q32.1 ---
RS1952514 AB 14.00 14 19601896 14q11.2 HNRPC
RS2014422 AB 13.71 18 60158077 18q22.1 CDH7 MGC39571
RS1856133 AB 13.50 6 102146470 6q16.3 GRIK2
RS2213177 AB 13.50 12 11852139 12p13.2 ETV6
RS4496103 AB 13.50 15 86073423 15q25.3 NTRK3 FLJ31461
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
16 04/24/06 3:33 PM, Sapio Sciences, LLC
Table 4 - Odds Ratio - Top 35 Results by Genotype
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
17 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.5 Statistics: Odds Ratio By Allele
This statistic is generated by building 2 x 2 contingency tables and doing proper counts of alleles. To give an example, suppose there is a SNP RS001. Given two
possible alleles A and B, Exemplar generates a 2x2 contingency table from allele counts and computes an Odds Ratio based on the SNP’s allele frequencies.
4.1.5.1 Top 35 Results By Allele:
SNP Odds Ratio Chromosome Position Cytoband Related Genes
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
19 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.6 GA Module Results Summary
The Genetic Algorithm in Exemplar is a powerful method for identifying genetic diseases of complex nature. Such diseases are often multigenic and/or comprised of
moderate risk alleles. Exemplar’s models, by incorporating groups of SNP’s logically, are able to identify the various factors that may be involved in a complex
phenotype. It is not known if ---- has complex genetics or not (is it single gene or multigenic?).
4.1.6.1 GA Based Analysis
The GA has many settings that need to be tuned to optimize results. Many different combinations of settings were used and will not be detailed in this report. Pertinent
though are the various types of feature selection that were used. Each of the following methods was employed in separate experiments/analyses to reduce the input
dataset and therefore reduce the problem space:
1. Statistical Reduction – Fisher exact was applied to all SNPs with a cutoff of .005. Any SNP’s above the cutoff were eliminated from consideration by the GA.
2. Minor Allele Frequency Changes – Minor allele frequencies were calculated for the cases and controls (minor allele was derived empirically from the controls)
separately for each SNP, if the variance between the two groups was less than 35%, we eliminated the SNP from consideration.
We also employed several different model types. This included:
1. Models with AND’s only
2. Models with AND’s and OR’s
3. Homozygous only models
4. Models incorporating more SNP’s
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
20 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.6.2 GA Results
We proceeded to run the GA multiple times using the methodology of model types and feature selection methods listed above. A graphical representation of the models
is shown with a screenshot the models performance. The green balls indicate a sample that was classified correctly by the model, a red ball indicates an incorrect
classification and a gold ball indicates that the model could not be evaluated for that sample due to a missing SNP value (NoCall). Since we are merely identifying
SNP’s/Regions of interest here, the actual structure of these resultant models is not of importance. Instead we are interested in what SNP’s the GA prefers when
building multi-loci models, and these will be added to our list of SNP’s for further consideration.
4.1.6.2.1 Model Summary Results
Several models were able to achieve near perfect classification of the entire dataset. The best models are detailed below.
4.1.6.2.1.1 Models 1 and 7-9 - Exp 124 (Run 20)
These models were created by first performing feature selection on the input dataset. The feature selection method utilized was Fishers Exact with a p<.005. This
resulted in a final SNP count of just 99 from over 10,000. The models performed well with all 5 categorizing 31 of 32 samples correctly. The models graphic
representations follow:
Figure 1 - GA Model 1, Exp 124 (Run 20)
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
21 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 2 - GA Model 1, Exp 124 (Run 20) - Predictions
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
22 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 3 - GA Model 7, Exp 124 (Run 20)
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
23 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 4 - GA Model 7, Exp 124 (Run 20) - Predictions
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
24 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 5 - GA Model 8, Exp 124 (Run 20)
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
25 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 6 - GA Model 8, Exp 124 (Run 20) - Predictions
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
26 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 7 - GA Model 9, Exp 124 (Run 20)
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
27 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 8 - GA Model 9, Exp 124 (Run 20) - Predictions
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
28 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.6.2.1.2 Models 1-2 - Exp 128 (Run 23)
These models were created by first performing feature selection on the input dataset. The feature selection method utilized was minor allele frequency changes >.25.
This was computed as follows:
1. The controls were scanned for each SNP to empirically determine the minor allele for that SNP…BB was not presumed.
2. The frequency of occurrence of the minor allele was computed for the controls and the cases separately.
3. If the variance between case and control minor alleles was less than 25%, then the SNP was eliminated from consideration.
This resulted in a final SNP count of just 93 from over 10,000. The models performed well with both categorizing 29 of 30 samples correctly. Note that the
CAR559_01 and CAR503_01 controls could not be evaluated for these models due to NoCalls for needed SNP’s. The models graphic representations follow:
Figure 9 - GA Model 1, Exp 128 (Run 23)
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
29 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 10 - GA Model 1, Exp 128 (Run 23) - Predictions
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
30 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 11 - GA Model 2, Exp 128 (Run 23)
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
31 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 12 - GA Model 2, Exp 128 (Run 23) - Predictions
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
32 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.6.3 Genetic Algorithm SNP List
The GA identified 10 unique SNP’s from all the above models, which included:
SNP Genotype Chromosome Position Cytoband Related Genes
RS2393537 AB 10 60345806 10q21.1 PHYHIPL
RS2896587 AB 11 11734173 11p15.3 USP47
RS2085808 AB 13 64216386 13q21.32 PCDH9 FLJ25694
RS1964562 BB 15 76762310 15q25.1 ADAMTS7 CHRNB4
RS727150 AB 21 18764912 21q21.1 NCAM2 PRSS7
RS1404090 AB 2 76402264 2p12 C2orf3 LRRTM4
RS585859 BB 2 219585661 2q35 USP37
RS691140 AA 2 219829544 2q35 TTLL4 CYP27A1
RS1354083 BB 3 125680764 3q21.2 TRAD
RS1503466 AB 4 27495090 4p15.2 PCDH7 STIM2
Table 6 - Genetic Algorithm SNP List
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
33 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.7 CA Module Results Summary
The CA module was run with the controls used as the reference set and only the tumor samples (Peripheral Blood samples were not utilized) as cases. The cases were
analyzed as a whole against the reference set, versus on an individual basis. A mean p-value was calculated for each SNP to determine its probability of LOH at that
point in the genome. We exported a table with a stringent cutoff of p<.005. Details on the method for calculating p-values are in Appendix A attached to this document.
Results for the top 100 most significant (lowest p-values) SNP’s follow:
SNP pValue Chromosome Position Cytoband Related Genes
We subsequently analyzed several samples from families for LOH suspected of having a deletion in the PDE11A gene. The individual samples were compared against
the controls and each SNP in the sample was assigned a probability score for Loss of Heterozygosity.
4.1.8.1 Results for family CAR545
Family CAR545 Family CAR545 Family CAR545 Family CAR545
ID 1 ID 2 ID 3 ID 3AT
SNP pValue Cytoband SNP pValue Cytoband SNP pValue Cytoband SNP pValue Cytoband
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
46 04/24/06 3:33 PM, Sapio Sciences, LLC
4.1.9 EM Algorithm Results
This statistic is generated by computing the log likelihood ratios of the cases, controls and combined cases/controls for haplotype pairings. The formula 2(Lcase +
Lcontrol - Lcombined) gives us a chi square value for the haplotype pairings. Below is a table showing the each log-likelihood score and the associated p-values. Not
the highlighted rows which are the only haplotype pair that matches a previously identified region of the genome (2q31-35 region).
Table 8 - EM Algorithm - Top 100 Results by pValue
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
50 04/24/06 3:33 PM, Sapio Sciences, LLC
5 Discussion We note the multiple occurrence of SNP’s from 2q35 region by the GA. Further, one model contained only two SNP’s from chromosome 2 and was able to classify almost the entire dataset, RS2085808 and RS1404090, both as heterozygous genotypes. Be aware that when a model says that a SNP has a certain genotype like AB, it is implied that it is saying that the person does NOT have AA or BB. So in the case of this 2 SNP model, it may be indicated that for the related genes that 2 copies are required to avoid the phenotype.
Stratakis suspected the 2p region in prior studies as a possible region of interest for MAH. In our results 2p12 showed as being significant. There was congruence between several statistically significant SNP’s and SNP’s the GA identified, which is not surprising in the case where statistics were used as a feature selection method.
The CA module was utilized to detect possible chromosomal deletions/additions. With a cutoff of p<.005, ~200 SNP’s remained. We selected the top 100 from the list for review. Notable are common regions of significance with the statistical and machine learning approaches, further confirming these regions as candidates for detailed investigation.
In order to provide a comprehensive list, we combined the results of all the statistics and genetic algorithm analysis, removed duplicates and also removed one of any SNP pair that were neighboring SNP’s that were in high LD. The next page provides the resultant list. Although this list covers many chromosomes, we pointed the researchers specifically to several regions, including the 2q3# (SNP’s in this region are highlighted by green shading) region due to its significance under multiple tests. We also noted 2p12 and 15q25#(SNP’s in this region are highlighted by blue shading) due to their frequent occurrence in several evaluations. Notably, LOH analysis also identified SNP’s in the 2q3# region as well as the 15q25# region, but not the 2p# region.
As noted above, the haplotype based analyses was completed after the PDE11A discovery. We note that the 2q3# regions are the only regions that were implicated in all analysis, including the haplotype analysis. We also note that the haplotype analysis directly identified the PDE11A gene. The haplotype analysis was not part of the results returned to the researchers from which they discovered the PDE11A’s involvement.
Given that the source data for this study was family based triads with 2 unaffected parents, it lends itself to the Transmission Distortion Test and Haplotype-based Relative Risk, which are powerful methods for association based analysis. Currently these methods are not in the Exemplar product, but will be added in the near future for studies of this nature.
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
51 04/24/06 3:33 PM, Sapio Sciences, LLC
5.1 Final SNP List for Further Review NOTE: This list did not include SNP’s from Haplotype analysis.
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
54 04/24/06 3:33 PM, Sapio Sciences, LLC
RS1412197 20 59360564 20q13.33 CDH4 FLJ33860
RS727150 21 18764912 21q21.1 NCAM2 PRSS7
RS1981391 21 35045926 21q22.12 RUNX1 CLIC6
RS967445 21 43090530 21q22.3 WDR4 PDE9A
Table 9 - Final SNP List for Further Review
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
55 04/24/06 3:33 PM, Sapio Sciences, LLC
6 Appendix A
6.1 Loss of Heterozygosity Calculations LOH in Exemplar is calculated via a probability-based method. This method computes the probability that a string of homozygous genotypes would occur by chance
alone. If a block of homozygous neighboring SNP’s in the analysis set has a very low probability of being homozygous in the reference set, then that block is shown to
have a high probability of LOH. LOH is calculated for each individual SNP in the input dataset, and then the results are ‘smoothed’ across blocks of SNP’s to reduce
noise.
For each SNPi that appears in the input dataset, look across all samples in the reference set and find the number of homozygous call values for that SNP. That number is
then divided by the total number of genotype calls for that SNP in the reference set to get the probability that SNPj is homozygous.
Pj = (# homozygous calls on SNPj ) / total # genotype calls on SNPj)
Where Pj is the probability that SNPj is homozygous in the reference set
After Pj is calculated for each SNPj in the input dataset against the reference set, a determination is made for what SNP’s make up each ‘block’ of SNP’s. This is
determined by the following method.
6.1.1 Analysis-Set Defined Block Size
When using a ‘analysis-set defined’ block size, the size of the blocks are determined by examining the call values in the analysis set. For each array to be analyzed, scan
the genotype call values for the SNP’s ordered by position on the chromosome. Group streams of homozygous call values together such that each ‘block’ of SNP’s is
bounded by a SNP with a heterozygous call value, and all other SNP’s in the block are homozygous. Once each block is identified, calculate the product of the
homozygous probabilities for each SNP in the block on the reference set. That becomes the p-Value for that block.
Sapio Sciences Analysis Report –Micronodular Adrenocortical Hyperplasia Research Paper Supplement
56 04/24/06 3:33 PM, Sapio Sciences, LLC
Figure 1 - LOH - Analysis-Set Defined Block Size
P(SNPm->n homozygous) = SUM(Pj) where j = m->n (SNP’s in the block)
When using this block type, all SNP’s in a given block will have the same probability value (for a given analysis array), which is a different effect than the sliding block
approach used above with user-defined block sizes where SNP’s in a block would normally have a variable final p-Value.
6.1.2 LOH P-Value
Now that we have the probability that a given block of SNP’s is homozygous on the reference set, we can examine each SNP for each array in the analysis set. For each
SNPj in the analysis set, if it is homozygous, the probability that it has lost heterozygosity is calculated in the following manner:
LOH = (-1)(Log10 P(SNPm-> homozygous))
For analysis-set defined blocks, each array in the analysis set could cause a different block to be identified on the reference set, so each LOH value on SNPj in each
array in the analysis set could be different. For this reason, standard deviation is calculated on each homozygous SNPj across the analysis set and plotted with the LOH