Considering positional bias in regulatory motifs of genes associated with breast cancer Nathaniel Gustafson Dr. Garry Larson (City of Hope)
Feb 24, 2016
Considering positional bias in regulatory motifs of genes associated
with breast cancerNathaniel Gustafson
Dr. Garry Larson (City of Hope)
Cancer StudiesCan we tie genetic variation (eg. SNPs) to
Cancer Risk?Myriad Genetics found BrCa1 and BrCa2
Mutations in BrCa1/2 tied to 800% increase in breast cancer risk
Most research is on exonic regionsChanges protein composition
http://members.cox.net/amgough/Fanconi-genetics-genetics-primer.htm
Our approachWhat about regulatory regions?Motif:
Recurring sequence, usu. 6-20 bp. Generally functional
Hypothesis: Regulatory motifs upstream of the transcriptional start site (TSS) may play some role in breast cancer stopATG
3`5`
5` - upstream
YGCGYRCGCATCMNTCCGYTGAYRTCAGCTNWTTGK...
Disease Mutations in Phylogenetically Conserved Motifs
Sequ
enc e
ph y
loge
n y
G A C C T A C T A C A
Orthologous bases: identical by descent
NonorthologousBases in red
G A G C T A C T A C T
~5 myr
G A C T T A A T T C A~70 myr
G A G C T A C - A G A
~300 myr
G A G T T A A T G G T
~475 myr
G A C C T T C T A C ABrCa Pt.
Mutation
BackgroundMeta-Analysis pools several brca ER+/-* studiesStatistics used to find genes that have consistent
differences in expression levels in ER+ vs ER- cell lines
*ER = Estrogen Receptor – a common way of classifying breast cancer cells
GCCATnTT x 50
GCCATnTT x 9
AimsInvestigate regulatory motifs for these genesCompare occurrences of each motif across
gene sets
Hypothesis: genes overexpressed in the same tumor type
share motifs
Weak Results
Are we missing the signal?
Old counting method
15 10
-2000 -1500 -1000 -500 TSS
ER+ < ER- gene setmotif occurrences
P-val: .30
ER+ > ER- gene set motif occurrences
-2000 -1500 -1000 -500 TSS
NOT significant
Are we missing the signal?
New counting method: use position bias
12
3-2000 -1500 -1000 -500 TSS
ER+ < ER- gene setmotif occurrences
P-val: .03
ER+ > ER- gene set motif occurrences
-2000 -1500 -1000 -500 TSS
significant
ToolsPerl
Handy scripting languageGreat for parsing textual data
mySQLStorage and retrieval of structured data
www.yusoft.net/yu-graph/main/logo-mysql.jpg
ProblemsLack of data specificity
What do Xie’s pos. biases mean?Insufficient data
Needed position of motif relative to TSSImproperly annotated data
Position shown to be inconsistentCollaboration
Norway is about 10 time zones away
Results
Motif
1down count
1up cnt
5down count
5up cnt
pos. bias
Pvaltop1
Pvaltop5
SCGGAAGY 5 8 36 71 -240.4011
60.0001
9... ... ... ... ... ... ... ...
Motif
1down count
1up cnt
5down count
5up cnt
pos. bias
PvalTop1
Pvaltop5
SCGGAAGY 31 41 168 206 -24 0.10299
0.00719
... ... ... ... ... ... ... ...
Reading 100 bp from positional bias
No window (Previous results)
Any SNPs in this motif?One SNP was found from HapMap in this
motifBut it was at a degenerate position (eg. Y =
C or G)= still satisfied the motif
Might still affect expression
Biological SignificanceSCGGAAGY found more in ER+
overexpressed genesKnown as a binding site for ELK-1
Might provide some insight into ER+/ER- cell differentiation
Verification in vivo remains to be done
3’UTR Motif List-6/7mer miRNA seeds-Phylogenetic conser. motifs
HapMap
BrCa GWASDatasets
Hunter, et al(CGEMS)
Gold, et al.(MSKCC)
Easton, et al.(UK)(unavailable)
Stacey (deCode)(unavailable)
SNP_list
SNPs Rank &Biological Testing
BrCa SomaticMutations (Sjöblom)
Linkage Studiesin BrCa (Smith, et al.)
LOH (aCGH)in BrCa
Thermodynamic Profiling(STarMir, PITA)
In-House IndependentAssociation Studies
3’UTR-luc Fusion Assay
Reciprocal AllelicTesting-Effect
Evolutionary Conser-vation (miRNA seeds)
LDMappingProxySNPs
Allele frequency inHapMap Population(s)
Reciprocal Allelictesting-no effect
Additional BiologicalTesting
GWAS“Genome Wide Association Study”Genotypes cases and controls at thousands of
lociIntended to be an unbiased approachPotentially identifies pertinent mutations
http://www2.bioinformatics.tll.org.sg/img/species/karyotype_Homo_sapiens.png
Study Assay Platform
Cases/ Controls
Comment_1 Comment_2 Public Dataset
Hunter, et al.(Nat Genet 39, 2007)
IlluminaHap 550
(keep 528K)
1,145 / 1,142 Prospective, post-menopausal women
Logistic Regression
YES(CGEMS)
Easton, et al.(Nature 447, 2007)
Affy, 266k SNPs
(keep 227k)
Stage I - 380 / 364Stage 2-3,990 /3,916 ctrlsStage 3-21,860/22,578 ctrls
Stage 1-Cases (2 first-degree relatives with Fam Hx)
3 stage associationStage 2-top 5% of stage 1Stage 3-Top 30 SNPs from Stage 2TNRC9 high score
NO
Stacey, et al(Nat Genet 39, 2007)deCode Dataset
Illumina Hap300
(keep 311k)
1,600 Icelandic cases/ 11,563 ctrls
Top 10 SNPs GTP’d in 2nd Icelandic sample and 2-3 ind. European cohorts
1 SNP strong LD with 9995 BRCA2-removed from studyFound SNP near TNRC9
NO
Gold, et al.(PNAS 105March, 2008)
Affy GTP 435K SNPs(keep 150k)
249 AJ Fam Hx (3 cases, BRCA1 & 2 neg) vs.299 Ca-free AJ ctrls
3 stage design Reproduced FGFR2 region
MAYBE?
BrCa GWAS Datasets
Study Assay Platform
Cases/ Controls
Comment_1 Comment_2 Public Dataset
Hunter, et al.(Nat Genet 39, 2007)
IlluminaHap 550
(keep 528K)
1,145 / 1,142 Prospective, post-menopausal women
Logistic Regression
YES(CGEMS)
Easton, et al.(Nature 447, 2007)
Affy, 266k SNPs
(keep 227k)
Stage I - 380 / 364Stage 2-3,990 /3,916 ctrlsStage 3-21,860/22,578 ctrls
Stage 1-Cases (2 first-degree relatives with Fam Hx)
3 stage associationStage 2-top 5% of stage 1Stage 3-Top 30 SNPs from Stage 2TNRC9 high score
NO
Stacey, et al(Nat Genet 39, 2007)deCode Dataset
Illumina Hap300
(keep 311k)
1,600 Icelandic cases/ 11,563 ctrls
Top 10 SNPs GTP’d in 2nd Icelandic sample and 2-3 ind. European cohorts
1 SNP strong LD with 9995 BRCA2-removed from studyFound SNP near TNRC9
NO
Gold, et al.(PNAS 105March, 2008)
Affy GTP 435K SNPs(keep 150k)
249 AJ Fam Hx (3 cases, BRCA1 & 2 neg) vs.299 Ca-free AJ ctrls
3 stage design Reproduced FGFR2 region
MAYBE?
BrCa GWAS Datasets
YES
Bring this...SM70 SG74 LF52 SM17 SH14 L5721 SM56 SF63 L5957 L5349 L5420 L5713 SH5 LF48 SJG4 L6029 SG21 L5352 L6121 SG69 L5952 SM78 SM113 SF23 L5573 SN6 SF1 SM91 L5895 L5518 L5501 L5328 L5772 SG08 SG28 SM52 SM106 SM67 L5463 L5494 SA17 L5796 L6014 SN15rs2180341 chr6 127642323 + ncbi_b35 MSKCCOffit AffyEAv3 PhaseIGold_et_al TT TT TT CT CT CC TT CT TT TT TT TT CT TT CT TT CT TT TT CT TT TT TT CT CT CT CT CT CT CT TT CT CT TT TT CT CC TT TT CT TT TT CT TT CT CT CT CT TT TT TT CT CT CT TT TT CT TT TT TT TT TT TT CT TT TT TT TT CT TT TT CT TT TT TT CT CT TT CT CT TT CT TT ...rs6569480 chr6 127663441 + ncbi_b35 MSKCCOffit AffyEAv3 PhaseIGold_et_al GG GG GG GG AG AA GG AG GG GG GG GG AG GG GG GG AG GG GG AG GG GG GG AG AG AG AG AG AG AG GG AG AG GG GG AG AA GG GG AG GG GG AG GG AG AG AG AG GG GG GG AG AG AG GG GG AG GG GG NN GG GG GG AG GG GG GG GG AG GG GG AG GG GG GG AG GG GG AG AG GG AG GG...
rs_num chr pos analysis_name p_value OR_het OR_hom build
rs10510126 chr10 124992475chi square - genotype 4e-06 0.5918 0.6387 ncbi_b36
rs10510126 chr10 124992475chi square - allele 2e-06 0.5918 0.6387 ncbi_b36
To this...
Future WorkWe’ve digested the Gold data set
Employ this in the triage for producing a gene list
Combine with other triage methods to find the most interesting genes Test these in vivo
Special ThanksDr. Garry LarsonSoCalBSI programSoCalBSI mentorsCity of Hope
FundingKomen for the CureNational Science
FoundationNational Institute of
HealthEmployment and
Workforce Development
References Xie X, Mikkelsen TS, Gnirke A, Lindblad-Toh K, Kellis M, Lander
ES. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc Natl Acad Sci U S A. 2007 Apr 24;104(17):7145-50.
D. Smith, P. Sætrom, O. Snøve Jr, C. Lundberg, G. Rivas, C. Glackin and G. Larson. Meta-analysis of breast cancer microarray studies in conjunction with conserved cis-elements suggest patterns for coordinate regulation. BMC Bioinformatics 2008, 9:63