Goals and Data Availability Feature Selection Unsupervised Clustering of Patient Gene Profiles Supervised Classification and Regression Biological Significance: Feature Selection Highlights Gene Ontology Networks and KEGG Pathways Implicated in Cancer Role of alpha-6/beta-4 integrins in carcinoma progression Outcome Pathway FDR Estrogen Receptor TGF-Beta-dependent induction of EMT via MAPK 5.75e-2 Progesterone Receptor Slit-Robot signaling 7.96e-2 Genes in Feature Set TGB-Beta, TGF-Beta 3, PTEN, Claudin-1 SLIT1, SLIT3 Progestrone Receptor Cadherin-mediated cell adhesion 7.96e-2 M-cadherin, P-cadherin HER2 Receptor ERBB-family signaling 5.01e-2 ERBB2, GRB7 HER2 Receptor Role of alpha-6/beta-4 cadherins in carcinoma progression 5.01e-2 ERBB2, ITGB4, 14-3-3 tau Survivability Cell adhesion and ECM remodeling 5.51e-2 Kallikrein 3, Kallikrein2, Kallikrein 1 Survivability Bradykinin/ kallidin maturation 4.88e-2 CPB2, Tissue Kallikreinin, Plasma Kallikreinin Survivability Role of Parkin in the ubiquitin-proteasomal pathway 8.75e-2 UBE1, Cyclin E, HSP70, Tubulin Beta Cell adhesion and ECM remodeling ER Gene set overlaps with ESR1 (nuclear), MALL, Hyaluronic acid, NCOA3, Estradiol network (54/61 genes present in feature set) GO processes Reproductive structure development (20%,5.72e-7), Female Sex Differentiation (10.9%, p=8.24e-6), Growth Hormone Receptor Signaling pathway (5.2%, p=3.43e-18) HER2 gene set overlaps with Survivin, Bcl-6, AKT (PKB), Fra-1, Bim Network (114/180 genes present in feature set) GO processes Regulation of cell death (43%,p=2.4e-29) Regulation of apoptotic process (41.4%,p=9.82e-29) Regulation of programmed cell death (41.4%,p=1.65e-28), Apoptotic signaling pathway (22.5%,p=4.78e-27) ANOVA p-value for Survivability = 1.80e-4 Tukey HSD Difference of Means (Days) Expected +346 -346 Lower +304 -304 Upper +386 -386 Hierarchical Clustering Breast cancer is the most commonly diagnosed and the second leading cause of death among women. The highly heterogeneous disease has many different clonal subtypes that cause different patient treatment responses due to the genetic differences from the disease and the patient. To better understand the complicated nature of this disease, The Cancer Genome Atlas (TCGA), a rich database of genetic and electronic medical information, can be mined to identify distinctive features for each clonal subtype and patient survivability. Using the significant genes found, unsupervised and supervised methods can be applied to find genetic differences within clonal subgroups and to predict clinical outcomes based on these given features. Chi-Squared p-value for Estrogen Receptor = 4.95e-43 Outcome Chi-Squared Standardized Residuals 0 -1.0 2.4 1 1.0 -2.5 -7.8 16.3 -1.5 -4.2 -2.2 -3.6 -2.8 7.9 -16.1 1.3 4.2 2.3 3.6 2.8 Genes with the strongest effect size Genes with the strongest effect size K-Means Clustering Forward subset selection 1. Linear/logistic regression of each feature against nominal/categorical outcomes (respectively) was performed. 2. Features with FDR >0.05 for association with outcome were excluded. 3. regsubsets library in R used to perform greedy forward subsets selection on remaining features. 4. Feature subset size selected to minimize the BIC. Elastic net regularization and Group lasso Glmnet package in R Input: RNA-Seq gene expression levels for 20,531 genes across 800 subjects Pre-proecessing: 1. Removed 475 genes with low expression (< 1FPKM) 2. Removed 699 genes with <0.1FPKM^2 expression variance Result: Survivability: 277 genes; Age of Diagnosis: 194 genes; Cancer stage: 140 genes; Estrogen Receptor: 176 genes; Progestone Receptor: 175 genes; Her2 Receptor: 52 genes Using 10-fold cross-validation and selecting feature subset size to minimize BIC: P-Value = 1.26e-31 P-Value = 0.89 P-Value = 1.38e-07 P-Value = 1.66e-26 Group lasso (alpha=1): Elastic net regularization (alpha=0.5): The number of patients in each clinical categorical feature and the colored bars represent data used in further analysis. Forward Selection Elastic Net Regulatization Group Lasso 391 149 156 72 31 204 7 Regressor RMSE Median Absolute Error SVR Linear Kernel 764.29 497.39 SVR Sigmoid Kernel 1212.04 909.00 R 2 0.60 0.00 Linear Regression 479.35 381.36 0.84 Patient Distribution in Breast Cancer TCGA Clinical Features TRIM72 ZFP42 LACRT TMIGD1 ACSM1 CCNE1 SCUBE2 PP1R1C IL19 CPB2 PCA and K-Means++ were used to create the clusters below. The distinctive groups suggest more patient subgroups that can be helpful for more precise treatments. Regressor RMSE Median Absolute Error SVR Linear Kernel 2.31 1.36 SVR Sigmoid Kernel 2.54 0.90 R 2 0.10 - 0.08 Linear Regression 2.05 1.35 0.30 Tumor Stage ZNF705A, LCE3E, PZP PRDM14, ZPBP2 ZIC4, LOC93432, LCT, GSDMC, B3GNT5, SOX11,SEL1L3, HAPLN3, MALL, CLDN1, CDH3, MMP1, S100A1, PADI2 Survivability Survival Analysis Prediction of clonal subtypes in breast invasive carcinoma Hunter Boyce, Alice Yu, Anna Shcherbina CS229, Fall 2015 Range: 1-10 Range: 158-4456 days p53 Bcl-6 Red bars indicate that pathway gene is present in the feature set Classification Receiver Operator Characteristic curves for the prediction of two subtypes of cancer. Area under the curve given with the classifier in the legend. Regression Statistical metrics of regression performance on RNA expression for predicting tumor stage and survival time. Kaplan-Meier plots for the survival of sub-populations of patients. Results show the opposite of what is expected. This should be further explored.