Optimisation of cDNA Microarray Tumour Profiling and Molecular Analysis of Epithelial Ovarian Cancer Ryan van Laar Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy June 2005 The Peter MacCallum Cancer Centre and The Department of Biochemistry and Molecular Biology The University of Melbourne
401
Embed
Optimisation of cDNA Microarray Tumour Profiling and Molecular Analysis of Epithelial Ovarian Cancer
The advent of microarray technology has allowed the study of diseases such as epithelial ovarian cancer (EOC) to occur at an unprecedented level of molecular resolution. EOC is fifth leading cause of female cancer death world wide. The prognosis of women diagnosed with this disease is often extremely poor, partially due to the difficulty of detection in its early and most treatable stages. It is hypothesised that gene expression profiling can shed light on the molecular events responsible for EOC development and progression. This information could one day be used to develop novel screening methods and therapeutic approaches based on individual tumour profiling. This thesis first describes the optimisation of several aspects of the microarray work flow and demonstrates their impact on the sensitivity and robustness of cDNA microarray data. An evaluation of reference RNA options was conducted, in which gene expression data generated using either a pool of RNA sourced from a diverse range of cell lines, or from a cohort of EOC specimens was compared. The cell line RNA was found to be the most suitable choice for a large-scale tumour profiling study based on the diverse criteria applied. A number of factors with the potential to impact on the spatial distribution of gene expression are also described and a novel method for quantification of this type of systematic bias is proposed. The findings from these comparisons are then used to create and analyse two clinically annotated dataset of EOC specimens. These data are interrogated to identify gene expression patterns related to overall length of patient survival and the phenotypic differences between the invasive and low malignant potential EOC subtypes. These analyses generated several validated sets of differentially regulated genes, many of which were clinically relevant or previously implicated in other cancer types. The molecular signatures identified were technically and biologically validated before bioinformatic analyses to identify the key biological processes and functional relationships they represent. Comparison of the gene expression signatures deduced for patient survival and serous low malignant potential vs. invasive cancer to studies of similar and disparate cancer types was carried out. The universality of the molecular events regulated by these genes in order to mediate survival and/or the malignant potential of EOC was evaluated. A significant relationship involving the altered expression of interacting calcium-dependant cell adhesion molecules was found to be important for both aspects of this disease.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Optimisation of cDNA Microarray Tumour Profiling and Molecular Analysis
of Epithelial Ovarian Cancer
Ryan van Laar
Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy
June 2005
The Peter MacCallum Cancer Centre and The Department of Biochemistry and Molecular Biology
The University of Melbourne
Abstract The advent of microarray technology has allowed the study of diseases such as epithelial
ovarian cancer (EOC) to occur at an unprecedented level of molecular resolution.
EOC is fifth leading cause of female cancer death world wide. The prognosis of women
diagnosed with this disease is often extremely poor, partially due to the difficulty of
detection in its early and most treatable stages. It is hypothesised that gene expression
profiling can shed light on the molecular events responsible for EOC development and
progression. This information could one day be used to develop novel screening methods
and therapeutic approaches based on individual tumour profiling.
This thesis first describes the optimisation of several aspects of the microarray work flow
and demonstrates their impact on the sensitivity and robustness of cDNA microarray data.
An evaluation of reference RNA options was conducted, in which gene expression data
generated using either a pool of RNA sourced from a diverse range of cell lines, or from a
cohort of EOC specimens was compared. The cell line RNA was found to be the most
suitable choice for a large-scale tumour profiling study based on the diverse criteria
applied. A number of factors with the potential to impact on the spatial distribution of
gene expression are also described and a novel method for quantification of this type of
systematic bias is proposed.
The findings from these comparisons are then used to create and analyse two clinically
annotated dataset of EOC specimens. These data are interrogated to identify gene
expression patterns related to overall length of patient survival and the phenotypic
differences between the invasive and low malignant potential EOC subtypes. These
analyses generated several validated sets of differentially regulated genes, many of which
were clinically relevant or previously implicated in other cancer types. The molecular
signatures identified were technically and biologically validated before bioinformatic
analyses to identify the key biological processes and functional relationships they
represent.
Comparison of the gene expression signatures deduced for patient survival and serous
low malignant potential vs. invasive cancer to studies of similar and disparate cancer
types was carried out. The universality of the molecular events regulated by these genes
in order to mediate survival and/or the malignant potential of EOC was evaluated. A
significant relationship involving the altered expression of interacting calcium-dependant
cell adhesion molecules was found to be important for both aspects of this disease.
Declaration
This is to certify that
(i) the thesis comprises only my original work towards the PhD except where
indicated in the Preface*,
(ii) due acknowledgement has been made in the text to all other material used,
(iii) the thesis is less than 100,000 words in length, exclusive of tables, maps,
bibliographies and appendices.
Ryan van Laar
Preface
The work presented in this thesis is the result of a number of collaborations. Samples of
ovarian cancer were kindly provided by Dr Georgia Chenevix-Trench of the Royal
Brisbane Hospital and Dr Anna DeFazio of the Westmead Millennium Institute, Sydney.
This preparation and hybridisation of tumour material to cDNA microarrays used in this
study was carried out by Sophie Katsabanis and Dileepa Diyagama of the Peter
MacCallum Cancer Centre Microarray Facility, Melbourne.
Clinical and gene expression data representing nine primary tumour types used to
generate a signature of primary EOC was provided Dr Richard Tothill. An additional
dataset comprising gastric cancer gene expression profiles and associated follow up
information was provided by Dr Alex Boussioutas.
Tissue microarray construction and immunohistochemistry was carried out in
collaboration with Dr Melissa Robbie from St.Vincent’s Hospital, Melbourne and also Mr
Neil O’Callaghan and Dr Melanie Trivett, of the Peter MacCallum Cancer Centre
Pathology Department.
Acknowledgements
I am indebted to a large number of people for their invaluable assistance in the
completion of this thesis.
Firstly I would like to thank my primary supervisor, Professor David Bowtell, for
allowing me to complete this body of work under his guidance, in his excellent research
group and utilising the unparalleled framework of the Australian Ovarian Cancer Study.
I would also like to thank Andrew Holloway, for his involvement in the supervision of
this project and the mentoring, scientific or otherwise, I have received over the past six
rather eventful years. He is responsible for much of what I have learnt about cancer and
also for a great deal of the enjoyment, satisfaction and growth I have experienced
working at Peter Mac.
Other senior members of the Australian Ovarian Cancer Study (AOCS) including Georgia
Chenevix-Trench of the Royal Brisbane Hospital and Anna DeFazio of the Westmead
Millennium Institute, Sydney have welcomed me into the AOCS group and provided
valuable assistance and materials throughout.
Fundamental in my understanding of ovarian pathology has been Dr Melissa Robbie, who
worked extremely hard to review a large number of cases and provided much appreciated
assistance with the biological validation stages of this project.
I would also like to thank Nadia Traficante, Sian Fereday and Anna Tinker, who make a
great team and have been truly amazing people to interact with on a daily basis. Between
them they are responsible for much of the current, and no doubt future, success of the
AOCS.
Members of the Bowtell Lab, Microarray Core Facility and wider Peter Mac Research
Division have also played a significant role in the completion of this project. These
include Izi Haviv and Alex Bousioutas who between them have enough enthusiasm and
ideas for 100 people; Sophie Katsabanis, Dileepa Diyagama and Bianca Locandro, who
have been instrumental in creating a world-class microarray facility, and the many others
who make Peter Mac an outstanding place to work and study.
The ever-fashionable Linda Stevens deserves a special mention for her support and
friendship, which has also been one of the most enjoyable and reliable aspects of working
at Peter Mac.
Successful collaborations that resulted in high quality publications have arisen from my
interactions with Richard Tothill, Bedrich Eckhardt and Melissa Peart, whom I thank for
including me in their projects. An additional thank-you also to Melissa for her help at the
bench and also over the occasional, but much deserved Treasury Café latte.
Thank-you to my parents for giving me a good head-start in life and finally, to my closest
friends Dean Chesterman, Dane McManus, and Chris Sherman, for their laughter and
unfailing support I could not have done without.
Publications and presentations
The following publications arose out of collaborative work during this project:
Eckhardt, B. L., Parker, B. S., van Laar, R. K., Restall, C. M., Natoli, A. L., Tavaria, M.
D., Stanley, K. L., Sloan, E. K., Moseley, J. M., and Anderson, R. L. (2005). Genomic
analysis of a spontaneous model of breast cancer metastasis to bone reveals a role for the
extracellular matrix. Mol Cancer Res 3, 1-13.
Holloway, A. J., van Laar, R. K., Tothill, R. W., and Bowtell, D. D. (2002). Options
available--from start to finish--for obtaining data from DNA microarrays II. Nat Genet 32
Suppl, 481-489.
Peart, M. J., Smyth, G. K., van Laar, R. K., Bowtell, D. D., Richon, V. M., Marks, P. A.,
Holloway, A. J., and Johnstone, R. W. (2005). Identification and functional significance
of genes regulated by structurally different histone deacetylase inhibitors. Proc Natl Acad
Sci U S A 102, 3697-3702.
Tothill, R. W., Kowalczyk, A., Rischin, D., Bousioutas, A., Haviv, I., van Laar, R. K.,
Waring, P. M., Zalcberg, J., Ward, R., Biankin, A. V., et al. (2005). An expression-based
site of origin diagnostic method designed for clinical application to cancer of unknown
origin. Cancer Res 65, 4031-4040.
The following invited presentations were given based on this thesis:
February 2003: Bioinformatics tools for expression-based tumour classification
Bioinformatics Workshop, St. Vincent’s Hospital, Melbourne.
October 2004: Microarray Profiling of Low Malignant Potential & Invasive Ovarian,
Cancer, Familial Cancer 2004: Research and Practice - A combined meeting of kConFab
& Australian Ovarian Cancer Study (AOCS) & Family Cancer Clinics of Australia and
New Zealand, Couran Cove.
February 2005: Understanding invasive ovarian cancer by microarray analysis and
comparison with the low malignant potential phenotype. AACR Oncogenomics 2005,
San Diego, USA. AstraZenica Travelling Scholar award recipient.
Table of Contents
1. Literature Review....................................................................................................... 1 1.1. Overview .................................................................................................................... 1 1.2. Microarray technology and its impact on ovarian cancer research ............................ 1
1.2.1. Selection of an appropriate reference RNA for cDNA microarray analysis ... 7 1.2.2. The impact of microarray scanning hardware on gene expression data........ 11 1.2.3. Spatial bias in cDNA microarray data........................................................... 14
1.3. Ovarian cancer.......................................................................................................... 17 1.3.1. Clinical background ...................................................................................... 17 1.3.2. Histology and associated genetic aberrations................................................ 18 1.3.3. Current needs in ovarian cancer diagnosis and treatment ............................. 19 1.3.4. Molecular pathology of EOC and its relevance to patient prognosis ............ 22 1.3.5. The ovarian tumour marker CA-125 and EOC prognosis............................. 27 1.3.6. The use of DNA microarrays to discover novel biomarkers of EOC............ 29 1.3.7. Current status of microarray-based EOC prognostic signatures ................... 38
1.4. Low malignant potential ovarian cancer .................................................................. 42 1.4.1. Molecular background and clinical information ........................................... 42 1.4.2. Molecular characteristics of LMP tumours ................................................... 45 1.4.3. Mucinous EOC and tumours metastatic to the ovary .................................... 48 1.4.4. Existing microarray profiling studies of LMP ovarian cancer ...................... 49 1.4.5. Other microarray studies of invasive vs. non-invasive cancer subtypes ....... 52
1.5. Summary and goals of this thesis ............................................................................. 53 2. Materials & Methods ............................................................................................... 55
2.1. Ethical Issues............................................................................................................ 55 2.1.1. Structure of ethical governance ..................................................................... 55 2.1.2. Ethical use of human tissues ......................................................................... 55 2.1.3. Patient Identifiers used in this thesis ............................................................. 56 2.1.4. Protection of privacy ..................................................................................... 56 2.1.5. Ethical contingencies..................................................................................... 57
2.3. In-vitro methods ....................................................................................................... 58 2.3.1. Construction of cDNA microarrays............................................................... 59 2.3.2. Collection and processing of tumour samples............................................... 60 2.3.3. Construction of reference RNA pools ........................................................... 61 2.3.4. Target labelling.............................................................................................. 62 2.3.5. Slide hybridisation......................................................................................... 62 2.3.6. RT-PCR......................................................................................................... 62 2.3.7. Tissue microarray construction ..................................................................... 64 2.3.8. Immunohistochemistry .................................................................................. 66
2.4. In-silico methods ...................................................................................................... 66 2.4.1. Image capture and data extraction................................................................. 67 2.4.2. Microarray image analysis ............................................................................ 67 2.4.3. Normalisation of cDNA microarray data ...................................................... 70 2.4.4. Microarray data visualisation methods.......................................................... 74 2.4.5. Unsupervised identification of differential gene expression ......................... 76 2.4.6. Identification of genes differentially expression between tumour subtypes . 77
2.4.7. Machine-learning approaches for class prediction ........................................ 79 2.4.8. Class Prediction ............................................................................................. 80 2.4.9. Gene ontology analysis.................................................................................. 85 2.4.10. Quantification of IHC staining ...................................................................... 86
3. Optimisation of cDNA microarray profiling for large-scale tumour profiling studies ............................................................................................................................... 87
3.1. Introduction .............................................................................................................. 87 3.1.1. A method for quantification of spatial bias in cDNA microarray data.......... 87 3.1.2. Reference RNA options for large-scale cDNA microarray profiling studies 88 3.1.3. The impact of experimental replication on the robustness of cDNA microarray gene expression measurements ................................................................ 91 3.1.4. The impact of scanning hardware on cDNA microarray data quality ........... 92
3.2. Results ...................................................................................................................... 95 3.2.1. Develop a method for measuring the degree of spatial bias present on a cDNA microarray ....................................................................................................... 95 3.2.2. Evaluation of reference RNA options for a large-scale tumour profiling study 106 3.2.3. Analysis of cDNA microarray slide scanning technology on data quality ..122
3.3. Discussion .............................................................................................................. 134 3.3.1. The use of Moods Median Test to quantify spatial bias in cDNA microarray data 134 3.3.2. Evaluation of reference RNA options suitable for large-scale tumour profiling studies ........................................................................................................ 136 3.3.3. Microarray scanners and cDNA gene expression data quality .................... 141
3.4. General conclusions................................................................................................ 142 4. Gene expression analysis of epithelial ovarian cancer overall survival ............. 143
4.2.1. Case selection and pathology review aimed at ensuring suitability for arraying and outcome analysis.................................................................................. 144 4.2.2. A descriptive statistical analysis of the study cohort................................... 147 4.2.3. Processing of microarray data prior to investigation molecular signatures of patient survival ......................................................................................................... 152 4.2.4. Identification of genes differentially expressed between patient survival groups 152 4.2.5. Experimentation with normalisation algorithms to improve detection of survival-related gene expression............................................................................... 171 4.2.6. RT-PCR validation: selection of genes with minimum 2-fold change in expression between patient survival groups ............................................................. 172 4.2.7. Analysis of published gene lists for predicting EOC prognosis .................. 180 4.2.8. Network and pathway analysis of genes differentially expressed between survival groups ......................................................................................................... 184
4.3. Discussion .............................................................................................................. 188 4.3.1. The impact of residual disease and distribution of survival times on the identification of genes related to length of survival.................................................. 188 4.3.2. EOC heterogeneity and its impact on the success of genomic analyses...... 190 4.3.3. Attempts to identify gene expression patterns with statistically significant relationships to length of survival............................................................................. 191 4.3.4. Biological and clinical relevance of genes identified .................................. 193 4.3.5. General conclusions..................................................................................... 196
5. Molecular analysis of invasive and low malignant potential ovarian tumours .197 5.1. Introduction ............................................................................................................ 197
5.2. Results .................................................................................................................... 201 5.2.1. Case selection and pathology review of suitable cases ............................... 201 5.2.2. Generation of cDNA microarray expression dataset................................... 204 5.2.3. Creation of a EOC gene expression signature for assistance in confirmation of primary ovarian origin.......................................................................................... 206 5.2.4. Application of the trained predictive algorithms to the invasive/LMP dataset 211 5.2.5. Gene expression based prediction of EOC histological subtype................. 219 5.2.6. Confirmation of LMP/invasive status with gene expression prediction analysis 221 5.2.7. Identification of differentially expressed genes between serous LMP and serous invasive EOC ................................................................................................ 222 5.2.8. Molecular pathway analysis of the invasive and LMP EOC gene expression signature 229 5.2.9. Validation of selected differentially expressed genes with RT-PCR .......... 240 5.2.10. Biological validation of the LMP/invasive expression signature................ 249
5.3. Discussion .............................................................................................................. 256 5.3.1. Findings from this analysis and relevance to published studies of LMP or invasive EOC............................................................................................................ 256 5.3.2. Analysis of differentially expressed genes identified by multiple studies .. 259 5.3.3. Use of gene expression based predictive analysis to confirm specimen diagnosis and identify metastatic disease ................................................................. 261 5.3.4. Cell adhesion molecules and EOC malignancy........................................... 262 5.3.5. High throughput analysis of TMA IHC....................................................... 267
5.4. Summary and conclusions from chapter ................................................................ 267 6. Discussion & Conclusion........................................................................................ 269
6.1. Summary of major findings.................................................................................... 269 6.1.1. Optimisation of microarray technology for large-scale tumour profiling studies 269 6.1.2. Gene expression based prediction of patient survival ................................. 270 6.1.3. Molecular characterisation of ovarian LMP and invasive epithelial cancer 272 6.1.4. EOC and the differential expression of genes involved cell adhesion processes; a reoccurring theme................................................................................. 273
6.2. Future directions..................................................................................................... 276 6.2.1. Meta-analysis of gene expression datasets .................................................. 276 6.2.2. Extension of cDNA expression dataset with Affymetrix GeneChip profiling 276 6.2.3. Translation of findings to in vivo studies of gene function and the potential for clinical application.............................................................................................. 279 6.2.4. Conclusion................................................................................................... 279
Appendix A: FIGO staging of EOC................................................................................ 332 Appendix A: FIGO staging of EOC................................................................................ 332 Appendix B: Specimens of EOC included in TMA........................................................ 334 Appendix C: Details of pooled tumour and cell line reference RNAs............................ 338 Appendix D: MMT scores from the analysis of spatial bias on LOOCV prediction accuracy .......................................................................................................................... 339 Appendix E: Members of the ‘response to stimulus’ gene ontology .............................. 340
Appendix F: Reference RNA comparison: Predictions of histological subtype ............. 344 Appendix G: Genes with minimum two-fold mean expression differences between survival groups ................................................................................................................ 348 Appendix H: Higher-level gene ontologies represented by genes differentially expressed between survival groups.................................................................................................. 351 Appendix I: Samples used to generate predictive gene expression signature of primary EOC................................................................................................................................. 353 Appendix J: Output of prediction of primary ovarian origin for LMP and invasive EOC cohort............................................................................................................................... 356 Appendix K: Predictive genes expression signature of primary EOC ............................ 358 Appendix L: KEGG pathways significantly represented in gene expression signature of serous LMP and invasive EOC ....................................................................................... 365 Appendix M: Microsoft Access gene ontology filter ...................................................... 368 Appendix N: Visual basic script for batch export of IHC image histogram statistics..... 369 Appendix O: UniGene annotated genes included in thesis ............................................. 371 Appendix P: Genes differentially expressed between serous LMP and invasive EOC after excluding those involved in cell-cycle regulation and the immune response ................. 379 Appendix Q: Microarray images from Gilks et al study of LMP and invasive EOC...... 385
1
1. Literature Review
1.1. Overview This review will focus on how microarray technology has evolved and been applied to
address some of the needs of ovarian cancer research. It will also cover some of the
advances in the microarray work flow that have increased the robustness and accuracy of
cDNA microarray-generated gene expression data to the point where research findings
are beginning to be applied in the clinic to impact on disease diagnosis and treatment. The
current understanding of the molecular basis of primary epithelial ovarian cancer (EOC)
development and progression will be discussed, including examples of how microarray
analysis has been applied to determine molecular signatures associated with a range of
clinically-relevant aspects of the disease.
As with any new technology in its early stages of development, microarrays have suffered
from a range of teething problems. These include incorrect clone annotations, technical
biases introduced by standard laboratory practices and pitfalls associated with the
application of traditional experimental designs and methods of statistical analyses to data
of a structure and magnitude previously unfamiliar to many biomedical researchers.
Methods for addressing many of these issues are reviewed and the outstanding needs that
form the basis of this project are highlighted
1.2. Microarray technology and its impact on ovarian cancer research
As knowledge about the underlying molecular mechanisms for human cancers has
accumulated, the full extent of its complexity, cellular origins, interactions with non-
cancerous tissue, and other previously unconsidered aspects have become apparent
(Hanahan and Weinberg, 2000; Liotta and Petricoin, 2000). At the same time, advances in
the fields of laboratory robotics and desktop computing have enabled a rapid increase in
the rate at which information about the individual components of the human genome can
be generated, stored and exploited. As a result, discoveries based on molecular
information generated from high-throughput technologies such as microarrays are today
occurring faster than ever before (Ochs and Godwin, 2003).
Whilst not specifically designed for cancer research, microarray technology has been one
of the most significant advances in cancer research in recent years. The application of this
2
technology to cancer research has arisen from the recognition that cancer is primarily a
disease of the genes (Hanahan and Weinberg, 2000; Holloway et al., 2002). The field of
microarrays has undergone a rapid evolution, from nylon membrane arrays with fewer
than 100 unique genes (Chen et al., 1998) to the latest commercial “whole-genome”
single-chip oligonucleotide arrays which contain sequence-verified clones for every
known and purported gene in the human genome (Woo et al., 2004).
The two main types of gene expression microarrays are presently used for cancer
research; spotted glass slide arrays and in situ synthesised oligonucleotide arrays. Both
types are based on the concept of DNA fragments (‘probes’) of known identity positioned
at high density on a solid support. Glass slide cDNA microarrays can be produced with
equipment that is in the economic reach of many academic or smaller-scale research
facilities, whereas the specialised machinery required for in situ synthesised
oligonucleotide arrays limit their production to commercial settings (Singh-Gasson et al.,
1999). Commercial cDNA and oligonucleotide microarrays can be purchased from
companies such as Agilent Technologies and Affymetrix, respectively and are a common
alternative to in-house manufacturing (Bowtell, 1999; Holloway et al., 2002). A diagram
summarising they key differences in the chemical processes used by each array type in
order to measure gene expression is shown in Figure 1-1.
Although the array-to-array variability of spotted arrays, particularly those created at
smaller microarray facilities, has the potential to be quite large, the relative nature of the
gene expression measurements produced effectively controls for this source of variation
(Woo et al., 2004).
A key difference with one-colour platforms, such as the Affymetrix GeneChip, is that a
single biological sample is hybridised to the in-situ synthesised oligonucleotides present
on the glass substrate. The precision of the measurement is achieved by the minimisation
of array-to-array variability and highly control environment in which they are produced.
Success using either spotted cDNA or in-situ synthesised oligonucleotides microarrays
depends on tightly controlled array production and hybridisation methods because of the
intrinsic qualities of each platform type and the minute physical amounts of genetic
material actually being quantified (Lockhart et al., 1996).
3
Figure 1-1: Hybridisation properties and differences of cDNA and oligonucleotides microarrays. (A) For cDNA microarrays, three elements are involved in the generation of gene expression measurements. Firstly the arrays are prepared by depositing thousands of individual nanolitre amounts of concentrated PCR product, produced from cDNAs, onto a glass substrate in predefined grid pattern. Next fluorescently labelled cDNAs obtained from two RNA sources (usually a ‘test’ sample which is compared to a ‘reference’ sample) are competitively hybridised to the prepared substrate. The relative fluorescence intensity measured for each label, per spotted feature, is used to determine a gene expression ratio, in the form of test intensity divided by reference intensity. (B) Oligonucleotide in-situ synthesised microarrays, such as the Affymetrix GeneChip, rely on a direct hybridisation of labelled transcript from a sample of interest to up to 20 micro squares of 25-mer oligonucleotides for each gene present on the array. Each of these squares includes perfect and mismatch pairs for each probe or feature. The intensity from the mismatched probes is subtracted from the perfect matches and an average is determined. (Gibson, 2002)
A B
4
The amount of variation within a technology and the amount of agreement between the
presently available platforms are crucial issues that are still being addressed by the
microarray field (Baker et al., 2005). A number of studies have been carried out in which
data from different platforms has been analysed to determine the robustness of gene
expression measurements made with these technologies, but there is no clear consensus.
Some studies have shown a significant divergence exists across platforms (Kuo et al.,
2002; Rogojina et al., 2003; Tan et al., 2003), while others state that the level of
concordance is acceptable (Ishii et al., 2000; Yuen et al., 2002). To date no publications
have appeared in which data from spotted cDNA and in-situ synthesised oligonucleotide
microarray data as been analysed in parallel, possibly reflecting the divergence noted by
some studies.
As summarised in Figure 1-2, generating data from a cDNA microarray involves three
major steps, all of which can be carried out with assistance from the Microarray Core
Facility at the Peter MacCallum Cancer Centre (Melbourne Australia). These steps are:
(i) Preparation of ready-to-print cDNA probes and precise depositing onto glass
slides.
(ii) Extraction of RNA from test and reference biological specimens (e.g. tissue,
cell lines), reverse transcription, Cy3/Cy5 dye labelling and hybridisation of
the target to the printed slide.
(iii) Scanning of slide with a high-resolution imaging device at two laser
wavelengths.
(iv) Analysis of scanned image and quantification of the bound target as
numerical gene expression ratios.
Recently, a standard data format for recording the specific steps in a microarray
experiment has been proposed by a committee of microarray users and organisations and
has since been adopted by a large proportion of journals and publishers. The Minimal
Information About a Microarray Experiment (MIAME) standard describes a minimum set
of information that scientists are required to provide about gene expression data to ensure
that it can be easily interpreted and that results derived from its analysis can be
independently verified (Brazma et al., 2001). The introduction of this standard has been
successful in addressing some of the early problems with studies based on microarray
data whereby results could not be replicated due to the information required to create the
5
actual microarray slides or process the biological specimens not being provided with
processed findings
In addition to the laboratory-based processes of microarray fabrication and usage, data
management is an important part of microarray research. Approximately 15 different
measurements for each feature on a cDNA array can be generated depending on the
image analysis software used, describing foreground and pixel intensity, spot size and
shape, foreground and background variation and a range of quality measures. This can
result in almost 160,000 individual data points per 10.5k cDNA microarray hybridisation,
which presents data storage and manipulation challenges that must be met by the use of
complex relational databases, such as BASE (Saal et al., 2002).
6
Figure 1-2: Schematic diagram of the cDNA microarray workflow. The process can be viewed in three stages (I) Probe preparation: Thousands of cDNAs of known identity are prepared in large quantities and robotically printed in a grid structure onto a glass substrate. (ii) Target preparation: RNA from the tissue or cell line of interest is extracted, purified and labelled with either a Cy3 (red) or Cy5 (green) dye before being competitively hybridised to the printed microarray. (iii) Data analysis: Specific-wavelength lasers are used to excite the probes bound to the microarray surface and a high resolution TIFF image for each dye is created. Image analysis software is used to convert the image to numerical data which is then analysed in the form of gene expression ratios.
Cy3
Cy5
Probe preparation
Target preparation
Data analysis
7
1.2.1. Selection of an appropriate reference RNA for cDNA microarray analysis
The competitive hybridisation design of cDNA microarrays allows the researcher the
flexibility of choosing the reference RNA that best suits the experimental design. The
intensity measurements obtained from the amount of Cy3-labelled reference RNA bound
to each probe (or feature) on the array are used as the denominator value in the
calculation of the final expression ratios. Therefore the appropriate reference RNA is
essential as this has the potential to impact significantly on the entire expression profile
generated. For example, if a particular feature on the array is not bound by a labelled
reference target, no expression ratio can be generated, even if the probe is bound by the
Cy5-labelled sample material.
The most common options used in large-scale cDNA microarray experiments are
commercially available RNA stocks such as the Stratagene Human Reference RNA
(Stratagene, USA), genomic DNA (Gadgil et al., 2005), pooled RNA from all (or a
subset) of the samples actually being investigated (van 't Veer et al., 2002), or a ‘home
grown’ universal RNA produced from cell lines, such as the Stanford pooled 11 cell line
reference (Khan et al., 1998; Ross et al., 2000). While this decision has a major impact on
the final microarray data, few comparative studies have been carried out to determine the
impact, if any, of the type of RNA used (Novoradovskaya et al., 2004; Weil et al., 2002).
The concept of using a pool of cell-line derived RNA as a universal experimental
reference was first introduced by Ross et al (Ross et al., 2000) who combined an equal
mixture of RNA from 12 different cell lines to create a gene expression ‘baseline’ for a
microarray comparison of 60 different cell lines (known as the NCI 60 (Stinson et al.,
1992)). The pool was comprised of RNA extracted from a range of cell lines, known to
have a maximally diverse gene expression based on previously conducted two-
dimensional gel analyses (Khan et al., 1998). These were HL-60 (acute myeloid
leukemia) and K562 (chronic myeloid leukemia); NCI-H226 (non-small-cell-lung);
3 and OVCAR–4 (ovarian); CAKI-1 (renal); PC-3 (prostate); and MCF7 and Hs578T
(breast).
By excluding those genes on the microarray without significant intensity readings in the
reference channel, 6,831 of the 9,703 total cDNA features were identified, indicating that
the reference pool successfully bound to 70.4% of the particular microarray used. Other
8
groups have reported over 90% array coverage from pooled-cell line universal references
(Bergstrom et al., 2002), however this figure is dependant on the type of array used and
method for calculating the number of successful reference channel hybridisations.
Reference RNA stocks generated from tumour cell lines have the advantage of being
scalable because of the unlimited growth potential of the cell lines used, however there
are concerns over batch-to-batch variations arising from the use of different passages of
cell as well as changes in gene expression patterns resulting from minor variation in
culture conditions (Holloway et al., 2002; Sterrenburg et al., 2002).
Interestingly, Yang et al (Yang et al., 2002a) determined that pooling of RNA from a
small number of tissue samples or cell lines with diverse gene expression profiles can be
superior to the use of more complex RNA mixes. It was hypothesised that while some
cell lines actively express more genes than others, the level at which each gene is
expressed can vary between individual lines. Thus by adding more cell lines to a pool,
those genes expressed at lower levels may be diluted to a level at which they are
undetectable to the microarray platform. Yang et al demonstrated that using a
combination of only three cell lines from dissimilar tissues gives similar array coverage to
the commercial Stratagene universal reference, composed of RNA isolated from ten
different lines (Stratagene, USA).
Genomic DNA is readily available, inexpensive, invariant over time and between
laboratories and represents all genes with a uniform signal rendering it a theoretically
useful reference for competitive hybridisation. Mouse genomic DNA has been
demonstrated to have an extremely high coverage of a 16k mouse microarray and out
performed the Stratagene Universal Mouse Reference RNA (Stratagene, USA), in this
regard (Williams et al., 2004). A benefit of using genomic material (or cDNA) is its
ability to identify low abundance genes that may be undetectable or unstable with the use
of RNA references. With newer arrays including more genes of relatively low abundance
expression levels, this may be an important factor in future evaluations of reference RNA
options.
The differences between the use of genomic DNA compared to pooled RNA was studied
by Kim et al (Kim et al., 2002a). The results from this comparison indicated that genomic
DNA was the inferior option on the basis of a decreased correlation seen in self-self
hybridisations. The accuracy of data obtained with the use of a pooled RNA reference
was comparable to that achieved with self-self hybridisation of a single sample of RNA,
as shown in Figure 1-3. Self-self hybridisations are carried out by labelling a stock of
9
RNA with both Cy3 and Cy5 and hybridising to a single microarray, resulting in
theoretically perfect 1:1 expression ratios for all genes detectably hybridised. The number
and identity of differentially expressed genes were concordant between the pooled RNA
arrays and the direct hybridisations, but varied substantially from the genomic-DNA
hybridised slides.
Sterrenburg et al reported a method for using the pooled cDNA products actually used to
print the microarray as a reference material (Sterrenburg et al., 2002). This method was
shown to yield excellent array coverage (>99%) allowing expression ratios to be
calculated for virtually every array feature although it is was not compared to other
reference types described in this section and would result in cross-experiment analyses
being restricted to data generated from the same microarray platform.
To date, no comprehensive analyses of reference RNA types for tumour profiling have
been published, particularly for aspects other than relative array coverage. Questions still
exist around the use of a project-specific pool of sample RNA versus a ‘universal’ cell
line reference for such tasks as the identification of discriminating genes between
histological subtypes of a given cancer type, predictive machine-learning analyses or
accuracy of any quality control features contained on the microarray.
10
Figure 1-3: Comparison of reference RNA options via self-self hybridisation. Scatter plots of self versus self hybridization intensitie) RNA s for (A) genomic DNA (gDNA), (B) A pool of RNA from 3 separate isolations, and (C from a single isolation. Pearson correlation coefficients (r) for the two channels are shown in each plot. (Kim et al., 2002a)
11
1.2.2. The impact of microarray scanning hardware on gene expression data
The microarray scanner is one of the most expensive and important pieces of equipment
in a cDNA microarray laboratory. By scanning hybridised microarray slides and
generating the high-resolution electronic images that are converted to numerical
expression data, the scanner is effectively the bridge between the in vitro and ‘in silico’ or
bioinformatic stages of an experiment. Due to rapidly expanding market for microarray
products since the technologies’ inception, many companies have introduced scanners
with increasingly sophisticated features. Furthermore, within each scanner type the
settings that control the laser power and photomultiplier tube (PMT) voltage can be either
varied by the operator or controlled by electronic feed-back systems, in response to the
characteristics of the particular slide being scanned (Holloway et al., 2002).
All microarray scanners have a limited range of feature intensity detection, outside of
which the measurements are unreliable, as described by Lyng et al (Lyng et al., 2004). At
the higher end of the spectrum (>50,000 pixel intensity in the Lyng study) saturation of
the detectors became a source of significant error. In recognition of this, the image
analyses carried out for this thesis contained a filter to exclude array features with three
percent or higher pixel saturation. To avoid reducing low-intensity features to an
undetectable level by reducing the overall laser power, Lyng et al suggest scanning each
microarray twice – once at a low PMT setting then again at a higher setting, followed by
the use of a novel algorithm for excluding faint or saturated spots respectively. Whilst this
approach may be suitable for smaller array experiments, the amount of data duplication
and extra image analysis that would be required is unfeasible for most larger-scale
projects.
Few direct comparisons of microarray scanning hardware have been published to date.
One such study is that by Ramdas et al (Ramdas et al., 2001b) in which three types of
scanners were compared, although the identity of each was not revealed. The main
differences between the scanners were summarised as follows Scanner A was a four-
laser-based imaging system that used PMT detectors and a proprietary dark-field
illumination to minimize background signal, Scanner B was a simultaneous dual laser
scanner with a large field depth of 60 µm while Scanner C used patented confocal laser
scanning system with the capability to automatically calibrate the PMT.
12
For the comparison of these scanners, a single image analysis package was used, as to
avoid introducing variation based on use of different image analyses algorithms. The
correlation between data generated by different scanners was in the range of r = 0.90 to
r=0.96, which was not significantly higher than the correlation obtained from scanning
one slide multiple times on a single machine (r=0.93). This indicates that variation
generated from the use of different scanners is equivalent to that generated from scanning
the same slide multiple times on a single machine. Furthermore, the most differentially
expressed genes, as assessed by a 3-fold change in expression, exhibited a 95%
agreement in identity between all three scanners. No gene expression quality control
measures were analysed in this study to determine if a significant difference existed in the
accuracy of data produced from these three scanners. Levels of spatial bias, variation in
background intensity or the overall dynamic range of the data are also important measures
of scanner performance that were not tested in this comparison. In addition, no
information about the normalisation algorithm used was given, making it difficult to
extend the findings to other datasets or laboratories. The authors state that their findings
indicate data from disparate microarray scanners can be interchanged and successfully
analysed, however the limitations of this comparison as described, should be addressed
before accepting these conclusions.
During the course of this project, the Peter MacCallum Cancer Centre (Peter Mac)
Microarray Facility acquired a new microarray scanner manufactured by Agilent
Technologies (USA). The Agilent Microarray Scanner BA was claimed to offer
substantial improvements in the quality of cDNA expression data when compared to
other scanners, through the inclusion of features such as ‘Sure Scan’ technology, whereby
the focal point of the lasers is dynamically maintained throughout the duration of the
scan. This is a point of difference compared to other scanners, such as the Packard
Scanarray 5000 (Packard Bioscience, USA) in which the scanning lasers are focused
before the beginning of the scan and the focal point maintained constant for the duration
of the scan. Despite claims about the benefits of such hardware advances reducing the
level of systematic noise in cDNA microarray data, the actual benefits appear not to have
been rigorously tested, outside of the manufactures own literature, an example of which is
shown in Figure 1-4.
13
Figure 1-4: Representation of background fluorescence intensity variation with and without dynamic auto-focus, a feature of the Agilent Microarray Scanner. A trend towards lower background intensity measurements in relation to the physical location of the feature can be observed, as reflected by darkened upper-right corner of the lower scanned image (Agilent Technology, USA).
14
1.2.3. Spatial bias in cDNA microarray data
In some microarray hybridisations, the values of the expression ratios are dependant on
their physical location on the array, more so than their true expression in the specimen of
interest. This is known as spatially-dependant gene expression and has been observed in
cDNA microarray data by several groups and identified as significant source of technical
error (Lee, 2004; Miles, 2001; Quackenbush, 2002; Yang et al., 2002b).
False colour representations of microarrays are an effective way of visualising these
patterns, as shown in Figure 1-5. This type of systematic noise can be caused by a range
of factors including small variations in the dimensions of printing tips used to spot
individual microarray features onto the glass substrate, inadequate distribution of the
labelled target during the hybridisation stage, variation in the thickness of the glass
substrate or a slight angle in the position of the hybridised slide during the scanning
process.
The spatial arrangement of probes on the array can also lead to the appearance of
spatially-dependant patterns of differential expression (Balazsi et al., 2003), however
some randomisation of probe types (by known gene function or sequence homology) is
usually incorporated into the assignment of probes throughout the array layout to avoid
this factor, as was the case with the Peter Mac 10.5k cDNA microarray used for this
thesis. Often spatial bias appears as a gradual effect from one corner of the array to that
diagonally opposite (Figure 1-5), however as shown in Figure 1-6, a spatially-dependant
variation in expression ratios can occur in other patterns, depending on its cause.
15
Figure 1-5: False-colour or ‘virtual array’ images representing different components of a microarray affected by spatial bias. (A) Probe intensities of the Cy5 channel, (B) Corresponding background intensities for the same channel. The gradual fading of intensities can be observed in the background-subtracted image in (C). (Lee, 2004)
Figure 1-6: Position effect or spatial bias in cDNA microarray data as visualised by a high-density graph of relative fold change vs. array position. This method of visualisation shows that several different patterns of spatial bias can be evident in a dataset, this particular slide generating a Cy5 bias in approximately the second quarter of the data set (Miles, 2001)
A B C
16
Two methods for addressing this issue that are commonly used by researchers are print-
tip lowess normalisation (Yang et al., 2002b) and the Statistical Normalisation of
Microarray Data (SNOMAD) method (Colantuoni et al., 2002). Both methods use the
robust local linear regression (lowess) curve fitting algorithm (Cleveland, 1979) to
identify a line-of-best-fit through non-linear data. The Yang et al method groups
individual gene expression measurements into ‘bins’ for normalisation according to the
printing tip used to spot the respective array feature onto the slide. This algorithm has the
benefit of allowing the identification of individual tips that may be releasing too much or
too little cDNA with each printing cycle. A line is fitted through the data for each print tip
using the lowess curve fitting method. Next, this curve is corrected to fit a linear 1:1
intensity line and the amount of correction required at each point of the line is applied to
the individual expression points, effectively correcting for any variation between printing
tips.
Because not all spatial irregularity is caused by variation in print-tip dimensions or
similar printing attributes, this method may not always be effective for minimising
spatially-dependant bias in cDNA microarray data. The SNOMAD method uses the
physical X-Y (i.e. Row X, Column Y) coordinates of each array measurement and adjusts
each according to a mean intensity that is determined locally across the microarray
surface (Colantuoni et al., 2002). This technique is a multi-step approach and first
involves normalising the array to its median expression ratio in order to assist in
visualising the spatial bias present. The main point of difference between SNOMAD and
print-tip normalisation is the two-dimensional local estimation of mean hybridisation
intensity that is used to normalise each array feature. Again, the lowess function is used to
estimate the local mean intensity as a function of its specific location within the array.
The area or ‘window’ of the array used in this estimation can be controlled by the user.
Because this method is not limited to grouping data points into predetermined categories
associated with only one of the cause of spatial bias such as printing tips, it is potentially
a more versatile approach to addressing this issue associated with cDNA microarrays.
However both print-tip and SNOMAD methods take into account the location of a
microarray feature, therefore are both effective for correcting for spatial bias in cDNA
expression data. Print-tip normalisation is available through the Bioconductor analysis
package (Gentleman et al., 2004) and also has recently been implemented through an
online interface: http://gepas.bioinfo.cnio.es (Herrero et al., 2003), similar to SNOMAD
17
(http://pevsnerlab.kennedykrieger.org/snomad.php). While normalisation methods such as
these described are effective for correcting spatial bias, it can be difficult to determine
when this type of normalisation is required and the extent to which the bias is reduced as
neither method described provides a quantification of the level of bias present.
As well as normalisation algorithms, various aspects of the laboratory-based stages of the
microarray workflow may be adjusted to minimise the introduction of spatial bias into
cDNA expression data. These include changes to hybridisation methods as new
techniques are proposed and validated (McQuain et al., 2004; Yuen et al., 2003) and
scanning equipment as previously discussed in section 1.2.2. Despite spatial bias being an
obvious problem for cDNA microarray experimentation, to date no method for
quantifying its extent has been described in the literature.
1.3. Ovarian cancer
1.3.1. Clinical background
Three main categories of ovarian cancer exist; epithelia, stromal and germ cell tumours,
each having a distinct aetiology and clinical course. Of all gynaecological cancers, EOC
is the most common and has the poorest prognosis, rendering it the fifth leading cause of
female cancer deaths world-wide (Ries LAG, 2004). In patients where the disease is
confined to the ovaries (FIGO Stage 1 – See Appendix A) surgery alone can achieve a
cure in up to 90% of cases. However for the 80% of patients who present with more
advanced disease (FIGO stages 2-4), combined therapy of debulking surgery and
chemotherapy is required (Agarwal and Kaye, 2003). Platinum agents such as cisplatin
and carboplatin are the most active and frequently used chemotherapeutic drugs used for
ovarian cancer. Recent randomised trials have suggested additional benefits of adding
taxanes to platinum drugs (Harper, 2002). The standard treatment for Australian women
with ovarian cancer is presently a combination of carboplatin and paclitaxel (Harries and
Gore, 2002a; Markman et al., 2001; Marsden et al., 2000; Piccart et al., 2000).
While survival times have significantly increased over the past 20 years, this has not
correlated with an equally significant improvement in the cure rate (Engel et al., 2002).
Development of drug resistance is a large factor in this statistic, with the majority of
women who are diagnosed with more advanced stages of ovarian cancer disease
eventually experiencing relapse following their initial treatment and ultimately dying
from drug resistant tumour (Agarwal and Kaye, 2003). Drug-resistant disease is observed
18
in more than 75% of cases four years from diagnosis and consequently the 5 year survival
rate in Australia is around 42% ; lower than the 63% mean combined 5-year survival rate
for all other female cancers sufferers between 1992 and 1997 (Australian Institute of
Health and Welfare and Australasian Association of Cancer Registries, 2001).
1.3.2. Histology and associated genetic aberrations
EOC is classified into five main histological categories according to the cellular
appearance of the tumour. These classes are serous, mucinous, endometrioid, clear cell
and transitional cell (the latter sometimes referred to as Brenner tumours) (World Health
Organization, 1999). The resemblance of the differentiation present in a tumour to other
tissues is the basis of the classifications. Serous tumours most closely resemble fallopian
tube epithelium, mucinous tumours the gastrointestinal tract or endocervical epithelium,
endometrioid tumours the proliferative endometrium, clear cell tumours the gestational
endometrium and transitional cell tumours the epithelium of the urinary tract. A range of
malignant behaviours is also observed between these groups. These are classified as (i)
benign, with simple non-stratified epithelium in which no cytologic atypia is present, (ii)
low malignant potential (LMP) in which epithelial proliferation featuring stratification
and tufting is observed (with varied mitotic activity and atypical nuclei) and finally (iii)
malignant carcinoma in which stromal invasion and cytologic atypia is observed
(Kurman, 2003).
Approximately 10% of all EOCs are associated with autosomal dominant genetic
predisposition, primarily inherited mutations in the BRCA11 or BRCA2 tumour suppressor
genes (Jazaeri et al., 2002; Lakhani et al., 2004; Malander et al., 2004). Mutations of
these genes are also seen in a small proportion (~5%) of sporadic ovarian cancers
(Matias-Guiu and Prat, 1998). Other genetic features tend to relate to specific types of
ovarian cancer. For example invasive serous and undifferentiated ovarian carcinomas are
characterized by mutations of the tumour-suppressing gene TP53 and accumulation of the
protein it encodes (Baekelandt et al., 1999). As well, the loss of genetic material from
chromosome 17, where the TP53 gene is located, is also common (Chenevix-Trench et
al., 1997). Over expression of the apoptosis suppressing gene BCL2 is reported in
endometrioid carcinomas (90% of cases) (Baekelandt et al., 1999). Mutations of the
KRAS oncogene are characteristic features of mucinous carcinomas (detected in 40-50%
1 The UniGene symbol is used as the primary gene identifier in this thesis. A full list of UniGene symbols and complete gene names can be found in Appendix O and also in a spreadsheet format on the CD-ROM attached to this document (file name: “RvL_Thesis_Genelist.xls”).
19
of cases), although less frequent in mucinous tumours of low malignant potential (LMP)
where they are detected in approximately 30% of cases (Cuatrecasas et al., 1998). The
LMP form of ovarian cancer shares many of the characteristics of its invasive
counterpart, however exhibits a markedly different clinical course and women diagnosed
with this form of the disease have a significantly more favourable prognosis (Kliman et
al., 1986; Trimble and Trimble, 2003), as discussed later in section 1.4.
Despite what is already known about the underlying molecular basis of EOC, a much
deeper understanding of the events leading to EOC development and progression is
needed. The high cure rate for those patients diagnosed early in the stages of this disease
is responsible for a keen interest in identifying the specific genes or proteins whose
expression or silencing indicate the first stages of tumorigenesis. Insight into the events
responsible for malignancy, particularly those required for a tumour to spread beyond the
confines of the ovary, may also lead to the discovery of novel therapeutic agents or
molecular targets for treating patients diagnosed with invasive or advanced stage disease.
1.3.3. Current needs in ovarian cancer diagnosis and treatment
Like many forms of human malignancies, when diagnosed early in its’ clinical course,
EOC is a disease that can be treated effectively and often cured using the currently
available range of surgical and chemotherapeutic strategies (Karlan, 1995; Smart and
Chu, 1992; Teneriello and Park, 1995). When the disease is identified before it has a
chance to invade into nearby tissues, or grow to a size where bowel obstruction becomes
a serious risk to the patients life, the 5-year survival rate is between 80 and 90%, with a
steady decline in this rate as the cancer progresses, as shown in Figure 1-5 (Society,
2005). Unfortunately, only approximately one fifth of all cases are detected before local
spread to other pelvic and abdominal structures has occurred (i.e. FIGO stage 1) (Agarwal
and Kaye, 2003), hence the often poor prognosis of most EOC patients.
The challenge of identifying EOC in its early stages, where prognosis is significantly
more favourable, is compounded by the vagueness of its most common symptoms. Many
of the symptoms are often interpreted by patients and health-care professionals as normal
events associated with childbearing, menopause or the aging process (Bankhead et al.;
Fitch et al., 2002). The most common symptoms experienced by women with EOC
according to large retrospective studies, are gastrointestinal discomfort, weight gain, pain
and swelling of the abdomen and indigestion and shortness of breath (Fitch et al., 2002).
20
In a large survey of US women diagnosed with ovarian cancer it was found that 95%
experienced a range of symptoms prior to their diagnosis, despite the common belief that
early stage EOC is largely asymptomatic (Bankhead et al.; Goff et al., 2000). Women
who ignored these indications were significantly more likely to be diagnosed with
advanced stage disease compared to those who acted upon them (p=0.002). The study
concluded that ovarian cancer may not be as asymptomatic as once thought (Chan et al.,
2003), however the most common symptoms are often not considered indicative of a
gynaecologic condition, sometimes resulting in delayed or incorrect diagnoses (Ferrell et
al., 2003). Some studies have suggested that education of patients and doctors about
considering EOC as a possible cause of the symptoms described, coupled with more
effective screening using existing methods (e.g.. pelvic examinations, CA-125 or
ultrasound) may be beneficial for increasing the frequency of early-stage diagnoses (Igoe,
1997). In spite of this, others have recently shown that advanced stage disease at
presentation and consequently a poor prognosis, is rarely attributable to a delay in
diagnosis attributable to misinterpreted symptoms (Lataifeh et al., 2005).
Because of the known relationship between advancing disease stage and poor prognosis,
there remains a pressing need to understand the molecular events underlying the
transition from one stage of EOC to the next. In particular, those genes controlling a
tumour’s ability to spread from the originating ovary to nearby tissues, as this phase of
disease progression is associated with the largest change in treatment course and a
significant decrease in patient survival (Clark et al., 2001; Friedlander, 1998). Advances
in our understanding of these processes will aid the development of tests designed to
identify the first stages of ovarian tumorigenesis and may also allow the development of
novel therapeutics targeted towards the specific gene products responsible for disease
progression.
Given the absence of any effective late-stage treatment of EOC, research into the
molecular events responsible for EOC development, particularly those mediating a
tumour’s drug resistance and invasive potential, offers the most promise for reducing the
impact of this disease on the community. While not in the scope of the aims of this thesis,
microarray technology is also being used to investigate the precise mechanisms of drug
resistance (Sakamoto et al., 2001)
21
0 10 20 30 40 50 60 70 80 90
100
Ia Ib Ic II IIIa IIIb IIIc IV
EOC Stage at diagnosis
5-ye
ar su
rviv
al r
ate
(%)
A
B
Figure 1-7: (A) 5-year survival rates of EOC patients by tumour grade at time of diagnosis (Society, 2005). (B) Diagram representing the region of the body to which the tumour has spread that corresponds to the four main FIGO stages of disease progression. Full descriptions of each stage can be found in Appendix 1.
22
1.3.4. Molecular pathology of EOC and its relevance to patient prognosis
Alterations in the oncogene TP53 and its downstream targets p21 (cell cycle inhibitor),
BAX (apoptosis agonist) and BCL-2 (apoptosis antagonist) are often observed in EOC,
however there is still debate concerning the prognostic ability of these changes. Schuyer
et al used a range of molecular and immunohistological methods to examine the
relationship of these genes with important clinico-pathological variables including
outcome and response to platinum-based chemotherapy drugs including cisplatin
(Schuyer et al., 2001). Interestingly, while TP53 mutations were present in up to 50% of
EOC’s, no correlation with increased rate of progression or death was observed, nor with
expression of p21 or BCL-2 in this study. Higher TP53 expression levels were correlated
with shorter overall survival rate (p=0.03). Factoring TP53 mutation and over-expression
resulted in a more significant correlation with overall survival than the expression data
alone (p=0.08), as observed in other studies (Wen et al., 1999). The other gene
downstream of TP53 investigated as part of this study, BAX, was however significantly
linked to progression-free and overall survival. Furthermore, patients with expression of
both BAX and BCL-2 exhibited longer survival times than those with tumours expressing
BAX alone. The authors concluded that high expression of BAX may therefore be a
potential independent prognostic indicator for this disease.
Expression of P21/WAF1, a tumour suppressor gene, is inversely correlated to TP53. It
has been associated with higher EOC grades (i.e. a less differentiated cellular structure)
and later FIGO stages (Anttila et al., 1999). DNA damaging agents that result in cell
cycle arrest of wild-type TP53 cells in the G1 phase are capable of inducing the
p21/WAF1 gene. Antilla et al used immunohistochemical profiling of over 300 ovarian
tumour specimens to explore the relationship between expression of p21/WAF1 and
patient outcome. Statistical analysis of expression levels and patient clinical information
revealed that high level expression of p21/WAF1 were associated with lower levels of
cellular proliferation. In a univariate approach, the gene appeared to be a negative
prognostic factor. Patients whose tumours had minimal or no expression appeared to have
a higher risk of tumour recurrence after treatment and shorter disease-free and overall
survival rates, particularly for those positive for TP53 also. Whilst not statistically
significant, there was also a trend of higher p21/WAF1 expression in patients that
exhibited a complete response to chemotherapy.
23
The gene KLK4 (Kallikrein 4) has been associated with disease progression and survival
time in EOC. KLK4 has been implicated in other hormonally regulated cancers including
those of the breast and prostate (Obiezu et al., 2001). In 147 EOC samples, expression of
this gene was detected by RT-PCR in 69 cases (55%). Furthermore, a significant
association with tumour grade and stage was observed. Overall the authors of this study
concluded that KLK4 expression was related to a more aggressive phenotype, which
generally translated to an increased risk of disease relapse and ultimately death. When
tested against chemotherapy response rates, a correlation between positive expression and
lack of treatment efficacy was detected. Interestingly, comparing the expression of KLK4
in grade 1 and 2 versus grade 3 tumours showed that positive expression in grade 1 and 2
cases indicated a 2.5-fold increase in relative risk of relapse yet was not significantly
predictive for relapse of the least differentiated grade 3 tumours (see Figure 1-8), which
may indicate the loss of expression with dedifferentiation status.
24
Figure 1-8: Variation in rate of tumour relapses between grade 1-2 (A) and grade 3 (B) tumours by KLK4 expression(Obiezu et al., 2001). The level of this gene appears to be related to survival in moderate and well differentiated tumours but not to the same extent in those of poor differentiation.
A
B
25
The Fanconi anemia-BRCA pathway has been implicated in the molecular changes
occurring in cisplatin-resistant EOC. According to research by Taniguchi et al,
interruption of this genetic pathway ultimately leads to the development and selection of
drug-resistant cancer cells (Taniguchi et al., 2003). This pathway is made up of six genes
(FANC-A, -C, -D2, -E, -F and -G) plus BRCA1 and BRCA2 and normally regulates
cellular reaction to cisplatin and other DNA cross-linking substances. The pathway gets
its name from Fanconi anemia, a rare autosomal recessive disease causing abnormal
development and predisposition to a wide range of tumours. The authors of this study
showed that cisplatin resistance in EOC cell lines could be attributed to initial
methylation-induced inactivation and subsequent demethylation of FANCF. A proposed
model of tumour progression based on these findings is shown in Figure 1.7.
In this model, methylation of the FANCF occurs during the early stages of tumour
progression. This results in chromosomal instability and accumulation of other tumour
causing mutations. The majority of cells in the growing tumour remain hypersensitive to
cisplatin, due to their underlying Fanconi pathway defect. As a result, cisplatin treatment
results in significant apoptosis of the drug-susceptible cell population. In rare cells,
demethylation of FANCF occurs, leading to reactivation of the pathway and selective
growth of these cells, eventually forming a cisplatin-resistant tumour mass. As shown in
Figure 1.7, the use of a small-molecule Fanconi-pathway inhibitor may be clinically
useful for resensitising these relapsed tumours.(Taniguchi et al., 2003)
As this Fanconi pathway analysis was carried out using EOC cell lines, validation using
other methods such as expression profiling of RNA extracted from human tissue should
be carried out. Microarray profiling of cell lines and primary ovarian tumour has revealed
that significant molecular differences exist between these two forms of the disease,
questioning the validity of cell line models for the study of human cancers without
confirming the observations made with validation studies using actual human tissue (Ross
and Perou, 2001). The extent of these differences between EOC cell lines and human
tumours has been described by Sawiris et al who used principal component analysis
(PCA), a data reduction technique to visualise complex gene expression patterns in three
dimensions, to describe the molecular differences identified by expression profiling, as
shown in Figure 1-10. In this analysis, the primary EOC tissue specimens appear as
related to primary colorectal tissue specimens as to ovarian cell lines (Sawiris et al.,
2002).
26
Fi
gure
1-9
: A
pro
pose
d m
odel
of
EO
C p
rogr
essi
on a
nd d
evel
opm
ent
of c
hem
o-re
sist
ance
. In
the
ear
ly s
tage
s of
tum
our
prog
ress
ion,
the
FAN
CF
gene
is
met
hyla
ted
whi
ch re
sults
in c
hrom
osom
al in
stab
ility
and
bui
ld u
p of
oth
er tu
mou
r cau
sing
mut
atio
ns.
The
maj
ority
of
cells
in
the
grow
ing
tum
our
rem
ain
hype
rsen
sitiv
e to
cis
plat
in, d
ue t
o th
eir
unde
rlyin
g Fa
ncon
i pa
thw
ay d
efec
t. A
s a
resu
lt, c
ispl
atin
tre
atm
ent
resu
lts i
n si
gnifi
cant
apo
ptos
is o
f th
e su
scep
tible
cel
l po
pula
tion.
In
rare
cel
ls,
dem
ethy
latio
n of
FA
NC
F oc
curs
, rea
ctiv
atin
g th
e pa
thw
ay a
nd le
adin
g to
sel
ectiv
e gr
owth
of
thes
e ce
lls,
resu
lting
in a
cis
plat
in re
sist
ant t
umou
r mas
s. A
s sh
own
in th
e m
odel
the
use
of a
sm
all-
mol
ecul
e Fa
ncon
i-pat
hway
inh
ibito
r m
ay b
e cl
inic
ally
use
ful
for
rese
nsiti
sing
the
se
rela
psed
tum
ours
. (Ta
nigu
chi e
t al.,
200
3)
Figu
re 1
-10:
Thr
ee d
imen
sion
al p
lot
of p
rinc
ipal
com
pone
nt
anal
ysis
of
mic
roar
ray
gene
exp
ress
ion
data
gen
erat
ed u
sing
E
OC
ce
ll lin
es,
EO
C
tissu
e sa
mpl
es
and
colo
n tu
mou
r sa
mpl
es.
Sign
ifica
nt d
iffer
ence
s ex
ist
betw
een
the
expr
essi
on
prof
iles
gene
rate
d fr
om c
ell
lines
and
hum
an t
issu
es c
an b
e ob
serv
ed(S
awiri
s et a
l., 2
002)
27
1.3.5. The ovarian tumour marker CA-125 and EOC prognosis
The currently established EOC prognostic factors are (Clark et al., 2001):
Age at diagnosis,
Histology,
Tumour stage and grade,
Volume of ascites,
Performance status according to the ZUBROD-ECOG-WHO scale (Oken et al.,
1982),
Findings at second-look laparotomy and
Debulk status
Along with the above prognosticators, the abundance of a cell-surface molecule called
CA-125 detected by a blood test is frequently used to assess the risk of ovarian
malignancy. To a lesser extent, this marker is also used identify disease stage and
histological subtype, although this is more often determined pathologically using staging
systems such as the FIGO scale (Benedet et al., 2000). The clinical usefulness of CA-125
was first identified in 1981 (Bast et al., 1981) and it remains one of the most commonly
measured indicators of the disease to this day (Agarwal and Kaye, 2005). The molecule is
expressed by over 80% of ovarian cancers and secreted into the blood stream, enabling its
detection through an un-invasive blood test. CA-125 levels are measured at regular
intervals throughout the course of a woman’s treatment and used to predict the likelihood
of a favourable response to chemotherapy and also the probability of disease recurrence
up to 60 days following treatment (Meyer and Rustin, 2000).
In using CA-125 levels to decide whether to continue, modify or stop therapy all together,
the definition for treatment response recently proposed by the Gynaecological Cancer
Intergroup (GCIG), is a 50% reduction in the level of the protein that is sustained for 28
days (Rustin, 2004; Rustin et al., 2004). This was determined by comparing patient
response rates according to CA-125 with response rates expected according to standard
criteria and calculation of the proportion of patients in whom the CA-125 prediction
agreed or differed with the response determined by standard criteria. The accuracy of the
28
definition for response according to CA-125 was also been determined by examination of
how accurate the CA-125–defined response was in predicting the activity of drugs in
phase II trials, compared with response rates obtained by standard criteria (Rustin et al.,
2000).
Despite the demonstrated link between CA-125 and the onset or progression of several
clinically important aspects of EOC, none of the indices have universal acceptance in
disease prognostication, despite extensive evaluation. The sample size of many of these
evaluating studies is a frequent limiting factor, along with the lack of prospective studies
to confirm the original observations. And finally, there is insufficient predictive ability of
the indices when applied to an individual patient to justify a change in management
(Cruickshank et al., 1992; Rustin, 2004).
According to the literature, over 230 papers have been published on potential prognostic
factors for EOC in the past 5 years (Agarwal and Kaye, 2005), however despite this
volume of research, no single factor has passed all the criteria necessary for acceptance
into research clinical practice for this disease (Agarwal and Kaye, 2003). The prognostic
value of the tumour suppressor gene TP53 has been studied extensively in EOC, although
its precise role in tumour response to DNA damage remains controversial. In a review of
published TP53 analyses it was found that 43% found a significant correlation between
TP53 status and clinical end point with respect to chemoresistance. However only six
studies met the minimum criteria established in the review, none of which found a
reliable correlation between drug-resistance end points (Hall et al., 2004). The criteria
used for evaluating the TP53 studies included variables such as sample size, inadequate
positive and negative controls or the use of more than one antibody to assess TP53 levels.
One explanation for the difficulty in finding truly prognostic markers of EOC may be the
univariate methods of analysis used to evaluate most novel candidates. This method does
not account for the impact of other established prognostic variables, known to be
important in determining patient outcome or chemotherapy response (e.g.. the amount of
residual disease remaining after surgery, patient age, etc) (Altman, 2001). Another reason
for the absence of a truly universally applicable EOC prognosticator may be the single
gene/protein nature of most studies to date, such of TP53 (Wen et al., 1999), ERBB2
(Meden and Kuhn, 1997) and MDR (Ikeda et al., 2003). EOC is known to be a complex
and heterogeneous disease and it therefore may require the simultaneous measurement
and analysis of multiple molecular markers and/or clinical variables to accurately
determine patient prognosis (Hernandez et al., 1984; Pieretti et al., 2002).
29
1.3.6. The use of DNA microarrays to discover novel biomarkers of EOC
Although microarray technology is still undergoing rapid development, early indications
are that it has the potential to impact significantly on diagnosis of diseases with
underlying molecular causes and also methods for assessing patient prognosis.
Identification of gene expression signatures associated with patient prognosis has been
achieved for a range of cancer types, including breast (van de Vijver et al., 2002) B-cell
lymphoma (Alizadeh et al., 2000), ovarian (Berchuck et al., 2004), prostate
(Dhanasekaran et al., 2001), renal cell (Vasselli et al., 2003) and oesophageal cancer
(Kihara et al., 2001). In breast cancer, the predictive gene expression signature described
by Van’t Veer et al, has been developed into a conventional clinical trial where treatment
decisions are being made based on the expression profile of the 70-genes represented in
the signature that has been demonstrated to correlate with either a good or poor prognosis
(Branca, 2003). In this case, women enrolled in the study (N>5000) will be assigned to
one of two treatment groups either based on their molecular profile or conventional
assessment by clinicians. Patient outcome between the ‘microarray’ vs. clinician assigned
groups will be compared throughout the study to determine if either method is superior
for identifying those women most at risk of recurrent disease and therefore requiring
more aggressive treatment.
The ability of microarrays to monitor and predict the response of a tumour to a specific
chemotherapy agent, a variable that impacts on prognosis, is currently being
demonstrated for the multiple myeloma drug Velcade (Jung et al., 2004; Mitsiades et al.,
2002). By analysing the Affymetrix GeneChip expression profiles of patients before and
after drug treatment, a predictive signature was devised for identifying whether a tumour
is likely to respond favourably to the drug. This study is one of the first Government
approved clinical trials in the USA to incorporate microarray-based expression profiling
in a trial protocol.
Several groups have attempted to use microarray profiling to discover novel biomarkers
for EOC, particularly markers of early stage disease. One of the first published studies
was carried out using Affymetrix HuGeneFL GeneChips, an early version of the presently
available GeneChips that contained approximately 6,000 oligonucleotide features (Welsh
et al., 2001). A novel “array of arrays” format was used whereby 49 separate microarrays,
separated by individual chambers, were hybridised in parallel on a single glass wafer.
Criteria for selection of genes as potential diagnostic markers were: (i) low expression in
30
normal tissue and high expression in neoplastic tissue and (ii) a clear and unambiguous
difference in expression between these two tissue types. In the analysis of the expression
data generated, certain samples in the cohort were found to be expressing high levels of
genes that are normally associated with stroma or infiltrating immune cells. These
tumours were confirmed pathologically to have low epithelial content and subsequently
excluded from the analysis, reducing the sample size significantly and therefore limiting
the statistical power of the study. The hybridisation intensity of each gene in the normal
and malignant specimens was analysed with three different methods for detection of
differential expression; (i) difference of means, (ii) fold change and (iii) unpaired t-test.
By ranking the genes according to each of these measures, the sum of each was able to be
used to calculate an overall estimate of differential expression, as reproduced in Figure
1-12.
Genes identified by this approach included several cell-proliferation genes (e.g. CCNB1,
CDC20, RAN), previously identified tumour specific genes (COX5B, PRSS8, PRAME),
stromal genes upregulated in normal tissues (CNN1, MLCK, MUC18), genes only
expressed in normal tissue (EGR1, IGFBP5, BTG2) and several ribosomal genes.
Important factors for determining the potential of a novel disease biomarker include the
copy number of the gene, particularly for mRNA-based detection, or the translation of the
gene into circulating protein product for a biomarker that can potentially be identified
from blood, urine or saliva samples.
31
Figure 1-11: Affymetrix GeneChip gene expression measurements of the 30 highest ranked potential EOC biomarkers by Welsh et al (Welsh et al., 2001). Red and blue squares correspond to mean expression level of each gene in malignant and normal ovarian tissue respectively. Green bars correspond to expression of the gene in a pool of six normal tissues for comparison. 95% confidence intervals shown. Studies such as these demonstrate the power of microarrays to identify large numbers of potential biomarker candidates.
32
The gene Prostasin (PRSS8) identified by Welsh et al, was also proposed as a potential
serum marker for the early detection of EOC in an independent cDNA-microarray based
study by Mok et al (Mok et al., 2001). Using a 2.4k commercial cDNA array
(MICROMAX human cDNA microarray system, manufactured by Perkin Elmer, USA)
this group identified those genes coding for over expressed proteins potentially suitable
for use as early detection markers. Thirty genes with expression ratios greater than five
were identified (EOC cell line to normal ovarian surface epithelial (OSE) cell). Included
in the output of this analysis was PKB, which encodes a protein marker that is already
used clinically for assessing renal cell carcinoma and lung cancer prognosis. PRSS8 had a
Cy3:Cy5 expression ratio of 170, indicting an extreme increase in its abundance relative
to normal OSE. This over expression was confirmed with RT-PCR and
immunohistochemistry was also carried out to determine its cellular localisation.
Antibody staining of EOC sections revealed high levels of serum prostasin in 64 cases of
EOC compared to 134 control cases (examples of staining shown in Figure 1-12), this
independent validation step being crucial in the process of evaluating a novel biomarker
(Statnikov et al., 2005). Importantly, variables that can potentially confound analyses of
novel biomarkers, such as patients’ age and specimen quality, were controlled for in the
statistical analysis of gene expression and disease state in this study. After factoring in
these clinical variables, a highly significant difference (P < 0.001) between EOC and
normal tissue expression of PRSS8 was still observed.
One caveat to this study was the exclusion of residual disease information in estimate of
significance, a potential oversight given the prognostic value of this variable (Hoskins et
al., 1992; Hoskins et al., 1994).
33
Figure 1-12: Immunohistochemistry validation of Prostasin (PRSS8), a novel serum marker for EOC identified by microarray analysis. Low prostasin expression in normal surface epithelial cells (A) and serous LMP tumour (B). Higher expression of the marker in a grade 3 EOC specimen is shown in (C) and no positive signal is observed for the same case in (D) for which a preimmune rabbit serum was used. S = stroma and the horizontal scale bar indicates a length of 50um. (Mok et al., 2001)
34
Another candidate marker from microarray profiling of EOC is Osteopontin (SPP1), one
of 30 candidate genes identified by Wong et al (Wong et al., 2001) from cell-line
experiments also using the MICROMAX platform. This gene was observed to be 150-
180-fold over expressed in the tumour derived cell lines relative to the seven cultured
normal OSE cell lines used as a reference. However, this study was largely a validation of
the microarray platform itself and only a rudimentary data analysis was used to identify
candidate genes. The product of the SPP1 gene is an acidic calcium-binding
glycophosphoprotein found in virtually all body fluids and in the components of the
extracellular matrix. It is thought to be involved in regulation of cell adhesion and also a
cytokine for CD44 and several integrins (Standal et al., 2004).
The expression of SSP1 was validated initially by Kim et al (Kim et al., 2002b) using
normal and cancerous cell lines, archival paraffin-embedded ovarian tissue as well as
fresh tissue and plasma from 144 patients treated for a pelvic mass at two locations in the
United States (Brakora et al., 2004; Schorge et al., 2004). RT-PCR on microdissected
tumour material revealed higher expression of this Osteopontin mRNA relative to normal
tissue, but the difference was not statistically significant. Immunohistochemical analysis
revealed histological-subtype specific pattern of staining. For example high cytoplasmic
staining in mucinous tumours compared to psammoma-body localised expression in the
serous subtype. Ovarian tumours of low malignant potential have also been noted to
express higher levels of osteopontin protein than their invasive counterparts, which
suggests a role for this molecule in regulation of tumour dissemination to other tissues
(Tiniakos et al., 1998). Serum testing in this study revealed clearer differences between
healthy controls and tumour patients with osteopontin ELISA with preoperative plasma
levels being significantly higher, for all histological subtypes tested.
Continuing the exploration of the potential diagnostic value of SSP1, genome-
comprehensive Affymetrix U95 GeneChips, were used to assess the expression of this
gene in 42 EOC and normal OSE samples, along with ten other potential tumour
markers(Lu et al., 2004). Of the eleven markers used, SSP1 was not selected by the
recursive descent partition analysis (Hastie et al., 2001), rather a formula based on the
expression of HE4, CA-125 and MUC1 was formulated which was able to discriminate
between 100% of the tumour and OSE samples tested when the expression levels of all
markers were found to be elevated. Another set of genes was identified from the available
data with high classification accuracy. Claudin 3 (CLDN3) expression by itself could
classify all serous, clear cell, endometrioid and one of eight mucinous samples from the
35
non-cancerous OSE. With the addition of vascular endothothelial growth factor (VEGF)
expression into the classifier, the remaining mucinous samples were correctly classified.
In a follow up IHC analysis of 158 EOC cases, it was demonstrated that a combination of
CLDN3, CA-125, MUC1 and VEGF staining were able to classify all tumour samples
from normal tissue. This study demonstrated the potential for microarrays to aid in the
development or improvement of cancer diagnostics. As the authors commented, one
limitation of such studies is the need for any potential markers to be present in serum, not
just expressed in at the mRNA level a given tissue type for a candidate to become a
clinically useful diagnostic tool.
Studies such as these describe the exhaustive process required to identify new biomarkers
for EOC using microarrays as the initial discovery platform and other more established
methods such as RT-PCR and IHC for validation.
Meta-analysis is an approach to data mining in which raw gene expression data from
separately conducted microarray experiments are combined to create one dataset with
increased statistical power. This method has been used in successful studies of prostate
cancer (Rhodes et al., 2002) and also to identify a transcriptional profile commonly
activated in a large range of cancer types (Rhodes et al., 2004)
By compiling a database of gene expression information from 14 different microarray
studies of EOC gene expression relative to OSE or other forms of non-malignant tissue,
Heinzelmann-Schwarz et al (Heinzelmann-Schwarz et al., 2004) identified three cell-
adhesion genes that were overexpressed in all histological subtypes tested. This approach,
as well as increasing the statistical power of the analysis, is an effective way of
controlling for variation between laboratory protocols, microarray platforms and data
analysis methods. Genes that are found to be differentially expressed between phenotypes
of interest in more than one independent study are more likely to be significant on a
population-level as opposed to those identified by one study alone.
By using a database created from these compiled studies in association with the authors
own unpublished dataset of EOC Affymetrix profiles, 69 genes differentially expressed
between EOC and OSE were found in common. From this list, cellular localisation,
minimal expression in normal ovarian tissue and the gene’s individual p-value for
differential expression were used to identify candidate tumour markers of EOC. Three
cell-adhesion markers were chosen for follow up analysis with immunohistochemistry;
molecule (EP-CAM). The relative levels of these molecules in surface epithelium and a
range of EOC histological subtypes is shown in Figure 1-13. Immunohistochemistry
revealed low expression of these candidate markers in normal surface epithelium and
significantly higher expression in the EOC subtypes profiled.
Whilst none of these potential markers was predictive of relapse-free survival, compared
to other variables tested such as age, debulk status and tumour stage, patients with lower
CLND3 expression exhibited a trend towards shorter survival (p=0.068).
These types of studies highlight the benefit to the search for novel biomarkers microarray
technology represents, but also the immense amount of work that is still required to find
suitable molecules. Whilst the scientific community now has the capacity to screen
thousands of genes in parallel due to the continual refinement of microarray technology,
much effort is still required to ensure any candidate genes are expressed specifically in
the tissue of interest and that the gene product is present in corresponding and detectable
levels in plasma. Making use of the growing databases of tissue expression profiles
(Ramaswamy et al., 2001; Su et al., 2001) are one method for assessing tissue specificity,
however issues of cross-platform and inter-laboratory variation still exist (King and
Sinha, 2001; Simon et al., 2003b).
As well as improved means of early detection and population screening, a need exists for
more accurate prognostic factors to assist in individualising the treatment EOC patients
receive. This need is pressing as most ovarian cancer patients are diagnosed with
advanced stage disease and present treatment options are only effective in a portion of
these cases (Harries and Gore, 2002a). Debate still exists in the medical community on
the most effective methods for determining whether a patient receives any form of
chemotherapy, the route of administration, dosage levels and also the most appropriate
surgical approach to take for maximum benefit (Agarwal and Kaye, 2003; Harries and
Gore, 2002a; Harries and Gore, 2002b; Marsden et al., 2000).
37
Figure 1-13: IHC expression of potential EOC markers. Units are mean percentage of cells expressing each marker. SOC: Serous ovarian cancer; MOC: Mucinous ovarian cancer; EnOC: Endometrioid ovarian cancer; ClCCA: Clear cell ovarian cancer. Black and white bars correspond to cytoplasmic and membrane expression respectively (Heinzelmann-Schwarz et al., 2004).
38
After initial surgery, some patients can be identified with a sufficiently favourable
histological assessment (stage 1) that indicates the chance of being cured by surgery alone
is sufficiently high as to avoid the personal and financial cost of chemotherapy. However
this category of EOC is still remarkably heterogeneous and substantial variation exists in
survival times, underscoring the need for reliable prognostic measures. A comprehensive
retrospective study by Vergote et al (Vergote et al., 2001) based on 1545 women with
stage 1 EOC identified the most important prognostic factors for probability of relapse
being degree of tumour differentiation, the presence of cyst rupture, bilateralism
(presence of tumour in both ovaries) and the age of the patient at diagnosis.
Another recently reported prognostic factor for risk of relapse amongst stage 1 patients is
DNA ploidy (Kristensen et al., 2003; Trope et al., 2000). Kristensen et al found that those
patients with polyploid and aneuploid tumours had 10-year relapse-free survival rates of
70% and 29% respectively. Those with diploid and tetraploid tumours had significantly
higher rates of 95% and 89% respectively. In a multivariate analysis including tumour
grade, FIGO stage and histological subtype variables, DNA ploidy was found to be the
strongest predictor of survival, while all variables were independently prognostic at
statistically significant levels. The investigators identified low, medium and high risk
relapse groups based on DNA ploidy and other variables and propose the routine use of
DNA ploidy analysis for the selection of early-stage patients likely to benefit from post-
surgical adjuvant chemotherapy.
1.3.7. Current status of microarray-based EOC prognostic signatures
Possibly due to the difficulty in accruing sufficient numbers of patients representing a
broad range of survival times, microarray-based identification of novel EOC prognostic
factors has lagged behind that of other cancer types such as breast (Huang et al., 2003;
Sorlie et al., 2001; van de Vijver et al., 2002) or B-cell lymphoma (Rosenwald et al.,
2002) (Lossos et al., 2004) (Yeoh et al., 2002) for example.
Based on a 68-patient cohort, Spentzos et al (Spentzos et al., 2004) identified a set of 115
genes referred to as the ‘Ovarian Cancer Prognostic Profile (OCPP)’. These genes were
narrowed down from over 12,000 contained on the Affymetrix U95A GeneChip and their
expression patterns grouped patients into classes with statistically significant differences
in survival times, based on a three-step process. Genes were first selected by the
comparison of expression data from patients at the extreme ends of the survival
39
distribution. Following this, those samples corresponding to patients who had survival
times in-between the extreme long and short term groups were classified using the
deduced OCCP. The authors noted a bias in debulking efficiency and patient age between
the favourable and unfavourable groups, with both of these variables being significantly
prognostic, based upon univariate analysis. This observation has been reported by a
number of other studies (Clark et al., 2001; Friedlander, 1998; Vergote et al., 2001).
Despite this, multivariate analysis of the OCCP, corrected for age and debulking status,
maintained its prognostic independence. Whilst an independent test set showed the OCCP
to be significantly prognostic for samples not used in the training process, the total
number of cases in the study (n=102) and small number of specimen collection sites
would need to be expanded before the OCCP could be confidently applied to EOC
patients for diagnostic purposes. Importantly, many of the genes in the list of 115 have
previously been implicated in processes such as invasion and disease progression. These
include:
Fibronectin (FN1); known to be integral in neovascularisation and metastasis,
immunosuppressive and apoptotic pathways, and in a large
immunohistochemistry study based in Germany was significantly correlated with
other established prognostic factors as well as overall patient survival (Franke et
al., 2003).
Plasminogen activator inhibitor 1 (PAI1); elevated expression of this gene and
its target Urokinase-type plasminogen activator (PLAU) were significantly
associated with disease prognosis and progression by quantitative ELISA of a
large cohort of patients by Konecny et al (Konecny et al., 2001). This enzyme
and inhibitor complex are thought to mediate a tumours ability to degrade
extracellular matrix and basement membranes, essential for invasion to occur
(Dano et al., 1985; Schmitt et al., 1997). They have also been implicated as
prognostic markers in breast (Duffy et al., 1988; Foekens et al., 1992), kidney
(Hofmann et al., 1996), colon (Ganesh et al., 1994), lung (Pedersen et al.,
1994)and gastrointestinal cancers (Nekarda et al., 1994)
Thrombospondin 2 (TSP2); the role of this gene in EOC is still debated, but
over expression has been associated with a more aggressive phenotype and
shorter survival (Kodama et al., 2001). It is a disulfide-linked glycoprotein that
controls cell-cell and cell-matrix adhesion and interaction, potently inhibiting
tumour growth and angiogenesis (Lopes et al., 2003).
40
Figure 1-14: Kaplan Meier analysis of EOC patients classified into prognosis groups on the basis of a 115-gene expression profile. (A) Prognostic model applied to a validation set of EOC (independent to the cohort used for creation of the original model) (B) Model applied to entire cohort. Highly significant differences between survival curves were observed. (Spentzos et al., 2004)
41
The potential of microarray technology to reveal important information about the
underlying causes of variation in EOC survival rates, as well as uncovering novel markers
of disease development and progression, was demonstrated by Lancaster et al (Lancaster
et al., 2004). By comparing 31 advanced stage serous EOCs obtained from patients with
either (i) less than two years or (ii) greater than seven years survival times, a list of
differentially expressed genes was obtained. A gene called Tumour Necrosis Factor-
related Apoptosis-inducing Ligand (TRAIL) was flagged for further validation after it was
found to be 7.4 fold higher expressed in ovarian cancer compared to normal epithelium. It
was also observed to be 1.5-fold higher in patients with longer survival times compared to
those with shorter survival times in the cohort investigated.
Using RT-PCR profiling in a follow up study involving 120 EOCs, the authors describe a
significant relationship between TRAIL expression and increased length of survival.
Patients who lived for more than five years had 2.2-fold higher expression of this gene
than those who died within 12 months of diagnosis (Lancaster et al., 2003). This gene is a
member of the “death ligands” and involved in regulation of apoptosis, by increasing the
chemosensitivity of tumours in which it is expressed at high levels. A follow up study has
since independently demonstrated the combination of TRAIL and chemotherapy lead to a
significant increase in apoptosis and growth inhibition of EOC cell lines further adding
weight to the potential clinical use of this molecule to improve the effectiveness of
current chemotherapeutics (Cuello et al., 2001).
In another study with clinically relevant findings, Hartmann et al (Hartmann et al., 2005)
identified a gene expression signature that discriminated between ovarian cancer patients
having either a short or long time to recurrence, following platinum-paclitaxel
combination chemotherapy. This type of platinum-based chemotherapy given after
surgery has the highest clinical benefit as defined by response rate, time to recurrence and
overall survival making it the current standard of care for EOC (Harries and Gore, 2002a;
McGuire et al., 1996). Gene expression profiling of a cohort of 79 patients with advanced
stage, high grade EOC was carried out using cDNA microarrays. A 14-gene signature
was identified that was able to classify patients to either an early (≤21 months) or late
(≥21 months) relapse category. This classification was carried out with an accuracy of
86%, as measured by cross validation of the dataset and the use of an independent test
cohort of patients not involved the determination of the 14-gene signature. This study
demonstrates that gene expression data may be able to identify those EOC patients at risk
of early disease relapse, making them candidates for more aggressive treatment
42
modalities or novel therapies. This analysis however, was limited by not considering
other prognostic variables such as residual disease levels or patient age. Also there was
very little overlap between those genes in the prognostic signature and those identified in
other studies of EOC, however as discussed later, this phenomenon has been observed for
other cancer types and reflects on the heterogeneity of EOC, method of gene selection
and also the sample size (Ein-Dor et al., 2005).
1.4. Low malignant potential ovarian cancer
1.4.1. Molecular background and clinical information
Cancers of the ovary are a heterogeneous class of malignancies (Hernandez et al., 1984).
Classification is primarily carried out according to cell type, the main subtypes described
in section 1.3.2. These labels refer to the histological appearance of the tumour as
observed by the pathologist. Each of these major categories is then classified further
according to the behaviour of the tumour – benign, malignant or low malignant potential
(LMP), the latter sometimes referred to in the past as ‘borderline’ due to it once being
thought of as a intermediate stage between benign and malignant disease and not an entity
of its own (World Health Organization, 1999).
The LMP subtype of EOC is of keen interest in the field of EOC research because it
shares several characteristics of the invasive counterpart, yet has a markedly different
clinical course. Several important characteristics define this subtype, introduced into the
FIGO grading system in 1971 (International Federation of Gynecology and Obstetrics,
1971), from the invasive form of the disease. These include:
Atypical cellular proliferation, but the lack of stromal invasion despite sharing
other malignant characteristics such as cellular stratification and nuclear atypia
Significantly better prognosis; the 5-year survival rate for women diagnosed with
stage 1 disease is in excess of 95% (compared to 30% for all EOC)
Younger age of patients at diagnosis; in a retrospective study the median age of
the 339 women with LMP tumours was found to be 39 years (Zanetta et al.,
2001).
43
The efficacy of conservative (fertility-sparing) treatment, as indicated by the
higher 5-year survival rate for this subtype of EOC, compared to the those
described previously.
LMP ovarian tumours account for 15% of all EOC diagnoses (Ries LAG, 2004). Debate
once existed as to whether these tumours are a separate class of tumour from invasive
EOC or represent a transitional stage from benign to invasive cancer (Kurman and
Trimble, 1993). There is a general consensus now that true LMP tumours rarely develop
invasive characteristics. Occasionally, LMP tumours of the mucinous or endometrioid
type are associated with invasive carcinomas in the same patient, unlike the serous type
which appears to rarely progress or be associated with invasive disease.
The overall percentage of true LMP (i.e. not metastases from a primary tumour in another
tissue) to invasive carcinoma conversions is extremely low. In the Zanetta et al study of
339 women with LMP disease the percentage of LMP tumours that progressed to invasive
disease was two percent (Zanetta et al., 2001)). Consequently this cancer type is no longer
referred to as ‘borderline’ EOC by many clinicians and scientists, a label that implies the
previously held belief that this form of EOC represents a transitional entity rather than a
clinically distinct tumour subtype.
44
A B
C D
Serous LMP Serous invasive
Mucinous LMP Mucinous invasive
Figure 1-15: Survival rates for serous LMP (A) and serous carcinomas (B), mucinous LMP (C) and mucinous invasive (D) EOC. For both histological subtypes the LMP phenotype has a markedly higher survival rate than invasive tumours. Data is shown grouped by disease stages with a greater spread of tumour corresponding to a shorter survival time (Sherman et al., 2004)
45
In the past, the treatment of LMP tumours with chemotherapy has been controversial,
reflecting the uncertainty of how this subtype related to the more common invasive form
of EOC. Because of the low mitotic index of LMP cells, some have argued chemotherapy
is theoretically incapable of producing the desired response (Kurman and Trimble, 1993),
while others contend a high response rate is observed despite this characteristic (Fort et
al., 1989). In current practice, treatment usually involves conservative surgery, followed
by observation (Trope et al., 2000). In a review of retrospective patient data Kurman and
Trimble (Kurman and Trimble, 1993) found that the deaths caused from complications
associated with chemotherapy or radiotherapy exceeded that caused by disease
progression to the invasive type. A number of studies have reported no significant
difference in the frequency of disease recurrence or progression between women who did
and did not receive postoperative chemotherapy, therefore today it is rarely used to treat
this form of EOC (Kliman et al., 1986; Nikrui, 1981; Trope et al., 1993). At the 10-year
survival point used by some studies, 97% of women diagnosed with serous LMP tumours
are alive, compared to only 30% of those with invasive disease at this same time point
(Sherman et al., 2004). This difference in survival times highlights the importance of
exploring the molecular differences between these two classes of EOC.
1.4.2. Molecular characteristics of LMP tumours
Several genetic mutations have been demonstrated to be differentially represented
between LMP and invasive ovarian tumours. For example mutations in TP53 and somatic
or germ-line BRCA1/2 abnormalities are commonly observed in invasive EOC but not in
LMP. Conversely point mutations in the KRAS & BRAF genes and microsatellite
instability are well documented traits of the LMP type (Russell and McCluggage, 2004;
Singer et al., 2003).
KRAS and BRAF are members of the RAS-RAF-MEK-ERK-MAP kinase pathway
(hereafter referred to as the RAS pathway), the primary function of which is to control
how a cell responds to a range of growth signals (Davies et al., 2002).The RAS pathway
is mutated in approximately 15% of all human cancers and is involved in regulating the
tumour-suppressing functions of the TP53 pathway . To determine the role of BRAF and
KRAS in EOC, Singer et al (Singer et al., 2003) screened for three common mutations in a
series of serous LMP and invasive tumours. 15 of 22 (62%) invasive micropapillary
serous carcinomas (MPSCs) were found to have mutations in either codon 599 of BRAF
or codons 12 and 13 of KRAS. This subtype displays a micropapillary architecture and
low grade nuclei and are thought to arise from atypical serous tumours, as opposed to
46
conventional serous carcinomas which are thought to develop de novo (Smith Sehdev et
al., 2003). 31 of 51 (68%) serous LMP tumours tested contained the same BRAF or KRAS
mutations, suggesting a shared pathway of carcinogenesis between MPSC and LMP
disease, involving these members of the RAS pathway. 72 high-grade invasive serous
carcinomas were also tested by Singer et al and neither type of mutation was detected.
Thus appearance of KRAS and BRAF mutations in only low grade serous carcinomas
suggests separate development pathways for low and high-grade serous EOC. No tumour
tested yielded mutations in both KRAS and BRAF genes.
Based on findings such as these, a model for the development of EOC was proposed by
Shih I.e. and Kurman (Shih Ie and Kurman, 2004), illustrated in Figure 1-16. The model
is comprised of two main pathways that can lead to EOC and attempts to resolve the
position of the LMP type within the spectrum of all ovarian malignancies. The two
pathways are:
Type I: Low grade neoplasms that arise in a linear fashion from LMP tumours.
This type is composed of low grade serous tumours, mucinous, clear-cell and
endometrioid carcinomas as well as malignant Brenner tumours.
Type II: High grade neoplasms that do not arise from a precursor lesions or
morphologically distinguishing transitional state. High grade serous carcinomas,
malignant mesodermal tumours (carcinosarcoma) and undifferentiated
carcinomas make up this group.
One of the key reasons for this division are the associated molecular changes of Type I
tumours rarely found in Type II, as described above. Other than the high frequency of
TP53 mutations, little is known about other possible genetic alterations present in Type II
tumours.
It is currently believed that advanced stage LMP tumours, defined by the detection of
nodal metastases or peritoneal implants (Rao et al., 2004), do not represent a precursor to
grade 1 serous invasive EOC, a hypothesis supported by studies such as Oritz et al (Ortiz
et al., 2001). This analysis focused on a group of eight patients who initially presented
with advanced stage serous LMP tumours and later developed grade 1 invasive serous
disease. Single-stranded conformational polymorphism-PCR was used to investigate
mutations in TP53 and KRAS. Differences in the mutations of the primary and secondary
tumours were observed in seven of the eight cases, suggesting the secondary tumours had
arisen independently of the previous LMP cancer.
47
Figure 1-16: Proposed two-pathway model of ovarian carcinoma development. Type 1 pathway has frequent BRAF/KRAS mutations, low cellular proliferation, a gradual increase in CIN and a 5-year survival rate of approximately 55%. Pathway II has a high frequency of TP53 mutation, higher cellular proliferation and CIN and a lower 5-year survival rate of approximately 30%. (Shih Ie and Kurman, 2004)
48
1.4.3. Mucinous EOC and tumours metastatic to the ovary
The relationship between mucinous LMP and mucinous invasive EOC is complicated by
the fact that invasive mucinous tumour found in the ovary is often metastatic from
another primary site (Ronnett et al., 2004). In one study, 40 of 52 mucinous tumours of
the ovary collected in a consecutive series of 124 ovarian malignancies were found to be
of metastatic origin (77%). Three of the remaining being atypical proliferative mucinous
tumours with microinvasion (Seidman et al., 2003). Overall, only three tumours from the
total cohort of 124 (2.4%) were classified as primary invasive mucinous EOC. Mucinous
EOC has historically been reported as representing up to 25% of all ovarian cancer
diagnoses, although recent advances in the interpretation of histological features,
immunohistochemistry and other molecular classification methods, suggest that the actual
proportion may be substantially lower. Agreeing with the Seidman et al figures, of
patients recruited for the Australian Ovarian Cancer Study (http://www.aocstudy.org/),
only 2% of patients recruited have a confirmed diagnosis of primary mucinous invasive
EOC (Unpublished data).
The most common origins of metastatic mucinous tumours to the ovary are the
gastrointestinal tract (45% of the Seidman et al metastatic cases), pancreatic (20%),
gynaecologic malignancies such as cervical or endometrium (18%), breast (7%) and
unknown primary site (10%). A rule for identifying primary mucinous carcinomas is
proposed; any unilateral EOC greater than or equal to 10cm in diameter is deemed to
have arisen from the ovary, with all others being metastatic. 90% of the tumours in this
cohort were correctly classified by this formula although no independent validation was
carried out (Seidman et al., 2003). Other histological parameters that are indicative of
metastases include surface implants (i.e. microscopic surface involvement by epithelial
cells and an infiltrative pattern of invasion (Lee and Young, 2003).
Recent studies suggest mucinous ovarian carcinomas may have a substantially better
prognosis than previously described, most likely due to metastatic tumours from the
pancreas and intestines being misclassified as primary ovarian tumours and wrongly
included in survival analyses (Lee and Scully, 2000; Ronnett et al., 1997). Findings that
most, if not all, ovarian mucinous cystic tumours associated with a grossly visible
accumulation of mucus in the pelvis or abdomen (pseudomyxoma peritonei) are
metastatic from the appendix (or sometimes the gastrointestinal tract), have prompted
49
calls to changes in the official grading systems such as FIGO, Union Internationale Contre le
Cancer (UICC) and American Joint Committee on Cancer (AJCC) systems.
Because of the controversy surrounding the classification and treatment of mucinous and
serous ovarian tumours (LMP and invasive), microarray analysis is a suitable approach
for gaining insight into the underlying molecular differences between these classes of
EOC (Benedet et al., 2000). Comparing gene expression profiles of these tumour types
may yield an increased understanding of the genes and molecular events responsible for a
true LMP tumour’s inability to invade the tissues by which it is surrounded. The
therapeutic manipulation of these processes may therefore have potential to greatly
reduce the mortality of invasive EOC.
Given the frequency of metastatic tumours being incorrectly diagnosed as primary
disease, accurate pathological diagnosis based on up to date guidelines of EOC
classification, is therefore essential for any microarray based study to avoid
contamination of data with non-ovarian gene expression data. Possibly reflecting the
difficulty of this task, only a small number of studies comparing gene expression data
generated from true LMP and invasive EOC have been published to date.
1.4.4. Existing microarray profiling studies of LMP ovarian cancer
One of the first groups to publish a microarray-based analysis of LMP and invasive
ovarian carcinoma was Lee et al (Lee et al., 2003), using the Atlas 1.2k cDNA array
platform (Clonetech, USA). 76 of the total 1176 (6.4%) nylon array features were
identified as differentially expressed between normal ovarian tissue (n=4), LMP (n=2)
and invasive EOC (n=4). A higher proportion of genes were upregulated in the invasive
tumours relative to the less malignant types, although little information was given about
the statistical method used to define differential expression or about selection and review
of the samples involved. Several of the differentially expressed genes observed in this
study had been previously implicated in the glucose/insulin pathway (e.g. S100A1,
ERBB3, HMG1), suggesting its importance in the progression of EOC. Both the small
sample size and type of microarray used are limiting factors in this study. However, a
number of biologically-relevant differentially expressed genes were identified and a
pathway describing their potential interactions with the well characterised glucose/insulin
pathway was deduced. A link was also made between molecular events in ovarian and
breast cancer based on a number of genes being previously implicated in breast cancer
50
studies. These include COUPTFII, one of the few down regulated genes in the invasive
EOC cases, which has reduced expression levels in 30% of breast cancer and has been
demonstrated to bind to the insulin promoter as well as influencing the expression of
CCND1 and p21.
Warrenfeltz et al (Warrenfeltz et al., 2004) used Affymetrix U95A GeneChips to profile
the expression of a small cohort (n=18) of ovarian tumours including two mucinous and
two serous LMP, with the goal of identifying genes that correlated with malignant
potential. A set of 163 genes (1.6% of reliably detected probe sets) was found to be
differentially expressed between the benign, LMP and invasive tumours compared. A
relationship between loss of insulin-like growth factor (IGF) binding proteins, molecules
involved in regulation of cell adhesion and malignant potential was observed with several
examples of multiple differentially expressed genes sharing chromosomal locations. The
authors state that the expression levels of a significant proportion of the genes identified
in the LMP tumours were intermediate between the benign and invasive samples and
suggest this as evidence of LMP tumours representing a transitional state between the two
extremes of benign and invasive EOC. However the small sample size, lack of
information given about the pathology review process applied to ensure the true primary
ovarian status of the samples and the large amount of evidence in the literature to the
contrary weighs against the validity of this statement.
In a study focusing on the serous type of LMP and invasive tumour, Gilks et al (Gilks et
al., 2005) generated cDNA microarray profiles for 23 tumours subject to thorough
pathologic review from two pathologists according to WHO criteria (World Health
Organization, 1999). This study used extremely comprehensive 43k cDNA microarrays,
however only a relatively small number of genes were identified as being differentially
expressed with supervised or unsupervised analyses leading the authors to postulate that
the responsible mechanisms for the phenotype difference under investigation may be
outside the scope of microarray detection. A list of 541 genes (1.25% of the total
microarray feature set) was identified as being differentially expressed across the dataset
as determined by an unsupervised filter consisting of a minimum 2-fold change in at least
three samples.
This figure seems disproportionately small for a microarray platform containing over
43,000 features and indicates that no significant variation was observed for over 98% of
the clones represented. Furthermore, a permutation based approach to identifying
differentially expressed genes between invasive and LMP tumours also yielded a
51
similarly small list of genes (n = 217; 0.5% of total clone set). Somewhat unusually, all
these differentially expressed genes were over expressed in the LMP type relative to the
invasive tumours. This observation implies that no genes are expressed at a higher level
in the invasive EOC subtype, which known to have a comparatively faster growth rate
and proliferative ability, biological processes known to involve substantial gene
regulation.
Whist ontology analysis of the 217 differentially expressed genes identified was not
carried out, one would expect a large number of differentially expressed cell cycle or
proliferation genes to be identified from a comparison of LMP and invasive tumour
because of the known difference in mitotic rate, however based on the discussion of the
genes selected, this does not appear to have been observed.
Gilks et al make the observation that many previous array studies of ovarian cancer have
used RNA cultured cell lines or normal OSE as a reference for normal ovary (Lu et al.,
2004; Zorn et al., 2003). However, as the normal precursor of EOC is still under question
and isolation of uncontaminated and appropriate quantities of surface epithelium
notoriously difficult, this may not be the most suitable reference material to study the
gene expression of malignant ovarian tissue. Several potential novel markers for EOC
that have been identified from such studies, such as HE4, MUC1, MSLN and PAX8 were
observed to have higher expression in the LMP tumours compared to the invasive type in
this study. The authors highlight this as a pitfall of not including LMP tumours in studies
designed to investigate tumourigenic pathways or identify novel molecular markers.
Overall there is a trend towards larger numbers of genes being upregulated in invasive
EOC relative to those upregulated in LMP tumours, with the exception of the Gilks et al
study, reasons for which are discussed further in Chapter 5. Genes involved in the
insulin/glucose pathway have been identified as differentially expressed between the
phenotypes by at least two independent studies. Only limited analysis appears to have
been carried out for the majority of LMP microarray datasets published to date including
only minimal in silico or other functional characterisation of those genes that appear to
discriminate between the LMP and invasive subtype.
The relationship of LMP tumours to invasive serous carcinomas of varying grade, or
differentiation status, has recently been investigated by the use of oligonucleotide
microarray profiling. It was found that the LMP tumours shared many of the molecular
characteristics of the well differentiated (grade 1) invasive tumours, compared to those
52
with moderately (grade 2) or poorly (grade 3) differentiated features. A high degree of
similarity in gene expression was present between the higher grade tumours. These
microarray observations were supported by findings from comparative genomic
hybridisation (CGH) in which the same similarity between LMP and low grade invasive
EOC was observed; in general far fewer chromosomal abnormalities than the grade 2 and
3 specimens (Meinhold-Heerlein et al., 2005). Taken together, it appears that LMP EOC
has a common transcriptional profile to grade 1 invasive EOC which is lost as
dedifferentiation occurs in association with tumour progression.
1.4.5. Other microarray studies of invasive vs. non-invasive cancer subtypes
The ability of microarray-based studies to elucidate the underlying molecular processes
involved other models of invasive vs. non-invasive cancer has been demonstrated by a
number of other studies. These include analyses of breast cancer (Iacobuzio-Donahue et
al., 2002; Kluger et al., 2004; Seth et al., 2003; van 't Veer et al., 2002; van de Vijver et
al., 2002), gastric cancer (Notterman et al., 2001), bladder cancer (Dyrskjot et al., 2003)
and prostate cancer (Bull et al., 2001; Calvo et al., 2002; Singh et al., 2002).
Invasive ductal breast carcinoma (IDC) and ductal carcinoma in-situ (DCIS) represent
two well characterised stages subtypes of breast cancer that have been thoroughly
characterised and widely held to be part of a progression from normal to malignant tissue.
In a comparison by Seth et al (Seth et al., 2003) 9k cDNA microarrays were used to
generate an expression model which consisted of 303 genes differentially expressed at a
two-fold level between DCIS and IDC. The most upregulated genes in the invasive
tumours were immunoglobulin heavy constant gamma 3 (IGHG3) and calgranulin B
(S100A9) – both known to be involved in the immune system and inflammatory response
to cancer.
Ma et al used laser microdissection of the premalignant stages of breast cancer to isolate
sufficient quantities of uncontaminated RNA to profile against breast tumours of other
stages (Ma et al., 2003). One unexpected finding from this study was the consistency of
molecular profiles from tumours of distinct pathological stages. Significant changes were
observed in the profiles of patient-matched normal breast epithelium and the first
recognised stage of malignancy, atypical ductal hyperplasia (ADH). These changes then
appear to be maintained as the tumour progresses through the following DCIS and IDC
stages. This suggests the metastatic potential of a tumour is determined in the very early
53
stages of its development, a hypothesis that has been validated by Van’t Veer et al (van 't
Veer et al., 2002) whereby the metastatic potential of a large cohort of tumours was
predicted based on the expression pattern of a gene set already present the primary
tumour.
Another interesting finding from the Ma et al study was the set of genes identified as
having a relationship to both the grade of the tumours as well as the transition from DCIS
to IDC. These genes may represent a connection between tumour stage and grade,
suggesting that the mechanisms that lead to loss of differentiation and increasing
malignancy may also control a tumour’s invasive ability. RRM2 was identified as
correlating with advanced grade and stage and is thought to play a dual role in
encouraging accelerated cell proliferation as well as conferring invasive capacity to the
tumour in which it is over expressed.
While the model for breast cancer development and progression is far more established
than those presently deduced for EOC, studies such as this indicate the potential clinical
benefit that can be obtained from using microarrays to analyse tumour subtypes of
varying metastatic potential. Multi-gene signatures capable of predicting complex clinical
variables, such as probability of disease recurrence and development of metastases, as
well as identification of individual genes potentially responsible for tumour invasion, can
be achieved from microarray-based studies with appropriate reviewed sample cohorts and
independent validation methods.
1.5. Summary and goals of this thesis This review has described the positive impact on cancer research that has resulted from
the advent of DNA microarray technology, along with some of its short comings and
areas in which progress is still to be made.
It also covers the clinical and molecular background of EOC, the fifth leading cause of
cancer death in women world-wide (Ries LAG, 2004). There is a clear need for a greater
understanding of the precise molecular events that the epithelial cells of the ovary
undergo during the transition from a normal to a malignant state. Attempts to identify
novel prognostic markers or molecular signatures through the use of microarray analysis
are described, both for EOC as well as cancer types such as breast, for which significant
progress has been made towards clinical application.
54
Furthermore, the potential benefits that may come from an understanding of the genes
and processes responsible for dictating either an invasive or non-invasive (LMP)
phenotype are also described, along with the current state of research into these areas.
This study therefore aims to:
(i) Experimentally determine the optimal conditions for carrying out a large-
scale tumour profiling study using cDNA microarrays, including selection of
an appropriate reference RNA, methods for monitoring data quality as well as
the impact of scanning hardware and normalisation and replication.
(ii) Analyse a cohort of EOC gene expression data to identify genes or molecular
processes related to length of patient survival, thereby gaining an insight into
the malignant events responsible for death from this disease, and
(iii) Analyse differences in expression patterns between invasive and non-
invasive (LMP) EOC to identify genes responsible for the observed
phenotypic differences and clinical course of these disease subtypes.
55
2. Materials & Methods
2.1. Ethical Issues This project has occurred during the establishment phases of the Australian Ovarian
Cancer Study (http://www.aocstudy.org) (AOCS). As such the undertaking of obtaining
appropriate ethical approvals to conduct a molecular profiling study of human tissues
obtained from a diverse range of hospitals and research institutes was done by the AOCS
Management Committee, as detailed below.
2.1.1. Structure of ethical governance
Human Research Ethics Committee (HREC) approval was first obtained at Peter Mac
and the Queensland Institute of Medical Research (QIMR) (AOCS host institutions).
Approval for the study was then obtained from each of the 19 collaborating centres across
the country. Thereafter, modifications to the protocol were first considered by Peter Mac
and QIMR, and then submitted to each of the collaborating centres.
2.1.2. Ethical use of human tissues
Tissues removed from patients in the normal course of that person’s treatment are
required by law to be stored (archived) for diagnostic or forensic reasons. This tissue can
be used for teaching and quality assurance without the consent of the patient and is not
considered tissue banking.
The context of this project is in reference to the National Health and Medical Research
Council’s (NHMRC) “National Statement on Ethical Conduct in Research Involving
Humans” definition of tissue banking: “The collection and storage of human tissue into a
database specifically for the purpose of medical research which researchers and the
HREC have deemed conforms to the guidelines outlined by the National Statement”
(NHMRC, 1999).
The principle of obtaining informed consent prior to collection of tissue samples was
strongly adhered to in this study. Patients scheduled for surgery with a suspected
diagnosis of ovarian cancer were identified through the surgeons or hospital pre-
admission clinics. A research nurse or research assistant approached patients and
explained the study. Using a research assistant was considered preferable for recruitment,
56
rather than the treating clinician, to avoid the potential for unintentional coercion. Written
informed consent was obtained for all patients.
All aspects of the study were explained to the subject and they were asked to volunteer
for blood and tissue collection and follow up.
The disclosure of “no duty of care” towards the subject was emphasized during the
consent process. No researcher in the study was a primary care provider for the subject
and any medical questions that arose during the study were referred to the treating doctor.
Any concerns on the part of the researcher were discussed with the treating surgeon and,
if necessary, addressed by the Peter Mac HREC.
All tissue was catalogued using a unique identifier to protect the privacy of the individual.
Access to identifying information was necessary for initial case discovery and for clinical
information collection, including follow up, however, this was restricted to the Chief
Investigators, the Program Manager, and Research Nurses dealing with individual
patients.
2.1.3. Patient Identifiers used in this thesis
No personally identifying information can be interpreted from the system used to
enumerate biological specimens in this study. In general, the method used involves letters
and numbers which refer to the hospital/institute where the specimen was processed and
stored and the order in which they were selected for the study. This does not necessarily
relate to the location or time that a patient received treatment.
2.1.4. Protection of privacy
The privacy of participants was maintained in a number of ways and conformed to the
provisions of the Privacy Act 2001 (Privacy Act No.119 1988 as amended)
(www.privacy.gov.au). Personal identifiers were retained in the master database and were
accessible only to the Tissue Bank Manager, Chief Investigators and staff with specific
access rights. Electronic databases containing patient information and questionnaire data
were stored on a server at QIMR and backed up regularly. Study nurses around Australia
had password protected on-line access to the patient database for real-time monitoring of
recruitment. Data transmitted to and from the database is protected with a 128-bit
encryption algorithm implemented via Secure Socket Layer (SSL). The web server sits
behind a firewall implemented by the QIMR IT department. Biospecimen, pathology and
57
clinical data is stored in electronic password protected databases on a server (firewall
protected) at the Peter MacCallum Cancer Centre that is backed up regularly
Records were kept in locked cabinets within the Tissue Bank and electronic records were
kept on the database, behind a firewall and utilizing standard Microsoft Windows 2000
administrative security measures. All records and communications with patients were
kept confidential and de-identified once entered into the database.
2.1.5. Ethical contingencies
There were no adverse events during the study that required HREC intervention.
2.2. Pathology review and associated tumour classifications
Standard pathology procedures were used to review EOC specimens for inclusion in this
study. A number of classifications were used to create groupings of patients which could
be used to compare gene expression profiles, described below.
2.2.1. Assessment of relative percentage tumour content
Hematoxylin and eosin (H&E) stained sections of fixed or fresh tumour were analysed by
either Dr Melissa Robbie or Dr Paul Waring to determine their suitability for microarray,
RT-PCR or IHC analysis. Sections were reviewed for percentage necrosis by cross-
sectional area and percentage tumour epithelial cells by the tumour nuclei method (i.e.
percentage of tumour cells present). This was on the basis that the RNA content is likely
to correlate best with the percentage of cells, with large areas of collagen containing
occasional fibroblasts likely to have a different RNA profile as a section that comprised
of densely packed epithelial cells. Estimates were made based on light microscopy survey
of the whole section on low to medium power.
2.2.2. Residual disease
The level of residual disease remaining after debulking surgery was categorised by
measurement of the thickness of the largest visible area of tumour remaining after
surgery.
The categories used were: 0cm: Nil, 0-1cm: Minimal, 1-2cm: Moderate, >2cm,
Maximum.
58
2.2.3. Tumour grade
Tumours were graded according to the level of cellular differentiation observed by the
reviewing pathologist. Grade 1: the least malignant appearance with well differentiated
cells, Grade 2: intermediate with moderately differentiated cells and Grade 3: the most
malignant, with poorly differentiated cells.
2.2.4. Tumour stage
Tumours were staged based on the international FIGO staging guidelines (Benedet et al.,
2000), detailed in Appendix A.
2.2.5. Patient status
In order to identify those patients suitable for the gene expression analysis of survival
times carried out in Chapter 4, a classification system was to define the status of the
patient at the time of last follow-up. The status criteria are:
0 = Patient alive, disease absent,
1= Patient alive, disease present,
2 = Patient deceased from cancer,
3 = Patient deceased over other causes,
4 = Patient deceased as a result of treatment,
5 = Patient deceased, cause unknown,
6 = Patient lost to follow-up, disease absent at last point of contact,
7 = Patient lost to follow-up, disease present at last point of contact,
8 = No registry follow-up
2.3. In-vitro methods Table 2-1: General reagents and suppliers for in vitro work.
Equation 2-1: Formula for calculation of RT-PCR gene expression ratios
64
CT refers to the number of cycles at the threshold (most linear part of amplification
graph). ∆CT is the difference in CT of a particular gene (GENE) compared to reference
and normalised against the gene HPRT. HPRT is a gene that did not vary its expression
significantly across all samples tested and was used as a DNA loading control. Ratio
refers to the expression ratio compared to the reference. The reference in these
experiments was the universal reference, allowing comparison to the microarrays given
the same reference was used.
2.3.7. Tissue microarray construction
Tissue Microarrays (TMAs) were created for high-throughput validation and large-scale
experimental design. Tissue was sourced from both the AOCS collection and from the
St.Vincents Hospital (Melbourne) by Dr Melissa Robbie. TMAs were produced
essentially as described in Sambrook and Bowtell (Sambrook and Bowtell, 2003) and
schematically in Figure 2-1.
Briefly, H&E stained slides from cases identified as suitable for inclusion in the study
(i.e. ovarian serous carcinomas, invasive or LPM) were reviewed to confirm the
diagnosis, find areas of tumour typical of the diagnosis and check that these areas
contained features plentifully represented elsewhere (i.e. the area was diagnostically
redundant). This area was then circled on the slide and used to locate the matching area
on the paraffin block for needle punch biopsy. The diagnosis was recorded but no other
information retained. Agar blocks were processed in paraffin for the recipient block.
After melting, the histology scientist Neal O’Callaghan attempted to poke the cores down
so both long and short cores were present on the base of the cassette which becomes the
cutting face.
Two identical copies of each TMA were constructed to allow a large number of sections
to be cut for this and future analyses. The finished blocks were stored by the Pathology
Department in appropriate conditions. The layout of each TMA constructed and relevant
information for each specimen is shown in Appendix B. Tissue histology and original
specimen number were recorded for each grid reference and stored in a Microsoft Excel
spreadsheet.
65
Figure 2-1: Schematic diagram of the TMA-construction process. Multiple formalin-fixed tissue specimens are embedded in individual paraffin blocks. A 2mm core of a tumour-representative tissue is then taken with a punch biopsy tool. The cores (shown here packed together) are inserted into a pre-cored paraffin embedded donor block of agar as described in the text. The final block consisting up to 54 tumour cores is then sectioned into thin slices and placed onto standard microscopy slides for IHC analysis (Liotta and Petricoin, 2000).
66
2.3.8. Immunohistochemistry
Blocks were routinely sectioned at 3µm. The sections were then stored in foil at room
temperature to avoid exposure to light and air. Immediately prior to use, sections were de-
waxed with xylene for 3 minutes twice and then rehydrated by passage through a series of
ethanol solutions (100%, 100% to 70% and then tap water).
Antigen retrieval (Shi et al., 2001) was necessary for all antibodies used. Sections were
placed into 10 mM Sodium Citrate buffer (pH 6.0) and boiled under pressure for two
minutes using Biocare Decloaker (Biocare Medical, USA).
IHC was performed on a Dako Autostainer (Dako, USA) and all incubations were
performed in a humidified chamber at room temperature for 30 minutes. Blocking was
carried out for 10 mins in 3% hydrogen peroxide. Dako diluent was used in the
concentrations shown in Table 2-3 (Dako Product Code S0809).
The primary antibody was then detected with a polymer linked detection system,
Envision+ (Dako) with a 30 minute incubation. The chromogen, DAB+ (Dako K3468)
was applied for 10 minutes. Slides were finally washed and then counterstained with
Haematoxylin and progressively dehydrated through an ethanol series (70% to 100%),
then placed in xylene. Cover slips were mounted using DPX mounting medium (BDH)
and air dried overnight in a fume hood.
Table 2-3: Antibody information used for IHC on TMAs.
ER 6F11 Novocastra NCL-ER-L-6F11 1:100 Envision + mouse
Ki67 MIB1 Dako M7240 1:100 Envision + mouse
2.4. In-silico methods A comprehensive range of data analysis methods were used in this thesis to interrogate a
range of raw data types (predominately microarray gene expression data) to explore a
range of biological questions.
67
2.4.1. Image capture and data extraction
Hybridised microarray slides were scanned with either a ScanArray 5000 (Packard
Biosystems, USA) or Agilent Microarray Scanner BA (Agilent Technologies, USA), as
indicated in the text.
For the Scanarray 5000, the confocal laser was focused using both channels. Excitatory
wavelengths of 570 nm and 670 nm were used for Cy3 and Cy5 channels, respectively.
The ScanArray 5000 scanner required manual allocation of laser power and
photomultiplier tube (PMT) settings. These settings were selected to produce the largest
dynamic range of signal detection, with minimal increase in background intensity.
Finally, the settings for each laser (Cy3 and Cy5) were adjusted to give equivalent
excitation to avoid bias due to the dominant Cy5 signal.
For the Agilent Microarray Scanner, excitatory wavelengths of 570 nm and 670 nm were
used for Cy3 and Cy5 channels, respectively. Unlike the ScanArray 5000, Agilent
scanner does not require manual adjustment of laser power and PMT settings. Thus it
minimises the photo-bleaching of features due to iterative scanning to obtain the optimum
dynamic range of signal intensity. This process is done independently and simultaneously
for both Cy3 and Cy5 channels, which significantly improves the signal-to-noise ratio.
Furthermore, the dynamic auto-focus ability of the Agilent scanner achieves better spot-
to-spot consistency by minimizing spatial bias that results from glass curvature and
misaligned slides. These features are compared and contrasted further in Chapter 3.
2.4.2. Microarray image analysis
A 16-bit TIFF image was obtained for each hybridisation channel, which was stored
initially on a dedicated hard-drive at Peter Mac and subsequently archived onto DVD-
RW media. The images were reviewed using a pseudo-colour overlay image of the
Cy5/Cy3 channels, with red allocated to Cy5 and green to Cy3. The overlay images
provide the ability conduct a visual assessment of background non-specific staining and
other staining artefacts, consistency of spot morphology, and relative signal intensity
between the two excitation channels.
Data extraction from TIFF images and conversion was performed with either GenePix
Pro 4.1 (“GenePix”) (Molecular Devices, USA) or Quantarray (Packard Bioscience,
USA) as specified. These programs function by overlaying a user-defined grid structure
onto the scanned TIFF images corresponding to each hybridisation channel. The identity,
68
layout and size of the probes are built into the grid file which, after being positioned
accurately, converts the pixel intensities to numerical measurements. The areas of the
microarray used for hybridisation intensity quantification are described in Figure 2-2.
Specific features on each microarray with poor morphology or no signal detection were
flagged as ‘absent’ in the image analysis software, which assigns a code to the
hybridisation values recorded that can be used to inform the data analysis software of this
fact. In Quantarray this is carried out manually by the operator by visual inspection of the
array images whereas GenePix uses a series of numerical criteria to identify poor quality
or missing features. The formula used to exclude poor quality feature for microarrays in
this thesis is shown below in Equation 2-2.
These criteria translate to the flagging of features for which fewer than 55% of pixels
have an intensity reading at least greater than the median local background intensity plus
one standard deviation, a diameter of less than 80uM or greater than 150uM, more than
3% saturation and the sum of Cy3 and Cy5 intensities is less than 300. Spots that are
flagged by these rules can be excluded from downstream data analysis to reduce the
chance of introducing systematic noise into a gene expression profile.
[% > B635+1SD] > 55 or [% > B532+1SD] > 55 And
[Dia.] > 80 And [Dia.] < 150 And
[Flags] <> [Bad] And
[Flags] <> [Absent] And
[Flags] <> [Not Found] And
[F532 % Sat.] < 3 And
[F635 % Sat.] < 3 And
[Sum of Medians] > 300
Equation 2-2: Criteria for flagging poor quality cDNA microarray features in GenePix image analysis software
The measurements of hybridisation intensities are stored in a tab-delimited text file which
can be read into a number of different analysis packages or opened directly by most
spreadsheet applications.
69
Figure 2-2: Schematic diagram of cDNA microarray image analysis. The region shaded back indicates the area of the slide in between spotted probes used to assess the level of background hybridisation. The GenePix program uses the median intensity level this area whilst Quantarray uses the mean level. A 2-pixel gap is left between the areas used for quantification to avoid including small fragments of the spotted probe, or other artefact, in the calculation of background intensity (Axon, 2004).
70
2.4.3. Normalisation of cDNA microarray data
Normalisation refers to adjustment of systematic differences in the relative intensity of
the Cy3 and Cy5 channels so that data can be compared within and between microarrays
(Yang et al., 2002b).
Normalisation of microarray data in this study was carried out using a range of methods.
These include simple median normalisation, intensity-dependant normalisation (Yang et
al., 2002b) or one of two spatially-dependant normalisation algorithms, SNOMAD
(Colantuoni et al., 2002) and print-tip normalisation (Yang et al., 2002b).
The latter of these has been adapted into a broad range of bioinformatic tools toward the
later stages of this study and therefore has been used in a larger number of published
analyses to data that have made use of these tools. Print-tip normalisation is implemented
in packages such as BRB ArrayTools (Biometric Research Branch, National Cancer
Institute, USA) and Bioconductor (www.bioconductor.org) (Gentleman et al., 2004).
Both SNOMAD and print tip methods use information about a genes location within a
microarray when determining the level of correction that needs to be applied.
2.4.3.1. Median normalisation
Median normalisation is a simple method for addressing variation between cDNA
microarrays one experiment and also for centring the distribution of genes within each
array around the expression ratio of 1.0, which indicates equal expression in both test and
reference RNA samples.
This is achieved by (i) dividing the value of each gene by its median value across the
entire dataset and (ii) by the median of all values on the particular microarray.
2.4.3.2. Intensity-dependant normalisation
Intensity dependent normalisation (also called non-linear or lowess normalization) is a
technique that is used to eliminate dye-related artefacts in two-colour experiments that
cause the Cy5/Cy3 ratio to be affected by the total intensity of the spot. This
normalisation process attempts to correct for artefacts caused by non-linear rates of dye
incorporation as well as inconsistencies in the relative fluorescence intensity between
71
some red and green dyes. Such artefacts often result in a curve in the graph of raw versus
control signal (Quackenbush, 2002; Yang et al., 2002b; Yang and Speed, 2002).
In the absence of bias, one would expect there to be no dependence of Cy5 signal on Cy3
signal and thus the data points would be scattered symmetrically around the 1:1 line of
Cy5:Cy3 expression. Intensity-dependent normalization fits a curve through the
expression data and uses this curve to adjust the control value for each measurement.
When the resulting normalised data are graphed versus the adjusted control value, the
points are distributed more symmetrically between hybridisation channels. In this project,
20% of the total data is used for the smoothing process (Quackenbush, 2002).
2.4.3.3. Print tip normalisation
Examples of an individual microarray slide, pre and post normalisation, as well as a
representation of the entire dataset at these same stages, are shown in Figure 2-3. Each
array print tip (n=24, corresponding to the number of array sub-grids present) is
represented by a box plot in A and B of this figure. The variation of data from each print-
tip from the baseline expression of 0.0 (log2 of 1.0) can be observed and was corrected by
the normalisation process as shown in panel B. Panels C and D of this figure show the
entire dataset, with each array represented by an individual box and whisker plot. The
normalisation process has effectively centred the distribution of data points for each array
about the baseline expression level, effectively correcting for any bias in fluorescence
intensity (Yang et al., 2002b).
2.4.3.4. SNOMAD normalisation
SNOMAD uses a two-dimensional approach whereby a topographical view of the
hybridisation channel is created using the lowess algorithm and the difference between
the patterns of expression in each channel used to determine the level of adjustment to
apply to each data point.
72
Figure 2-3: Impact of print-tip normalisation on cDNA microarray expression data (Herrero et al., 2003; Vaquerizas et al., 2004). (A) Sample individual microarray dataset prior to normalisation. Each box and whisker corresponds to data generated by an individual print tip in the microarray fabrication process. Some drift from base line expression (horizontal dashed line) can be observed. (B) The same sample microarray post normalisation – all data is now centred on the baseline expression level. (C) Representation of entire dataset used for Chapter 5 before normalisation shows a large degree of variation in the data range of individual arrays with the majority having a median expression ratio of <1, suggesting a bias towards the Cy3 channel. (D) Chapter 5 dataset following print tip normalisation.
A
C D
Print tip ID Print tip ID
Individual array ID Individual array ID
B
73
Figure 2-4: Diagram of 3D lowess-based mapping of array data carried out during SNOMAD normalisation. (A) Mean pixel intensity vs. Cy5:Cy3 ratio for each array feature (Ratio Intensity Plot) of sample array. Blue dots indicate upregulated genes, red dots down-regulated. (B) Distribution of same up and down regulated features in two dimensional ‘virtual array’ view of microarray slide. A disproportionate number of similarly regulated genes appear in opposite corners of this array, indicating a technical error has resulted in spatial bias. The lowess curve fitting algorithm is then used to map the variation in feature intensity relative to the location on the array and used to normalise the features for (C) the test channel and (D) the reference channel. (E) Ratio-Intensity plot of normalised data shows the redistributed expression data still contains similar proportions of up and down regulated genes. (F) ‘Virtual array’ view of slide reveals a more even distribution of up and down regulated features across the array surface. (Colantuoni et al., 2002)
A B
C D
E F
74
2.4.4. Microarray data visualisation methods
2.4.4.1. Hierarchical Clustering
Clustering is one method for uncovering patterns of gene expression and the relationships
between these patters and reducing data complexity to facilitate visualization.
Hierarchical clustering uses similarity algorithms to divide genes or samples into groups
with similar gene expression profiles (Quackenbush, 2001). In this thesis clustering was
carried out using GeneSpring (Silicon Genetics, USA) and unless otherwise stated, genes
are displayed on the horizontal axis and samples on the vertical.
In any clustering algorithm, the calculation of a ‘distance’ between any two objects is
fundamental to placing them into groups. Correlations of multiple experiments (arrays)
are performed through a weighted correlation in which the weight of each experiment can
be specified. It is possible to make one sample more important in the clustering process
than another. If all of the experiments or experiment sets are given the same weight, they
are averaged equally. For example, you could give Experiment 1 a weight of 2, and
Experiment 2 a weight of 1. Therefore, in this example, the correlations found in the
Experiment 1 are twice as influential in creating the tree as the correlations between the
genes in the Experiment 2 study.
The equation used to determine the overall correlation is shown as Equation 2-3 where
the variables are:
A: The correlation coefficient between the gene in question in experiment 1 and the gene
named in the Experiments to Use box, also from Experiment 1.
a: the weight specified for Experiment 1.
B: The correlation coefficient of the gene in question in experiment 2, to the gene named
in the title bar, also from Experiment 2.
b: The weight associated with Experiment 2
C: The correlation coefficient of the gene in question in experiment 3 to the gene named
in the title-bar, also from Experiment 3
c: The weight associated with Experiment 3, and so on.
75
)()(
K
K
++++++=
cbaCcBbAaX
Equation 2-3: Equation for determining the overall correlation coefficient for hierarchical clustering
If X is between the minimum and maximum correlations specified by the researcher, the
gene in question passes the correlations. The minimum distance and separation ratio
equation is shown as Equation 2-4.
Equation 2-4: Equation for determining minimum distance and separation ratio for hierarchical clustering
To make a tree or dendrogram, GeneSpring calculates the correlation for each gene with
every other gene in the set. Then it takes the highest correlation and pairs those two
genes, averaging their expression profiles. GeneSpring then compares this new composite
gene with all of the other unpaired genes.
This is repeated until all of the genes have been paired. At this point the minimum
distance and the separation ratio come in to play. Both of these affect the branching
behaviour of the tree. The minimum distance deals with how far down the tree discrete
branches are depicted. Using a value smaller than p=0.001 has minimal impact because
few genes are more highly correlated than this cut off. A higher number tends to
incorporate more genes into each group, making the groups less specific.
2.4.4.2. Principal component analysis and multidimensional scaling
Principal component analysis (PCA) is a decomposition technique that produces a set of
expression patterns known as principal components (Holter et al., 2000). Linear
combinations of these patterns can be assembled to represent the behaviour of all of the
genes in a given data set. The application of PCA to microarray gene expression data was
76
carried out according to principles first described by Raychaudhuri et al (Raychaudhuri et
al., 2000).
Principal Components Analysis is a covariance analysis between different factors.
Covariance is always measured between two factors. So with three factors, covariance is
measured between factor x and y; y and z, and x and z. When more than 2 factors are
involved, covariance values can be placed into a matrix. This is where PCA becomes
useful. PCA will find Eigenvectors and eigenvalues relevant to the data using a
covariance matrix. Eigenvectors can be thought of as “preferential directions” of a data
set, or in other words, main patterns in the data. For PCA on genes, an eigenvector would
be represented as an expression profile that is most representative of the data. For PCA on
conditions, an eigenvector could be similar to main condition profiles. For either PCA,
there cannot be more components than there are conditions in the data. Eigenvalues can
be thought of as quantitative assessment of how much a component represents the data.
The higher the eigenvalues of a component, the more representative it is of the data.
Eigenvalues can also be representative of the level of explained variance as a percentage
of total variance. By themselves, eigenvalues by are not informative. The percent of
variance explained is dependent on how well all the components summarize the data. In
theory, the sum of all components explains 100% variability in the data.
2.4.5. Unsupervised identification of differential gene expression
In order to reduce the size of a microarray dataset to facilitate other analyses, an
unsupervised filter is often applied to remove genes not significantly contributing to the
phenotype of interest. The unsupervised nature of these filters means that no information
about the class structure present in the dataset is used to select genes at this stage.
2.4.5.1. Fold change method.
A common method for identifying genes without substantial variation in expression
compared to the reference channel intensity is to filter on the basis of a minimum fold
change. This is done by specifying a proportion of a dataset in which a given gene must
be x-fold differentially expressed. A common application of this filter is to exclude genes
that do not vary at least 1.5-fold in at least 20% of the dataset. Studies have shown that
cDNA microarray platforms are accurate at detecting gene expression fold changes of
1.4-fold or greater (Yue et al., 2001).
77
2.4.5.2. Signal-to-noise
The signal-to-noise method of gene selections involves the calculation of the following
formula for each gene: (µ0 - µ1) / (σ0 – σ1); where µ and σ represent the mean and
standard deviation expression level of each class, respectively.
A threshold can then be applied to the signal-to-noise scores, or the genes can be ranked
by this metric and an algorithm which iteratively tests for the optimal number, and
combination, of genes for a particular task can use this ranking to prioritise genes within
the search.
2.4.5.3. Log-expression variation method
An alternative method to using absolute fold-changes for identifying differentially
expressed genes is the calculation of p-vales for each gene describing their level of
statistical variance. This method does not rely on setting a fixed threshold for excluding
genes rather is based upon statistical comparison of the variation of a particular gene
across the dataset to either baseline expression (i.e. 1.0) or the median variation of all
genes on the array. Those genes not significantly more variable than the median gene, at a
pre-determined p-value are filtered out. The p-value of 0.001 for most log-expression
variation filtering in this thesis
Specifically, the quantity (n-1) Vari / Varmed is computed for each gene: i. Vari is the
variance of the log intensity for gene i across the entire set of n arrays and Varmed is the
median of these gene-specific variances. This quantity is compared to a percentile of the
chi-square distribution with n-1 degrees of freedom. This is an approximate test of the
hypothesis that gene i has the same variance as the median variance
2.4.6. Identification of genes differentially expression between tumour subtypes
2.4.6.1. Significance of Microarray Analysis (SAM)
The SAM method is a popular method for identifying genes with significant differential
expression from microarray data, described by Tusher et al (Tusher et al., 2001). The
SAM algorithm is one method of controlling the False Discovery Rate (FDR), which is
defined in SAM as the median number of false positive genes divided by the number of
significant genes. The SAM algorithm is an alternative to the multivariate permutation
test used by several other algorithm described in this chapter.
78
Firstly for each gene in the dataset, a modified F-statistic (or t-statistic for two-class data)
in which a “fudge factor for standard deviation” is included in the denominator to
stabilize the gene specific standard deviation estimates. The F-statistics are then used by
sorting them from smallest to largest (F (1), F (2)… F (i), F (n)), where n is the number of
genes. Next the class labels are permuted and a set of ordered F-statistics for each
permutation is re-computed. The expected ordered statistics are estimated as the average
of the ordered statistics over the set of permutations. A cut point is then defined as F (i*)
(∆), where i* ( ∆) is the first index i in which the actual ordered F-statistic is larger than
the expected ordered F-statistic by a ∆-threshold value, and is a function of this ∆. Genes
which have an F-statistic larger than this cut point are considered to be “significant”. For
random permutations, any “significant” genes are presumed to be false positives, and a
median number of false positive genes can be computed over the set of permutations. The
median number of false positive genes is then multiplied by a shrinkage factor π, which
represents the proportion of true null genes in the dataset, and is computed as the number
of actual F-statistics which fall within the interquartile range of the set of F-statistics
computed for all permutations and all genes, divided by the quantity of .5 times the
number of genes. If this π factor is greater than 1, then a π factor of 1 is used instead. The
median number of false positive genes, multiplied by π and divided by the number of
“significant” genes, yields the FDR for a given ∆ value.
2.4.6.2. Class comparison using data bocking
Genes that were differentially expressed between classes of samples were determined
using a multivariate permutation test (Korn et al., 2004b; Simon et al., 2003a). In
comparing classes of samples, gene expression variation from a potentially confounding
factor was controlled for using ‘data blocking’. This refers to the inclusion of
confounding factors in the general linear model (ANOVA) used to identify genes with
expression differences between classes of interest. This could be different batches of
microarray slides in the one experiment, or a variable such as residual disease volume
remaining after surgery. As such variation in the expression data that is attributable to
these variables is controlled for, allowing the test to identify genes with expression
differences between the true classes of interest.
The multivariate permutation test was used to provide 90% confidence that the false
discovery rate was less than 10%. The false discovery rate is the proportion of the list of
genes claimed to be differentially expressed that are false positives. The test statistics
used are random variance F-statistics for the effect of tumour type for each gene (Wright
79
and Simon, 2003). The F-statistics were computed from a two-way analysis of variance
with tumour type and gender as factors. Although F-statistics were used, the multivariate
permutation test is non-parametric and does not require the assumption of Gaussian
distributions. In our analyses to find genes that were differentially expressed among
classes, technical replicates of the same sample were averaged.
2.4.7. Machine-learning approaches for class prediction
2.4.7.1. Survival analysis
Survival analysis was used to identify genes whose expression was significantly related to
survival of the patients in a given experiment. A statistical significance level was
computed for each gene based on univariate proportional hazards models (Cox, 1972).
These p values were then used in a multivariate permutation test (Korn et al., 2004b;
Simon et al., 2003a) in which the survival times and censoring indicators were randomly
permuted among arrays. The multivariate permutation test was used to provide 90%
confidence that the false discovery rate was less than 10%. The false discovery rate is the
proportion of the list of genes claimed to be differentially expressed that are false
positives. The multivariate permutation test is non-parametric and does not require the
assumption of Gaussian distributions.
2.4.7.2. Quantitative Trait Analysis
Genes whose expression was significantly related to a continuous variable such as patient
survival were identified with Quantitative Trait Analysis. A statistical significance level
was computed for each gene for testing the hypothesis that the Spearman’s correlation
between gene expression and survival time (in months) was zero. These p values were
then used in a multivariate permutation test (Korn et al., 2004b; Simon et al., 2003a) in
which the ages were randomly permuted among arrays. The multivariate permutation test
was used to provide 90% confidence that the false discovery rate was less than 10% of
the number of genes identified. The false discovery rate is the proportion of the list of
genes claimed to be differentially expressed that are false positives. The multivariate
permutation test is non-parametric and does not require the assumption of Gaussian
distributions.
80
2.4.8. Class Prediction
Using the BRB ArrayTools package (Simon and Lam), models for utilising gene
expression profile to predict the class of future samples were created. Models developed
were based on the Compound Covariate Predictor (Radmacher et al., 2002), Diagonal
Linear Discriminant Analysis (Dudoit et al., 2002), Nearest Neighbour Classification
(Dudoit et al., 2002), and Support Vector Machines with linear kernel (Ramaswamy et
al., 2001). The models incorporated genes that were differentially expressed among genes
at the 0.001 significance level as assessed by the random variance t-test (Wright and
Simon, 2003). The prediction error of each model was estimated using leave-one-out
cross-validation (LOOCV) as described by Simon et al (Simon et al., 2003b). For each
LOOCV training set, the entire model building process was repeated, including the gene
selection process. It was also evaluated whether the cross-validated error rate estimate for
a model was significantly less than one would expect from random prediction. The class
labels were randomly permuted and the entire LOOCV process was repeated. The
significance level is the proportion of the random permutations that gave a cross-
validated error rate no greater than the cross-validated error rate obtained with the real
data. 1000 random permutations were used.
2.4.8.1. Multivariate permutation tests for controlling the number and proportion of false discoveries
The multivariate permutation tests for controlling number and proportion of false
discoveries is used for class comparison, survival analysis, and quantitative traits
analysis. Using a stringent p<0.001 threshold for identifying differentially expressed
genes is a valid way for controlling the number of false discoveries. A false discovery is a
gene that is declared differentially expressed among the classes, when in fact it is not.
There are two problems with this approach to controlling the number of false discoveries.
One is that it is based on p values computed from the parametric t/F tests or random
variance t/F tests. These parametric p values may not be accurate in the extreme tails of
the normal distribution for small numbers of samples. The second problem is that this
approach does not take into account the correlation among the genes. Using stringent p
value thresholds on the univariate permutation p values won’t be effective when there are
few samples and will not account for correlations. Multivariate permutation tests that
accomplish both objectives were used in this study, as described in Technical Report 3,
Biometric Research Branch, National Cancer Institute, 2002;
(http://linus.nci.nih.gov/~brb) and also Reiner et al (Reiner et al., 2003)
81
The multivariate permutation tests are based on permutations of the labels of which
experiments are in which classes. If there are fewer than 1000 possible permutations, then
all permutations are considered. Otherwise, a large number of random permutations are
considered. For each permutation, the parametric tests are re-computed to determine a p
value for each gene that is a measure of the extent it appears differentially expressed
between the random classes determined by the random permutation. The genes are
ordered by their p values computed for the permutation (genes with smallest p values at
the top of the list). For each potential p value threshold, the program records the number
of genes in the list. This process is repeated for a large number of permutations.
Consequently, for any p value threshold, we can compute the distribution of the number
of genes that would have p values smaller than that threshold for permutations. That is the
distribution of the number of false discoveries, since genes that are significant for random
permutations are false discoveries. The algorithm selects a threshold p value so that the
number of false discoveries is no greater than that specified by the user C% of the time,
where C denotes the desired confidence level.
The procedures for controlling the number or proportion of false discoveries are based on
multivariate permutation tests. Although parametric p values are used in the procedures, s
the permutation distribution of these p values is determined, and hence the false discovery
control is non-parametric and does not depend on normal distribution assumptions.
The multivariate permutation tests also take advantage of the correlation among the
genes. For a given p value for truncating an ordered gene list; the expected number of
false discoveries does not depend on the correlations among the genes, but distribution of
the number of false discoveries does. The distribution of number of false discoveries is
skewed for highly correlated data. If the confidence coefficient at is specified at 50%, the
program provides the length of the gene list associated with a specified median number of
false discoveries or given proportion of false discoveries.
2.4.8.2. Compound Covariate Predictor
The Compound Covariate (CC) method of prediction uses a weighted linear combination
of log-ratios (or log intensities for single-channel experiments) for genes that are
univariately significant at the specified level. By specifying a more stringent significance
level, fewer genes are included in the multivariate predictor. Genes in which larger values
of the log-ratio pre-dispose to class 2 rather than class 1 have weights of one sign,
whereas genes in which larger values of the log-ratios pre-dispose to class 1 rather than
82
class 2 have weights of the opposite sign. The univariate t-statistics for comparing the
classes are used as the weights. Detailed information about the Compound Covariate
Predictor is available in Hedenfalk et al (Hedenfalk et al., 2001) or in Technical Report
01, 2001, Biometric Research Branch, National Cancer Institute, USA .
(http://linus.nci.nih.gov/~brb/TechReport.htm.)
2.4.8.3. Diagonal Linear Discriminant Analysis
The Diagonal Linear Discriminant Analysis (DLDA) is similar to the Compound
Covariate Predictor. It is a version of linear discriminant analysis that ignores correlations
among the genes in order to avoid over-fitting the data. Many complex methods have too
many parameters for the amount of data available. Consequently they appear to fit the
training data used to estimate the parameters of the model, but they have poor prediction
performance for independent data. The study by Dudoit et al (Dudoit et al., 2002) found
that diagonal linear discriminant analysis performed as well as much more complicated
methods on a range of microarray data seta.
2.4.8.4. k Nearest Neighbour Predictor
The k Nearest Neighbour (kNN) Predictor is based on determining which expression
profile in the training set is most similar to the expression profile of the specimen whose
class is to be predicted.
The expression profile is a vector of log-ratios or log-intensities for the genes selected for
inclusion in the multivariate predictor. Euclidean distance is used as the distance metric
for the Nearest Neighbour Predictor. Once the nearest neighbour in the training set of the
test specimen is determined, the class of that nearest neighbour is used as the predicted
class of the test specimen. kNN prediction is an extension of the Nearest Neighbour
method. For example, with the 3-Nearest Neighbour algorithm, the expression profile of
the test specimen is compared to the expression profiles of all of the specimens in the
training set and the 3 specimens in the training set most similar to the expression profile
of the test specimen are determined. The distance metric is also Euclidean distance with
regard to the genes that are univariately significantly differentially expressed between the
two classes at the threshold significance level specified. Once the 3 nearest specimens are
identified, their classes vote and the majority class among the 3 is the class predicted for
the test specimen.
83
This approach was first applied to microarray data by Golub et al (Golub et al., 1999) and
despite its relative simplicity compared to other algorithms frequently used for microarray
analysis due to its realistic computational requirements and potential for producing highly
accuracy results.
2.4.8.5. Nearest Centroid Predictor
Nearest Centroid Prediction (NC) is another algorithm implemented in a range of data
analysis tools, including the ArrayTools suite (Simon and Lam). In the training set there
are samples belonging to class 1 and to class 2. The centroid of each class is determined.
The centroid of class 1, for example, is a vector containing the means of the log-ratios (or
log intensities for single label data) of the training samples in class 1. There is a
component of the centroid vector for each gene represented in the multivariate predictor;
that is, for each gene that is univariately significantly differentially expressed between the
two classes at the threshold significance level specified. The distance of the expression
profile for the test sample to each of the two centroids is measured and the test sample is
predicted to belong to the class corresponding to the nearest centroid.
2.4.8.6. Support Vector Machines
A Support Vector Machine (SVM) is a class prediction algorithm that has appeared
effective in other contexts and is currently of great interest to the bioinformatics and
machine learning communities. SVMs were developed by V. Vapnik (Vapnik, 1998).
The SVM predictor is a linear function of the log-ratios or the log-intensities that best
separates the data subject to penalty costs on the number of specimens misclassified. The
SVM implementation used in this study is LIBSVM of Chang and Lin (Fan et al., 2005).
For all classification tasks the SVM algorithm used a one-vs.-all architecture, which
means that separate classifiers were trained to distinguish each class from all the other
cases in the data set. An individual support vector machine solution is determined by the
vector w and the constant b (bias) obtained as minimiser of the so-called ‘regularised
risk’, where n is the number of genes; nRw �¸ is a vector of the dimensionality equal to
the number of genes, n, and w is its Euclidean norm; Rb ∈ is a constant (bias);
ni Rx ∈ and 1±=iy are the sample measurement vector and label, i = 1, …, S
(no.samples); “.” denotes dot-product in nR ; and C > 0.
84
Equation 2-5: Equation for determining an individual support vector machine solution
The SVM creates a hyperplane in n-dimensional space where n is the number of genes
selected. This hyperplane in effect allows a decision of whether a case is within a class or
not. The relative distance of a case from the hyperplane provides a measure of decision
confidence.
The absolute output from the SVM is a number between -1 and +1. The +1 denotes
within the class in question whereas -1 the case does not belong in that class. In Figure
2.2 Class 1 may be the default, hence labelled +1. All class 2 cases will be 0 < x < -1
depending on the confidence of the prediction. The more confident the SVM prediction is
the closer to -1 or +1 is the resultant score. Therefore in a 3 class problem the highest
score determines the class label (Brown et al., 2000).
2.4.8.7. Cross validated calculations of misclassification probabilities
To determine the p-value for the cross-validated misclassification error rate, permutation
analysis was carried out. For each random permutation of class labels, the entire cross-
validation procedure was repeated to determine the cross-validated misclassification rate
obtained from developing a multivariate predictor with two random classes. The final p-
value is the proportion of the random permutations that gave as small a cross-validated
misclassification rate as was obtained when using the true class labels. A cross-validated
misclassification rate and a corresponding p-value for each class prediction method used
is obtained.
One thousand permutations were carried out for each test of this kind, unless otherwise
stated, to obtain a statistically robust permutation p-value for the cross-validated
misclassification rate of a given algorithm.
85
2.4.9. Gene ontology analysis
2.4.9.1. EASE
Gene ontology analysis was carried out on the list of overlapping genes identified as
being significantly differentially expressed between histological subtypes to determine
whether the difference in the size of these lists corresponded to particular classes of genes
or known functional groups. The online ontology analysis tool EASE (Hosack et al.,
2003) was used to determine significantly represented ontologies or biological ‘themes’,
in both sets of histologically discriminant genes.
Briefly, this method annotates a given list of genes with their known ontology
memberships and statistically compares the biological themes represented in the list with
the total ontology profile of a ‘background list’ of genes, usually the entire list of genes
present on the array being used. It also takes into account the total number of genes in the
genome known to belong to each ontology classification. A Fishers Exact p-value and
EASE score is calculated and the ontologies are ranked in order of significance. The
number and names of genes present in each class can be viewed, along side the number of
genes present in the background list belonging to the same class, as well as the total
number of genes on a genome level with the same classification. These values are used to
determine the significance of observing groups of genes with similar functions in a given
list of genes.
The EASE score is a conservative adjustment to the Fisher exact probability that weights
significance in favour of themes supported by more genes. The concept of jack-knifing a
probability is the theoretical basis of the EASE score (Baty et al., 2005). The stability of
any given statistic can be ascertained by a procedure called jack-knifing. This is where a
single data point is removed and the statistic is recalculated many times to give a
distribution of probabilities that is broad if the result is highly variable and tight if the
result is robust. The EASE score is calculated by penalizing (removing) one gene within
the given category from the list and calculating the resulting Fisher exact probability for
that category. It therefore represents the upper bound of the distribution of jack-knife
Fisher exact probabilities and has advantages in terms of penalizing the significance of
categories supported by few genes.
86
2.4.9.2. FatiGO
Comparative ontology analysis was carried out using the FatiGO algorithm which
operates based on similar concepts as EASE, described above, however has the ability to
statistically compare the ontologies represented by two separate gene lists (Al-Shahrour et
al., 2004). This method was used for comparing gene lists generated from a range of
analyses to determine if significantly different biological themes were represented by
either list. A Fishers exact test is used to calculate the level of significance and
permutation testing used to control the false discovery rate, i.e. the chance of selecting
ontologies as being differentially represented between two gene lists by chance alone, due
to the large numbers of individual tests carried out.
2.4.10. Quantification of IHC staining
In order to analyse the staining intensity of the antibodies used for IHC validation work in
this thesis, three high power (400x) fields of each stained tumour section were captured.
An effort was made to select those areas most representative of the predominant staining
pattern in each tumour. These images were backed up onto external media then processed
with a modified protocol for quantifying IHC staining intensity using Adobe Photoshop
CS2 (Adobe Systems Inc., USA) described by a number of groups (Lehr et al., 1997;
Lehr et al., 1999; Matkowskyj et al., 2000).
This method involves the application of a threshold filter to each image, calibrated for
each antibody or IHC batch if necessary, to exclude the unstained sections of the image as
well as any non-specific staining. The threshold command in Adobe Photoshop converts
colour images to high-contrast black-and-white. All pixels lighter than the threshold
value, determined by the operator, are converted to white; all pixels darker are converted
to black. The threshold can be set between 0 and 255, the digital tonal range of the image.
Next the image is inverted and converted to greyscale, resulting in the areas of antibody
staining appearing white and the rest of the field black. A histogram of pixel intensities is
then created for the inverted image and the mean and standard deviation values exported
into a tab-delimited text file. Statistical analysis of these mean and standard deviation
values was then carried using Minitab version 13 (Minitab Inc, USA) to determine if a
significant difference existed between the IHC intensities of the tumour classes under
investigation. Examples of this method are given in chapter five.
87
3. Optimisation of cDNA microarray profiling for large-scale tumour profiling studies
3.1. Introduction This chapter aims to experimentally analyse several technical aspects of the cDNA
microarray work flow in order to determine the optimal parameters for large-scale tumour
profiling studies. The use of a universal or project-specific reference RNA and impact of
microarray scanner on the robustness and statistical accuracy of expression data generated
are evaluated. Issues of replication and normalisation are investigated and a novel method
for quantification of spatial bias in expression data is proposed.
The concepts introduced and finding generated from this chapter have implications for
chapters four and five, in which cDNA microarrays are used to profile EOC specimens
with the goals of exploring the molecular basis of clinically important phenotypic
differences.
3.1.1. A method for quantification of spatial bias in cDNA microarray data
Systematic error is frequently detected in data generated from microarray experiments.
Common sources of such non-biological variability include differences between the
labelling efficiency of the Cy3/Cy5 dyes and inconsistencies in the surface on which the
probes are printed (Quackenbush, 2002). These factors can be minimised by careful
laboratory work and selection of quality reagents and substrates, but can not be
completely avoided. Different methods of normalisation and other data manipulations,
such as analysis of gene expression ratio ranks rather than actual intensities, have been
proposed (Bilban et al., 2002; Broberg, 2003; Hoffmann et al., 2002; Qin and Kerr, 2004)
and are routinely used to correct for sources of technical noise in microarray data.
Specific examples of normalisations include intensity-dependant (non-linear)
normalisation and print-tip normalization, both of which use a locally weighted
regression method of curve fitting (Lowess) (Cleveland, 1979).
Intensity-dependent normalization is used to compensate for differences in the labelling
efficiency of Cy dyes, used to fluorescently label the reverse-transcribed sample RNA,
whilst print-tip lowess normalisation used to compensate for variation in the amount of
probe deposited by individual pins during the printing process (Park et al., 2003).
88
The issue of technically introduced spatial bias, or position effect (Miles, 2001; Smyth et
al., 2003) is more difficult to identify as it cannot be readily detected with a Ratio-
Intensity plot (also known as an ‘M vs. A’ plot). These are a common method for
visualising the range and distribution of both Cy3/Cy5 expression ratios and absolute
hybridisation intensities of microarray data (Yang et al., 2002b). Inspection of these plots
can assist in determining the type of normalisation required, particularly in the case of
bias due to higher incorporation or fluorescence intensity of one Cy dye. Print-tip
normalisation (Yang et al., 2002b) and Statistical Normalisation of Microarray Data
(SNOMAD) (Colantuoni et al., 2002) are two methods that take the physical position of
the array features into account in the normalisation process. However as with all
statistical manipulations of raw data, normalisation of array data can have a detrimental
impact. Often the overall (dynamic) range of data points is substantially reduced in the
process of correcting for technical bias (Yang and Speed, 2002). This can potentially
impact on the biological question under investigation by reducing the proportion of genes
observed to be differentially expressed over a given threshold or between known classes
of samples (Hoffmann et al., 2002). Therefore, wherever possible, it is preferable to
identify and correct the cause of systematic error at the experimental level, rather than
relying on statistical manipulations, themselves a potential source of systematic variation
(Tsodikov et al., 2002).
3.1.2. Reference RNA options for large-scale cDNA microarray profiling studies
As microarray technology gives researchers the ability to investigate the expression of
thousands of genes in parallel, they are ideally suited for studying diseases with genetic
foundations such as EOC. To date, the wide availability and lower cost of cDNA
‘spotted’ microarrays has lead to many academic research institutes, including the Peter
MacCallum Cancer Centre, adopting this technology for large-scale tumour profiling
studies. Microarray technology is a recent development compared to more traditional
methods for analysing patterns of gene expression information, such as differential
display (DD) (Liang and Pardee, 1992; Martin and Pardee, 1999), representational
differential analysis (RDA) (Lisitsyn et al., 1993) or subtractive differential hybridisation
(SSH) (Diatchenko et al., 1996). Therefore before conducting a large scale project using
valuable human tissue derived RNA, extensive planning and validation of the techniques
involved should be carried out in order to conserve resources and ensure the data
produced accurately represents true biology. Key decisions around choice of reference
RNA, appropriate levels of replication, selection of equipment, software and algorithms
89
for scanning, data extraction, normalisation and analysis need to be made at the outset of
the project to ensure the output is free of any technical bias and represent a true
measurement of gene expression in a particular tissue or cell line at a given time.
At the beginning of this project, little information was available on the impact of using
different types of reference RNA on the data generated. The selection of the most
appropriate material for the study in question is imperative as this provides a ‘base line’
expression level for the genes being investigated. Because the reference RNA is an
experimental constant used for every microarray slide in a cDNA microarray-based study,
it is important to select one most appropriate for the goals of the planned study. Changing
either the type or even batch of reference RNA used has the potential to significantly alter
the results and may lead to a significant ‘batch effect’ in the final expression data. If this
occurs, one cannot determine if any observed differences in gene expression correspond
to the type or batch of reference used, rather than true variation in gene expression
profiles in specimens being profiled.
To investigate the most suitable reference RNA for the EOC-profiling study planned as
part of the AOCS, an experiment was carried out using two conceptually different
reference RNA types to determine what, if any, impact each reference had on the quality
of the expression data produced. The reference RNAs used for this comparison were:
(i) Pooled cell line reference: composed of RNA from 11 human cell lines (of
leukaemia, myeloma, acute promyelocytic leukaemia, fibrosarcoma and
hepatocellular origins) as first described by Perou et al (Perou et al., 2000).
(ii) Pooled tumour reference: composed of RNA extracted from a subset of the
ovarian tumours to be profiled for future analysis as part of the AOCS.
Tumours were selected for the pool to represent the most common
histological subtypes; predominantly serous and mucinous, with smaller
numbers of endometrioid, benign and other less frequent subtypes.
Specific sample information for the cell lines and tumour specimens used to create the
reference RNA stocks are given in Appendix C.
90
The pooled tumour reference represented a project-specific reference RNA. Samples were
chosen for inclusion in the ‘tumour pool’ in approximately the same histological subtype
proportions as observed in the general population. Hybridisations against this reference
will result in expression ratios with a common denominator approximately equivalent to
an ‘average’ specimen of EOC. The pooled cell-line based reference is designed to
represent a molecular ‘average’ of gene expression across a selection of the most
common forms of the disease and therefore is not specific to cancer of one primary
origin.
One of the key goals of the AOCS gene expression profiling study is to identify
potentially subtle differences in gene expression between known classes of EOC, such as
histological subtypes for example, as well as the identification of novel classes that
correspond to variables of clinical importance. Because of this, the ability of a reference
RNA to give maximum resolution of gene expression differences between histological
and other subtypes was an important factor in evaluating the performance of these two
reference types. Other bioinformatic measures used to evaluate the references were:
Accuracy and reproducibility of synthetic quality control genes. These are
printed in the last row of each sub-grid and targeted by synthetic RNAs spiked
into the reverse transcribed specimen RNA to be hybridised. Comparison of the
observed expression ratios to the theoretical values of these features can be used
to assess the accuracy the accuracy of the microarray platform.
The proportion of the genes on the array which were identified as detectably
expressed. As the final gene expression measurements obtained from two-colour
cDNA arrays are ratios of two intensity readings (test and reference intensity), a
reference that binds to a larger proportion of the total array will produce a larger
number of valid expression ratios available for downstream analysis.
The proportion of genes on the array identified differentially expressed.
Differential expression can be defined as a gene above or below a predetermined
ratio threshold or based on statistical assessment of differential expression based
on the global variance present on each array analysed. Many analytical methods
involve the exclusion of genes without sufficient variation across a series of
arrays to constitute a contribution to the phenotype or variable of interest,
therefore any difference in the proportion of genes excluded on the basis of the
type of reference RNA used is important to note.
91
The number and type of genes with statistically significant differences in
expression between known histological subtypes in the dataset. A common
goal of microarray analysis is to identify genes that are differentially expressed
between two tumour subtypes. Analysis was carried out to determine if a
significant difference in the number of genes or represented ontologies identified
were attributable to the reference RNA used.
The performance of the total dataset as assessed with a sample predictive
analysis. A large gene expression dataset of samples representing different
tumour or histological types can be used as a reference for the correct
classification of unknown samples (Huang et al., 2003; Kan et al., 2004;
Ramaswamy et al., 2001; Shedden et al., 2003; Tothill et al., 2005). In this
section, a number of classification algorithms were trained to predict the
histological subtype of a set of tumours for which the subtype was unknown to
the classifier. The percentage accuracy generated by each algorithm was
compared to determine if either reference type conveyed a predictive advantage
for this type of investigation.
As well as bioinformatic comparisons, several practical issues are considered and taken
into account when evaluating the most appropriate reference RNA for a large-scale
tumour profiling study. These include the time and expense required to maintain and
extract RNA from eleven different cell lines, the longevity of the reference in terms of the
maximum amount able to be produced and its ability to be regenerated if the original
supply is exhausted, plus the significant advantage of being able to combine data sets
from disparate sources using the same reference material.
3.1.3. The impact of experimental replication on the robustness of cDNA microarray gene expression measurements
Due to a range of factors, including the relative expense of microarray analysis and often
the limited nature of the material being profiled, it is rarely feasible to replicate the
microarray analysis of every sample included in a large-scale tumour profiling study. It
has been widely demonstrated that cDNA microarray experiments are susceptible to a
range of technical biases, including variation attributable to individual array batches,
hybridisation method and the type of scanner used (Holloway et al., 2002; King and
Sinha, 2001; Ramdas et al., 2001a; Simon et al., 2003b; Wang et al., 2001; Yang et al.,
2001). As a common goal of tumour profiling studies is to identify a small number of
92
genes with robust differential expression between classes, it is important for the gene
expression values generated to reflect actual biology rather than technical or systematic
noise.
ScoreCard (GE Healthcare, USA) gene expression measurements from a series of quality-
control microarray hybridisations routinely carried out by the Peter Mac Microarray
facility were used to assess the impact of replication on data quality. The same batch of
commercially-supplied solution of synthetic genes used to spike into the Cy3 and Cy5-
labelled samples prior to hybridisation was used for all arrays in this analysis. In this
sense each hybridisation including this material can be regarded as a replicate
measurement for these genes. Therefore, it was assumed for the purposes of this
investigation, using an average calculated from multiple ScoreCard measurements was
the equivalent of replicate profiling of true biological specimens. As more replicates were
included in the average of a particular ScoreCard measurement, variation from its
theoretical expected value was analysed using ANOVA.
Another method of assessing the level of replication required to generate statistically
accurate cDNA expression data was the use of individual Genbank Accession numbers
from the same UniGene clusters. UniGene Cluster IDs represented with multiple
accession numbers in the 10.5k cDNA human clone set used to generate the Peter Mac
10.5k cDNA microarray were identified (UniGene Build number 160). The questions
posed by this analysis were (i) what proportion of UniGene IDs represented on the array
by multiple accession numbers have significant variation in the expression values of these
array features and (ii) does replication and averaging reduce this variation to a level
where the difference between accession numbers from the same UniGene cluster is no
longer statistically significant?
3.1.4. The impact of scanning hardware on cDNA microarray data quality
A key piece of equipment in any microarray experiment is the scanner, effectively the
bridge from the in vitro to in silico components of a microarray experiment. The primary
role of the scanner in two-colour microarray systems is to generate a high-resolution
digital image of the hybridised slide at two laser wavelengths selected to excite the
labelled probe bound to genetic material printed on the surface of the slide. Another
important function is to measure the amount of non-specific or background hybridisation.
Accurate measurement of this is important to calculate the proportion of a given gene’s
93
intensity reading due to random binding of labelled cDNA probes to the slide substrate or
non-specific probes.
Microarray scanners are commonly comprised of lasers used to excite fluorescently
labelled probes at two wavelengths and dual channel photomultiplier tubes (PMTs) for
signal detection. During the scanning process a series of filters, mirrors and lenses are
used to convert and digitise the electronic signal produced by laser excitation of the
printed region of the microarray. One image per dye is recorded for each microarray
scanned by either moving the slide or the laser itself (Lyng et al., 2004).
Newer model scanners such as the Agilent Microarray Scanner (Model BA, Product
#G2565BA) (Agilent Technologies, USA) feature a laser focusing technology referred to
as ‘dynamic focussing’ whereby the focal point of the laser detectors is continually
adjusted throughout the scan in response to minor variations in the substrate surface or
slide angle. Theoretically this is designed to reduce technical variation in the
measurement of fluorescence intensities that can be introduced by even minor variations
in glass thickness and topography. There is little in the way of reviewed literature
comparing the effectiveness of dynamic focussing on microarray data quality although
manufacturer’s data suggests that this feature substantially decreases the spatial
dependency of background hybridisation measurements. A series of microarray
hybridisations was scanned on two different microarray scanners, with and without this
dynamic focussing technology to determine the impact, if any, on the data produced.
Significant differences observed in the datasets produced by the scanners tested have
implications on the previous analysis of replication and is discussed later.
An important caveat to consider with this analysis is that a number of other potentially
confounding variables exist between the two microarray scanners used for this
comparison, including the age of the machines. The Agilent Microarray Scanner was less
than one year old at the time of this project whilst the Packard Scanarray model had been
in operation for approximately three years. As a result, it is possible that some depletion
in laser intensity may have occurred over time in the Packard scanner. However to
counteract this, the machine was regularly serviced and the quality of its output
monitored by staff of the Microarray Facility. (Sambrook and Bowtell, 2003)
94
Figu
re 3
-1: S
chem
atic
dia
gram
of t
wo
mic
roar
ray
slid
es b
eing
sca
nned
with
and
with
out a
dapt
ive
focu
ssin
g. (A
) Slid
e lo
aded
into
mac
hine
with
slig
ht ti
lt re
sults
in
mea
sure
men
ts b
eing
mad
e at
diff
eren
t foc
al p
oint
s; (B
) Slid
e on
sam
e an
gle
resu
lts in
adj
ustm
ent o
f the
foca
l poi
nt in
the
Agi
lent
Mic
roar
ray
Scan
ner a
nd m
easu
rem
ents
re
cord
ed w
ith e
ach
feat
ure
in f
ocus
. (C
) Slid
e sc
anne
d w
ith v
aria
tion
in g
lass
thic
knes
s re
sulti
ng in
feat
ure
inte
nsiti
es a
gain
bei
ng r
ead
at v
aryi
ng fo
cal p
oint
s. (D
) Th
e sa
me
slid
e sc
anne
d w
ith th
e A
gile
nt M
icro
arra
y Sc
anne
r (A
gile
nt T
echn
olog
ies,
USA
).
AB
CD
95
3.2. Results
3.2.1. Develop a method for measuring the degree of spatial bias present on a cDNA microarray
3.2.1.1. The importance of quantifying spatial-bias, application and calibration of Moods Median Test
Some arrays have significant areas where all features appeared to be up or down-
regulated. As the distribution of features on the arrays was intentionally randomised with
respect to the biological properties of the associated genes, these spatial patterns were
most likely to be due to technical problems. A number of normalisation algorithms exist
for addressing the issue of spatial bias in cDNA microarray data (Colantuoni et al., 2002;
Schuchhardt et al., 2000; Yang et al., 2002b), however no method for visualising and
quantifying the degree of this type of error as been described.
Spotted cDNA microarrays such as the Peter Mac 10.5k human array are printed by a
series of pins which results in a 4 x 6 sub-grid structure in the final product, each sub-grid
containing 19 rows and 24 columns of spotted cDNAs. It was hypothesised that for a
given cDNA array not affected by spatial bias, the difference between the median
expression ratio of each sub-grid should be small, relative to the same measure generated
on an array with a more extensive level of spatial bias. Based on this premise, a statistical
test was sought that determined the significance of observed differences in median
expression between array sub-grids.
Moods Median Test (MMT) is a statistical test, implemented in the Minitab statistical
package (Minitab Inc, USA), and designed to measure the significance of variation of
median values of a series of values in ‘data blocks’. This test was implemented on each
array in the series using the sub-grid number to group the expression ratios into data
blocks. The graphical output of the test reflected the pattern of spatial bias and a chi-
square score provided a numerical measurement of the degree of variation between sub-
grid median values. This value is referred to as an MMT score from here on. This test is
non-parametric and being median-based is robust to outliers, frequently present in gene
expression data. These factors make the test suitable for microarray data where the
expression ratio of one individual gene theoretically has no impact on the expression of
an adjacent gene. Because of this, small numbers of highly upregulated features can be
present in regions of a microarray where the majority of other features are down-
96
regulated or not expressed at all, which would impact on the mean, but not median
expression ratio of an array sub-grid.
For all Peter Mac cDNA microarrays tested with Moods Median Test, the p-value for
variation in sub-grid median expression ratio was below 0.001, even when no visible
spatial bias is present, indicating that a statistically significant difference existed between
at least two sub-grids. Examination of the graphical output of the large number of tests
carried out using expression data from several studies being carried out through the Peter
Mac Microarray Facility (e.g. Figure 3-2), showed that even on those arrays without
apparent spatial bias, as judged by visual inspection of virtual array images, the 95%
confidence intervals for each sub-grid median expression ratio differ sufficiently to result
in a p-values of less than 0.001. However, after applying the test to a large number of
arrays, it was determined that the Moods Median Test Chi square values for the series of
arrays analysed positively correlated with an increasing extent of spatial bias that was
visible to the naked eye.
By applying this test to a large number of arrays and relating both the MMT score and
confidence-interval diagrams to the level of spatial bias observed in virtual-array
representations of the data, the MMT chi-square value was calibrated to determine an
appropriate cut-off value for determining when the level of spatial bias present warrants
correction by spatially-dependant normalisation algorithms as previously described. In
the case where the MMT chi square was in the range of >1000, it was found that the use
of such algorithms did not result in a sufficient MMT reduction and the array was
excluded from further analysis. Arrays that generated a MMT score of below 200 had no
visible spatial bias on inspection of their virtual array representation, therefore this was
determined to be the threshold for an acceptable level of spatial bias. Examples of the test
applied to arrays with and without spatial bias are shown in Figure 3-2.
Another method to calibrate an acceptable MMT score for a particular microarray would
be the use of a box plot to visualise the range of MMT values present in a series of
hybridisations belonging to the one experiment, such as those generated in the statistical
package Minitab(Minitab Inc, 2002). This type of graph highlights any individual data
points (MMT scores) that are sufficiently larger than the main distribution as outliers. It is
an effective method for identifying any sub-standard arrays in a large series based on the
overall distribution of MMT scores in the entire dataset and the MMT scores
corresponding to these arrays can be used to determine the threshold of acceptable spatial
bias.
97
Figure 3-2: (A) Example of MMT output for a microarray slide with low spatial bias as observed with a virtual array diagram (right). The MMT score for this array is 107, as reflected by the small range of sub-grid confidence intervals (i.e. Scale 0.6 – 1.14). (B) Example of a second array with a high degree of spatial bias observed upon inspection of the virtual array figure on right. The range of sub-grid median confidence intervals is far greater (Cy5:Cy3 ratio 0.7 – 1.6) and the corresponding chi square value is 2976, reflecting the substantially greater degree of bias present in this array.
Mood median test for microarray UP127 Chi-Square = 106.70 P = 0.000
(v) Annotates each ratio with a number identifying the array sub-grid to which it
belongs.
(vi) Copies the following columns to a new spreadsheet: Feature identifier
(generally the Genbank Accession number), Subgrid identifier, background-
subtracted expression ratio
(vii) Closes the raw data file
(viii) Opens the next file in directory and repeats steps (iv) – (vii).
(ix) Saves a compiled summary spreadsheet at in the directory created in step (ii).
This summary spreadsheet can then be directly opened in Minitab and the MMT score for
each microarray profile determined rapidly from the now correctly formatted expression
data.
99
Figure 3-3: Custom Excel Macro worksheet interface created to automate several aspects of microarray data manipulation. A number of functions are available, including the ability to prepare multiple arrays for SNOMAD normalisation (#2) and converting expression ratios from log to linear scale (#4).
100
3.2.1.3. Using Moods Median Test to evaluate the impact of image analysis software on cDNA microarray spatial bias
The MMT test described in this chapter can also be used to assess the impact of other
variables in the microarray work flow that have the potential to impact on spatial
distribution of data. These include variation in methods array hybridisation or the type of
software used for image analysis.
During the period of this project, developments in image analysis software were made,
including methods for dynamically adjusting the circumference of the region used to
identify an individual array feature (Kim et al., 2001; Korn et al., 2004a). Minor
variations in the amount of material spotted onto the slide are a common feature of cDNA
arrays, and can be caused by slight variations in the physical dimensions of the spotting
pins used or the length of time each pin is in contact with the slide. Because of this, even
the most carefully controlled array platform will produce arrays with a range of feature
diameters. An image analysis tool that is not capable of responding to these variations, by
having a spot detection window of fixed size, is likely to include areas of background in
the quantification of feature intensity. This has the potential to impact on the final
intensity measurement due to the inclusion of pixels actually corresponding to the glass
slide and not the spotted probe.
To investigate the impact of image analysis software on spatial bias a series of cDNA
microarray images was analysed using two difference programs. GenePix (Axon, USA),
with the ability to dynamically adjust the size of the circular window used to identify the
hybridised array feature from the surrounding background area, and Quantarray (Packard
Bioscience, USA) which maintains a fixed spot size for the entire array. Both packages
use the histogram method for determining intensity threshold levels (Ahmed et al., 2004)
and also allow the despots to be moved in any direction, individually or in blocks, to cater
for some misalignment in the printing process.
MMT scores were calculated for the series of 45 cDNA microarrays analysed in duplicate
using these two packages to determine if the distribution and magnitude of spatial bias
differed according to the image analysis software used. A box-plot of the MMT scores
was used to compare the series of MMT scores and analysis of variance carried out to
determine if the difference was statistically significant.
101
As shown in Figure 3-4, the data resulting from the GenePix analysis exhibited an overall
lower and less varied degree of spatial bias. The mean MMT score for the Quantarray-
analysed images was 169.8 compared to 111.9 for the same array images analysed with
GenePix, representing a reduction in spatial bias of 34% (p=0.006).
This analysis demonstrates the utility of MMT scores for evaluating advances in
microarray technology and their impact on data quality, specifically a reduction in
systematic spatial bias. It also describes the benefit of image analysis algorithms that
respond to varying feature diameters, a characteristic of this type of microarray.
102
Figure 3-4: MMT scores from 45 EOC microarray profiles analysed using (i) Quantarray and (ii) GenePix. Data generated using Quantarray had on average a higher and more varied range of MMT scores than data from the same microarray images analysed with GenePix. p-value for difference between mean values (indicated by red dots) = 0.006
1000
500
0
MM
T
Figure 3-5: Box plot of MMT scores from Gastric Cancer Prediction of Recurrent Disease dataset (described in section 3.2.1.4). Four arrays are indicated as outliers by this method, with one an extreme outlier, appearing at the very top of the figure. This particular array produced a MMT score of 1224 and was subsequently removed from the dataset after normalisation failed to reduce its MMT score to the level determined as acceptable for this microarray platform (200).
MM
T -
GP
MM
T -
QA
500
400
300
200
100
0
MMT scores from Quantarray
analysed array images
MMT scores from GenePix analysed
array images
MM
MT
scor
e
MM
MT
scor
e
103
3.2.1.4. Use of MMT to identify cDNA microarrays with extreme spatial bias and the impact of their exclusion on sample dataset
In order to assess the impact of a reduction in spatial bias on the bioinformatic
performance of a sample dataset, a comparison of cross-validation accuracies obtained for
a tumour classification problem was carried out. In a parallel study being carried out by
Dr Alex Boussioutas using the same Peter Mac 10.5k microarray platform (Boussioutas
et al., 2003), specimens of gastric cancer were profiled to identify a gene expression
signature capable of predicting the likelihood of disease recurrence following initial
treatment. Two version of the dataset were created, one normalised with intensity-
dependant lowess and the other using a spatially dependant normalisation method
(SNOMAD). MMT scores for the spatially-normalised dataset were calculated, as shown
in Appendix D. A box plot was used to visually analyse the distribution of MMT scores
across the dataset and identify any outliers, shown in Figure 3-5.
Array AB016 produced a MMT score of 1,224 and is represented by the asterisks at the
top of Figure 3-5. Following the identification of this array with extensive spatial bias, a
third version of the dataset was created without the array AB016 in order to determine if
by excluding arrays with extremely high MMT scores (even after spatially-dependant
normalisation), impacted on the cross-validated accuracy obtained when the dataset is
used for a sample classification task.
For each of the three data set versions, the optimal number of genes required to predict
the recurrent-disease status of each sample with the highest leave-one-out cross validation
accuracy was then determined. This was achieved using GeneCluster version 1.0 from the
Broad Institute (USA) (Reich et al., 2004). This method is based on a recursive signal-to-
noise approach for identifying genes of interest and a kNN method of class prediction, as
described in Material & Methods section 2.4.8.4. GeneCluster allows the researcher to
specify a range of gene numbers to search (e.g. 2 – 1,000) and calculates the classification
accuracy of each combination of genes within this range. Because of computing
limitations, the number of genes used in each iteration increases by a factor of two. The
cross validation accuracy of each dataset was also recorded along with the number of
genes used by the kNN algorithm to ascertain if a reduction in spatial bias also impacted
on the optimal number of genes required to perform the classifications.
The results of this analysis are shown in Figure 3-6. In summary, it appears that as spatial
bias is reduced in the three versions of this dataset, the number of genes required for
104
optimal classification also decreases, while the corresponding accuracy of the predictions
(as assessed by leave-one-out cross validation) increases. This observation supports the
hypothesis that reducing the level of technical bias present in a cDNA microarray dataset,
in this case spatial bias, results in more accurate classifications.
The highest cross-validation accuracy was obtained using the spatially-normalised dataset
minus one array exhibiting extremely high spatial bias (87% correct). This was followed
by the complete spatially-normalised dataset (78% correct) and finally the dataset
normalised with intensity dependant normalisation (73% correct). Following an inverse
trend to that of the classification accuracies, the number of genes selected by the
algorithm decreased from 1024 for the intensity dependant lowess data, 256 for the
complete spatially-normalised data to only 8 genes required for optimal classification of
the spatially-normalised dataset without array AB016. This large reduction in gene
number is amplified by the 2-fold increasing method of gene selection and evaluation
implemented in GeneCluster. This observation suggests that by reducing the level of
spatial bias present in a cDNA microarray dataset, by removing unacceptably biased
arrays and then minimising the spatial bias in those remaining, predictive algorithms such
as kNN are able to perform with a higher degree of accuracy using a smaller number of
genes expression data points.
The substantial increase in cross-validation accuracy observed in this study, attributable
to the exclusion of only one array from the dataset, also reflects the impact of sample size
on cDNA datasets of this size. This phenomenon has been recently described by Ein-Dor
et al (Ein-Dor et al., 2005), who noted that genes selected for prediction of the probability
of breast cancer patients developing metastatic disease is highly contingent on the subset
of patients used for gene selection processes.
105
Figure 3-6: Sample machine learning based predictive analysis to demonstrate impact of spatial bias on dataset performance. (A) Percent of gastric cancer specimens assigned to the correct disease recurrence category on the basis of leave one out cross-validation following lowess normalisation, spatially-dependant normalisation and then exclusion of a single array with an extremely high MMT score (B) The optimal number of genes (tested in doubling increments) selected by a kNN classification algorithm resulting in the corresponding prediction accuracy shown in (A) Genes were selected based on signal-to-noise ranking.
1
10
100
1000
10000
Intensity-dependantLowess
SNOMAD SNOMAD - highMMT
65
70
75
80
85
90
Intensity-dependantLowess
SNOMAD SNOMAD - highMMT sample
A
B
Perc
enta
ge o
f sam
ples
co
rrec
tly c
lass
ified
N
o. g
enes
106
3.2.2. Evaluation of reference RNA options for a large-scale tumour profiling study
3.2.2.1. Selection of samples and creation of a dataset to compare the performance of the two reference RNA options
To create a dataset for comparison of these reference options, RNA was extracted from
95 tissue samples obtained from patients diagnosed with EOC. The histological subtypes
of these cases are summarised in Figure 3-7. Samples were selected to resemble the
frequency of the EOC histological subtypes observed in the general population as closely
as possible (Ries LAG, 2004). The method for RNA extraction and labelling are
described in Materials and Methods section 2.3.2.1. Specimen RNA was reverse
transcribed incorporating the Cy5 fluorescing dye and hybridised to cDNA microarrays in
duplicate by Sophie Katsabanis, as described in Materials and Methods section 2.3.4 and
2.3.5. The first series of arrays was hybridised against the Cy3 labelled pooled tumour
reference and the second series using the Cy3 labelled pooled cell line material.
In total, RNA from 95 specimens of EOC were hybridised to 10.5k cDNA microarrays
using the two types of reference RNA as described. Image analysis and data extraction
was carried out using GenePix, as per Materials and Methods section 2.4.2.
Figure 3-7: Summary of histological subtypes. cDNA gene expression dataset used to compare the use of either a universal pooled cell line or project-specific pooled tumour reference RNA. N = 95. The serous and mucinous endometrioid subtypes of EOC represent 73% of the samples in the dataset.
108
3.2.2.2. Hierarchical clustering of EOC datasets
Hierarchical clustering was used to visualise the natural grouping of samples on the basis
of their expression profiles. Initially an unsupervised method was used whereby all genes
identified as reliably hybridised were used in the clustering process. Genes not varying
significantly according to the Log-ratio method from baseline expression (ration = 1.0) in
a 20% of samples were excluded and the remaining data used for unsupervised
hierarchical clustering. The dendrogram resulting from the hierarchical clustering
analyses are shown in Figure 3-8.
From these clustering figures it can be observed that the first and most divergent branch
points in the dendrogram structure correspond to the mucinous and serous histological
subtypes present in this cohort of patients. The endometrioid subtype does not appear to
form a discrete cluster with either reference RNA type or gene set. It is important to note
however, clustering is not used to predict or classify samples in this study, rather as a
method of visualising the natural groupings in the data available (Simon et al., 2003b).
Overall both reference types were capable of generating expression data which generated
biologically-driven hierarchical clustering results, even without supervised gene
selection. One notable difference between clustering associated with the reference RNA
used is the small number of serous tumours that are grouped in the predominantly
mucinous branch, cell-line reference cluster (Figure 3-8 (B)). One of these four cases is
also grouped with the mucinous tumours in the pooled-tumour cluster; however the other
three are positioned in the main serous branch.
This observation suggests that the dataset created using the project-specific pooled
tumour RNA dataset may allow the known histological to be detected with greater
resolution than with the use of the universal 11 cell line reference, particularly with
samples that are difficult to classify. It is important to emphasise however, that
hierarchical clustering is not a recommended method for classification of tumour
samples, however is an effective and method for visualising complex gene expression
data and its relationship to known clinical variables including histological subtype (Simon
et al., 2003b).
109
Figure 3-8: Dendrogram structure of unsupervised hierarchically clustered (A) pooled tumour-reference arrays and (B) pooled-cell line reference arrays. Both patient IDs and histological subtype information is given to allow identification of individual specimens within the figure. The separation of mucinous and serous tumours is the dominant pattern observed, with the majority of the endometrioid samples being clustered in the mucinous branch.
A
B
Serous Mucinous Endometrioid
110
3.2.2.3. Differences in the distribution of expression data based on reference RNA type.
In order to determine if either the pooled 11 cell line or tumour RNA resulted in a greater
dynamic range of gene expression measurements, the mean expression ratio for each gene
detectably hybridised was calculated for the two datasets. Intensity-dependant
normalisation was used. The mean normalised values where analysed with a test for equal
variances using Minitab and visualised with a histogram of intensities, shown in Figure
3-9. The range of mean expression ratios was significantly wider in the data generated
from the use of the 11 cell line reference (p<0.001).
The wider range of gene expression ratios resulting from the hybridisation with the cell
line reference RNA is a reflection of the relative molecular differences between the
specimens being profiled and the composition of the reference RNA. In a hybridisation
against the pooled tumour RNA, each gene in the specimen of interest is quantified
relative to the level of that gene in a pool of other primary EOCs. When the same
specimen is profiled against the 11 cell line reference RNA, each gene measurement
made is relative to the expression of that gene in a much broader range of cancer types,
resulting in a larger expression ratio. As well as representing 11 unique cell lines, this
reference RNA also contains the molecular differences known to exist between cultured
cell lines and actual human tissue, further widening the difference in molecular profiles of
the two specimens being competitively hybridised to the microarray.
111
Figure 3-9: Histogram of mean normalised gene expression measurements for a series of microarray hybridisations using either a pooled cell line (universal) or pooled tumour (project-specific) reference RNA. A test for equal variances revealed the range of gene expression ratios in the dataset generated with the 11 cell line reference was significantly different (i.e. wider) than for the same dataset constructed using the pooled tumour reference RNA (p<0.001).
43210-1-2-3-4-5
1500
1000
500
0
Mean expression ratio (log2)
Freq
uenc
y
Pooled tumour reference 11 cell line reference
112
3.2.2.4. Comparison of the proportion of the microarray clone set represented by each reference type
After background subtraction, genes expression ratios are created by dividing the
intensity of hybridisation in the test (sample) channel by that of the reference channel;
therefore the degree of differential expression observed for a particular gene is relative to
the abundance of that gene in the reference RNA used. As a result of this, an important
measure of reference RNA performance is the proportion of the total microarray for
which valid expression ratios are generated.
Using the BRB ArrayTools microarray analysis package (Biometric Research Branch,
National Institute of Health, USA) the proportion of array features with (i) minimum
absolute expression of 300 in the reference Cy3-labelled channel and (ii) fewer than 20%
missing values across all arrays was determined for each individual array, in both
datasets. The minimum intensity level of 300 has been determined by the Peter Mac
Microarray Facility as an appropriate intensity cut-off, below which measurements are
unreliable. The mean number of features, per array, passing these criteria was tallied for
each of the 98 arrays and a one-way ANOVA used to determine if a significant difference
existed between these proportions of microarray coverage.
The mean number of array features successfully bound by the Cy3 labelled tumour
reference (9,056) was marginally higher (<0.01% of total array) than for the pooled cell
line reference (9,018). ANOVA of this difference reveals it is not significant (p= 0.499)
as indicated by the box plots shown in Figure 3-10. Therefore both types of reference
RNA used in this analysis hybridise to the same proportion of the microarray. It is
important to note that the proportion of array features bound is highly dependant on array
quality, RNA-labelling and hybridisation efficiency, all variables that are difficult to
quantify. However, by using a large number of arrays such as in this comparison the
impact of these variables is reduced and the conclusion that there is no significant
difference in array coverage between these types of reference RNA can be drawn.
113
Figure 3-10: Box plot of mean proportion of array features successfully bound by each reference type. Over 90% of array features were identified by both reference types in the dataset analysed. The difference between means was not statistically significant.
Prop
ortio
n of
tota
l arr
ay d
etec
tabl
y hy
brid
ised
Cel
l lin
ere
feren
ce
Tum
our
refer
ence
0.6
0.7
0.8
0.9
1.0
Pooled cell-line
reference
Pooled tumour
reference
114
3.2.2.5. Analysis of genes uniquely detected by either reference type
Although no difference in the proportion of features on the microarray detectably bound
by the reference types being investigated was observed, 241 genes (2.6%) were bound by
the tumour reference that were not detected by the cell line material, and a further 213
(2.4%) vice versa.
In order to characterise these genes differentially represented between the two reference
types, gene ontology analysis was carried out using the EASE method (Materials and
Methods section 2.4.9.1) (Hosack et al., 2003). This method determines the biological
functions, molecular processes and cellular components that are significantly represented
by a list of genes, relative to the composition of the particular microarray platform used.
The ontologies identified as statistically significant for these two lists of genes are shown
in Table 3-1 (tumour reference) and Table 3-2 (cell line reference).
This analysis revealed that genes known to be involved in cell-cell communication (e.g.
WNT1, FGF14, TNFRSF2) and also structurally involved membranes and extracellular
features (CCR5, MUC4, NLGN1) were identified by the tumour reference and not by the
cell line reference. It is possible these genes were not expressed in the cell line reference
RNA due to the in vitro culturing conditions in which they are grown where conditions
vary substantially to those present in the human body. The reference generated by
extraction of RNA from whole tumour contains a significantly higher proportion of genes
involved in the tumours interactions with microenvironment. No cellular components
ontologies were significantly identified in the list of genes unique to the cell-line
reference, further supporting this hypothesis. This difference may be an important
consideration if the goal of a microarray study is to explore these interactions.
Genes uniquely detected by the use of a cell line reference RNA include those involved in
morphogenesis and the detection and response to external and mechanical stimuli (e.g.
OPHN1, ALDH7A1). This again most likely reflects the growth conditions of these cells,
whereby they are dislodged from their substrates and transferred between different tissue
culture flasks, where most resettle resulting in a change in cell morphology in response to
contact with each other and the flask itself.
115
Table 3-1: Significantly represented gene ontologies in the list of 241 genes uniquely detected by the pooled tumour reference RNA
Biological process ontologies EASE Score Sexual reproduction 0.00861 Cell surface receptor linked signal transduction 0.0147 Neurogenesis 0.0228 G-protein coupled receptor protein signalling pathway 0.0248 Ion transport 0.028 Organogenesis 0.0295 Cell communication 0.0396 Organismal physiological process 0.0462 Cellular component ontologies EASE score Integral to plasma membrane 0.00139 Plasma membrane 0.00471 Cell fraction 0.0158 Extracellular 0.0163 Integral to membrane 0.0209 Molecular function ontologies EASE score Signal transducer activity 4.5 x 10-6 Receptor binding 4.17 x 10-5 Cytoskeleton protein binding 0.00467 Cation channel activity 0.0134 Cytokine activity 0.0144 Rhodopsin-like receptor activity 0.0163 G-protein coupled receptor activity 0.0239 Growth factor activity 0.0492
Table 3-2: Significantly represented gene ontologies in the list of 213 genes uniquely hybridised by the pooled cell line reference RNA
Biological process ontologies Ease score Organogenesis 0.00897 Ectoderm development 0.011 Detection of abiotic stimulus 0.0117 Development 0.0181 Morphogenesis 0.0184 Sensory perception of mechanical stimulus 0.0217 Detection of mechanical stimulus 0.0236 Skeletal development 0.0292 Histogenesis 0.03 Detection of external stimulus 0.039 Regulation of transcription from pol ii promoter 0.039 Purine ribonucleotide biosynthesis 0.0487 Molecular function ontologies: Ease score Monooxygenase activity 0.02 Nucleic acid binding 0.0457
116
3.2.2.6. Comparison of the proportion of the microarray set identified as differentially expressed relative to baseline expression
Two methods were used to compare the proportion of differentially expressed genes
present in a given proportion of arrays from both datasets; (i) minimum fold change of
1.5-fold or greater in at least 20% of samples and (ii) log-expression variation whereby
the variance of the log-ratios for each gene is compared to the median of all the variances.
Genes whose variance is not significantly more variable than the median gene are filtered
out
The decision to set the cut off for differential expression relative to the reference at 1.5
fold was based on this level representing a 50% increase or decrease of relative
expression and also published work in which a fold change of 1.4 was shown to be the
minimum gene expression ratio fold change that can accurately detected by cDNA
microarrays (Yue et al., 2001).
Intensity-dependant, lowess (Cleveland, 1979), normalisation was carried out prior to
applying the filter for differentially expressed genes to reduce bias in the data due to
factors such as variation in the efficiency of Cy-dye incorporation. The same filtering as
described in section 3.2.2.4 was used to screening out those features with unreliably low
intensity measurements and/or failing to produce valid expression ratios in over 20% of
the total number of samples.
Using the fold-change method of identifying differential expression, 2,956 genes from the
cell line reference data and 2,259 genes from the tumour reference set were identified as
differentially expressed. A statistical test of proportions indicated this difference is highly
significant (p<0.001). Therefore at the 1.5-fold level of differential expression, use of the
cell-line based reference yields a significantly larger proportion of differentially
expressed genes, relative to the expression of each gene in the reference channel. This
result reflects the greater dynamic range of mean expression ratios present in the cell line
reference dataset compared to the tumour reference data (section 3.2.2.3).
A more sophisticated method of identifying differential gene expression, whilst taking
into account sources of variation such as low overall intensity, is the use of t-tests to
compare the variation of a given gene versus the median variation of all genes on the
same array. This method was developed after it became apparent that fold change cut-offs
can vary in their biological relevance depending on various technical factors that impact
117
on the dynamic range of data points on a given microarray, although both methods are
still in use (Cui and Churchill, 2003; Yang et al., 2002a).
Using a log-expression variation filter with p-value cut-off of 0.001, 4,092 features from
the tumour reference arrays and 4,026 features from the cell-line reference arrays were
identified as differentially expressed. This difference was not statistically significant
(p=0.356) as determined by a t-test of proportions. This indicates that unlike the result
obtained by the fold-change method of gene selection, no difference between the
proportions of genes identified as differentially expressed based on log-expression
variation selection.
It is to be expected that a filtering method based on the comparison of each individual
genes variation across the dataset to the median degree of variation will result in equal
numbers of genes detected between reference types. This is because the median variation
is calculated for each dataset; therefore the actual fold change that correlates to the same
p-value can vary substantially. This method is generally accepted as being the most
robust approach to filtering out genes that are not contributing to the biological question
being investigated as it is not dependant on setting a fixed ratio threshold which may not
always correspond to the same increase or decrease in gene copy number (Cui and
Churchill, 2003).
Further analyses of microarray data in this chapter will use the log-expression filtered
gene lists generated from the reference comparison datasets.
3.2.2.7. Analysis of differentially expressed genes between histological subtypes
Histological subtype information was available for 82 of the profiled EOC specimens, as
summarised in Figure 3-7. As different subtypes of ovarian cancer have markedly
different clinical outcome (Pieretti et al., 2002), the genes and molecular pathways
underlying these differences are of interest. The ability of each reference type to identify
differentially expressed genes between these categories was tested.
Using the Significance Analysis of Microarrays (SAM) method for identification of
differentially expressed genes (Tusher et al., 2001), the largest three subtypes of ovarian
cancer in the available data set were analysed. This method of gene selection assigns a
score to each gene based on its change in expression relative to the standard deviation of
repeated measurements. For those genes with a score over a predetermined threshold,
permutation testing is then carried out to estimate the percentage of the gene list
118
identified by chance (i.e. False discovery rate (FDR)) and then a gene list is created based
on the FDR computed from the permutation analysis. This method allows for the
possibility of dependant measurements in the dataset, i.e. genes whose expression levels
are dependant on the level of other genes in the same sample.
1,737 genes were identified by SAM as being differentially expressed between
histological subtypes in the cell line reference dataset, compared to 1,287 from the pooled
tumour reference data. The p-value for the variation in proportion of differentially
expressed genes was p<0.001 indicating significantly more genes were identified by the
use of cell line reference RNA. 862 genes were selected by SAM analysis in both
datasets. The statistical significance of this overlap is P < 0.001 (representation factor =
4.2), as calculated by the online tool provided by the Kim Laboratory, Stanford
University (CA, USA)) (Lund, 2003) indicating the overlap is highly significant. 99 of
the top 100 genes in both lists are identical, the top 50 of which are listed, along with the
mean expression for each histology type, in Table 3-3.
119
Table 3-3 Top 50 most differentially expressed genes between histological subtypes for both reference types. The order of genes was identical for both datasets and mean expression ratios for each EOC subtype are given. Unless otherwise states UniGene build #184 was used for all annotations. Clones without current UniGene IDs are represented as Genbank Accession numbers.
Molecular function ontologies No. genes in list EASE Score Transferase activity, transferring hexosyl groups 15 0.000297 Oxidoreductase activity 42 0.00144 Anion transporter activity 10 0.00924 Calcium ion binding 40 0.00951 Acyl-coa or acyl binding 7 0.0176 Organic anion transporter activity 7 0.0217 Ion transporter activity 24 0.0241 Electrochemical potential-driven transporter activity 16 0.0254 Serine-type endopeptidase inhibitor activity 9 0.0315 Protease inhibitor activity 12 0.033 Trypsin activity 10 0.0341
122
3.2.2.9. Comparison of reference RNA types with the use of a sample machine learning-based classification task
To determine if either the pooled tumour or pooled cell line reference RNA offered an
advantage when using machine learning algorithms to predict a clinical variable of
interest a sample classification task was set. Using all genes on the array as to avoid
algorithm over-fitting, leave-one out-cross-validation (LOOCV) was used in conjunction
with three different learning algorithms to assess the ability of each dataset to predict the
histological subtype of an ‘unknown’ sample. The algorithms used were: kNN (with 1
and 3 neighbours), NC and DLDA as described in Materials and Methods section 2.4.8.
Genes were included in the classification model if they were significantly different
between the histology classes at the univariate level of p=0.001 using a randomised
variance model. Gene selection was repeated for each iteration of the learning process,
minimising the chance of over-fitting the model to the cohort being analysed.
The complete output from the LOOCV predictions of histological subtype using both
reference types is listed in Appendix F along with a summary of the percentage of
samples correctly classified. The output of this sample classification task shows that for 3
of the 4 classification attempts, the cell line reference data achieved fewer errors in
classifying samples into according to their correct histological subtype. The DLDA, 3-NN
and NC classifiers correctly predicted the subtype of each specimen in 84%, 82% and
84% of cases respectively. This compared to 75%, 76% and 76% accuracy when using
the same algorithms on data generated using a pooled tumour reference.
The mean number of genes selected by the algorithms was 654 from the cell line
reference data compared to 424 from the tumour reference data. The difference between
number of genes selected is statistically significant (p<0.001), which supports the
previous observation of more genes being robustly differentially expressed between
histological subtypes with the use of a cell line based reference.
3.2.3. Analysis of cDNA microarray slide scanning technology on data quality
To test the impact of two common scanning methods on the quality of expression data, 46
10.5k cDNA microarray slides were scanned in duplicate, once using a Packard
Scanarray 5000 scanner (Packard Bioscience, USA) and immediately afterwards with a
Agilent Microarray Scanner BA (Agilent Technologies, USA).
123
These slides were hybridised with EOC RNA using the pooled 11 cell line reference, as
part of the reference RNA comparison study (section 3.2.2), using protocols described
previously.
3.2.3.1. Correlation of expression data generated from a series of cDNA microarray slides scanned on different scanners
Pearson correlation coefficients were calculated for each pair of array scans. The mean
correlation was 0.91 and a box plot of the correlation scores is shown in Figure 3-11. This
correlation indicates that while the overall consensus between data generated by different
microarray scanners is high, some variation does exist; reflecting the large number of
measurements made and also indicates the potential for introduction of systematic noise
into microarray data at this stage of the work flow. This compares to the Pearson
correlation obtained when comparing the results of an individual slide scanned multiple
times on the one scanner (Pearson correlation = 0.98).
This finding contrary to that of Ramdas et al (Ramdas et al., 2001b), in which the
difference in correlation of data between scanners was equivalent to that of data generated
by a single machine.
3.2.3.2. Impact of the microarray scanner used on the consistency of repeated microarray measurements.
To determine whether a significant difference existed in the measurement of individual
array features between the Agilent and Packard scanners, a comparison of quality control
features was carried out. Variation between repeated measurements of eleven different
synthetic quality control genes, printed multiple times throughout the layout of cDNA
microarray was carried out. A general linear model was used to control for the expected
variation between different types of ScoreCard gene (i.e. Quality controls at
predetermined ratios and intensities) and also between individual arrays.
Highly significant variation (p<0.001) was observed between the QC expression ratios
generated using both microarray scanners, after factoring in the expected variation
between the individual 46 arrays (p<0.001) and also for the expected differences between
gene ratios (p<0.001). Further analysis was then carried out to determine which of the
two scanners produced the more accurate measurement of these genes; however this
comparison revealed that statistically significant differences do exist between the
expression ratios generated for theoretically identical features when measured by
different microarray scanners.
124
3.2.3.3. Comparison of spatial bias present in cDNA microarray data from two different scanners
MMT scores (section 3.2.1) were then calculated for each dataset. As shown in Figure
3-12, the measurements of spatial bias obtained from the images generated by the Agilent
microarray scanner exhibited significantly lower (mean: 140) and less varied (standard
deviation: 105) MMT scores compared to those obtained from the Scanarray machine
(mean: 461, standard deviation: 445). Analysis of variance of the two series of MMT
scores revealed a highly significant statistical difference (p<0.001). Visual inspection of
virtual array images validated the MMT scores for these datasets with arrays having the
highest MMT scores having clearly visible position effects, such as those shown in Figure
3-13.
125
1.0
0.9
0.8
0.7
0.6
0.5
Pear
son
corr
elat
ion
Figure 3-11: Box plot of Pearson correlations for duplicate scanned microarrays (N=45). On average there is good agreement between microarray data generated by two different slide scanners (Pearson correlation >0.95)
PackardAgilent
2000
1000
0
MM
T
Figure 3-12: Box plot of MMT scores from duplicate scanned microarrays. The Agilent Microarray Scanner results in a dataset with significantly lower mean and range of MMT scores compared to the Packard Machine. Outliers are observed with both machines indicating a small number of arrays heavily affected by spatial bias compared to the dataset as a whole.
126
Figure 3-13: Virtual array images of two microarray slides with visible spatial bias scanned on: (A & B) Packard and (C & D) Agilent microarray scanners. A visible reduction in spatial bias is observed in the image generated by the Agilent scanner by the reduction of the overwhelming Cy5 (red) in the lower right and lower left of these two arrays respectively.
A B
C D
127
3.2.3.4. Comparison of background hybridisations spatially-dependant variation between sample microarrays scanned on different scanner
To explore the variation in spatial bias introduced between scans of the same microarray
slide in finer detail, one array was selected where a large difference in MMT score was
observed between the images generated by the two available microarray scanners. An
example of how MMT reflects and quantifies spatial bias is shown in Figure 2-4. 200-
point moving average plots of the foreground and background hybridisation intensity
measurements were generated for both Cy3 and Cy5 dyes from each scan of this array,
from the first to final array feature.
As seen in Figure 3-14, both foreground and background absolute intensity measurements
exhibit substantial variation across the surface of this sample microarray. The lines
representing background intensity appear to track with array location to a greater degree
in the Packard-generated data compared to the Agilent scan of the same microarray slide.
The same trends can also be observed in the measurement of the foreground (feature)
intensities with that of the Packard increasing with array location to a greater degree than
the Agilent data.
As both background and foreground measurements generated by the Packard scanner
appear to follow the same trend it was hypothesised that the routine practice of
subtracting the background intensity from the foreground may in fact eradicate the spatial
bias in the final expression data. Background subtracted intensity measurements were
then plotted as shown in Figure 3-15. By visualising the intensity measurements
generated by these two scanner types in this manner it is possible to observe the impact of
bias in background hybridisation on the overall level of spatial bias in the final expression
Figure 3-14: 200-point moving average graphs of individual detection channels from a sample microarray slide scanned in two cDNA microarrays scanners. (A) Packard-scanned microarray slide with visible spatial bias. Measurement of both foreground and background intensity correlates with array location (x-axis). (B) Slide rescanned using continual laser focus – resulting in reduction in visible spatial bias and MMT score. Measurement of both foreground and background intensity correlates with array location (x-axis). Background intensity readings from this scan are noticeably less variable and correlated with location than in the data from the Packard machine.
Figure 3-15: 200 point moving average graphs of the difference between foreground and background intensity measurements on a sample microarray (A) Packard-scanned microarray with visible spatial bias. The correlation with array location indicates that the difference between foreground and background hybridisation is not constant for the entire slide. (B) Slide rescanned using the Agilent Microarray Scanner. The difference between background and foreground intensity does not track with array location to the same extent.
Feature number
Feature number
Hyb
ridi
satio
n in
tens
ity (l
Lg1
0)
Hyb
ridi
satio
n in
tens
ity (L
og10
)
B
A
130
3.2.3.5. Analysis of ScoreCard quality control features to compare microarray scanner accuracy
As previously detailed, the ScoreCard quality control system allows researchers to
determine the accuracy of their expression data by comparing the difference between
observed and expected expression ratios at a range of ratios and absolute abundances. A
reliable microarray platform should generate expression data without significant variation
between the known correct values for these features distributed throughout the array and
those observed from a given hybridisation.
Intensity-dependant lowess normalised mean ratios for ScoreCard ratio control (RC)
genes were compared between both datasets. Differences between down-regulated
expression ratios were tested with ANOVA, factoring in differences between gene types
(four ratio control genes), individual arrays and differences between scanners. P-values
for all comparison were highly significant (p<0.001) indicating a statistically significant
difference existed in data generated by the two scanners tested. Box plots were created for
the mean expression of each Scorecard gene generated by both Agilent and Packard
microarray scanners. As shown in Figure 3-16, measurements of ratio-control ScoreCard
genes were significantly closer to their expected values, indicated by the dashed
horizontal lines, in data generated by the Agilent scanner. Furthermore, the range of
values obtained for these controls was significantly smaller (p<0.001) as determined by a
test for equal variances, implemented in Minitab and illustrated by size of the
interquartile ranges in Figure 3-16.
It is important that a microarray platform can accurately detect differences in relative
gene expression levels at varying absolute concentrations as genes are present at a broad
spectrum of concentrations in all biological samples. Dynamic range control genes are an
effective way of monitoring sensitivity of a platform to varying RNA abundances. Six
synthetic genes are double-spotted on the array, six times each so that each dynamic
range gene is represented 12 times. Mean expression ratios of each dynamic range are
taken to compare the accuracy of the array platform at detecting hybridisation of probes
at a range of absolute concentrations. The exact concentrations of the dynamic range
RNA spiked into the biological sample applied to the array range from 33pg/5uL to
33,000pg/5ul as described in Material and Methods.
131
Analysis of the mean raw expression ratio obtained for these dynamic range controls is
shown in Figure 3-16. For all six of the dynamic range controls it was observed that the
distribution of observed values crossed the expected value, indicated by a dashed line.
The distributions of dynamic control measurements from the fixed focus scanner were
further from the expected value and also more varied within each individual control.
This comparison indicated that the Agilent microarray scanner produced data with a
significantly higher degree of accuracy for a range of genes with known Cy3:Cy5 ratios
and absolute abundances. From this one can expect that data produced by this machine is
superior in its ability to generate expression data accurately reflecting the true biological
signal in the sample being profiled.
13
2
Figu
re 3
-16:
Box
plo
ts
of p
er-a
rray
mea
n ex
pres
sion
rat
ios f
or
sele
cted
Sco
reC
ard
qual
ity c
ontr
ol g
enes
. D
ashe
d lin
es in
dica
te th
e th
eore
tical
ly e
xpec
ted
valu
es fo
r eac
h ge
ne. A
g:
Agi
lent
Mic
roar
ray
Scan
ner B
A; P
a: P
acka
rd
Scan
arra
y 50
00 (A
) U
preg
ulat
ed q
ualit
y co
ntro
l fea
ture
s, C
y5:C
y3
ratio
s giv
en in
figu
re. (
B)
Dow
n re
gula
ted
qual
ity
cont
rol f
eatu
res.
(C)
Dyn
amic
rang
e fe
atur
e.
In a
ll ca
ses,
the
Agi
lent
da
ta w
as c
lose
r to
the
theo
retic
al v
alue
s of t
hese
qu
ality
con
trol g
enes
, in
dica
ting
a hi
gher
deg
ree
of a
ccur
acy
at a
rang
e of
ex
pres
sion
ratio
s and
re
lativ
e ab
unda
nces
.
A
B
C
3:1
Ag
3:1
Pa
10:1
Ag
10:1
Pa
1:3
Ag
1:3
Pa
1:10
Ag
1:10
Pa
Cy5:Cy3 ratio
Cy5:Cy3 ratio Cy5:Cy3 ratio
Agi
lent
Pa
ckar
d
3.3%
1.0%
0.1%
0.03%
0.00%
0.00%
0.01%
0.03%
0.1%
1.0%
3.3%
0.01%
Dyn
amic
ran
ge r
elat
ive
abun
danc
e (C
y5:C
y3)
133
3.2.3.6. Analysis of the effect of scanner type on the biological outcome of a sample microarray experiment
A crucial question when comparing technical modifications of existing methods is
whether the changes have any significant impact on the biological question being posed
by the experiment. The proportion of genes differentially expressed (2-fold) in a
proportion of the two datasets was used as a measure of how the differences in scanning
methods impact on the range of gene expression ratios in a given dataset. The fold change
method was used for gene selection specifically because it does not compensate for
variation between slide intensities potentially attributable to the type of microarray
scanner used.
This expression threshold was applied to the data after an intensity-based normalisation
was carried out to compensate for subtle differences in Cy3 and Cy5 incorporation during
the RNA labelling process. In 10 of the 46 arrays scanned on the fixed focus scanner (an
arbitrary proportion), 1176 of the 9995 genes available were either 2-fold up or down
regulated compared to the reference. When scanned on an adaptively focussing system a
total 1925 genes pass this filter. The difference in proportions of differentially expressed
genes is highly significant in a two-sided t-test (p<0.001).
The variation in quality control features can also be used to infer how varying scanning
methods impact on the biological meaning of a microarray. Although two colour
microarrays produce values that are relative to a reference RNA, relating absolute
intensity levels back to control genes of known abundance can give a measure of a gene’s
actual concentration or abundance in the sample being profiled. Accuracy of quality
control is required to achieve reliable calculations of gene absolute abundances. Mean
values for all quality control genes were significantly closer to the expected values in data
generated using the newer Agilent microarray scanner, indicating a higher potential for
reliable estimates of absolute gene abundances than from data generated by the Packard
scanner.
134
3.3. Discussion
3.3.1. The use of Moods Median Test to quantify spatial bias in cDNA microarray data
Spatial bias in cDNA microarray data is often observed as a spatial gradient or trend in
gene expression ratios related to their position on the microarray, rather than their actual
expression level in the specimen of interest. By randomising genes of similar structure
and function throughout the printing area, as much as possible, a random distribution of
up and down regulated features is expected across the surface of a cDNA microarray.
Therefore, patterns of hybridisation related to physical location are an indication of
technical error, such as inadequate probe distribution during hybridisation, irregular slide
topography or the angle of the microarray slide during scanning.
These effects can be visualised by creation of a virtual-array image, as shown in Figure
3-2. Often the effect is more visible after intensity-based normalisation is applied and any
imbalance between Cy3 and Cy5 labelling efficiency has been corrected. A method for
quantifying the degree of this bias is proposed whereby a chi-square statistic is generated
from a test of disparity of array sub-grid median ratio values. As the median sub-grid
expression values deviate from one another, the magnitude of the chi-square test statistic
generated by MMT increases. Examples of the test applied to arrays with and without
obvious spatial irregularities are shown in Figure 3-2. A global measure of position effect
is useful for identifying arrays in a particular experiment that may need to be repeated if
the effect cannot be corrected to an acceptable level with normalisation or other statistical
manipulation. As demonstrated, this test can also be used to compare differences between
array scanners, image analysis packages or other variables in the microarray process that
could potentially impact on the degree of position effect present in the final data obtained.
MMT is based on nonparametric statistics, therefore robust to outliers which are often
present in microarray data, depending on the type of sample RNA being investigated.
RNA extracted from human tissues will generally bind a higher proportion of array
features than RNA extracted from cell lines for example therefore potentially resulting in
more features being identified as outliers. While array data often approaches a normal
distribution, this cannot be assumed (Aris et al., 2004); therefore the non-parametric
nature of MMT is another factor which makes it suitable for analysing microarray data.
In section 3.2.1.2, the test is used to evaluate the impact of different microarray image
analysis software packages on the degree of spatial bias present in the final data. A key
135
difference between the two packages compared is the variable spot size algorithm
implemented in GenePix. This allows the diameter of each individual array feature to be
adjusted in response to variation in the actual amount of cDNA deposited on the glass
surface. The benefit of this is a reduction in the amount of background hybridisation
included in the measurement of smaller array features and conversely the amount of
foreground (or feature) intensity included with the measurement of background
hybridisation nearby larger features. Another important difference is the method
employed for calculating background hybridisation. The Quantarray package uses the
average pixel intensity of a circular area surrounding each spot, whereas GenePix takes
four separate measurements of background intensity and uses the median of these as the
final measurement.
By calculating the MMT scores for a series of arrays analysed in duplicate by these two
methods it was observed that GenePix resulted in expression data with overall lower and
less varied range of spatial bias, as determined by MMT scores. From this observation it
can be concluded that the image analysis stage of microarray analysis can also lead to the
introduction of spatial bias into the final data obtained.
Other methods for correcting spatial irregularity in expression data exist and operate on
similar principles as SNOMAD. These include the widely used print-tip normalisation
method, implemented in the R language (Ihaka, 1997) through the Bioconductor package
(Gentleman et al., 2004) and also online at GEPAS Tools (http://gepas.bioinfo.cnio.es/)
(Herrero et al., 2003) in which the adjustment factor is determined by a lowess regression
curve fitted through the expression data binned into groups according to the printing tip
used to deposit the probe onto the glass substrate during the array fabrication process.
To demonstrate how MMT can be used in a practical application, a sample classification
task was carried out using a dataset designed to develop a predictive algorithm for gastric
cancer disease recurrence. By creating three separate versions of the dataset and carrying
out the cross-validated prediction analysis, it was possible to observe the increase in
prediction accuracy associated with a reduction in the overall level of spatial bias present
in a cDNA microarray dataset. The number of genes required by the algorithm, designed
to iteratively add genes to the predictive set based on their signal-to-noise ranking until
optimal classification is reached, decreased with the reduction in spatial bias. This may
further reflect the reduction in systematic error component of each genes expression
profile across the dataset, allowing the algorithm to perform optimally using a smaller
136
number of genes whose expression patterns more closely resemble the underlying
biology, rather than characteristics of the laboratory procedures.
In general, the effect of reducing the overall amount of spatial bias in a dataset appears to
impact favourably on classification accuracy. Switching from intensity-dependant to
spatially-dependant normalisation resulted in a 9% increase in classification accuracy and
a four-fold reduction in the number of genes required to achieve optimal prediction.
Calculating MMT scores for arrays in this experiment after SNOMAD normalisation
show a significant reduction in the mean level of spatial bias, however one array
continued to generate a high MMT score. Removing this one array from the dataset
resulted in a further classification-accuracy increase and also a decrease in the optimal
number of genes required by the algorithm to perform the classifications.
This section of the analysis further supports the benefits of identifying and reducing the
degree of spatial bias in cDNA microarray datasets. As technical noise is reduced in gene
expression through improved wet-lab protocols or bioinformatic methods like SNOMAD
or print-tip normalisation, the biological signal appears to become less obscured by
technical noise, resulting in the ability to generate predictive signatures with higher cross-
validation accuracy and requiring fewer genes. This is important, as often a core set of
genes is sought for follow up analysis using other methods such as RT-PCR, in-situ
hybridisation or immunohistochemistry. Therefore methods that reduce redundancy in
gene sets identified as predictive of a phenotype of interest, facilitate these validation
analyses by allowing the minimal number of most influential genes to be determined.
3.3.2. Evaluation of reference RNA options suitable for large-scale tumour profiling studies
The advantages and disadvantages of the two distinct types of reference RNA analysed
for suitability in a large-scale tumour profiling study vary depending on the scope of the
project, method of data analysis planned and technical considerations such as cost of
production, longevity of the material and the ability to combine data for meta-analysis
studies.
An important measure of reference RNA performance is the proportion of array features
successfully hybridised and thus capable of generating an expression ratio that passes the
predetermined quality criteria. An expression ratio is only generated if the image analysis
software detects an adequate level of fluorescence in both sample and reference channels
of the scanned microarray images. In this study, no significant difference was detected in
137
the proportion of the microarray successfully hybridised by either reference type,
therefore in this regard, both references perform equally, producing viable expression
measurements for the vast proportion of array features. While no statistically significant
difference was found, the box plots of proportion of successful hybridisations (Figure
2-4) shows that five of the cell-line reference microarrays were outliers with proportions
as low as 55%. No outliers were identified in the tumour reference arrays, however it is
difficult to know whether this is a reflection of individual array quality, the particular
printing batch the arrays were sourced from, a hybridisation problem, or a by-product of
the reference type used.
Despite no difference in the proportion of genes hybridised by the reference types being
detected, several important ontology differences were observed between the lists of genes
bound by only one reference type, approximately 2% of the total set. In line with the
structurally more diverse tissue used to generate the pooled tumour reference, a
significant proportion of genes on the microarray involved in cell-cell communication,
membrane and extracellular structure were detected by this reference type. This
observation may be extremely important in selecting a reference type for a microarray
study in which the interactions between a tumour and its surrounding tissue or the
immune system are of interest. Those genes identified by the cell line reference alone
included genes related to cell morphology and response to external stimuli, reflecting the
tissue culture conditions in which they are grown.
Reduction of spatial correlation or bias in expression data is important for robust and
repeatable array results (Qian et al., 2003). The proposed method of visualising and
quantifying this factor revealed a significantly lower degree of spatial bias present across
the arrays using the pooled cell line reference compared to that observed in the tumour-
derived reference hybridisations. One reason for the significant difference in spatial bias
could be a difference in the overall mean intensity values of the hybridised features. With
higher feature intensity, the resulting expression ratios are less affected by the subtraction
of background hybridisation when compared to features where the distinction between
background and spot intensity is smaller. As a result any spatially-dependant variation in
background hybridisation due to uneven slide topography for example, is less likely to
result in up or down regulation of all genes in a specific section of an array. It is also
important to note that the difference in spatial bias between reference types was reduced
to a non-significant level after SNOMAD normalisation had been applied, indicating that
for the extent of spatial bias present in the datasets tested, this normalisation method is
effective for correcting this source of technical error (Colantuoni et al., 2002).
138
Approximately 600 more genes were identified from the cell line reference data using an
unsupervised method of identifying those genes potentially contributing to a phenotype of
interest (i.e. at least 1.5-fold differentially expressed in at least 20% of samples). This
indicates that the cell line reference RNA results in expression data with a wider dynamic
range, supported by an analysis of variance in the overall mean expression ratios for each
dataset, illustrated in Figure 3-9. The use of t-test based approach (log expression
variation) for identification of genes with significant variation in a given proportion of a
dataset yielded no difference in the proportion of genes identified. By adapting to the
specifics of each individual microarray, this method of data reduction, or unsupervised
gene selection, is therefore more suited to cDNA microarray data for which the overall
dynamic range of expression ratios can vary based on a number of variables, including
the type of reference RNA used.
In a comparison of genes identified as being differentially expression between
histological subtypes present in the dataset, 450 more genes were identified from the cell-
line reference data. Therefore, while an unsupervised gene selection filter identified a
similar number of genes, a supervised approach yielded a substantially larger number of
genes from the cell-line data. In the top 50 differentially expressed genes listed in Table
3-3, identified in both datasets, a number of biologically relevant genes were observed.
These include:
Trefoil factor 1 (TFF1) – upregulated 2-fold in mucinous tumours and
implicated in number of other cancers including breast and colorectal. Trefoil
factor genes are thought to be involved in protecting the mucosa from damage,
stabilizing the mucosal layer and have a role in epithelium healing and tumour
suppression (Dossinger et al., 2002; Schwartz et al., 2002).
Folate receptor 1 (FOLR1) – upregulated 5 fold in serous and 2 fold in
endometrioid samples and used as a marker for ovarian tumour progression. A
membrane glycoprotein not present on ovarian surface epithelium but found
during early transformation stages (Galmozzi et al., 2001; Tomassetti et al.,
2003).
Kallikrein 8 (KLK8) – upregulated 3 fold in serous and 1.3 fold in endometrioid
samples. Proposed as a biomarker for ovarian cancer and thought to be involved
in invasion and metastasis. One of several kallikrein genes identified as
differentially expressed (Shigemasa et al., 2004)
139
Mesothelin (MSLN) - upregulated 7 fold in serous and 2.5 fold in endometrioid
samples. Binds to the ovarian cancer antigen CA-125 specifically and controls
cellular adhesion to the mesothelial epithelium therefore mediating one of the key
stages of tumour invasion (Rump et al., 2004)
Although not one of the goals of this chapter, comparison of gene expression profiles
between subtypes of EOC, the data generated has the potential to form the basis of such
an analysis. In Chapter 5 the molecular differences between several subtypes of EOC is
explored, making use of data and findings from this preliminary chapter.
For differential ontology analysis, gene lists were restricted to those differentially
expressed between the mucinous and serous subtype only. Comparing the significantly
represented ontologies in these lists revealed only a single class or ontology was
differentially represented between the two lists. This again indicates that neither reference
type identified a substantially different list of genes differentially expressed between EOC
subtypes. Furthermore, it shows that Peter Mac cDNA microarray platform is sufficiently
robust that the same classes of genes are identified from a duplicate experiment, even
when a major experiential variable such as the type of reference RNA used, is changed.
By using the data in a sample classification task it was possible to observe whether either
reference type conferred an advantage in what is often the main purpose of a microarray
dataset; pattern leaning and prediction of one or more samples for which the class or
phenotype of interest is unknown. The data was used in conjunction of learning
algorithms to generate a predictive signature for ovarian cancer histological subtype. In
three of the four prediction trials, with LOOCV used to evaluate the performance of the
algorithm the data generated with the cell line RNA reference produced fewer
misclassifications. On average 81% of the samples were assigned their true histological
subtype compared to 78% for the classifier trained on the tumour reference data. While
only a small difference in terms of the number of actual samples predicted incorrectly
with one dataset and correctly with the other, it is still an important advantage particularly
when an algorithm is being used to predict clinically-relevant phenotypes.
The practical and technical considerations outlined such as cost of production, long term
availability and portability of data generated are also important factors to consider when
selecting a reference RNA type. Experiments carried out using the same reference type
have a greater potential for meta-analysis, in which the statistical power of a dataset is
increased by combining raw expression data from separate studies. Where a common
140
reference is used, such as the 11 cell line reference described in this chapter, it acts as
common denominator or baseline between studies, even if they have been carried out in
separate laboratories. Therefore this is a significant advantage over project-specific
references, such as the pooled tumour material, for which the expression ratios generated
are relative to the expression profiles of a specific group of tumours.
Additional time and expense can be associated with the construction of a pooled cell line
reference RNA. Large volumes of 11 different cell lines need to be generated, each with
their own nutrient requirements and varying growth rates. In contrast, a pooled tumour
reference can be made by combining amplified or non-amplified RNA from tissue
specimens at the same time as those being processed for individual hybridisation,
requiring little additional time or expense.
After weighing up the advantages and disadvantages of each reference type, summarised
in Table 3-5, the pooled 11 cell line reference was determined to be the most appropriate
choice for large scale EOC-profiling studies.
Table 3-5: Summary of reference RNA comparisons carried out in this study. In most of the comparisons conducted, the pooled cell line reference out-performed the pooled tumour reference.
Method of assessment Pooled cell line reference
Pooled tumour reference
Proportion of array features identified as reliably expressed
Wide dynamic range of expression ratios Identification of genes related to interactions between tumour cells and their environment
Proportion of genes differentially expressed – fold change
Proportion of genes differentially expressed – log-variance
Lower degree of spatial bias as measured by MMT Number of genes significantly differentially expressed between histological subtype Accuracy of machine-learning based predictions of EOC histological subtype
Comparability of data generated to other datasets
Long term availability
Reproducibility by other laboratories
Cost of production
141
3.3.3. Microarray scanners and cDNA gene expression data quality
During the course of this project the Agilent Microarray Scanner BA was introduced with
several new features not found in other scanners on the market at the time. One of the
most significant new features was its ability to maintain the focal point of the scanning
lasers throughout the duration of the scan. This feature was claimed to result in higher
quality data because the individual measurements were made with the correct laser focal
point, irrespective of variation in substrate topography, tilt or any movement of the slide
during the scanning process. By scanning a series of hybridised microarray slides on an
Agilent Microarray scanner and then immediately again using a Packard Bioscience
Scanarray 5000 without this dynamic focusing technology, several important differences
in the final data were observed that supported the claim of improved data quality over
scanner without the features of its new scanner.
The comparisons of the datasets produced by the two scanners revealed a number of
significant differences in the accuracy of individual measurements and other methods of
assessment including variation in the degree of spatial bias present. These findings
suggest the use of multiple scanners for one experiment may result in a systematic bias
being introduced in the data generated. Through normalisation, or inclusion of scanner-
type as a variable in the data analysis stage, the impact of the observed differences
between scanners could possibly be minimised; however this may limit the statistical
power of the dataset. Therefore, where possible it is preferably to use a single scanner
when creating a cDNA microarray dataset.
By analysing variation in raw background hybridisation intensity, for both Cy3 and Cy5
channels of sample arrays with and without visible spatial bias, it was possible to identify
that a significant proportion of the spatial bias present in cDNA microarray data comes
from a spatial relationship between background hybridisation intensity and array location.
This variation in transferred to the final gene expression ratios when the measurement of
background hybridisation is subtracted from the actual intensity of the hybridised probe, a
step intended to control for non-specific binding. Significantly lower variation was
observed in the background channels of array images produced by the Agilent Microarray
Scanner, explaining their overall lower MMT scores and greater accuracy of individual
data points.
These results also suggest that forgoing the background subtraction stage of cDNA
microarray data processing may reduce the level of spatial bias present in the final data,
142
particularly for microarrays where a distinct relationship between background
hybridisation and array location is observed.
The analysis of Scorecard expression measurements obtained from the two scanner types
revealed a significant difference in the accuracy of individual feature measurements. At a
range of ratio levels and absolute RNA concentrations, values generated from the Agilent
Microarray Scanner were significantly closer to the expected values and exhibited less
variation within repeated measurements of the same array feature.
3.4. General conclusions In this chapter a number of options available for several important steps of the microarray
work flow are compared to determine the optimal choice for future tumour profiling
studies.
A novel method for identifying and quantifying spatial bias in cDNA microarray
data is proposed. Its usefulness for identifying problem arrays whose removal
from a dataset dramatically increases the predictive accuracy of the dataset is
demonstrated.
A comparison of two types of reference RNA revealed that for a large-scale
project such as that planned for the AOCS a pooled-cell line material is superior
to a more project-specific material made of pooled RNA from the samples to be
profiled.
A comparison of data generated by different microarray scanners is also
described in which the Agilent Microarray Scanner generated the most accurate
and robust measurements of gene expression with a significantly lower degree of
spatial bias.
Adoption of the optimal methods determined by analyses described in this chapter can be
expected to result in cDNA microarray data that is highly accurate, thus requiring less
technical replication. The data will also be readily comparable to other studies with
minimal statistical manipulation required and have a lower degree of technical error
present compared to data generated with other methods. These findings are applied to the
following chapters in this thesis, analysing patterns of gene expression that correlate with
length of EOC patient survival (Chapter 4) and the molecular characterisation of invasive
and LMP EOC subtypes (Chapter 5).
143
4. Gene expression analysis of epithelial ovarian cancer overall survival
4.1. Introduction Ovarian cancer remains one of the most lethal of all cancer types with a five year survival
rate of 42% (Ries LAG, 2004). Patients who are diagnosed in the early stages of the
disease’s progression have a significantly better prognosis than those diagnosed with
advance stage tumours. Approximately 80% of invasive epithelial ovarian cancer
diagnoses are stage III or IV; in which tumour is present in both ovaries and spread to
other organs in the body (see Appendix A). Substantial variation in the prognosis and
survival time for these women is observed (Australian Institute of Health and Welfare and
Australasian Association of Cancer Registries, 2001). It is hypothesised that those
molecular differences underlying difference in clinical behaviour, including survival time,
are of interest as they may represent aspects of the disease which could be therapeutically
manipulated to improve patient outcome.
It is difficult to assign EOC patients into valid survival groups because of the complexity
of treatment variables. These include both the number of cycles and specific type of
chemotherapy used, type of surgery and also level of residual disease remaining after
surgery (debulk status) (van der Burg et al., 1995). In one recent study of EOC that
sought to relate survival time to gene expression profiles, a group of patients classified as
‘short-term survivors’ had a median survival time of 30 months (Spentzos et al., 2004).
However this definition was based on the specifics of the cohort under investigation and
may or may not be appropriate for other studies.
Microarrays are one of the most promising high-throughput methods for analysing
diseases at a fundamental molecular level. By integrating gene expression and clinical
data it is possible to gain insight into the foundations of clinical variables such as
disparate survival times. In recent times they have been used to explore molecular
differences between patients with short or long term survival (which can be defined as
disease-free or overall survival) in a range of other cancer types including breast (Jenssen
et al., 2002; van de Vijver et al., 2002), mesothelioma (Gordon et al., 2003), kidney
(Vasselli et al., 2003), prostate (Singh et al., 2002), diffuse large-B-cell lymphoma
(Rosenwald et al., 2002) and most recently, EOC (Hartmann et al., 2005; Spentzos et al.,
2004).
144
This chapter describes various bioinformatic approaches to identifying genes whose
expression patterns correlate with the variable of patient survival, in a cohort of EOC
cases analysed with the Peter Mac 10.5k cDNA microarray platform. The first section
describes case selection and the review processes carried out to create the best possible
cohort of patients for this type of analysis. A variety of approaches are then described in
which the survival variable is related to gene expression data, either as a categorical (e.g.
short versus long-term survival) or as a continuous variable (no. months). Methods for
assessing the significance of gene expression signatures obtained from these analyses are
then explored, along with different approaches for normalising cDNA microarray data to
address technical error. Finally genes whose expression patterns are identified as being
significantly related to length of survival are analysed with respect to other functional
information and their discovery in other studies of a similar nature is reported.
4.2. Results
4.2.1. Case selection and pathology review aimed at ensuring suitability for arraying and outcome analysis
4.2.1.1. Identification of appropriate cases from the AOCS microarray database
As part of the AOCS microarray project, a database of EOC microarray profiles was
created from retrospectively collected specimens of tissue sourced from several
participating institutions including Westmead Hospital, Royal Brisbane Hospital, as well
as Peter Mac. 10.5k cDNA microarrays were used as previously described. Sample
processing and microarray hybridisation was carried out by Sophie Katsabanis of the
Peter Mac Microarray Facility between 2001 and 2003.
A series of criteria for case selection was devised with assistance from Sian Fereday,
AOCS Data Manager and Dr Sherene Loi, a medical oncologist at Peter Mac. Patients
were included in the study if the following clinical and pathological information was
available (descriptions of these criteria are give in Materials and Methods section 2.2):
Date of diagnosis,
Date of last follow up or date of death,
Patient status at last follow up,
If patient deceased, cause of death,
Pathology grade and stage,
145
Histological subtype,
Chemotherapy information,
The amount of disease remaining at the conclusion of surgery (debulk status),
Arrayed using the pooled cell line reference RNA
In total, 26 cases that satisfied these criteria were identified. A large proportion of the
total number of cases available initially were excluded on the basis of missing
information about residual disease levels after surgical debulking. Given that the extent of
residual disease is known to be a significant prognosticator, it was not possible to
discount this factor and include cases without this data in the study (Bristow et al., 2002;
Grossi et al., 2002; van der Burg et al., 1995). The samples selected for this study are
described in Table 4-1.
The serous EOC subtype was selected as this represents the majority of EOC diagnoses.
The molecular and genetic differences between serous and the other major EOC subtypes,
endometrioid and mucinous, have been documented and it is clear that these subtypes
represent distinct disease types, each requiring separate investigation (Hess et al., 2004;
Pieretti et al., 2002; Schwartz et al., 2002).
14
6
Tab
le 4
-1 S
ampl
es se
lect
ed fo
r an
alys
is o
f gen
es e
xpre
ssio
n pr
ofile
s ass
ocia
ted
with
leng
th o
f sur
viva
l. Ta
ble
sorte
d by
dis
ease
stag
e. A
ll pa
tient
s had
de
ceas
ed o
f dis
ease
at t
ime
of la
st fo
llow
up.
Pat
ient
stat
us: 2
= d
ecea
sed
from
can
cer,
3 =
dece
ased
due
to o
ther
cau
ses.
Gra
de 9
= u
nkno
wn.
Patie
nt ID
M
orph
olog
y D
escr
iptio
n G
rade
St
age
No.
Mon
ths s
urvi
val
Patie
nt S
tatu
s R
esid
ual D
isea
se
85.0
05
Papi
llary
sero
us c
ysta
deno
carc
inom
a
9 II
I 15
2
Mod
92
.015
Se
rous
cys
tade
noca
rcin
oma
3
III
1 3
Min
91
.052
Pa
pilla
ry se
rous
cys
tade
noca
rcin
oma
2
IIB
17
2
Nil
93.1
31
Papi
llary
sero
us a
deno
carc
inom
a
9 II
IA
20
2 N
il 92
.071
Se
rous
cys
tade
noca
rcin
oma
2
IIIB
14
2
Min
90
.061
Pa
pilla
ry se
rous
ade
noca
rcin
oma
3
IIIC
10
2
Mod
91
.007
Pa
pilla
ry se
rous
ade
noca
rcin
oma
2
IIIC
9
2 M
ax
92.0
03
Sero
us su
rfac
e pa
pilla
ry c
arci
nom
a
3 II
IC
7 2
Min
92
.004
Pa
pilla
ry se
rous
ade
noca
rcin
oma
0
IIIC
45
2
Min
92
.074
Pa
pilla
ry se
rous
ade
noca
rcin
oma
3
IIIC
6
2 M
ax
93.0
21
Papi
llary
sero
us a
deno
carc
inom
a
3 II
IC
13
2 M
in
93.0
35
Papi
llary
sero
us a
deno
carc
inom
a
0 II
IC
15
2 M
in
93.1
08
Papi
llary
sero
us c
ysta
deno
carc
inom
a
0 II
Ic
16
2 M
in
93.1
20
Sero
us c
arci
nom
a 2
IIIC
15
2
Min
93
.128
Se
rous
car
cino
ma
3 II
IC
27
2 M
in
94.0
44
Papi
llary
sero
us c
ysta
deno
carc
inom
a
3 II
IC
101
2 N
il 94
.050
Pa
pilla
ry se
rous
ade
noca
rcin
oma
3
IIIC
40
2
Min
94
.056
Se
rous
cys
tade
noca
rcin
oma
3
IIIC
1
3 N
il 94
.067
Pa
pilla
ry se
rous
cys
tade
noca
rcin
oma
3
IIIC
20
2
Min
94
.070
Se
rous
cys
tade
noca
rcin
oma
2
IIIC
32
2
Nil
94.0
84
Sero
us c
arci
nom
a 3
IIIC
14
2
Min
94
.093
Se
rous
cys
tade
noca
rcin
oma
1
IIIC
15
2
Max
94
.113
Pa
pilla
ry se
rous
ade
noca
rcin
oma
9
IIIC
25
2
Min
94
.116
Se
rous
cys
tade
noca
rcin
oma
2
IIIC
12
2
Mod
94
.019
Pa
pilla
ry se
rous
ade
noca
rcin
oma
3
IV
11
2 M
in
94.0
68
Sero
us c
ysta
deno
carc
inom
a
3 IV
13
2
Max
92
.001
Se
rous
car
cino
ma
3 x
8 2
Max
147
4.2.1.2. Pathology review of selected cases for outcome analysis
Cases identified as suitable for this study were reviewed by pathologists Dr. Paul Waring
and Dr. Melissa Robbie to confirm adequate percentage tumour content and also the
histological subtype and grade of each specimen. All samples were determined to be over
50% tumour according to the tumour-nuclei method (Materials and Methods section
2.2.1) and the histological subtype of each was confirmed to match the original diagnosis.
4.2.2. A descriptive statistical analysis of the study cohort
To explore and visualise the characteristics of the 26 samples comprising the cohort for
analysis, a descriptive statistical analysis was carried out using Minitab and Microsoft
Excel. The median survival time for these 26 cases was 15 months (Figure 4-1d) with a
mean and standard deviation each of 19 months. These figures indicate a wide range of
survival times was present in this cohort, however the distribution of survival times does
not accurately resemble the distribution of EOC survival in the general population. All
but two patients in this sample set had deceased in less than 5 years, whereas according to
the Australian Institute of Health and Welfare, on average 42% of Australian women
survive past this time point (Australian Institute of Health and Welfare and Australasian
Association of Cancer Registries, 2001).
The Australian statistics are used for this study as they do not include the LMP form of
EOC in the calculations of survival rates, unlike figures from US agencies such as SEER
(Parkin and Muir, 1992; Ries LAG, 2004). As described in Chapter 5, this form of the
disease has a more favourable prognosis and is associated with statistically significant
longer survival times (Behtash et al., 2004; Sherman et al., 2004).
The overall median survival of optimally-debulked women with stage III disease who are
treated with standard chemotherapy (consisting of the drugs cisplatin or carboplatin plus
paclitaxel) is approximately 50 months (Gadducci et al., 2000; Markman et al., 2001;
Piccart et al., 2000) (shown as a dashed horizontal line across the survival time box plot
in Figure 4-1).
Based on the clinical information available for the study cohort identified, a descriptive
statistical analysis was carried out using Minitab, the results of which are shown in Figure
4-1. This analysis showed that over half the samples in the cohort were classified as stage
III, which indicated that the majority tumours had spread to both ovaries, beyond the
pelvis to the abdomen lining and/or into the lymph system by the time the tumour was
148
diagnosed. Grade 3 tumours represented over half of the cohort, indicating an overall
poor differentiation status. More then three quarters of samples were stage IIIC which
indicates that abdominal deposits of 2cm or greater were observed in the patients
afflicted, corresponding to an advanced stage of EOC. The following categories for
assessing the level of residual disease were as follows: nil: 0cm, min: 0-1cm, mod: 1-
2cm, max: >2cm thick section of tumour remaining after surgery. Two thirds of patients
were recorded as having nil or minimal levels of residual disease following surgery with
the remainder having moderate (1-2cm) or maximum (>2cm) levels indicating the
surgeon was unable to adequately debulk the tumour.
Therefore in summary, the average patient in this study had grade 3, stage IIIC EOC with
between 0 and 1cm of residual disease after surgery and lived for approximately 15
months after diagnosis. In the Spentzos et al study of gene expression profiles of EOC,
the most common clinical profile was grade 3, stage III, residual disease of less than 1cm
with a median overall survival of 49 months (Spentzos et al., 2004).
149
Figure 4-1: Descriptive statistical analysis of patient cohort. (A) Tumour grade summary of 26 sample cohort – over 50% were grade three (Key: 1: the least malignant, with well-differentiated cells, 2: intermediate, with moderately differentiated cells, 3: the most malignant, with poorly differentiated cells, 9: unknown) (B) Pathology Stage (FIGO) of tumours in cohort with stage III tumours representing bulk of this cohort (C) Residual disease summary – nil: 0cm, min: 0-1cm, mod: 1-2cm, max: >2cm thick section of tumour remaining after debulking surgery (D) Box plot of patient survival times (months) Asterisks correspond to outlier cases RBH 94.019 (101 months survival) and RBH 94.116 (116) months survival. Dashed horizontal line indicates population median survival time (50 months) for optimally debulked stage III EOC (Gadducci et al., 2000; Markman et al., 2001; Piccart et al., 2000).
Zero
One
Two
Three
Nine
Nil
Min
Mod
Max
II IIIAIIIB
IIIC
IVB
D 120
100
80
60
40
20
0
Mon
ths
A
C
150
4.2.2.1. Analysis of interaction between survival time and debulk status/residual disease
Due to the large body of evidence concerning the relationship between residual disease
and patient prognosis, the interaction of these two variables in the cohort was analysed.
By plotting the sample cohort in order of increasing survival, and then assigning each
class of residual disease a shading code, as shown in Figure 4-2A, a trend agreeing with
the literature was observed. Patients with higher levels of disease present after surgery
also appeared to have shorter survival periods. All patients with moderate or maximum
levels of residual disease experienced survival times equal or less than the cohort’s
median survival of 15 months.
To determine if a statistically significant relationship existed, patients were grouped into
two categories; (i) nil (0cm) or minimal (0-1cm) residual disease and (ii) moderate (1-
2cm) or maximum (>2cm) residual disease. A box plot of the survival times for each
group (Figure 4-2B) shows the overall shorter and less varied range of survival times seen
in group (ii) compared to group (i). Whilst a one way ANOVA revealed that this
difference was not statistically significant (p=0.149), there is a trend towards shorter
survival time correlating with higher levels of residual disease, agreeing with the current
literature on the impact of this variable as previously described (Hoskins et al., 1992;
Hoskins et al., 1994). Had the cohort available for this study been larger it can be
hypothesised that this trend may have reached a statistically significant level.
151
Figure 4-2: (A) Samples ordered by increasing survival time and coloured by residual disease status (B) Box plot of patient survival times grouped by residual disease categories. Categories: i – nil or minimal (0-1cm); ii – moderate or maximum (>1cm). Difference between two groups was not statistically significant (p=0.149) although the trend of shorter survival time associated with greater residual disease can be observed. Red dots indicate mean survival time for each group.
0
20
40
60
80
100
120
92.0
15
94.0
5692
.074
92.0
0392
.001
91.0
0790
.061
94.0
1994
.116
93.0
2194
.068
92.0
7194
.084
85.0
05
93.0
3593
.120
94.0
9393
.108
91.0
5293
.131
94.0
6794
.113
93.1
2894
.070
94.0
5092
.004
94.0
44
Patient ID
Mon
ths s
urvi
val
MaxModMinNilResidual disease:
10
100
50
0
Dis Code
Mon
ths
surv
ival
A
B
(i) (ii)
152
4.2.3. Processing of microarray data prior to investigation molecular signatures of patient survival
Microarray image sets generated by an Agilent Microarray Scanner BA for the 26 EOC
specimens were processed using GenePix image analysis software. The standard
algorithm for identifying poor quality and unreliably features was applied as described in
Material and Methods section 2.4.2. MMT scores were calculated following a standard
intensity-dependant lowess normalisation. All were within the range of acceptable levels
of spatial bias (100-200) therefore no further normalisation was applied in the interest of
preserving the dynamic range present in data and not unnecessarily manipulating the data.
Genes were then filtered to remove those missing values in 50% or more of the sample
set or having a log-ratio variation p-value > 0.01. Genes excluded by these criteria are
assumed not to contribute to the molecular phenotype of interest as described in Material
and Methods section 2.4.3. After this filtering, 474 genes remained, corresponding to a
95.5% reduction. Because this filter resulted in the exclusion of such a high proportion of
the total gene set, it was decided to use a less strict method for excluding genes with
lower (unsupervised) variation across the dataset, in order to supply the downstream
analyses with an adequately large list of genes. To achieve this, a minimum fold change
filter (1.5-fold change in either direction in >20% of samples) was used, which resulted in
a list of 4508 candidate genes for further analysis (42% of the total clone set).
Ein-Dor et al (Ein-Dor et al., 2005) have recently noted that in several published analysis
of breast cancer survival, no single gene was found to have a strong individual correlation
with patient survival. Rather, a large number of genes appeared to have a moderate
relationship with this variable. Thus it was the combination of these genes which was able
to accurately predict patient survival. This observation supports the decision to
compromise on the magnitude of individual gene variation in order to increase the
number of genes available for predictive analysis.
4.2.4. Identification of genes differentially expressed between patient survival groups
Several methods were used to identify genes with expression patterns that correlate with
length of survival, as described in Material and Methods section 2.4.8.
153
Quantitative trait analysis – identifies genes that have a significant correlation
with survival time (no. months) for each patients, using either Spearman or
Pearson correlation coefficients
Class comparison – T or F-tests between patients grouped into the following
classes, including the variable of residual disease in the statistical calculations:
o two survival groups (median)
o three survival groups based on approximately 1/3 of the cohort
represented in each (<12months, 12-24 months, >24 months)
o first and last quartiles
Survival analysis – specifically designed algorithm for identifying genes that are
predictive of patient survival based on Cox’s proportional hazards model and
censoring of any patients still alive at the time of last follow up (Cox, 1972).
The quantitative trait analysis has the advantage of using a Spearman, gene-rank based,
correlation measure to assess the relationship between gene expression and the survival
variable. Using the rank level of genes rather than absolute expression ratios can reduce
variation within a dataset by minimising the impact of outliers on the identification of
genes with significant correlation to the survival variable (Broberg, 2003; Troyanskaya et
al., 2002).
By combining the expression profiles of patients into discrete categories associated with
length of survival, the Class comparison methods allows the use of T or F-tests
(ANOVA) to identify genes with large differences in expression between survival classes,
relative to the variation within each class.
A survival analysis based on Cox’s Proportional hazards model is a non-parametric
method which does not make any assumptions about the nature or shape of the data
distribution (Cox, 1972). This approach allows the inclusion of data from patients still
alive at the time of last follow up, however as all patients included in the cohort for this
chapter were deceased at the time of last follow up, it was found that this method did not
differ greatly from the quantitative trait method.
154
4.2.4.1. Quantitative trait analysis of gene expression data to find genes related to patient survival
Samples were analysed for genes with significant correlation to survival time (months). A
Spearman correlation was used to assess the relationship between each gene and the
survival variable, with a significance threshold of univariate testing of 0.001. The
maximum number and proportion of false positives was set at 10 and 0.1 respectively.
This was achieved through the use of multivariate permutation testing, based on 1000
random permutations of the dataset to control the false discovery rate. This provides 90%
confidence that the list generated contains no more than 10% false discoveries.
Two genes were identified at the 0.001 significance level, which was not statistically
significantly larger than that expected by chance, as measured by permutation analysis
(p=0.825). Genes identified by this analysis, up to the maximum number of false
positives permitted, are shown in Table 4-2 although the relationship between these ten
genes and the months-survival variable is not statistically significant on a multivariate
level, due to the large number of repeated tests carried out.
In the list are genes involved in apoptosis (PHDLA), several genes related to endoplasmic
reticulum function (SEC22L3, GALNT1 and CHST2), the negative regulator of
transcription ZFN189 and a gene involved in muscle development and contraction
regulation, TNNT1. Of these ten genes, GALNT1 has been identified as being over
expressed in colorectal cancer compared to normal tissue (Kohsaki et al., 2000; White et
al., 1995), PHLDA2 expression, negatively correlated with survival in this study, has been
mapped to a chromosomal region frequently altered in breast, lung and ovarian cancer
(Hu et al., 1997), GNB5 over expression has been identified as predictive of lymph node
metastasis in oesophageal squamous cell carcinoma (Jones et al., 1998; Kan et al., 2004)
and TNNT1 was contained in a gene set used to identify small round blue-cell tumours
(Barton et al., 1999; Khan et al., 2001).
To visualise the strength of the relationship between the two significantly correlated
genes (at the univariate level) and length of survival, scatter plots were constructed as
shown in Figure 4.2 3. SCOC is under expressed, whilst PPAP2B is over expressed in
patients with shorter survival times.
Limited information exists about the function of SCOC. Although widely expressed in
human tissues, it does not appear to have been previously implicated in EOC. The
155
primary function of this gene is to act as a binding partner for the gene ARL1 which
regulates intracellular vesicular membrane trafficking (Van Valkenburgh et al., 2001).
The VEGF-inducible gene PPAP2B, also known as VCIP, has been demonstrated to
function in the regulation of cell-cell interaction and aggregation. Cell line studies have
demonstrated that the recombinant expression of this gene leads to greater cell adhesion
and spreading in endothelial cells (Humtsoe et al., 2003).
Whilst the probability of observing two genes correlated with patient survival is not a
statistically significant observation, due to the large number of genes present, inspection
of these plots reveals the potential of this method to identify potentially interesting and
clinically relevant genes from microarray expression data. Removal of the two outlier
cases (RBH 94.019 with 101 months survival and RBH 94.116 with 116 months survival)
in the second half of this figure further shows the association of these two genes with
length of survival for the remaining 24 patients.
This method did not take into account the important factor of residual disease in
determining the significance of each gene’s relationship with survival. Therefore an
ANOVA was also performed in which data from SCOC and PPAP2B expression levels
were analysed in regard to the residual disease categories corresponding to each patient.
No significant difference in the expression of these two genes was detected between
patient residual disease categories (P >0.05 for all tests). This suggests that the expression
of SCOC and PPAP2B correlates with the length of survival of these patients,
independent of the amount of residual disease present, with the important caveat of that
because of the larger number of genes to being with, it such patters of expression may be
detected by chance alone.
15
6
Tab
le 4
-2: G
enes
iden
tifie
d by
qua
ntita
tive
trai
t ana
lysi
s. Th
e fir
st tw
o ar
e si
gnifi
cant
at t
he 0
.001
leve
l of t
he u
niva
riate
test
.
Uni
Gen
e N
ame
Uni
Gen
e Sy
mbo
l
Spea
rman
C
orre
latio
n co
effic
ient
P-
valu
e
Sum
mar
y of
func
tion
Sum
mar
y of
gen
e on
tolo
gy m
embe
rshi
ps
Shor
t coi
led-
coil
prot
ein
SCO
C
-0.6
84
0.00
0442
4 - L
ittle
func
tiona
l inf
orm
atio
n kn
own.
Bin
ds to
GTP
ases
in th
e A
RF
fam
ily
(Van
Val
kenb
urgh
et a
l., 2
001)
N
/A
Phos
phat
idic
aci
d ph
osph
atas
e ty
pe 2
B
PPAP
2B
0.81
1
0.00
0712
9 - M
embr
ane
glyc
opro
tein
loca
lized
at t
he c
ell p
lasm
a m
embr
ane.
- E
xpre
ssio
n is
enh
ance
d by
epi
derm
al g
row
th fa
ctor
and
is k
now
n to
pr
omot
e gr
owth
and
mot
ility
in E
OC
(Kai
et a
l., 1
997;
Luq
uain
et a
l., 2
003)
germ
cel
l mig
ratio
n |
hydr
olas
e ac
tivity
| in
tegr
al
to m
embr
ane
| lip
id
met
abol
ism
Car
bohy
drat
e (N
-ac
etyl
gluc
osam
ine-
6-O
) sul
fotra
nsfe
rase
2
CH
ST2
0.77
4
0.00
108
- I
nvol
ved
in th
e in
flam
mat
ory
resp
onse
of v
ascu
lar e
ndot
helia
l cel
ls
- Sul
finat
ion
of th
e le
ukoc
yte
adhe
sion
mol
ecul
e L-
sele
ctin
(L
i and
Ted
der,
1999
)
carb
ohyd
rate
met
abol
ism
| in
flam
mat
ory
resp
onse
| in
tegr
al to
mem
bran
e |
sulfo
trans
fera
se a
ctiv
ity
poly
pept
ide
N-
acet
ylga
lact
osa
min
yl-
trans
fera
se 1
(G
alN
Ac-
T1)
GAL
NT1
-0
.723
0.
0021
738
- Ini
tiate
s muc
in-ty
pe O
-link
ed g
lyco
syla
tion
in th
e G
olgi
app
arat
us
- Exp
ress
ed in
var
ied
leve
ls in
col
orec
tal c
ance
r, tre
nd to
war
ds h
ighe
r ex
pres
sion
in tu
mou
r tis
sue
com
pare
d to
nor
mal
. (K
ohsa
ki e
t al.,
200
0; W
hite
et a
l., 1
995)
O-li
nked
gly
cosy
latio
n |
inte
gral
to m
embr
ane
| m
anga
nese
ion
bind
ing
|
Zinc
fing
er p
rote
in
189
ZNF1
89
-0.6
72
0.00
4061
4 - M
aps t
o ch
rom
osom
al re
gion
com
mon
ly d
elet
ed in
bla
dder
can
cer
(Ode
berg
et a
l., 1
998)
Neg
ativ
e re
gula
tion
of
trans
crip
tion
from
RN
A |
nucl
eus |
zin
c io
n bi
ndin
g SW
I/SN
F re
late
d,
mat
rix a
ssoc
iate
d,
actin
dep
ende
nt
regu
lato
r of
chro
mat
in, s
ubfa
mily
a-
like
1
SMAR
CAL
1 -0
.571
0.
0042
025
- Has
hel
icas
e an
d A
TPas
e ac
tiviti
es
- Reg
ulat
ion
of tr
ansc
riptio
n of
cer
tain
gen
es b
y al
terin
g ch
rom
atin
stru
ctur
e - M
utat
ions
in th
is g
ene
are
a ca
use
of a
con
ditio
n as
soci
ated
with
T-c
ell
imm
unod
efic
ienc
y.(C
olem
an e
t al.,
200
0)
ATP
bin
ding
| D
NA
bi
ndin
g | h
elic
ase
activ
ity
Plec
kstri
n ho
mol
ogy-
like
dom
ain,
fam
ily
A, m
embe
r 2
PHLD
A2
-0.7
25
0.00
4457
3
- Loc
ated
at 1
1p15
.5, a
n im
porta
nt tu
mou
r sup
pres
sor g
ene
regi
on.
- Alte
ratio
ns a
ssoc
iate
d w
ith lu
ng, o
varia
n, a
nd b
reas
t can
cers
and
pot
entia
lly
invo
lved
in re
gula
tion
of p
lace
ntal
gro
wth
. (H
u et
al.,
199
7)
apop
tosi
s | im
prin
ting
SEC
22 v
esic
le
traff
icki
ng p
rote
in-
like
3 (S
. cer
evis
iae)
SE
C22
L3
-0.6
99
0.00
4876
4 - V
esic
le tr
affic
king
pro
tein
s, lo
caliz
ed a
t the
end
opla
smic
retic
ulum
- D
own-
regu
late
d in
diff
use-
type
gas
tric
canc
er
(Has
egaw
a et
al.,
200
2; T
ang
et a
l., 1
998)
ER to
Gol
gi tr
ansp
ort |
in
tegr
al to
mem
bran
e
Gua
nine
nuc
leot
ide
bind
ing
prot
ein,
bet
a 5
GN
B5
0.69
8
0.00
4876
4 - I
dent
ified
as p
redi
ctiv
e of
lym
ph n
ode
met
asta
sis i
n oe
soph
agea
l squ
amou
s ce
ll ca
rcin
oma
(Jon
es e
t al.,
199
8; K
an e
t al.,
200
4)
G-p
rote
in c
oupl
ed re
cept
or
prot
ein
sign
allin
g pa
thw
ay
Trop
onin
T1,
skel
etal
, sl
ow
TNN
T1
-0.6
75
0.00
5147
- Com
pone
nt o
f the
trop
onin
com
plex
, for
min
g th
e ca
lciu
m-s
ensi
tive
mol
ecul
ar sw
itch
that
regu
late
s stri
ated
mus
cle
cont
ract
ion
- Ide
ntifi
ed in
an
alys
is o
f pre
dict
ive
gene
s for
smal
l rou
nd b
lue-
cell
tum
ours
(B
arto
n et
al.,
199
9; K
han
et a
l., 2
001)
mus
cle
deve
lopm
ent |
re
gula
tion
of m
uscl
e co
ntra
ctio
n | t
ropo
myo
sin
bind
ing
157
Figu
re 4
-3 S
catt
er p
lots
of
the
two
gene
s si
gnifi
cant
ly c
orre
late
d w
ith t
he v
aria
ble
of m
onth
s-su
rviv
al a
t th
e 0.
0001
lev
el;
(a)
SCO
C (
Spea
rman
co
rrel
atio
n =
-0.6
84) a
nd (b
) PPA
P2B
(0.8
11).
Pane
ls (c
) and
(d) s
how
the
corr
elat
ion
of th
ese
gene
s afte
r exc
ludi
ng p
atie
nts i
dent
ified
as s
urvi
val t
ime
outli
ers;
R
BH
94.
019
(101
mon
ths s
urvi
val)
and
RB
H 9
4.11
6 (1
16) m
onth
s sur
viva
l.
-3
-2.5-2
-1.5-1
-0.50
0.51
1.52
020
40
60
80
100
120
-3
-2.5-2
-1.5-1
-0.50
0.51
1.52
020
4060
80100
120
-3
-2.5-2
-1.5-1
-0.50
0.51
1.52
010
20
30
40
50
-3
-2.5-2
-1.5-1
-0.50
0.51
1.52
05
1015
20
25
30
35
40
45
50
A
C
B
D
SCO
C e
xpre
ssio
n vs
. sur
viva
l PP
AP2
B e
xpre
ssio
n vs
. sur
viva
l
Surv
ival
tim
e (m
onth
s)
Surv
ival
tim
e(m
onth
s)
SCOC expression ratio SCOC expression ratio
PPAP2B expression ratio PPAP2B expression ratio
Surv
ival
tim
e(m
onth
s)Su
rviv
al ti
me
(mon
ths)
R2 =
0.4
69
R2 =
0.1
51
R2 =
0.2
33
R2 =
0.4
70
158
4.2.4.2. F-test class comparison
This analysis was carried out to identify genes with statistically significant expression
differences between discrete groups of patients grouped into three categories of survival
time; <12 months (n=7), 12-24 (n=12) months and >24 months (n=7). The variable of
residual disease was included as a potential source of gene expression variation in the
data.
A random variance version of the F-test was used because of the small sample sizes
present for each group. The minimum significance level of each univariate test was set at
0.001, 90% confidence level of false discovery rate assessment and maximum number of
false positive genes was 10, as per the quantitative trait analysis described above.
Five genes were identified as having significant differential expression between these
classes of tumours at the 0.001 significance level. An additional five genes are included
whose expression difference between the groups was approaching significance at the
0.001 level defined. These are listed in Table 4-3, along with functional summaries and
gene ontology information. The probability of obtaining five genes from a total of 4,508
F-tests, if there are no real differences between classes (i.e. Null hypothesis), is 0.577.
This is based on 1000 random permutations of the dataset with the number of significant
genes between randomised classes calculated for each permutation. Again this reveals
that it is not possible to state whether the genes identified by this approach were selected
by chance or they represent genuine differences between these groups of patients.
Supervised hierarchical clustering of the ten genes (Figure 4-4) reveals imperfect
segregation of samples into survival group categories. The first two major branch points
of the dendrogram separate the five of the seven shortest-term survivors away from the
remainder of the cohort, but there is no real discrimination between the 12-24 and >24
month categories in this representation of the data.
Closer inspection of the genes selected and the associated literature reveals a broad range
of molecular functions represented. These include NAPSA (generating the most
significant p-value of all genes identified from this analysis), which increases the cell-
surface expression of E-cadherin protein on breast cancer cells and is a novel therapeutic
target presently being investigated (Tatnell et al., 1998; Thibout et al., 1999). NAPSA is
more highly expressed in the longer-term survival groups in this dataset.
159
Also included was BRF1 (expressed at lower levels in the longer-term survival groups),
which degrades AU-rich element containing mRNA. This gene is thought to facilitate
oncogenesis by regulating the decay of mRNA of the proliferation-associated genes IL3,
GM-CSF and TNF (Stoecklin et al., 2002; Wang and Roeder, 1995).
Another gene identified, ACP6, increased in expression in association with longer
survival, has been demonstrated to mediate cell proliferation and protect cells from
apoptosis when treated in vitro with cisplatin, a commonly used chemotherapeutic agent
(Hiroyama and Takenawa, 1999; Mackeigan et al., 2005). The protein encoded by this
gene has been suggested as a novel biomarker for EOC as it is detected at significantly
higher levels in plasma from patients with EOC compared to normal controls (Xu et al.,
1998).
MAPK1 was also selected, which has been found to be amplified and over expressed in
EOC (Benetkiewicz et al., 2005; Goedert et al., 1997) and important for a range of
processes such as proliferation and cellular differentiation. This gene was found to be
expressed at higher levels in the long term survival group in this dataset.
Finally, expression of the oncogene ETV3, also known as PE1, was found to be also
significantly different between patient groups, although was expressed at higher levels in
the mid-term survival group, making the result difficult to interpret in terms of a linear
relationship to the survival variable. This gene has an anti-proliferative effect by blocking
the oncogenic RAS-pathway of genes, suggesting it may be upregulated in those patients
with longer survival times (Klappacher et al., 2002).
16
0
Tab
le 4
-3: G
enes
iden
tifie
d by
ran
dom
-var
ianc
e F-
test
with
res
idua
l dis
ease
cat
egor
y us
ed a
s a b
lock
ing
vari
able
. Gro
ups u
sed
for A
NO
VA
wer
e (a
) <12
m
onth
(b) 1
2-24
mon
ths a
nd (c
) >24
mon
ths s
urvi
val.
Uni
Gen
e N
ame
Uni
Gen
e Sy
mbo
l P-
valu
e
Mea
n ra
tios:
<
12 m
onth
s;
12-2
4 m
onth
s;
>24
mon
ths
Sum
mar
y of
func
tion
Sum
mar
y of
gen
e on
tolo
gy
mem
bers
hips
Nap
sin
A
aspa
rtic
pept
idas
e N
APSA
0.
0001
20
0.39
; 0.6
33;
0.66
9
- Im
porta
nt fo
r cor
rect
fold
ing,
targ
etin
g, a
nd c
ontro
l of t
he
activ
atio
n of
asp
artic
pro
tein
ase
zym
ogen
s.
- Inc
reas
es e
xpre
ssio
n of
E-c
adhe
rin o
n su
rfac
e of
bre
ast
canc
er c
ells
- N
ovel
ther
apeu
tic a
gent
, clin
ical
tria
ls u
nder
way
. (T
atne
ll et
al.,
199
8; T
hibo
ut e
t al.,
199
9)
peps
in A
act
ivity
| pr
oteo
lysi
s an
d pe
ptid
olys
is
Fam
ily w
ith
sequ
ence
si
mila
rity
50,
mem
ber B
FAM
50B
0.00
0429
0.
032;
0.6
53;
0.87
6
- Fun
ctio
nal r
etro
poso
n ex
pres
sed
in w
ide
rang
e of
tiss
ues
(Sed
lace
k et
al.,
199
9)
nucl
eus
Aci
d ph
osph
atas
e 6,
ly
soph
osph
atid
ic
ACP6
0.
0004
78
0.20
5; 0
.692
; 0.
647
Incr
ease
d ba
sal c
ell s
urvi
val a
nd p
rovi
des s
igni
fican
t pr
otec
tion
from
cis
plat
in-in
duce
d ap
opto
sis i
n H
eLa
cell
stud
ies
(Hiro
yam
a an
d Ta
kena
wa,
199
9; M
acke
igan
et a
l., 2
005)
acid
pho
spha
tase
act
ivity
Hyp
othe
tical
pr
otei
n M
GC
1587
5 M
GC
1587
5 0.
0006
39
0.52
8; 0
.443
; 0.
752
- Seq
uenc
es a
s par
t of N
atio
nal I
nstit
utes
of H
ealth
M
amm
alia
n G
ene
Col
lect
ion.
(S
traus
berg
et a
l., 2
002)
mito
chon
drio
n | p
yrid
oxal
ph
osph
ate
bind
ing
| tra
nsam
inas
e ac
tivity
| tra
nsfe
rase
act
ivity
Ets v
aria
nt g
ene
3 ET
V3
0.00
0639
0.
284;
0.4
00;
0.19
9
- An
Ets r
epre
ssor
sugg
este
d to
con
tribu
te to
gro
wth
arr
est
durin
g te
rmin
al m
acro
phag
e di
ffer
entia
tion
- R
epre
sses
Ets
targ
et g
enes
invo
lved
in R
as-d
epen
dent
pr
olife
ratio
n (K
lem
sz e
t al.,
199
4; S
awka
-Ver
helle
et a
l.,
2004
)
nucl
eus |
nuc
leus
| re
gula
tion
of
trans
crip
tion,
DN
A-d
epen
dent
| tra
nscr
iptio
n fa
ctor
act
ivity
BR
F1 h
omol
og,
subu
nit o
f RN
A
poly
mer
ase
III
trans
crip
tion
initi
atio
n fa
ctor
II
IB (S
. ce
revi
siae
)
BRF1
0.
0011
2
0.25
2; 0
.157
; 0.
148
- Cen
tral r
ole
in tr
ansc
riptio
n in
itiat
ion
by R
NA
pol
ymer
ase
III o
n ge
nes e
ncod
ing
tRN
A, 5
S rR
NA
, and
oth
er st
ruct
ural
R
NA
s.
- Pro
mot
es d
egra
datio
n of
AR
E (A
U-r
ich
elem
ent)-
cont
aini
ng m
RN
A; i
mpo
rtant
in c
ell a
ctiv
atio
n an
d on
coge
nesi
s (S
toec
klin
et a
l., 2
002;
Wan
g an
d R
oede
r, 19
95)
RN
A p
olym
eras
e II
I tra
nscr
iptio
n fa
ctor
act
ivity
| tR
NA
tran
scrip
tion
| zin
c io
n bi
ndin
g
161
Uni
Gen
e N
ame
Uni
Gen
e Sy
mbo
l P-
valu
e
Mea
n ra
tios:
<
12 m
onth
s;
12-2
4 m
onth
s;
>24
mon
ths
Sum
mar
y of
func
tion
Sum
mar
y of
gen
e on
tolo
gy
mem
bers
hips
Will
iam
s B
eure
n sy
ndro
me
chro
mos
ome
regi
on 2
0C
WBS
CR2
0C
0.00
122
0.
294;
0.2
70;
0.25
6
- Del
eted
in W
illia
ms s
yndr
ome,
a m
ulti-
syst
em
deve
lopm
enta
l dis
orde
r cau
sed
by th
e de
letio
n of
con
tiguo
us
gene
s at 7
q11.
23.
(Dol
l and
Grz
esch
ik, 2
001)
N/A
Mito
gen-
activ
ated
pro
tein
ki
nase
11
MAP
K11
0.
0013
8 0.
186;
0.1
20;
0.51
9
- Mem
ber o
f the
MA
P ki
nase
fam
ily, w
hich
is a
n in
tegr
atio
n po
int f
or m
ultip
le b
ioch
emic
al si
gnal
s - I
nvol
ved
in p
rolif
erat
ion,
diff
eren
tiatio
n, tr
ansc
riptio
n re
gula
tion
and
deve
lopm
ent
- Ide
ntifi
ed a
s am
plifi
ed a
nd d
iffer
entia
lly e
xpre
ssed
in
EOC
. (B
enet
kiew
icz
et a
l., 2
005;
Goe
dert
et a
l., 1
997)
MA
P ki
nase
act
ivity
| | p
rote
in
kina
se c
asca
de |
resp
onse
to
stre
ss |
sign
al tr
ansd
uctio
n |
trans
fera
se a
ctiv
ity
A k
inas
e (P
RK
A) a
ncho
r pr
otei
n 8
AKAP
8 0.
0016
9 0.
414;
0.2
55;
0.27
1
- Bin
ds to
the
regu
lato
ry su
buni
t of p
rote
in k
inas
e (P
KA
) an
d co
nfin
es th
e ho
loen
zym
e to
dis
cret
e lo
catio
ns w
ithin
the
cell.
- H
as a
cel
l cyc
le-d
epen
dent
inte
ract
ion
with
the
RII
subu
nit
of P
KA
. (E
ide
et a
l., 1
998)
DN
A b
indi
ng |
mem
bran
e |
mito
sis |
pro
tein
kin
ase
A
bind
ing
| sig
nal t
rans
duct
ion
| su
gar p
orte
r act
ivity
| tra
nspo
rt |
zinc
ion
bind
ing
Dou
ble
C2-
like
dom
ains
, bet
a D
OC
2B
0.00
169
0.53
1; 1
.413
; 0.
508
- Int
erac
ts w
ith C
a2+ a
nd p
hosp
holip
id
(Orit
a et
al.,
199
5)
N/A
162
Figure 4-4: Hierarchical clustering using 10 genes identified by a random variance F-test of 26 EOC microarray profiles grouped into three survival-length categories. All but one case of the shortest-term survivors are clustered away from the mid and long term groups at the first major branch point in the dendrogram. Most genes appear to be down regulated in this short term group suggesting that a reduction or loss of their expression may confer a more aggressive tumour phenotype. Gray squares indicate absent expression values for these genes, which may have influenced the clustering.
<12 months
12-24 months
>24 months
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.5
2.0
2.5
3.0
3.0
1.0
0.3
NAP1
AI478508
D6S2654
MGC1587
ETV3
AKAP8
ACP6
BRF1
WBSCR20C
MAPK11
Survival category
Expression ratio
163
4.2.4.3. Genes with significant expression differences between survival groups independent of residual disease status
In an attempt to increase the chances of identifying genes with statistically significant
expression between groups of varying survival times, a two-class ANOVA was carried
out between the <12 month (n=7) and >24 month (n=7) survival groups present in the
cohort. This approach of comparing patients representing the outer regions of the survival
distribution is similar to that used by Spentzos et al (Spentzos et al., 2004).
Five of the seven patients in the <12 month group had either moderate or maximum
residual disease, compared to only minimal or nil residual disease in those patients who
lived for more than 24 months after diagnosis. This factor was taken into consideration
during the T-test as to identify genes that were differentially expressed between survival
groups, independent of the level of residual disease present.
In summary, this approach identified 27 genes significant at the 0.001 level of univariate
testing. Four of these genes overlapped with the previous 3-group ANOVA results
(NAP1, MGC15875, ETV3 and ACP6). Again, 1000 permutations of the dataset were
carried out to assess the significance of observing this number of differentially expressed
genes in a dataset of this size. The p-value for this assessment was not significant
(p=0.194), although this was an improvement on previous attempts at identifying a
molecular signature of patient survival using other approaches as described. The gene
identities, along with functional and ontology information are listed in Table 4-4.
Standard hierarchical clustering using the 27 genes is shown in Figure 4-5. Most of the
genes appear upregulated in both classes of tumours when the data is median-centred,
making the differences between patient groups difficult to visualise with this technique.
Whilst the mean fold change differences are small between two the two groups, their
consistency within groups resulted in a high level of statistical significance.
Gene ontology analysis was carried out on these 27 genes using the EASE method
(Hosack et al., 2003), with the total list of genes exhibiting significant variation in an
unsupervised 20% of the dataset used as a reference list. In reflection of the small number
of genes in this list and also their varied range of individual functions as summarised in
Table 4-4, no significantly represented gene ontologies were identified.
Using the Fishers Exact method of assessing significance, which does not take into
account the potential for co-variation of gene expression, one significantly represented
ontology class was identified. This was the transcription regulator activity (p=0.03)
164
category, represented in this gene list by ETV3, ESR1, ZFN161, SOX13, XFN161 and
ESR1, which are expressed at higher levels in the longer term survival group. The genes
ETV13 and SOX13, which are expressed at lower levels in the longer term survival group,
also belong to this ontology.
Genes in this ontology play a role in regulation of transcription, for example by
interacting with a DNA-binding factor, or binding a promoter or enhancer DNA
sequence. Loss of the normal controls over DNA replication is a process associated with
tumourigenesis and disease progression. Therefore, the relative down regulation of most
genes in this ontology within the short term survival group may reflect this occurrence of
aberrant DNA replication, resulting in a more aggressive phenotype and shorter survival
times.
165
Tab
le 4
-4: G
enes
diff
eren
tially
exp
ress
ed b
etw
een
patie
nts w
ith e
ither
<12
mon
ths o
r >2
4 m
onth
s sur
viva
l at t
he 0
.001
sign
ifica
nce
leve
l. G
enes
are
so
rted
by si
gnifi
canc
e of
diff
eren
tial e
xpre
ssio
n be
twee
n th
e tw
o su
rviv
al g
roup
s. Se
vera
l gen
es in
volv
ed in
cal
cium
tran
spor
t are
obs
erve
d as
wel
l as g
enes
im
plic
ated
in th
e pr
ogre
ssio
n of
a n
umbe
r of o
ther
can
cer t
ypes
.
Uni
Gen
e N
ame
Uni
Gen
e Sy
mbo
l P-
valu
e
Geo
m.
mea
n: <
12
mon
ths;
>2
4 m
onth
s
Fold
di
ffer
ence
of
geo
m.
mea
ns
Sum
mar
y of
func
tion
Sum
mar
y of
gen
e on
tolo
gy
mem
bers
hips
Late
nt
trans
form
ing
grow
th fa
ctor
be
ta b
indi
ng
prot
ein
2
LTBP
2 p
< 1.
0 x
10-7
0.
07; 0
.233
0.
3
- Ext
race
llula
r mat
rix p
rote
in w
ith m
ulti-
dom
ain
stru
ctur
e.
– Po
sses
ses u
niqu
e re
gion
s sim
ilar t
o th
e fib
rillin
s.
- Mul
tiple
func
tions
: mem
ber o
f the
TG
F-be
ta la
tent
co
mpl
ex, s
truct
ural
com
pone
nt o
f mic
rofib
rils,
and
role
in
cell
adhe
sion
. (M
oren
et a
l., 1
994)
calc
ium
ion
bind
ing
| ext
race
llula
r m
atrix
(sen
su M
etaz
oa) |
gro
wth
fa
ctor
bin
ding
| pr
otei
n se
cret
ion
| re
gula
tion
of c
ell c
ycle
| tra
nsfo
rmin
g gr
owth
fact
or b
eta
rece
ptor
sign
allin
g pa
thw
ay
Zinc
fing
er
prot
ein
161
ZNF1
61
p <
1.0
x 10
-7
0.21
; 0.3
11
0.67
5
- Inv
olve
d in
bot
h no
rmal
and
abn
orm
al c
ellu
lar
prol
ifera
tion
and
diff
eren
tiatio
n.
- Pos
sibl
e tra
nscr
iptio
n fa
ctor
, bin
ds to
the
CT/
GC
-ric
h re
gion
of t
he in
terle
ukin
-3 p
rom
oter
and
med
iate
s tax
tra
nsac
tivat
ion
of IL
-3
(Koy
ano-
Nak
agaw
a et
al.,
199
4)
DN
A b
indi
ng |
cellu
lar d
efen
se
resp
onse
| re
gula
tion
of
trans
crip
tion
from
RN
A
poly
mer
ase
II p
rom
oter
| zi
nc io
n bi
ndin
g
Odz
, odd
Oz/
ten-
m h
omol
og 4
(D
roso
phila
) O
DZ4
p
< 1.
0 x
10-7
0.
583;
0.
813
0.71
7
- Typ
e II
tran
smem
bran
e m
olec
ule
- Chr
omos
omal
tran
sloc
atio
n th
at le
ads t
o th
e fu
sion
of
DO
C4
and
HG
L, o
n ch
rom
osom
es 1
1 an
d 8
in b
reas
t ca
ncer
may
lead
to a
ctiv
atio
n of
Erb
B si
gnal
ling
thro
ugh
the
prod
uctio
n of
an
auto
crin
e lig
and
(Ben
-Zur
et a
l., 2
000)
N/A
Estro
gen
rece
ptor
1
ESR1
p
< 1.
0 x
10-7
2.
144;
2.
359
0.90
9
- Lig
and-
activ
ated
tran
scrip
tion
fact
or c
ompo
sed
of
seve
ral d
omai
ns im
porta
nt fo
r hor
mon
e bi
ndin
g, D
NA
bi
ndin
g, a
nd a
ctiv
atio
n of
tran
scrip
tion.
(G
reen
et a
l., 1
986)
DN
A b
indi
ng |
cell
grow
th |
chro
mat
in re
mod
ellin
g co
mpl
ex |
nega
tive
regu
latio
n of
mito
sis |
st
eroi
d ho
rmon
e re
cept
or a
ctiv
ity
SRY
(sex
de
term
inin
g re
gion
Y)-
box
13
SOX1
3 p
< 1.
0 x
10-7
0.
517;
0.
459
1.12
6 - I
nvol
ved
in th
e re
gula
tion
of e
mbr
yoni
c de
velo
pmen
t and
in
the
dete
rmin
atio
n of
cel
l fat
e.
(Roo
se e
t al.,
199
9)
mor
phog
enes
is |
nucl
eus |
re
gula
tion
of tr
ansc
riptio
n, D
NA
-de
pend
ent
BA
T2 d
omai
n co
ntai
ning
1
XTP2
p
< 1.
0 x
10-7
0.
354;
0.
288
1.22
9 - A
mpl
ified
and
ove
r exp
ress
ed in
inva
sive
bla
dder
can
cer
- Hig
hest
exp
ress
ion
leve
ls fo
und
in o
vary
(H
uang
et a
l., 2
002)
N
/A
16
6
Uni
Gen
e N
ame
Uni
Gen
e Sy
mbo
l P-
valu
e
Geo
m.
mea
n: <
12
mon
ths;
>2
4 m
onth
s
Fold
di
ffer
ence
of
geo
m.
mea
ns
Sum
mar
y of
func
tion
Sum
mar
y of
gen
e on
tolo
gy
mem
bers
hips
Cis
plat
in
resi
stan
ce-
asso
ciat
ed
over
expr
esse
d pr
otei
n
CRO
P p
< 1.
0 x
10-7
0.
24; 0
.179
1.
341
- Thi
s pro
tein
loca
lizes
with
a sp
eckl
ed n
ucle
ar p
atte
rn
- Cou
ld b
e in
volv
ed in
the
form
atio
n of
splic
esom
e vi
a th
e R
E an
d R
S do
mai
ns.
- Iso
late
d fr
om c
ispl
atin
-res
ista
nt c
ell l
ine
(Um
ehar
a et
al.,
200
3)
RN
A sp
licin
g | a
popt
osis
| nu
cleu
s |
resp
onse
to st
ress
Nuc
lear
ant
igen
Sp
100
SP10
0 p
< 1.
0 x
10-7
0.
864;
0.
636
1.35
8
- Int
erac
ts w
ith E
TS1
trans
crip
tion
fact
or
- Inh
ibits
the
inva
sion
of b
reas
t can
cer c
ells
and
is in
duce
d by
Inte
rfer
on-a
lpha
, sho
wn
to in
hibi
t the
inva
sion
of
canc
er c
ells
(S
eele
r et a
l., 2
001;
Yor
dy e
t al.,
200
4)
DN
A b
indi
ng |
chro
mat
in |
regu
latio
n of
tran
scrip
tion,
DN
A-
depe
nden
t
Lym
phot
oxin
be
ta (T
NF
supe
rfam
ily,
mem
ber 3
)
LTB
p <
1.0
x 10
-7
1.27
4;
0.78
9 1.
615
- Typ
e II
mem
bran
e pr
otei
n of
the
TNF
fam
ily.
- Anc
hors
lym
phot
oxin
-alp
ha to
the
cell
surf
ace
thro
ugh
hete
rotri
mer
form
atio
n.
- LTB
is a
n in
duce
r of t
he in
flam
mat
ory
resp
onse
syst
em
- Im
mun
e in
tera
ctio
n w
ith L
TB re
cept
or p
rom
otes
tum
our
grow
th b
y in
duci
ng a
ngio
gene
sis.
(Bro
wni
ng e
t al.,
199
3)
cell-
cell
sign
allin
g | i
mm
une
resp
onse
| in
tegr
al to
mem
bran
e |
mem
bran
e | s
igna
l tra
nsdu
ctio
n |
tum
our n
ecro
sis f
acto
r rec
epto
r bi
ndin
g
KIA
A12
40
prot
ein
KIA
A124
0 p
=
0.00
0112
5 0.
754;
1.
224
0.61
6 -I
sola
ted
from
bra
in c
DN
A li
brar
ies.
- No
func
tiona
l inf
orm
atio
n av
aila
ble
(Nag
ase
et a
l., 1
999)
ATP
bin
ding
| D
NA
bin
ding
| ce
ll cy
cle
| nuc
leos
ide-
triph
osph
atas
e ac
tivity
Bon
e ga
mm
a-ca
rbox
yglu
tam
ate
(gla
) pro
tein
(o
steo
calc
in)
BGLA
P p
=
0.00
0515
8 0.
186;
0.
412
0.45
1
- Hig
hly
cons
erve
d bo
ne-s
peci
fic o
steo
blas
t-syn
thes
ised
pr
otei
n - M
ay b
e in
volv
ed in
cal
cium
pho
spha
te d
epos
ition
in
psam
mom
a bo
dies
of o
varia
n se
rous
pap
illar
y cy
stta
deno
mac
arci
nom
as, a
ssoc
iate
d w
ith c
ellu
lar
degr
adat
ion.
(R
aym
ond
et a
l., 1
999)
calc
ium
ion
bind
ing
| cel
l adh
esio
n | h
ydro
xyap
atite
bin
ding
| od
onto
gene
sis |
regu
latio
n of
bon
e m
iner
aliz
atio
n
Aci
d ph
osph
atas
e 6,
ly
soph
osph
atid
ic
ACP6
p
=
0.00
0734
6 0.
205;
0.
647
0.31
7
Incr
ease
d ba
sal c
ell s
urvi
val a
nd p
rovi
des s
igni
fican
t pr
otec
tion
from
cis
plat
in-in
duce
d ap
opto
sis i
n H
eLa
cell
stud
ies
(Hiro
yam
a an
d Ta
kena
wa,
199
9; M
acke
igan
et a
l., 2
005)
acid
pho
spha
tase
act
ivity
Hyp
othe
tical
pr
otei
n FL
J201
52
FLJ2
0152
p
=
0.00
0734
6 0.
213;
0.
549
0.38
8 - S
eque
nces
as p
art o
f Nat
iona
l Ins
titut
es o
f Hea
lth
Mam
mal
ian
Gen
e C
olle
ctio
n.
(Stra
usbe
rg e
t al.,
200
2)
N/A
167
Uni
Gen
e N
ame
Uni
Gen
e Sy
mbo
l P-
valu
e
Geo
m.
mea
n: <
12
mon
ths;
>2
4 m
onth
s
Fold
di
ffer
ence
of
geo
m.
mea
ns
Sum
mar
y of
func
tion
Sum
mar
y of
gen
e on
tolo
gy
mem
bers
hips
Nap
sin
A a
spar
tic
pept
idas
e N
APSA
p
=
0.00
0734
6 0.
39; 0
.669
0.
583
- Im
porta
nt fo
r cor
rect
fold
ing,
targ
etin
g, a
nd c
ontro
l of
the
activ
atio
n of
asp
artic
pro
tein
ase
zym
ogen
s.
- Inc
reas
es e
xpre
ssio
n of
E-c
adhe
rin o
n br
east
can
cer c
ells
- P
oten
tial n
ovel
ther
apeu
tic a
gent
, clin
ical
tria
ls u
nder
w
ay.
(Tat
nell
et a
l., 1
998;
Thi
bout
et a
l., 1
999)
peps
in A
act
ivity
| pe
ptid
ase
activ
ity |
prot
eoly
sis a
nd
pept
idol
ysis
Hyp
othe
tical
pr
otei
n M
GC
1587
5 M
GC
1587
5 p
=
0.00
0734
6 0.
528;
0.
752
0.70
2 - S
eque
nces
as p
art o
f Nat
iona
l Ins
titut
es o
f Hea
lth
Mam
mal
ian
Gen
e C
olle
ctio
n.
(Stra
usbe
rg e
t al.,
200
2)
mito
chon
drio
n | p
yrid
oxal
ph
osph
ate
bind
ing
| tra
nsam
inas
e ac
tivity
| tra
nsfe
rase
act
ivity
Cas
pase
re
crui
tmen
t do
mai
n fa
mily
, m
embe
r 8
CAR
D8
p =
0.
0007
346
0.38
8;
0.53
5 0.
725
- Inv
olve
d in
pat
hway
s lea
ding
to a
ctiv
atio
n of
cas
pase
s or
nucl
ear f
acto
r kap
pa-B
in th
e co
ntex
t of a
popt
osis
or
infla
mm
atio
n, re
spec
tivel
y
- Mem
ber o
f CA
RD
fam
ily th
at se
lect
ivel
y su
ppre
sses
ap
opto
sis .
Exp
ress
ion
corr
elat
es w
ith sh
orte
r sur
viva
l tim
e in
col
orec
tal c
ance
r (P
atha
n et
al.,
200
1)
nucl
eus |
pro
tein
bin
ding
| re
gula
tion
of a
popt
osis
Rya
nodi
ne
rece
ptor
1
(ske
leta
l) RY
R1
p =
0.
0007
346
0.6;
0.7
25
0.82
8
- A c
alci
um re
leas
e ch
anne
l of t
he sa
rcop
lasm
ic re
ticul
um
as w
ell a
s a b
ridgi
ng st
ruct
ure
conn
ectin
g th
e sa
rcop
lasm
ic
retic
ulum
and
tran
sver
se tu
bule
(P
hilli
ps e
t al.,
199
6)
B-c
ell p
rolif
erat
ion
| apo
ptos
is |
calc
ium
cha
nnel
act
ivity
| ce
ll m
otili
ty |
posi
tive
regu
latio
n of
tra
nscr
iptio
n | p
rote
in fo
ldin
g|
regu
latio
n of
cel
l cyc
le |
regu
latio
n of
end
o &
exo
cyto
sis
RA
LBP1
as
soci
ated
Eps
do
mai
n co
ntai
ning
2
REPS
2 p
=
0.00
0734
6 0.
261;
0.
272
0.96
- Exp
ress
ion
in p
rost
ate
canc
er c
ells
indu
ces a
popt
osis
- A
ffec
ts d
rug
accu
mul
atio
n lo
wer
s dru
g re
flux
in c
ance
r ce
lls
(Ike
da e
t al.,
199
8)
calc
ium
ion
bind
ing
| epi
derm
al
grow
th fa
ctor
rece
ptor
sign
allin
g pa
thw
ay |
prot
ein
com
plex
as
sem
bly
G p
rote
in-
coup
led
rece
ptor
20
G
PR20
p
=
0.00
0734
6 0.
633;
0.
575
1.10
1 - I
nteg
ral m
embr
ane
prot
ein
high
ly e
xpre
ssed
in
gast
roin
test
inal
stro
mal
tum
ours
(O
'Dow
d et
al.,
199
7)
G-p
rote
in c
oupl
ed re
cept
or p
rote
in
sign
allin
g pa
thw
ay |
inte
gral
to
plas
ma
mem
bran
e
Prot
ocad
herin
12
PCD
H12
p
=
0.00
0734
6 0.
248;
0.
223
1.11
2
- Cel
lula
r adh
esio
n m
olec
ule
impo
rtant
for c
ell-c
ell
inte
ract
ions
at i
nter
endo
thel
ial j
unct
ions
- P
rom
otes
hom
otyp
ic c
alci
um d
epen
dent
agg
rega
tion
and
adhe
sion
and
clu
ster
s at i
nter
cellu
lar j
unct
ions
. (L
udw
ig e
t al.,
200
0)
calc
ium
ion
bind
ing
| cel
l adh
esio
n | c
ytos
kele
ton
| hom
ophi
lic c
ell
adhe
sion
| in
tegr
al to
pla
sma
mem
bran
e | n
euro
nal c
ell
reco
gniti
on
Seve
n tra
nsm
embr
ane
dom
ain
prot
ein
NIF
IE14
p
=
0.00
0734
6 0.
374;
0.
328
1.14
- S
eque
nces
as p
art o
f Nat
iona
l Ins
titut
es o
f Hea
lth
Mam
mal
ian
Gen
e C
olle
ctio
n.
(Stra
usbe
rg e
t al.,
200
2)
inte
gral
to m
embr
ane
16
8
Uni
Gen
e N
ame
Uni
Gen
e Sy
mbo
l P-
valu
e
Geo
m.
mea
n: <
12
mon
ths;
>2
4 m
onth
s
Fold
di
ffer
ence
of
geo
m.
mea
ns
Sum
mar
y of
func
tion
Sum
mar
y of
gen
e on
tolo
gy
mem
bers
hips
A d
isin
tegr
in-li
ke
and
met
allo
prot
ease
w
ith
thro
mbo
spon
din
type
1 m
otif,
4
ADAM
TS4
p =
0.
0007
346
0.58
1; 0
.49
1.18
6
- Dis
inte
grin
and
met
allo
prot
eina
se w
ith th
rom
bosp
ondi
n m
otifs
-4, w
hich
is a
mem
ber o
f the
AD
AM
TS p
rote
in
fam
ily.
- Res
pons
ible
for t
he d
egra
datio
n of
agg
reca
n, a
maj
or
prot
eogl
ycan
of c
artil
age
(Tor
tore
lla e
t al.,
200
0)
extra
cellu
lar m
atrix
(sen
su
Met
azoa
) | in
tegr
in-m
edia
ted
sign
allin
g pa
thw
ay |
met
allo
endo
pept
idas
e ac
tivity
|
Hyd
roxy
ster
oid
(11-
beta
) de
hydr
ogen
ase
2 H
SD11
B2
p =
0.
0007
346
0.35
6;
0.29
3 1.
215
- Pla
ys ro
le in
mod
ulat
ing
min
eral
ocor
ticoi
d an
d gl
ucoc
ortic
oid
rece
ptor
occ
upan
cy b
y gl
ucoc
ortic
oids
- D
etec
ted
in a
dult
adre
nal c
ortic
al c
arci
nom
a an
d ad
enom
a (A
lbis
ton
et a
l., 1
994)
cell-
cell
sign
allin
g | g
luco
corti
coid
bi
osyn
thes
is |
met
abol
ism
| m
icro
som
e | o
xido
redu
ctas
e ac
tivity
Scav
enge
r re
cept
or c
lass
A,
mem
ber 3
SC
ARA3
p
=
0.00
0734
6 1.
496;
1.
094
1.36
7
- A m
acro
phag
e sc
aven
ger r
ecep
tor-
like
prot
ein.
- D
eple
tes r
eact
ive
oxyg
en sp
ecie
s, pr
otec
ting
cells
from
ox
idat
ive
stre
ss, w
hich
indu
ces i
ts e
xpre
ssio
n (H
an e
t al.,
199
8)
UV
pro
tect
ion
| cyt
opla
sm |
phos
phat
e tra
nspo
rt | r
espo
nse
to
oxid
ativ
e st
ress
| sc
aven
ger
rece
ptor
act
ivity
Ets v
aria
nt g
ene
3 ET
V3
p =
0.
0007
346
0.28
4;
0.19
9 1.
427
- An
Ets r
epre
ssor
sugg
este
d to
con
tribu
te to
gro
wth
arr
est
durin
g te
rmin
al m
acro
phag
e di
ffer
entia
tion
- R
epre
sses
Ets
targ
et g
enes
invo
lved
in R
as-d
epen
dent
pr
olife
ratio
n (K
lem
sz e
t al.,
199
4; S
awka
-Ver
helle
et a
l.,
2004
)
regu
latio
n of
tran
scrip
tion,
DN
A-
depe
nden
t | tr
ansc
riptio
n fa
ctor
ac
tivity
But
yrop
hilin
, su
bfam
ily 3
, m
embe
r A1
BTN
3A1
p =
0.
0007
346
0.52
4;
0.29
9 1.
753
- In
the
B b
ox fa
mily
of p
rote
ins.
- Inv
olve
d in
cel
l pro
lifer
atio
n an
d de
velo
pmen
t. - S
eque
nce
anal
ysis
sugg
ests
a c
ell s
urfa
ce re
cept
or
func
tion
(Rho
des e
t al.,
200
1; T
aylo
r et a
l., 1
996)
inte
gral
to m
embr
ane
| lip
id
met
abol
ism
Glu
tath
ione
pe
roxi
dase
3
(pla
sma)
G
PX3
p =
0.
0007
346
1.66
1;
0.85
9 1.
934
- Exp
ress
ed h
ighl
y in
cle
ar-c
ell o
varia
n ca
ncer
, a h
ighl
y m
alig
nant
subt
ype
- Fun
ctio
ns in
the
prot
ectio
n of
cel
ls a
gain
st o
xida
tive
dam
age.
(H
ough
et a
l., 2
001;
Tak
ahas
hi e
t al.,
198
7)
elec
tron
trans
porte
r act
ivity
| ex
trace
llula
r reg
ion
| glu
tath
ione
pe
roxi
dase
act
ivity
| ox
idor
educ
tase
ac
tivity
| re
spon
se to
lipi
d hy
drop
erox
ide
| sol
uble
frac
tion
169
Figure 4-5: Hierarchical cluster of generated from EOC specimens associated with either <12 or >24 months survival. The 27 genes used were identified as being differentially expressed between survival classes using a t-test approach with a random variance model. Variation in the level of residual disease remaining after surgery was factored into the model. The univariate level of significance was set at 0.001. The p-value for observing this number genes from a dataset of this size is p=0.194; therefore some genes present may have been selected by chance alone. This approach was the closest to be being statistically significant of all methods tried and a number of biologically relevant genes were identified, as discussed.
4.2.4.4. Cox’s proportional hazards model survival analysis of gene expression data for predictive model of EOC survival
A Cox proportional hazards model and Wald statistic (also known as a T statistic) was
performed one gene at a time to test for its dependency on the survival time variable. The
significance cut-off was set at 0.001 and 1,000 permutations of the dataset were
performed to determine the statistical significance of the genes identified in relation to the
size of the available dataset. As all patients in the cohort have deceased at the last date of
follow-up, no censoring of data was required.
Only one gene was identified as significant at the specified level, this observation not
being statistically robust following permutation testing (p=0.737) The gene identified,
Hypothetical LOC388298, most likely has no relationship to the survival time and was
selected by chance alone due the large number of tests performed. This function of this
gene, based on 99.2% sequence homology to VAS1, is thought to be the acidification of
intracellular compartments and does not appear to have been associated with cancer-
specific events in the literature (Nelson and Harvey, 1999).
These experiments highlight the importance of evaluating the significance of ‘predictive’
gene lists identified from gene expression data. This can be done by performing large
numbers of random permutations to determine the number of genes that can be expected
by chance alone and also by analysis of the associated literature available for each gene.
The significance level of 0.001 for univariate testing ensures that no more than 10 false-
positive genes can be selected from the ten thousand present on the array platform in use.
However this number is often smaller due to the unsupervised filtering of genes without
significant variation from baseline expression across a subset of the total experiment.
Multivariate permutation testing, as used for the analyses carried out in this chapter is
effective for controlling the proportion of false discoveries whilst taking into account (i)
the small number of samples relative to individual genes measured, (ii) potential for
inaccuracy at the extreme ends of the normal distribution of gene expression values and
(iii) the known correlation between genes of similar structure and/or function (Reiner et
al., 2003).
171
4.2.5. Experimentation with normalisation algorithms to improve detection of survival-related gene expression
The available gene expression dataset was re-normalised with a range of different
methods in order to determine if technical bias or noise introduced by a particular
normalisation algorithm was the cause of the inability of the feature selection approaches
to identifying survival-related set of genes.
The following normalisation methods, as described in section 2.4.3, were applied to
separate copies of the dataset:
Median per-gene and per-array normalisation
Lowess – intensity based normalization
SNOMAD
Print-tip lowess with background subtraction
Print-tip lowess and without background subtraction
For each type of normalisation listed above, the feature selection approaches described in
sections 4.2.4.1 - 4.2.4.4 were repeated to determine if the manipulation carried out by
each algorithm impacted on the ability to identify genes with expression patterns
significantly related to the survival variable. No combination of normalisation and feature
selection yielded a set of genes with significantly different expression between survival
categories, correlation to the continuous variable of survival time or significant
performance with Cox’s proportional hazards model.
Analysis of data from Lucidea Microarray ScoreCard features revealed that
normalisations (ii) – (v) described above resulted in measurements with no significant
difference between the theoretical expected and observed values. Methods (i) and (vi),
which involved scaling all genes to the median expression of each gene and also each
array, did result in average scorecard values significantly different (p>0.05 in all cases) to
their expected values. In this case the use of spatial lowess (print tip and SNOMAD) or
intensity-only based lowess had no significant effect on the accuracy of the ScoreCard
quality control features. This is most likely a reflection of the low spatial bias present in
the data generated using an Agilent Microarray Scanner and GenePix image analysis
package, as demonstrated in Chapter 3.
172
4.2.6. RT-PCR validation: selection of genes with minimum 2-fold change in expression between patient survival groups
In order to obtain a set of genes to validate with RT-PCR and carry the analysis through
to the stage of independent validation, with the caveats of lack of statistical significance
as described above, the mean expression ratio of all samples with <12 months or >24
months was plotted (Figure 4-6). From this approach, 130 genes were identified with 2-
fold or greater differences in mean expression between the two classes. The full list is
given in Appendix G. Four genes were selected based on their fold change and literature
searches of their purported involvement in EOC biology. Four genes were selected with
higher levels in patients with longer survival times (KLK7, SLIP, S100A2 and TNFSF10)
and two with reduced mean expression with increased survival time (FN1 and UPA).
Details and literature information for these genes is given in Table 4-5.
The identity of each gene was confirmed by sequencing and RT-PCR primers
corresponding to the sequences listed in Table 4-6 were designed using the GenScript
Real-time PCR (TaqMan, USA) Primer Design tool (GenScript, USA).
173
-7
-5
-3
-1
1
3
5
-7 -5 -3 -1 1 3 5
mean (<12 months survival)
mea
n (>
24
mon
ths s
urvi
val)
Figure 4-6: Mean expression profile for short (a) and long (c) term survival cases. Diagonal lines indicate 2-fold up and down regulation. Red dots indicate genes above or below this threshold. An annotated list of all genes over 2-fold up or down regulated by this analysis is shown in Appendix G.
17
4
Tab
le 4
-5: S
elec
ted
gene
s with
2-f
old
or g
reat
er d
iffer
ence
s bet
wee
n m
ean
phen
otyp
e pr
ofile
s. M
ean
expr
essi
on ra
tios f
or e
ach
clas
s giv
en a
long
with
di
rect
ion
of e
xpre
ssio
n ch
ange
with
incr
ease
d le
ngth
of s
urvi
val a
nd m
ean
fold
cha
nce.
KLK
7, S
LIP,
S10
0A2
and
TNFS
F10
have
hig
her m
ean
leve
ls o
f ex
pres
sion
in p
atie
nts w
ith lo
nger
surv
ival
whe
reas
FN
1 an
d U
PA h
ave
low
er le
vels
.
Uni
Gen
e Sy
mbo
l U
niG
ene
Nam
e M
ean
(<12
m
onth
s)
Mea
n (>
24
mon
ths)
Cha
nge
in e
xpre
ssio
n w
ith in
crea
sed
leng
th
of su
rviv
al
Rel
evan
ce to
EO
C su
rviv
al.
KLK
7 K
allik
rein
7
(chy
mot
rypt
ic,
stra
tum
cor
neum
) 7.
171
17.9
49
Incr
ease
s (2.
5 fo
ld)
KLK
7 is
a p
oten
tial b
iom
arke
r for
EO
C, w
here
it is
exp
ress
ed a
t sig
nific
antly
hi
gher
leve
ls in
late
-sta
te d
isea
se c
ompa
red
to n
orm
al o
r ben
ign
aden
oma.
K
LK7
stat
us (n
egat
ive/
posi
tive)
has
bee
n de
mon
stra
ted
to b
e a
pred
icto
r of
both
dis
ease
free
and
ove
rall
surv
ival
in E
OC
and
cor
rela
tes t
o am
ount
of
resi
dual
dis
ease
rem
aini
ng a
fter s
urge
ry (D
ong
et a
l., 2
003;
Kyr
iako
poul
ou e
t al
., 20
03)
SLPI
Se
cret
ory
leuk
ocyt
e pr
otea
se in
hibi
tor
(ant
ileuk
opro
tein
ase)
5.
296
13.5
72
Incr
ease
s (2.
6-fo
ld)
Prom
otes
the
tum
ourig
enic
and
met
asta
tic p
oten
tial o
f can
cer c
ells
. Pro
mot
es
repo
rter a
nd su
icid
e ge
ne e
xpre
ssio
n. H
as b
een
prop
osed
as a
can
dida
te fo
r ad
enov
irus-
med
iate
d ge
ne th
erap
y fo
r EO
C (B
arke
r et a
l., 2
003;
Shi
gem
asa
et
al.,
2001
)
S100
A2
S100
cal
cium
bin
ding
pr
otei
n A
2 1.
432
4.05
8 In
crea
ses (
2.8
fold
)
S100
A2 h
as a
pot
entia
l rol
e as
a tu
mou
r sup
pres
sor g
ene
and
also
regu
late
s the
ac
cum
ulat
ion
of c
alci
um in
nor
mal
mam
mar
y ep
ithel
ial c
ells
. Its
exp
ress
ion
is
elev
ated
in E
OC
rela
tive
to n
orm
al o
vary
tiss
ue.(H
ough
et a
l., 2
001;
San
tin e
t al
., 20
04)
TNFS
F10
Tum
our n
ecro
sis
fact
or (l
igan
d)
supe
rfam
ily, m
embe
r 10
1.82
6 4.
239
Incr
ease
s (2.
3 fo
ld)
Als
o kn
own
as T
RAIL
and
ass
ocia
ted
with
favo
urab
le o
utco
me
in E
OC
. It i
s a
pote
nt d
eath
pro
tein
that
favo
urs t
he k
illin
g of
var
ious
type
s of c
ance
r cel
ls to
no
rmal
cel
ls. H
igh
expr
essi
on o
f thi
s gen
e in
EO
C is
a si
gnifi
cant
indi
cato
r of
long
er su
rviv
al ti
mes
.(Lan
cast
er e
t al.,
200
4; L
anca
ster
et a
l., 2
003;
Wile
y et
al
., 19
95)
FN1
Fibr
onec
tin 1
1.
114
0.53
5 D
ecre
ases
(2.1
fold
)
FN1
is a
n ex
trace
llula
r mat
rix p
rote
in w
hich
pro
mot
es tu
mou
r mig
ratio
n an
d in
vasi
on th
roug
h im
porta
nt c
ell-a
dhes
ion
func
tions
. It i
s als
o re
porte
d to
hav
e im
mun
osup
pres
sive
func
tions
. Thi
s gen
e is
kno
wn
to b
e up
regu
late
d in
in
vasi
ve tu
mou
rs b
ut tu
mou
r of l
ow m
alig
nant
pot
entia
l (LM
P). (
Fran
ke e
t al.,
20
03; S
hige
mas
a et
al.,
200
1)
UPA
/PLA
U
Plas
min
ogen
ac
tivat
or, u
roki
nase
3.
909
1.94
7 D
ecre
ases
(2.0
fold
)
Hig
h U
PA a
ssoc
iate
d w
ith re
sidu
al d
isea
se a
nd sh
orte
ned
dise
ase-
free
surv
ival
. PL
AU/P
AI-1
axi
s may
play
an
impo
rtant
role
in th
e in
tra-a
bdom
inal
spre
ad a
nd
reim
plan
tatio
n of o
varia
n ca
ncer
cel
ls. T
he p
rogn
ostic
rele
vanc
e of
PLA
U a
nd
PAI-
1 su
ppor
ts th
eir p
ossi
ble
role
in th
e m
alig
nant
pro
gres
sion
of o
varia
n ca
ncer
(Kon
ecny
et a
l., 2
001;
Sch
mitt
et a
l., 1
997)
175
Table 4-6: Primer sequences designed for RT-PCR validation of genes identified with >two-fold mean differential expression between patients of either <12 or >24 months survival
Table 4-8: Mean microarray and RT-PCR gene expression ratios for technical validation subset of primary patient outcome cohort. Ratios calculated using data from specimens available for RT-PCR validation experiment only.
4.2.6.2. Independent biological validation of genes with two-fold mean differential expression between survival groups
In order to validate the expression differences for these selected genes in samples not
used in the primary analysis, RNA was obtained from EOC specimens not used in the
original cohort. Five samples from patients with survival times of <12 months (median 9)
and nine samples with survival times > 24 months (median 57 months) were obtained.
Specimens were reviewed by pathologist Dr Melissa Robbie as previously described to
ensure adequate tumour content and correct diagnosis. RNA was extracted by Anna
Tinker as part of the ongoing AOCS tumour profiling study and 5ug of total RNA used to
produce cDNA template.
RT-PCR was carried out using the housekeeping gene HPRT as a control gene (de Kok et
al., 2005). Each measurement was repeated three times per sample. Normalised RT-PCR
results are shown in Table 4-9 and a summary of the mean fold chance differences for the
177
two survival classes is given in Table 4-10. Overall the RT-PCR measured mean fold
changes agreed with the microarray data, although the exact fold changes varied slightly,
as expected for a sample set of this limited size. This result does however validate the
ability of this approach to identify a subset of genes with differential expression based on
observed mean fold changes and literature analysis. The standard deviation between
triplicate RT-PCR reactions was 0.14 indicating a high degree of accuracy between
multiple measurements of the same gene/template combination.
This approach is not the ideal one for identifying genes with significant expression
differences between groups of patients, rather was tailored to the sample size of this study
and the inability of more accepted methods of survival analysis to identify a set of genes
for validation.
17
8
Tab
le 4
-9: I
ndep
ende
nt b
iolo
gica
l val
idat
ion
set a
nd R
T-P
CR
dat
a fo
r se
lect
ed m
ean
2-fo
ld d
iffer
entia
lly e
xpre
ssed
gen
es. R
esid
ual d
isea
se s
umm
ary
– ni
l: 0c
m, m
in: 0
-1cm
, mod
: 1-2
cm, m
ax: >
2cm
thic
k se
ctio
n of
tum
our r
emai
ning
afte
r sur
gery
. RT-
PCR
scor
es n
orm
alis
ed to
HPR
T ex
pres
sion
and
are
ave
rage
of
trip
licat
e m
easu
rem
ents
. Das
hes i
ndic
ate
eith
er th
e ab
senc
e of
dat
a ge
nera
ted
from
this
reac
tion,
or a
larg
e di
verg
ence
bet
wee
n re
plic
ate
mea
sure
s.
Patie
nt ID
Pa
thol
ogy
clas
sific
atio
n M
onth
s sur
viva
l R
esid
ual D
isea
se
KLK
7 SL
PI
S100
A2
TNF
SF10
n F
N1
UPA
85
.064
Pa
pilla
ry se
rous
cys
tade
noca
rcin
oma
2 M
ax
58.8
7 -
- 8.
28
17.3
2 67
3.61
93
.086
Pa
pilla
ry se
rous
ade
noca
rcin
oma
3
Max
10
9.04
74
72.1
5 11
46.1
9 13
.85
13.0
8 15
4.34
86
.027
Se
rous
car
cino
ma
5 M
in
5.63
91
4.51
31
.95
9.74
2.
11
15.3
4 85
.031
Pa
pilla
ry se
rous
ade
noca
rcin
oma
9
Min
11
.40
3275
.01
285.
31
15.1
4 72
.95
1255
.68
95.0
14
Sero
us c
arci
nom
a 9
Min
17
.84
3434
.70
94.1
5 16
.19
1.72
48
.61
94.1
13
Papi
llary
sero
us a
deno
carc
inom
a
25
Nil
165.
04
1866
9.01
60
90.1
4 26
.23
126.
42
919.
00
93.0
75
Papi
llary
sero
us a
deno
carc
inom
a
27
Nil
20.9
3 19
5.83
35
.24
7.23
0.
51
19.3
8 93
.072
Se
rous
ade
noca
rcin
oma
33
Nil
149.
99
1185
.98
11.0
8 24
.96
4.30
48
.91
93.0
06
Papi
llary
sero
us c
ysta
deno
carc
inom
a 34
M
ax
56.4
5 27
61.1
4 26
0.00
37
.66
11.2
0 11
8.41
95
.002
Pa
pilla
ry se
rous
ade
noca
rcin
oma
80
M
in
9.67
18
17.1
7 20
2.25
8.
03
3.06
83
.23
87.0
35
Sero
us c
ysta
deno
carc
inom
a
86
Max
-
- -
- 4.
25
10.9
6 93
.056
Pa
pilla
ry se
rous
cys
tade
noca
rcin
oma
126
Min
16
.30
2690
.87
1020
.06
5.11
41
.41
804.
65
86.0
28
Sero
us c
ysta
deno
carc
inom
a
211
Min
42
.31
5825
.52
2543
.18
8.04
6.
44
260.
70
86.0
58
Sero
us c
ysta
deno
carc
inom
a
214
Nil
4.95
10
655.
47
741.
92
8.70
9.
18
92.8
0 T
able
4-1
0: S
umm
ary
of g
ene
expr
essi
on m
easu
rem
ents
from
mic
roar
ray
data
, RT
-PC
R v
alid
atio
n w
here
ava
ilabl
e (P
CR
1) a
nd R
T-P
CR
inde
pend
ent
valid
atio
n co
hort
(PC
R2)
. The
mea
n fo
ld c
hang
e of
eac
h ge
ne c
alcu
late
d fr
om th
e co
mpl
ete
prim
ary
mic
roar
ray
data
set i
s sho
wn
for c
ompa
rison
.
K
LK7
SLPI
S1
00A
2 TN
FSF
10n
FN
1 U
PA
Gro
up
Arr
ay
PCR
1 PC
R2
Arr
ay
PCR
1 PC
R2
Arr
ay
PCR
2 A
rray
PC
R2
Arr
ay
PCR
1 PC
R2
Arr
ay
PCR
1 PC
R2
<12
mon
ths
7.17
77
7.25
38
.68
5.30
1.
79
2921
.05
1.43
29
9.21
1.
83
12.3
0 1.
11
5.61
31
.05
3.91
2.
03
459.
62
>24
mon
ths
17.9
5 24
01.9
6 54
.18
13.5
7 3.
92
5042
.28
4.06
12
66.4
0 4.
24
18.4
2 0.
54
4.06
20
.75
1.94
7 1.
18
237.
25
Mea
n fo
ld c
hang
e 2.
5 3.
09
1.4
2.6
2.19
1.
7 2.
8 4.
2 2.
3 1.
5 0.
4 0.
72
0.6
0.5
0.58
0.
5
179
0
500
1000
UPA
0
10000
20000
SLPI
0
50
100
G
FN1
40
30
20
10
TNFS
F10n
<12 months >24 months <12 months >24 months
<12 months >24 months <12 months >24 months
<12 months >24 months <12 months >24 months
6000
5000
4000
3000
2000
1000
0
G
S100
A2
A
C D
E F
0
50
100
150
G
KLK7
B
Figure 4-7: Box plots of RT-PCR assessed gene expression of selected genes on independent validation samples. Red dots indicate mean expression levels per class. All p-values comparison of the mean expression level between groups were >0.05. The mean fold change in expression however did agree with values observed in the microarray data. (A) KLK7 (B) SLPI (C) S100A2 (D) TNFSF10 (E) FN1 (F) UPA.
180
4.2.7. Analysis of published gene lists for predicting EOC prognosis
Prognostic gene lists identified by other published studies of EOC were used to
interrogate the dataset generated for this chapter in order to determine the relationship
between datasets. Univariate F-tests were carried out on expression data corresponding to
the published gene lists and hierarchical clustering was performed to visualise the
patterns of expression formed.
4.2.7.1. Comparison to EOC gene expression prognostic signature
The Spentzos et al (Spentzos et al., 2004) 115 gene independent prognostic signature was
matched via UniGene IDs (Wheeler et al., 2003) to 90 genes on the Peter Mac 10.5k
cDNA microarray. None of the genes exhibited statistically significant variation between
survival groups present in this study or correlated significantly to with survival time.
Hierarchical clustering of the 26 samples used in this chapter samples using the ‘Ovarian
Cancer Prognostic Signature’ is shown in Figure 4-9. While regions of co-expressed
genes can be observed in the cluster image generated, the samples were not grouped into
survival groups, nor were groups of samples corresponding to other clinical variables.
4.2.7.2. Comparison to a molecular signature of EOC residual disease levels
Microarrays have been used to investigate the molecular component of residual disease
following surgery, thought to be dependant on both the physiological characteristics of
the tumour and the ability of the operating surgeon. Berchuck et al (Berchuck et al.,
2004) found 32 genes that could distinguish between optimal and suboptimal debulking
with 72.2% accuracy. Only 12 of these genes were present on the Peter Mac array (based
on UniGene ID linking, build #184) reducing the power of this comparison. However
using this12 gene subset, no significant grouping of patients into either survival or
residual disease categories on the basis of hierarchal clustering was observed (Figure 4-
8).
Furthermore, none of these twelve genes were present in any of the other published gene
lists used to interrogate these 26 EOC expression profiles. This suggests that the signature
of residual disease identified by Berchuck et al may be specific to the 44-patient cohort
from which it was developed, or these genes are not involved in the range of other
tumour-related processes represented by these gene lists.
181
Figure 4-8: EOC survival dataset clustered using 12 genes overlap between the Peter Mac 10.5k cDNA microarray and a predictive signature of residual disease (Berchuck et al., 2004). Sample colour bar corresponds to residual disease categories nil (0cm tumour remaining), min (0-1cm), mod (1-2cm) or max (>2cm).
Table 4-11: Known molecular networks identified by Ingenuity Pathway Analysis as having a significant representation of the 27 genes differentially expressed between EOC patients of <12 or >24 months survival. ‘Focus’ genes are those present in the list of genes input into the Ingenuity program (shown below in bold).
Genes in Network Network Score
No. Genes overlapping
Gene ontologies significantly represented by network
Figure 4-11: Gene interaction network identified as containing a statistically significant proportion of the 27 genes differentially expressed between patients with <12 months or >24 months survival times (shown in grey). Other genes contained in this figure were not present in the initial list of 27 but have documented relationships to the 14 present, based on literature and database mining using the Ingunuity system. This network is implicated in a range of cancer types (including breast, colorectal and ovarian), as well as apoptosis, cell growth, proliferation, morphology and movement. These gene ontologies are related to this specific gene network at a significance level of p<0.001.
Key:
188
4.3. Discussion This chapter describes the use of cDNA gene expression data to identify genes related to
the variable of EOC patient survival. While several methods of data analysis ranging in
complexity and having been successfully used by others for similar analyses, it was not
possible to identify a statistically significant gene set from the particular cohort available.
Several confounding factors were identified as contributing to the inability to achieve this
goal of the study.
4.3.1. The impact of residual disease and distribution of survival times on the identification of genes related to length of survival
One of the most significant confounding factors in the attempt to identify genes related to
patient survival was the pattern of residual disease present in this cohort, particularly in
light of the demonstrated importance of this variable on EOC prognosis (Berchuck et al.,
2004; Bristow et al., 2002; Hoskins et al., 1994). A distinct trend of increasing levels of
residual disease associated with shorter survival times could be observed in the 26
patients analysed for this chapter, although it was not statistically significant. By
grouping patients into categories of nil/minimal (<1cm) or moderate/maximum (>1cm)
levels of residual disease and incorporating this information into the analysis of
differential gene expression between survival groups, it was hoped to identify genes
whose expression was related to survival, independent of the level of residual disease
present. Whilst a number of gene lists were obtained, none were found to be more
significant than could be expected by chance alone. The method most closely resembling
that of Spentzos et al (Spentzos et al., 2004), in which patients representing the outer
edges of the survival time distribution were compared, resulted in the gene list that was
the closest to statistical significance on the basis of permutation analysis.
Various retrospective studies have demonstrated the benefit of optimal surgical debulking
for advanced stage EOC. Current figures from these reports show a median survival of
approximately 5 years for patients who are diagnosed with <1cm diameter residual
tumour nodules, compared to 3 years in cases where larger volumes of tumour remain
(Hoskins et al., 1992; Hoskins et al., 1994). Microarray analysis of late stage tumours of
varying debulk status has revealed a small gene signature associated capable of predicting
this variable in approximately 75% of cases tested. This implies that the benefits obtained
from optimal debulking are at least partially linked to the molecular characteristics of the
189
individual tumour and not just due to the physical removal of tumour bulk alone
(Berchuck et al., 2004). The model generated from this gene expression study does not fit
every sample in the cohort and as such other theories about the reason for the prognostic
significance of residual disease remain valid. These include hypotheses that smaller
tumour masses are more susceptible to chemotherapy, more likely to trigger an effective
immune response and have a lower chance of developing chemoresistance (Berek, 1995;
Memarzadeh et al., 2003; van der Burg et al., 1995).
Another significant hurdle this dataset presented was its limited sample size, a factor
whose influence was most likely amplified in this study of EOC due to the previously
discussed heterogeneity of this disease and influence of varying residual disease levels on
patient survival (Hoskins et al., 1992; Hoskins et al., 1994). In keeping to the criteria
determined for selecting suitable cases of EOC from the total set available at the start of
this project, many samples had to be excluded because of incomplete clinical data,
particularly information concerning the level of residual disease present. Several attempts
were made over the course of this project to obtain more clinical information about the
total cohort, most of which was obtained through a collaboration with the Royal Brisbane
Hospital, QLD. The age of the specimens, some collected up to 15 years ago, as well as
the varying locations at which the women were treated, made this information difficult to
obtain.
One of the few studies to successfully identify a set of genes with prognostic capabilities
for EOC is that by Spentzos et al (Spentzos et al., 2004). This work involved the use of 68
EOC microarray profiles generated with the 12,625 feature Affymetrix U95A2 array
(Affymetrix, Santa Clara, CA USA), more than twice the number of samples used for this
study. A similar method of analysis as employed in section 4.2.4.3 was employed in
which groups of patients with the shortest and longest survival times were compared to
identify genes with significant expression differences. The substantially greater range of
survival times present in the Spentzos et al cohort permitted a greater separation between
the short and long term survival groups, with the former have survival times below 26
months and the latter 58 months or greater. It is reasonable to assume that further the
distance between two groups of specimens are from each other as defined by a linear
variable such as survival time; the greater the molecular contrast between these two
classes, improving the chances of identifying differentially expressed genes.
The distribution of survival times in the cohort analysed in the chapter did not permit for
the equivalent separation of short and long term survival groups as used by Spentoz et al.
190
However should this dataset be extended by incorporation of gene expression profiles
generated from specimens prospectively collected by the AOCS, associated with
substantially more comprehensive clinical information, such separation may be possible.
4.3.2. EOC heterogeneity and its impact on the success of genomic analyses
Microarray profiling of whole tumour specimens gives a ‘global overview’ of gene
expression in a given piece of tissue. However as human tissue is a complex network of
tissue and cell types, it can be difficult to ascertain the exact cell type responsible for a
particular expression pattern generated when profiling macroscopic sized specimens of
tissue (Liotta and Petricoin, 2000). The heterogeneity of EOC has been described
previously and is a confounding factor for any study seeking to identify its underlying
molecular causes using high throughput approaches such as microarray technology where
samples must be grouped into classes of sufficient numbers to allow statistically valid
comparisons to be made (Hernandez et al., 1984; Pieretti et al., 2002).
Concentrating the analysis on the single histological subtype is one method of reducing
heterogeneity in a dataset. Limiting the analysis in this chapter to the serous subtype
achieved this and also reduced the risk of the dataset being contaminated by metastases
which are frequently of mucinous histology (Lee and Young, 2003; Seidman et al., 2003).
Despite limiting the cohort to the serous type only, the importance of full pathology
review of specimens intended for microarray analysis cannot be underestimated. One
crucial measure obtained from a review of the tissue prior to array analysis is the
percentage of tumour present in the section relative to the stroma and other non-malignant
tissue. As the tissue processing protocol used for this study did not incorporate
microdissection, any non-tumour tissue present will be homogenised and the genetic
material it contains extracted along with that of the tumour, contributing to the overall
gene expression profile generated for the specimen. In this study all samples were
reviewed as having sufficient tumour content for cDNA microarray profiling by the
reviewing pathologist, as described in Material & Methods section 2.2.1. It has been
documented that particularly for heterogeneous tumour types such as EOC the specific
location relative to the entire tumour, from which the biopsy for microarray analysis is
taken can have a significant impact on the resulting gene expression profile (Pieretti et al.,
2002). This may be due to certain areas of a tumour of this type being more malignant,
191
containing more stroma or a higher level of infiltrating immune cells for example, each
influencing the molecular signature generated (Liotta and Petricoin, 2000).
Another method that could be employed to increase the clarity of information obtained by
microarray analysis would be the use of microdissection and RNA amplification. This
process is extremely time-consuming to carry out, however would facilitate the exclusion
of all non-tumour tissue from the biopsy being analysed resulting in microarray data
generated from a more pure cell population. Therefore one could state with confidence
that the gene expression signature obtained from the amplified RNA was truly
representative of the malignant tissue rather than its surrounding stroma or nearby normal
tissue (Player et al., 2004; Sambrook and Bowtell, 2003). Comparisons could then be
made to microarray profiles of the microdissected stroma and other non-tumour cells in
order to profile the expression of genes in these cells.
To date this microdissection of tumour material has not been employed in the vast
majority of EOC gene expression studies (Adib et al., 2004; Donninger et al., 2004; Gilks
et al., 2005; Hartmann et al., 2005; Jazaeri et al., 2003; Lancaster et al., 2004; Lee et al.,
2003; Sakamoto et al., 2001; Santin et al., 2004; Schaner et al., 2003; Schwartz et al.,
2002; Spentzos et al., 2004; Tonin et al., 2001), which in light of a growing body of
evidence concerning the importance of tumour-stroma interaction in disease progression,
represents somewhat of a deficiency in this area of research. The future use of
microdissection coupled with microarray profiling may increase the level of
understanding of the interaction of EOC and its environment.
4.3.3. Attempts to identify gene expression patterns with statistically significant relationships to length of survival
A gene or set of genes whose expression correlated with survival length in a linear
fashion would make an ideal prognostic marker. Theoretically such a marker could be
measured throughout the course of a patient’s treatment regime to assess disease
progression and possibly contribute to any decision concerning how aggressively a
tumour should be treated. To this end, several approaches were attempted to find such
genes in this dataset. Both the quantitative trait (section 4.2.4.1) and Cox proportional
hazards (section 4.2.4.4) analyses interrogate the microarray data with respect to the
continuous variable of survival time, the hazards model permitting the censoring of data
from any patients who were still alive at the last data collection point (follow-up date in
this instance).
192
Unfortunately, neither analysis approach yielded a list of genes more significantly
correlated with survival than could be expected by chance association. However,
inspection of the genes identified do show a significant correlation with the length of
survival experienced by the 26 cases in this study, demonstrating the theoretical ability of
this approach for identifying a small number of potentially clinically important genes
from a very large starting set (4,508 genes were tested after excluding non-varying genes
using an unsupervised filter).
This phenomenon of non-overlapping gene sets claimed to reflect the same clinical
question has been observed by a number of groups (Ein-Dor et al., 2005). It may be a
reflection of the heterogeneity of ovarian cancer, a disease whose subtypes are known to
have markedly different clinical courses (Hess et al., 2004; Ronnett et al., 2004; Zanetta
et al., 2001). Alternatively, or perhaps additionally, it may be due to the sample sizes used
in each study being inadequate to produce a truly population-representative expression
signature. Other explanations may be the disparate clone sets used to create the respective
microarrays used by each study or method of bioinformatic analysis employed (Lossos et
al., 2004). Certain statistical approaches have been identified to cause over fitting of
findings which would result in an expression signature only being applicable to the
samples it was generated from (Ambroise and McLachlan, 2002; Simon et al., 2003b).
Ein-Dor et al (Ein-Dor et al., 2005) have recently shown that the creation of a predictive
gene list is highly dependant on the subset of patients used in the analysis and even small
changes in the number of samples used can result in significant changes in the number
and identity of genes selected by many feature selection algorithms. Other reasons these
authors proposed for the difficulty of identifying universal sets of genes with prognostic
expression patterns include the possibility that while large number of genes are correlated
with survival, the scale of these differences is often very small. This hypothesis agrees
with analyses carried out in this chapter in which the statistically significant changes in
expression detected between patients of <12 or >24 months survival corresponded to
seemingly very small differences in relative fold changes (e.g.. less than 0.1 difference in
mean fold change for some instances).
Attempts to identify genes correlated with EOC survival in this chapter were most likely
hindered by the number of samples available having the requisite clinical information, the
number of genes contained on the microarray platform used and also the known
heterogeneity of EOC, even within an individual histological subtype (Hernandez et al.,
1984; Pieretti et al., 2002; Sevin and Perras, 1997).
193
Several methods for identifying and correcting sources of technical error described in
Chapter 3 were applied to the dataset in an attempt to reduce any systematic noise that
may have been clouding true biological information. The level of spatial bias present in
the dataset was determined using the MMT method, however was of an acceptable level,
indicating this source of error was not a significant concern. Despite this, a range of
normalisation approaches were applied to the dataset and the analyses of survival times
repeated, however no increase in statistical significance of the resulting gene lists were
obtained.
4.3.4. Biological and clinical relevance of genes identified
Despite the statistical uncertainty of the gene lists identified, a number of biologically
relevant genes were identified from the analyses carried out. The T-test carried out
between patients with <12 or >24 months, resembling the method carried out by Spentzos
et al (Spentzos et al., 2004), was the closest of those used in this chapter to generating a
list of differentially expressed genes with greater significance than would be obtained by
chance alone on the basis of 1000 permutations of the dataset (p=0.019). As discussed
previously and in light of this approach generating the closest to a statistically significant
list of genes, a larger sample cohort with a greater divide in survival times between
classes could be expected to yield results similar to those published.
Nine genes are differentially expressed at the most significant univariate level possible in
this analysis (p<1e-07). While it is not possible to state that the genes in this list were not
selected by chance alone, some have interesting and relevant biology behind them, which
in itself is a form of experimental validation. Amongst these is Cisplatin resistance-
associated over expressed protein (CROP) which is a stress-response molecule isolated
from a cisplatin-resistant cell line. CROP is expressed at higher levels in the tumours of
the short term survivors (Umehara et al., 2003). While its precise function remains
unknown, its location within the cell is altered following cisplatin treatment, suggesting a
mode of activation for this chemotherapeutic agent commonly used for EOC that involves
modulation of stress-response genes including CROP (Umehara et al., 2003). Its higher
expression detected in cisplatin-treated patients with shorter survival times and both a cell
line resistant to the same drug may represent a molecular mechanism for evading the
cytotoxic effects of this treatment, leading to chemoresistant disease and a poorer
prognosis.
194
Expressed at a level 3-fold lower level in short term survivors, was the extracellular
matrix protein Latent transforming growth factor beta binding protein 2 (LTBP2). The
cell adhesive properties of this gene have been demonstrated in melanoma adhesion
assays in which cell attachment to the LTBP2 was inhibited in a dose-dependant manner
by antibodies against beta-1 integrin (Moren et al., 1994; Vehvilainen et al., 2003). A
reduction in the expression or functionality of cell adhesion genes is frequently associated
with increased proliferation of tumour cells (Wijnhoven et al., 2000), this trend being
supported by the observation of lower levels of LTBP2 in patients with shorter survival
times.
Other genes expressed at lower levels in those patients with shorter survival times
identified by this analysis are:
ACP6, which has been demonstrated to protect cells from cisplatin induced death
and contributes to overall cell survival (Mackeigan et al., 2005);
BGLAP, which is thought to contribute to the accumulation of calcium phosphate
in EOC psammoma bodies and associated with cellular degradation (Raymond et
al., 1999); and
NAPSA, for which therapeutic targeting is currently under way. In breast cancer,
increased expression of this gene results in higher levels of E-cadherin (CHD1),
in important cell surface molecule heavily implicated in EOC progression
(Tatnell et al., 1998; Thibout et al., 1999),
The reduction of expression of these genes is associated with shorter survival times and
agrees with the current literature concerning their molecular functions. Other genes
selected by this analysis, as listed in Table 4-4, function to control events associated with
the cell cycle, cell adhesion and regulate apoptosis, all important processes in tumour
growth and progression.
Amongst those genes found to have higher expression levels in those patients with shorter
survival times were:
GPX3, which protects cells against damage from oxidative stress and has
approximately 6-fold higher expression in the highly malignant clear-cell ovarian
cancer subtype relative to mucinous and serous EOC (Hough et al., 2001;
Takahashi et al., 1987);
195
LTB, which promotes tumour growth by interacting with the immune system and
triggering angiogenesis allowing the tumour cells to obtain the required nutrients
to proliferate (Browning et al., 1993);
SCARA3, also protecting from oxidative stress, a condition under which non-
malignant cells would die, by depleting the levels of reactive oxygen present. The
expression of this gene is increased in response to an elevation of these damaging
molecules (Han et al., 1998).
Notably, in the list of 27 genes differentially expressed between short and long terms
survivors in this study, five genes are involved in either calcium binding (LTBP2,
BGLAP, REPS2, and PCDH12) or calcium channel activity (RYR1). All genes besides
PCDH12 are expressed at higher levels in the longer term (>24 months) survival group
relative to the short-term group (<12 months) and although the differences are small they
are highly statistically significant (p<0.0001).
Large epidemiology studies have described a significant link between levels of dietary
calcium intake and rates of ovarian cancer (Goodman et al., 2002). This study noted an
inverse association between the consumption of lactose, thought to increase calcium
absorption, and the risk of EOC (Gueguen and Pointillart, 2000). In summary, the data
revealed that women who consume higher levels of both calcium and lactose, particularly
from dairy sources, are at a significantly decreased risk of EOC development. Dietary
calcium intake has also been reported to be inversely related to breast cancer (Lipkin and
Newmark, 1999) and colorectal cancer (Martinez and Willett, 1998) and positively
related to prostate cancer (Giovannucci et al., 1998) suggesting an important role for this
nutrient and the molecular processes in which it is involved for a range of malignancies.
A potential mechanism to explain the association between calcium and EOC is the
requirement of this compound for correct function of cellular adhesion molecules,
specifically transmembrane glycoprotein cadherins. Variation in calcium availability,
either through dietary consumption or defective regulation of calcium-processing genes,
may alter the adhesive functionality of cadherins. This in turn may lead to increased rates
of tumour progression and invasion as cells lose or gain adhesive abilities that confer a
malignant phenotype. The regulation of cadherin expression and function has been
demonstrated as a pivotal stage of EOC progression (Patel et al., 2003). In another
microarray based study of EOC malignant potential, a significant number of calcium
channel related genes were observed to be differentially expressed between benign,
borderline and malignant ovarian specimens (Warrenfeltz et al., 2004). These genes were
196
under expressed in the malignant tissues relative to the other tissues profiled by
Warrenfeltz study. This agrees with the observation made in this chapter in which lower
calcium-related gene expression was associated with those patients experiencing shorter
survival times; possibly reflect a more malignant disease phenotype of EOC present in
these patients.
4.3.5. General conclusions
In this chapter a range of methods were used to identify genes with patterns of expression
related to EOC survival. Whilst no single list of genes was identified with more statistical
significance than could be expected by chance, a number of biologically and clinically
relevant genes were identified, particularly in the list of 27 genes found by comparing the
two groups of patients with either <12 or >24 month survival. Several of these genes were
identified as being involved in cell adhesion and calcium binding or transport, therefore
regulation of these processes is hypothesised to be important for EOC progression and
ultimately patient survival.
Significant gene lists from a number of published studies of EOC survival or malignancy
were used to interrogate the microarray data generated for this study; however none were
found to reproduce the same result as observed their respective published analyses. This
is thought to reflect the heterogeneity of EOC, limited genome-coverage of some
microarray platforms and also the small cohort sizes of most EOC studies of this kind
carried out to date.
RT-PCR was used to attempt technical and biological validation of a series of genes
found to have a mean 2-fold or greater fold change between patient survival groups and a
relevant literature base. Despite complications arising from the limited quantities of RNA
available for validation purposes, in general good agreement was found between
microarray and independent RT-PCR based measurements of gene expression.
197
5. Molecular analysis of invasive and low malignant potential ovarian tumours
5.1. Introduction The invasive and low malignant potential (LMP) subtypes of ovarian cancer both arise
from the epithelial lining of the ovary, yet have a number of important differences. By
studying the molecular differences between these two tumour types it is hoped to increase
the understanding of those events responsible for EOC progression and invasion.
The defining characteristics of LMP EOC include:
Atypical cellular proliferation, but the lack of stromal invasion despite sharing
other malignant characteristics such as cellular stratification and nuclear atypia
Significantly better prognosis; the 5-year survival rate for women diagnosed with
stage 1 LMP disease is in excess of 95% (compared to 30% for all EOC)
Younger age at diagnosis.
The efficacy of conservative (fertility-sparing) surgery in achieving a cure.
Arising from the same tissue as invasive cancers, LMP tumours are an excellent model to
assist in identifying the molecular basis of EOC invasion.
This chapter describes a microarray-based investigation of invasive and LMP ovarian
tumours, including those of mucinous and serous histology. The work describes
pathology review of specimens and the creation of an expression-based method for
confirming pathological classifications, including tissue of origin. Bioinformatic analysis
was carried out on the resulting dataset to characterise the molecular differences between
LMP and invasive subtypes. Relationships to other invasive/non-invasive cancer models
are analysed and ontology and pathway analyses are carried out to determine key
processes that may be responsible for controlling the invasive potential of EOC.
High throughput methods such as automated RT-PCR and microarrays have been applied
to a range of cancer models in recent years to advance the understanding of their
molecular foundations, particularly in relation to important clinical variables
(Dhanasekaran et al., 2001; Dyrskjot et al., 2003; Golub et al., 1999; Ramaswamy et al.,
2001; Spentzos et al., 2004; van 't Veer et al., 2002; Zembutsu et al., 2002). One such
198
variable is the malignant or invasive potential of a tumour, as this can influence how a
patient is treated. Tumours showing early indications of a more aggressive phenotype
may be treated with a broader range of therapies (radiotherapy/chemotherapy and
surgery), whereas those exhibiting more benign characteristics may be curable by surgery
alone. Where this is the case, the patient is spared the time, expense and significant
morbidity associated with other forms of treatment.
An unexpected challenge that arose during the course of this chapter was to be the higher
than expected frequency of metastatic tumours to the ovary diagnosed as primary EOC.
During the pathology review process, the original diagnosis of several cases was queried
by the reviewing pathologist, Dr Melissa Robbie. As a result of interrogating the
associated clinical information, revision of diagnostic slides and the comparison of the
gene expression profiles of these queried samples to a large database of other primary
tumour types (Tothill et al., 2005), a proportion of the cohort had to be excluded from
further analysis. The mucinous invasive type of EOC was the most frequently
misdiagnosed, as observed by a number of other studies of these tumours (Ji et al., 2002;
Lee and Scully, 2000; Lee and Young, 2003; Ronnett et al., 2004).
An improved understanding of the molecular differences between invasive and non-
invasive forms of common cancer types, such as EOC, has the potential to assist
clinicians in making important treatment decisions. Exploring the functions of those
genes differentially expressed between tumour subtypes may also lead to the discovery of
specific genes that can be therapeutically manipulated to reduce the malignant potential
of a tumour, thus increasing the efficacy of other treatments (Alizadeh et al., 2000; Liotta
and Petricoin, 2000).
This chapter describes the development of a bioinformatic approach for using cDNA
microarray data, gene ontology analysis and pathway discovery to identify key molecular
events whose aberrant regulation may be responsible for clinically important phenotypic
differences. The relationship of these molecular differences to other models of cancer
progression and invasion is also investigated. This was achieved by comparing the gene
expression signature of LMP or invasive EOC to other published and in-house microarray
analyses of biological relevance. Studies compared include those where gene expression
related to the invasion of breast (van 't Veer et al., 2002), prostate (Singh et al., 2002) and
gastrointestinal tract (Boussioutas et al., 2003) carcinoma were identified. As well as
these studies of other cancer types, the expression profile that distinguishes between LMP
199
and invasive EOC was technically validated by comparison to other recently published
microarray-based studies of EOC malignancy (Gilks et al., 2005; Schwartz et al., 2002).
To biologically validate the gene expression differences found, two methods were used –
RT-PCR and immunohistochemistry. Both techniques were applied to samples
independent of those in the cohort used to generate the LMP/invasive expression
signature, an important step in genomic profiling of any disease type to ensure the
findings are widely applicable on a population level and not restricted to one particular
group of patients (King and Sinha, 2001; Liotta and Petricoin, 2000; Simon et al., 2003b).
Appropriate paraffin-embedded specimens of EOC, confirmed by pathology review, were
used to create two tissue microarrays (TMA). These were used for technical and
biological validation of the microarray signature through IHC analysis with labelled
antibodies specific to the protein product of several genes identified as having differential
expression levels between the EOC subtypes.
A method for capturing and objectively analysing large collection of IHC data using
commercially available image-processing and statistical software is described in the
validation section of this chapter. This technique was used to quantify the differences in
staining intensities for a range of antibodies used to identify specific proteins
corresponding to a selection of genes differentially expressed between the LMP and
invasive tumour types.
200
Identify suitable cases of EOC from AOCS cohort and Peter Mac Tissue
bank
Pathology review of specimens to confirm original
diagnosis
Process tissue and hybridise RNA to
cDNA microarrays.
Analyse patterns of gene expression
Investigate gene ontology and
pathways represented by differentially
expressed genes
Explore relationships to
other in-house and published studies of invasive/non-invasive cancers
and also EOC
Technically validate
expression of differentially
expressed genes with RT-PCR
Biologically validate expression
of differentially expressed genes with RT-PCR on
independent samples
Biologically validate expression
of differentially expressed genes
with qIHC
Figure 5-1: Overview of gene expression based analysis of LMP and invasive EOC; from identification of suitable samples through to quantitative immunohistochemistry (qIHC) on tissue microarrays. Dashed line between pathology review and analysis of gene expression indicates the parallel nature of these two stages. A number of samples were found to be metastatic rather than primary tumours and excluded from the study to avoid contaminating the dataset with non primary EOC gene expression information.
201
5.2. Results
5.2.1. Case selection and pathology review of suitable cases
Patients diagnosed with mucinous or serous EOC were identified from the AOCS and
Peter Mac Tissue Bank databases. H&E stained sections were inspected by either Dr
Melissa Robbie or Dr Paul Waring to confirm the specimen of tumour available matched
the diagnosis given. As the ratio of tumour to non-tumour cell present in a specimen is
important for microarray work, an objective assessment of tumour content was made.
Comments from the review of each case are summarised in Table 5-1. Unless otherwise
stated, percentage tumour was judged to be sufficient for microarray analysis (>50%
tumour cell content by assessment of the number of tumour cell nuclei present per high-
power field, as described in Materials and Method).
A number of samples originally classified as primary EOC were questioned by one or
both of the pathologists during the review process, based on inspection of the original
pathology report and the corresponding H&E stained section. Where possible, the full
range of diagnostic slides was called in for further information, as well as other clinical
notes on the patient available from the treating hospital. Where no further information
was available or slides could not be obtained, the specimen in question was excluded
from further analysis as to avoid contamination of the dataset with poor quality or non-
primary ovarian tumour material. Identification of non-primary or metastatic tumours to
the ovary is crucial for a study of microarray gene expression data. It has been
demonstrated that tumour metastases maintain the expression profile of their originating
tissue, therefore not identifying and excluding such specimens may potentially
contaminate a dataset with gene expression patterns of tissue type other than the one
being investigated (Su et al., 2001). Further analysis of the relationship between
metastatic disease and the tissue it originated from has revealed a molecular signature that
can discriminate between metastases and their primary tumours, however the
predominant gene expression profile obtained from these microarray analyses was
observed to be that of the primary tissue (Ramaswamy et al., 2003).
Four specimens of mucinous invasive, 20 mucinous LMP, 12 serous invasive and 19
serous LMP EOC were reviewed and profiled by cDNA microarray analysis (total no. =
55), as shown in Table 5-1.
20
2
Tab
le 5
-1: P
atho
logy
info
rmat
ion
for
orig
inal
EO
C c
ohor
t (n=
55).
Patie
nt
ID
Subt
ype
Inva
sive
/LM
P G
rade
FI
GO
St
age
Com
men
ts fr
om r
evie
win
g pa
thol
ogis
t
93.0
64
Muc
inou
s In
vasi
ve
1 1C
-
90.0
07
Muc
inou
s In
vasi
ve
2 1C
C
ompl
ex, a
typi
a, n
ecro
sis c
onsi
sten
t we
wel
l diff
eren
tiate
d in
vasi
ve c
arci
nom
a 94
.036
M
ucin
ous
Inva
sive
3
3C
Om
enta
l spr
ead
note
d in
pat
holo
gy re
port.
94
.112
M
ucin
ous
Inva
sive
3
Su
spic
ious
as t
o LM
P st
atus
. Sm
all g
land
s but
min
imal
aty
pia,
no
stro
mal
reac
tion
93.0
02
Muc
inou
s LM
P
LMP
Foca
l ser
ous a
reas
P0
0627
M
ucin
ous
LMP
1 LM
P -
P007
84
Muc
inou
s LM
P 1
LMP
Orig
inal
ly li
sted
as i
nvas
ive
- pat
h re
port
and
arra
y an
alys
is in
dica
te L
MP.
Pos
sibl
y sa
mpl
ing/
biop
sy si
te is
sue
P009
34
Muc
inou
s LM
P 1
LMP
Mos
tly b
enig
n sa
mpl
e, sm
all a
rea
of L
MP
foci
W
M22
3 M
ucin
ous
LMP
1 LM
P -
WM
438
Muc
inou
s LM
P 1
LMP
Path
revi
ew in
dica
ted
met
asta
tic c
olor
ecta
l dis
ease
P0
0488
M
ucin
ous
LMP
3 LM
P Pa
th re
port
sugg
ests
sam
ple
is m
etas
tatic
from
app
endi
x (h
yper
plas
tic p
olyp
) 92
.011
M
ucin
ous
LMP
5 LM
P -
93.0
77
Muc
inou
s LM
P 5
LMP
Orig
inal
repo
rt: m
entio
ned
tum
our h
ad b
osse
late
d fe
atur
es, i
.e. r
ound
ed n
odul
es o
n su
rfac
e 94
.030
M
ucin
ous
LMP
5 LM
P O
rigin
al re
port
men
tions
tum
our o
n br
oad
ligam
ent
94.0
72
Muc
inou
s LM
P 5
LMP
- 94
.080
M
ucin
ous
LMP
5 LM
P V
ery
little
tum
our p
rese
nt o
n H
&E
slid
e, p
redo
min
antly
stro
ma
93.0
85
Muc
inou
s LM
P 6
LMP
Rep
ort s
ays M
etas
tatic
& re
view
indi
cate
s les
s tha
n 1%
tum
our c
onte
nt in
H&
E se
ctio
n 44
247
Muc
inou
s LM
P 9
LMP
Hig
h gr
ade
LMP
tum
our -
may
exh
ibit
mol
ecul
ar c
hara
cter
istic
s sim
ilar t
o in
vasi
ve sp
ecim
en
5102
6 M
ucin
ous
LMP
9 LM
P -
5103
0 M
ucin
ous
LMP
9 LM
P -
P007
18
Muc
inou
s LM
P 9
LMP
No
tum
our i
n H
&E
sect
ion,
nec
rotic
cys
t P0
0807
M
ucin
ous
LMP
9 LM
P -
P009
35
Muc
inou
s LM
P 9
LMP
Spar
se tu
mou
r cel
ls, o
rigin
al re
port
stat
es m
ostly
ben
ign,
smal
l LM
P fo
ci
WM
439A
M
ucin
ous
LMP
9 LM
P Pa
th re
port
indi
cate
s sam
ple
is m
etas
tatic
from
app
endi
x. D
epos
it on
ute
rus a
lso
note
d 93
.117
Se
rous
In
vasi
ve
3B
-
86.0
58
Sero
us
Inva
sive
1
2C
Orig
inal
pat
holo
gy m
entio
ns P
sam
mom
a bo
dies
85
.064
Se
rous
In
vasi
ve
2 3
- 91
.007
Se
rous
In
vasi
ve
2 3C
-
91.0
39
Sero
us
Inva
sive
2
1A
Sam
ple
is e
ndom
etrio
id u
pon
revi
ew o
f H&
E st
aine
d se
ctio
n 91
.052
Se
rous
In
vasi
ve
2 2B
-
93.0
04
Sero
us
Inva
sive
2
3B
Orig
inal
repo
rt m
entio
ns e
xist
ence
of p
rimar
y pe
riton
eal t
umou
r 93
.001
Se
rous
In
vasi
ve
3 3C
-
203
Patie
nt
ID
Subt
ype
Inva
sive
/LM
P G
rade
FI
GO
St
age
Com
men
ts fr
om r
evie
win
g pa
thol
ogis
t
94.0
17
Sero
us
Inva
sive
3
2B
Foca
l TC
C, P
sam
mom
a bo
dies
(i.e
. are
as o
f cal
cific
atio
n), p
oten
tially
rela
ted
to sa
rcom
a 94
.127
Se
rous
In
vasi
ve
3 3C
-
P007
56
Sero
us
Inva
sive
3
-
93.1
31
Sero
us
Inva
sive
9
3A
- 92
.014
Se
rous
LM
P
LMP
- 92
.018
Se
rous
LM
P
LMP
- 93
.007
Se
rous
LM
P
LMP
- 93
.079
Se
rous
LM
P
LMP
--
94.0
46
Sero
us
LMP
LM
P -
95.0
06
Sero
us
LMP
LM
P -
93.0
73
Sero
us
LMP
0 LM
P -
90.0
37
Sero
us
LMP
5 LM
P -
91.0
77
Sero
us
LMP
5 LM
P -
93.0
90
Sero
us
LMP
5 LM
P -
90.0
63
Sero
us
LMP
9 LM
P -
2202
7 Se
rous
LM
P 9
LMP
- 44
232
Sero
us
LMP
9 LM
P -
7005
6 Se
rous
LM
P 9
LMP
- 70
057
Sero
us
LMP
9 LM
P -
P006
33
Sero
us
LMP
9 LM
P -
WM
389A
Se
rous
LM
P 9
LMP
- W
M54
2A
Sero
us
LMP
9 LM
P -
WM
578A
Se
rous
LM
P 9
LMP
Pote
ntia
l ser
omuc
inou
s cas
e. In
vasi
ve im
plan
ts
204
5.2.2. Generation of cDNA microarray expression dataset
Fresh-frozen pathology-reviewed biopsies from selected cases of EOC were processed by
Dileepa Diyagama according to standard protocols, as previously described (Boussioutas
et al., 2003). Amplified RNA was hybridised to 10.5k cDNA microarray slides using the
pooled cell line reference as described in Material and Methods section 2.3.3.1 and
Sambrook and Bowtell (Sambrook and Bowtell, 2003). Hybridised arrays were scanned
on an Agilent Microarray Scanner BA) and the scanned images were converted to gene
expression ratios with Axon GenePix image analysis software. Individual array features
were marked or flagged as either ‘present’, ‘marginal’ or ‘absent’ based on predetermined
quality control settings, described in section 2.4.2, to exclude hybridisation artefacts and
poor quality array features from downstream analysis.
MMT scores were calculated to assess the level of spatial bias present in the microarray
data generated for this study as described in Chapter 3. All scores were in the acceptable
range of <200, as determined by calibration of this test to the Peter Mac 10.5k cDNA
platform. During this work, an online tool that facilitated the batch-wise normalisation of
microarray data using the print-tip normalisation method was released (Herrero et al.,
2003; Vaquerizas et al., 2004). This method uses the lowess method of curve fitting to
correct for bias associated with the individual printing pins used to spot the cDNA
material onto the glass substrate (Sambrook and Bowtell, 2003). Over the course of this
thesis this method of spatially-dependant normalisation has been widely adopted by the
microarray field more so than the SNOMAD method previously described (section 2.4.3).
As both methods are based on similar principles, the print-tip method was selected for
this chapter, described in Materials & Methods section 2.4.3.3.
205
Figure 5-2: Overview of gene expression based predictions used in association with the pathology sample review process. (i) Initially each specimen of EOC (n=55) was compared to a database of nine other tumour types to identify any samples of non-primary ovarian origin. (ii) Samples confirmed as primary EOC were then analysed to confirm their histological subtype. This was a concern for specimens where the pathology report indicated the specimen contained regions of mixed histological subtype. (iii) The invasive or LMP phenotype of each microarray profile was then predicted based on LOOCV. This assumes that the majority of samples in the cohort were originally correctly classified at this level. Discrepancies can exist between the official pathology classification of a specimen and that obtained from this process because of sampling bias and heterogeneity within individual tumours. Genes were re-selected at each iteration of LOOCV, for all three levels of classification (Simon et al., 2003b).
(ii) Predict histological subtype
(Mucinous/Serous/Other)
(i) Predict primary ovarian vs. nine primary tumour classes
Exclude from study
Exclude from study
(iiia) Predict invasive/LMP
status
‘Other’ Ovarian
‘Other’ Mucinous
(iiib) Predict invasive/LMP
status
Serous
(iv-a) Confirmed as
primary mucinous LMP
EOC
(iv-b) Confirmed as
primary mucinous
invasive EOC
(iv-c) Confirmed as
primary serous LMP
EOC
(iv-d) Confirmed as
primary serous
invasive EOC
LMP Invasive LMP Invasive
Predict tissue of origin
206
5.2.3. Creation of a EOC gene expression signature for assistance in confirmation of primary ovarian origin
In order to assist in the pathology review process and confirm some of the observations
and classifications made, predictive algorithms were trained using gene expression data
from this and other studies and applied to all cases in the study. In a hierarchical process,
samples were first analysed to confirm their primary ovarian origin, as previously
described. To achieve this, a classifier was created using gene expression data provided
by Richard Tothill representing over nine different tumour types (Tothill et al., 2005). An
overview of this process is shown in Figure 5-2.
5.2.3.1. Selection of cases for use in training set for the prediction of primary ovarian origin
To build the first classifier in the predictive pathway for this study, raw gene expression
data for specimens of nine types of primary carcinoma were obtained from a parallel
study at the Peter MacCallum Cancer centre into carcinoma of unknown primary (Tothill
et al., 2005). For some tumour types, a larger number of samples were available than
could be used in the training process. In this case those samples with the highest
percentage tumour and pathology review agreement with the original diagnosis were
selected. A full list of the samples used in this analysis and associated pathology review
comments is given in Appendix I.
To train the first level binary predictor, capable of separating ovarian tumours from other
tumour types, two groups of samples were created; (i) 19 confirmed primary ovarian
tumours from the carcinoma of unknown primary project as described and (ii) 115
samples representing nine other tumour types as summarised in Figure 5-3. These
included lung, breast, colorectal, gastric, renal, melanoma, uterine, SCC and pancreatic
cancers, representing the most common origin of metastatic disease found in the ovary
(Blaustein, 1982; Fujiwara et al., 1995; Giordano et al., 2001).
207
Figure 5-3: Pie chart of tumour types and sample numbers used to train a range of predictive algorithms to identify primary EOC. Tumour types were selected based on the most frequently observed origins of metastatic EOC based on literature reports (Blaustein, 1982; Fujiwara et al., 1995; Giordano et al., 2001). Individual specimens were selected from the total pool available based on highest tumour percentage and pathology review agreement with the original diagnosis (Tothill et al., 2005).
Lung, 20
Breast, 20
Ovarian, 19
Colorectal, 16
Gastric, 15
Renal, 10
Melanoma, 10
Uterine, 9
SCC, 9 Pancreas, 6
208
5.2.3.2. Algorithm training for the gene expression based prediction of ovarian vs. non-ovarian primary origin
An unsupervised data-reduction filter was first applied to the training set to remove genes
not expressed at detectable levels, or not significantly varying across the dataset relative
to the median level of variation present. After removing any gene with (i) no signal in
50% or more of the samples and (ii) a log-ratio variation p-value of < 0.001, a list of
2,907 genes was left for further supervised analyses.
Next a range of algorithms were trained to distinguish between primary EOC and the
group of nine other tumour types using methods described in Materials and Methods
section 2.4.8. After the most significantly predictive subset of these 2,907 genes had been
identified by LOOCV of the training set, the trained algorithms were applied to the 54
gene expression profiles specifically generated for the comparison of LMP and invasive
EOC. These data were not included in the selection of the ‘EOC:other’ predictive genes.
A combination of algorithms, implemented in the BRB ArrayTools analysis package,
were used to classify each of the 54 profiles as either primary EOC or not (Simon and
Lam). These algorithms were LDA, 1-NN, 3-NN, and NCC, as described in Materials and
Methods. By using a multiple algorithms the primary EOC status of each sample was thus
predicted four independent times and it was possible to identify the algorithm most suited
to this form of classification.
All four algorithms were highly accurate in their ability to assign samples from the
training set into their correct class. On average the classification accuracy observed was
98.3%. As the gene selection process is repeated for each cycle of the LOOCV, a slightly
different number of genes may be used by the algorithm for each classification. The mean
number of genes required for these predictions was 213. A summary of the predictions
made on the training set of samples for each algorithm is shown in Table 5-2. Several
important criteria are given to evaluate the performance of each algorithm. These are:
• Sensitivity: True positive rate – the probability of predicting a true primary
ovarian sample as ‘ovarian’
• Specificity: True negative rate – the probability of predicting a non-ovarian
sample as ‘non-ovarian’
209
• Positive predictive value (PPV): The probability a sample is actually primary
ovarian cancer, if given an ‘ovarian’ prediction by the algorithm.
• Negative predictive value (NPV): The probability that a sample is NOT
ovarian if predicted as ‘non-ovarian’
The closer these four values are to 1.0 the greater the accuracy of the algorithm and the
lower chance of false negative or false positive predictions being made. The 1-NN
algorithm generated only 1 misclassification from the 134 samples tested and
consequently produced the optimal sensitivity, specificity, PPV and NPV. In general, all
algorithms trialled performed with a high degree of accuracy, comparable or superior to
several published analyses of tumour origin on the basis of molecular profiling.
Table 5-2: Performance of the machine learning classifiers in predicting the primary ovarian origin of a given specimen, as determined by LOOCV.
5.2.3.3. Permutation analysis to assess the statistical significance of predictions
In order to determine whether the error rate reported by the LOOCV experiments was
significantly lower than one would expect from random predictions, permutation analysis
was carried out. This involved the random permutation of the class labels of the samples
and repeating of the entire LOOCV process. The number of random permutations, for
which a cross-validated error rate lower than that obtained when the correct class labels
were assigned, was used to determine the significance level of each predictor. A total of
2,000 permutations per algorithm, of the class labels and predictive analyses were carried
out to ensure adequate sampling. Gene selection was repeated for each trial. The p-value
for each predictor was <5 x 10-4, confirming that the predictions made by these
algorithms are not based on random noise within the dataset.
210
5.2.3.4. Use of multiple algorithms and LOOCV to predict primary ovarian origin of specimens
One benefit of using multiple predictive algorithms is that each serves as an
‘independent’ attempt at classification. Prediction results can be compared to determine
whether a sample was incorrectly predicted on a single occasion or by more than one
algorithm. The misclassified samples from the training set are shown in Tan;e 5-3, along
with the number of misclassifications. Comments based on further interrogation of the
pathology associated with these cases are also given. It is important to investigate
incorrect classifications to determine if a particular class is frequently being misclassified,
which may indicate a problem with the quality of the expression data for that class or
inadequate number of samples to generate a robust predictive signature.
These results indicate that this approach is effective for identifying samples that may have
been incorrectly diagnosed as EOC or potentially mislabelled during the experimental or
data analysis stages of the experiment.
Both samples of endometrioid uterine cancer present in the training set were predicted as
EOC. This cancer arises from the endometrium lining of the uterus and their classification
as ovarian in this analysis suggests these tumours may have a higher degree of molecular
similarity to EOC than to the other tumour types present.
Of the four samples of EOC assigned to the non-ovarian category, review of the
pathology revealed one case of Malignant Mixed Mullerian type, one case of
endometrioid cancer and one case of metastatic colorectal cancer, all incorrectly assigned
to the ovarian category. No discrepancy between the original pathology report review
information for the remaining incorrectly classified sample was found. This tumour was
of mucinous LMP histology.
211
Table 5-3: Predictions and associated pathology review comments for samples resulting in incorrect classifications during the algorithm training stage.
Sample ID
Class used for algorithm training
Details about case available at time of analysis:
No. times incorrectly predicted
Comments
UP415 Other Endometrioid uterine cancer 1
Only two samples of endometrioid cancer included in training set
UP421 Other Endometrioid uterine cancer 2
Only two samples of endometrioid cancer included in training set
UP075 Ovarian Mucinous LMP EOC 2
No discrepancy between pathology review and original diagnostic report noted.
UP146 Ovarian Serous EOC 3 Sample appears to be a Malignant Mixed Mullerian Tumour (MMMT) of the Ovary
UP165 Ovarian Serous EOC 1 Sample initially diagnosed as serous but appears to be endometrioid upon review.
UP286 Ovarian Mucinous LMP EOC 5
Sample is metastatic colorectal tissue and not primary EOC on review of pathology report and H&E stained section
5.2.4. Application of the trained predictive algorithms to the invasive/LMP dataset
Following the algorithm training stage and development of the 231 gene predictor of
EOC primary origin, predictive analysis of all prospective studies for this project was
carried out. 39 of the 55 samples analysed by this method were predicted as being
primary EOC by a majority of the algorithms used (71%). The prediction details for each
specimen are given in Appendix J.
Those samples predicted as non-ovarian by one or more algorithm used are listed in Table
5-4, together with comments from the pathology review process, which was carried out
completely independent to the gene expression based analysis. Array scores for the EOC
markers cytokeratin 7 and 20 are also given where data were present for these genes.
High levels of cytokeratin 20 and low levels of cytokeratin 7 are suggestive of metastatic
disease; however immunohistochemistry is usually performed to determine their relative
212
abundance (Ji et al., 2002; Loy et al., 1996; Nishizuka et al., 2003). The information
obtained from the pathology review of the majority of these specimens agreed with their
microarray based prediction of non-primary ovarian origin, confirming the validity of
gene expression based predictions of tumour origins for metastatic disease (Ramaswamy
et al., 2003; Ramaswamy et al., 2001; Tothill et al., 2005).
After the predictions of primary origin had been carried out and compared to the
information obtained from the pathology review process, it became evident that the
invasive mucinous subtype of EOC is the most frequently misclassified or most subject to
sampling bias, of those included in this analysis. Consultation of the literature confirmed
this observation (Hess et al., 2004; Lee and Scully, 2000; Lee and Young, 2003; Ronnett
et al., 2004; Seidman et al., 2003). Only one case of serous carcinoma was predicted to be
non-ovarian by the expression-based classifiers. Upon review of the H&E stained section
associated with the case it was determined to be of endometrioid histology, a subtype not
intended to be included in this study.
After excluding samples that did not pass pathology review process, all of which were
predicted to be non-ovarian based on their expression profile, the breakdown of samples
remaining in the cohort was: mucinous invasive EOC (2), mucinous LMP (14), serous
invasive (11) and serous LMP (19), for a total number of 46 cases.
213
Tab
le 5
-4:
Sam
ples
of
LM
P or
inva
sive
EO
C p
redi
cted
as
non-
ovar
ian
by a
t le
ast
one
algo
rith
m. C
omm
ents
fro
m p
atho
logy
rev
iew
list
ed
with
gen
e ex
pres
sion
ratio
s for
cyt
oker
atin
mar
kers
7 (C
K7)
and
20
(CK
20),
whi
ch a
re u
sed
diag
nost
ical
ly fo
r ide
ntify
ing
met
asta
tic d
isea
se to
the
ovar
y.
Patie
nt ID
H
isto
logi
cal
subt
ype
Inva
sive
/LM
P C
omm
ent
Arr
ay
CK
7 A
rray
C
K20
E
xclu
ded
from
st
udy
90.0
07
Muc
inou
s In
vasi
ve
Com
plex
, aty
pia,
nec
rosi
s con
sist
ent w
ith w
ell
diff
eren
tiate
d in
vasi
ve c
arci
nom
a 3.
08
0.74
Y
es
91.0
39
Sero
us
Inva
sive
Sa
mpl
e is
end
omet
rioid
upo
n re
view
of H
&E
sect
ion
0.93
1.
48
Yes
94.0
36
Muc
inou
s In
vasi
ve
Om
enta
l spr
ead
indi
cate
d in
pat
h re
port
– su
gges
tive
of
met
asta
tic d
isea
se
1.20
23
.37
Yes
5102
6 M
ucin
ous
LMP
Not
hing
susp
icio
us in
pat
h re
port
or o
n re
view
of H
&E
sect
ion
3.64
1.
61
5103
0 M
ucin
ous
LMP
Hig
h ar
ray
CK
20 -
sugg
estiv
e of
col
orec
tal m
etas
tase
s. 1.
82
24.8
7
92.0
11
Muc
inou
s LM
P N
othi
ng su
spic
ious
in p
ath
repo
rt or
on
revi
ew o
f H&
E se
ctio
n
18.0
2
93.0
02
Muc
inou
s LM
P N
othi
ng su
spic
ious
in p
ath
repo
rt or
on
revi
ew o
f H&
E se
ctio
n
2.05
93.0
85
Muc
inou
s LM
P R
epor
t say
s tum
our m
ay b
e m
etas
tatic
and
revi
ew
indi
cate
s sec
tion
is a
ppro
xim
atel
y 1%
tum
our c
onte
nt
1.
36
Yes
94.0
72
Muc
inou
s LM
P N
othi
ng su
spic
ious
in p
ath
repo
rt or
on
revi
ew o
f H&
E se
ctio
n
3.21
94.0
80
Muc
inou
s LM
P V
ery
little
tum
our p
rese
nt o
n H
&E
slid
e +
high
arr
ay
CK
20.
12
.65
Yes
P004
88
Muc
inou
s LM
P Pa
th re
port
indi
cate
s sam
ple
is m
etas
tatic
from
app
endi
x
37.5
8 Y
es
P006
27
Muc
inou
s LM
P N
othi
ng su
spic
ious
in p
ath
repo
rt or
on
revi
ew o
f H&
E se
ctio
n
1.07
P007
18
Muc
inou
s LM
P N
o tu
mou
r in
H&
E se
ctio
n
4.26
Y
es
P008
07
Muc
inou
s LM
P N
othi
ng su
spic
ious
in p
ath
repo
rt - r
eque
st C
K7
CK
20.
Hig
h ar
ray
CK
20.
6.
79
WM
438
Muc
inou
s LM
P Pa
th re
view
indi
cate
d m
etas
tatic
col
orec
tal d
isea
se
pres
ent
Yes
WM
439A
M
ucin
ous
LMP
Path
repo
rt in
dica
tes s
ampl
e is
met
asta
tic fr
om a
ppen
dix
Yes
214
5.2.4.1. Hierarchical clustering analysis of the 231 gene EOC expression signature
A list of 231 genes was chosen by the classifier for prediction of ‘test’ samples based on
the number of times each gene was selected in the LOOCV iterations and its individual
parametric p-value for discriminating between ovarian and ‘other’ tumours of 9 varieties.
This list of genes was annotated using the UniGene database (Build #184) and is provided
in Appendix K. The genes are discussed in section 5.2.5.2.
Hierarchical clustering (as described section 2.4.4.1) using the genes identified by
LOOCV allowed inspection of differences in their expression levels between ovarian and
the other tumour types. The resulting cluster image is shown in Figure 5-5. A clear divide
between ovarian and non-ovarian tissues can be observed, which is expected from a
supervised clustering analysis. It is interesting to note however, that despite the binary
method that was used to select these 231 genes, i.e. EOC or non-EOC, the other nine
tumour types have formed sub-clusters corresponding to their site of origin. By inspecting
the body of the cluster figure it is possible to observe patterns of up and down regulation
that have resulted in the grouping of samples belonging to the same tumour type,
particularly for the breast, colorectal, uterine and renal cancers which have formed
discrete sub-clusters. This observation implies that tissue-specific patterns exist in gene
expression data at several levels. By design, the largest and most significant difference in
the expression of these 231 genes is between EOC and the other nine tumour types
grouped as a single class. Despite this, inspection of the hierarchical cluster generated
shows clear tissue or site-specific patterns of expression.
Inspection of the relative positioning of ovarian samples in Figure 5-4 revealed a
histologically driven dendrogram structure. An expanded view of this section is shown in
Figure 5-5 with the colour bar below the cluster corresponding to histological subtype of
the tumours. Again, the intrinsic molecular differences present between the predominant
EOC subtypes present (serous and mucinous) appears to have created a biologically
relevant sub-structure, despite these samples being treated as a single class for the gene
selection and evaluation section of the analysis.
In general, this molecular signature is capable of distinguishing EOC from the nine other
tumour types tested as well as identifying a sample as either mucinous or serous with a
high degree of accuracy.
215
Figure 5-4: Supervised hierarchical cluster of ovarian and nine other classes of primary carcinomas using 231 differentially expressed genes (p<0.001). Clear separation of ovarian (green section of colour bar at base of cluster) and the other nine tumour types can be observed on the basis of these selected genes. Even though a binary comparison of EOC vs. non-EOC was used to generate the gene list, grouping of the non-EOC portion of the cluster is according to tumour types. This indicates that as well as distinguishing EOC from other tissue types, these genes have expression characteristics unique to other types of cancer.
Lung
Ovarian
Breast
Colorectal
Melanoma
Uterine
Gastric
Pancreas
Renal
SCC
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.5
2.0
2.5
3.0
3.0
1.0
0.3
Expression ratio
21
6
Figu
re 5
-5: E
xpan
sion
of E
OC
tree
from
supe
rvis
ed c
lust
er o
f EO
C-p
redi
ctiv
e ge
nes (
Figu
re 5
-4).
EOC
spec
imen
s hav
e cl
uste
red
acco
rdin
g to
his
tolo
gica
l su
btyp
e ev
en th
ough
gen
es w
ere
sele
cted
for
diff
eren
tial e
xpre
ssio
n be
twee
n EO
C a
nd a
gro
up o
f ni
ne o
ther
prim
ary
tum
our
type
s. Th
is r
efle
cts
the
inna
te
stru
ctur
e w
ithin
the
mic
roar
ray
data
that
cor
resp
ondi
ng to
kno
wn
clin
ical
par
amet
ers.
EO
C su
btyp
e
Muc
inou
s
Sero
us
Endo
met
rioid
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.5
2.0
2.5
3.0
3.0
1.0
0.3
Exp
ress
ion
ratio
217
5.2.4.2. Gene ontology analysis of the 231 gene signature predictive of primary EOC
Gene ontology analysis was carried out to investigate the functional composition of the
genes contained in the list of 231 genes selected by LOOCV. The list of 2,907 genes
generated by unsupervised data-reduction filtering was used as a reference list of genes to
provide a measure of the frequency of genes belonging to a specific ontological group.
The analysis was carried out using the EASE method, which uses a modified Fishers
Exact test to determine statistical significance as described in section 2.4.9.1.
The biological process ‘development’ ontology is the most significantly represented gene
classification in the EOC signature (EASE score = 0.00292). This second-level ontology
represents genes involved in the developmental progression of an organism over time
(Zeeberg et al., 2003) and contains a number of genes involved in skeletal development
as well as having demonstrated involvement in ovarian malignancy, including BMP5,
ADAMTS4, KLK7 and KLK8 (Dong et al., 2003; Shigemasa et al., 2004). Genes in this
higher-level ontology have a broad range of functions, with the central link being their
involvement in tissue development, possibly representing pathways of progression unique
to EOC.
The ontologies ‘cell adhesion’, ‘basement membrane’, ‘extracellular matrix’ and
‘binding’ represent genes expressed by both the tumour cells and cells of the extracellular
matrix involved in regulation of important adhesion interactions which are characteristic
of EOC development (Gardner et al., 1995; Kim et al., 2003; Patel et al., 2003; Sundfeldt,
2003). NID2 and LAMB2 are both major components of the basement membrane and
important regulators of tumour/stroma interaction and associated growth regulation;
ADAMTS4 and other genes with matrix-degradation properties; CD31, CD44, CHL1,
CDH2 and other cell-surface expressed molecules involved in the physical attachment of
cancer cells to each other, or to the extracellular matrix (Gardner et al., 1995; Martin et
al., 2003; Sillanpaa et al., 2003). As observed in Chapter 4 and also discussed in later
sections, cell-cell and cell-matrix genes play a crucial role in EOC invasion and
progression and have certain ovarian-specific properties not observed in other cells of the
body. Therefore the observation of these ontologies in the 231 gene signature of EOC is
well supported by the literature as an important class of genes in the development and
progression of this disease type.
218
The ‘transcription factor activity’ ontology (GO consortium ID: 0003700) contains a
number of known genes whose mutation or altered expression results in tumorigenesis.
These include ESR1, MYCN and WT1 which are capable of altering DNA transcription
leading to over expression of oncogenes or under expression of tumour suppressor genes
relative to a non-cancerous state (Bardin et al., 2004; Lee et al., 2002; Slamon et al.,
1986). Whilst this is a process common to many cancers, the transcription factors
included belonging to this gene ontology in the list of 231 genes are expressed at
significantly different levels in EOC relative to the nine other tumour types analysed.
Table 5-5: Significantly represented gene ontologies represented by the 231 gene signature of EOC. The identification of adhesion, membrane and extracellular matrix ontologies indicate that the nature or extent of extracellular interaction of EOC may be a defining characteristic.
Ontology Genes (listed in order of statistical significance for discriminating between EOC and nine other tumour types)
Next the LMP or invasive status of each specimen was confirmed by gene expression
based prediction, using the same method for the prediction of histological subtype. The
two samples of mucinous invasive tumour were both predicted as being of LMP. This
most likely reflects the inadequate number of this class of tumours present in the dataset
for ANOVA-based gene selection and predictive analysis, rather than true misdiagnoses.
Even prior to obtaining this result, it was apparent that the number of true mucinous
invasive samples available was inadequate for a balanced comparison between invasive
and LMP EOC within the two histological subtypes. As a result, it was decided to focus
this study on the serous subtype only, for which 11 invasive and 19 LMP reviewed and
diagnosis-confirmed samples were available. Mucinous EOC is increasingly being
regarded as a separate entity to the serous type of ovarian cancer on the basis of a
growing body of molecular and clinical evidence (Gilks, 2004; Hess et al., 2004; Lassus
et al., 2001).
222
5.2.7. Identification of differentially expressed genes between serous LMP and serous invasive EOC
To identify those genes with robust patterns of differential expression between the serous
LMP and invasive tumours, the Significance Analysis of Microarrays (SAM) method was
used (Tusher et al., 2001). An unsupervised filter of log-ratio variation p-value < 0.001
and fewer than 50% missing values was again applied to the normalised data, leaving
2,965 genes available for SAM testing.
1,302 genes were identified as having significant differential expression between invasive
and LMP serous EOC. The median false discovery rate among the list of 1302 genes was
0.0088, indicating that less than 1% of the genes in this list were selected by chance
alone. A SAM plot is shown in Figure 5-6 and reveals the balanced distribution of up and
down regulated genes in the list created. Approximately 13% of the entire microarray and
over 50% of the unsupervised-filtered gene list was therefore identified as differentially
expressed between these tumour subtypes. This indicates the substantial variation that
exists at the molecular level between these classes of EOC.
5.2.7.1. Visualisation of expression differences between serous LMP and invasive EOC
Hierarchical clustering and principle component analysis were used to visualise the
degree of difference between these classes of tumours on the basis of the genes identified
by SAM analysis, shown in Figure 5-7. The clear divide between these samples can be
observed as well as the balance between up and down regulated genes selected for each
class.
Principle component analysis (PCA) was carried out to further visualise the difference
between LMP and invasive serous tumours represented by these genes. This method
reduced the variation in gene expression values to a small number of ‘principle
components’ each representing a percentage of the total variation present. By plotting the
first three principle components it is possible to visualise the samples being investigated
in three-dimensional space, as shown in Figure 5-8 where the extent of difference
between these tumour classes can again be clearly observed. Visualisation techniques
such as these assist in gauging the extent of the divide between two classes of samples
and identification of misclassified samples.
223
Figure 5-6: SAM plot for selection of genes differentially expressed between serous LMP and invasive EOC. Genes indicated by red dots above the threshold (dashed) line represent those over expressed in the invasive EOC cases relative to the LMP tumours. Green dots indicate genes with lower expression in invasive EOC than LMP tumours. 1,302 differentially expressed genes were identified by this approach.
224
Figure 5-7: Hierarchical clustering of 1302 SAM-selected genes identified as differentially expressed between serous LMP and invasive EOC. A slightly larger proportion of the genes identified by SAM are upregulated in the serous invasive tumours relative to the proportion upregulated in LMP samples.
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.5
2.0
2.5
3.0
3.0
1.0
0.3
Expression ratio
Serous invasive
Serous LMP
Histology
225
Figure 5-8: Principle component analysis of SAM identified differentially expressed genes between serous LMP and invasive EOC. This three dimensional view of the three most significant principal components reveals a clear divide between these subtypes of EOC on the basis the differentially expressed genes identified.
226
5.2.7.2. Gene ontology analysis of 1302 genes differentially expressed between serous LMP and invasive EOC
Gene ontology analysis was carried out as previously described using the 1302 genes
differentially expressed between LMP and invasive serous EOC. The EASE method was
used for ontology analysis as it takes into account the composition of the array platform
used, the size of each ontology classification and also the potential for co-expression of
genes in computing the significance scores. Table 5-7 lists the classifications with EASE
scores < 0.05. The analysis was carried out to three levels of the gene ontology hierarchy
as recommended by Hosack et al (Hosack et al., 2003).
Inspection of the significantly represented gene ontologies revealed that a high proportion
of the genes differentially expressed between LMP and invasive EOC are involved in
regulation of the immune response, control of the cell cycle, as well as cell proliferation,
movement and adhesion. The cellular localisation most significantly represented suggests
that a large number of these differentially expressed genes are membrane bound or
expressed in the extracellular matrix.
The most significant biological processes distinguishing serous LMP and invasive EOC
involved genes that regulate the immune system. This gene ontology includes INHBA,
which regulates cell proliferation and has a tumour suppressing role, facilitates TGFβ-
mediated immunosuppression in thymocytes (T-cell precursors) (Ying and Becker, 1995);
CXCL9, a chemotactic protein for activated T-cells, the presence of which in EOC has
been shown to have a significant prognostic value (Zhang et al., 2003); and GAGEB1
which codes for a specific antigen recognised by cytolytic T-cells (Van den Eynde et al.,
1995). Genes in this ontology are upregulated in the invasive tumours relative to the LMP
type and represent one aspect of the body’s own response to the detection of invading
tumour cells.
The ontologies representing genes involved in mitosis and proliferation are also highly
significant by EASE analysis. These genes are again predominantly upregulated in the
invasive tumours, reflecting the faster growth rate of this subtype. Differentially
expressed genes representing these ontologies include CCNB2, essential for control of the
cell cycle at the G2/M transition and whose over expression can lead to chromosomal
instability (Sarafan-Vasseur et al., 2002); ANAPC7, a ubiquitin ligase that controls
progression through mitosis and the G1 phase of the cell cycle (Sarafan-Vasseur et al.,
2002); and PTTG1 which blocks segregation of chromosomes during mitosis and also has
227
been shown to negatively regulate the transcriptional and related apoptosis activity of
TP53, a known oncogene implicated in many cancers (Zhang et al., 1999).
Another distinct theme present in the 1,302 differentially genes are ontologies
representing cell-cell and cell-matrix adhesion genes. Reflecting the cellular location of a
large number of these genes, the cellular component ontologies ‘integral to plasma
membrane’ and ‘extracellular matrix’ are also highly statistically significant. Genes
representing these categories include FN1, a component of the extracellular matrix and
prognostic EOC marker with an important role in the attachment of tumour cells to the
mesothelium (Franke et al., 2003); MSLN, which has recently been shown to bind the
cell-surface EOC marker CA-125 and mediate cell-matrix adhesion (Rump et al., 2004);
CLDN10 which plays a major role in tight junctions, a type of cell adhesion that serves as
a physical barrier to prevent solutes and water from passing freely through the space
between epithelial or endothelial cell sheets (Kubota et al., 1999), and finally CDH11,
one of a large number of calcium-dependant cadherins differentially expressed between
invasive and LMP tumours, required for homophilic cell adhesion (Tanihara et al., 1994).
Table 5-7: Significantly represented gene ontologies in the 1,302 genes differentially expressed between invasive and LMP EOC.
5.2.7.3. Identification of differentially expressed genes significantly representing known molecular (KEGG) pathways
The KEGG database is a comprehensive collection of biological pathways created from
studies of gene-gene interactions in various cellular processes (Kanehisa, 1997; Kanehisa
and Goto, 2000). It presently contains over 24,000 molecular pathways, onto which gene
expression data can be overlayed to gain insight into the outcome of microarray
experiments. Using similar statistical approaches as described for gene ontology analyses,
the list of genes differentially expressed between serous LMP and invasive EOC was
queried against the KEGG database and those pathways with statistically significant
representation were identified.
The cell-cycle pathway was amongst those represented with high significance (p= 2.81E-
5) and contains a large number of genes that were detected as upregulated in the invasive
tumours compared to those of LMP (Appendix L). Aberrant expression of genes involved
in regulating the cell cycle has been noted in a wide range of tumour types and their up
regulation in the invasive EOC type corresponds to the known difference in growth rate
and speed of disease progression for this disease as previously described (D'Andrilli et
al., 2004). Over expression of cyclin D1 for example has been demonstrated to correlate
with chromosomal instability in breast cancer (Lung et al., 2002) although other research
suggests that distinct pathways of cyclin activation and effect exist between breast and
ovarian cancer (Courjal et al., 1996; D'Andrilli et al., 2004). The mitosis check-point
control gene MAD2 is upregulated in this pathway. Studies in ovarian cancer cell lines
have demonstrated that steady-state amounts of this molecule are required for cells to
maintain proper control of the replication process (Wang et al., 2002).
The complement and coagulation cascade pathway contains several differentially
expressed genes previously linked to ovarian cancer invasion and metastasis, including
thrombin/F2 (Wilhelm et al., 1998) and the PLAU/PLAUR combination (Konecny et al.,
2001; van der Burg et al., 1996). The complement pathway is a crucial part of the bodies
immune and inflammatory response to a tumour (Smith and Oi, 1984) which have been
identified as a prognostic factor for EOC (Zhang et al., 2003).
Another significantly represented KEGG pathway in the EOC LMP/invasive signature is
the cytokine-cytokine receptor interaction pathway (p=5.28E-5). It contains chemokines
such as CXCL9 and CXCL10, molecules also involved in the immune and inflammatory
responses. These genes function particularly in the recruitment of T-cells, which can
229
infiltrate a malignancy and assist in the bodies’ ability to slow or terminate the
uncontrolled growth by engaging the other components cells (Zhang et al., 2003).
5.2.8. Molecular pathway analysis of the invasive and LMP EOC gene expression signature
Pathway analysis is an emerging technique for mining existing databases of biological
and clinical knowledge to add value to microarray data. This method involves the use of
large databases, created from the use of intelligent text-mining algorithms applied to data
sources such as the PubMed. Souces such as this contain decades of medical, clinical and
molecular information, however much of the information is reliant on human
interpretation of long strings of text (Wheeler et al., 2003). The use of these algorithms on
such data sources has generated an index of over 500,000 biological interactions. These
data exist in a proprietary format (ResNet) that can be queried with standard database
protocols (Daraselia et al., 2004). The pathway analysis tool PathwayAssist (Ariadne
Genomics, USA) was used to explore interactions between the list of 1,302 genes
identified by statistical analysis of the serous LMP and invasive EOC profiles generated
for this study (Nikitin et al., 2003).
Initially the entire list of genes identified as differentially regulated by LMP and invasive
serous EOCs was used to interrogate the ResNet database and generate a network of
interactions. The analysis was restricted to finding direct interactions between ‘nodes’
(i.e. genes) and only those genes coding for known Homo sapiens proteins. Despite these
restrictions the size of the network returned was prohibitively large for display and
interpretation with the available computing resources (data not shown). This result likely
reflects the representation of ResNet entries involving the dominant gene ontologies
represented in the 1,302 differentially expressed genes; cell cycle regulation, proliferation
and the immune response, as described in section 5.2.7.2.
5.2.8.1. Gene ontology-based filtering of the LMP and invasive EOC gene expression signature
In order to explore the key biological processes represented by the LMP and invasive
EOC signature other than cell cycle regulation, proliferation and the immune response, a
custom database query was written to identify and exclude genes implicated in these
processes. This query was applied to a UniGene-annotated (Batch #184) version of gene
list using Microsoft Access. The details of the query composed are shown in Appendix
M.
230
This filtering resulted in a list of 142 genes with functions other than cell cycle
regulation, proliferation and immune system activity. These are shown in hierarchical
cluster format in Figure 5-11 and listed with annotations in Appendix P. It can be seen
that the expression of this 142 gene subset remains highly distinct between tumour
subtypes.
The large decrease (89%) in the number of differentially expressed genes remaining after
excluding those involved in the cell cycle, proliferation and the immune response,
correlates with their degree of over-representation observed from EASE analysis of the
total list. Manual inspection of the annotated set of 142 genes remaining, coupled with the
output of hierarchical clustering was carried out. This revealed that a large proportion of
this subset of the LMP and invasive EOC signature were involved in cell-cell or cell-
matrix adhesion processes. These adhesion-related genes also appeared to be
differentially expressed at extreme levels between the EOC subtypes based on the
intensity of the hierarchical clustering colouring. Gene ontology analysis of the 142 gene
subset, again using EASE, confirmed this observation with the p-values for over-
representation of adhesion-related ontologies being less than 0.001.
This list of 142 genes was then used for pathway analysis and discovery to explore
potential interactions between the individual members and identify other molecular
events that may be differentially regulated between these EOC subtypes.
231
Figure 5-9: Hierarchical cluster of 142 genes remaining after key-word filtering. Columns indicated by red and blue section of colour-bar correspond to invasive and LMP tumours, respectively. Yellow indicators on right correspond to genes in cell-cell adhesion or cell-matrix adhesion gene ontologies. Gene names and mean expression ratios per class given in Appendix P and on the attached CD-ROM.
Serous invasive
Serous LMP
Histology
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.5
2.0
2.5
3.0
3.0
1.0
0.3
Expression ratio
232
5.2.8.2. Pathway analysis of 142 gene subset of the invasive and LMP EOC expression signature
Using the PathwayAssist interface to for the ResNet database of molecular interactions,
all linkages between the 142 ontology-filtered genes differentially expressed between
LMP and invasive EOC set were determined. Genes without connection to the main
network were removed and the network arranged for optimal viewing of the relationships
present. Each linkage was manually checked by reading the PubMed abstract from which
it was obtained to confirm the natural language processor used had interpreted the
information correctly. The final result is shown in Figure 5-10.
For selected genes from the network created with PathwayAssist, expression box plots, a
description of their potential function in EOC malignancy and the other members in the
network with which they are know to interact, are shown in Table 5-8. The network
members are coloured according to their mean expression level in invasive EOC
A hierarchical cluster of genes contained in the network generated was also produced to
visualise the relative consistency of each gene with the two EOC classes (Figure 5-11).
233
Figure 5-10: Gene expression network created from pathway analysis of 142 keyword-filtered invasive/LMP EOC differentially expressed genes. Regulation of cell-cell and cell-matrix adhesion is a primary function of the majority these interacting genes with significant differences in expression between LMP and invasive EOC. Genes shown in grey shading were not present in the dataset after filtering, but were added based on literature information.
3.0
1.0
0.3
Expression ratio
Gene not present on microarray
Connection types Network elements
Protein
Extracellular protein
Receptor
234
Table 5-8: Details about key genes from differentially expressed LMP/invasive network. Box plots of Log2 expression ratios from microarray dataset shown. Left box : Invasive EOC, Right box: LMP EOC. All genes differentially expressed at univariate level p<0.001
Gene Interacts with
Summary of purported involvement in EOC development and/or progression
MSLN (Mesothelin)
CA-125
- A cell surface molecule expressed in the mesothelial lining in many tumour cells. - CA-125 and MSLN are co-expressed in advanced grade ovarian adenocarcinoma. - Initiates cell attachment to the mesothelial epithelium via binding to mesothelin,
contributing to the metastasis of ovarian cancer to the peritoneum (Gilks et al., 2005; Hippo et al., 2001; Lu et al., 2004; Rump et al., 2004; Schaner et al., 2003)
LGALS1 (Galactin 1)
CA-125 FN1
- LGALS1 binds to CA-125 - Is a component of the extracellular matrix and implicated in the regulation of cell adhesion, apoptosis, ad tumour progression. - LGALS1 export to the cell surface may be regulated by CA-125 activity.
(Seelenmeyer et al., 2003)
CDH1 (E-cadherin)
CDH6 CCDH11 VIL2 BAIP1 PLAU CA-125 CD44
- The cell-adhesion molecule CDH1 plays an important role in maintaining tissue integrity. - Disappearance or impaired function of CDH1 has often been associated with tumour formation and invasion in vivo and in vitro. - In normal ovaries, the expression of CDH1 is limited to
inclusion cysts or deep clefts lined with OSE, whereas no CDH1 staining of the OSE is detected at the ovary surface. - Benign and borderline EOC tumours uniformly express CDH1. (Auersperg et al., 1999; Davies et al., 1998; Hiscox and Jiang, 1999; Sasaki et al., 1999; Sundfeldt et al., 1997; Xu and Yu, 2003)
LMPInvasive
1
0
-1
-2
Class
MSL
N
LMPInvasive
1
0
-1
-2
-3
Class
LGAL
S1
LMPInvasive
1
0
-1
-2
-3
-4
Class
CD
H1
235
LMPInvasive
2
1
0
-1
-2
Class
CD
44
Gene Interacts with
Summary of purported involvement in EOC development and/or progression
FN1 (Fibronectin)
CD44 LGALS1 PLAU LCP1 CD9
- Involved in adhesion, motility, opsonization, wound healing, and maintenance of cell shape. - Chemotaxis of EOC cells is partially prevented by antibodies against FN1. - The mesothelium plays an active role in inducing the intraperitoneal spread of EOC cells, and FN1 is
one of the main mediators of mesothelium-induced cell motility. - EOC cells bound to fibronectin are protected from apoptosis when treated with cisplatin, or other drugs including paclitaxel. (Franke et al., 2003; Rieppi et al., 1999; Zand et al., 2003)
PLAU (Plasminogen activator, urokinase)
CDH1 FN1
- Over-expression of PLAU or PLAUR is a feature of malignancy and is correlated with tumour progression and metastasis. - Expression is predictive of patient outcome, particularly when residual disease present. - PLAU expression is
activated after contact with FN1 and initiation of cell-cell interactions that are mediated by CHD1. (Foekens et al., 1992; Konecny et al., 2001; Nekarda et al., 1994; Pedersen et al., 1994; Sasaki et al., 1999; Schmitt et al., 1997; van der Burg et al., 1996)
CD44
VIL2 (ezrin) CDH1 TNFAIP6 CD9
- Ovarian cancer cell adhesion to mesothelium can be inhibited by antibodies against CD44 suggesting a role in the mediation the adhesion of cellular adhesion - CD44 interacts with members of the ezrin family (ERM family)
such as VIL2 and forms a complex with properties that suggest its importance in tumour-endothelium interactions, cell migrations, cell adhesion, tumour progression and metastasis. (Bar et al., 2004; Gardner et al., 1995; Lessan et al., 1999; Martin et al., 2003; Sillanpaa et al., 2003; Xu and Yu, 2003)
LMPInvasive
3
2
1
0
-1
-2
-3
Class
FN1
LMPInvasive
3
2
1
0
-1
-2
Class
PLAU
236
Gene Interacts with
Summary of purported involvement in EOC development and/or progression
VIL2 (Ezrin)
CDH1 CD44
• The function of these proteins such as that coded by VIL2 is to link the plasma membrane to the actin cytoskeleton. • Inhibition of VIL2 expression in colorectal cancer cells causes reduced cell-cell adhesiveness together with a gain in motility and invasive behaviour. These
cells also displayed increased spreading over matrix-coated surfaces. • VIL2 regulates cell-cell and cell-matrix adhesion, by interacting with cell adhesion molecules CHD1 and beta-catenin, and is thought to play an important role in the control of adhesion and invasiveness of EOC cells. (Hiscox and Jiang, 1999; Martin et al., 2003; Shen et al., 2003)
LMPInvasive
2
1
0
-1
-2
Class
VIL2
237
Figure 5-11: Hierarchical cluster of genes included in adhesion-related network found to be differentially expressed between LMP and invasive EOC. Red and blue sections of colour bar correspond to invasive and LMP tumours, respectively. A high level of consistent differential expression can be seen between the two EOC subtypes for these genes, particularly those upregulated in the LMP tumours, e.g. MSLN and PTTG1.
Serous LMP
Serous Invasive
Histology:
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.5
2.0
2.5
3.0
3.0
1.0
0.3
Expression ratio
238
5.2.8.3. Comparison of the LMP/invasive gene expression signature to published studies of other invasive/non-invasive cancers
After performing the ontology analysis described previously, there remained a large
proportion of the 1,302 differentially expressed genes whose function or involvement in
these tumours had not been accounted for. To further explore the LMP/invasive gene
expression signature, comparative analysis was carried out using a range of published
dataset and lists of genes found to be differentially expressed between other forms of
human cancers.
Similarity between gene lists was determined by first mapping the lists of differentially
expressed genes from a study of interest to the Peter Mac 10.5k cDNA microarray clone
set. This was done by mapping the by converting genes IDs used in the study to current
UniGene cluster IDs (Build #184). The significance of any overlap with the UniGene-
annotated list of LMP/invasive genes was then calculated using a standard Fisher's exact
test to ascertain if the degree of overlap observed could be expected by chance alone.
Table 5-9 summarises those gene lists found to have a significant homology to the
invasive/LMP EOC signature.
This analysis revealed a high similarity to a range of other models of invasive and non-
invasive cancer subtypes that have been studied with microarrays, including several
studies of ovarian cancer. The statistically significant overlap of genes identified by this
study and those described by Gilks et al (Gilks et al., 2005), Warrenfeltz et al
(Warrenfeltz et al., 2004) and Santin et al (Santin et al., 2004) represents a validation step
in the process of characterising these EOC subtypes. These studies were performed using
tissue from EOC patients from independent cohorts, processed in different laboratories,
hybridised to different array platforms and analysed with different algorithms and tools.
Despite these extensive differences, the genes found to be associated with EOC
malignancy overlap with those identified by this study at a statistically significant level.
This finding provides some validation for the use of microarray profiling as a means of
identifying molecular signatures of human disease. It shows that the findings generated
are transferable between patient cohorts and laboratories.
The overlap between the recently published meta-signature of undifferentiated tumours
further strengthens the power of this analysis. It also supports the hypothesis that
tumorigenesis and progression is reliant on a common transcriptional profile, found to
present in a large proportion of cancer types (Rhodes et al., 2004). The similarity with the
239
meta-signature furthermore suggests that the molecular differences between LMP and
invasive EOC are similar to the core processes responsible for the malignant
transformation of a broad range of tissues.
This meta-analysis of microarray datasets also revealed a relationship between the
invasive/LMP signature and that from a microarray study of breast cancer (BrCa)
subtypes DCIS and IDC (Ma et al., 2003). BrCa is frequently studied as a progressive
cancer model with defined pathological stages thought to correlate with distinct molecular
events. These stages are: atypical ductal hyperplasia (ADH), DCIS and IDC.
Initially the overlap between the genes thought to characterise these stages of BrCa
progression and the invasive/LMP signature deduced in this chapter suggested a
progressive relationship between LMP and invasive EOC. Warrenfeltz et al have
proposed that LMP tumours exist as an intermediate state between normal ovarian
epithelium and fully invasive EOC (Warrenfeltz et al., 2004). However, one important
observation from the Ma et al study was that same genes found to correlate with the
stages of breast cancer were also significantly related to the grade of the tumours used to
generate the signature. Therefore it appears unlikely that these stages of breast cancer are
separated by distinct molecular profiles, at least those detectable by microarray analysis.
Therefore the overlap between breast and ovarian cancer expression profiles therefore
most likely reflects the increasing tumour grade associated with the development of a
more aggressive phenotype.
Genes found to be differentially expressed between pre-invasive and invasive prostate
cancer (Dhanasekaran et al., 2001) had no significant overlap with the LMP/invasive
EOC signature. Also not significantly related was a gene expression signature deduced
from a comparison of human embryonic stem cells and a somatic cell line (Sperger et al.,
2003). This indicates a different range of molecular events responsible for these
phenotypes compared to the events occurring in LMP and invasive EOC.
240
Table 5-9: Published gene lists or studies with significant homology to LMP/invasive EOC signature
List Name p value
Differentially expressed genes between low/intermediate grade breast DCIS and high grade DCIS/IDC (Seth et al., 2003). 0.000554
Genes differentially expressed between cell cultures of serous EOC and normal ovarian epithelium (Santin et al., 2004) 0.000122
Prognostic signature of breast cancer capable of predicting a short interval to the development of metastatic disease after surgery (van 't Veer et al., 2002) 4.82 x 10-5
Genes with increased expression in association with both increasing breast tumour grade and the transition of tumours from premalignant (benign), pre-invasive (DCIS) to malignant (IDC) (Ma et al., 2003)
1.33 x 10-5
Genes identified by SAM analysis of serous LMP and invasive EOC (Gilks et al., 2005) 4.67 x 10-8
Genes identified as differentially expressed between gastric cancer specimens of different stages of invasive (Boussioutas et al., 2003). 5.16 x 10-8
Genes differentially expressed between benign and malignant EOC – proposed as signature of EOC malignant potential (Warrenfeltz et al., 2004) 7.15 x 10-9
Metasignature of undifferentiated cancer identified by profiling of >3,700 tumours to identify core genes involved in malignant transformation and tumour progression in the majority of human cancer types (Rhodes et al., 2004)
1.51 x 10-11
Table 5-10: Published gene lists from microarray studies in which no significant homology to the LMP/invasive EOC signature was observed
List Name p value
Genes differentially expressed between human embryonic stem cells and somatic cell line (Sperger et al., 2003) 0.342
Genes differentially expressed between prostate cancer and benign prostate hyperplasia (Dhanasekaran et al., 2001) 0.089
5.2.9. Validation of selected differentially expressed genes with RT-PCR
5.2.9.1. Selection of appropriate genes for expression signature validation and design of RT-PCR primers
RT-PCR was chosen as one method for both technically and biologically validating the
expression of several genes identified by microarray analysis as having statistically
significant differences in expression between the EOC subtypes of interest. A high-
throughput method of this technique was employed using the ABI 7900HT system
(Applied Biosystems, USA), enabling up to 384 RT-PCR reactions to be performed
241
simultaneously, whilst requiring half the reaction volume as the standard 96-well plate
format (Pinhasov et al., 2004). The performance of RT-PCR as a method to validate
microarray results has been widely reported (Jenson et al., 2003; Mutch et al., 2001).
Several criteria were formulated to select genes appropriate for validation method:
(i) Statistically significant difference between serous invasive and LMP EOC as
identified by SAM analysis.
(ii) Upregulated (mean normalised expression ratio >1.0) in at least one class, i.e.
not down-regulated to differing extents in both LMP and invasive tumours
(iii) Representative of significantly enriched gene ontologies.
(iv) Mean differences in expression levels that are robust to changes in the
number of samples in each class.
Primers for the RT-PCR reactions were designed using the online primer design tool
provided by GenScript (GenScript Corporation, 2005). Details of the primers designed
are shown in Table 5-11. All sequences were checked for homology to other unrelated
genes by BLAST search (Altschul et al., 1990). Official UniGene names, mean
expression ratios for the two EOC subtypes of interest and a summary of the genes
relevant functions are shown in Table 5-12. Gene ontologies represented by each gene are
shown in Table 5-13
Table 5-11: RT-PCR primer sequences designed for validation of gene expression signature of LMP and invasive EOC
Table 5-13: Gene ontology information for selected RT-PCR genes. Ontologies identified as significantly represented by the 1,302 genes differentially expressed between LMP and invasive EOC are shown in bold
Figure 5-13: Mean fold changes in the expression ratios for selected genes as determined by (A) microarray and (B) RT-PCR from the validation cohort of 10 cases. Confidence intervals based on the standard-error are shown. In general good agreement was observed between these assessments of gene expression with larger mean fold change differences observed by RT-PCR.
RT-PCR
0
1
2
3
4
5
6
7
8
9
10
KLK5
FN1
CHL1STK
6
TNFS
F10
SSPN
MSLN
BIRC5
CLDN10
PTTG
1
CXCL9
Me
an
fo
ld c
ha
ng
e
LMPInvasive
Microarray
0
1
2
3
4
5
6
7
8
9
10
KLK5
FN1
CHL1STK
6
TNFS
F10
SSPN
MSLN
BIRC5
CLDN10
PTTG
1
CXCL9
Me
an
fo
ld c
ha
ng
e
LMPInvasive
B
Microarray
RT-PCR
A
249
5.2.10. Biological validation of the LMP/invasive expression signature
To validate the extent to which protein expression changes in relation to the observed
mRNA changes, immunohistochemical analysis using EOC tissue microarrays (TMAs)
was performed. The use of TMAs, in association with microarray data, as a high
throughput method of evaluating protein expression has been widely demonstrated for a
range of cancer types (Abd El-Rehim et al., 2005; Rihl et al., 2004; Sallinen et al., 2000).
Tissue microarrays add significant structural and cellular localisation information to gene
expression data and enable large numbers of correctly preserved tumour specimens to be
analysed simultaneously.
The cases chosen for TMA validation represent samples completely independent to those
used to generate the microarray expression data. The use of independent samples to
biologically validate microarray-based findings is an essential step to ensure the
observations are not specific to the sample cohort used for the primary analysis.
Previously, the expression profiles analysed in this study have been validated technically
(section 5.2.9) and also by comparison to lists of differentially expressed genes from
other independent microarray studies of EOC or other invasive/non-invasive cancer
models (section 5.2.8.3).
5.2.10.1. Selection of independent EOC cases for validation and TMA
Paraffin-embedded, formalin fixed specimens of EOC were obtained from the AOCS and
also Mercy Hospital Tissue Banks, with the assistance of Dr Melissa Robbie. This was
done by searching the respective specimen databases for the terms ‘ovarian’ and ‘serous’.
De-identified pathology reports were reviewed by Dr Robbie to confirm the suitability of
each specimen for this study, however a full pathology review including determination of
tumour grade was not possible due to time restraints. Dr Robbie did however attempt to
select cases for this validation cohort that resembled this histological characteristics of the
primary cohort used for microarray analysis.
H&E stained sections of each case were reviewed by Dr Robbie to confirm the diagnosis
of serous LMP or serous invasive EOC and to assess the relative tumour content of each
specimen to ensure adequate material was present for immunohistochemistry. The
method for TMA construction described in Sambrook and Bowtell (Sambrook and
Bowtell, 2003) was followed with a few modifications (Kononen et al., 1998).
250
Areas of each tumour typical of the diagnosis given and suitable for TMA inclusion, were
identified. Confirmation was made that the features present in the section to be punch
biopsied were plentifully represented elsewhere in the block, allowing to be used for
future studies if necessary. The appropriate area was then circled on the H&E slide which
was in turn placed over the tumour block and used to locate the matching area from
which to take the needle punch biopsy.
Tumour punches were inserted into an agar block, processed into paraffin for the
recipient block. After melting the block now containing the punches, the histology
scientist (Mr. Neal O’Callaghan) attempted to press the cores down so punches of varying
lengths were present on the base of the cassette, which became the cutting face. 5uM
sections were cut from the final blocks for each antibody to be analysed. IHC staining
was performed using a Dako Autostainer (Dako, USA) using standard IHC protocols.
In total, 84 cores (3mm diameter) were taken from areas of representative EOC content
from 52 cases of serous invasive EOC and 32 serous LMP tumours. These are detailed in
Appendix B.
5.2.10.2. Selection of antibodies corresponding to differentially expressed genes identified by microarray analysis
Using online antibody database AbCam (AbCam Inc., UK) and gene and protein
information from GeneCards (Rebhan et al., 1998), IHC markers used by the Peter Mac
Pathology Department were mapped to specific features present on the 10.5k cDNA
microarray used for this project. Those genes overlapping with the list of 1,302
differentially expressed between serous LMP and invasive EOC were identified.
Antibodies were chosen to represent gene ontologies previously identified as being
differentially expressed, including proliferation (MKI67), regulation of the cell cycle
(CCND1), and cell adhesion (CDH1).
IHC was conducted on one section of each TMA created for this project. As the
antibodies selected were in routine use by the Peter Mac Pathology Department, the IHC
was carried out by Dr Melanie Trivet using routine diagnostic IHC protocols (Materials
and Methods section 2.3.8). Sample images of EOC sections showing areas of
representative staining can be seen in Figure 5-14. Box plots of the gene expression levels
corresponding to each antibody are also shown. These plots allow the extent of variation
in the expression of these genes to be observed for the two EOC subtypes.
251
Tab
le 5
-15:
Diff
eren
tially
exp
ress
ed g
enes
cor
resp
ondi
ng to
dia
gnos
tic a
ntib
odie
s use
d by
the
Pete
r M
ac P
atho
logy
Dep
artm
ent
Uni
Gen
e Sy
mbo
l /
Ant
ibod
y na
me
Mea
n ex
pres
sion
ra
tio:
Inva
sive
; L
MP
EO
C
Fold
diff
eren
ce
of m
ean
expr
essi
on r
atio
s U
niG
ene
Nam
e Su
mm
ary
of fu
nctio
n an
d lit
erat
ure
refe
renc
es
PEAC
AM1
/ CD
31
1.35
2, 0
.594
2.
276
Plat
elet
/end
othe
lial c
ell
adhe
sion
mol
ecul
e
- A su
rfac
e gl
ycop
rote
in e
xpre
ssed
on
plat
elet
s and
end
othe
lial c
ell
junc
tions
. - E
xpre
ssed
in a
rang
e of
solid
tum
ours
and
thou
ght t
o po
sitiv
ely
regu
late
the
atta
chm
ent o
f tum
our c
ells
to e
ndot
heliu
m.
(New
man
et a
l., 1
990;
Tan
g et
al.,
199
3).
MK
I67
/ Ki6
7 2.
662;
0.6
3 4.
225
Ant
igen
iden
tifie
d by
m
onoc
lona
l ant
ibod
y K
i-67
- Req
uire
d fo
r mai
nten
ance
of c
ell p
rolif
erat
ion.
- E
xpre
ssed
in G
1, S
and
G2
phas
es o
f cel
l cyc
le.
- Cor
rela
tes w
ith p
oor s
urvi
val i
n EO
C
(Ant
tila
et a
l., 1
998;
Sch
lute
r et a
l., 1
993)
CC
ND
1 / C
yclin
D
1 0.
901;
1.5
3 0.
587
Cyc
lin D
1 (P
RA
D1:
pa
rath
yroi
d ad
enom
atos
is 1
)
- Thi
s cyc
lin fo
rms a
com
plex
with
CD
K4
or C
DK
6, w
hose
act
ivity
is
requ
ired
for c
ell c
ycle
G1/
S tra
nsiti
on.
- Mut
atio
ns, a
mpl
ifica
tion
and
over
exp
ress
ion
alte
r cel
l cyc
le
prog
ress
ion
cont
ribut
ing
to tu
mor
igen
esis
. (M
otok
ura
et a
l., 1
991)
ESR1
/ Es
troge
n re
cept
or α
0.
87; 1
.76
0.50
7 Es
troge
n re
cept
or 1
- A li
gand
-act
ivat
ed tr
ansc
riptio
n fa
ctor
com
pose
d of
seve
ral d
omai
ns
impo
rtant
for h
orm
one
bind
ing,
DN
A b
indi
ng, a
nd a
ctiv
atio
n of
tra
nscr
iptio
n.
- Exp
ress
ion
of th
is m
olec
ule
used
bro
adly
to d
eter
min
e cl
inic
al
man
agem
ent o
f bre
ast c
ance
r pat
ient
s.
(Gre
en e
t al.,
198
6)
CD
H1
/ E-c
adhe
rin
0.57
6; 1
.404
0.
41
Cad
herin
1, t
ype
1, E
-ca
dher
in (e
pith
elia
l)
- A c
alci
um d
epen
dent
cel
l-cel
l adh
esio
n m
olec
ule.
- M
utat
ions
in th
is g
ene
are
corr
elat
ed w
ith g
astri
c, b
reas
t, co
lore
ctal
, th
yroi
d an
d ov
aria
n ca
ncer
. - L
oss o
f fun
ctio
n co
ntrib
utes
to p
rogr
essi
on in
can
cer b
y in
crea
sing
pr
olife
ratio
n, in
vasi
on, a
nd/o
r met
asta
sis.
(Bus
sem
aker
s et a
l., 1
993)
25
2
Figu
re 5
-14:
IHC
stai
ned
sect
ions
of L
MP
and
inva
sive
EO
C fr
om ti
ssue
mic
roar
ray
biol
ogic
al v
alid
atio
n. B
ox p
lots
indi
cate
the
mic
roar
ray-
quan
tifie
d ge
ne e
xpre
ssio
n le
vels
in th
ese
two
EOC
subt
ypes
.
IHC: LMP IHC: Invasive Microarray
LMP
Inva
sive
1 0 -1
Sb
l
Ki6
7C
D31
E-c
adhe
rin
Cyc
lin D
ER
-α
LMP
Inva
sive
2 1 0 1
INV
L
MP
LMP
Inva
sive
1 0 -1 -2 -3 -4
INV
L
MP
2 1 0 -1 -2 -3 -4
INV
L
MP
LMP
Inva
sive
2 1 0 -1 -2
INV
L
MP
INV
L
MP
253
5.2.10.3. Quantification of tissue microarray immunohistochemistry
Three high-power (400x) images were captured of representative staining for each tumour
on each array. In total, 993 images were captured from 10 separate IHC-stained TMA
sections. As a small proportion of sectioned TMA cores had floated off or otherwise been
negatively affected during the staining process, it was not possible to record images for
every specimen included in the two TMA designs described previously.
In order to facilitate the quantification and statistical analysis of IHC staining in this large
number of digital images, an automated procedure was created, based on protocols
described by several groups (Lehr et al., 1997; Lehr et al., 1999; Matkowskyj et al.,
2000). Briefly these methods involve using thresholding and the use of the histogram-
analysis feature of Adobe Photoshop to calculate the median and standard deviation pixel
intensity of a given image. Prior to this a threshold is applied to the image to exclude
background or haematoxylin staining.
Using Adobe Photoshop CS2 (Adobe Systems Inc., USA) and Microsoft Visual Basic
(Microsoft Corporation, USA) a program was written that allowed an entire directory of
images to be processed through a series of functions. An output file containing the mean,
median and standard deviation of pixel intensities for each image was created. A sample
script from Adobe for exporting of image histogram statistics, provided with Adobe
Photoshop, was used as a foundation for the output section of the program. The full VBA
code required to run the program is given in Appendix N.
The program created carried out the following steps (as shown in Figure 5-15):
• Open the first image in a specified directory; images stored in TIFF format.
• Apply a threshold adjustment at a tonal level of 150 (the range 0 – 255,
represents the full tonal range of any digital image); consequently eliminating
the unstained sections of the image and low-level background or non-specific
staining.
• Invert the image and convert to greyscale; results in IHC stained sections
showing up as white areas on a largely black background
• Export the image histogram statistics to a tab-delimitated text file created in a
sub-directory of the images currently being processed.
254
(A) Sample image 1: Ki67 low staining
(B) Sample image 2: Ki76 high staining
(C) Sample image 1: Image converted to greyscale and threshold level of 150 applied.
(D) Sample image 2: Image converted to greyscale and threshold level of 150 applied.
(E) Sample image 1: Image inverted
(F) Sample image 2: Image inverted
Histogram statistics: Mean: 0.04 Standard deviation: 2.7
Histogram Statistics:
Mean: 52 Standard deviation: 96.73
Figure 5-15: Examples of the IHC quantification program applied to sample specimens with low (sample 1) and high (sample 2) expression of Ki67. The images are converted to gray scale and a threshold level between 0-255 is applied. The appropriate threshold level can be calibrated to the level of background or contrast staining present. The image statistics from the inverted and thresholded image, including pixel intensity mean and standard deviation, are then determined and exported.
255
5.2.10.4. Statistical analysis of quantified IHC data
Using Microsoft Excel and Minitab Statistical software, a multivariate analysis of IHC
intensity data was carried out. Using a general linear model, variation between the two
TMAs, measurements from the same tumour and specifically, EOC histological types,
was analysed to determine if significant differences in expression existed for these
proteins. Including the TMA number and replicate measurements in the ANOVA model
allowed these variables to be controlled for when determining the statistical significance
of any observed difference in the mean pixel intensities calculated from each high power
image of invasive or LMP EOC. Combining the data from the two separate arrays for
each antibody increased the statistical power of the analysis. The result of the general
linear model analyses are summarised in Table 5-16.
Table 5-16: Summary of statistical analysis and comparison of microarray and qIHC data used for biological validation of findings.
Microarray data (no. samples = 30) qIHC data (n=84)
Antibody Mean expression: Invasive
Mean expression: LMP
P-value
Mean expression: Invasive
Mean expression: LMP
P-value (n=84)
E-cadherin 1.032 2.469 <0.001 12.66 8.98 0.068
Ki67 2.662 0.63 <0.001 13.49 2.465 <0.001
CD31 1.352 0.594 <0.001 2.878 1.584 0.005
Cyclin D1 0.901 1.535 0.01 5.023 6.052 0.18
ERα 0.87 1.716 >0.05 14.36 14.74 0.849
5.2.10.5. Summary of IHC-based biological validation of invasive/LMP gene expression profile
From this analysis it was observed that increases in the expression of the gene coding for
the proliferation marker MKI64 (Ki67) and angiogenesis and tumour-recurrence marker
PECAM1 (CD31) correlates with protein expression in an independent cohort of samples
as measured with quantitative IHC (qIHC). Both these genes were detected as being
upregulated in invasive tumours relative to the LMP type by microarray analysis and
were also observed to be upregulated with high confidence from qIHC.
The cell-cycle regulating gene cyclin D1 (CCND1) was observed to be up-regulated by
microarray analysis in the invasive tumours. The same trend was observed with qIHC
256
analysis although the difference between LMP and invasive specimens was not
statistically significant.
The qIHC data from TMA1 for the cell-adhesion protein E-cadherin when analysed
separately produced mean differences in the same direction as observed by microarray
analysis and also RT-PCR validation of the same cohort, although this was not
statistically significant (p=0.096).
This section of the study shows that changes in gene expression, identified by microarray
analysis and appropriate analytical methods, can be extended to an independent cohort.
These results also show that changes in mRNA expression levels correlate with changes
in protein expression, as measured by a novel, automated qIHC analysis method.
5.3. Discussion This chapter describes the molecular characterisation of serous LMP and invasive EOC
through the use of microarray, RT-PCR and qIHC methods. The considerations around
correct pathological classification are also covered, incorporating novel methods for
confirming the primary ovarian origin of an individual specimen, based on a microarray
gene expression profile is described. A robust list of genes with differential expression
between LMP and invasive EOC subtypes was determined. The difference in expression
of a subset of this list was then validated using RT-PCR. Gene ontology and pathway
analyses were carried out to determine the key molecular processes represented within the
total list of genes and also for an ontology-filtered subset. IHC was performed on sections
of two TMAs, comprised of needle biopsies taken from an independent cohort of EOC
specimens. An automated method for objectively quantifying IHC staining intensity was
also described and demonstrated for the antibodies selected.
5.3.1. Findings from this analysis and relevance to published studies of LMP or invasive EOC
At the commencement of this project, few studies had been published involving the gene
expression profiling of LMP EOC, despite the potential for insight into the events
responsible for the, ultimately fatal, invasive capability of serous EOC. During the course
of this study however, two studies have been published in the literature involving
microarray profiling of this invasive/non-invasive EOC model (Gilks et al., 2005;
Warrenfeltz et al., 2004).
257
One of the published microarray studies of EOC that was found to significantly overlap
with the genes identified by comparison of LMP and invasive EOC in this chapter was
that by Warrenfeltz et al in 2004. This analysis was done using the oligonucleotide
Affymetrix U95aVv2 chips, making it difficult to directly compare gene expression
measurements to those determined with cDNA microarrays. The Warrenfeltz study was
limited by its small sample size (n=13) and the inclusion of only two mucinous and two
serous LMP specimens. Another limitation of the study was the combining of the LMP
tumours into one class, despite the known extensive molecular and behavioural
differences between these subtypes. No mention is made in the publication of how the
specimens were reviewed to confirm their suitability for the study, nor is any comparison
of the generated expression profiles to other datasets in order to confirm the primary
ovarian origin of these mucinous type tumours. This comparison could have been carried
out using publicly available datasets from studies comparing broad ranges of human
cancer types using the Affymetrix platform (Ramaswamy et al., 2001; Su et al., 2001).
One conclusion drawn by the authors of this study is that borderline tumours represent an
intermediate stage between benign adenomas and malignant adenocarcinomas, based on
the expression levels for selected genes being mid-way between those observed for the
benign and malignant types. The small sample size and combination of mucinous and
serous type LMP samples question the validity of this statement. If either or both
mucinous LMP specimens profiled were in fact metastatic invasive tumours from another
site such as the appendix, pooling the profiles of these samples may result in an increase
in the mean level of genes associated with malignancy and invasion, thus making the
class appear as a transitional one between the extremities of the other two.
The literature concerning the frequency of malignant transformation of LMP EOC
suggests this phenomenon is extremely rare (Puls et al., 1992). One meta-analysis of over
137 serous LMP tumours found only a single case of recurrence in the form of invasive
cancer was observed from the 66 stage I diagnoses investigated. Of 45 stage II-IV
tumours, six (13%) of these were noted to possess invasive implants and the women
affected experienced an unfavourable outcome, three of whom died of their conditions.
The presence of invasive implants in serous LMP tumours is associated with a
significantly poorer prognosis (Prat and De Nictolis, 2002). From this study it could be
hypothesised that tumours with these pathological features would exhibit a more
‘invasive-like’ gene expression profile if the biopsy of tumour used for microarray
analysis included areas of invasive implants and were not microdissected prior to RNA
extraction and processing (Liotta and Petricoin, 2000).
258
Warrenfeltz et al do not state if microdissection of the LMP tumours was performed or if
any evidence of invasive implants was noted during the review process. Despite these
factors, the list of 163 differentially expressed genes between benign, LMP and invasive
EOC overlaps with high statistical significance with that produced from the larger cohort
(but smaller microarray) used in this analysis. In this list are a high proportion of genes
involved in similar processes as observed from the gene ontology analysis carried in out
in 5.2.7.3, including regulation of growth/proliferation, adhesion and control of DNA
replication and associated events.
Of the differentially expressed genes that overlap with this study, the cell-adhesion
molecule E-cadherin (CDH1) is noted as an example of a gene that progressively
increases over 2-fold in expression from benign, to LMP and then invasive cancer. The
authors propose that since the adhesive function of cadherins is calcium-dependant
(Pokutta et al., 1994) and their list of differentially expressed genes contained molecules
involved in controlling calcium transport or channel activity, that dietary calcium intake
may be related to the development and progression of EOC. Studies of the dietary
patterns of women with and without EOC have in fact showed shown a reduced risk of
ovarian cancer with increased dietary calcium intake (Goodman et al., 2002), supporting
this hypothesis.
Another more recently published microarray study of LMP EOC and comparison to the
invasive type is that by Gilks et al (Gilks et al., 2005). This study used genome-
comprehensive cDNA microarrays to profile 23 samples of serous invasive or LMP
cancer. Contrary to the findings of this study, only a small number of genes were found to
be differentially expressed between the compared tumour classes. Unsupervised filtering
of this dataset reduced the number of genes from >45,000 to 541 on the basis of
excluding genes without log2 2-fold variation from the mean value in at least 3 arrays,
from which a total of 217 genes were identified by SAM gene selection (Tusher et al.,
2001). Inspection of the SAM output reveals all genes selected as differentially expressed
were upregulated in the LMP tumours relative to the invasive cases. This is a surprising
observation as other studies comparing high and low grade tumours have revealed a
significant number of genes with higher expression in the more metabolically-active high-
grade samples. The small number class-discriminating genes and uni-directional change
in expression may be an indication of a technical fault in the experimental stages of this
study. Visual inspection of the hybridised microarray images available from the Stanford
Microarray Database at http://genome-www5.stanford.edu, revealed extremely poor array
printing and hybridisation quality (Sherlock et al., 2001). Examples of three randomly
259
selected microarrays from this dataset are shown in Appendix Q. Individual array features
had poor morphology and there appears to have been problems with uneven hybridisation
of labelled probe across the surface of the printed area. This most likely resulted in a
spatial bias in the final gene expression ratios obtained, affecting all down-stream data
analysis. No mention of the use of a spatially-dependant normalisation algorithm was
made in the manuscript or supporting material, which may have lessened the impact of
these technical issues. As a result of these observations about the claims that only a small
number of genes discriminate LMP and invasive serous EOC, or that the molecular
events responsible for such marked phenotypic differences are below the threshold of
detection by microarrays, must be questioned.
5.3.2. Analysis of differentially expressed genes identified by multiple studies
Only three differentially expressed genes were identified in common between this study
and those by Gilks et al and Warrenfeltz et al. These were: progestagen-associated
endometrial protein (PAEP), connective tissue growth factor (CTGF) and anterior
gradient 2 homolog (AGR2).
PAEP, five-fold upregulated in LMP tumours in this study, codes for a protein known as
glycodelin. This molecule has been detected in both normal and malignant ovaries. It is
also secreted in the endometrium during the menstrual cycle and also during the first
semester of pregnancy (Joshi et al., 1982). One form of the protein encoded by this gene
(glycodelin-a) is known to have immunosuppressive activity, which may assist the slower
growing LMP tumours in slowing or preventing the immune system from successfully
engaging to the extent observed in invasive disease. (Kamarainen et al., 1996) IHC
analysis of 460 serous invasive EOC specimens using the TMA method revealed higher
levels of this gene are associated with a higher 5-year survival rate. Expression decreased
with increasing tumour stage, defined by the extent of invasion observed, with the most
significant reduction occurring between stage III and IV tumours compared to stage 1.
The reduction at stage II was not statistically significant, suggesting that loss of
expression or function of glycodelin may be required for successful invasion beyond the
pelvis, a pathological distinction between LMP and invasive EOC (Mandelin et al.,
2003).
Over-expression of CTGF, coding for a secreted integrin-binding protein, has been
implicated in inhibition of lung adenocarcinoma metastasis and invasion (Chang et al.,
260
2004) however literature concerning its involvement in ovarian cancer is scarce at best.
One group identified overexpression of this gene in a cisplatin-resistant ovarian cancer
cell line through the use of microarray profiling (Sakamoto et al., 2001). However CGTF
is actually down regulated in invasive tumours relative to the LMP type in both this and
the two published studies, which may suggest another mode of action for this molecule in
ovarian cancer.
AGR2/HAG-2 is a human homolog of the cement gland gene in Xenepus laevis and is on
average 5.6-fold upregulated in the LMP tumours profiled in this chapter. The expression
of this gene in breast cancer has been shown to be significantly higher in malignant cell
lines and human tumours compared to benign cell lines and non-cancerous tissues.
Transfection of this gene into non-metastatic cell lines has been shown to result in
metastatic lung formations in animals transplanted with these cells. An increase in the
rate of adhesion was also observed in the AGR2-transfected cells further implicating this
gene in the process of metastasis (Liu et al., 2005a). Increased expression of this gene has
been detected in prostate cancer compared to benign disease, which is the inverse pattern
to that detected in LMP/invasive EOC (Zhang et al., 2005). However another study has
shown this over-expression in prostate cancer is not of prognostic significance
(Kristiansen et al., 2005).
Analysis of genes associated with pancreatic cancer has identified AGR2 as a marker of
malignancy however besides being highly over expressed in this malignancy relative to
normal pancreas, little is know of its specific involvement in this disease as with ovarian
cancer (Missiaglia et al., 2004; Ryu et al., 2002).
In breast cancer this gene is co-expressed with the oestrogen receptor and is thought to
play a role in metastasis through the regulation of receptor adhesion and function. AGR2
is proposed as a novel molecular marker of disease progression or potential therapeutic
target for hormone responsive breast tumours (Fletcher et al., 2003).
Each of these three overlapping genes has an interesting clinical and molecular profile,
across a range of human cancers. Their expression is clearly involved in core processes
associated with tumour invasion, including that of EOC. Each gene has been shown to
have some form of adhesive functionality. Whether by increasing the rate of cell
attachment to a specific substrate (AGR2), significantly decreasing its expression in
response to an EOC adhering to and invading of the pelvis (PAEP), or binding to cell-
attachment intergrins, the adhesive function is shared by all three genes.
261
The very small overlap between all three lists of differentially expressed genes between
LMP and invasive EOC is perhaps not surprising in light of the known heterogeneity of
this disease. Furthermore, observations from microarray studies of other cancer types,
particularly BrCa, have indicated that sample size, specific cohort composition and
method of analysis play a major role in the robustness of lists describing differential gene
expression between tumour subtypes.
5.3.3. Use of gene expression based predictive analysis to confirm specimen diagnosis and identify metastatic disease
One distinguishing feature between this and other studies of LMP EOC is the use of
cross-validated gene-expression based confirmation of tumour diagnosis. Making use of a
large microarray dataset comprising of expression profiles of tumours from 10 different
primary sites, a signature of genes with ovarian-specific expression patters was identified.
Using a combination of machine learning algorithms and LOOCV, each sample of EOC
was subject to classification as either primary ovarian or not, before inclusion in the final
cohort. This was done in association with a standard pathology review which consisted of
review of the H&E stained section taken from the tumour specimen and the original
diagnostic pathology report associated with each case.
In a landmark study by Ramaswamy et al, it was determined that the molecular profile of
a metastatic tumour was more similar to the tissue from which it arose rather than that it
is excised from, suggesting that metastatic potential is a variable that is determined in the
early stages of tumorigenesis (Ramaswamy et al., 2003). Since this a number of other
studies have demonstrated the use of gene expression profiling to identify the origin of
metastases taken from a range of sites (Li et al., 2003; Roepman et al., 2005; Shedden et
al., 2003; Talbot et al., 2005). This technique is used and extended here to confirm the
primary ovarian origin and other relevant classifications of the specimens being analysed.
The list of genes identified (shown in hierarchical cluster format in Figure 5-4), validated
with extensive permutation testing, serves as a molecular fingerprint for EOC and with
further refinement could form the basis of a high-throughput screening test for confirming
the origin of a suspicious sample of tumour found in the ovary, but exhibiting
characteristics of another tissue type. Several examples of this approach exist in the
literature to date, including a study in which a predictive microarray signature of breast
cancer recurrence was translated to 384-well RT-PCR format using the ABI 7900 real-
time thermal cycler (Paik et al., 2004) and also a recently published PCR-based predictor
262
of the primary origin for cancers of unknown primary by Tothill et al (Tothill et al.,
2005).
In this chapter, further to confirming the primary ovarian origin of a given tumour
specimen, predictive analyses was also carried out to confirm the histological subtype and
LMP/invasive classification. With further development it may be possible to combine the
outcome of these lists of predictive genes and generate a single test capable of predicting
a range of clinically important parameters on the basis of a small number of genes,
therefore minimising expense and delay in such information being available to the
treating clinician.
5.3.4. Cell adhesion molecules and EOC malignancy
The importance of cell-cell and cell-matrix functions in controlling the malignant
potential of a range of cancer types has become apparent over recent years. The
metastatic process in EOC involves cancer cells being shed from the epithelium,
dissemination throughout the pelvis and localised proteolysis leading to invasion
throughout the body (Rodriguez et al., 2001). A defining characteristic of true LMP
tumours is their inability to spread beyond the pelvis, yet ability to grow successfully in
this area, avoiding the immune response for many years (Allen et al., 1987; Kurman and
Trimble, 1993; Prat and De Nictolis, 2002). It is therefore hypothesised that differential
expression of cell adhesion molecules play a pivotal role in determining the clinical
behaviour and consequent mortality rates of these two disease subtypes.
From this study it was observed that genes with cell-cell/matrix adhesive functionality
were a differentially regulated class of genes between invasive and LMP EOC. This was
found only after removal of the large number of genes whose selection during the
analysis of LMP and invasive EOCs was due to the faster growth rate and increased
metabolic activity of the invasive form of this disease (D'Andrilli et al., 2004; Prat and De
Nictolis, 2002). This observation agrees with the study by Warrenfeltz et al (Warrenfeltz
et al., 2004), who also noted the over representation of genes of these classes in their
analysis of tumours of varying malignant potential. The finding is taken further in this
chapter by the use of pathway analysis to elucidate a network of interconnected genes
whose combined regulation may be responsible mediating a tumour’s invasive ability.
The gene expression network deduced from pathway analysis in section 5.2.8 represents a
collection of genes whose products are crucial for a tumour cells’ ability to adhere to each
other, as well as the mesothelium lining in the peritoneal cavity, a major route of invasion
263
for EOC (Gardner et al., 1995). Amongst these genes is mesothelin (MSLN), for which a
soluble product has recently been demonstrated to bind exclusively to the CA-125
molecule. This binding has been shown to initiate cell adhesion processes. Conversely,
antibodies binding and blocking the mesothelin protein results in the inability of an
ovarian cell line to successfully attach to a mesothelin coated substrate. Both MSLN and
CA-125 exhibit higher levels of expression in advanced grade tumours and it is
hypothesised that their interaction may represent a crucial stage in the attachment of
cancer cells to the mesothelial epithelium, facilitating the invasion process (Rump et al.,
2004). While the cDNA clone for CA-125 was not present on the particular cDNA
microarray used for this study, the mesothelin gene was on average 2.1-fold higher
expressed in the invasive tumours profiled.
The tumour-antigen CA-125 is measured routinely as a measure of disease burden and
EOC patient prognosis. It is expressed in both normal and malignant ovarian tissue;
however release into the extracellular space is strongly associated with tumorigenesis
(Meyer and Rustin, 2000). Although it was discovered some time ago and is accepted as a
reliable prognostic marker, its precise function in EOC remains unknown (Bast et al.,
1981; Meyer and Rustin, 2000). It has a demonstrated high-affinity interaction with an
extracellular matrix protein, galectin 1 (LGAL1). This molecule is a cell surface lectin
implicated in the regulation of cell-cell/matrix adhesion. It has also been demonstrated
that the binding characteristics of CA-125 may be regulated by gene expression in tissues
surrounding the tumour cell on which it is expressed (Seelenmeyer et al., 2003). Other
studies have demonstrated the addition galactin 1 results in a dose-dependant increase in
EOC cell adhesion to both laminin-1 and fibronectin (FN1) (van den Brule et al., 2003)
Another recent finding which has stimulated interest in galactin 1, described its
immunosuppressive function. By directly causing T cell apoptosis, specifically at the site
of tumour infiltration, this molecule appears to aid in tumour progression by preventing
an effective immune response (He and Baum, 2004).
Fibronectin (FN1) is expressed on the cell surface and also in the extracellular matrix of
EOC cells, however it is the intracellular form of this molecule that has the most
significant impact on the behaviour of invasive EOC (Zand et al., 2003). This gene, like
galactin 1, to which it binds, also has multiple functions additional to its adhesive
properties, including tumour neovascularisation, immunosuppression and prevention of
apoptosis – all crucial events during tumour proliferation (Franke et al., 2003). In this
study, a 1.5-fold statistically significant increase in its expression was detected in
invasive EOC compared to the LMP tumours. Others have demonstrated its correlation
264
with tumour stage and growth fraction making it a candidate prognostic factor. In a study
to determine the level of influence the mesothelium exerts over the motility and
invasiveness of ovarian cancer cells, monoclonal antibody-blocking of fibronectin was
observed to significantly inhibit ovarian cancer cell motility (Rieppi et al., 1999). When
EOC cells are bound to fibronectin however, they are protected from apoptosis when
exposed to chemotherapy drugs including cisplatin and paclitaxel, further underscoring
the importance of this molecule in ovarian malignancies (Jun S, 2003).
One of the adhesion genes upregulated in the LMP tumours analysed is CD44,
traditionally known as a homing receptor that interacts with members of the ezrin (VIL2)
family of genes and also capable of binding to extracellular fibronectin (Jalkanen and
Jalkanen, 1992; Martin et al., 2003). The CD44/ezrin combination has been found in a
range of cancer types and is thought to assist the tumour cell locating and binding to
favourable organs for invasion and metastasis. Both genes are down-regulated in invasive
tumours profiled in this analysis. CD44 is a surface glycoprotein with a range of functions
including cell-matrix adhesion, whereas ezrin is a member of the ERM protein family
(Ezrin, Radixin, and Meosin) which performs structural and regulatory roles in plasma
membrane domains. In ovarian cancer, CD44 expression is detected in primary tumours
but much less frequently in metastatic growths (Bar et al., 2004), which is thought to be
due to hypermethylation of its promoter region. This reduction in expression correlates
with stage, survival and dissemination of cancer cells to surrounding tissues (Martin et
al., 2003).
In colorectal cancer studies the function of ezrin in controlling the adhesive potential of
tumour cells has been demonstrated through the use of antisense oligonucleotides.
Inhibiting ezrin expression was observed to result in reduced cell-cell adhesion and
subsequent increase in motility and invasiveness. An association with the cell adhesion
molecule E-cadherin (CHD1) was also noted through coprecipitation studies (Hiscox and
Jiang, 1999), another member of the differentially expressed gene network identified in
this chapter.
Also present in the expression network are CD9, CD36 and CD47 which code for
membrane proteins involved in mediating a tumour cells adhesive interactions with the
surrounding stroma and (unlike CD44) are upregulated in the majority of invasive EOC
specimens profiled. Expression of these genes is associated with cellular differentiation
265
This study points to the importance in cell-cell adhesion junctions in EOC as a major
determinant of a tumour’s malignant potential. Impairment of a range of adhesion
interactions resulting from (or as a consequence of) over or under expression of a
relatively small number of genes, with seemingly crucial roles, appears to be closely
connected to a tumour’s intra-abdominal spread and subsequent invasion to other parts of
the body.
E-cadherin is one of the central molecules in the gene expression network shown in
Figure 5-10 and was identified as being, on average, 2-fold upregulated in LMP tumours.
Conflicting reports exist in the literature concerning the regulation of this gene and its
impact on cellular adhesion, differentiation and tumour proliferation. Some studies show
a decrease in expression correlates with dedifferentiation and increased invasive
properties in free floating EOC cells, invasive endometrial carcinoma and other cancer
types including bladder and prostate (Fujimoto et al., 1997; Ross et al., 1995; Sakuragi et
al., 1994; Umbas et al., 1992; Veatch et al., 1994). Other studies have shown levels of E-
cadherin mRNA is higher in weakly metastatic cell lines compared to those with high
metastatic capability (Hashimoto et al., 1989), although cell lines can often exhibit
substantial differences in gene expression compared to primary tumours (Ross and Perou,
2001). Conceptually, genes involved in extracellular interactions such as adhesion to
other cells and extracellular matrix may be the most severely affected by the in-vitro
culturing process. Despite this several studies have shown that E-cadherin can be detected
in benign, LMP and malignant EOC (Sundfeldt et al., 1997). One hypothesis for this
apparent variation in expression between tumour types and stages of malignancy is a
transient up and down-regulation of E-cadherin, and feasibly other associated
differentially expressed genes identified in this study, to promote tumour proliferation
and multiple phases of invasion (Sundfeldt, 2003). As a significant number of genes also
found discriminating between invasive and LMP EOC were involved in promotion or
suppression of the immune response, a punctuated process of invasion may actually assist
the tumour in prevention of a successful attack from the bodies natural defences.
In the biological validation section of this chapter no significant difference was detected
between IHC detection levels of e-cadherin, despite the significant reduction observed in
the invasive EOC samples by microarray analysis. One explanation for this apparent lack
of correlation may be the higher stromal content of LMP tumours, in which E-cadherin is
not expressed. Profiling of tumours of this kind would result in a reduction of the amount
of e-cad expression detected by IHC per high-power field compared to an equivalent area
of invasive EOC tumour, which has a generally higher tumour to stroma ratio.
266
Figure 5-16: Examples of ER-stained LMP and invasive tumour. (A) Sample LMP tumour with a higher proportion of stroma, and lower proportion of tumour present per high-power field. On average, mRNA expression of ERα is higher in LMP tumours than the invasive type, but lower expression or no change was observed at the protein level via IHC analysis. (B) Sample invasive tumour showing, more tumour material and less stroma per HPF.
A
B
267
5.3.5. High throughput analysis of TMA IHC
Also described in this chapter is a method for automation of antibody staining intensity
for large numbers of tumour sections, as generated from the use of TMAs. A number of
studies have demonstrated the use of commercially available image analysis and statistics
software for the objective analysis of IHC (Lehr et al., 1997; Lehr et al., 1999;
Matkowskyj et al., 2000), a frequent objective of gene expression studies to determine the
biological significance of differentially expressed mRNA levels in a given tissue sample
(Abd El-Rehim et al., 2005; Al Kuraya et al., 2004; Liu et al., 2005b; Pacifico et al.,
2004). Other groups have published and distributed software designed for handling the
large numbers of images and relevant clinical information associated with a TMA (Liu et
al., 2002).
The method proposed and demonstrated in this study makes use of Microsoft Visual
Basic scripting (Microsoft Corporation, USA) and several automated features of Adobe
Photoshop (Adobe Systems Inc., USA), to convert complex digital images of IHC stained
tumour sections to determine the mean pixel intensity of a given image and associated
standard deviation value. The method makes use of the threshold function to eliminate
unstained areas of a section during the quantification process. Because this function is
based on a user-definable level from 1 – 255 (arbitrary units), it allows the flexibility of
adapting the method for varying levels of background or secondary haematoxylin
staining. Whilst the method is not fully automated in the form of a packaged graphical-
user-interface, the Visual Basic code written and protocols described allowed the analysis
of almost 1000 IHC images in a relatively short period of time and could potentially be
developed into a more user-friendly application.
5.4. Summary and conclusions from chapter This chapter describes the use of molecular profiling to characterise EOC in relation to
other tumour types and also to explore potential reasons for the differences between the
LMP and invasive subtypes of the disease. Through extensive pathology reviewing and
microarray-based predictions, a high quality cDNA microarray dataset was generated
which was capable of predicting the primary origin of a tumour found in the ovary, as
either primary ovarian or metastatic, on the basis of a 231 gene signature. Extensive
bioinformatic analysis of the microarray data confirmed the diagnosis of the specimens of
primary serous invasive and LMP cancer and also revealed a large number of
268
differentially expressed genes. A significant proportion of the genes differentiating
between these tumour types were found to be involved in the bodies immune reaction to
invading tumour cells, control - or perhaps loss of control - of the cell cycle leading to
unrestrained proliferation. After ontology filtering, a number of inter-connected cell-cell
or cell-matrix adhesion molecules were also identified as having differential expression
between EOC subtypes. These genes regulate a tumour cells ability to attach to surfaces
and it is proposed their disregulation resembles a crucial first step of the invasion process.
Gene expression pathway analysis was performed and a network of interacting
differentially expressed genes was deduced with a central function of cell adhesion
regulation, a reoccurring theme in several sections of this chapter. Finally, RT-PCR and
IHC on TMAs were used to confirm the expression of selected genes and their
corresponding proteins on a series of independent samples.
269
6. Discussion & Conclusion
6.1. Summary of major findings
6.1.1. Optimisation of microarray technology for large-scale tumour profiling studies
Despite the significant advances in molecular biology, robotics and computing power that
have made cDNA microarray technology accessible to many laboratories, ranging from
major pharmaceutical companies to small academic research groups, the generation of
high quality gene expression data remains a complex process that is dependant on many
factors. Chapter Three of this thesis describes the analysis of a number of important
variables in the microarray work flow that have the ability to impact on the quality and
interpretation of cDNA microarray data and thus impact on the success or failure of an
experiment carried out using this tool.
A novel method for quantifying and visualising spatial bias in cDNA microarray data was
proposed and its ability to favourably impact on a bioinformatic analysis of genes that
were predictive of tumour relapse was demonstrated. This simple statistical test is
applicable to any cDNA microarray platform and makes use of widely available statistical
software. An automated script to facilitate its application to large datasets was also
described. This test was also demonstrated to be an effective way of monitoring the
impact of new image analysis algorithms and software packages, or changes to the
printing or scanning protocols and/or equipment by providing objective measurements of
spatial bias, rather than relying on a subjective visual interpretation.
There are important practical and data-quality issues associated with the choice of a
reference RNA for a large-scale gene expression study. By comparing gene expression
data obtained from a series of EOC specimens hybridised to two different types of
reference RNA, it was determined that a reference comprised of cell-line RNA was, albeit
marginally, the most appropriate choice for a study of EOC gene expression. This
conclusion was based on the proportion of the total probe set detectibly hybridised by the
reference RNA, the ability of the data generated to discriminate between tumour
subtypes, and practical considerations associated with ease of cross-study comparison and
the durability of the reference RNA resource. Although equal proportions of the
microarray analysed was detectably hybridised by each of the references tested, the
reference material generated by combining RNA extracted from a subset of the cohort of
270
interest (i.e. project-specific) identified a number of extra genes involved in a tumour’s
interaction with its environment. This may be an important factor for studies seeking to
explore this aspect of tumour biology.
By comparing expression data generated by two microarray scanners with a number of
different features, it was observed that a significant amount of systematic error can be
introduced at this stage of the cDNA microarray work flow. It was found that one scanner
offered clear advantages in terms of lower spatial bias and more accurate individual gene
expression measurements at a range of different Cy3:Cy5 ratios and absolute abundances.
Importantly, the findings from chapter 3 were applied to the expression profiling and
bioinformatic analyses of EOC carried out in chapters 4 and 5.
6.1.2. Gene expression based prediction of patient survival
Gene expression profiling has been used to explore the molecular basis of a wide range of
solid human cancers, including malignancies of the breast (Hedenfalk et al., 2001; Seth et
al., 2003; Sorlie et al., 2001), prostate (Bull et al., 2001; Calvo et al., 2002; Dhanasekaran
et al., 2001; Singh et al., 2002), gastrointestinal tract (Boussioutas et al., 2003; Hasegawa
et al., 2002; Hippo et al., 2002). They have also been used successfully to profile various
forms of leukaemia and lymphoma (Alizadeh et al., 2000; Golub, 2001; Khan et al., 1998;
Lossos et al., 2004). In these studies, the gene expression data was analysed in the context
of various types of clinical information, such as histological subtypes, response to
treatment and length of survival or in relation to genetic information such as the mutation
status of genes such as BRCA1 or BRCA2.
Studies of EOC using microarrays to profile gene expression related to clinically
important variables have been limited by small sample sizes and the heterogeneous nature
of this type of cancer. During the course of this thesis a number of studies were published
in which comparisons between EOC and normal ovarian surface epithelium or
histological subtypes were made (Ono et al., 2000; Schaner et al., 2003; Schwartz et al.,
2002; Wang et al., 1999; Welsh et al., 2001). During the later stages of this work several
studies were published in which expression profiles were related to tumour grade,
malignant potential or patient prognosis (Gilks et al., 2005; Jazaeri et al., 2003; Spentzos
et al., 2004; Warrenfeltz et al., 2004).
Chapter 4 sought to profile specimens of EOC surgically removed from patients with
varying survival times. The goal of this chapter was to identify gene expression patterns
271
that correlate with the length of survival and explore the biology behind these patterns.
Based on compelling evidence that the amount of residual disease following surgery
impacts substantially on patient survival, analyses were confined to those patients with
adequate clinical information. The resulting cohort was analysed with using a range of
approaches, including methods for considering survival as a continuous or a categorical
variable. None of the analyses generated a statistically significant list of survival-related
genes, almost certainly due to the limited sample size. Despite this, within the gene lists
that were generated, a number of biologically interesting relevant genes were observed,
many of which are implicated in the development and progression of other cancer types,
other reproductive system diseases, as well as regulation of cell growth and proliferation.
An analysis similar to that by Spentzos et al (Spentzos et al., 2004) came the closest to
generating a statistically significant list of differentially expressed genes between patient
survival groups. The list of 27 genes obtained was enriched for molecules known to be
involved in calcium-binding and calcium channel functions. Coupled with interesting
epidemiological evidence about the level of dietary calcium intake and EOC-risk and also
the known importance of calcium-dependant cell-adhesion molecules in EOC progression
and invasion, these 27 genes may represent a prognostic group worthy of further
investigation.
Gene lists obtained from other studies of EOC were related to the data generated for this
chapter to determine whether they were able to segregate patients on the basis of survival,
level of residual disease or probability of tumour relapse. These gene lists were not able
to predict these key clinical parameters to the same extent achieved in their original
studies. This may reflect the heterogeneity of EOC leading to the difficulty of applying
findings based on one cohort of specimens to other samples or a deficiency in the quality
of the microarray data generated for this section of the thesis. It may also indicate that the
sample sizes used to analyse EOC have been insufficient to result in a truly universal
prognostic signature for this cancer type.
The phenomenon of non-overlapping sets of genes generated by independent studies of
the same cancer type, particularly for breast cancer for which a number of molecular
signatures for predicting the development of metastases, has been described by Ein-Dor
et al (Ein-Dor et al., 2005). By focusing on one published dataset, that of van’t Veer et al
(van 't Veer et al., 2002), it was found that no single gene had a very high correlation to
the outcome variable, rather a large number of genes in the dataset had moderately
correlating patterns of expression. As a result, a number of different non-overlapping
272
predictive subsets were identified that produced the same classification accuracy as the
published 70-gene profile. The ranking of genes in order of prediction accuracy was
observed to fluctuate drastically with even small changes in the training cohort of
samples.
The issue of sample size on the discovery of significant predictive gene expression
signatures using microarrays has been evaluated by Ntzani et al (Ntzani and Ioannidis,
2003). This was achieved by comparing the cohort size and prediction accuracies of 84
published studies in which microarray data was used to predict clinical outcomes such as
death, metastasis, recurrence or response to therapy. It was found that a doubling in the
number of samples profiled, lead to a 3.5-fold increase in the probability of a study
identifying a significant association between gene expression and outcome. Furthermore
it was also revealed that a significant association was 9.7 times more likely with each ten-
fold increase in the number of clones represented by the microarray platform used.
Together these observations about the possibility for generating multiple signatures of
outcome from a given microarray dataset, coupled with the impact of cohort and
microarray clone set size, explain how studies of the one cancer type can result in non-
overlapping sets of outcome-related genes. As cohort sizes in published studies continue
to expand over time and newer microarrays are developed in which the vast majority of
genes in the human genome can be profiled simultaneously, it is expected to see a
convergence of predictive gene signatures in the future.
6.1.3. Molecular characterisation of ovarian LMP and invasive epithelial cancer
The LMP type of EOC represents a subtype of ovarian cancer with a number of clinically
important differences to the invasive form, which result in it having a significantly more
favourable prognosis. An analysis of both the mucinous and serous types of LMP tumour
was planned, however after it was determined with pathology review and microarray
analysis, that a significant proportion of the mucinous EOC specimens were actually
metastases from other tissues. Consequently a decision was made to restrict the
investigation to the serous type tumours only.
Through pathology review and gene expression-based predictive analysis using LOOCV,
a high quality gene expression dataset and associated clinical information was generated.
A large proportion of the detectably hybridised genes present on the Peter Mac 10.5k
273
human cDNA microarray were found to be differentially expressed between carefully
selected and validated samples of serous LMP and invasive EOC.
As a part of this analysis, a predictive signature of EOC based on 231 genes was
identified. This was shown to be capable of accurately discriminating between samples of
primary EOC and a series of several hundred tumours of other origins, representing nine
types of primary tumour. Characterisation of these genes with gene ontology analysis
revealed that the EOCs have a unique pattern of cell-adhesion gene expression which
distinguishes then from the other cancer types used in the analysis.
Using a combination of gene ontology and novel pathway/network analysis a series of
interacting genes were identified that share a common function of controlling the
processes required to maintain cell-cell or cell-matrix adhesion, a critical step in the
process of tumour invasion. Also differentially expressed between LMP and invasive
serous EOC were large numbers of genes involved in regulation of the cell cycle and a
significant proportion of genes related to the bodies’ immune reaction to an invading
tumour.
Comparison of the genes identified as differentially expressed between LMP and invasive
EOC in this study to other published studies revealed a number of similarities, particular
in terms of the biological processes represented. Statistically significant overlaps were
observed between the genes found in this chapter and those found from studies of breast
cancer DCIS and IDC, gastric cancer depth of invasion and a recently identified
transcriptional profile of undifferentiated cancer that appears to be almost universally
activated in human cancer. These findings suggest that the molecular events responsible
for the phenotypic differences between LMP and invasive EOC are similar to those
responsible for malignancy and invasion in other parts of the body
6.1.4. EOC and the differential expression of genes involved cell adhesion processes; a reoccurring theme
Genes that have been demonstrated to regulate a cell’s ability to adhere to other cells of
the same kind and/or cells of the extracellular matrix were identified in this study as
having key roles in (i) molecular differentiation of EOC from nine other primary tumour
types, (ii) processes relating to length of patient survival and (iii) the phenotypic
differences between LMP and invasive disease.
274
The importance of adhesion genes in ovarian malignancies has been described by a
number of authors (Davies et al., 1998; Hashimoto et al., 1989; Lessan et al., 1999; Patel
et al., 2003; Rump et al., 2004; Sundfeldt, 2003; Zand et al., 2003). This study extends
previous findings by further underscoring their correlation with extent of disease spread
and survival. The extent to which cell adhesion genes are involved in EOC malignancy,
relative to nine other tumour types, was indicated by the significant representation of this
gene ontology in the 231 gene signature of EOC identified in Chapter 5. This gene set
was capable of identify primary EOC from metastatic disease in the ovary, thus the genes
can be viewed as representing EOC-unique processes. This highlights the importance of
also observing genes of this functionality as having a key role in the LMP/invasive
phenotype and also the potential benefits that may come from the therapeutic
manipulation of these processes.
The cell adhesion gene PLAU, identified as differentially expressed between serous LMP
and invasive EOC in Chapter 5. The interaction of PLAU with its receptor was recently
studied by Krol et al in a model of EOC progression and invasion. Tri-functional
inhibitors composed of N-TIMP-1 or -3 (human matrix metalloproteinase inhibitors) and
a chicken variant of the protease inhibitor cystatin, harbouring the PLAUR binding site of
PLAU, (chCys-PLAU19-31) have been transfected into in ovarian cancer cells lines to
test their ability to reduce the growth and spread of ovarian cancer cells (Krol et al.,
2003). The transfected cell lines were observed to display the same adhesive and
proliferative features as a vector-only transfected control line, however exhibited a
significant reduction in invasive potential in vitro. By inoculating the cell lines into the
peritoneum of nude mice a significant reduction in tumour burden was observed with the
inhibitor expressing cell lines relative to those with the vector alone, indicating the
potential for these inhibitors to be used as gene therapy agents against solid ovarian
malignancies.
Another promising study into the potential therapeutic benefits of anti-adhesive drugs
involved their use in conjunction with standard chemotherapeutic agents to improve the
efficiency of tumour cell killing. Tumour cells grown as multicellular spheroids are
known to be inherently more resistant to a large array of chemotherapeutic drugs
compared to the same cells grown as dispersed monolayer cell cultures. This process is
known as acquired multicellular resistance (Kobayashi et al., 1993).
The drug hyaluronidase has been demonstrated to sensitise tumour cells to a range of
chemotherapeutic agents. In a study of the effect of this drug on mouse mammary cell
275
lines it was observed that this agent was able to disrupt tight clusters of cells which
resulted in a significant increase in chemosensitivity. For in vitro and in vivo models, this
drug was able to disrupt inter-cellular adhesion and sensitise cells and tumours to the
chemotherapeutic agent tested. By actually dispersing clusters of cells, this observation
supports the hypothesis that its chemosensitising ability is not a result of increased drug
penetration, rather of its anti-adhesive properties. It is suggested that by over-riding cell
contact-dependent growth inhibition more cells are actively dividing, thus increasing the
proportion of tumour cells sensitive to cytotoxic agents (Croix et al., 1996).
Other methods for therapeutically manipulating cell adhesion genes include the use of
anti-E-cadherin monoclonal antibodies. By using these antibodies to disrupt E-cadherin-
mediated cell adhesion interactions in multicellular spheroids of colorectal cancer cells a
resensitisation to a range of chemotherapy agents was observed, including paclitaxel
(commonly used for EOC) but not to cisplatin. This demonstrates the principal of
modifying a tumours adhesive ability to enhance the efficacy of conventional therapies
(Green et al., 2004).
FAK/PTK22 is a non-protein tyrosine kinase that becomes phosphorylated and activated
during integrin-mediated EOC cell adhesion. Recent cell line studies have demonstrated
that the expression of this gene is up-regulated in invasive EOC and is significantly
associated with an aggressive phenotype, corresponding to a poor outcome in patients. In
addition, it was found that by inhibiting FAK phosphorylation by introducing a dominant-
negative construct called FAK-related non-kinase (FRNK) into highly aggressive EOC
cells, a decrease in invasion (56-85% decrease), migration (52-68% decrease), and cell
spreading were observed. This indicates both the importance of this gene’s function in
key metastatic events and also its potential use in cell-adhesion based gene therapy
approaches to EOC treatment (Sood et al., 2004).
In summary, there appears to be growing interest in the use of gene therapy approaches to
modify the expression and/or function of molecules involved in regulating cell adhesion
interactions for a range of cancer types. This study provides several lists of candidate cell-
cell and cell-matrix adhesion genes that with further validation and translational studies
could be used in this fashion to alter the clinical course of invasive EOC.
2 The feature corresponding to PTK2 on the Peter Mac Microarray 10.5k human cDNA microarray was excluded during the unsupervised filtering of non-expressing genes in the Chapter 5 analysis of LMP and invasive EOC (Log-ratio variation P-value = 0.98). However an individual ANOVA of this gene revealed a significantly higher PTK2 expression in the serous LMP tumours (P=0.002).
276
6.2. Future directions
6.2.1. Meta-analysis of gene expression datasets
Combining of datasets from publicly available datasets is one method for increasing the
total sample size available for the type of bioinformatic analyses carried out in this thesis.
With the adoption of the MIAME guidelines and requirement for complete gene
expression profiles to be made publicly available prior to manuscript acceptance, the
body of raw data available for meta-analysis is continually expanding.
Many microarray studies referenced in this thesis have not made their entire datasets
made publicly available. Most have opted to providing only the data specifically relating
to identified differentially expressed genes. This has restricted the opportunity for meta-
analysis of data generated for this study and other suitable microarray studies of EOC
outcome (Lancaster et al., 2004; Spentzos et al., 2004) or malignancy (Warrenfeltz et al.,
2004). The raw gene expression data used by Gilks et al (Gilks et al., 2005) to analyse
serous LMP and invasive EOC was recently made available through the Standford
Microarray Database (SMD) (Sherlock et al., 2001). However, as discussed in Chapter 5,
inspection of these arrays revealed a range of hybridisation artefacts and irregular probe
distribution, reducing the value of the data extracted from these microarrays for meta-
analysis.
Databases such as SMD, ArrayExpress (Brazma et al., 2003) and the Gene Expression
OmniBus (Edgar et al., 2002) are online repositories of a wide range of high-throughput
data generated from both single and dual channel microarray experiments of mRNA,
genomic DNA and protein abundance (proteomics). It is hoped that with the increasing
adoption of policies requiring the full disclosure of raw expression data prior to
publication, along with the associated clinical information required to repeat the analyses
described, opportunities to extend this work by meta-analysis will arise.
6.2.2. Extension of cDNA expression dataset with Affymetrix GeneChip profiling
During the course of this project, the cost of Affymetrix GeneChip microarrays reduced
to a level where they are a viable option for large-scale tumour profiling studies, such as
that being carried out by the AOCS. As well as the price reduction, the refinement of the
construction of these chips and the extensive clone sets from which they are generated
has enabled the creation of the ‘whole genome’ expression microarray known as the
277
Human Genome U133 Plus 2.0. This single chip array contains over 47,000 transcripts,
representing far more of the human genome than the 10.5k cDNA used for this study.
38,500 well characterised human genes are included and multiple independent measures
of each transcript per array, claimed to increase data accuracy and reproducibility by
lowering the probability of identifying differentially expressed genes by chance.
As a result of these developments, future tumour profiling by the AOCS will be carried
out using Human Genome U133 Plus 2.0 arrays. Approximately 500 specimens of EOC
with extensive clinical annotation are planned to be profiled on this platform. As shown
in Figure 6-1, the clone set used to create the Affymetrix array contains 70.3% of those
genes present on the cDNA microarray used in this study. This overlap includes 78.6% of
those identified as differentially expressed between LMP and invasive EOC in Chapter 5.
The future microarray profiling planned by the AOCS includes an extensive analysis of
genes related to patient survival and will involve many more samples than those analysed
in this thesis. The majority of specimens to be analysed on the Affymetrix platform are
from patients prospectively recruited into the study and as a result a more extensive and
complete level of clinical annotation will be available. Another branch of the AOCS
study will extend upon the analysis of LMP and invasive tumours carried out here, as
well as other histological subtypes such as clear cell and endometrioid tumours.
These future studies will benefit from the experience gained by this present body of work,
particularly with respect to issues of sample size and value of specimen review by
pathologists who are experts in the field of gynaecological pathology. The analytical
methods outlined throughout this thesis are entirely scalable and will potentially serve as
a guide for the bioinformatic investigation of future Affymetrix GeneChip datasets.
The substantial overlap of genes present on the Peter Mac 10.k5 cDNA microarray with
the Affymetrix platform will allow the data generated in this study to be incorporated
with future analyses. This will increase the total sample size and probability of observing
significant relationships between gene expression and clinical variables. Another possible
future use of data from this thesis is as an independent validation set, for evaluating
findings generated by Affymetrix profiling. This would achieve the double purpose of
confirming the significance of a discovered molecular signature on biological specimens
completely independent to the original cohort and also using a microarray platform based
on a different probe type, demonstrating platform independence.
278
Figure 6-1: Visualisation of clone set overlap between Peter Mac 10.5k cDNA microarray and Affymetrix U133A Plus 2.0 genechip. Array features were matched using the gene-list homology function of Silicon Genetics Genespring 7.2 (Agilent Technologies, USA) which identifies genes shared between two lists on the basis of UniGene (build #184) and/or LocusLink identifiers. The region shown in orange corresponds to the overlap between clone sets used to generate the two microarrays. The red and yellow regions indicate those features unique to the Peter Mac and Affymetrix platforms respectively.
Affymetrix U133A Plus 2.0 Total features: 54,978
7,695
47,283
3,249Peter Mac 10.5k cDNA microarray
Total features: 10,944
279
6.2.3. Translation of findings to in vivo studies of gene function and the potential for clinical application
A number of findings described in thesis could be extended upon to determine their
clinical relevance through the use of translation approaches. These include the use of
RNA-interference (RNAi) to modulate the expression of the cell adhesion genes
identified in Chapters 4 and 5 in cell lines derived from either LMP or invasive EOC.
Any observed difference in cell growth, proliferation, adhesive or spreading ability may
suggest the suitability of a gene for further analysis using three dimensional cell culturing
systems or animal models of EOC.
Monoclonal antibodies targeted to gene products of differentially expressed genes could
also be used to investigate the in vivo effect of blocking cell adhesion gene products
and/or their specific receptors. Findings from this thesis indicate such an experiment
would result a significant reduction in a tumour’s ability to invade through other tissues
and organs. As a result its growth may be restricted to a more localised area, increasing
the chances it being completely excised by surgery alone and consequently dramatically
improving patient prognosis as indicated by the significant relationship between residual
disease and outcome.
6.2.4. Conclusion
The work outlined in this study critically evaluates several important aspects of the
microarray workflow and determines a number of methods for generating high quality
cDNA microarray gene expression data with a minimum of systematic error. It also
provides an analysis of genes involved in both length of patient survival and the LMP or
invasive phenotype of this highly lethal disease. These analyses point to calcium-
dependant cell adhesion molecules as potential novel therapeutic targets which could be
manipulated to improve patient prognosis and enhance the efficacy of existing treatments.
The future microarray and translational work planned as part of the ongoing AOCS as
well as meta-analysis of any suitable publicly available will extend upon and hopefully
support the findings of this thesis. With the increase in understanding of the molecular
foundation of EOC, brought about by high throughput genomic tools and analytical
approaches such as those used in this study, it is hoped a reduction in the burden of this
disease on the community will be seen in the near future.
280
281
7. Bibliography
Abd El-Rehim, D. M., Ball, G., Pinder, S. E., Rakha, E., Paish, C., Robertson, J. F.,
Macmillan, D., Blamey, R. W., and Ellis, I. O. (2005). High-throughput protein
expression analysis using tissue microarray technology of a large well-characterised
series identifies biologically distinct classes of breast cancer confirming recent cDNA
expression analyses. Int J Cancer.
Adib, T. R., Henderson, S., Perrett, C., Hewitt, D., Bourmpoulia, D., Ledermann, J., and
Boshoff, C. (2004). Predicting biomarkers for ovarian cancer using gene-expression
microarrays. Br J Cancer 90, 686-692.
Agarwal, R., and Kaye, S. B. (2003). Ovarian cancer: strategies for overcoming resistance
to chemotherapy. Nat Rev Cancer 3, 502-516.
Agarwal, R., and Kaye, S. B. (2005). Prognostic factors in ovarian cancer: how close are
we to a complete picture? Ann Oncol 16, 4-6.
Ahmed, A. A., Vias, M., Iyer, N. G., Caldas, C., and Brenton, J. D. (2004). Microarray
segmentation methods significantly influence data precision. Nucleic Acids Res 32, e50.
Al-Shahrour, F., Diaz-Uriarte, R., and Dopazo, J. (2004). FatiGO: a web tool for finding
significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20,
578-580.
Al Kuraya, K., Simon, R., and Sauter, G. (2004). Tissue microarrays for high-throughput
molecular pathology. Ann Saudi Med 24, 169-174.
Albiston, A. L., Obeyesekere, V. R., Smith, R. E., and Krozowski, Z. S. (1994). Cloning
and tissue distribution of the human 11 beta-hydroxysteroid dehydrogenase type 2
enzyme. Mol Cell Endocrinol 105, R11-17.
Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A.,
Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large B-
cell lymphoma identified by gene expression profiling. Nature 403, 503-511.
282
Allen, H. J., Porter, C., Gamarra, M., Piver, M. S., and Johnson, E. A. (1987). Isolation
and morphologic characterization of human ovarian carcinoma cell clusters present in
effusions. Exp Cell Biol 55, 194-208.
Altman, D. G. (2001). Systematic reviews of evaluations of prognostic variables. Bmj
323, 224-228.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local
alignment search tool. J Mol Biol 215, 403-410.
Ambroise, C., and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis
of microarray gene-expression data. Proc Natl Acad Sci U S A 99, 6562-6566.
Ambrosini, G., Adida, C., and Altieri, D. C. (1997). A novel anti-apoptosis gene,
survivin, expressed in cancer and lymphoma. Nat Med 3, 917-921.
Anttila, M., Kosma, V. M., Ji, H., Wei-Ling, X., Puolakka, J., Juhola, M., Saarikoski, S.,
and Syrjanen, K. (1998). Clinical significance of alpha-catenin, collagen IV, and Ki-67
expression in epithelial ovarian cancer. J Clin Oncol 16, 2591-2600.
Anttila, M. A., Kosma, V. M., Hongxiu, J., Puolakka, J., Juhola, M., Saarikoski, S., and
Syrjanen, K. (1999). p21/WAF1 expression as related to p53, cell proliferation and
prognosis in epithelial ovarian cancer. Br J Cancer 79, 1870-1878.
Aris, V. M., Cody, M. J., Cheng, J., Dermody, J. J., Soteropoulos, P., Recce, M., and
Tolias, P. P. (2004). Noise filtering and nonparametric analysis of microarray data
underscores discriminating markers of oral, prostate, lung, ovarian and breast cancer.
BMC Bioinformatics 5, 185.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A.
P., Dolinski, K., Dwight, S. S., Eppig, J. T., et al. (2000). Gene ontology: tool for the
unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29.
Auersperg, N., Pan, J., Grove, B. D., Peterson, T., Fisher, J., Maines-Bandiera, S.,
Somasiri, A., and Roskelley, C. D. (1999). E-cadherin induces mesenchymal-to-epithelial
transition in human ovarian surface epithelium. Proc Natl Acad Sci U S A 96, 6249-6254.
Australian Institute of Health and Welfare, and Australasian Association of Cancer
Registries (2001). Cancer survival in Australia, 2001: [relative survival data for selected
283
cancers for the period 1982 to 1997], (Canberra: Australian Institute of Health and
Welfare).
Axon (2004). GenePix Pro. In, pp. GenePix Pro is the complete standalone image
analysis software for microarrays, tissue arrays and cell arrays.
Baekelandt, M., Kristensen, G. B., Nesland, J. M., Trope, C. G., and Holm, R. (1999).
Clinical significance of apoptosis-related factors p53, Mdm2, and Bcl-2 in advanced
ovarian cancer. J Clin Oncol 17, 2061.
Baker, S. C., Bauer, S. R., Beyer, R. P., Brenton, J. D., Bromley, B., Burrill, J., Causton,
H., Conley, M. P., Elespuru, R., Fero, M., et al. (2005). The External RNA Controls
Consortium: a progress report. Nat Methods 2, 731-734.
Balazsi, G., Kay, K. A., Barabasi, A. L., and Oltvai, Z. N. (2003). Spurious spatial
periodicity of co-expression in microarray data due to printing design. Nucleic Acids Res
31, 4425-4433.
Bankhead, C. R., Kehoe, S. T., and Austoker, J. Symptoms associated with diagnosis of
ovarian cancer: a systematic review.
Bar, J. K., Grelewski, P., Popiela, A., Noga, L., and Rabczynski, J. (2004). Type IV
collagen and CD44v6 expression in benign, malignant primary and metastatic ovarian
tumors: correlation with Ki-67 and p53 immunoreactivity. Gynecol Oncol 95, 23-31.
Bardin, A., Hoffmann, P., Boulle, N., Katsaros, D., Vignon, F., Pujol, P., and Lazennec,
G. (2004). Involvement of estrogen receptor beta in ovarian carcinogenesis. Cancer Res
64, 5861-5869.
Barker, S. D., Coolidge, C. J., Kanerva, A., Hakkarainen, T., Yamamoto, M., Liu, B.,
Rivera, A. A., Bhoola, S. M., Barnes, M. N., Alvarez, R. D., et al. (2003). The secretory
leukoprotease inhibitor (SLPI) promoter for ovarian cancer gene therapy. J Gene Med 5,
300-310.
Barton, P. J., Cullen, M. E., Townsend, P. J., Brand, N. J., Mullen, A. J., Norman, D. A.,
Bhavsar, P. K., and Yacoub, M. H. (1999). Close physical linkage of human troponin
genes: organization, sequence, and expression of the locus encoding cardiac troponin I
and slow skeletal troponin T. Genomics 57, 102-109.
284
Bast, R. C., Jr., Feeney, M., Lazarus, H., Nadler, L. M., Colvin, R. B., and Knapp, R. C.
(1981). Reactivity of a monoclonal antibody with human ovarian carcinoma. J Clin Invest
68, 1331-1337.
Baty, F., Bihl, M. P., Perriere, G., Culhane, A. C., and Brutsche, M. H. (2005). Optimized
between-group classification: a new jackknife-based gene selection procedure for
Zhou, H., Kuang, J., Zhong, L., Kuo, W. L., Gray, J. W., Sahin, A., Brinkley, B. R., and
Sen, S. (1998). Tumour amplified kinase STK15/BTAK induces centrosome
amplification, aneuploidy and transformation. Nat Genet 20, 189-193.
Zorn, K. K., Jazaeri, A. A., Awtrey, C. S., Gardner, G. J., Mok, S. C., Boyd, J., and
Birrer, M. J. (2003). Choice of normal ovarian control influences determination of
differentially expressed genes in ovarian cancer expression profiling studies. Clin Cancer
Res 9, 4811-4818.
332
Appendix A: FIGO staging of EOC
Stage I: The cancer is still contained within the ovary (or ovaries).
Stage IA: Cancer has developed in one ovary, and the tumor is confined to the inside of the ovary. There is no cancer on the outer surface of the ovary. Laboratory examination of washings from the abdomen and pelvis did not find any cancer cells.
Stage IB: Cancer has developed within both ovaries without any tumor on their outer surfaces. Laboratory examination of washings from the abdomen and pelvis did not find any cancer cells.
Stage IC: The cancer is present in one or both ovaries and one or more of the following are present:
• Cancer on the outer surface of at least one of the ovaries • In the case of cystic tumors (fluid-filled tumors), the capsule (outer wall of the
tumor) has ruptured (burst) • Laboratory examination found cancer cells in fluid or washings from the
abdomen.
Stage II: The cancer is in one or both ovaries and has involved other organs (such as the uterus, fallopian tubes, bladder, the sigmoid colon, or the rectum) within the pelvis.
Stage IIA: The cancer has spread to or has actually invaded the uterus or the fallopian tubes, or both. Laboratory examination of washings from the abdomen did not find any cancer cells.
Stage IIB: The cancer has spread to other nearby pelvic organs such as the bladder, the sigmoid colon, or the rectum. Laboratory examination of fluid from the abdomen did not find any cancer cells.
Stage IIC: The cancer has spread to pelvic organs as in stages IIA or IIB and laboratory examination of the washings from the abdomen found evidence of cancer cells.
Stage III: The cancer involves one or both ovaries, and one or both of the following are present: (1) cancer has spread beyond the pelvis to the lining of the abdomen; (2) cancer has spread to lymph nodes.
Stage IIIA: During the staging operation, the surgeon can see cancer involving the ovary or ovaries, but no cancer is grossly visible (can be seen without using a microscope) in the abdomen and the cancer has not spread to lymph nodes. However, when biopsies are checked under a microscope, tiny deposits of cancer are found in the lining of the upper abdomen.
Stage IIIB: There is cancer in one or both ovaries, and deposits of cancer large enough for the surgeon to see, but smaller than 2 cm (about 3/4 inch) across, are present in the abdomen. Cancer has not spread to the lymph nodes.
333
Stage IIIC: The cancer is in one or both ovaries, and one or both of the following are present:
• Cancer has spread to lymph nodes. • Deposits of cancer larger than 2 cm (about 3/4 inch) across are seen in the
abdomen.
Stage IV: This is the most advanced stage of ovarian cancer. The cancer is in one or both ovaries. Distant metastasis (spread of the cancer to the inside of the liver, the lungs, or other organs located outside of the peritoneal cavity) has occurred. Finding ovarian cancer cells in pleural fluid (from the cavity that surrounds the lungs) is also evidence of stage IV disease.
Recurrent ovarian cancer: This means that the disease has come back (recurred) after completion of treatment.
Source: American Cancer Society: http://www.cancer.org/
334
Appendix B: Specimens of EOC included in TMA Details of AOCS cases used for TMA #AOCS-01
p78 (mouse) Hs.515369 TYROBP TYRO protein tyrosine kinase binding protein Hs.351279 HLA-DMA Major histocompatibility complex, class II, DM alpha Hs.347270 HLA-
DPA1 Major histocompatibility complex, class II, DP alpha 1
APBB1IP Amyloid beta (A4) precursor protein-binding, family B, member 1 interacting protein -0.073620
AKR1C2 Aldo-keto reductase family 1, member C2 (dihydrodiol dehydrogenase 2; bile acid binding protein; 3-alpha hydroxysteroid dehydrogenase, type III)
Appendix H: Higher-level gene ontologies represented by genes differentially expressed between survival groups High Level Function Significance # Genes Cancer 4.36 X 10-7 - 9.83 x 10-3 18 Cell Death 2.71 x 10-6 - 9.83 x 10-3 18 Reproductive System Disease 3.95 x 10-6 - 9.83 x 10-3 12 Cellular Growth and Proliferation 4.88 x 10-6 - 9.83 x 10-3 13 Skeletal and Muscular Disorders 4.88 x 10-6 - 9.83 x 10-3 8 Tumor Morphology 2.34 x 10-5 - 9.83 x 10-3 7 Gastrointestinal Disease 2.36 x 10-5 - 9.83 x 10-3 7 Cellular Assembly and Organization 2.36 x 10-5 - 9.83 x 10-3 6 Ophthalmic Disease 2.36 x 10-5 - 9.83 x 10-3 3 Cell Morphology 2.36 x 10-5 - 9.83 x 10-3 11 Cellular Movement 3.21 x 10-5 - 4.92 x 10-3 10 Cell Cycle 4.72 x 10-5 - 9.83 x 10-3 14 Connective Tissue Development and Function 7.05 x 10-5 - 9.83 x 10-3 9 Tissue Development 7.05 x 10-5 - 9.83 x 10-3 9 Gene Expression 1.33 x 10-4 - 4.92 x 10-3 13 Renal and Urological Disease 1.41 x 10-4 - 4.92 x 10-3 3 Developmental Disorder 2.33 x 10-4 - 9.83 x 10-3 4 Organ Morphology 2.33 x 10-4 - 9.83 x 10-3 6 Hematological Disease 2.33 x 10-4 - 9.83 x 10-3 6 Organismal Injury and Abnormalities 2.33 x 10-4 - 9.83 x 10-3 6 Cellular Development 3.29 x 10-4 - 9.83 x 10-3 11 Skeletal and Muscular System Development and Function 3.29 x 10-4 - 9.83 x 10-3 7
Respiratory Disease 3.49 x 10-4 - 9.83 x 10-3 5 Organ Development 3.66 x 10-4 - 9.83 x 10-3 5 Hepatic System Development and Function 4.48 x 10-4 - 9.83 x 10-3 3 Reproductive System Development and Function 4.87 x 10-4 - 9.83 x 10-3 6 Cellular Function and Maintenance 4.87 x 10-4 - 9.83 x 10-3 7 Tissue Morphology 4.87 x 10-4 - 9.83 x 10-3 11 Connective Tissue Disorders 5.57 x 10-4 - 4.92 x 10-3 5 Hematological System Development and Function 6.48 x 10-4 - 9.83 x 10-3 7 Immune & Lymphatic System Development & Function 6.48 x 10-4 - 5.59 x 10-3 5
Organismal Functions 7.58 x 10-4 - 7.58 x 10-4 3 Dermatological Diseases and Conditions 7.58 x 10-4 - 9.83 x 10-3 6 Neurological Disease 8.35 x 10-4 - 9.83 x 10-3 10 Immunological Disease 1.17 x 10-3 - 9.83 x 10-3 6 DNA Replication, Recombination, and Repair 1.18 x 10-3 - 9.83 x 10-3 5 Embryonic Development 1.26 x 10-3 - 9.83 x 10-3 8 Post-Translational Modification 1.78 x 10-3 - 1.78 x 10-3 2 Cell Signaling 1.84 x 10-3 - 3.71 x 10-3 5 Vitamin and Mineral Metabolism 1.84 x 10-3 - 1.84 x 10-3 4 Inflammatory Disease 2.07 x 10-3 - 4.92 x 10-3 3 Cellular Compromise 2.07 x 10-3 - 9.83 x 10-3 4 Organismal Development 2.78 x 10-3 - 9.83 x 10-3 5 Small Molecule Biochemistry 3.71 x 10-3 - 3.71 x 10-3 3 Organismal Survival 3.82 x 10-3 - 9.83 x 10-3 10 Cardiovascular Disease 3.82 x 10-3 - 9.83 x 10-3 3 Cardiovascular System Development and Function 4.23 x 10-3 - 9.83 x 10-3 3
352
High Level Function Significance # Genes Cell-To-Cell Signaling and Interaction 4.67 x 10-3 - 9.83 x 10-3 4 Energy Production 4.67 x 10-3 - 4.67 x 10-3 2 Nucleic Acid Metabolism 4.67 x 10-3 - 9.83 x 10-3 3 Hair and Skin Development and Function 4.92 x 10-3 - 9.83 x 10-3 5 Carbohydrate Metabolism 4.92 x 10-3 - 4.92 x 10-3 1 Endocrine System Development and Function 4.92 x 10-3 - 9.83 x 10-3 4 Genetic Disorder 4.92 x 10-3 - 9.83 x 10-3 3 Hepatic System Disease 4.92 x 10-3 - 9.83 x 10-3 2 Metabolic Disease 4.92 x 10-3 - 4.92 x 10-3 1 Nutritional Disease 4.92 x 10-3 - 4.92 x 10-3 1 Visual System Development and Function 4.92 x 10-3 - 9.83 x 10-3 4 Nervous System Development and Function 4.92 x 10-3 - 9.83 x 10-3 5 Renal & Urological System Development and Function 4.92 x 10-3 - 9.83 x 10-3 3 Free Radical Scavenging 4.92 x 10-3 - 4.92 x 10-3 1 Immune Response 4.92 x 10-3 - 4.92 x 10-3 1 Endocrine System Disorders 4.92 x 10-3 - 4.92 x 10-3 1 Viral Function 4.92 x 10-3 - 9.83 x 10-3 1 Cellular Response to Therapeutics 4.92 x 10-3 - 9.83 x 10-3 1 Behaviour 4.92 x 10-3 - 9.83 x 10-3 1 Protein Synthesis 9.83 x 10-3 - 9.83 x 10-3 1
353
Appendix I: Samples used to generate predictive gene expression signature of primary EOC
Array ID Cancer type Subtype Pathology review comments % Tumour
UP012 Breast Ductal Breast (ductal) 90% (30% necrotic) UP014 Breast Lobular Breast (lobular) 70% UP016 Breast Breast(ductal) 70% UP064 Breast Lobular Breast(lobularg2) 90% UP082 Breast Ductal Breast 100% UP096 Breast Ductal Breast (ductal) 90% UP097 Breast Lobular Breast(lobular) 95% UP098 Breast Lobular Breast (lobular) 90% UP102 Breast Ductal Breast (ductal) 90% UP111 Breast Ductal Breast(ductal) 95% UP113 Breast Ductal Breast (ductal) 80% UP116 Breast Lobular Breast (lobular) 70% UP119 Breast Ductal Breast (ductal) 90% UP161 Breast Ductal Breast(ductal) T130% T250% UP166 Breast Lobular Breast(lobular) 95% UP213 Breast Ductal Breast(ductal) 90% UP423 Breast Breast Infiltrating ductal 60% UP426 Breast Breast lobular 80% UP428 Breast Breast Infiltrating ductal 70% UP451 Breast Infiltrating ductal 60% UP017 Colorectal Colorectal(moderate) 80% UP019 Colorectal Colorectal (moderate) 70% UP047 Colorectal Colorectal (moderate) 30% UP062 Colorectal Colorectal 100% UP063 Colorectal Colorectal (moderate) 70% UP069 Colorectal UP080 Colorectal Colorectal (moderate) 30% UP341 Colorectal Colon adenocarcinoma 40% UP356 Colorectal Mucinous adenocarcinoma 100% UP369 Colorectal Colon adenocarcnimoa 100% UP371 Colorectal Colonic adenocarcinoma 70% UP380 Colorectal Adenocarinoma of rectum 20% UP388 Colorectal Colon adenocarcinoma (moderate) 5% UP399 Colorectal Colorectal adenocarcinoma (moderate) 80% UP442 Colorectal Colon adenocarcinoma (moderate) 85% UP453 Colorectal Colon adenocarcinoma (moderate) 70% UP034 Gastric Gastric (intestinal) 80% UP040 Gastric Diffuse Gastric (diffuse) 10% UP045 Gastric Signet ring Gastric(diffuse) 80% UP057 Gastric Intestinal Gastric (moderate) 85% UP058 Gastric Gastric(diffuse) 40% UP085 Gastric diffuse Gastric (diffuse) 30% UP127 Gastric Gastric (diffuse) 50%
354
Array ID Cancer type Subtype Pathology review comments % Tumour
Appendix J: Output of prediction of primary ovarian origin for LMP and invasive EOC cohort Patient ID Histology Invasive/LMP 1kNN 3kNN Nearest
Centroid
Linear Discriminant Analysis
91.039 Serous Invasive Ovarian Ovarian Ovarian Other 85.064 Serous Invasive Ovarian Ovarian Ovarian Ovarian 86.058 Serous Invasive Ovarian Ovarian Ovarian Ovarian 91.007 Serous Invasive Ovarian Ovarian Ovarian Ovarian 91.052 Serous Invasive Ovarian Ovarian Ovarian Ovarian 93.001 Serous Invasive Ovarian Ovarian Ovarian Ovarian 93.004 Serous Invasive Ovarian Ovarian Ovarian Ovarian 93.117 Serous Invasive Ovarian Ovarian Ovarian Ovarian 93.131 Serous Invasive Ovarian Ovarian Ovarian Ovarian 94.017 Serous Invasive Ovarian Ovarian Ovarian Ovarian 94.127 Serous Invasive Ovarian Ovarian Ovarian Ovarian P00756 Serous Invasive Ovarian Ovarian Ovarian Ovarian 90.037 Serous LMP Ovarian Ovarian Ovarian Ovarian 90.063 Serous LMP Ovarian Ovarian Ovarian Ovarian 91.077 Serous LMP Ovarian Ovarian Ovarian Ovarian 93.007 Serous LMP Ovarian Ovarian Ovarian Ovarian 92.014 Serous LMP Ovarian Ovarian Ovarian Ovarian 92.018 Serous LMP Ovarian Ovarian Ovarian Ovarian 93.073 Serous LMP Ovarian Ovarian Ovarian Ovarian 93.079 Serous LMP Ovarian Ovarian Ovarian Ovarian 94.046 Serous LMP Ovarian Ovarian Ovarian Ovarian 95.006 Serous LMP Ovarian Ovarian Ovarian Ovarian 22027 Serous LMP Ovarian Ovarian Ovarian Ovarian 44232 Serous LMP Ovarian Ovarian Ovarian Ovarian 70056 Serous LMP Ovarian Ovarian Ovarian Ovarian 70057 Serous LMP Ovarian Ovarian Ovarian Ovarian 93.090 Serous LMP Ovarian Ovarian Ovarian Ovarian P00633 Serous LMP Ovarian Ovarian Ovarian Ovarian WM389A Serous LMP Ovarian Ovarian Ovarian Ovarian WM542A Serous LMP Ovarian Ovarian Ovarian Ovarian WM578A Serous LMP Ovarian Ovarian Ovarian Ovarian 90.007 Mucinous Invasive Other Other Other Other 94.036 Mucinous Invasive Other Other Other Other 93.064 Mucinous Invasive Ovarian Ovarian Ovarian Ovarian 94.112 Mucinous Invasive Ovarian Ovarian Ovarian Ovarian P00488 Mucinous LMP Other Other Other Other WM439A Mucinous LMP Other Other Other Other P00718 Mucinous LMP Other Other Other Other WM438 Mucinous LMP Other Other Other Other WM223 Mucinous LMP Other Other Other Other 94.080 Mucinous LMP Other Ovarian Other Other 92.011 Mucinous LMP Other Ovarian Other Other 93.085 Mucinous LMP Ovarian Ovarian Other Other
357
Patient ID Histology Invasive/LMP 1kNN 3kNN Nearest
Centroid
Linear Discriminant Analysis
51030 Mucinous LMP Ovarian Ovarian Other Other P00807 Mucinous LMP Ovarian Other Other Ovarian 93.002 Mucinous LMP Ovarian Ovarian Other Ovarian 51026 Mucinous LMP Ovarian Ovarian Other Ovarian P00627 Mucinous LMP Ovarian Ovarian Other Ovarian 94.072 Mucinous LMP Other Other Ovarian Ovarian 93.077 Mucinous LMP Ovarian Ovarian Ovarian Ovarian 44247 Mucinous LMP Ovarian Ovarian Ovarian Ovarian 94.030 Mucinous LMP Ovarian Ovarian Ovarian Ovarian P00784 Mucinous LMP Ovarian Ovarian Ovarian Ovarian P00934 Mucinous LMP Ovarian Ovarian Ovarian Ovarian P00935 Mucinous LMP Ovarian Ovarian Ovarian Ovarian
35
8 App
endi
x K
: Pre
dict
ive
gene
s exp
ress
ion
sign
atur
e of
pri
mar
y E
OC
Ran
k
Uni
Gen
e N
ame
Uni
gene
Sym
bol
t-va
lue
Para
met
ric
p-va
lue
% C
V
supp
ort
Mea
n of
rat
ios
in c
lass
1:
othe
r
Mea
n of
rat
ios i
n cl
ass 2
: Ova
rian
1 F-
box
prot
ein
21
FBXO
21
-12.
74
p <
0.00
0001
10
0 0.
856
2.68
2
Wilm
s tum
or 1
W
T1
-11.
74
p <
0.00
0001
10
0 0.
653
7.57
3 3
Zinc
fing
er p
rote
in 2
61
ZNF2
61
-7.4
1 p
< 0.
0000
01
100
0.85
5 1.
791
4 M
yelin
ass
ocia
ted
glyc
opro
tein
M
AG
-7.0
5 p
< 0.
0000
01
100
0.86
7 2.
398
5 Zi
nc fi
nger
pro
tein
, mul
tityp
e 2
ZFPM
2 -6
.98
p <
0.00
0001
10
0 0.
726
3.04
8
6 SW
I/SN
F re
late
d, m
atrix
ass
ocia
ted,
act
in d
epen
dent
regu
lato
r of c
hrom
atin
, su
bfam
ily d
, mem
ber 3
SM
ARC
D3
-6.8
9 p
< 0.
0000
01
100
0.81
3 2.
118
7 R
ap g
uani
ne n
ucle
otid
e ex
chan
ge fa
ctor
(GEF
) 3
RAPG
EF3
-6.8
7 p
< 0.
0000
01
100
0.83
2 2.
278
8 Sp
ondi
n 1,
ext
race
llula
r mat
rix p
rote
in
SPO
N1
-6.7
1 p
< 0.
0000
01
100
0.71
1 4.
235
10
Cat
enin
(cad
herin
-ass
ocia
ted
prot
ein)
, alp
ha-li
ke 1
C
TNN
AL1
-6.5
9 p
< 0.
0000
01
100
0.87
5 2.
195
11
LY6/
PLA
UR
dom
ain
cont
aini
ng 1
LY
PDC
1 -6
.58
p <
0.00
0001
10
0 1.
081
4.99
4 12
Sc
aven
ger r
ecep
tor c
lass
A, m
embe
r 3
SCAR
A3
-6.5
3 p
< 0.
0000
01
100
0.80
1 2.
855
13
Fibr
obla
st g
row
th fa
ctor
18
FGF1
8 -6
.51
p <
0.00
0001
10
0 0.
997
2.90
3 15
N
eura
l cel
l adh
esio
n m
olec
ule
1 N
CAM
1 -6
.22
p <
0.00
0001
10
0 0.
758
2.48
16
M
yosi
n, h
eavy
pol
ypep
tide
10, n
on-m
uscl
e M
YH10
-6
.17
p <
0.00
0001
10
0 0.
759
1.77
1 17
PD
Z do
mai
n co
ntai
ning
3
PDZK
3 -6
.15
p <
0.00
0001
10
0 0.
82
1.79
8 18
Pr
otei
n ki
nase
C, i
ota
PRK
CI
-6.1
5 p
< 0.
0000
01
100
0.96
9 2.
175
19
Gro
wth
arr
est-s
peci
fic 6
G
AS6
-6.1
4 p
< 0.
0000
01
100
0.71
5 1.
895
20
Arg
inin
osuc
cina
te sy
nthe
tase
AS
S -6
.11
p <
0.00
0001
10
0 0.
76
2.30
6 21
Pa
tern
ally
exp
ress
ed 3
PE
G3
-6.0
2 p
< 0.
0000
01
100
0.56
7 4.
481
22
Inhi
bito
r of D
NA
bin
ding
4, d
omin
ant n
egat
ive
helix
-looP
-hel
ix p
rote
in
ID4
-5.9
4 p
< 0.
0000
01
100
0.77
5 2.
661
23
Rep
rodu
ctio
n 8
D8S
2298
E -5
.93
p <
0.00
0001
10
0 1.
048
2.28
8 24
R
NA
-bin
ding
regi
on (R
NP1
, RR
M) c
onta
inin
g 1
RNPC
1 -5
.85
p <
0.00
0001
10
0 0.
771
2.00
1 25
B
one
mor
phog
enet
ic p
rote
in 6
BM
P6
-5.7
5 p
< 0.
0000
01
100
1.04
6 3.
169
26
GR
B2-
asso
ciat
ed b
indi
ng p
rote
in 2
G
AB2
-5.7
1 p
< 0.
0000
01
100
0.85
5 1.
751
27
Myo
-inos
itol 1
-pho
spha
te sy
ntha
se A
1 IS
YNA1
-5
.63
p <
0.00
0001
10
0 0.
777
2.34
6 28
Pa
rane
opla
stic
ant
igen
MA
1 PN
MA1
-5
.58
p <
0.00
0001
10
0 0.
771
1.41
5 29
H
omeo
box
D4
HO
XD4
-5.5
8 p
< 0.
0000
01
100
0.93
7 2.
695
359
Ran
k
Uni
Gen
e N
ame
Uni
gene
Sym
bol
t-va
lue
Para
met
ric
p-va
lue
% C
V
supp
ort
Mea
n of
rat
ios
in c
lass
1:
othe
r
Mea
n of
rat
ios i
n cl
ass 2
: Ova
rian
30
Ret
icul
ocal
bin
2, E
F-ha
nd c
alci
um b
indi
ng d
omai
n RC
N2
-5.5
6 p
< 0.
0000
01
100
1.01
4 1.
966
31
Hom
eo b
ox D
8 H
OXD
8 -5
.56
p <
0.00
0001
10
0 0.
888
2.57
6 32
A
nti-M
ulle
rian
horm
one
rece
ptor
, typ
e II
AM
HR2
-5
.51
p <
0.00
0001
10
0 0.
925
2.66
33
EG
F-co
ntai
ning
fibu
lin-li
ke e
xtra
cellu
lar m
atrix
pro
tein
2
EFEM
P2
-5.4
9 p
< 0.
0000
01
100
0.83
9 1.
615
35
Ets v
aria
nt g
ene
1 ET
V1
-5.3
8 p
< 0.
0000
01
100
0.87
4 1.
848
37
Mei
s1, m
yelo
id e
cotro
pic
vira
l int
egra
tion
site
1 h
omol
og (m
ouse
) M
EIS1
-5
.34
p <
0.00
0001
10
0 0.
948
2.59
7 38
G
uano
sine
mon
opho
spha
te re
duct
ase
GM
PR
-5.3
3 p
< 0.
0000
01
100
0.83
2 2.
23
39
Hyp
othe
tical
pro
tein
MG
C20
235
MG
C20
235
-5.3
1 p
< 0.
0000
01
100
0.87
1 1.
885
40
Kin
esin
fam
ily m
embe
r 5C
K
IF5C
-5
.3
p <
0.00
0001
10
0 0.
89
2.65
4 41
G
amm
a-am
inob
utyr
ic a
cid
(GA
BA
) A re
cept
or, a
lpha
1
GAB
RA1
-5.2
8 p
< 0.
0000
01
100
0.81
4 2.
605
42
Paire
d bo
x ge
ne 8
PA
X8
-5.2
7 p
< 0.
0000
01
100
1.01
8 5.
205
43
Neu
ral c
ell a
dhes
ion
mol
ecul
e 1
NC
AM1
-5.2
5 p
< 0.
0000
01
100
0.77
5 2
44
Mat
rilin
2
MAT
N2
-5.2
4 p
< 0.
0000
01
100
0.83
2 2.
215
45
Ephr
in-B
3 EF
NB3
-5
.22
p <
0.00
0001
10
0 0.
889
1.66
4 46
C
ell a
dhes
ion
mol
ecul
e w
ith h
omol
ogy
to L
1CA
M (c
lose
hom
olog
of L
1)
CH
L1
-5.2
1 p
< 0.
0000
01
100
0.77
7 3.
33
48
Hyp
othe
tical
pro
tein
FLJ
1244
2 FL
J124
42
-5.2
p
< 0.
0000
01
100
0.81
8 1.
556
51
Ret
inol
bin
ding
pro
tein
1, c
ellu
lar
RBP1
-5
.16
p <
0.00
0001
10
0 0.
728
2.79
8 52
Sa
rcos
pan
(Kra
s onc
ogen
e-as
soci
ated
gen
e)
SSPN
-5
.12
1.00
E-06
10
0 0.
844
1.81
4 53
K
IAA
0020
K
IAA0
020
-5
2.00
E-06
10
0 0.
945
5.49
7 54
Sy
ndec
an 3
(N-s
ynde
can)
SD
C3
-5
2.00
E-06
10
0 0.
742
1.54
8 56
K
IAA
1240
pro
tein
K
IAA1
240
-4.9
4 2.
00E-
06
100
0.92
4 1.
943
57
RA
B11
fam
ily in
tera
ctin
g pr
otei
n 5
(cla
ss I)
RA
B11F
IP5
-4.8
9 3.
00E-
06
100
0.71
8 1.
717
58
Zinc
fing
er p
rote
in 2
58
ZNF2
58
-4.8
6 3.
00E-
06
100
0.89
2.
37
59
Pros
tate
tum
or o
vere
xpre
ssed
gen
e 1
PTO
V1
-4.8
6 3.
00E-
06
100
0.87
6 2.
239
60
Yip
1 in
tera
ctin
g fa
ctor
hom
olog
(S. c
erev
isia
e)
YIF1
-4
.84
4.00
E-06
10
0 0.
85
1.45
62
U
biqu
itin-
conj
ugat
ing
enzy
me
E2E
2 (U
BC
4/5
hom
olog
, yea
st)
UBE
2E2
-4.8
1 4.
00E-
06
100
0.86
2 1.
615
63
Kal
likre
in 8
(neu
rops
in/o
vasi
n)
KLK
8 -4
.79
7.00
E-06
10
0 0.
978
3.89
5 64
Fu
ll le
ngth
inse
rt c
DN
A cl
one
ZD63
C05
-4
.72
7.00
E-06
10
0 0.
828
2.62
4 65
N
erve
gro
wth
fact
or re
cept
or (T
NFR
SF16
) ass
ocia
ted
prot
ein
1 N
GFR
AP1
-4.7
2 6.
00E-
06
100
0.73
8 1.
553
67
Ner
ve g
row
th fa
ctor
rece
ptor
(TN
FRSF
16) a
ssoc
iate
d pr
otei
n 1
NG
FRAP
1 -4
.72
6.00
E-06
10
0 0.
802
1.58
4 68
B
one
mar
row
stro
mal
cel
l ant
igen
2
BST2
-4
.71
6.00
E-06
10
0 0.
716
1.61
2
36
0
Ran
k
Uni
Gen
e N
ame
Uni
gene
Sym
bol
t-va
lue
Para
met
ric
p-va
lue
% C
V
supp
ort
Mea
n of
rat
ios
in c
lass
1:
othe
r
Mea
n of
rat
ios i
n cl
ass 2
: Ova
rian
69
PTPR
F in
tera
ctin
g pr
otei
n, b
indi
ng p
rote
in 1
(lip
rin b
eta
1)
PPFI
BP1
-4.7
1 8.
00E-
06
100
0.91
9 1.
795
70
Tran
sduc
in-li
ke e
nhan
cer o
f spl
it 4
(E(s
p1) h
omol
og, D
roso
phila
) TL
E4
-4.7
1 6.
00E-
06
100
0.86
1 1.
731
71
Ast
rota
ctin
AS
TN
-4.6
4 1.
10E-
05
100
0.88
5 1.
836
72
Mel
anom
a as
soci
ated
gen
e D
2S44
8 -4
.62
9.00
E-06
10
0 0.
759
1.57
3 74
C
adhe
rin 2
, typ
e 1,
N-c
adhe
rin (n
euro
nal)
CD
H2
-4.5
7 1.
10E-
05
100
0.91
3 2.
662
75
Hyp
othe
tical
pro
tein
MG
C22
014
MG
C22
014
-4.5
7 1.
30E-
05
100
0.88
2 1.
605
76
GA
TA b
indi
ng p
rote
in 4
G
ATA4
-4
.55
1.30
E-05
10
0 1.
163
4.16
2 77
A
DP-
ribos
ylat
ion
fact
or 4
-like
AR
F4L
-4.5
2 1.
40E-
05
100
0.87
2 1.
985
78
Del
ta-li
ke 1
hom
olog
(Dro
soph
ila)
DLK
1 -4
.47
1.60
E-05
10
0 0.
868
5.33
1 79
Le
ctin
, gal
acto
side
-bin
ding
, sol
uble
, 3 b
indi
ng p
rote
in
LGAL
S3BP
-4
.45
1.70
E-05
10
0 0.
857
1.55
3 80
Ph
osph
olip
id tr
ansf
er p
rote
in
PLTP
-4
.44
1.80
E-05
10
0 0.
899
1.72
3 81
Tr
ansc
ribe
d lo
cus
-4.4
4 1.
90E-
05
100
0.89
4 1.
808
82
Dou
blec
ortin
dom
ain
cont
aini
ng 2
D
CD
C2
-4.4
3 2.
10E-
05
100
0.83
7 2.
783
83
BTG
fam
ily, m
embe
r 3
BTG
3 -4
.41
2.20
E-05
10
0 1.
029
1.87
7 84
In
tegr
in, a
lpha
9
ITG
A9
-4.4
2.
70E-
05
100
1.00
8 1.
95
85
Cyc
lin E
1 C
CN
E1
-4.3
7 2.
50E-
05
100
1.00
8 1.
944
86
Dis
cs, l
arge
(Dro
soph
ila) h
omol
og-a
ssoc
iate
d pr
otei
n 3
DLG
AP3
-4.3
6 2.
60E-
05
100
0.76
8 1.
525
87
IGF-
II m
RN
A-b
indi
ng p
rote
in 2
IM
P-2
-4.3
3.
30E-
05
100
0.70
4 1.
738
88
V-m
yc m
yelo
cyto
mat
osis
vira
l rel
ated
onc
ogen
e, n
euro
blas
tom
a de
rived
(a
vian
) M
YCN
-4
.28
3.60
E-05
10
0 0.
887
2.30
1
89
Pate
rnal
ly e
xpre
ssed
10
PEG
10
-4.2
8 3.
60E-
05
100
0.73
8 3.
072
90
ST6
(alp
ha-N
-ace
tyl-n
eura
min
yl-2
,3-b
eta-
gala
ctos
yl-1
,3)-
N-
acet
ylga
lact
osam
inid
e al
pha-
2,6-
sial
yltra
nsfe
rase
5
SIAT
7E
-4.2
3 5.
80E-
05
100
1.08
7 2.
616
91
Plex
in D
1 PL
XND
1 -4
.2
4.70
E-05
10
0 0.
85
1.51
1 92
Ph
osph
ofru
ctok
inas
e, m
uscl
e PF
KM
-4
.2
4.80
E-05
10
0 0.
9 1.
491
93
Bon
e m
orph
ogen
etic
pro
tein
7 (o
steo
geni
c pr
otei
n 1)
BM
P7
-4.1
9 5.
00E-
05
100
0.94
6 2.
848
94
Cad
herin
6, t
ype
2, K
-cad
herin
(fet
al k
idne
y)
CD
H6
-4.1
9 5.
20E-
05
100
0.90
8 2.
841
95
Bra
in e
xpre
ssed
X-li
nked
2
BEX2
-4
.19
5.10
E-05
10
0 0.
9 1.
901
96
Lam
inin
, bet
a 2
(lam
inin
S)
LAM
B2
-4.1
6 5.
50E-
05
100
0.82
9 1.
345
97
Sal-l
ike
2 (D
roso
phila
) SA
LL2
-4.1
5 6.
00E-
05
100
0.84
2 1.
584
361
Ran
k
Uni
Gen
e N
ame
Uni
gene
Sym
bol
t-va
lue
Para
met
ric
p-va
lue
% C
V
supp
ort
Mea
n of
rat
ios
in c
lass
1:
othe
r
Mea
n of
rat
ios i
n cl
ass 2
: Ova
rian
98
Kal
likre
in 7
(chy
mot
rypt
ic, s
tratu
m c
orne
um)
KLK
7 -4
.1
9.10
E-05
10
0 1.
013
4.22
99
La
troph
ilin
2 LP
HN
2 -4
.09
7.50
E-05
10
0 0.
84
1.71
8 10
0 D
eath
-ass
ocia
ted
prot
ein
kina
se 1
D
APK
1 -4
.07
7.90
E-05
10
0 0.
841
1.61
8 10
1 C
alm
odul
in-li
ke 3
C
ALM
L3
-4.0
5 0.
0001
02
100
1.08
9 3.
555
102
Nor
rie d
isea
se (p
seud
oglio
ma)
N
DP
-4.0
3 0.
0001
04
99
0.90
8 1.
904
104
ATP
ase,
Ca+
+ tra
nspo
rting
, pla
sma
mem
bran
e 1
ATP2
B1
-3.9
7 0.
0001
18
100
0.85
4 1.
993
105
Plex
in B
1 PL
XNB1
-3
.96
0.00
0126
10
0 0.
936
1.61
5 10
7 H
ypot
hetic
al p
rote
in M
GC
3504
8 M
GC
3504
8 -3
.95
0.00
0145
10
0 0.
787
1.63
4 10
8 C
arbo
xype
ptid
ase
Z C
PZ
-3.9
4 0.
0001
39
100
0.78
6 1.
719
109
Trop
hini
n TR
O
-3.9
4 0.
0001
43
99
0.81
5 1.
631
110
Sarc
ogly
can,
bet
a (4
3kD
a dy
stro
phin
-ass
ocia
ted
glyc
opro
tein
) SG
CB
-3.9
2 0.
0001
41
100
0.80
4 1.
9 11
2 W
AS
prot
ein
fam
ily, m
embe
r 1
WAS
F1
-3.8
8 0.
0001
67
100
0.91
6 1.
548
113
SAC
3 do
mai
n co
ntai
ning
1
SHD
1 -3
.88
0.00
0165
10
0 0.
885
1.70
8 11
4 Le
ucin
e zi
pper
, dow
n-re
gula
ted
in c
ance
r 1
LDO
C1
-3.8
6 0.
0001
73
100
0.72
9 1.
517
115
Prot
ein
tyro
sine
pho
spha
tase
, rec
epto
r typ
e, U
PT
PRU
-3
.86
0.00
0176
10
0 0.
909
1.69
8 11
6 Pr
osta
glan
din
I2 (p
rost
acyc
lin) s
ynth
ase
PTG
IS
-3.8
4 0.
0001
91
99
0.88
5 1.
513
117
Fibr
onec
tin le
ucin
e ric
h tra
nsm
embr
ane
prot
ein
2 FL
RT2
-3.8
3 0.
0001
97
97
0.93
2 2.
003
118
Sarc
ogly
can,
eps
ilon
SGC
E -3
.83
0.00
0196
99
0.
763
1.63
2 11
9 IM
P (in
osin
e m
onop
hosp
hate
) deh
ydro
gena
se 2
IM
PDH
2 -3
.83
0.00
0197
10
0 0.
888
1.44
8
120
Tran
scri
bed
locu
s, m
oder
atel
y si
mila
r to
XP_5
3447
6.1
sim
ilar t
o Ad
hesi
on re
gula
ting
mol
ecul
e 1
prec
urso
r (11
0 kD
a ce
ll m
embr
ane
glyc
opro
tein
) (G
p110
) [C
anis
fam
iliar
is]
-3.8
2 0.
0002
24
98
1.02
5 1.
776
121
Tum
or n
ecro
sis f
acto
r, al
pha-
indu
ced
prot
ein
2 TN
FAIP
2 -3
.81
0.00
0214
99
0.
962
1.78
7 12
2 K
allik
rein
5
KLK
5 -3
.81
0.00
022
100
0.93
1 4.
19
123
Hyp
othe
tical
LO
C40
1022
LO
C40
1022
-3
.81
0.00
0219
99
1.
014
1.98
2 12
4 Sy
ntax
in 6
ST
X6
-3.8
1 0.
0002
44
99
0.96
3 2.
115
125
Secr
etor
y le
ukoc
yte
prot
ease
inhi
bito
r (an
tileu
kopr
otei
nase
) SL
PI
-3.7
9 0.
0002
28
100
0.75
1 3.
159
126
Supp
ress
ion
of tu
mor
igen
icity
5
ST5
-3.7
8 0.
0002
36
99
0.86
9 1.
333
127
Dis
coid
in, C
UB
and
LC
CL
dom
ain
cont
aini
ng 2
D
CBL
D2
-3.7
7 0.
0002
4 10
0 0.
81
1.62
12
8 Pl
eiom
orph
ic a
deno
ma
gene
-like
2
PLAG
L2
-3.7
7 0.
0002
53
99
1.06
9 1.
781
129
WA
S pr
otei
n fa
mily
, mem
ber 1
W
ASF1
-3
.76
0.00
0266
99
0.
819
1.43
5 13
0 C
yclin
-dep
ende
nt k
inas
e in
hibi
tor 2
C (p
18, i
nhib
its C
DK
4)
CD
KN
2C
-3.7
5 0.
0002
75
99
0.92
7 1.
531
36
2
Ran
k
Uni
Gen
e N
ame
Uni
gene
Sym
bol
t-va
lue
Para
met
ric
p-va
lue
% C
V
supp
ort
Mea
n of
rat
ios
in c
lass
1:
othe
r
Mea
n of
rat
ios i
n cl
ass 2
: Ova
rian
132
Thyr
oid
horm
one
rece
ptor
inte
ract
or 6
TR
IP6
-3.7
3 0.
0002
9 98
0.
747
1.27
5 13
3 M
ater
nally
exp
ress
ed 3
M
EG3
-3.7
3 0.
0002
97
98
0.97
9 2.
153
134
N-a
cety
late
d al
pha-
linke
d ac
idic
dip
eptid
ase
2 N
AALA
D2
-3.7
1 0.
0003
05
97
0.87
4 1.
449
135
Cul
lin 7
C
UL7
-3
.71
0.00
0303
96
0.
878
1.37
2 13
6 C
yclin
G1
CC
NG
1 -3
.7
0.00
0316
97
0.
907
1.43
13
7 Pr
epro
noci
cept
in
PNO
C
-3.7
0.
0003
38
97
0.99
4 1.
555
138
Fola
te re
cept
or 1
(adu
lt)
FOLR
1 -3
.66
0.00
0414
97
1.
113
3.38
5 13
9 TB
P-in
tera
ctin
g pr
otei
n TI
P120
B -3
.66
0.00
0413
97
0.
942
1.67
8 14
0 Si
mila
r to
RIK
EN c
DN
A 2
3100
16C
16
LOC
4938
69
-3.6
4 0.
0003
9 98
0.
879
1.43
5 14
1 W
W d
omai
n bi
ndin
g pr
otei
n 5
WBP
5 -3
.63
4.00
E-04
97
0.
86
1.29
7 14
2 C
erul
opla
smin
(fer
roxi
dase
) C
P -3
.63
0.00
0407
96
0.
992
2.66
1 14
4 A
ctin
bin
ding
LIM
pro
tein
1
ABLI
M1
-3.6
2 0.
0004
14
95
0.77
9 1.
514
145
SP11
0 nu
clea
r bod
y pr
otei
n SP
110
-3.6
2 0.
0004
62
96
0.91
7 1.
484
146
Solu
te c
arrie
r fam
ily 6
(neu
rotra
nsm
itter
tran
spor
ter,
crea
tine)
, mem
ber 8
SL
C6A
8 -3
.6
0.00
0437
96
0.
746
1.87
1 14
7 F-
box
and
leuc
ine-
rich
repe
at p
rote
in 7
FB
XL7
-3.6
0.
0004
43
96
0.89
7 1.
639
148
TU3A
pro
tein
TU
3A
-3.6
0.
0004
72
96
0.90
5 1.
504
149
Prot
ein
phos
phat
ase
2 (f
orm
erly
2A
), ca
taly
tic su
buni
t, be
ta is
ofor
m
PPP2
CB
-3.5
8 0.
0004
73
95
0.95
9 1.
365
150
Gol
gin-
67
GO
LGIN
-67
-3.5
8 0.
0004
77
96
0.89
3 1.
464
151
Tetra
trico
pept
ide
repe
at d
omai
n 7A
TT
C7A
-3
.58
0.00
0482
96
0.
899
1.43
15
2 Es
troge
n re
cept
or 1
ES
R1
-3.5
8 0.
0004
8 98
1.
204
4.97
7 15
3 H
apto
glob
in
HP
-3.5
7 0.
0005
23
94
1.23
5 3.
391
154
Cyc
lin-d
epen
dent
kin
ase
inhi
bito
r 1C
(p57
, Kip
2)
CD
KN
1C
-3.5
7 0.
0004
98
96
0.89
1.
525
155
Neu
rona
l pen
traxi
n II
N
PTX2
-3
.57
0.00
0506
96
0.
834
1.76
1 15
6 D
KFZ
P566
O08
4 pr
otei
n D
KFZ
p566
O08
4 -3
.56
0.00
0512
98
0.
738
1.65
6 15
7 Ly
soso
mal
ass
ocia
ted
prot
ein
trans
mem
bran
e 4
beta
LA
PTM
4B
-3.5
5 0.
0005
25
97
0.92
2 1.
7 15
8 C
DN
A FL
J369
31 fi
s, cl
one
BRAC
E200
5290
-3
.55
6.00
E-04
96
0.
804
1.66
3 15
9 Pr
e-B
-cel
l leu
kem
ia tr
ansc
riptio
n fa
ctor
1
PBX1
-3
.5
0.00
0623
96
0.
983
1.62
9
160
Gly
cine
deh
ydro
gena
se (d
ecar
boxy
latin
g; g
lyci
ne d
ecar
boxy
lase
, gly
cine
cl
eava
ge sy
stem
pro
tein
P)
GLD
C
-3.4
8 0.
0006
89
93
0.93
4 3.
622
161
Car
bohy
drat
e su
lfotra
nsfe
rase
10
CH
ST10
-3
.47
0.00
073
93
0.9
1.58
3 16
2 A
myl
oid
beta
(A4)
pre
curs
or-li
ke p
rote
in 1
AP
LP1
-3.4
5 0.
0008
02
96
0.84
6 1.
626
363
Ran
k
Uni
Gen
e N
ame
Uni
gene
Sym
bol
t-va
lue
Para
met
ric
p-va
lue
% C
V
supp
ort
Mea
n of
rat
ios
in c
lass
1:
othe
r
Mea
n of
rat
ios i
n cl
ass 2
: Ova
rian
163
Nid
ogen
2 (o
steo
nido
gen)
N
ID2
-3.4
3 0.
0007
99
94
0.91
1 1.
647
164
KIA
A12
38 p
rote
in
KIA
A123
8 -3
.43
0.00
0798
91
0.
839
1.46
3 16
5 C
arbo
nic
anhy
dras
e X
IV
CA1
4 -3
.42
0.00
0841
94
0.
956
1.42
8 16
6 C
UG
trip
let r
epea
t, R
NA
bin
ding
pro
tein
2
CU
GBP
2 -3
.41
0.00
0847
92
0.
931
1.72
5 16
7 TT
K p
rote
in k
inas
e TT
K
-3.4
1 0.
0008
63
92
1.04
4 1.
761
168
Tran
smem
bran
e pr
otei
n w
ith E
GF-
like
and
two
folli
stat
in-li
ke d
omai
ns 1
TM
EFF1
-3
.38
0.00
0953
49
0.
841
1.54
4 17
0 Fo
rkhe
ad b
ox F
2 FO
XF2
3.38
0.
0009
78
44
1.20
5 0.
59
172
Ubi
quiti
n fu
sion
deg
rada
tion
1-lik
e U
FD1L
3.
39
0.00
0962
55
1.
401
0.65
7 17
3 G
ardn
er-R
ashe
ed fe
line
sarc
oma
vira
l (v-
fgr)
onc
ogen
e ho
mol
og
FGR
3.39
0.
0009
37
67
1.26
8 0.
656
174
Inte
rfer
on st
imul
ated
gen
e 20
kDa
ISG
20
3.4
0.00
0886
80
1.
156
0.66
7 17
5 Ec
tonu
cleo
side
trip
hosp
hate
dip
hosp
hohy
drol
ase
1 EN
TPD
1 3.
4 0.
0009
25
63
1.23
7 0.
712
176
IQ m
otif
cont
aini
ng G
TPas
e ac
tivat
ing
prot
ein
2 IQ
GAP
2 3.
41
0.00
0857
91
1.
055
0.58
17
7 G
luta
redo
xin
(thio
ltran
sfer
ase)
G
LRX
3.41
0.
0008
46
90
1.27
3 0.
722
180
Paire
d-lik
e ho
meo
dom
ain
trans
crip
tion
fact
or 2
PI
TX2
3.43
0.
0008
34
93
1.51
4 0.
615
181
Man
nosi
dase
, alp
ha, c
lass
2A
, mem
ber 1
M
AN2A
1 3.
44
0.00
0798
90
1.
426
0.36
6 18
2 H
ypot
hetic
al p
rote
in F
LJ38
564
FLJ3
8564
3.
44
0.00
0813
93
1.
207
0.62
8 18
3 In
tegr
in, a
lpha
4 (a
ntig
en C
D49
D, a
lpha
4 su
buni
t of V
LA-4
rece
ptor
) IT
GA4
3.
45
0.00
0753
93
1.
207
0.73
7 18
4 K
IAA
0408
K
IAA0
408
3.47
0.
0007
27
97
1.24
2 0.
572
185
Plat
elet
/end
othe
lial c
ell a
dhes
ion
mol
ecul
e (C
D31
ant
igen
) PE
CAM
1 3.
55
0.00
0531
95
1.
176
0.66
5 18
6 K
IAA
1012
K
IAA1
012
3.59
0.
0005
1 96
1.
444
0.65
8 18
7 D
naJ (
Hsp
40) h
omol
og, s
ubfa
mily
D, m
embe
r 1
DN
AJD
1 3.
6 0.
0004
5 97
1.
157
0.69
5 18
8 Tr
opho
blas
t-der
ived
non
codi
ng R
NA
Tn
cRN
A 3.
62
0.00
0407
96
0.
981
0.58
4 18
9 Sm
cy h
omol
og, Y
-link
ed (m
ouse
) SM
CY
3.63
0.
0004
08
99
2.55
1 0.
436
191
Ubi
quito
usly
tran
scrib
ed te
tratri
cope
ptid
e re
peat
gen
e, Y
-link
ed
UTY
3.
67
0.00
0424
99
1.
343
0.53
1 19
2 C
hrom
osom
e 6
open
read
ing
fram
e 4
C6o
rf4
3.73
0.
0002
88
99
1.19
9 0.
553
193
LYR
IC/3
D3
LYRI
C
3.76
0.
0002
87
99
1.14
3 0.
538
194
BEN
E pr
otei
n BE
NE
3.76
0.
0002
52
100
1.22
9 0.
555
195
Pota
ssiu
m v
olta
ge-g
ated
cha
nnel
, del
ayed
-rec
tifie
r, su
bfam
ily S
, mem
ber 3
K
CN
S3
3.78
0.
0002
37
99
1.31
3 0.
607
196
TRN
A (5
-met
hyla
min
omet
hyl-2
-thio
urid
ylat
e)-m
ethy
ltran
sfer
ase
1 TR
MT1
3.
78
0.00
023
100
1.27
8 0.
781
197
Cal
cium
/cal
mod
ulin
-dep
ende
nt p
rote
in k
inas
e II
C
aMK
IIN
alph
a 3.
8 0.
0002
17
100
1.13
7 0.
49
198
A d
isin
tegr
in-li
ke a
nd m
etal
lopr
otea
se (r
epro
lysi
n ty
pe) w
ith
ADAM
TS4
3.81
0.
0002
2 99
1.
233
0.7
36
4
Ran
k
Uni
Gen
e N
ame
Uni
gene
Sym
bol
t-va
lue
Para
met
ric
p-va
lue
% C
V
supp
ort
Mea
n of
rat
ios
in c
lass
1:
othe
r
Mea
n of
rat
ios i
n cl
ass 2
: Ova
rian
thro
mbo
spon
din
type
1 m
otif,
4
200
Kin
ase
inse
rt do
mai
n re
cept
or (a
type
III r
ecep
tor t
yros
ine
kina
se)
KD
R 3.
86
0.00
0175
10
0 1.
149
0.56
8 20
1 R
AN
-bin
ding
pro
tein
2-li
ke 1
shor
t iso
form
LO
C40
0966
3.
87
0.00
018
99
1.09
8 0.
626
202
Pota
ssiu
m in
war
dly-
rect
ifyin
g ch
anne
l, su
bfam
ily J,
mem
ber 1
5 K
CN
J15
3.94
0.
0001
32
100
1.34
8 0.
639
203
Neu
ritin
1
NRN
1 3.
97
0.00
0117
10
0 1.
308
0.57
6 20
4 In
tegr
in, a
lpha
6
ITG
A6
3.99
0.
0001
1 10
0 1.
12
0.50
9 20
6 V
av 3
onc
ogen
e VA
V3
3.99
0.
0001
07
100
1.65
4 0.
538
207
V-e
ts e
ryth
robl
asto
sis v
irus E
26 o
ncog
ene
hom
olog
2 (a
vian
) ET
S2
3.99
0.
0001
13
100
1.41
4 0.
673
208
Ecto
nucl
eotid
e py
roph
osph
atas
e/ph
osph
odie
ster
ase
3 EN
PP3
3.99
0.
0001
14
99
2.10
4 0.
327
209
Syna
ptot
agm
in V
II
SYT7
4
0.00
0109
99
1.
352
0.44
2 21
1 K
IAA
1539
K
IAA1
539
4.1
7.20
E-05
10
0 1.
159
0.70
4 21
2 Pr
otea
se, s
erin
e, 2
3 PR
SS23
4.
18
5.30
E-05
10
0 1.
321
0.61
5 21
3 C
alci
um/c
alm
odul
in-d
epen
dent
pro
tein
kin
ase
II
CaM
KII
Nal
pha
4.19
5.
00E-
05
100
1.06
0.
474
214
T ce
ll re
cept
or a
lpha
locu
s TR
A@
4.26
3.
80E-
05
100
1.30
6 0.
629
215
Perio
stin
, ost
eobl
ast s
peci
fic fa
ctor
PO
STN
4.
3 3.
30E-
05
100
1.16
8 0.
309
216
Chr
omos
ome
14 o
pen
read
ing
fram
e 14
7 C
14or
f147
4.
34
2.80
E-05
10
0 1.
193
0.61
7 21
7 K
IAA
0753
gen
e pr
oduc
t K
IAA0
753
4.37
2.
40E-
05
100
1.07
6 0.
393
218
Hyp
othe
tical
pro
tein
LO
C15
2485
LO
C15
2485
4.
4 2.
30E-
05
100
1.11
7 0.
656
219
DEA
D (A
sP-G
lu-A
la-A
sp) b
ox p
olyp
eptid
e 3,
Y-li
nked
D
DX3
Y 4.
41
2.20
E-05
99
1.
35
0.41
6 22
1 BR
CA1
ass
ocia
ted
prot
ein-
1 (u
biqu
itin
carb
oxy-
term
inal
hyd
rola
se)
BAP1
4.
54
1.20
E-05
10
0 1.
268
0.29
7 22
2 C
arci
noem
bryo
nic
antig
en-r
elat
ed c
ell a
dhes
ion
mol
ecul
e 1
CEA
CAM
1 4.
58
1.00
E-05
10
0 1.
668
0.49
3 22
3 Eu
kary
otic
tran
slat
ion
initi
atio
n fa
ctor
1A
, Y-li
nked
EI
F1AY
4.
6 1.
00E-
05
100
1.82
1 0.
59
224
Chr
omos
ome
Y o
pen
read
ing
fram
e 15
B
CYo
rf15
B 4.
61
9.00
E-06
10
0 3.
544
0.38
3 22
5 St
omat
in
STO
M
4.64
8.
00E-
06
100
1.25
6 0.
73
226
Bon
e m
orph
ogen
etic
pro
tein
5
BMP5
4.
65
9.00
E-06
10
0 1.
401
0.60
8
227
Ang
iote
nsin
ogen
(ser
ine
(or c
yste
ine)
pro
tein
ase
inhi
bito
r, cl
ade
A (a
lpha
-1
antip
rote
inas
e, a
ntitr
ypsi
n), m
embe
r 8)
AGT
4.75
5.
00E-
06
100
1.90
9 0.
451
228
CD
44 a
ntig
en (h
omin
g fu
nctio
n an
d In
dian
blo
od g
roup
syst
em)
CD
44
4.81
4.
00E-
06
100
1.28
9 0.
565
229
Inte
grin
, alp
ha 2
(CD
49B
, alp
ha 2
subu
nit o
f VLA
-2 re
cept
or)
ITG
A2
4.83
4.
00E-
06
100
1.11
8 0.
434
230
Myo
sin,
ligh
t pol
ypep
tide
kina
se
MYL
K
4.88
3.
00E-
06
100
1.18
4 0.
519
231
Tena
scin
C (h
exab
rach
ion)
TN
C
4.89
3.
00E-
06
100
1.30
6 0.
414
365
App
endi
x L
: KE
GG
pat
hway
s sig
nific
antly
rep
rese
nted
in g
ene
expr
essi
on si
gnat
ure
of
sero
us L
MP
and
inva
sive
EO
C
Cel
l cyc
le K
EG
G p
athw
ay
P-v
alue
for o
verla
p w
ith
LMP/
inva
sive
EO
C
diff
eren
tially
exp
ress
ed g
ene
list:
2.81
x 10
-5.
Red
circ
les:
gen
e ov
er
expr
esse
d in
inva
sive
tu
mou
rs.
36
6
Com
plem
ent a
nd
coag
ulat
ion
KE
GG
pa
thw
ay.
P-va
lue
for o
verla
p w
ith
LMP/
inva
sive
EO
C
diff
eren
tially
exp
ress
ed g
ene
list:
P-va
lue
= 3.
88 x
10-3
Red
circ
les:
gen
e ov
er
expr
esse
d in
inva
sive
tu
mou
rs.
Gre
en c
ircle
s: G
ene
unde
r-ex
pres
sed
in in
vasi
ve tu
mou
rs
367
Cyt
okin
e-cy
toki
ne
rece
ptor
inte
ract
ion
path
way
. P-
valu
e fo
r ove
rlap
with
LM
P/in
vasi
ve E
OC
di
ffer
entia
lly e
xpre
ssed
ge
ne li
st P
= 5
.28
x 10
-5
Red
circ
les:
gen
e ov
er
expr
esse
d in
inva
sive
tu
mou
rs.
Gre
en c
ircle
s: G
ene
unde
r-ex
pres
sed
in
inva
sive
tum
ours
368
Appendix M: Microsoft Access gene ontology filter SQL query applied to total list of differentially expressed genes to exclude cell-cycle
regulating and immune-response genes from the LMP/invasive EOC expression signature
SELECT Invasive_LMP_SAM_source2.ID, Invasive_LMP_SAM_source2.Acc, Invasive_LMP_SAM_source2.Name, Invasive_LMP_SAM_source2.Symbol, Invasive_LMP_SAM_source2.SumFunc, Invasive_LMP_SAM_source2.GOabr FROM Invasive_LMP_SAM_source2 WHERE (((Invasive_LMP_SAM_source2.SumFunc) Like "*cancer*" Or (Invasive_LMP_SAM_source2.SumFunc) Like "*adhesion*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*tumour*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*epith*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*apoptosis*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*invasion*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*metas*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*ovarian*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*growth*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*")) OR (((Invasive_LMP_SAM_source2.SumFunc) Like "*tumour*") AND ((Invasive_LMP_SAM_source2.GOabr) Not Like "*cell cycle*" And (Invasive_LMP_SAM_source2.GOabr) Not Like "*immune*"));
369
Appendix N: Visual basic script for batch export of IHC image histogram statistics Dim appRef, startRulerUnits, startTypeUnits, startDisplayDialogs, docRef Dim totalCount, channelIndex, activeChannels, myChannels, secondaryIndex Dim largestCount, histogramIndex, pixelsPerX, outputX, a, visibleChannelCount Dim fsoRef, fileRef Dim folderRef, fileCollection Dim ImageCount, ImageCountTotal Dim i, newFolderName Dim aChannelArray(), aChannelIndex, fileOut, hist Set appRef = CreateObject("Photoshop.Application") ' Save the current preferences startRulerUnits = appRef.Preferences.RulerUnits startTypeUnits = appRef.Preferences.TypeUnits startDisplayDialogs = appRef.DisplayDialogs ' Set Photoshop CS2 to use pixels and display no dialogs appRef.Preferences.RulerUnits = 1 'for PsUnits --> 1 (psPixels) appRef.Preferences.TypeUnits = 1 'for PsTypeUnits --> 1 (psPixels) appRef.DisplayDialogs = 3 'for PsDialogModes --> 3 (psDisplayNoDialogs) i = 0 Set fsoRef = CreateObject( "Scripting.FileSystemObject" ) Set folderRef = fsoRef.GetFolder( "SPECIFY FULL DIRECTORY OF IMAGES HERE" ) Set fileCollection = folderRef.Files newFolderName = folderRef & "\Histogram_reports" Set convertedFolderRef = fsoRef.CreateFolder( newFolderName ) Set fileOut = fsoRef.CreateTextFile(newFolderName & "\" & "compiled_histogram_report.txt") For Each fileRef In fileCollection On Error Resume Next Set docRef = appRef.Open( fileRef.Path ) ' find out how many pixels I have totalCount = docRef.Width * docRef.Height ' more info to the out file 'fileOut.WriteLine " with a total pixel count of " & totalCount ' remember which channels are currently active activeChannels = appRef.ActiveDocument.ActiveChannels
' document histogram only works in these modes If docRef.Mode = 2 Or docRef.Mode = 3 Or docRef.Mode = 6 Then 'enumerated values = PsDocumentMode --> 2 (psRGB), 3 (psCMYK), 6 (psIndexedColor) ' activate the main channels so we can get the document’s histogram ' using the TurnOnDocumentHistogramChannels function Call TurnOnDocumentHistogramChannels(docRef) ' Output the documents histogram Call OutputHistogram(docRef.Histogram, "Luminosity", fileOut) End If ' local reference to work from Set myChannels = docRef.Channels ' loop through each channel and output the histogram For channelIndex = 1 To myChannels.Count ' the channel has to be visible to get a histogram myChannels(channelIndex).Visible = true ' turn off all the other channels for secondaryIndex = 1 to myChannels.Count If Not channelIndex = secondaryIndex Then myChannels(secondaryIndex).Visible = false End If Next ' Use the function to dump the histogram Call OutputHistogram(myChannels(channelIndex).Histogram,myChannels(channelIndex).Name, fileOut) Next ' close down the output file 'fileOut.Close ' reset the active channels docRef.ActiveChannels = activeChannels ' Reset the application preferences appRef.Preferences.RulerUnits = startRulerUnits appRef.Preferences.TypeUnits = startTypeUnits appRef.DisplayDialogs = startDisplayDialogs ' Utility function that takes a histogram and name ' and dumps to the output file appRef.ActiveDocument.Close() i = i + 1 Next fileOut.Close MsgBox i & " files processed by Ryans Histogram analysis tool!" Private Function OutputHistogram (inHistogram, inHistogramName, inOutFile) ' find out which count has the largest number ' I scale everything to this number for the output
370
largestCount = 0 ' a simple indexer I can reuse histogramIndex = 0 ' search through all and find the largest single item For Each hist In inHistogram histogramCount = histogramCount + CLng(hist) If CLng(hist) <> largestCount Then largestCount = CLng(hist) End If Next 'These should match If Not histogramCount = totalCount Then MsgBox "Something bad is happening!" End If 'inOutFile.WriteLine 'see how much each "X" is going to count as pixelsPerX = largestCount / 100 'output this data to the file 'output the name of this histogram 'inOutFile.WriteLine inHistogramName inOutFile.WriteLine docRef.Name & " " & inHistogramName & " Mean Pixels: " & AverageHistogram(inHistogram) & " Std. Dev. Pixels: " & StandardDeviationHistogram(inHistogram) & " Median Pixels: " & MedianHistogram(inHistogram,histogramCount) 'inOutFile.WriteLine docRef.Name & " Std. Dev. Pixels: " & StandardDeviationHistogram(inHistogram) 'inOutFile.WriteLine docRef.Name & " Median Pixels: " & MedianHistogram(inHistogram,histogramCount) ' loop through all the items and output in the following format ' 001 ' 002 ' For histogramIndex = 0 To (inHistogram.Count - 1) End Function ' Function to active all the channels according to the document’s mode ' Takes a document reference for input Private Function TurnOnDocumentHistogramChannels (inDocument) ' see how many channels we need to activate visibleChannelCount = 0 'based on the mode of the document Select Case inDocument.Mode Case 1 visibleChannelCount = 1 Case 5 visibleChannelCount = 1 Case 6 visibleChannelCount = 1 Case 8 visibleChannelCount = 2 Case 2 visibleChannelCount = 3 Case 4 visibleChannelCount = 3 Case 3 visibleChannelCount = 4 Case 8
visibleChannelCount = 4 Case 7 visibleChannelCount = (inDocument.Channels.Count + 1) Case Else visibleChannelCount = (inDocument.Channels.Count + 1) End Select ' now get the channels to activate into a local array ReDim aChannelArray(visibleChannelCount) ' index for the active channels array aChannelIndex = 1 For channelIndex = 1 to inDocument.channels.Count If channelIndex <= visibleChannelCount Then Set aChannelArray(aChannelIndex) = inDocument.Channels(channelIndex) aChannelIndex = aChannelIndex + 1 End If Next End Function Private Function StandardDeviationHistogram(inputArray) Dim numPixels, sum1, sum2, x, gray numPixels = 0 sum1 = 0.0 sum2 = 0.0 ' Compute totals for the various statistics For gray = 0 To 255 x = inputArray(gray) numPixels = numPixels + x sum1 = sum1 + x * gray sum2 = sum2 + x * (gray * gray) Next StandardDeviationHistogram = Sqr((sum2 - (sum1 * sum1) / numPixels) / (numPixels -1)) End Function Private Function AverageHistogram(inputArray) Dim numPixels, sum1, sum2, x, gray numPixels = 0 sum1 = 0.0 sum2 = 0.0 ' Compute totals for the various statistics For gray = 0 To 255 x = inputArray(gray) numPixels = numPixels + x sum1 = sum1 + x * gray sum2 = sum2 + x * (gray * gray) Next AverageHistogram = sum1 / numPixels End Function Private Function MedianHistogram(inputArray, numPixels) Dim gray, total, mid gray = 0 total = inputArray(0) mid = (numPixels + 1) / 2 Do While (total < mid) gray = gray + 1 total = total + inputArray(gray) Loop MedianHistogram = gray End function
371
Appendix O: UniGene annotated genes included in thesis List sorted alphabetically by UniGene Symbol (Build #184). Also present on the CD-
ROM attached to this thesis in Microsoft Excel format.
UniGene Symbol
UniGene Cluster UniGene Name
ABCC3 Hs.463421 ATP-binding cassette, sub-family C (CFTR/MRP), member 3 ABCG2 Hs.480218 ATP-binding cassette, sub-family G (WHITE), member 2 ABLIM1 Hs.438236 Actin binding LIM protein 1 ACADS Hs.507076 Acyl-Coenzyme A dehydrogenase, C-2 to C-3 short chain ACP6 Hs.528084 Acid phosphatase 6, lysophosphatidic ADA Hs.407135 Adenosine deaminase
ADAMTS4 Hs.211604 A disintegrin-like and metalloprotease (reprolysin type) with thrombospondin type 1 motif, 4
SMARCD3 Hs.444445 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 3
SMCY Hs.80358 Smcy homolog, Y-linked (mouse) SOC Hs.145061 Socius SPINK1 Hs.407856 Serine protease inhibitor, Kazal type 1 SPON1 Hs.445818 Spondin 1, extracellular matrix protein
SPP1 Hs.313 Secreted phosphoprotein 1 (osteopontin, bone sialoprotein I, early T-lymphocyte activation 1)
SPTBN1 Hs.503178 Spectrin, beta, non-erythrocytic 1 SRY Hs.1992 Sex determining region Y SSPN Hs.183428 Sarcospan (Kras oncogene-associated gene) ST5 Hs.117715 Suppression of tumorigenicity 5 STAT3 Hs.463059 Signal transducer and activator of transcription 3 (acute-phase response factor) STK6 Hs.250822 Serine/threonine kinase 6 STOM Hs.253903 Stomatin STX6 Hs.518417 Syntaxin 6 SULF1 Hs.409602 Sulfatase 1 SYK Hs.371720 Spleen tyrosine kinase SYT7 Hs.502730 Synaptotagmin VII T1 Hs.26814 Tularik gene 1
378
UniGene Symbol
UniGene Cluster UniGene Name
TBP Hs.1100 TATA box binding protein TCF7 Hs.519580 Transcription factor 7 (T-cell specific, HMG-box) TCF8 Hs.124503 Transcription factor 8 (represses interleukin 2 expression) TFF1 Hs.162807 Trefoil factor 1 (breast cancer, estrogen-inducible sequence expressed in) TFF3 Hs.82961 Trefoil factor 3 (intestinal) TGFA Hs.170009 Transforming growth factor, alpha TIMELESS Hs.118631 Timeless homolog (Drosophila) TLE4 Hs.444213 Transducin-like enhancer of split 4 (E(sp1) homolog, Drosophila) TLR3 Hs.29499 Toll-like receptor 3 TMEFF1 Hs.336224 Transmembrane protein with EGF-like and two follistatin-like domains 1 TNC Hs.143250 Tenascin C (hexabrachion) TNF Hs.241570 Tumor necrosis factor (TNF superfamily, member 2) TNFAIP2 Hs.525607 Tumor necrosis factor, alpha-induced protein 2 TNFAIP6 Hs.437322 Tumor necrosis factor, alpha-induced protein 6 TNNT1 Hs.534085 Troponin T1, skeletal, slow TNRC9 Hs.460789 Trinucleotide repeat containing 9 TRA1 Hs.192374 Tumor rejection antigen (gp96) 1 TREX1 Hs.344812 Three prime repair exonuclease 1 TRIP6 Hs.534360 Thyroid hormone receptor interactor 6 TRMT1 Hs.439524 TRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase 1 TRO Hs.434971 Trophinin TTK Hs.169840 TTK protein kinase TYROBP Hs.515369 TYRO protein tyrosine kinase binding protein UCC1 Hs.416007 Ependymin related protein 1 (zebrafish) UCHL1 Hs.518731 Ubiquitin carboxyl-terminal esterase L1 (ubiquitin thiolesterase) UCP2 Hs.80658 Uncoupling protein 2 (mitochondrial, proton carrier) UTY Hs.115277 Ubiquitously transcribed tetratricopeptide repeat gene, Y-linked VAV3 Hs.267659 Vav 3 oncogene VEGF Hs.73793 Vascular endothelial growth factor VIL1 Hs.534364 Villin 1 VIL2 Hs.487027 Villin 2 (ezrin) WAS Hs.2157 Wiskott-Aldrich syndrome (eczema-thrombocytopenia) WASF1 Hs.75850 WAS protein family, member 1 WBP5 Hs.533287 WW domain binding protein 5 WNT1 Hs.248164 Wingless-type MMTV integration site family, member 1 WT1 Hs.408453 Wilms tumor 1 XPR1 Hs.227656 Xenotropic and polytropic retrovirus receptor XTP2 Hs.494614 BAT2 domain containing 1 YIF1 Hs.446445 Yip1 interacting factor homolog (S. cerevisiae) ZAK Hs.444451 Sterile alpha motif and leucine zipper containing kinase AZK ZFPM2 Hs.431009 Zinc finger protein, multitype 2
379
App
endi
x P:
Gen
es d
iffer
entia
lly e
xpre
ssed
bet
wee
n se
rous
LM
P an
d in
vasi
ve E
OC
aft
er
excl
udin
g th
ose
invo
lved
in c
ell-c
ycle
reg
ulat
ion
and
the
imm
une
resp
onse
Ta
ble
sorte
d by
incr
easi
ng m
ean
diff
eren
ce in
LM
P:in
vasi
ve E
OC
exp
ress
ion.
Sym
bol
Mea
n ex
pres
sion
ra
tio:
Inva
sive
E
OC
Mea
n ex
pres
sion
ra
tio:
LM
P
Mea
n fo
ld
chan
ce
diff
eren
ce
Uni
Gen
e sy
mbo
l U
nige
ne
Clu
ster
U
niG
ene
Nam
e
CLD
N10
0.
554
7.05
1 0.
079
CLD
N10
H
s.534
377
Cla
udin
10
CH
L1
0.19
7 2.
129
0.09
3 C
HL1
H
s.148
909
Cel
l adh
esio
n m
olec
ule
with
hom
olog
y to
L1C
AM
(clo
se h
omol
og o
f L1)
U
PK1B
0.
333
1.78
3 0.
187
UPK
1B
Hs.2
7158
0 U
ropl
akin
1B
TS
PAN
1 0.
223
1.08
5 0.
205
TSPA
N1
Hs.3
8972
Te
trasp
anin
1
DLE
C1
0.96
1 4.
546
0.21
1 D
LEC
1 H
s.277
589
Del
eted
in lu
ng a
nd e
soph
agea
l can
cer 1
TF
F3
0.27
7 1.
206
0.23
0 TF
F3
Hs.8
2961
Tr
efoi
l fac
tor 3
(int
estin
al)
KLK
11
0.58
6 2.
374
0.24
7 K
LK11
H
s.577
71
Kal
likre
in 1
1 M
UC
4 0.
717
2.72
7 0.
263
MU
C4
Hs.3
6964
6 M
ucin
4, t
rach
eobr
onch
ial
ARH
I 0.
346
1.25
2 0.
276
ARH
I H
s.194
695
DIR
AS
fam
ily, G
TP-b
indi
ng R
AS-
like
3 TC
F21
0.25
1 0.
906
0.27
7 TC
F21
Hs.7
8061
Tr
ansc
riptio
n fa
ctor
21
ANXA
4 0.
305
1.07
6 0.
283
ANXA
4 H
s.422
986
Ann
exin
A4
ARH
I 0.
437
1.48
8 0.
294
ARH
I H
s.194
695
DIR
AS
fam
ily, G
TP-b
indi
ng R
AS-
like
3 U
CC
1 0.
305
1.03
1 0.
296
UC
C1
Hs.4
1600
7 Ep
endy
min
rela
ted
prot
ein
1 (z
ebra
fish)
PR
OM
1 0.
365
1.20
3 0.
303
PRO
M1
Hs.4
7922
0 Pr
omin
in 1
C
AV2
0.34
2 1.
115
0.30
7 C
AV2
Hs.2
1233
2 C
aveo
lin 2
FL
RT3
0.44
5 1.
388
0.32
1 FL
RT3
Hs.4
1296
Fi
bron
ectin
leuc
ine
rich
trans
mem
bran
e pr
otei
n 3
PPL
1.00
8 3.
050
0.33
0 PP
L H
s.192
233
Perip
laki
n C
DH
8 0.
623
1.82
1 0.
342
CD
H8
Hs.3
6832
2 C
adhe
rin 8
, typ
e 2
ARG
BP2
0.43
2 1.
246
0.34
6 AR
GBP
2 H
s.481
342
Arg
/Abl
-inte
ract
ing
prot
ein
Arg
BP2
PA
R1
0.48
2 1.
388
0.34
7 PA
R1
Hs.5
4684
7 Pr
ader
-Will
i/Ang
elm
an re
gion
-1
SPRY
2 0.
754
2.14
6 0.
352
SPRY
2 H
s.186
76
Spro
uty
hom
olog
2 (D
roso
phila
)
38
0
Sym
bol
Mea
n ex
pres
sion
ra
tio:
Inva
sive
E
OC
Mea
n ex
pres
sion
ra
tio:
LM
P
Mea
n fo
ld
chan
ce
diff
eren
ce
Uni
Gen
e sy
mbo
l U
nige
ne
Clu
ster
U
niG
ene
Nam
e
TAC
C1
0.44
6 1.
247
0.35
7 TA
CC
1 H
s.279
245
Tran
sfor
min
g, a
cidi
c co
iled-
coil
cont
aini
ng p
rote
in 1
M
ET
0.69
0 1.
814
0.38
0 M
ET
Hs.1
3296
6 M
et p
roto
-onc
ogen
e (h
epat
ocyt
e gr
owth
fact
or re
cept
or)
NRC
AM
1.06
6 2.
792
0.38
2 N
RCAM
H
s.214
22
Neu
rona
l cel
l adh
esio
n m
olec
ule
CTS
L2
1.11
0 2.
893
0.38
4 C
TSL2
H
s.874
17
Cat
heps
in L
2 PT
PN3
0.66
8 1.
722
0.38
8 PT
PN3
Hs.4
3642
9 Pr
otei
n ty
rosi
ne p
hosp
hata
se, n
on-r
ecep
tor t
ype
3 D
HX3
4 0.
579
1.48
5 0.
390
DH
X34
Hs.1
5170
6 D
EAH
(AsP
-Glu
-Ala
-His
) box
pol
ypep
tide
34
PRK
CG
0.
435
1.10
5 0.
394
PRK
CG
H
s.289
0 Pr
otei
n ki
nase
C, g
amm
a IN
HBB
0.
668
1.68
1 0.
398
INH
BB
Hs.1
735
Inhi
bin,
bet
a B
(act
ivin
AB
bet
a po
lype
ptid
e)
DD
X6
0.61
8 1.
538
0.40
2 D
DX6
H
s.408
461
DEA
D (A
sP-G
lu-A
la-A
sp) b
ox p
olyp
eptid
e 6
CD
44
0.52
1 1.
285
0.40
5 C
D44
H
s.502
328
CD
44 a
ntig
en (h
omin
g fu
nctio
n an
d In
dian
blo
od g
roup
syst
em)
SSPN
1.
095
2.69
6 0.
406
SSPN
H
s.183
428
Sarc
ospa
n (K
ras o
ncog
ene-
asso
ciat
ed g
ene)
C
DH
1 0.
576
1.40
4 0.
410
CD
H1
Hs.4
6108
6 C
adhe
rin 1
, typ
e 1,
E-c
adhe
rin (e
pith
elia
l) IT
GA2
0.
481
1.15
9 0.
415
ITG
A2
Hs.4
8207
7 In
tegr
in, a
lpha
2 (C
D49
B, a
lpha
2 su
buni
t of V
LA-2
rece
ptor
) TN
FRSF
21
0.55
6 1.
265
0.43
9 TN
FRSF
21
Hs.4
4357
7 Tu
mor
nec
rosi
s fac
tor r
ecep
tor s
uper
fam
ily, m
embe
r 21
MM
P10
0.78
0 1.
772
0.44
0 M
MP1
0 H
s.225
8 M
atrix
met
allo
prot
eina
se 1
0 (s
trom
elys
in 2
) PD
GFR
A 0.
520
1.17
6 0.
442
PDG
FRA
Hs.7
4615
Pl
atel
et-d
eriv
ed g
row
th fa
ctor
rece
ptor
, alp
ha p
olyp
eptid
e FL
RT2
0.39
2 0.
882
0.44
5 FL
RT2
Hs.5
3371
0 Fi
bron
ectin
leuc
ine
rich
trans
mem
bran
e pr
otei
n 2
SOX4
0.
715
1.57
4 0.
454
SOX4
H
s.357
901
SRY
(sex
det
erm
inin
g re
gion
Y)-
box
4 PT
PRN
2 0.
337
0.73
7 0.
457
PTPR
N2
Hs.4
9078
9 Pr
otei
n ty
rosi
ne p
hosp
hata
se, r
ecep
tor t
ype,
N p
olyp
eptid
e 2
ABC
C2
0.51
6 1.
121
0.46
0 AB
CC
2 H
s.368
243
ATP
-bin
ding
cas
sette
, sub
-fam
ily C
(CFT
R/M
RP)
, mem
ber 2
AN
XA13
0.
497
1.07
8 0.
461
ANXA
13
Hs.1
8110
7 A
nnex
in A
13
PTG
S2
0.93
4 2.
006
0.46
6 PT
GS2
H
s.196
384
Pros
tagl
andi
n-en
dope
roxi
de sy
ntha
se 2
(pro
stag
land
in G
/H sy
ntha
se a
nd
cycl
ooxy
gena
se)
CD
47
1.04
4 2.
238
0.46
6 C
D47
H
s.446
414
CD
47 a
ntig
en (R
h-re
late
d an
tigen
, int
egrin
-ass
ocia
ted
sign
al tr
ansd
ucer
) M
ET
0.68
6 1.
462
0.46
9 M
ET
Hs.1
3296
6 M
et p
roto
-onc
ogen
e (h
epat
ocyt
e gr
owth
fact
or re
cept
or)
NRX
N3
0.88
5 1.
881
0.47
0 N
RXN
3 H
s.368
307
Neu
rexi
n 3
SORL
1 1.
040
2.19
3 0.
474
SORL
1 H
s.368
592
Sorti
lin-r
elat
ed re
cept
or, L
(DLR
cla
ss) A
repe
ats-
cont
aini
ng
LIFR
0.
630
1.32
1 0.
477
LIFR
H
s.133
421
Leuk
emia
inhi
bito
ry fa
ctor
rece
ptor
381
Sym
bol
Mea
n ex
pres
sion
ra
tio:
Inva
sive
E
OC
Mea
n ex
pres
sion
ra
tio:
LM
P
Mea
n fo
ld
chan
ce
diff
eren
ce
Uni
Gen
e sy
mbo
l U
nige
ne
Clu
ster
U
niG
ene
Nam
e
ACVR
1B
0.72
9 1.
510
0.48
3 AC
VR1B
H
s.438
918
Act
ivin
A re
cept
or, t
ype
IB
VIL2
0.
653
1.35
0 0.
484
VIL2
H
s.487
027
Vill
in 2
(ezr
in)
MAP
K10
0.
479
0.98
9 0.
485
MAP
K10
H
s.252
09
Mito
gen-
activ
ated
pro
tein
kin
ase
10
PTPR
U
0.96
7 1.
992
0.48
6 PT
PRU
H
s.197
18
Prot
ein
tyro
sine
pho
spha
tase
, rec
epto
r typ
e, U
C
DH
11
0.51
9 1.
048
0.49
5 C
DH
11
Hs.1
1647
1 C
adhe
rin 1
1, ty
pe 2
, OB
-cad
herin
(ost
eobl
ast)
FGFR
3 0.
632
1.23
5 0.
511
FGFR
3 H
s.142
0 Fi
brob
last
gro
wth
fact
or re
cept
or 3
(ach
ondr
opla
sia,
than
atop
horic
dw
arfis
m)
TM4S
F9
0.56
3 1.
080
0.52
1 TM
4SF9
H
s.118
118
Tetra
span
in 5
LA
D1
0.59
2 1.
126
0.52
6 LA
D1
Hs.5
1903
5 La
dini
n 1
PTK
6 0.
472
0.89
2 0.
530
PTK
6 H
s.511
33
PTK
6 pr
otei
n ty
rosi
ne k
inas
e 6
KLK
6 1.
171
2.17
1 0.
539
KLK
6 H
s.793
61
Kal
likre
in 6
(neu
rosi
n, z
yme)
AK
AP12
0.
636
1.16
0 0.
549
AKAP
12
Hs.3
7124
0 A
kin
ase
(PR
KA
) anc
hor p
rote
in (g
ravi
n) 1
2 FG
FR3
0.70
5 1.
248
0.56
5 FG
FR3
Hs.1
420
Fibr
obla
st g
row
th fa
ctor
rece
ptor
3 (a
chon
drop
lasi
a, th
anat
opho
ric d
war
fism
) PD
CD
4 0.
599
1.06
1 0.
565
PDC
D4
Hs.2
3254
3 Pr
ogra
mm
ed c
ell d
eath
4 (n
eopl
astic
tran
sfor
mat
ion
inhi
bito
r)
WIS
P3
0.94
1 1.
666
0.56
5 W
ISP3
H
s.549
081
WN
T1 in
duci
ble
sign
alin
g pa
thw
ay p
rote
in 3
IT
GA6
0.
557
0.98
1 0.
568
ITG
A6
Hs.1
3339
7 In
tegr
in, a
lpha
6
PTG
ES
0.99
0 1.
740
0.56
9 PT
GES
H
s.146
688
Pros
tagl
andi
n E
synt
hase
TF
F1
0.52
1 0.
914
0.57
0 TF
F1
Hs.1
6280
7 Tr
efoi
l fac
tor 1
(bre
ast c
ance
r, es
troge
n-in
duci
ble
sequ
ence
exp
ress
ed in
) FG
FBP1
0.
984
1.71
4 0.
574
FGFB
P1
Hs.1
690
Fibr
obla
st g
row
th fa
ctor
bin
ding
pro
tein
1
TM4S
F4
0.53
6 0.
925
0.57
9 TM
4SF4
H
s.133
527
Tran
smem
bran
e 4
L si
x fa
mily
mem
ber 4
C
LDN
11
0.55
7 0.
952
0.58
5 C
LDN
11
Hs.3
1595
C
laud
in 1
1 (o
ligod
endr
ocyt
e tra
nsm
embr
ane
prot
ein)
PTG
S2
1.05
8 1.
805
0.58
6 PT
GS2
H
s.196
384
Pros
tagl
andi
n-en
dope
roxi
de sy
ntha
se 2
(pro
stag
land
in G
/H sy
ntha
se a
nd
cycl
ooxy
gena
se)
ACSL
5 0.
526
0.89
2 0.
589
ACSL
5 H
s.116
38
Acy
l-CoA
synt
heta
se lo
ng-c
hain
fam
ily m
embe
r 5
BDN
F 0.
549
0.92
8 0.
592
BDN
F H
s.502
182
Bra
in-d
eriv
ed n
euro
troph
ic fa
ctor
AK
AP12
0.
639
1.06
0 0.
603
AKAP
12
Hs.3
7124
0 A
kin
ase
(PR
KA
) anc
hor p
rote
in (g
ravi
n) 1
2 PP
P2R2
B 1.
451
2.39
8 0.
605
PPP2
R2B
Hs.1
9382
5 Pr
otei
n ph
osph
atas
e 2
(for
mer
ly 2
A),
regu
lato
ry su
buni
t B (P
R 5
2), b
eta
isof
orm
IT
GB4
0.
799
1.31
5 0.
607
ITG
B4
Hs.3
7025
5 In
tegr
in, b
eta
4 IT
GA9
0.
557
0.90
1 0.
618
ITG
A9
Hs.1
1315
7 In
tegr
in, a
lpha
9
38
2
Sym
bol
Mea
n ex
pres
sion
ra
tio:
Inva
sive
E
OC
Mea
n ex
pres
sion
ra
tio:
LM
P
Mea
n fo
ld
chan
ce
diff
eren
ce
Uni
Gen
e sy
mbo
l U
nige
ne
Clu
ster
U
niG
ene
Nam
e
GPC
6 0.
467
0.74
9 0.
624
GPC
6 H
s.444
329
Gly
pica
n 6
KRT
19
0.78
3 1.
253
0.62
5 K
RT19
H
s.514
167
Ker
atin
19
CD
H5
0.62
4 0.
969
0.64
3 C
DH
5 H
s.762
06
Cad
herin
5, t
ype
2, V
E-ca
dher
in (v
ascu
lar e
pith
eliu
m)
LIM
S1
2.33
4 3.
627
0.64
4 LI
MS1
H
s.469
593
LIM
and
sene
scen
t cel
l ant
igen
-like
dom
ains
1
ANXA
3 0.
684
1.06
2 0.
644
ANXA
3 H
s.480
042
Ann
exin
A3
BAG
5 0.
844
1.30
2 0.
648
BAG
5 H
s.544
3 B
CL2
-ass
ocia
ted
atha
noge
ne 5
LA
MA3
0.
754
1.15
1 0.
655
LAM
A3
Hs.4
3636
7 La
min
in, a
lpha
3
CER
KL
0.88
8 1.
351
0.65
7 IT
GA4
H
s.440
955
Cer
amid
e ki
nase
-like
PC
DH
A6
0.72
9 1.
095
0.66
6 PC
DH
A6
Hs.1
9934
3 Pr
otoc
adhe
rin a
lpha
6
LAM
B1
0.55
4 0.
830
0.66
8 LA
MB1
H
s.489
646
Lam
inin
, bet
a 1
FGF1
3 1.
092
0.68
8 1.
587
FGF1
3 H
s.654
0 Fi
brob
last
gro
wth
fact
or 1
3 C
APN
9 1.
092
0.68
8 1.
587
CAP
N9
Hs.4
9802
1 C
alpa
in 9
G
PC1
1.41
5 0.
885
1.59
9 G
PC1
Hs.3
2823
2 G
lypi
can
1 TS
TA3
1.25
0 0.
781
1.60
0 TS
TA3
Hs.4
0411
9 Ti
ssue
spec
ific
trans
plan
tatio
n an
tigen
P35
B
IMP-
3 1.
547
0.95
8 1.
615
IMP-
3 H
s.432
616
IGF-
II m
RN
A-b
indi
ng p
rote
in 3
LG
ALS1
0.
987
0.60
6 1.
627
LGAL
S1
Hs.4
4535
1 Le
ctin
, gal
acto
side
-bin
ding
, sol
uble
, 1 (g
alec
tin 1
) PD
CD
2 1.
474
0.87
0 1.
694
PDC
D2
Hs.3
6790
0 Pr
ogra
mm
ed c
ell d
eath
2
SLC
39A1
4 1.
485
0.85
0 1.
749
SLC
39A1
4 H
s.491
232
Solu
te c
arrie
r fam
ily 3
9 (z
inc
trans
porte
r), m
embe
r 14
SELE
NBP
1 1.
271
0.70
9 1.
792
SELE
NBP
1 H
s.334
841
Sele
nium
bin
ding
pro
tein
1
CD
H13
2.
100
1.14
7 1.
830
CD
H13
H
s.436
040
Cad
herin
13,
H-c
adhe
rin (h
eart)
ST
EAP
1.18
2 0.
639
1.85
1 ST
EAP
Hs.6
1635
Si
x tra
nsm
embr
ane
epith
elia
l ant
igen
of t
he p
rost
ate
1 TM
4SF3
1.
191
0.63
7 1.
870
TM4S
F3
Hs.1
7056
3 Te
trasp
anin
8
CD
9 1.
976
1.03
4 1.
911
CD
9 H
s.114
286
CD
9 an
tigen
(p24
) PT
PNS1
1.
476
0.75
5 1.
955
PTPN
S1
Hs.1
2884
6 Pr
otei
n ty
rosi
ne p
hosp
hata
se, n
on-r
ecep
tor t
ype
subs
trate
1
LTBP
1 1.
368
0.68
6 1.
993
LTBP
1 H
s.497
87
Late
nt tr
ansf
orm
ing
grow
th fa
ctor
bet
a bi
ndin
g pr
otei
n 1
CD
H6
2.59
1 1.
275
2.03
3 C
DH
6 H
s.171
054
Cad
herin
6, t
ype
2, K
-cad
herin
(fet
al k
idne
y)
MAD
1.
408
0.68
5 2.
056
MAD
H
s.468
908
MA
X d
imer
izat
ion
prot
ein
1 PL
AU
2.33
4 1.
090
2.14
1 PL
AU
Hs.7
7274
Pl
asm
inog
en a
ctiv
ator
, uro
kina
se
383
Sym
bol
Mea
n ex
pres
sion
ra
tio:
Inva
sive
E
OC
Mea
n ex
pres
sion
ra
tio:
LM
P
Mea
n fo
ld
chan
ce
diff
eren
ce
Uni
Gen
e sy
mbo
l U
nige
ne
Clu
ster
U
niG
ene
Nam
e
NG
FR
1.30
6 0.
596
2.19
2 N
GFR
H
s.415
768
Ner
ve g
row
th fa
ctor
rece
ptor
(TN
FR su
perf
amily
, mem
ber 1
6)
MSL
N
1.14
7 0.
519
2.20
9 M
SLN
H
s.408
488
Mes
othe
lin
PRSS
8 1.
597
0.72
3 2.
209
PRSS
8 H
s.757
99
Prot
ease
, ser
ine,
8 (p
rost
asin
) BM
P7
3.03
1 1.
338
2.26
6 BM
P7
Hs.4
7316
3 B
one
mor
phog
enet
ic p
rote
in 7
(ost
eoge
nic
prot
ein
1)
SNC
G
2.22
2 0.
977
2.27
5 SN
CG
H
s.349
470
Synu
clei
n, g
amm
a (b
reas
t can
cer-
spec
ific
prot
ein
1)
CSR
P2
1.95
8 0.
833
2.35
0 C
SRP2
H
s.530
904
Cys
tein
e an
d gl
ycin
e-ric
h pr
otei
n 2
DEF
A4
1.35
7 0.
559
2.42
7 D
EFA4
H
s.258
2 D
efen
sin,
alp
ha 4
, cor
ticos
tatin
C
EBPA
1.
688
0.68
7 2.
457
CEB
PA
Hs.7
6171
C
CA
AT/
enha
ncer
bin
ding
pro
tein
(C/E
BP)
, alp
ha
LCP1
1.
816
0.72
4 2.
510
LCP1
H
s.381
099
Lym
phoc
yte
cyto
solic
pro
tein
1 (L
-pla
stin
)
FABP
3 1.
622
0.64
4 2.
520
FABP
3 H
s.112
669
Fatty
aci
d bi
ndin
g pr
otei
n 3,
mus
cle
and
hear
t (m
amm
ary-
deriv
ed g
row
th
inhi
bito
r)
CD
36
1.42
8 0.
562
2.54
0 C
D36
H
s.120
949
CD
36 a
ntig
en (c
olla
gen
type
I re
cept
or, t
hrom
bosp
ondi
n re
cept
or)
L1C
AM
2.56
9 0.
986
2.60
6 L1
CAM
H
s.522
818
L1 c
ell a
dhes
ion
mol
ecul
e TH
BS2
1.57
0 0.
587
2.67
5 TH
BS2
Hs.3
7114
7 Th
rom
bosp
ondi
n 2
DAB
2 1.
132
0.42
2 2.
685
DAB
2 H
s.481
980
Dis
able
d ho
mol
og 2
, mito
gen-
resp
onsi
ve p
hosp
hopr
otei
n (D
roso
phila
) PR
RX1
2.15
9 0.
787
2.74
2 PR
RX1
Hs.2
8341
6 Pa
ired
rela
ted
hom
eobo
x 1
STEA
P 1.
388
0.50
4 2.
754
STEA
P H
s.616
35
Six
trans
mem
bran
e ep
ithel
ial a
ntig
en o
f the
pro
stat
e 1
CO
L4A2
1.
348
0.48
2 2.
796
CO
L4A2
H
s.508
716
Col
lage
n, ty
pe IV
, alp
ha 2
K
RT10
3.
411
1.21
3 2.
813
KRT
10
Hs.9
9936
K
erat
in 1
0 (e
pide
rmol
ytic
hyp
erke
rato
sis;
ker
atos
is p
alm
aris
et p
lant
aris
) FN
1 1.
534
0.52
1 2.
944
FN1
Hs.2
0371
7 Fi
bron
ectin
1
PTG
S1
3.12
0 1.
048
2.97
7 PT
GS1
H
s.201
978
Pros
tagl
andi
n-en
dope
roxi
de sy
ntha
se 1
(pro
stag
land
in G
/H sy
ntha
se a
nd
cycl
ooxy
gena
se)
DXS
9879
E 2.
020
0.67
5 2.
994
DXS
9879
E H
s.444
619
DN
A se
gmen
t on
chro
mos
ome
X (u
niqu
e) 9
879
expr
esse
d se
quen
ce
DN
MT1
2.
232
0.73
5 3.
038
DN
MT1
H
s.202
672
DN
A (c
ytos
ine-
5-)-
met
hyltr
ansf
eras
e 1
GAG
EB1
1.69
9 0.
538
3.16
0 G
AGEB
1 H
s.128
231
P an
tigen
fam
ily, m
embe
r 1 (p
rost
ate
asso
ciat
ed)
TNFA
IP6
2.68
9 0.
816
3.29
5 TN
FAIP
6 H
s.437
322
Tum
or n
ecro
sis f
acto
r, al
pha-
indu
ced
prot
ein
6 M
EOX1
3.
147
0.93
5 3.
364
MEO
X1
Hs.4
38
Mes
ench
yme
hom
eo b
ox 1
SN
X13
2.06
9 0.
610
3.39
2 SN
X13
Hs.4
8764
8 So
rting
nex
in 1
3
38
4
Sym
bol
Mea
n ex
pres
sion
ra
tio:
Inva
sive
E
OC
Mea
n ex
pres
sion
ra
tio:
LM
P
Mea
n fo
ld
chan
ce
diff
eren
ce
Uni
Gen
e sy
mbo
l U
nige
ne
Clu
ster
U
niG
ene
Nam
e
MEO
X1
2.64
8 0.
768
3.44
9 M
EOX1
H
s.438
M
esen
chym
e ho
meo
box
1
KLK
5 7.
933
2.27
6 3.
485
KLK
5 H
s.509
15
Kal
likre
in 5
H
OXA
5 2.
104
0.59
5 3.
537
HO
XA5
Hs.3
7034
H
omeo
box
A5
UC
HL1
3.
108
0.83
1 3.
740
UC
HL1
H
s.518
731
Ubi
quiti
n ca
rbox
yl-te
rmin
al e
ster
ase
L1 (u
biqu
itin
thio
lest
eras
e)
GPN
MB
1.67
7 0.
446
3.75
9 G
PNM
B H
s.190
495
Gly
copr
otei
n (tr
ansm
embr
ane)
nm
b M
MP1
5 2.
140
0.55
0 3.
888
MM
P15
Hs.8
0343
M
atrix
met
allo
prot
eina
se 1
5 (m
embr
ane-
inse
rted)
IG
F1
1.90
7 0.
481
3.96
1 IG
F1
Hs.1
6056
2 In
sulin
-like
gro
wth
fact
or 1
(som
atom
edin
C)
RAD
51
2.60
0 0.
532
4.88
9 RA
D51
H
s.446
554
RA
D51
hom
olog
(Rec
A h
omol
og, E
. col
i) (S
. cer
evis
iae)
BA
P1
3.42
8 0.
559
6.13
4 BA
P1
Hs.1
0667
4 BR
CA1
ass
ocia
ted
prot
ein-
1 (u
biqu
itin
carb
oxy-
term
inal
hyd
rola
se)
CEN
PF
4.00
0 0.
637
6.28
1 C
ENPF
H
s.497
741
Cen
trom
ere
prot
ein
F, 3
50/4
00ka
(mito
sin)
TO
P2A
3.44
6 0.
546
6.31
1 TO
P2A
Hs.1
5634
6 To
pois
omer
ase
(DN
A) I
I alp
ha 1
70kD
a PT
TG1
4.16
8 0.
545
7.64
2 PT
TG1
Hs.3
5096
6 Pi
tuita
ry tu
mor
-tran
sfor
min
g 1
CRA
BP2
9.17
2 0.
806
11.3
82
CRA
BP2
Hs.4
0566
2 C
ellu
lar r
etin
oic
acid
bin
ding
pro
tein
2
385
Appendix Q: Microarray images from Gilks et al study of LMP and invasive EOC These randomly selected arrays from the total set (n=23) show the extent and range of
hybridisation patterns present in this dataset, visible even at this macroscopic level. In
particular, the feature intensity appears to fade from top to bottom of each sub-grid.