Top Banner
Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis Lars Nørgaard 1 * , Gyo ¨ rgy So ¨ le ´ tormos 2 , Niels Harrit 3 , Morten Albrechtsen 4 , Ole Olsen 5 , Dorte Nielsen 6 , Kristoffer Kampmann 3,5 and Rasmus Bro 1 1 Department of Food Science, The Faculty of Life Sciences, University of Copenhagen, DK-1958 Frederiksberg C, Denmark 2 Department of Clinical Biochemistry, Copenhagen University Hospital Hillerød, DK-3400 Hillerød, Denmark 3 Department of Chemistry, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen, Denmark 4 Wexotec Management, Holgersvej 15, DK-2920 Charlottenlund, Denmark 5 Medico-Chemical Lab, Skelstedet 5, DK 2950 Vedbæk, Denmark 6 Department of Oncology, Copenhagen University Hospital Herlev, HerlevRingvej 75, DK-2730 Herlev, Denmark Received 9 February 2007; Accepted 23 March 2007 The objective of this phase I feasibility study is to investigate whether fluorescence spectroscopy of serum samples in combination with multivariate data analysis can be used to discriminate healthy females from breast cancer patients with solitary and multiple metastases, respectively. Serum samples were obtained from 39 females: 13 healthy females (controls) and 26 clinically diagnosed patients with either solitary metastases (11 patients) or multiple metastases (15 patients). Fluor- escence spectra were measured on undiluted samples and samples diluted 20 times and 500 times. Extended Canonical Variates Analysis (ECVA) was applied to develop classification models on the data. Three-group ECVA based on all spectroscopic data (5221 variables) gave five misclassifications in total, while sequential ECVA models on selected excitation wavelengths yielded two errors. The fluorescence spectroscopic results were compared with results based on the three tumor markers cancer antigen 15-3 (CA 15-3), carcinoembryonic antigen (CEA), and tissue polypeptide antigen (TPA). The lowest number of errors obtained using ECVA on the biomarkers was seven. Furthermore, fluorescence spectroscopy made it possible to discover sample subgroupings: females with solitary and multiple metastases could be divided into two subgroups according to the spectral patterns of the samples. Copyright # 2007 John Wiley & Sons, Ltd. KEYWORDS: cancer; fluorescence; biomarkers; chemometrics; multivariate; ECVA 1. INTRODUCTION Fluorescence spectroscopy is a sensitive, specific, and fast tool for detection of micro-environment changes and mole- cular interactions in complex samples [1]. As a consequence of this fact a fluorescence landscape measured directly on a diluted serum sample yields a characteristic multivariate spectroscopic pattern. Wolfbeis, Leiner, and co-workers introduced the hypothesis that this pattern of fluorescence in diluted serum samples contains information about the health status of the individual. The first analyses performed by Wolfbeis, Leiner, and co-workers with excitation–emission (2D) fluorescence spectroscopy on diluted blood/serum samples were published in the 1980s. They analyzed diluted serum from healthy and tumor-induced rats with ultraviolet [2] (500 times dilution) and visible [3] (20 times dilution) 2D fluorescence spectroscopy. Comparison of the recorded signals was performed by inspection of the raw data, inspection of difference matrices, and in the last-mentioned paper also by a cluster analysis on two selected fluorescence intensities. Later they introduced fluorescence spectroscopic measurements on diluted human serum samples [4] and they also compared 2D fluorescence signals measured on diluted serum from subjects with gynecological tumors with serum from healthy subjects [5]. They demonstrated that fluor- escence spectroscopy measured on the diluted, but otherwise untreated serum samples, was an attractive way of obtaining information due to the complex pattern obtained but at the same time they encountered problems on how to extract the relevant information. Hence, their approach was not opera- tional at that time. JOURNAL OF CHEMOMETRICS J. Chemometrics 2007; 21: 451–458 Published online 15 July 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/cem.1042 *Correspondence to: L. Nørgaard, Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Chemometrics Group Rolighedsvej 30 DK-1958 Frederiksberg C Denmark. E-mail: [email protected] Copyright # 2007 John Wiley & Sons, Ltd.
8

Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

May 04, 2023

Download

Documents

Ole Wæver
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

JOURNAL OF CHEMOMETRICSJ. Chemometrics 2007; 21: 451–458Published online 15 July 2007 in Wiley InterScience

(www.interscience.wiley.com) DOI: 10.1002/cem.1042

Fluorescence spectroscopy and chemometrics forclassification of breast cancer samples—a feasibilitystudy using extended canonical variates analysis

Lars Nørgaard1*, Gyorgy Soletormos2, Niels Harrit3, Morten Albrechtsen4, Ole Olsen5,

Dorte Nielsen6, Kristoffer Kampmann3,5 and Rasmus Bro1

1Department of Food Science, The Faculty of Life Sciences, University of Copenhagen, DK-1958 Frederiksberg C, Denmark2Department of Clinical Biochemistry, Copenhagen University Hospital Hillerød, DK-3400 Hillerød, Denmark3Department of Chemistry, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen, Denmark4Wexotec Management, Holgersvej 15, DK-2920 Charlottenlund, Denmark5Medico-Chemical Lab, Skelstedet 5, DK 2950 Vedbæk, Denmark6Department of Oncology, Copenhagen University Hospital Herlev, Herlev Ringvej 75, DK-2730 Herlev, Denmark

Received 9 February 2007; Accepted 23 March 2007

*CorrespoFaculty ofGroup RoE-mail: la

The objective of this phase I feasibility study is to investigate whether fluorescence spectroscopy of

serum samples in combination with multivariate data analysis can be used to discriminate healthy

females from breast cancer patients with solitary and multiple metastases, respectively. Serum

samples were obtained from 39 females: 13 healthy females (controls) and 26 clinically diagnosed

patients with either solitary metastases (11 patients) or multiple metastases (15 patients). Fluor-

escence spectra were measured on undiluted samples and samples diluted 20 times and 500 times.

Extended Canonical Variates Analysis (ECVA) was applied to develop classification models on the

data. Three-group ECVA based on all spectroscopic data (5221 variables) gave five misclassifications

in total, while sequential ECVA models on selected excitation wavelengths yielded two errors. The

fluorescence spectroscopic results were compared with results based on the three tumor markers

cancer antigen 15-3 (CA 15-3), carcinoembryonic antigen (CEA), and tissue polypeptide antigen

(TPA). The lowest number of errors obtained using ECVAon the biomarkerswas seven. Furthermore,

fluorescence spectroscopy made it possible to discover sample subgroupings: females with solitary

andmultiplemetastases could be divided into two subgroups according to the spectral patterns of the

samples. Copyright # 2007 John Wiley & Sons, Ltd.

KEYWORDS: cancer; fluorescence; biomarkers; chemometrics; multivariate; ECVA

1. INTRODUCTION

Fluorescence spectroscopy is a sensitive, specific, and fast

tool for detection of micro-environment changes and mole-

cular interactions in complex samples [1]. As a consequence

of this fact a fluorescence landscape measured directly on a

diluted serum sample yields a characteristic multivariate

spectroscopic pattern. Wolfbeis, Leiner, and co-workers

introduced the hypothesis that this pattern of fluorescence in

diluted serum samples contains information about the health

status of the individual. The first analyses performed by

Wolfbeis, Leiner, and co-workers with excitation–emission

(2D) fluorescence spectroscopy on diluted blood/serum

ndence to: L. Nørgaard, Department of Food Science,Life Sciences, University of Copenhagen, Chemometricslighedsvej 30 DK-1958 Frederiksberg C [email protected]

samples were published in the 1980s. They analyzed diluted

serum from healthy and tumor-induced rats with ultraviolet

[2] (500 times dilution) and visible [3] (20 times dilution) 2D

fluorescence spectroscopy. Comparison of the recorded

signals was performed by inspection of the raw data,

inspection of difference matrices, and in the last-mentioned

paper also by a cluster analysis on two selected fluorescence

intensities. Later they introduced fluorescence spectroscopic

measurements on diluted human serum samples [4] and they

also compared 2D fluorescence signals measured on diluted

serum from subjects with gynecological tumors with serum

from healthy subjects [5]. They demonstrated that fluor-

escence spectroscopymeasured on the diluted, but otherwise

untreated serum samples, was an attractive way of obtaining

information due to the complex pattern obtained but at the

same time they encountered problems on how to extract the

relevant information. Hence, their approach was not opera-

tional at that time.

Copyright # 2007 John Wiley & Sons, Ltd.

Page 2: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

452 L. Norgaard et al.

In the present study, excitation–emission fluorescence

spectroscopic measurements on undiluted, 20 times and

500 times diluted serum samples are combined with multi-

variate chemometric modeling for discriminating subjects

from two groups of patients and a control group. Thus, we

use essentially the same instrumental setup as did Wolfbeis

and Leiner [4], but we add to that the measurement of undilu-

ted samples as well as the use of multivariate data analysis.

Measuring undiluted samples is typically avoided in

traditional chemical analysis because it can lead to quench-

ing and other not well-described phenomena. However, with

the introduction of multivariate data analysis, this may not

be a problem and lack of a dilution step would provide a

significant practical advantage. Focus in the present study is

to show how a multivariate classification technique can be

applied to extract information from the complex spectro-

scopic fingerprints. The results obtained are compared with

the results from using biomarker diagnostics (cancer antigen

15-3 (CA 15-3), carcinoembryonic antigen (CEA), and tissue

polypeptide antigen (TPA) [6]) on the same subjects. Both the

fluorescence spectroscopic and biomarker methods are

compared to the clinical diagnoses of the subjects.

2. THEORY

2.1. Fluorescence spectroscopyFluorescence refers to light emission (luminescence) by elec-

tron transfer in the singlet state when molecules are excited

by photons [1,7]. Fluorescence is a three-stage process that

occurs in certain molecules called fluorophores: (1) the fluo-

Figure 1. Left: Contour plots of fluorescence spectra o

dilution, and 500 times dilution (diagonal lines in the p

dilution the signals saturate the detector at excitation wa

not included. Right: The concatenated spectra for t

measurement conditions. Data from undiluted and 5

18 excitation wavelengths (from 230 to 400 nm with 1

contains emission spectra from 10 excitation wavele

interscience.wiley.com/journal/cem

Copyright # 2007 John Wiley & Sons, Ltd.

rophore is excited to an electronic singlet state by absorption

of an external photon; (2) the excited state undergoes confor-

mational changes and interacts with the molecular environ-

ment in a number of different ways, including vibrational

relaxation, quenching, and energy transfer; (3) a photon is

emitted at a longer wavelength, while the fluorophore re-

turns to its ground state. The fluorescence excitation and

emission of light typically appear within nanoseconds. The

molecular structure and environment is decisive for whether

a compound is fluorescent, and fluorescence is often exhi-

bited by organic compounds with rigid molecular skeletons,

usually polyaromatic hydrocarbons and heterocycles. Fluor-

escence is unique among spectroscopic techniques, because it

is inherently multidimensional and fluorophores have

independent and specific spectral excitation and emission

profiles. These profiles can be measured as excitation and

emission spectra or as a complete excitation–emission matrix

(EEM), also known as a fluorescence landscape. An example

of a fluorescence landscape (contour plot) of a serum sample

is depicted in Figure 1 (left). Fluorescence spectroscopy is

also a very sensitive analytical method (compared to e.g.

absorption-based spectrophotometry) with possibilities to

measure down to parts per billion levels. The fluorescence

signals are ideally additive in mixtures; that is, the overall

fluorescence signal of a given sample can be expressed as the

sum of the fluorescence contribution from each of the

inherent fluorophores. However, in complex mixtures such

as serum samples, the fluorescence may not be additive due

to quenching phenomena and interactions with the molecu-

lar environment of the fluorophore.

f a selected solitary sample at no dilution, 20 times

lots are due to Rayleigh scattering). For 20 times

velengths from 230 to 300 nm and these data are

he 39 analyzed samples for each of the three

00 times dilution contain emission spectra from

0 nm intervals), while the 20 times dilution data

ngths. This figure is available online at www.

J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem

Page 3: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

Breast cancer classification 453

2.2. Extended Canonical VariatesAnalysis—ECVAECVA [8] was recently developed as an extension of

Canonical Variates Analysis (CVA) which is a well-

established method for classification of full rank data. ECVA

is developed to handle highly collinear and low rank data, for

example, spectroscopic data tables where the number of

variables is larger than the number of samples. ECVA

estimates directions in space that maximize the differences

between the groups in the data according to a well-defined

optimization criterion. The following paragraph is based on

reference [8] and references therein, especially [9] and [10]

are of interest.

Assume a datamatrixX (n� v) where the samples are from

g different groups with ni samples in the ith group

(n ¼Pg

i¼1 ni).

The within-group covariance matrix is defined as

Swithin ¼ 1

ðn� gÞXg

i¼1

Xni

j¼1

ðxij � xiÞðxij � xiÞ0 (1)

and the between-group covariance matrix is defined as

Sbetween ¼ 1

ðg� 1ÞXg

i¼1

niðxi � xÞðxi � xÞ0 (2)

where xij is the jth sample in the ith group (represented

as a column vector), xi ¼ 1ni

Pnij¼1 xij is the mean vector in the

ith group, and x ¼ 1n

Pgi¼1

Pnij¼1 xij ¼ 1

n

Pgi¼1 nixi is the overall

mean vector. Note that the dimensions of Swithin and Sbetween

are v� v.

It is now possible to define standard CVA as the problem

of finding a direction, w, that maximizes

JðwÞ ¼ w0Sbetweenw

w0Swithinw(3)

The solution can be written as an eigenvector equation

Sbetweenw ¼ lSwithinw (4)

If Swithin is non-singular we have

S�1withinSbetweenw ¼ lw (5)

which is an eigenvalue problem, where l is the eigenvalue

and w is the eigenvector.

If Swithin is singular, left-multiplication by the inverse of

Swithin is not possible and this is why standard CVA breaks

down when analyzing, for example multicollinear data.

In ECVA, the following is suggested: For the two-group

situation Equation (4) can be rewritten as [11]

ðx1 � x2Þðx1 � x2Þ0w ¼ lSwithinw (6)

ðx1 � x2Þ0w is a scalar, k, so the equation can be written as

ðx1 � x2Þk ¼ lSwithinw (7)

Equation (7) is then transformed into a multivariate

regression problem

y ¼ Rbþ f (8)

where y ¼ ðx1 � x2Þ is the dependent variable, R¼Swithin

contains the independent variables, and b¼w is the

regression vector. The vector f holds the residuals. Since k

and l are constants they do not change the direction of w in

Copyright # 2007 John Wiley & Sons, Ltd.

Equations (7) and (8). In ECVA Equations (7) and (8) are

solved with a PLS regression method. By multiplication of

the mean centered data matrix, XMC, with the weight vector,

w, the canonical variates, tCV, are obtained (tCV ¼ XMCw).

Themean centering is performed using themean vector of all

calibration samples (the same mean vector as used in the

calculation of Sbetween). The calculated canonical variates can

be used directly in, for example, a Linear Discriminant

Analysis (LDA) classifier (described later).

When more than two groups are considered the directions

can also be estimated from Equation 4 as there will generally

be more than one eigenvalue/eigenvector pair:

Sbetweenwa ¼ laSwithinwa (9)

where a is the number of directions. Equation (9) has

a¼min(v,g�1) non-zero eigenvalues and the maximum

dimensionality for the canonical space is thus a. For

high-dimensional data (large v) the maximum number of

canonical variates is g�1 (the number of groups minus one).

The regression equation for the multigroup case is

Y ¼ RBþ F (10)

where Y contains the differences ðxi � xÞ as column vectors,

R is Sbetween, and the columns of B arewa (designated asW in

the following). F is the residual matrix. The dimension of Y,

B, and F is v� g.

PLS2 is used as the regression technique to solve Equation

(10). The number of weights calculated corresponds to the

number of groups and the weights are sorted according to

their values when inserted one-by-one in the optimization

criterion (Equation (3)). The weight with the lowest value is

left out before application of the classifier because there is an

intrinsic rank deficiency due to the closure properties of the

dependent variables. The properties of PLS2 will ensure that

the space spanned by the retained g�1weights covers the full

space of the solution which is all that is needed.

By multiplication of the mean centered data matrix, XMC,

with the canonical weights matrix,W, the canonical variates,

TCV, are obtained (TCV ¼ XMCW) and LDA is then applied on

these. The advantage of ECVA is that no dimension-reducing

step is necessary before the classifier is applied to the

canonical variates; the discriminative directions are esti-

mated directly in the original multidimensional space which

is of interest for, for example, spectroscopic applications.

2.3. LDA classifierAn LDA [9] was used as the classifier; the discriminant

function for the canonical variates is

LiðtÞ ¼ logðpiÞ �1

2ðt� tiÞ0S�1

within;TCVðt� tiÞ

þlog Swithin;TCV

�� ��(11)

where i is a group index (1,. . .,g), t contains the canonical

variates (as a column vector) for the sample to be classified, tiis the mean vector of the canonical variates for group i,

Swithin;TCVis the pooled covariance matrix for the canonical

variates (an analog to Swithin for the raw data presented

above). The prior, p, was selected as equal probabilities, for

example, if three groups are analyzed the prior is 1/3 for each

group. The sample is classified to the group that gives the

highest value of Li.

J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem

Page 4: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

454 L. Norgaard et al.

2.4. Interval ECVA models—iECVAsBased on the same idea as interval PLS modeling [12] iECVA

modeling is programmed and it is intended as an exploratory

multivariate modeling tool. See for example, the paper by

Munck [13] in this volume. In the present case an ECVA

model is calculated for each excitation wavelength recorded

for each of the dilutions.

2.5. Normalization of individual emissionspectra for each excitation wavelengthEach emission spectrum (x) at a given excitation wavelength

is normalized to length one: x ¼ x= k x k, where the norm is

the Euclidian length.

3. MATERIAL AND METHODS

3.1. SamplesThe serum samples were selected from three sample banks

collected previously [6,14]. The selected samples were

collected from 1988 to 1992 and stored at �808C with

24-hour surveillance. None of the samples were thawed

previously. We investigated three groups of females. Group

1—the control (C) group—consisted of 13 healthy subjects

[6]. Group 2 consisted of 11 metastatic breast cancer patients

with solitary metastases [14]. The samples were drawn prior

to the first cycle of first-line treatment for metastatic breast

cancer. Group 3 consisted of 15 metastatic breast cancer

patients with multiple metastases [14]. The samples were

drawn during clinical tumor progression at least 2 weeks

after the latest course of chemotherapy except for two

patients who received different continuous oral medication.

The mean/minimum/maximum value for age for women in

groups C, S, and M are 49/39/69 years, 52/34/67 years, and

56/35/70 years, respectively.

Dilution of samples 20 and 500 times was performed with

0.067M phosphate buffer at pH 7.4.

3.2. Biomarker analysisThe traditional biomarkers CA 15-3, CEA, and TPA were

analyzed according to the description given in Reference [6].

3.3. Calibration set and test set—validationprincipleThe number of samples in the present study is low and in

order to get an impression of the possible over-fitting of the

ECVAmodels it was decided to split the 38 samples into two

sets: a calibration set consisting of 29 samples and a

randomly selected test set consisting of 9 samples (three

from each group).

All ECVA models on calibration sets are leave-one-out

cross validated and the number of optimal PLS components

in the inner relation is estimated based on this validation

scheme. The test set samples are not involved in the ECVA

modeling procedure at any time.

3.4. Fluorescence spectroscopic analysisThe serum samples were analyzed using an LS55 Lumines-

cence Spectrometer, Perkin Elmer, Boston, MA, USA. The

instrumental settings were: excitation wavelengths recorded

at 230 to 400 nm with 10 nm intervals, and emission spectra

scanned from 250 to 600 with 0.5 nm steps (for the 20 times

Copyright # 2007 John Wiley & Sons, Ltd.

dilution only excitations from 310 to 400 nm are recorded due

to signal saturation at lower excitation wavelengths). The

undiluted samples were analyzed in front face mode using

the front face accessory produced by Perkin Elmer and a

3� 10mm cuvette, while analysis of the diluted samples was

performed using an ordinary 10� 10mm cuvette in a right

angle set-up. The slit widths for the experiments with

undiluted and 20 times diluted samples were 5.0 nm

(excitation) and 4.5 nm (emission) and for the 500 times

dilution experiments the slit widths were 4.0 nm (excitation)

and 3.0 nm (emission). Scan rate for all experiments was set

to 600 nm/min. For each of the three samples sets the

samples were measured in a random order during one day

and instrumental changes were not significant as tested by

sugar standards measured three times a day [15].

3.5. SoftwareAll multivariate data analyses were performed using

MATLAB version 7.3.0.267 (R2006b) (The MathWorks, Natick,

MA, USA). An ECVA Toolbox [8] is available at http://

www.models.life.ku.dk. Compared to the originally published

ECVA algorithm, the current algorithm has been optimized

with respect to speed using initial loss-less compression and

more efficient use of the algebraic properties of the covariance

matrices. The fluorescence spectra were transferred to

MATLAB using an in-house routine written in MATLAB.

4. RESULTS AND DISCUSSION

4.1. Fluorescence spectroscopyThe spectral data acquired are shown in Figure 1 for the

undiluted, the 20 times diluted and the 500 times diluted

sample sets. The contour plots of the fluorescence spectra for

undiluted, dilution 20 times and dilution 500 times are

included for a selected solitary sample. Due to signal

saturation the spectra with excitation from 230 to 300 nm are

not included at 20 times dilution. For the undiluted data it is

observed that a single sample (multiple metastases) has a

deviating pattern; this is the same sample that deviates in the

first part of the 20 times (excitation 300 and 310nm) and 500

times (excitation 230 to 270 nm) dilution spectra. The sample

is from one of the two patients who received continuous oral

medication which might explain the deviating pattern and

the sample is excluded from the further data analysis. For the

20 times dilutions a group of five spectra have a higher

intensity clearly seen at excitation wavelengths from 340 nm

and upwards. Four samples are from subjects with solitary

metastases and one sample is from a subject with multiple

metastases. These five samples are not excluded from further

data analysis due to their common patterns.

Each individual set of emission spectra recorded at a given

wavelength is normalized before application of the ECVA.

The results were in general better with than without

normalization so this is chosen as the default pre-processing.

Normalization of spectra is often used in auto-fluorescence

spectroscopywhere ratios and spectral shapes sometimes are

more relevant than quantitative measures [4,5,16].

ECVA models on the calibration set (29 samples) were

calculated on the concatenated spectral data set (all

excitations for all dilutions) which contains 5221 (2043,

J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem

Page 5: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

Table I. Classification results from ECVA on fluorescence spectroscopy data

Spectral variables and dilutions Groups Data sets False positives False negatives Total error

C, S, M Cal. set (n¼ 29)� 2 2Test set (n¼ 9) 0 1 5

C, S&M Cal. set (n¼ 29) 2 5All excitations and all dilutions (5222 variables) Test set (n¼ 9) 1 2 10

C&S, M Cal. set (n¼ 29) 2 2Test set (n¼ 9) 1 2 7

C, S Cal. set (n¼ 18) 0 1Test set (n¼ 6) 1 0 2

Excitation 230 nm, 500 times dilution C&S, M Cal. set (n¼ 29) 0 0Test set (n¼ 9) 0 0 0

C, control; S, solitary metastases; M, multiple metastases.�One misclassification between S and M (not included as false positive or false negative).

Breast cancer classification 455

1135, and 2043) variables. The first model was calculated in

order to discriminate between the three groups (C, S, M)

simultaneously. The number of false positives and false

negatives for the calibration set (five inner PLS components)

and test set are given in Table I. In total five errors are

observed out of 38 samples (13.2%).

It is interesting to investigate if it is possible to discriminate

C from S&M; that is, the solitary and multiple metastases

groups are considered as one group. The number of

misclassifications increases to 10 (five inner PLS components)

indicating that the solitary andmultiplemetastases groups are

quite different in their data structure. The primary challenge is

to discriminate solitary samples (S) from controls (C) in order

to make an early detection of cancer. An alternative

classification strategy is then to discriminate multiple

metastases samples from the other two groups and in

the next step focus only on discrimination of control and

solitary. As observed from the results presented in Table I the

total number of misclassifications is seven for M versus C&S

(pooled, four inner PLS components), while the number of

errors is two when ECVA is used on only C versus S samples

(seven inner PLS components). This sequential approach gives

a total number of errors equal to nine (seven and two), which

is higher than the three-group model.

The result from an iECVA for the M versus C&S model is

given in Figure 2. Several of the intervals (each excitation

wavelength) perform better than the full data model. An

ECVA model on variables 3179–3274 (interval number 29,

two inner PLS components) corresponding to excitation

wavelength 230 nm for the 500 times diluted samples gives

zero misclassifications. Besides the low error this is

interesting for at least two reasons: (1) it is possible to make

a much faster method using one excitation wavelength and

only one dilution and (2) it indicates spectral ranges that are

relevant to investigate with respect to specific fluorophores.

Application of the interval model followed by the C versus

S model (full data) gives two errors out of 38 samples (5.3%).

The drawback of this approach is that the C versus Smodel is

based on all excitation wavelengths and all dilutions, and no

model based on a single excitation wavelength has a lower

error than the global model (results not shown).

4.2. BiomarkersDescriptive statistics for CA 15-3, CEA, and TPA are

presented in Table II for each clinical group. Using the

Copyright # 2007 John Wiley & Sons, Ltd.

recommended cutoff values [6] the diagnostic results

reported in Table III are obtained, with CA 15-3 as the best

performing biomarker with eight false negatives (seven

solitary, one multiple). For the CEA and TPA biomarkers the

numbers of false negatives are 9 and 12, respectively, and for

all three biomarkers no false positives are detected.

ECVA, which boils down to an ordinary CVA in the case of

a low number of variables, is used to model the three

biomarkers together to see if there is a synergy advantage in

using a combination of biomarkers.

Table IV gives the results for the ECVA model of the

biomarker data (note that log10 is used to transform the

biomarker data prior to ECVA modeling due to skewed

biomarker distributions). The lowest number of errors for the

same type of groups as tested for fluorescence spectroscopy

is seven (for C, S,M). The sequentialmodel (C&S,M followed

by C, S) gives eight errors (seven and one). Calculating

exactly the same models for the best performing biomarker,

CA 15-3, gives an error of eight for the best performingmodel

(Table IV).

4.3. Subgroupings and outlier detectionBy inspection of the emission spectra for each excitation

wavelength, it is observed that the samples from subjects

with multiple metastases can be separated into two

apparently distinct groups according to the presence of an

emission peak at 515–520 nm (excitation 390 nm) for

undiluted samples (Figure 3a). A very clear distinction

between these two spectral signatures is observed also in a

PCA model on these data (not shown). This grouping of

samples from the multiple metastases subjects is not

observed for the diluted sample sets. In Figure 3b raw

emission spectra recorded at excitation 360 nm and 20 times

dilution for all the solitary samples are plotted. Four samples

show a deviating pattern with the same shape for all the four

samples. One of the multiple metastases samples also shows

this spectral pattern, but no control samples have this pattern

so it might be assumed that this pattern only is connected to

samples from diseased subjects.

4.4. Spectral assignmentsThe serum samples (both undiluted and diluted) are very

complex from a chemical point of view but it is still possible

to provide some indications on the fluorophores or groups of

fluorophores that might cause the spectral differences

J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem

Page 6: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

Figure 2. Interval ECVA model for C&S versus M based on fluorescence spectroscopic data.

The first 18 intervals correspond to excitation wavelengths 230 to 400 nm for no dilution, the

next 10 intervals correspond to excitation wavelengths 310 to 400 nm for 20 times dilution,

and the last 18 intervals correspond to excitation wavelengths 230 to 400 nm for 500 times

dilution. The bars reflect the number of misclassifications for each model for the calibration set

(n¼ 29) based on leave-one-out cross validation. The average spectrum (solid line) and the

error using all variables (dotted line) are also shown.

Table II. Characterization of the biomarker data CA 15-3, CEA, and TPA for 39 samples from three different groups: control,

solitary metastases, and multiple metastases

Biomarker Minimum value Maximum value Median Standard deviation

CA 15-3 (kU/L) 6.2 7200 28.1 1146.6Control 6.2 28.1 12.6 5.6Solitary 6.5 214 23 59.6Multiple 18.7 7200 195 1814.6CEA (mg/L) 1 367 5.9 76.3Control 1.3 7.3 2.6 1.8Solitary 1 25.3 4.9 7.1Multiple 2.6 367 26.6 108TPA (U/L) 25 8150 82 1838.4Control 25 89.7 38 17.2Solitary 30.8 183 61 42.3Multiple 322 8150 1006 2531.1

Table III. Classification results for all samples based on recommended cutoff values [6]

CA 15-3 CEA TPACutoff value 30 kU/L Cutoff value 7.5mg/L Cutoff value 356U/L

False positives 0 0 0False negatives 8 9 12

(7 solitary, 1 multiple) (7 solitary, 2 multiple) (11 solitary (all), 1 multiple)

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem

456 L. Norgaard et al.

Page 7: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

Table IV. Classification results from ECVA on biomarker (log10) data

Biomarkers Group Data set False positives False negatives Total error

All three biomarkers C, S, M Cal. set (n¼ 29) 1 4Test set (n¼ 9)� 1 1 7

C, S&M Cal. set (n¼ 29) 0 8Test set (n¼ 9) 0 1 9

C&S, M Cal. set (n¼ 29) 0 0Test set (n¼ 9) 0 1 1

C, S Cal. set (n¼ 18) 2 3Test set (n¼ 6) 1 1 7

CA 15.3 C, S, M Cal. set (n¼ 29)�� 1 4Test set (n¼ 9)��� 2 1 8

C, S&M Cal. set (n¼ 29) 1 6Test set (n¼ 9) 0 2 9

C&S, M Cal. set (n¼ 29) 1 2Test set (n¼ 9) 1 1 5

C, S Cal. set (n¼ 18) 1 4Test set (n¼ 6) 2 1 8

�One misclassification between S and M (not included as false positive or false negative).��Three misclassifications between S and M (not included as false positive or false negative).���Two misclassifications between S and M (not included as false positive or false negative).

Breast cancer classification 457

observed. A selection of the identified fluorophores in serum

at 20 or 500 times dilution are tyrosine (ex/em¼ 275 nm/

300 nm), free and bound tryptophan (ex/em¼ 280–290 nm/

320–350 nm), indoxyl sulfate (ex/em¼ 290 nm/385nm), 3-

hydroxyanthranilic acid (ex/em¼ 325 nm/425 nm), 5-hydro-

xyanthranilic acid (ex/em¼ 340 nm/430nm), 4-pyridoxic

acid (ex/em¼ 315–320 nm/425–440 nm), pyridoxal phos-

phate Schiff base (imine form) (ex/em¼ 325 nm/430nm),

free NAD(P)H (ex/em¼ 345 nm/460nm), enzyme-bound

NAD(P)H (ex/em¼ 340–350 nm/440–480 nm), pyridoxic

acid lactone (ex/em¼ 365 nm/425 nm), riboflavin/FMN

(ex/em¼ 370 nm/500–520 nm), pyridoxal phosphate Schiff

Figure 3. (a) Raw emission spectra recorded at excitation 390

subgroups are identified based on the presence or absence of the

excitation 360 nm and 20 times dilution for the solitary samples. F

shape, separating the solitary group into two subgroups.

Copyright # 2007 John Wiley & Sons, Ltd.

base (enamine form) (ex/em¼ 410 nm/510nm) (see Wolf-

beis and Leiner [4] and references therein for assignments). A

selection of the mentioned chemical components are also

involved in other types of cancer [16–18]. A hypothesis from

the present study is that the developed method can also be

applied in the diagnosis of these. The assignments of spectral

peaks are tentative and could be backed up using a

chromatographic technique with, for example, fluorescence

and ultraviolet/visible detector in order to obtain a unique

and complete identification [19]. Alternatively, it might be

possible in future studies to apply the PARAFAC [20]

technique for mathematical de-convolution of the spectral

nm (no dilution) for the multiple metastases samples. Two

peak at 515–520 nm. (b) Raw emission spectra recorded at

our samples (gray) show a deviating pattern with the same

J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem

Page 8: Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis

458 L. Norgaard et al.

peaks contributing to the total signals. It will be of great

interest also to further investigate the observed spectral

subgroupings.

It is important to stress that the spectral differences

observed between the three groups might not necessarily be

ascribed to one or several fluorophores but also to the

chemical environment of these fluorophores.

4.5. Dilutions and other chemometricmodelsThe fluorescence signals saturate in samples measured at

20 times dilution (Figure 1), that is, the recorded signals are

higher than for the undiluted samples. This is due to intense

quenching and inner filter effects in the undiluted highly

concentrated sample. When such a sample is diluted more

light is allowed to escape at the low excitation wavelengths

giving rise to detector saturation. Another point for further

investigation is if the selected dilution levels are optimal for

obtaining discrimination. We selected to use the dilutions

suggested by Wolfbeis and Leiner [4] but it is possible that

other dilutions might also be relevant; that is, a gradient of

many dilutions can be measured by using, for example, a

flow injection analysis system [21].

When more samples become available in future studies it

is plausible that more advanced chemometrics methods,

including non-linear methods and wavelength/range selec-

tion methods, can be applied with even better results.

5. CONCLUSIONS

In this study a multivariate classification model of fluor-

escence spectroscopic data is demonstrated to yield a

promising low number of false positives and negatives for

the 38 samples under study. The results compare favorably to

results using the biomarkers CA 15-3, CEA, and TPA. It

should be stressed that the number of samples available in

this study is too small to provide conclusive evidence that

fluorescence spectroscopy and chemometrics is a suitable

approach for detecting breast cancer, but the study definitely

shows that fluorescence spectroscopic analysis on intact and

diluted serum samples in combination with multivariate

data analysis has the potential to become a valuable tool for

improving breast cancer diagnosis and monitoring and for

detecting deviating samples and discovering subgroups.

Especially the subgrouping is important and can be helpful,

for example in developing targeted treatment or medicine. It

is also possible that other types of cancer can be detectedwith

this technique.

We suggest that the combination of medical data and

multivariate data analysis is defined as medicometrics as also

suggested by Bailly [22] although in a different context.

REFERENCES

1. Lakowicz JR. Principles of Fluorescence Spectroscopy (2ndedn). Kluwer Academic/Plenum Publishers: New York,1999.

2. Leiner M, Wolfbeis OS, Schaur RJ, Tillian HM. Fluor-escence topography in biology .1. Ultraviolet fluor-escence topograms of rat sera and decrease of

Copyright # 2007 John Wiley & Sons, Ltd.

tryptophan fluorescence in Yoshida ascites hepatoma-bearing rats. IRCS Med. Sci.-Biochem. 1983; 11: 675–676.

3. Leiner M, Schaur RJ, Wolfbeis OS, Tillian HM. Fluor-escence topography in biology .2. Visible fluorescencetopograms of rat sera and cluster-analysis of fluorescenceparameters of sera of Yoshida ascites hepatoma-bearingrats. IRCS Med. Sci.-Biochem. 1983; 11: 841–842.

4. Wolfbeis OS, LeinerM.Mapping of the total fluorescenceof human-blood serum as a new method for its charac-terization. Anal. Chim. Acta. 1985; 167: 203–215.

5. Leiner MJP, Schaur RJ, Desoye G, Wolfbeis OS. Fluor-escence topography in biology .3. Characteristic devi-ations of tryptophan fluorescence in sera of patients withgynecological tumors. Clin. Chem. 1986; 32: 1974–1978.

6. Soletormos G, Schioler V, Nielsen D, Skovsgaard T,Dombernowsky P. Interpretation of results for tumor-markers on the basis of analytical imprecision and bio-logical variation. Clin. Chem. 1993; 39: 2077–2083.

7. Christensen J, Nørgaard L, Bro R, Engelsen SB. Multi-variate autofluorescence of intact food systems. Chem.Rev. 2006; 106: 1979–1994.

8. Norgaard L, Bro R, Westad F, Engelsen SB.A modification of Canonical Variates Analysis to handlehighly collinear multivariate data. 2006; 20: 425–435.

9. Krzanowski WJ. Principles of Multivariate Analysis(Revised edn). Oxford University Press: New York,USA, 2000.

10. Naes T, Indahl U. A unified description of classicalclassification methods for multicollinear data.J. Chemometr. 1998; 12: 205–220.

11. Duda RO, Hart PE, Stork DG. Pattern Classification (2ndedn). John Wiley & Sons: New York, USA, 2001.

12. Norgaard L, Saudland A,Wagner J, Nielsen JP,Munck L,Engelsen SB. Interval partial least-squares regression(iPLS): a comparative chemometric study with anexample from near-infrared spectroscopy. Appl. Spec-trosc. 2000; 54: 413–419.

13. Munck L. A new holistic exploratory approach to Sys-tems Biology by Near Infrared spectroscopy evaluatedby chemometrics and data inspection. J. Chemometr.(in press).

14. Soletormos G, Nielsen D, Schioler V, Skovsgaard T,Dombernowsky P. Tumor markers cancer antigen 15.3,carcinoembryonic antigen, and tissue polypeptide anti-gen for monitoring metastatic breast cancer during first-line chemotherapy and follow-up. Clin. Chem. 1996; 42:564–575.

15. Norgaard L. Direct standardization in multi wavelengthfluorescence spectroscopy. Chemometr. Intell. Lab. Syst.1995; 29: 283–293.

16. Masilamani V, Al-Zhrani K, Al-Salhi M, Al-Diab A,Al-Ageily M. Cancer diagnosis by autofluorescence ofblood components. J. Lumin. 2004; 109: 143–154.

17. Aiken JH, Huie CW, Terzian JA. Characteristic visiblefluorescence emission-spectra of sera from cancer-patients. Anal. Lett. 1994; 27: 511–521.

18. Hubmann MR, Leiner MJP, Schaur RJ. Ultraviolet fluor-escence of human sera .1. Sources of characteristic differ-ences in the ultraviolet fluorescence-spectra of sera fromnormal and cancer-bearing humans. Clin. Chem. 1990; 36:1880–1883.

19. RothM,Uebelhart D. Liquid chromatographywith fluor-escence detection in the analysis of biological fluids.Anal. Lett. 2000; 33: 2353–2372.

20. Bro R. PARAFAC. Tutorial and applications. Chemometr.Intell. Lab. Syst. 1997; 38: 149–171.

21. Ruzicka J, Hansen EH. Flow Injection Analysis (2nd edn).John Wiley & Sons: New York, USA, 1988.

22. Bailly A. Cardiovascular prevention under the scrutinyof medicometry. Schweiz. Med. Wochenschr. 1995; 125:2487–2493.

J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem