Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis Lars Nørgaard 1 * , Gyo ¨ rgy So ¨ le ´ tormos 2 , Niels Harrit 3 , Morten Albrechtsen 4 , Ole Olsen 5 , Dorte Nielsen 6 , Kristoffer Kampmann 3,5 and Rasmus Bro 1 1 Department of Food Science, The Faculty of Life Sciences, University of Copenhagen, DK-1958 Frederiksberg C, Denmark 2 Department of Clinical Biochemistry, Copenhagen University Hospital Hillerød, DK-3400 Hillerød, Denmark 3 Department of Chemistry, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen, Denmark 4 Wexotec Management, Holgersvej 15, DK-2920 Charlottenlund, Denmark 5 Medico-Chemical Lab, Skelstedet 5, DK 2950 Vedbæk, Denmark 6 Department of Oncology, Copenhagen University Hospital Herlev, HerlevRingvej 75, DK-2730 Herlev, Denmark Received 9 February 2007; Accepted 23 March 2007 The objective of this phase I feasibility study is to investigate whether fluorescence spectroscopy of serum samples in combination with multivariate data analysis can be used to discriminate healthy females from breast cancer patients with solitary and multiple metastases, respectively. Serum samples were obtained from 39 females: 13 healthy females (controls) and 26 clinically diagnosed patients with either solitary metastases (11 patients) or multiple metastases (15 patients). Fluor- escence spectra were measured on undiluted samples and samples diluted 20 times and 500 times. Extended Canonical Variates Analysis (ECVA) was applied to develop classification models on the data. Three-group ECVA based on all spectroscopic data (5221 variables) gave five misclassifications in total, while sequential ECVA models on selected excitation wavelengths yielded two errors. The fluorescence spectroscopic results were compared with results based on the three tumor markers cancer antigen 15-3 (CA 15-3), carcinoembryonic antigen (CEA), and tissue polypeptide antigen (TPA). The lowest number of errors obtained using ECVA on the biomarkers was seven. Furthermore, fluorescence spectroscopy made it possible to discover sample subgroupings: females with solitary and multiple metastases could be divided into two subgroups according to the spectral patterns of the samples. Copyright # 2007 John Wiley & Sons, Ltd. KEYWORDS: cancer; fluorescence; biomarkers; chemometrics; multivariate; ECVA 1. INTRODUCTION Fluorescence spectroscopy is a sensitive, specific, and fast tool for detection of micro-environment changes and mole- cular interactions in complex samples [1]. As a consequence of this fact a fluorescence landscape measured directly on a diluted serum sample yields a characteristic multivariate spectroscopic pattern. Wolfbeis, Leiner, and co-workers introduced the hypothesis that this pattern of fluorescence in diluted serum samples contains information about the health status of the individual. The first analyses performed by Wolfbeis, Leiner, and co-workers with excitation–emission (2D) fluorescence spectroscopy on diluted blood/serum samples were published in the 1980s. They analyzed diluted serum from healthy and tumor-induced rats with ultraviolet [2] (500 times dilution) and visible [3] (20 times dilution) 2D fluorescence spectroscopy. Comparison of the recorded signals was performed by inspection of the raw data, inspection of difference matrices, and in the last-mentioned paper also by a cluster analysis on two selected fluorescence intensities. Later they introduced fluorescence spectroscopic measurements on diluted human serum samples [4] and they also compared 2D fluorescence signals measured on diluted serum from subjects with gynecological tumors with serum from healthy subjects [5]. They demonstrated that fluor- escence spectroscopy measured on the diluted, but otherwise untreated serum samples, was an attractive way of obtaining information due to the complex pattern obtained but at the same time they encountered problems on how to extract the relevant information. Hence, their approach was not opera- tional at that time. JOURNAL OF CHEMOMETRICS J. Chemometrics 2007; 21: 451–458 Published online 15 July 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/cem.1042 *Correspondence to: L. Nørgaard, Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Chemometrics Group Rolighedsvej 30 DK-1958 Frederiksberg C Denmark. E-mail: [email protected]Copyright # 2007 John Wiley & Sons, Ltd.
8
Embed
Fluorescence spectroscopy and chemometrics for classification of breast cancer samples—a feasibility study using extended canonical variates analysis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
JOURNAL OF CHEMOMETRICSJ. Chemometrics 2007; 21: 451–458Published online 15 July 2007 in Wiley InterScience
Fluorescence spectroscopy and chemometrics forclassification of breast cancer samples—a feasibilitystudy using extended canonical variates analysis
Lars Nørgaard1*, Gyorgy Soletormos2, Niels Harrit3, Morten Albrechtsen4, Ole Olsen5,
Dorte Nielsen6, Kristoffer Kampmann3,5 and Rasmus Bro1
1Department of Food Science, The Faculty of Life Sciences, University of Copenhagen, DK-1958 Frederiksberg C, Denmark2Department of Clinical Biochemistry, Copenhagen University Hospital Hillerød, DK-3400 Hillerød, Denmark3Department of Chemistry, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen, Denmark4Wexotec Management, Holgersvej 15, DK-2920 Charlottenlund, Denmark5Medico-Chemical Lab, Skelstedet 5, DK 2950 Vedbæk, Denmark6Department of Oncology, Copenhagen University Hospital Herlev, Herlev Ringvej 75, DK-2730 Herlev, Denmark
Received 9 February 2007; Accepted 23 March 2007
*CorrespoFaculty ofGroup RoE-mail: la
The objective of this phase I feasibility study is to investigate whether fluorescence spectroscopy of
serum samples in combination with multivariate data analysis can be used to discriminate healthy
females from breast cancer patients with solitary and multiple metastases, respectively. Serum
samples were obtained from 39 females: 13 healthy females (controls) and 26 clinically diagnosed
patients with either solitary metastases (11 patients) or multiple metastases (15 patients). Fluor-
escence spectra were measured on undiluted samples and samples diluted 20 times and 500 times.
Extended Canonical Variates Analysis (ECVA) was applied to develop classification models on the
data. Three-group ECVA based on all spectroscopic data (5221 variables) gave five misclassifications
in total, while sequential ECVA models on selected excitation wavelengths yielded two errors. The
fluorescence spectroscopic results were compared with results based on the three tumor markers
cancer antigen 15-3 (CA 15-3), carcinoembryonic antigen (CEA), and tissue polypeptide antigen
(TPA). The lowest number of errors obtained using ECVAon the biomarkerswas seven. Furthermore,
fluorescence spectroscopy made it possible to discover sample subgroupings: females with solitary
andmultiplemetastases could be divided into two subgroups according to the spectral patterns of the
Fluorescence spectroscopy is a sensitive, specific, and fast
tool for detection of micro-environment changes and mole-
cular interactions in complex samples [1]. As a consequence
of this fact a fluorescence landscape measured directly on a
diluted serum sample yields a characteristic multivariate
spectroscopic pattern. Wolfbeis, Leiner, and co-workers
introduced the hypothesis that this pattern of fluorescence in
diluted serum samples contains information about the health
status of the individual. The first analyses performed by
Wolfbeis, Leiner, and co-workers with excitation–emission
(2D) fluorescence spectroscopy on diluted blood/serum
ndence to: L. Nørgaard, Department of Food Science,Life Sciences, University of Copenhagen, Chemometricslighedsvej 30 DK-1958 Frederiksberg C [email protected]
samples were published in the 1980s. They analyzed diluted
serum from healthy and tumor-induced rats with ultraviolet
[2] (500 times dilution) and visible [3] (20 times dilution) 2D
fluorescence spectroscopy. Comparison of the recorded
signals was performed by inspection of the raw data,
inspection of difference matrices, and in the last-mentioned
paper also by a cluster analysis on two selected fluorescence
intensities. Later they introduced fluorescence spectroscopic
measurements on diluted human serum samples [4] and they
also compared 2D fluorescence signals measured on diluted
serum from subjects with gynecological tumors with serum
from healthy subjects [5]. They demonstrated that fluor-
escence spectroscopymeasured on the diluted, but otherwise
untreated serum samples, was an attractive way of obtaining
information due to the complex pattern obtained but at the
same time they encountered problems on how to extract the
relevant information. Hence, their approach was not opera-
tional at that time.
Copyright # 2007 John Wiley & Sons, Ltd.
452 L. Norgaard et al.
In the present study, excitation–emission fluorescence
spectroscopic measurements on undiluted, 20 times and
500 times diluted serum samples are combined with multi-
variate chemometric modeling for discriminating subjects
from two groups of patients and a control group. Thus, we
use essentially the same instrumental setup as did Wolfbeis
and Leiner [4], but we add to that the measurement of undilu-
ted samples as well as the use of multivariate data analysis.
Measuring undiluted samples is typically avoided in
traditional chemical analysis because it can lead to quench-
ing and other not well-described phenomena. However, with
the introduction of multivariate data analysis, this may not
be a problem and lack of a dilution step would provide a
significant practical advantage. Focus in the present study is
to show how a multivariate classification technique can be
applied to extract information from the complex spectro-
scopic fingerprints. The results obtained are compared with
the results from using biomarker diagnostics (cancer antigen
15-3 (CA 15-3), carcinoembryonic antigen (CEA), and tissue
polypeptide antigen (TPA) [6]) on the same subjects. Both the
fluorescence spectroscopic and biomarker methods are
compared to the clinical diagnoses of the subjects.
2. THEORY
2.1. Fluorescence spectroscopyFluorescence refers to light emission (luminescence) by elec-
tron transfer in the singlet state when molecules are excited
by photons [1,7]. Fluorescence is a three-stage process that
occurs in certain molecules called fluorophores: (1) the fluo-
Figure 1. Left: Contour plots of fluorescence spectra o
dilution, and 500 times dilution (diagonal lines in the p
dilution the signals saturate the detector at excitation wa
not included. Right: The concatenated spectra for t
measurement conditions. Data from undiluted and 5
18 excitation wavelengths (from 230 to 400 nm with 1
contains emission spectra from 10 excitation wavele
interscience.wiley.com/journal/cem
Copyright # 2007 John Wiley & Sons, Ltd.
rophore is excited to an electronic singlet state by absorption
of an external photon; (2) the excited state undergoes confor-
mational changes and interacts with the molecular environ-
ment in a number of different ways, including vibrational
relaxation, quenching, and energy transfer; (3) a photon is
emitted at a longer wavelength, while the fluorophore re-
turns to its ground state. The fluorescence excitation and
emission of light typically appear within nanoseconds. The
molecular structure and environment is decisive for whether
a compound is fluorescent, and fluorescence is often exhi-
bited by organic compounds with rigid molecular skeletons,
usually polyaromatic hydrocarbons and heterocycles. Fluor-
escence is unique among spectroscopic techniques, because it
is inherently multidimensional and fluorophores have
independent and specific spectral excitation and emission
profiles. These profiles can be measured as excitation and
emission spectra or as a complete excitation–emission matrix
(EEM), also known as a fluorescence landscape. An example
of a fluorescence landscape (contour plot) of a serum sample
is depicted in Figure 1 (left). Fluorescence spectroscopy is
also a very sensitive analytical method (compared to e.g.
absorption-based spectrophotometry) with possibilities to
measure down to parts per billion levels. The fluorescence
signals are ideally additive in mixtures; that is, the overall
fluorescence signal of a given sample can be expressed as the
sum of the fluorescence contribution from each of the
inherent fluorophores. However, in complex mixtures such
as serum samples, the fluorescence may not be additive due
to quenching phenomena and interactions with the molecu-
lar environment of the fluorophore.
f a selected solitary sample at no dilution, 20 times
lots are due to Rayleigh scattering). For 20 times
velengths from 230 to 300 nm and these data are
he 39 analyzed samples for each of the three
00 times dilution contain emission spectra from
0 nm intervals), while the 20 times dilution data
ngths. This figure is available online at www.
J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem
Breast cancer classification 453
2.2. Extended Canonical VariatesAnalysis—ECVAECVA [8] was recently developed as an extension of
Canonical Variates Analysis (CVA) which is a well-
established method for classification of full rank data. ECVA
is developed to handle highly collinear and low rank data, for
example, spectroscopic data tables where the number of
variables is larger than the number of samples. ECVA
estimates directions in space that maximize the differences
between the groups in the data according to a well-defined
optimization criterion. The following paragraph is based on
reference [8] and references therein, especially [9] and [10]
are of interest.
Assume a datamatrixX (n� v) where the samples are from
g different groups with ni samples in the ith group
(n ¼Pg
i¼1 ni).
The within-group covariance matrix is defined as
Swithin ¼ 1
ðn� gÞXg
i¼1
Xni
j¼1
ðxij � xiÞðxij � xiÞ0 (1)
and the between-group covariance matrix is defined as
Sbetween ¼ 1
ðg� 1ÞXg
i¼1
niðxi � xÞðxi � xÞ0 (2)
where xij is the jth sample in the ith group (represented
as a column vector), xi ¼ 1ni
Pnij¼1 xij is the mean vector in the
ith group, and x ¼ 1n
Pgi¼1
Pnij¼1 xij ¼ 1
n
Pgi¼1 nixi is the overall
mean vector. Note that the dimensions of Swithin and Sbetween
are v� v.
It is now possible to define standard CVA as the problem
of finding a direction, w, that maximizes
JðwÞ ¼ w0Sbetweenw
w0Swithinw(3)
The solution can be written as an eigenvector equation
Sbetweenw ¼ lSwithinw (4)
If Swithin is non-singular we have
S�1withinSbetweenw ¼ lw (5)
which is an eigenvalue problem, where l is the eigenvalue
and w is the eigenvector.
If Swithin is singular, left-multiplication by the inverse of
Swithin is not possible and this is why standard CVA breaks
down when analyzing, for example multicollinear data.
In ECVA, the following is suggested: For the two-group
situation Equation (4) can be rewritten as [11]
ðx1 � x2Þðx1 � x2Þ0w ¼ lSwithinw (6)
ðx1 � x2Þ0w is a scalar, k, so the equation can be written as
ðx1 � x2Þk ¼ lSwithinw (7)
Equation (7) is then transformed into a multivariate
regression problem
y ¼ Rbþ f (8)
where y ¼ ðx1 � x2Þ is the dependent variable, R¼Swithin
contains the independent variables, and b¼w is the
regression vector. The vector f holds the residuals. Since k
and l are constants they do not change the direction of w in
Copyright # 2007 John Wiley & Sons, Ltd.
Equations (7) and (8). In ECVA Equations (7) and (8) are
solved with a PLS regression method. By multiplication of
the mean centered data matrix, XMC, with the weight vector,
w, the canonical variates, tCV, are obtained (tCV ¼ XMCw).
Themean centering is performed using themean vector of all
calibration samples (the same mean vector as used in the
calculation of Sbetween). The calculated canonical variates can
be used directly in, for example, a Linear Discriminant
Analysis (LDA) classifier (described later).
When more than two groups are considered the directions
can also be estimated from Equation 4 as there will generally
be more than one eigenvalue/eigenvector pair:
Sbetweenwa ¼ laSwithinwa (9)
where a is the number of directions. Equation (9) has
a¼min(v,g�1) non-zero eigenvalues and the maximum
dimensionality for the canonical space is thus a. For
high-dimensional data (large v) the maximum number of
canonical variates is g�1 (the number of groups minus one).
The regression equation for the multigroup case is
Y ¼ RBþ F (10)
where Y contains the differences ðxi � xÞ as column vectors,
R is Sbetween, and the columns of B arewa (designated asW in
the following). F is the residual matrix. The dimension of Y,
B, and F is v� g.
PLS2 is used as the regression technique to solve Equation
(10). The number of weights calculated corresponds to the
number of groups and the weights are sorted according to
their values when inserted one-by-one in the optimization
criterion (Equation (3)). The weight with the lowest value is
left out before application of the classifier because there is an
intrinsic rank deficiency due to the closure properties of the
dependent variables. The properties of PLS2 will ensure that
the space spanned by the retained g�1weights covers the full
space of the solution which is all that is needed.
By multiplication of the mean centered data matrix, XMC,
with the canonical weights matrix,W, the canonical variates,
TCV, are obtained (TCV ¼ XMCW) and LDA is then applied on
these. The advantage of ECVA is that no dimension-reducing
step is necessary before the classifier is applied to the
canonical variates; the discriminative directions are esti-
mated directly in the original multidimensional space which
is of interest for, for example, spectroscopic applications.
2.3. LDA classifierAn LDA [9] was used as the classifier; the discriminant
function for the canonical variates is
LiðtÞ ¼ logðpiÞ �1
2ðt� tiÞ0S�1
within;TCVðt� tiÞ
þlog Swithin;TCV
�� ��(11)
where i is a group index (1,. . .,g), t contains the canonical
variates (as a column vector) for the sample to be classified, tiis the mean vector of the canonical variates for group i,
Swithin;TCVis the pooled covariance matrix for the canonical
variates (an analog to Swithin for the raw data presented
above). The prior, p, was selected as equal probabilities, for
example, if three groups are analyzed the prior is 1/3 for each
group. The sample is classified to the group that gives the
highest value of Li.
J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem
454 L. Norgaard et al.
2.4. Interval ECVA models—iECVAsBased on the same idea as interval PLS modeling [12] iECVA
modeling is programmed and it is intended as an exploratory
multivariate modeling tool. See for example, the paper by
Munck [13] in this volume. In the present case an ECVA
model is calculated for each excitation wavelength recorded
for each of the dilutions.
2.5. Normalization of individual emissionspectra for each excitation wavelengthEach emission spectrum (x) at a given excitation wavelength
is normalized to length one: x ¼ x= k x k, where the norm is
the Euclidian length.
3. MATERIAL AND METHODS
3.1. SamplesThe serum samples were selected from three sample banks
collected previously [6,14]. The selected samples were
collected from 1988 to 1992 and stored at �808C with
24-hour surveillance. None of the samples were thawed
previously. We investigated three groups of females. Group
1—the control (C) group—consisted of 13 healthy subjects
[6]. Group 2 consisted of 11 metastatic breast cancer patients
with solitary metastases [14]. The samples were drawn prior
to the first cycle of first-line treatment for metastatic breast
cancer. Group 3 consisted of 15 metastatic breast cancer
patients with multiple metastases [14]. The samples were
drawn during clinical tumor progression at least 2 weeks
after the latest course of chemotherapy except for two
patients who received different continuous oral medication.
The mean/minimum/maximum value for age for women in
groups C, S, and M are 49/39/69 years, 52/34/67 years, and
56/35/70 years, respectively.
Dilution of samples 20 and 500 times was performed with
0.067M phosphate buffer at pH 7.4.
3.2. Biomarker analysisThe traditional biomarkers CA 15-3, CEA, and TPA were
analyzed according to the description given in Reference [6].
3.3. Calibration set and test set—validationprincipleThe number of samples in the present study is low and in
order to get an impression of the possible over-fitting of the
ECVAmodels it was decided to split the 38 samples into two
sets: a calibration set consisting of 29 samples and a
randomly selected test set consisting of 9 samples (three
from each group).
All ECVA models on calibration sets are leave-one-out
cross validated and the number of optimal PLS components
in the inner relation is estimated based on this validation
scheme. The test set samples are not involved in the ECVA
modeling procedure at any time.
3.4. Fluorescence spectroscopic analysisThe serum samples were analyzed using an LS55 Lumines-
cence Spectrometer, Perkin Elmer, Boston, MA, USA. The
instrumental settings were: excitation wavelengths recorded
at 230 to 400 nm with 10 nm intervals, and emission spectra
scanned from 250 to 600 with 0.5 nm steps (for the 20 times
Copyright # 2007 John Wiley & Sons, Ltd.
dilution only excitations from 310 to 400 nm are recorded due
to signal saturation at lower excitation wavelengths). The
undiluted samples were analyzed in front face mode using
the front face accessory produced by Perkin Elmer and a
3� 10mm cuvette, while analysis of the diluted samples was
performed using an ordinary 10� 10mm cuvette in a right
angle set-up. The slit widths for the experiments with
undiluted and 20 times diluted samples were 5.0 nm
(excitation) and 4.5 nm (emission) and for the 500 times
dilution experiments the slit widths were 4.0 nm (excitation)
and 3.0 nm (emission). Scan rate for all experiments was set
to 600 nm/min. For each of the three samples sets the
samples were measured in a random order during one day
and instrumental changes were not significant as tested by
sugar standards measured three times a day [15].
3.5. SoftwareAll multivariate data analyses were performed using
MATLAB version 7.3.0.267 (R2006b) (The MathWorks, Natick,
MA, USA). An ECVA Toolbox [8] is available at http://
www.models.life.ku.dk. Compared to the originally published
ECVA algorithm, the current algorithm has been optimized
with respect to speed using initial loss-less compression and
more efficient use of the algebraic properties of the covariance
matrices. The fluorescence spectra were transferred to
MATLAB using an in-house routine written in MATLAB.
4. RESULTS AND DISCUSSION
4.1. Fluorescence spectroscopyThe spectral data acquired are shown in Figure 1 for the
undiluted, the 20 times diluted and the 500 times diluted
sample sets. The contour plots of the fluorescence spectra for
undiluted, dilution 20 times and dilution 500 times are
included for a selected solitary sample. Due to signal
saturation the spectra with excitation from 230 to 300 nm are
not included at 20 times dilution. For the undiluted data it is
observed that a single sample (multiple metastases) has a
deviating pattern; this is the same sample that deviates in the
first part of the 20 times (excitation 300 and 310nm) and 500
times (excitation 230 to 270 nm) dilution spectra. The sample
is from one of the two patients who received continuous oral
medication which might explain the deviating pattern and
the sample is excluded from the further data analysis. For the
20 times dilutions a group of five spectra have a higher
intensity clearly seen at excitation wavelengths from 340 nm
and upwards. Four samples are from subjects with solitary
metastases and one sample is from a subject with multiple
metastases. These five samples are not excluded from further
data analysis due to their common patterns.
Each individual set of emission spectra recorded at a given
wavelength is normalized before application of the ECVA.
The results were in general better with than without
normalization so this is chosen as the default pre-processing.
Normalization of spectra is often used in auto-fluorescence
spectroscopywhere ratios and spectral shapes sometimes are
more relevant than quantitative measures [4,5,16].
ECVA models on the calibration set (29 samples) were
calculated on the concatenated spectral data set (all
excitations for all dilutions) which contains 5221 (2043,
J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem
Table I. Classification results from ECVA on fluorescence spectroscopy data
Spectral variables and dilutions Groups Data sets False positives False negatives Total error
C, S, M Cal. set (n¼ 29)� 2 2Test set (n¼ 9) 0 1 5
C, S&M Cal. set (n¼ 29) 2 5All excitations and all dilutions (5222 variables) Test set (n¼ 9) 1 2 10
C&S, M Cal. set (n¼ 29) 2 2Test set (n¼ 9) 1 2 7
C, S Cal. set (n¼ 18) 0 1Test set (n¼ 6) 1 0 2
Excitation 230 nm, 500 times dilution C&S, M Cal. set (n¼ 29) 0 0Test set (n¼ 9) 0 0 0
C, control; S, solitary metastases; M, multiple metastases.�One misclassification between S and M (not included as false positive or false negative).
Breast cancer classification 455
1135, and 2043) variables. The first model was calculated in
order to discriminate between the three groups (C, S, M)
simultaneously. The number of false positives and false
negatives for the calibration set (five inner PLS components)
and test set are given in Table I. In total five errors are
observed out of 38 samples (13.2%).
It is interesting to investigate if it is possible to discriminate
C from S&M; that is, the solitary and multiple metastases
groups are considered as one group. The number of
misclassifications increases to 10 (five inner PLS components)
indicating that the solitary andmultiplemetastases groups are
quite different in their data structure. The primary challenge is
to discriminate solitary samples (S) from controls (C) in order
to make an early detection of cancer. An alternative
classification strategy is then to discriminate multiple
metastases samples from the other two groups and in
the next step focus only on discrimination of control and
solitary. As observed from the results presented in Table I the
total number of misclassifications is seven for M versus C&S
(pooled, four inner PLS components), while the number of
errors is two when ECVA is used on only C versus S samples
(seven inner PLS components). This sequential approach gives
a total number of errors equal to nine (seven and two), which
is higher than the three-group model.
The result from an iECVA for the M versus C&S model is
given in Figure 2. Several of the intervals (each excitation
wavelength) perform better than the full data model. An
ECVA model on variables 3179–3274 (interval number 29,
two inner PLS components) corresponding to excitation
wavelength 230 nm for the 500 times diluted samples gives
zero misclassifications. Besides the low error this is
interesting for at least two reasons: (1) it is possible to make
a much faster method using one excitation wavelength and
only one dilution and (2) it indicates spectral ranges that are
relevant to investigate with respect to specific fluorophores.
Application of the interval model followed by the C versus
S model (full data) gives two errors out of 38 samples (5.3%).
The drawback of this approach is that the C versus Smodel is
based on all excitation wavelengths and all dilutions, and no
model based on a single excitation wavelength has a lower
error than the global model (results not shown).
4.2. BiomarkersDescriptive statistics for CA 15-3, CEA, and TPA are
presented in Table II for each clinical group. Using the
Copyright # 2007 John Wiley & Sons, Ltd.
recommended cutoff values [6] the diagnostic results
reported in Table III are obtained, with CA 15-3 as the best
performing biomarker with eight false negatives (seven
solitary, one multiple). For the CEA and TPA biomarkers the
numbers of false negatives are 9 and 12, respectively, and for
all three biomarkers no false positives are detected.
ECVA, which boils down to an ordinary CVA in the case of
a low number of variables, is used to model the three
biomarkers together to see if there is a synergy advantage in
using a combination of biomarkers.
Table IV gives the results for the ECVA model of the
biomarker data (note that log10 is used to transform the
biomarker data prior to ECVA modeling due to skewed
biomarker distributions). The lowest number of errors for the
same type of groups as tested for fluorescence spectroscopy
is seven (for C, S,M). The sequentialmodel (C&S,M followed
by C, S) gives eight errors (seven and one). Calculating
exactly the same models for the best performing biomarker,
CA 15-3, gives an error of eight for the best performingmodel
(Table IV).
4.3. Subgroupings and outlier detectionBy inspection of the emission spectra for each excitation
wavelength, it is observed that the samples from subjects
with multiple metastases can be separated into two
apparently distinct groups according to the presence of an
emission peak at 515–520 nm (excitation 390 nm) for
undiluted samples (Figure 3a). A very clear distinction
between these two spectral signatures is observed also in a
PCA model on these data (not shown). This grouping of
samples from the multiple metastases subjects is not
observed for the diluted sample sets. In Figure 3b raw
emission spectra recorded at excitation 360 nm and 20 times
dilution for all the solitary samples are plotted. Four samples
show a deviating pattern with the same shape for all the four
samples. One of the multiple metastases samples also shows
this spectral pattern, but no control samples have this pattern
so it might be assumed that this pattern only is connected to
samples from diseased subjects.
4.4. Spectral assignmentsThe serum samples (both undiluted and diluted) are very
complex from a chemical point of view but it is still possible
to provide some indications on the fluorophores or groups of
fluorophores that might cause the spectral differences
J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem
Figure 2. Interval ECVA model for C&S versus M based on fluorescence spectroscopic data.
The first 18 intervals correspond to excitation wavelengths 230 to 400 nm for no dilution, the
next 10 intervals correspond to excitation wavelengths 310 to 400 nm for 20 times dilution,
and the last 18 intervals correspond to excitation wavelengths 230 to 400 nm for 500 times
dilution. The bars reflect the number of misclassifications for each model for the calibration set
(n¼ 29) based on leave-one-out cross validation. The average spectrum (solid line) and the
error using all variables (dotted line) are also shown.
Table II. Characterization of the biomarker data CA 15-3, CEA, and TPA for 39 samples from three different groups: control,
solitary metastases, and multiple metastases
Biomarker Minimum value Maximum value Median Standard deviation
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem
456 L. Norgaard et al.
Table IV. Classification results from ECVA on biomarker (log10) data
Biomarkers Group Data set False positives False negatives Total error
All three biomarkers C, S, M Cal. set (n¼ 29) 1 4Test set (n¼ 9)� 1 1 7
C, S&M Cal. set (n¼ 29) 0 8Test set (n¼ 9) 0 1 9
C&S, M Cal. set (n¼ 29) 0 0Test set (n¼ 9) 0 1 1
C, S Cal. set (n¼ 18) 2 3Test set (n¼ 6) 1 1 7
CA 15.3 C, S, M Cal. set (n¼ 29)�� 1 4Test set (n¼ 9)��� 2 1 8
C, S&M Cal. set (n¼ 29) 1 6Test set (n¼ 9) 0 2 9
C&S, M Cal. set (n¼ 29) 1 2Test set (n¼ 9) 1 1 5
C, S Cal. set (n¼ 18) 1 4Test set (n¼ 6) 2 1 8
�One misclassification between S and M (not included as false positive or false negative).��Three misclassifications between S and M (not included as false positive or false negative).���Two misclassifications between S and M (not included as false positive or false negative).
Breast cancer classification 457
observed. A selection of the identified fluorophores in serum
at 20 or 500 times dilution are tyrosine (ex/em¼ 275 nm/
300 nm), free and bound tryptophan (ex/em¼ 280–290 nm/
Figure 3. (a) Raw emission spectra recorded at excitation 390
subgroups are identified based on the presence or absence of the
excitation 360 nm and 20 times dilution for the solitary samples. F
shape, separating the solitary group into two subgroups.
Copyright # 2007 John Wiley & Sons, Ltd.
base (enamine form) (ex/em¼ 410 nm/510nm) (see Wolf-
beis and Leiner [4] and references therein for assignments). A
selection of the mentioned chemical components are also
involved in other types of cancer [16–18]. A hypothesis from
the present study is that the developed method can also be
applied in the diagnosis of these. The assignments of spectral
peaks are tentative and could be backed up using a
chromatographic technique with, for example, fluorescence
and ultraviolet/visible detector in order to obtain a unique
and complete identification [19]. Alternatively, it might be
possible in future studies to apply the PARAFAC [20]
technique for mathematical de-convolution of the spectral
nm (no dilution) for the multiple metastases samples. Two
peak at 515–520 nm. (b) Raw emission spectra recorded at
our samples (gray) show a deviating pattern with the same
J. Chemometrics 2007; 21: 451–458DOI: 10.1002/cem
458 L. Norgaard et al.
peaks contributing to the total signals. It will be of great
interest also to further investigate the observed spectral
subgroupings.
It is important to stress that the spectral differences
observed between the three groups might not necessarily be
ascribed to one or several fluorophores but also to the
chemical environment of these fluorophores.
4.5. Dilutions and other chemometricmodelsThe fluorescence signals saturate in samples measured at
20 times dilution (Figure 1), that is, the recorded signals are
higher than for the undiluted samples. This is due to intense
quenching and inner filter effects in the undiluted highly
concentrated sample. When such a sample is diluted more
light is allowed to escape at the low excitation wavelengths
giving rise to detector saturation. Another point for further
investigation is if the selected dilution levels are optimal for
obtaining discrimination. We selected to use the dilutions
suggested by Wolfbeis and Leiner [4] but it is possible that
other dilutions might also be relevant; that is, a gradient of
many dilutions can be measured by using, for example, a
flow injection analysis system [21].
When more samples become available in future studies it
is plausible that more advanced chemometrics methods,
including non-linear methods and wavelength/range selec-
tion methods, can be applied with even better results.
5. CONCLUSIONS
In this study a multivariate classification model of fluor-
escence spectroscopic data is demonstrated to yield a
promising low number of false positives and negatives for
the 38 samples under study. The results compare favorably to
results using the biomarkers CA 15-3, CEA, and TPA. It
should be stressed that the number of samples available in
this study is too small to provide conclusive evidence that
fluorescence spectroscopy and chemometrics is a suitable
approach for detecting breast cancer, but the study definitely
shows that fluorescence spectroscopic analysis on intact and
diluted serum samples in combination with multivariate
data analysis has the potential to become a valuable tool for
improving breast cancer diagnosis and monitoring and for
detecting deviating samples and discovering subgroups.
Especially the subgrouping is important and can be helpful,
for example in developing targeted treatment or medicine. It
is also possible that other types of cancer can be detectedwith
this technique.
We suggest that the combination of medical data and
multivariate data analysis is defined as medicometrics as also
suggested by Bailly [22] although in a different context.
REFERENCES
1. Lakowicz JR. Principles of Fluorescence Spectroscopy (2ndedn). Kluwer Academic/Plenum Publishers: New York,1999.
2. Leiner M, Wolfbeis OS, Schaur RJ, Tillian HM. Fluor-escence topography in biology .1. Ultraviolet fluor-escence topograms of rat sera and decrease of
Copyright # 2007 John Wiley & Sons, Ltd.
tryptophan fluorescence in Yoshida ascites hepatoma-bearing rats. IRCS Med. Sci.-Biochem. 1983; 11: 675–676.
3. Leiner M, Schaur RJ, Wolfbeis OS, Tillian HM. Fluor-escence topography in biology .2. Visible fluorescencetopograms of rat sera and cluster-analysis of fluorescenceparameters of sera of Yoshida ascites hepatoma-bearingrats. IRCS Med. Sci.-Biochem. 1983; 11: 841–842.
4. Wolfbeis OS, LeinerM.Mapping of the total fluorescenceof human-blood serum as a new method for its charac-terization. Anal. Chim. Acta. 1985; 167: 203–215.
5. Leiner MJP, Schaur RJ, Desoye G, Wolfbeis OS. Fluor-escence topography in biology .3. Characteristic devi-ations of tryptophan fluorescence in sera of patients withgynecological tumors. Clin. Chem. 1986; 32: 1974–1978.
6. Soletormos G, Schioler V, Nielsen D, Skovsgaard T,Dombernowsky P. Interpretation of results for tumor-markers on the basis of analytical imprecision and bio-logical variation. Clin. Chem. 1993; 39: 2077–2083.
7. Christensen J, Nørgaard L, Bro R, Engelsen SB. Multi-variate autofluorescence of intact food systems. Chem.Rev. 2006; 106: 1979–1994.
9. Krzanowski WJ. Principles of Multivariate Analysis(Revised edn). Oxford University Press: New York,USA, 2000.
10. Naes T, Indahl U. A unified description of classicalclassification methods for multicollinear data.J. Chemometr. 1998; 12: 205–220.
11. Duda RO, Hart PE, Stork DG. Pattern Classification (2ndedn). John Wiley & Sons: New York, USA, 2001.
12. Norgaard L, Saudland A,Wagner J, Nielsen JP,Munck L,Engelsen SB. Interval partial least-squares regression(iPLS): a comparative chemometric study with anexample from near-infrared spectroscopy. Appl. Spec-trosc. 2000; 54: 413–419.
13. Munck L. A new holistic exploratory approach to Sys-tems Biology by Near Infrared spectroscopy evaluatedby chemometrics and data inspection. J. Chemometr.(in press).
14. Soletormos G, Nielsen D, Schioler V, Skovsgaard T,Dombernowsky P. Tumor markers cancer antigen 15.3,carcinoembryonic antigen, and tissue polypeptide anti-gen for monitoring metastatic breast cancer during first-line chemotherapy and follow-up. Clin. Chem. 1996; 42:564–575.
15. Norgaard L. Direct standardization in multi wavelengthfluorescence spectroscopy. Chemometr. Intell. Lab. Syst.1995; 29: 283–293.
16. Masilamani V, Al-Zhrani K, Al-Salhi M, Al-Diab A,Al-Ageily M. Cancer diagnosis by autofluorescence ofblood components. J. Lumin. 2004; 109: 143–154.
17. Aiken JH, Huie CW, Terzian JA. Characteristic visiblefluorescence emission-spectra of sera from cancer-patients. Anal. Lett. 1994; 27: 511–521.
18. Hubmann MR, Leiner MJP, Schaur RJ. Ultraviolet fluor-escence of human sera .1. Sources of characteristic differ-ences in the ultraviolet fluorescence-spectra of sera fromnormal and cancer-bearing humans. Clin. Chem. 1990; 36:1880–1883.
19. RothM,Uebelhart D. Liquid chromatographywith fluor-escence detection in the analysis of biological fluids.Anal. Lett. 2000; 33: 2353–2372.
20. Bro R. PARAFAC. Tutorial and applications. Chemometr.Intell. Lab. Syst. 1997; 38: 149–171.
21. Ruzicka J, Hansen EH. Flow Injection Analysis (2nd edn).John Wiley & Sons: New York, USA, 1988.
22. Bailly A. Cardiovascular prevention under the scrutinyof medicometry. Schweiz. Med. Wochenschr. 1995; 125:2487–2493.