Top Banner
Original Article Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials Jeff A. Sloan, PhD, Daniel J. Sargent, PhD, Paul J. Novotny, MS, Paul A. Decker, MS, Randolph S. Marks, MD, and Heidi Nelson, MD Departments of Health Sciences Research (J.A.S., D.J.S., P.J.N., P.A.D.); Medical Oncology (R.S.M.); and Colon and Rectal Surgery and Gastrointestinal Endoscopy (H.N.), Mayo Clinic, Rochester, Minnesota, USA Abstract Context. Quality-adjusted life year (QALY) estimation is a well-known but little used technique to compare survival adjusted for complications. Lack of calibration and interpretation guidance hinders implementation of QALY analyses. Objectives. We conducted simulation studies to assess the impact of differences in survival, toxicity rates, and utility values on QALY results. Methods. Survival comparisons used both log-rank and Wilcoxon testing. We examined power considerations for a North Central Cancer Treatment Group Phase III lung cancer clinical trial (89-20-52). Results. Sample sizes of 100 events per treatment have low power to generate a statistically significant difference in QALYs unless the toxicity rate is 44% higher in one arm. For sample sizes of 200 per arm and equal survival times, toxicity needs to be at least 38% more in one arm for the result to be statistically significant, using a utility of 0.3 for days with toxicity. Sample sizes of 300 (500)/ arm provide 80% power if there is a 31% (25%) toxicity difference. If the overall survival hazard ratio between the two treatment arms is 1.25, then samples of at least 150 patients and 13% increased toxicity are necessary to have 80% power to detect QALY differences. In study 89-20-52, there was only 56% power to determine the statistical significance of the observed QALY differences, clarifying the enigmatic conclusion of no statistically significant difference in QALY despite an observed 14.5% increase in toxicity between treatments. Conclusion. This calibration allows researchers to interpret the clinical significance of QALY analyses and facilitates QALY inclusion in clinical trials through improved study design. J Pain Symptom Manage 2013;-:-e-. Ó 2013 U.S. Cancer Pain Relief Committee. Published by Elsevier Inc. All rights reserved. This work was presented at the 2010 American Soci- ety of Clinical Oncology meeting (abstract 6108) and the 2010 International Society for Quality of Life Research meeting (abstract 1783). Address correspondence to: Paul J. Novotny, MS, Divi- sion of Biomedical Statistics and Informatics, Department of Health Sciences Research, 200 First Street SW, Rochester, MN 55905, USA. E-mail: [email protected] Accepted for publication: July 24, 2013. Ó 2013 U.S. Cancer Pain Relief Committee. Published by Elsevier Inc. All rights reserved. 0885-3924/$ - see front matter http://dx.doi.org/10.1016/j.jpainsymman.2013.07.016 Vol. - No. -- 2013 Journal of Pain and Symptom Management 1
12

Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

May 15, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Vol. - No. - - 2013 Journal of Pain and Symptom Management 1

Original Article

Calibration of Quality-Adjusted Life Yearsfor Oncology Clinical TrialsJeff A. Sloan, PhD, Daniel J. Sargent, PhD, Paul J. Novotny, MS, Paul A. Decker, MS,Randolph S. Marks, MD, and Heidi Nelson, MDDepartments of Health Sciences Research (J.A.S., D.J.S., P.J.N., P.A.D.); Medical Oncology (R.S.M.);

and Colon and Rectal Surgery and Gastrointestinal Endoscopy (H.N.), Mayo Clinic, Rochester,

Minnesota, USA

Abstract

Context. Quality-adjusted life year (QALY) estimation is a well-known but little

used technique to compare survival adjusted for complications. Lack ofcalibration and interpretation guidance hinders implementation of QALYanalyses.

Objectives. We conducted simulation studies to assess the impact of differencesin survival, toxicity rates, and utility values on QALY results.

Methods. Survival comparisons used both log-rank and Wilcoxon testing. Weexamined power considerations for a North Central Cancer Treatment GroupPhase III lung cancer clinical trial (89-20-52).

Results. Sample sizes of 100 events per treatment have low power to generatea statistically significant difference in QALYs unless the toxicity rate is 44% higherin one arm. For sample sizes of 200 per arm and equal survival times, toxicityneeds to be at least 38% more in one arm for the result to be statisticallysignificant, using a utility of 0.3 for days with toxicity. Sample sizes of 300 (500)/arm provide 80% power if there is a 31% (25%) toxicity difference. If the overallsurvival hazard ratio between the two treatment arms is 1.25, then samples of atleast 150 patients and 13% increased toxicity are necessary to have 80% power todetect QALY differences. In study 89-20-52, there was only 56% power todetermine the statistical significance of the observed QALY differences, clarifyingthe enigmatic conclusion of no statistically significant difference in QALY despitean observed 14.5% increase in toxicity between treatments.

Conclusion. This calibration allows researchers to interpret the clinicalsignificance of QALY analyses and facilitates QALY inclusion in clinical trialsthrough improved study design. J Pain Symptom Manage 2013;-:-e-. � 2013U.S. Cancer Pain Relief Committee. Published by Elsevier Inc. All rights reserved.

This work was presented at the 2010 American Soci-ety of Clinical Oncology meeting (abstract 6108)and the 2010 International Society for Quality ofLife Research meeting (abstract 1783).Address correspondence to: Paul J. Novotny, MS, Divi-sion of Biomedical Statistics and Informatics,

Department of Health Sciences Research, 200 FirstStreet SW, Rochester, MN 55905, USA. E-mail:[email protected]

Accepted for publication: July 24, 2013.

� 2013 U.S. Cancer Pain Relief Committee.Published by Elsevier Inc. All rights reserved.

0885-3924/$ - see front matterhttp://dx.doi.org/10.1016/j.jpainsymman.2013.07.016

Page 2: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

2 Vol. - No. - - 2013Sloan et al.

Key Words

QALY, quality-adjusted life year, Q-TWiST, QOL, quality of life, simulation

IntroductionQuality-adjusted life years (QALYs) are

a method of comparing treatment arms in clin-ical trials that combines survival time, timeafter cancer progression, and days with toxic-ity. The QALYs combine these three endpointsinto one measure that can be tested using stan-dard survival analysis methods. First proposedin 1989 by Gelber et al,1 this method held intu-itively appealing promise to combine qualityand quantity of survival. The QALYs assumethat days with high toxicity levels (such asdays with any Grade 2 or higher adverseevents) and days after cancer progression arenot as valuable to the patient as days withouttoxicity or progression. These days are countedas less than one day of survival.

A comprehensive review of QALY literaturecan be found in Tsuchiya and Dolan.2 Method-ological considerations for incorporating qual-ity of life (QOL) data into the QALY model arediscussed in Revicki et al3 and Mounier et al.4

Previously, we have proposed methods for sum-marizing QALY data graphically.5

There are numerous examples of successfulimplementation of QALY estimates in clinicaltrials. For example, women who test positivefor BRCA1/2 mutations in the U.S. may expe-rience greater QALYs from prevention strate-gies.6 Among patients with nonesmall-celllung cancer in Canada, it was demonstratedthat adjuvant chemotherapy produced supe-rior QALY estimates.7 In Europe, patientswith sepsis who were treated with intensivecare reported improved QALYs.8

The recent focus by health care policy andregulatory agencies on comparative effective-ness and cancer care delivery research wouldsuggest that QALYestimates and analyses basedon QALYs could contribute much to decision-making processes.9 The U.S. Public HealthService Panel on Cost-Effectiveness has recom-mended the use of QALYs as the best way toestimate outcomes in a cost-effectiveness analy-sis. The law enacting the Patient-centered Out-comes Research Institute (PCORI), however,explicitly forbids the PCORI from using costper QALY as a decision-making tool, and cites

the methodological, political, and interpreta-tional challenges. The executive director ofthe PCORI (P. Selby) indicated that the QALYestimates themselves remain a valid and viablemethod for analysis. Lipscomb et al10 summa-rized the problems with QALY estimates asbeing related to methodology, assumptions,and interpretation while indicating that QALYsremain a natural benchmark for preference-based approaches. Meta-analytic summaries ofstudies that have used QALYs, however, haveempirically suggested that the method may beinadequate for capturing the vital QOL issuesquantitatively.11

The major barrier to the routine use ofQALYs in clinical research is one of calibra-tion.3 Specifically, it is difficult to define howmuch of a difference in QALYs is clinicallymeaningful. This clinical significance couldbe in relation to the patient, clinician, payer,or society. Among the key data needed to de-fine clinical significance is the variability ofthe metric. To gain that understanding, oneneeds to be able to express statements ofpower regarding observed changes in QALYsso that the clinical meaning of a statistically sig-nificant value can be interpreted.Although there is research dealing with the

choice of the weights and defining clinical sig-nificance of QALYs, the power of QALY analy-ses has not been thoroughly explored. Thismanuscript explores the power of QALY analy-ses using simulations and calculates the powerassociated with a published clinical trial thatused a QALY analysis.

QALY DefinitionThe QALY analyses involve partitioning

the survival time for each patient into threeparts, namely time without symptoms or toxic-ity (TWiST), time with high levels of toxicity(TOX), and time after disease progression orrelapse (REL). The overall survival time (OS)for a patient can then be written as:

OS¼ TOXþTWiSTþREL

Days with toxicity and days after relapse areconsidered to be of less value than other days

Page 3: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Table 1Simulation Margins of Error for Each Sample Size

Sample SizePer Group

Number ofSimulations

SE for a 5% ObservedPower Result

SE for a 50% ObservedPower Result

SE for an 80% ObservedPower Result

SE for a 90% ObservedPower Result

50 10,000 0.002 0.005 0.004 0.003100 5000 0.003 0.007 0.006 0.004150 5000 0.003 0.007 0.006 0.004200 5000 0.003 0.007 0.006 0.004300 3000 0.004 0.009 0.007 0.005500 2000 0.005 0.011 0.009 0.007

1000 1000 0.007 0.016 0.013 0.009

SE¼ standard error.

Vol. - No. - - 2013 3Calibration of QALYs

and are given less weight in the analyses. Forexample, days with high levels of toxicity maybe counted as only half as good as a day with-out toxicity. The TOX and REL days areweighted by utilities to come up with the finalQALYs for each patient:

QALY ¼ Ut$TOX þ TWiST þ Ur$REL

90

100 Quadruple Toxicity

Triple Toxicity

where Ut¼ utility for days with toxicity (be-tween zero and one) and Ur¼ utility for daysafter relapse (between zero and one).

The differences in these QALYs can then betested between treatment arms using standardsurvival analysis methods such as log-rank tests,Kaplan-Meier curves, and Cox proportionalhazards models.

For a general overview of QALYs and a discus-sion of different methods of selecting utilities,see the monograph by Sloan et al.12 A discus-sion of QALY application models in cancer isfound in Cole et al.13 and a recent discussionof clinically important differences and QALYsis found in Revicki et al.3 Kilbridge14 providesa review of the current use of QALYs in oncol-ogy (health care), and their methodologicallimitations are discussed by Garau et al.11

0

10

20

30

40

50

60

70

80

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Utility

Po

we

r

Double Toxicity

75% More Toxicity

50% More Toxicity

Equal Toxicity

Fig. 1. Power for n¼ 300 per group, HR¼ 1.0, 10%toxicity in the reference arm, and varying the toxic-ity rate in the other arm.

MethodsSimulations

Simulations were run using SAS 9.2 (SAS In-stitute, Inc., Cary, NC) assuming a Weibull dis-tribution of survival times with constant failurerates over time. No censoring was used in thesimulations, so all sample sizes in this articlemust be interpreted as the number of eventsin the study. Days with toxicity and days afterprogression were given the same weights andnot simulated separately for the sake of sim-plicity without loss of generality. There is also

no strong evidence that days with toxicity anddays after relapse should be weighted with dif-ferent utilities.13 Study design parameters werevaried to explore their effects on the subse-quent power. Parameters included utilityscores (0 to 1 by 0.1), number of events (50,100, 150, 200, 300, 500, and 1000 per group),hazard rates (HRs: 0.5, 0.75, 1.0, 1.25, and1.5), and toxicity rate differences (assuming10% of days in one arm have toxicity and sim-ulating 10e100% toxicity in the other arm).

Between 1000 and 10,000 replications weredone with each set of parameters dependingon the sample size. The number of replica-tions was constrained for larger sample sizesbecause of computer memory limitations.Table 1 shows the number of replications per-formed for each sample size and the resultingstandard errors of the estimates. Power to de-tect a difference (effect size) between treat-ment arms for survival was based on thepercentage of replications with a log-rank P-value less than 0.05, with separate analyses

Page 4: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

eArm

Ut¼0.6

Ut¼0.7

dd

dd

dd

dd

dd

dd

druple

(40%

)d

4 Vol. - No. - - 2013Sloan et al.

based on a Wilcoxon P-value less than 0.05.Power was calculated for the log-rank test ofthe hypothesis that the survival distributionsin two populations are equal. The power isthe probability of declaring a significant differ-ence in QALYs between groups given thatthere truly is a difference. We assumeda Type I error rate of 5%.

Table2

Smallest

Difference

inPercentofDaysWithToxicityNee

ded

For80%

Power

WithHR¼1.00an

d10%

Toxicityin

On

NPer

Group

Ut¼0.0

Ut¼0.1

Ut¼0.2

Ut¼0.3

Ut¼0.4

Ut¼0.5

50d

dd

dd

d10

0Quad

ruple

(40%

)d

dd

dd

150

Quad

ruple

(40%

)Quad

ruple

(40%

)d

dd

d20

0Quad

ruple

(40%

)Quad

ruple

(40%

)Quad

ruple

(40%

)d

dd

300

Triple

(30%

)Quad

ruple

(40%

)Quad

ruple

(40%

)d

dd

500

Triple

(30%

)Triple

(30%

)Triple

(30%

)Quad

ruple

(40%

)d

d10

00Triple

(30%

)Triple

(30%

)Triple

(30%

)Triple

(30%

)Triple

(30%

)Quad

ruple

(40%

)Qua

HR¼hazardrate.

Missingcellsmeanthat

thetoxicity

rate

has

tobemore

than

fourtimes

higher

toreach80

%power.

ResultsPower Relation to Utility for a Given SampleSize

Fig. 1 shows power estimates for utilitiesvarying from zero to one for a trial with 300events in each group and an HR of 1.00 (nodifferences in survival). The simulation datafor this figure assumed that 10% of a patient’sdays had toxicity on one arm and the secondarm had toxicity rates varying from beingequal with the first arm to having quadruplethe amount of toxicity (40%).

For small-to-moderate differences in toxicity(up to triple the rate in the first treatmentarm) with a sample of 300 patients, 80% poweris never achieved regardless of the value ofthe utility used. If the toxicity in one arm istriple the other (30% vs. 10%), the days withtoxicity/postrecurrence would have to be re-duced to Ut¼ 0.12 to achieve 80% power.Even with a quadrupling of toxicity, utilitiesneed to be set to 0.4 or less to achieve 80%power. Hence, for a trial of this size (n¼ 300events per group), for a QALY analysis tohave 80% power to detect a difference be-tween arms, the utilities would have to bevery low even in the presence of substantialtoxicity differences.

Table 2 shows the differences needed intoxicity rates to achieve a power of 80% as thesample size and utility vary. These results showthat even for large sample sizes (n¼ 1000) andlow utility values (Ut), the toxicity rate has totriple to have a good chance of finding a statis-tically significant result.

Power Relation to Sample Size for GivenUtilities

Tables 3e5 show the power of log-rank testsusing QALYs with varying values of samplesizes, HRs, and differences in toxicity levels.These tables are based on a utility value of0.5 and assume 10% of the days in one arm

Page 5: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Table 3Power for QALY Test of Equality for Utility¼ 0.5 and HR¼ 1.00 and a Basic Toxicity Rate of 10% in One Arm

N PerGroup

Difference in Percent of Days With Toxicity

Equal(10% of Days)

50% More(15% of Days)

75% More(17.5% of Days)

Double(20% of Days)

Triple(30% of Days)

Quadruple(40% of Days)

50 5 6 6 6 9 14100 5 6 6 7 13 24150 5 6 7 8 17 31200 5 6 7 8 20 41300 5 7 7 10 26 57500 4 7 8 14 42 78

1000 4 8 12 22 72 98

QALY¼quality-adjusted life year.Values in bold indicate at least 80% power.If the overall survival time is the same, you need a quadruple of toxicity rates (from 10% to 40%) and samples of 500 events to have 80% power todetect a difference in QALYs.

Vol. - No. - - 2013 5Calibration of QALYs

have toxicity or relapse. Fig. 2 shows power es-timates by the HRs of the arm that had moretoxicity. This figure is based on using a utilityweight of 0.3 and twice as much toxicity inone arm.

If the OS is the same in both arms (HR¼ 1),a clinical trial would need a quadruple of thetoxicity rate (from 10% to 40%) and observe500 events per arm to have 80% power to de-tect a difference in QALYs. If the HR is 1.25,then a sample of 150 events per group will pro-vide 80% power if the toxicity rates in the twoarms are 10% and 30%. If the HR is 1.5, theneven a sample size of 200 events per arm willvirtually guarantee a significant QALY test forany difference in toxicity.

Power Relation to Percent of Toxicity in theReference Arm for Given Sample Sizes, HRs,Utilities, and Differences in Toxicity Rates

Figs. 3 and 4 show how power estimateschange as the percent of days with toxicity inthe reference arm increases. These figures as-sume that there are 300 events in each arm,

Table 4Power for QALY Test of Equality for

N Per Group

Difference in P

Equal 50% More 75% M

50 20 24 26100 34 44 45150 48 58 62200 60 70 74300 78 85 90500 94 97 99

1000 100 100 100

QALY¼quality-adjusted life year.Values in bold indicate at least 80% power.If the overall survival time hazard rate is 1.5, then a sample of 150 events per gare 10% and 30%.

a utility of 0.3, and an HR of 1.0. Fig. 3 showspower estimates as the toxicity rate in the sec-ond arm increases by 10%, 20%, 30%, and40%. Fig. 4 shows the power estimates as thetoxicity rate in the second arm increases by50%, 75%, double, triple, or quadruple thetoxicity rate in the first arm.

If the toxicity rate in the reference arm is be-tween 0% and 40%, then QALY tests for a 10%difference in toxicity rates between two treat-ment arms will have almost the same power.The rate of toxicity in the reference arm hasonly a small impact on the resulting power.However, if the toxicity rate in the referencearm is more than 40% or the difference in tox-icity rates is more than 10%, then the toxicityrate in the reference arm will affect the power.If the toxicity rate in the reference arm is low(about 10%), then it takes a large differencein toxicity rates (at least 30% or a tripling oftoxicity) to have 80% power. If the toxicityrate in the reference arm is high (60%), thenthe difference in toxicity rates has to only be20% points higher to reach 80% power.

Utility¼ 0.5 and HR¼ 1.25

ercent of Days With Toxicity

ore Double Triple Quadruple

28 38 4950 64 7966 83 9378 91 9892 98 10099 100 100

100 100 100

roup will provide 80% power if the toxicity rates in the two groups

Page 6: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Table 5Power for QALY Test of Equality for Utility¼ 0.5 and HR¼ 1.50

N Per Group

Difference in Percent of Days With Toxicity

Equal 50% More 75% More Double Triple Quadruple

50 51 58 59 62 71 81100 80 86 87 89 95 98150 93 96 97 98 99 100200 98 99 100 100 100 100300 100 100 100 100 100 100500 100 100 100 100 100 100

1000 100 100 100 100 100 100

QALY¼quality-adjusted life year.Values in bold indicate at least 80% power.If the overall survival time hazard rate is 1.5, even a sample of 200 events per group will virtually guarantee a significant QALY test.

100

0.40 Higher Toxicity in Other Arm

0.30 Higher Toxicity in Other Arm

0.20 Higher Toxicity in Other Arm

0.10 Higher Toxicity in Other Arm

6 Vol. - No. - - 2013Sloan et al.

Differences in Percent of Days With ToxicityNeeded to Obtain Significant Results

Table 6 shows the differences in toxicityneeded to achieve 80% or 90% power. Theseresults assume that one arm has 10% of itsdays with toxicity and a utility of 0.3. If theHRs are the same in the two arms, there needsto be a very large difference in toxicities to ob-tain 80% power. With 100 events in each arm,80% power is achieved if the difference in HRsis 44%. If the HR is 1.25, then sample sizes of200e300 provide adequate power for a trial.With an HR of 1.5 or higher, any sample sizeof at least 100 and any difference in toxicitieswill likely result in a significant difference inQALYs.

Differences in Percent of Days With ToxicityNeeded to Dampen a Significant SurvivalDifference

Table 7 shows the differences in toxicityneeded to achieve 10% or 20% power. Withthese low power levels, a study can have a non-significant QALY result even if there is

0

10

20

30

40

50

60

70

80

90

100

0.50 0.75 1.00 1.25 1.50 1.75 2.00

Hazard Rate For Arm With Higher Toxicity Rate

Po

we

r

n = 1000 per group

n = 500 per group

n = 300 per group

n = 200 per group

n = 150 per group

n = 100 per group

n = 50 per group

Fig. 2. Power for double toxicity rate in one armand utility of 0.3 for varying sample sizes and hazardrates.

a significant difference in survival betweenthe two arms. These results assume that onearm has 10% of its days with toxicity and a util-ity of 0.3. With 150e300 events in each armand an HR of 0.75, the QALY result will havelow power if there is about 20% more toxicityin the arm with better survival times.

Reanalysis of the North Central CancerTreatment Group Study 89-20-52Simulations can be used to determine the

observed power for a completed clinical trial.The North Central Cancer Treatment Groupstudy 89-20-52 was a blinded study of oncevs. twice daily radiation in patients with small-cell lung cancer.15 The QALY analysis for thisstudy showed no significant difference betweenarms although there was a large difference in

0

10

20

30

40

50

60

70

80

90

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Percent of Days with Toxicity in the Reference Arm

Po

we

r

Fig. 3. Power for n¼ 300 per group, HR¼ 1.0, andUt¼ 0.3 for varying toxicity rates in the referenceand other arm.

Page 7: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6

Percent of Days with Toxicxity in the Reference Arm

Po

we

rQuadruple Toxicity in Other Arm

Triple Toxicity in Other Arm

Double Toxicity in Other Arm

75% More Toxicity in Other Arm

50% More Toxicity in Other Arm

Fig. 4. Power for n¼ 300 per group, HR¼ 1.0, andUt¼ 0.3 for varying toxicity rates in the referenceand other arm.

Vol. - No. - - 2013 7Calibration of QALYs

toxicities. Details of this analysis are available inCreagan et al.15 or Sloan et al.12

In this study, the sample sizes in each armwere both about 130, both arms had about10% of patients with censored survival times,and the HRs in the two arms were similar(HR not significantly different from one).However, patients receiving twice daily radia-tion had toxicity on 54% of their survivaldays, whereas patients in the other arm hadtoxicity on only 39% of their days. Based onsimulations (Table 8), this study had only56% power to detect a significant differencein QALYs. The simulations likely explainwhy the QALY difference was nonsignificant

Table 6Difference in Toxicity Rate Needed for 80% or 90% Pow

N Per Group

80% Power

HR¼ 1.00 HR¼ 1.25 HR¼ 1.5

50 59 40 20100 44 22 0150 38 13 0200 33 8 0300 28 2 0500 22 0 0

1000 16 0 0

HR¼ hazard rate.For n¼ 100 per group, a hazard rate of 1, and 10% of days with toxicity in one(or 54% of days with toxicity) to have 80% power of finding a significant dif

although the toxicity differences betweenarms were large.

Appendix (available at jpsmjournal.com)provides SAS code that can be used to ex-plore the power of a QALY analysis for anystudy.

DiscussionThe algorithm developed herein can be used

to design QALY-based studies with appropriate(realistic) power and effect sizes, and is usefulfor reviewing the observed power of completedclinical trials. These simulations provide gen-eral guidelines that are helpful for designingfuture QALY studies.

The simulation results indicate that rela-tively large sample sizes are needed for design-ing QALY studies with a reasonable likelihoodof detecting statistically significant differences.If double or triple the toxicity in one arm rel-ative to another is expected, the study willneed to accrue in excess of 1000 patients ifthere is no expected survival difference (e.g.,if the goal is to demonstrate a QALY differencein a noninferiority trial). If the HR is 1.25 orgreater, then sample sizes of 200 or 150 areneeded to detect a significant difference ifthe toxicity rates are double or triple, respec-tively. To obtain significant results, utilitiesmust be small, with values between 0.3 and0.5 working reasonably well. If the HR is 1.5or greater and the sample size exceeds 50per arm, any utilities and any differences intoxicities will provide at least 80% power. Inthis case, there is no benefit in using QALYs.Large differences in toxicity rates are neededif the percentage of days with toxicity in the

er With Ut¼ 0.3 and 10% Toxicity in One Arm

90% Power

0 HR¼ 1.00 HR¼ 1.25 HR¼ 1.50

65 48 3150 29 842 18 038 14 031 5 025 0 019 0 0

arm, the other arm would need to have a 44% higher toxicity rateferences between the two arms.

Page 8: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Table 7Difference in Toxicity Needed To Get Below10% or 20% Power With Ut¼ 0.3 and 10%

Toxicity in One Arma

N PerGroup

10% Power 20% Power

HR¼ 0.50 HR¼ 0.75 HR¼ 0.50 HR¼ 0.75

50 56 16 47 4100 60 23 54 13150 61 25 56 18200 62 26 58 20300 63 28 60 23500 64 28 61 26

1000 65 31 64 29

HR¼ hazard rate.For n¼ 100 per group, if one arm has 10% of days with toxicity andthe other arm has a lower hazard rate of 0.50 but 60% more dayswith toxicity (a total of 70% of days with toxicity), then there isonly a 10% chance that the results will be significantly different.Therefore, the study will likely report no difference in QALYs al-though the overall survival rate is lower in one arm.aLikely to report no significant difference between groups evenwith one arm having a lower hazard rate because it also hasmore toxicity.

8 Vol. - No. - - 2013Sloan et al.

reference arm is small. The conclusions aresimilar if the Wilcoxon P-value is used insteadof the log-rank P-value.

Results also indicated that the definitionused for duration of toxicity has a huge impacton the design sensitivity and results of QALYstudies. There are numerous alternative defini-tions that can be used to reflect either acuteshort-term toxicity or lingering long-term tox-icity. Some authors have assumed that every in-cidence of toxicity lasted three months; others,including ourselves, assumed a more modestone cycle of impact as a result of any reportedtoxicity. This assumption by itself constrainsthe sensitivity of QALY models, but realistic as-sumptions need to be included in the modelfor its ultimate veracity.

Table 8Power for QALY in Study 89-20-52 for Varying

Survival Hazard Rates (HRs) and Utilities

HR Ut¼ 0.0 Ut¼ 0.3 Ut¼ 0.5 Ut¼ 0.7

0.50 90 99 100 1000.75 6 20 32 451.00 (Observed HR) 56 22 11 71.25 97 82 68 571.50 100 99 97 941.75 100 100 100 1002.00 100 100 100 100

QALY¼quality-adjusted life year.Values in bold indicate no survival difference.No comparisons approached statistical significance because powerfor QALY differences was insufficient in this study.

These results have implications forQALYs in health care research. Based on oursimulations, QALYs cannot be expected to besignificantly different between two treatmentsunless there are large differences in eitherHRs or toxicities. Perhaps another metric canbe created to generate results that bettermatch intuition. For example, if having twicethe toxicity rate is considered clinically mean-ingful, then QALYs should be set up to detectthat. Future QALY research studies need tokeep these power considerations in mind.There is also a need to refine the QALY param-eters to bring the math in line with clinical per-spective. If doctors are seeing a clinicallymeaningful difference in toxicities, the QALYsneed to be adjusted to detect those differ-ences. Until QALY models can more accuratelyreflect the clinical reality, they will remaina rarely used and poorly understood analyticmethod.

Disclosures and AcknowledgmentsThis study was supported in part by Public

Health Service grants CA-25224 and 5U10CA149950-02. The authors report no conflicts ofinterest in this work.

References1. Gelber RD, Gelman RS, Goldhirsch A. A quality-of-life-oriented endpoint for comparing therapies.Biometrics 1989;45:781e795.

2. Tsuchiya A, Dolan P. The QALY model and indi-vidual preferences for health states and health pro-files over time: a systematic review of literature. MedDecis Making 2005;25:460e467.

3. Revicki DA, Feeny D, Hunt TL, Cole BF. Analyz-ing oncology clinical trial data using the Q-TWiSTmethod: clinical importance and sources for healthstate preference data. Qual Life Res 2006;15:411e423.

4. Mounier N, Ferme C, Flechtner H, Henzy-Amar M, Lepage E. Model-based methodology foranalyzing incomplete quality-of-life data and inte-grating them into the Q-TWiST framework. MedDecis Making 2003;23:54e66.

5. Sloan JA, Sargent DJ, Lindman J, et al. A newgraphic for quality adjusted life years (Q-TWiST)survival analysis: the Q-TWiST plot. Qual Life Res2002;11:37e45.

6. Grann VR, Jacobson JS, Thomason D, et al. Ef-fect of prevention strategies on survival and

Page 9: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Vol. - No. - - 2013 9Calibration of QALYs

quality-adjusted survival of women with BRCA1/2mutations: an updated decision analysis. J Clin On-col 2002;20:2520e2521.

7. Jang RW, Le Maitre A, Ding K, et al. Quality-ad-justed time without symptoms or toxicity analysis ofadjuvant chemotherapy in non-small-cell lung can-cer: an analysis of the National Cancer Institute ofCanada Clinical Trials Group JBR.10 Trial. J ClinOncol 2009;27:4268e4273.

8. Karlsson S, Ruokonen E, Varpula T,Ala-Kokko TI, Pettila V, Finnsepsis Study Group.Long-term outcome and quality-adjusted life yearsafter severe sepsis. Crit Care Med 2009;37:1268e1274.

9. Weinstein MC, Skinner JA. Comparativeeffectiveness and health care spendingdimplica-tions for reform. N Engl J Med 2010;362:460e465.

10. Lipscomb J, Drummond M, Fryback D, Gold M,Revicki D. Retaining, and enhancing, the QALY.Value in Health 2009;12:S18eS26.

11. Garau M, Shah KK, Mason AR, et al. Using QA-LYs in cancer: a review of the methodological limita-tions. Pharmacoeconomics 2011;29:673e685.

12. Sloan JA, Dueck A, Frost MH, et al. ApplyingQOL assessments: solution for oncology clinicalpractice and research, part 2. Curr Probl Cancer2006;30:235e331.

13. Cole BF, Gelber RD, Gelber S, Mukhopadhyay P.A quality-adjusted survival (Q-TWiST) model forevaluating treatments for advanced cancer.J Biopharm Stat 2004;14:111e124.

14. Kilbridge KL. Quality-adjusted life-years, com-parative effectiveness in cancer care, and measuringoutcomes in the underserved. Oncology 2010;24:530e537.

15. Creagan ET, Dalton RJ, Ahmann DL, et al. Ran-domized, surgical adjuvant clinical trial of recombi-nant interferon alfa-2a in selected patients withmalignant melanoma. J Clin Oncol 1995;13:2776e2783.

Page 10: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Appendi

9.e1 Vol. - No. - - 2013Sloan et al.

x

Page 11: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

Vol. - No. - - 2013 9.e2Calibration of QALYs

Page 12: Calibration of Quality-Adjusted Life Years for Oncology Clinical Trials

9.e3 Vol. - No. - - 2013Sloan et al.