-
Understanding the WWC Procedures and Standards Webinar
Transcript
WWC Recertification for Standards Version 4.0 Webinar Transcript
January 12, 2018
Slide 1 Hello everyone, and thank you for attending today’s
webinar, “What Works Clearinghouse (WWC) Reviewer Recertification
Training.” The webinar will begin with a brief introduction from
Chris Weiss, Senior Education Research Scientist, National Center
for Education Evaluation and Regional Assistance, Institute of
Education Sciences, U.S. Department of Education.
Following Chris’s introduction, Neil Seftor, Mathematica’s
project director for the What Works Clearinghouse, will provide an
overview of today’s presentation, the recertification process, and
what has changed in version 4.0 of the WWC procedures and
standards. Next, Allison McKie and Dana Rotz will provide
information on some of the major changes in procedures and
standards to help reviewers prepare for recertification. We have
saved about 20 minutes at the end of the presentation for Q&A
related to recertification process. You can also submit technical
questions on the standards and procedures throughout the webinar;
however, we will respond to those during the office hours.
I will be briefly going through some housekeeping information
before we get started. You can make the slides larger on your
screen by clicking the bottom right corner of the slides window and
dragging. If you have accessed the audio for the webinar through
the teleconference line, you may experience a slight delay. If
possible, we encourage you to listen to the webinar through your
computer or device speakers. We encourage you to submit questions
throughout the webinar, using the Q&A tool on the webinar
software on your screen. You can ask a question when it comes to
mind, you don't have to wait until the question and answer session.
Because we're recording this, every member of the audience is in
listen‐only mode. That improves the sound quality of the recording,
but it also means that the only way to ask questions is through the
Q&A tool, so please use that. We will try to answer as many
questions as possible. The slide deck, and a recording and
transcript of the webinar will be available on the WWC website for
download. So, with that introduction, let’s get started.
Chris, you now have the floor.
Thank you, Brice. Good afternoon. I’m Chris Weiss, team lead for
the What Works Clearinghouse. On behalf of the Institute for
Education Sciences, it’s my pleasure to welcome you to this
webinar. As you know, the WWC updated our standards and procedures
a few months ago. Today’s webinar will discuss changes between
version 3.0 and the new version 4.0 handbooks. We’re very excited
to have this opportunity to tell you more about the new standards
and procedures. I also wanted to take a moment to express my thanks
to you, not only for your participation in today’s webinar, but
also for the contributions you make to the What Works
Clearinghouse. And thank you for this. And with that, I’ll turn it
over to Neil Seftor. Neil?
Thanks, Chris. And I’d like to thank everyone for joining us for
today’s webinar, Recertification for the version 4 standards. I’m
Neil Seftor, a senior researcher a Mathematica Policy Research, and
I oversee Mathematica’s work on the What Works Clearinghouse. My
co‐presenters today are Allison McKie and Dana Rotz, also senior
researchers at Mathematica, who are involved in WWC training
activities.
1
-
Understanding the WWC Procedures and Standards Webinar
Transcript
Slide 2
Before we begin, a little context for this webinar. The WWC
updates its procedures and standards for two key reasons. First,
existing research methods evolve and new research methods are
developed. In order to keep pace with the field, we revise and
develop standards to deal with these changes that appear in the
research we review. Second, we sometimes find that our procedures
or standards are unclear or don’t deal with specific situations. We
find these in the course of reviewing studies, and through
questions and comments from education decision makers and
researchers. In the short run, we provide additional guidance to
reviewers and answers to people who send us questions, but then we
incorporate all of that information in the next version of the
handbook. After an extensive process involving many levels of
review by methodological experts, the version 4 handbooks were
released in October, and they reflect the most up‐to‐date
procedures and standards used by the WWC. In today’s webinar, I’ll
first explain the process that reviewers can use to keep their
certification current, and then give a high‐level overview of the
changes in version 4. Next, I’ll turn it over to Allison and Dana
to discuss some things that have changed significantly, including
procedures for how the WWC defines a study and distinguishes
different types of findings, along with some updates to standards
for baseline equivalence and missing data.
Slide 3
Let’s go over the process for recertification.
Slide 4
The recertification process has been created to allow reviewers
who are certified on version 3 of the WWC’s group design standards
to update their certification to version 4. If you are not
currently certified on version 3 of the standards, you may not use
this process to be certified on version 4. Instead, you’ll need to
go through the full training on Group Design standards, which has
been updated to reflect the latest changes.
The recertification process uses several resources to give you
the information you need to know about the changes to the standards
and procedures. Today’s webinar is one resource, which will cover
the topics I listed before. At the end of today’s webinar, we will
answer questions related to the recertification process only. In
two weeks, we will be providing follow‐up office hours, during
which we will answer questions you may have related to the
standards and procedures. Another resource is the training module
on cluster level assignment. The standards for reviewing these
studies has changed significantly. Rather than try to add all of
that material to this webinar, we will point you to a video that
describes the standards for cluster‐level assignment in detail. The
webinar, office hours, and cluster module will all be made
available for your viewing. In addition to these webinars and
videos, we recommend reviewing the procedures and standards
handbooks.
Next, we ask you to demonstrate your understanding of the
material. The recertification exam consists of ten multiple‐choice
questions. To pass, you must answer eight of these correctly, and
you’ll have two opportunities to pass the test. The exam is
challenging, because the new standards deal with complex issues.
However, if you take your time and use the resources, which can
also be used during the test, you should not have trouble. Finally,
after passing the exam, we ask you to view one more training
2
-
Understanding the WWC Procedures and Standards Webinar
Transcript
module, which introduces you to the new online study review
guide. The certification status of reviewers listed on the WWC
website will be updated as recertification is completed.
Slide 5
So, what’s new in version 4?
Slide 6
The most obvious change is that we’ve split our procedures and
standards handbooks into two separate documents.
The five steps listed below comprise our systematic review
process. The WWC procedures handbook contains information on most
of the steps using the process, including finding studies,
screening them for eligibility, and reporting on findings. The WWC
standard handbook contains detailed information on the standards we
use to assess quality. These two version 4 documents replace the
single version 3 document. They are now split to allow for ease for
future updating of procedures and standards as necessary.
Slide 7
What content has changed?
On the procedure side, we will talk about the bolded items
today, including how the WWC defines a study and reports on
different types of findings. Other changes that we won’t cover, and
are not included in the exam, focus on aspects of the WWC reporting
of findings. Information on these changes can be found in the
procedures handbook.
Slide 8
On the standards side, we will talk about updates to the
baseline equivalence and missing data standards. You will also
learn about the new standards for reviewing studies with
cluster‐level assignment in a separate training module. The
remaining items, which are described in detail in the standards
handbook, include some clarifications on attrition terminology and
outcome requirements, along with extensive changes to the standards
used to review studies with specific designs and analyses. And with
that, I’ll turn it over to Allison to start describing some of the
more significant changes in version 4.
Slide 9
Thank you, Neil.
Slide 10
Because studies are the building blocks of systematic reviews,
the WWC needs a definition for what constitutes a study. In
practice, a single manuscript, such as a journal article or report
might include more than one study, or multiple manuscripts might
all report on the same study. So drawing the line to distinguish
one from the other can be challenging. The version 4 procedures
include a new, more formal definition of a study that we can
implement more consistently. When analyses of the same intervention
share certain characteristics, we consider them parts of the same
study. These characteristics include:
3
-
Understanding the WWC Procedures and Standards Webinar
Transcript
The assignment or selection process used to create the
intervention and comparison groups in the study sample: A random
assignment process is one way of creating the study sample. Another
way is to form groups by matching students who received the
intervention to comparison students who did not.
The study sample: This is the set of students, classrooms,
teachers, or schools that the study analyzed. Findings from
analyses that include some or all of the same sample members may be
related. Note that the samples do not need to be identical, they
just need to overlap.
The data collection and analysis procedures used to produce the
findings: When authors use identical or nearly identical procedures
to collect and analyze data, the findings may be related.
And finally, the research team that conducted the study: When
manuscripts share one or more authors, the findings reported in
those manuscripts may be related.
When two findings on the effectiveness of the same intervention
share at least three of these characteristics, we consider them
parts of the same study. Findings that meet this criterion
demonstrate similarity or continuity in the intervention and
comparison groups, and similarity or continuity in the procedures
used to produce the findings.
Slide 11
Let's consider a few scenarios to help clarify how the WWC
defines a study. First, consider a manuscript that presents a
finding based on combining data across related samples, such as
those from multiple time periods within the same school.
For example, imagine that the researchers randomly assigned
classrooms within a middle school to conditions, and the same
intervention is implemented in the intervention classrooms over
several school years. The study examines the effectiveness of the
intervention in three consecutive cohorts of students in the middle
school and measures the effectiveness of the intervention based on
each cohort separately, but using the same analytic procedures.
Findings for the three cohorts are presented in a single
manuscript.
In this case, we would review all the findings in the manuscript
as a single study because they share all four of the study
characteristics: the same random assignment process was used for
each cohort, the study samples in each cohort were formed within
the same school, the researchers used the same data collection and
analysis procedures, and the same research team was responsible for
all of the findings.
Slide 12
Now, imagine that instead of reporting the combined finding in
one manuscript, the study authors released a series of reports,
each reporting a single‐year finding.
These findings again share all four characteristics, so the WWC
would review the series of reports as a single study. The fact that
the authors released the findings separately in different reports
does not affect how we review the findings.
4
-
Understanding the WWC Procedures and Standards Webinar
Transcript
Slide 13
Now consider a journal article on the effectiveness of an
algebra curriculum on students’ scores on a standardized math
assessment.
The study conducted a randomized controlled trial in New Jersey
schools, and used a quasi‐experimental design in Wisconsin
schools.
Although the study authors report findings for both samples in
the same journal article, both the sample members and the
assignment process differ in the two samples. Because the findings
share fewer than three of the characteristics, we would review the
randomized controlled trial and quasi‐experimental design as two
separate studies.
Slide 14
Now let’s turn to updates to reporting procedures, starting with
the guidelines the WWC follows in distinguishing among main
findings, supplemental findings, and sensitivity analyses.
Slide 15
Often studies will report multiple findings, sometimes for the
same outcome measure on different samples, or for multiple outcomes
within a single outcome domain. In most products, the WWC reports
on all eligible findings that meet WWC design standards with or
without reservations.
Among those eligible findings that meet standards, we designate
one or more findings reported in the study as the main findings
that contribute to the WWC’s summary of the evidence. In
particular, we identify a main finding using three criteria. A main
finding:
‐ uses the full sample, rather than a subgroup; ‐ uses an
aggregate or composite outcome measure, rather than individual
subscales; and, ‐ is the finding closest to the preferred follow‐up
time point, as specified in the review protocol.
This preferred follow‐up period may most often be either the
immediate posttest, meaning the earliest follow‐up period after the
conclusion of the intervention, or the latest follow‐up period,
depending on the topic of interest.
If no one finding meets all three of these criteria, we will
select a set of findings that together meet the criteria. In doing
so, we seek to avoid overlap in subscales or subsamples.
For example, if a finding for a composite measure does not meet
standards but findings for separate subscales of a composite
measure meet standards, are based on the full sample, and are
reported at the preferred follow‐up period, then we would report
the findings for each subscale as main findings.
As we did under version 3 standards and procedures, under
version 4 we combine subsamples into an aggregate finding if no
aggregate finding meets WWC group design standards.
These rules allow the WWC to characterize a study’s findings
based on the most comprehensive information available. However, not
all studies will report a single finding or set of findings that
reflects
5
-
Understanding the WWC Procedures and Standards Webinar
Transcript
the full sample for the composite outcome measure and preferred
time period. When applying these rules is not straightforward
because of incomplete information about findings, overlapping
samples, or other complications, the review team leadership has
discretion for a study or group of studies under review to identify
main findings in a way that best balances the goals of
comprehensively characterizing each study’s findings and presenting
the findings in a clear and straightforward manner.
Slide 16
All eligible findings that meet WWC group design standards and
are not identified as main findings are considered either
supplemental findings or sensitivity analyses.
Supplemental findings do not contribute to the WWC’s summary of
evidence but may be reported separately in the WWC product that
includes the study review. In intervention reports, for example,
supplemental findings are reported in an appendix.
Sensitivity analyses are acknowledged in a note in the WWC
product but are not reported.
Findings for subsamples, subscales, and time periods other than
the preferred follow‐up period that are not identified as main
findings are supplemental findings.
Note that when the WWC calculates an aggregate finding from
subsample findings, we report each subsample finding that meets
standards as a supplemental finding.
Slide 17
A sensitivity analysis uses the same (or a very similar) sample
as the main finding, but applies a different analytic method, such
as using a different set of control variables.
To differentiate main findings from sensitivity analyses among
the findings that meet standards, we identify as the main finding
the one from the analysis that
‐ receives the highest rating, ‐ accounts for the baseline
measures specified in the review protocol, ‐ uses the most
comprehensive sample, ‐ and is most robust to threats to internal
validity based on the judgment of either the study
authors, as reported in the study, or review team
leadership.
Let’s go through an example.
Slide 18
A study estimated effects of a 1‐year intervention on SAT‐9
scores for students in grade 2. The study is reviewed using the
Beginning Reading protocol, which prioritizes immediate posttest
findings. All of the findings we will discuss are from eligible
analyses.
First, suppose the following findings meet WWC group design
standards:
‐ 1‐year impact for girls, ‐ 1‐year impact for all students, ‐
2‐year impact for all students, and ‐ 1‐year impact for all
students estimated using a simpler regression model.
6
-
Understanding the WWC Procedures and Standards Webinar
Transcript
We assume the simpler regression model has weaker internal
validity than the full regression model used for the first three
findings.
The 1‐year impact for boys does not meet standards.
Under this scenario, the main finding is the 1‐year impact for
all students because it is the full‐sample finding closest to the
preferred immediate posttest time period that is the most robust to
threats to internal validity.
The findings for subsamples and other time periods that meet
standards are supplemental findings: 1‐year impact for girls and
2‐year impact for all students.
The 1‐year impact for all students estimated using the simpler
and presumably weaker regression model is classified as a
sensitivity analysis.
Note that the 1‐year impact for boys does not appear under any
of our classifications because the finding does not meet standards.
The WWC reports only findings that meet design standards.
Slide 19
Now let’s consider the same study but instead suppose only the
following findings meet standards:
‐ 1‐year impact for girls, ‐ 2‐year impact for all students, and
‐ 1‐year impact for all students estimated using a simpler
regression model.
Now, the main finding is the 1‐year impact for all students
estimated using a simpler regression model because it is the only
full‐sample finding for the preferred time period that meets design
standards.
The 1‐year impact for girls and the 2‐year impact for all
students are again supplemental findings.
There are no sensitivity analyses. No findings from analyses
that use a similar sample as the main finding but a different
analytic method meet standards.
Slide 20
Another important reporting update concerns when to apply a
multiple comparison adjustment.
Under version 4 procedures, the WWC applies a multiple
comparison adjustment only to main findings.
We do not apply the adjustment to supplemental findings or
sensitivity analyses.
The procedure itself has not changed. We perform the multiple
comparison adjustment using the same Benjamini‐Hochberg correction
that we used under version 3 of the WWC Standards and
Procedures.
Slide 21
Next we will look at an update related to baseline equivalence.
Version 4 standards use a more flexible statistical adjustment
requirement for studies with moderate levels of differences in key
baseline measures.
7
-
Understanding the WWC Procedures and Standards Webinar
Transcript
Slide 22
As you know, we assess baseline equivalence for randomized
controlled trials with high attrition, randomized controlled trials
with compromised random assignment, and quasi‐experimental designs
by examining the standardized mean difference, or effect size,
between the analytic intervention and comparison groups on baseline
measures specified in the protocol.
If the absolute value of the baseline effect size is less than
or equal to 0.05, the study satisfies the baseline equivalence
requirement.
If the absolute value of the baseline effect size is larger than
0.25, the study does not satisfy the equivalence requirement.
For differences between 0.05 and 0.25, the WWC requires a
statistical adjustment to satisfy equivalence. Under version 3
standards, this statistical adjustment had to be a regression
covariate adjustment or an analysis of covariance.
Slide 23
Under version 4 standards, other analytical approaches may
satisfy the statistical adjustment requirement if two conditions
are met:
First, the baseline characteristic and the outcome measure must
be measured using the same units.
This condition would be met if the researchers administered the
same test and used the same scales and scoring procedures for the
baseline characteristic and the outcome measure.
This condition would not be met if the researchers administered
different tests for the baseline and outcome measures, or if they
used the same test, but used different scoring procedures.
Second, the baseline characteristic and outcome measure must
have a correlation of 0.6 or higher.
When these two conditions are met, there are three additional
methods that can be used to satisfy the statistical adjustment
requirement:
‐ a difference‐in‐differences adjustment, ‐ gain scores, and ‐
fixed effects.
When the study authors do not perform a statistical adjustment,
but these two conditions hold, the WWC will perform its own
difference‐in‐differences adjustment to satisfy the statistical
adjustment requirement.
Going back to reporting for a moment, recall that in reporting
findings that meet standards, the WWC performs a
difference‐in‐differences adjustment when an analysis does not
adjust for baseline differences. We still do a
difference‐in‐differences adjustment for such studies, even if the
correlation between the baseline characteristic and the outcome is
below 0.6.
Let’s do a couple of knowledge checks.
8
-
Understanding the WWC Procedures and Standards Webinar
Transcript
Slide 24
Knowledge Check 1. A quasi‐experimental design study analyzed
raw Peabody Picture Vocabulary Test (or PPVT) scores for 60
intervention and 60 comparison group students, estimating impacts
by comparing unadjusted means. We are told that the baseline effect
size is 0.07 standard deviations. The correlation between the raw
PPVT scores at baseline and follow‐up is 0.82.
What is the highest rating this study is eligible to
receive?
A. Meets WWC Group Design Standards Without Reservations,
B. Meets WWC Group Design Standards With Reservations, or
C. Does Not Meet WWC Group Design Standards
Slide 25
The correct answer is B: Meets WWC Group Design Standards With
Reservations.
The study uses a quasi‐experimental design and unadjusted means
to assess impacts. The baseline difference between the intervention
and comparison groups is 0.07 standard deviations, and the pretest
and the posttest are measured using the same units. The correlation
between the pretest and posttest is over 0.6. Therefore, the WWC
can do a difference‐in‐differences adjustment and the study is
eligible to receive the Meets WWC Group Design Standards With
Reservations rating.
A and C are incorrect answers.
Slide 26
Knowledge Check 2. A quasi‐experimental study analyzed the
impact of a high school tutoring program by comparing student
scores on a grade 11 district‐wide math test at the end of the
school year for 40 students who received the program to scores for
40 students who did not receive the program. Normalized baseline
scores on the grade 10 district‐wide math test have a mean of 0.04
for the intervention group, a mean of ‐0.04 for the comparison
group, and a standard deviation of 1 for both groups.
To estimate impacts, the authors normalized the grade 10 and
grade 11 scores so that each had a mean of 0 and standard deviation
of 1, stacked the data, and estimated a model with student fixed
effects, a time fixed effect, and an interaction between time and
an indicator for being in the intervention group. The coefficient
on the interaction term was the measure of the program’s
effect.
What is the highest rating this study is eligible to
receive?
A. Meets WWC Group Design Standards Without Reservations,
B. Meets WWC Group Design Standards With Reservations, or
C. Does Not Meet WWC Group Design Standards
Slide 27
The correct answer is “C”: Does Not Meet WWC Group Design
Standards.
9
-
Understanding the WWC Procedures and Standards Webinar
Transcript
The baseline difference between the intervention and comparison
groups is 0.08 standard deviations. The authors use individual
fixed effects to account for the baseline difference, but the
baseline and outcome scores do not have the same units. Even though
the grade 10 and grade 11 math tests are both district‐wide
assessments and have been transformed to the same scale, they are
not the same assessment. Therefore, the study does not satisfy the
baseline equivalence requirement, and it should receive the Does
Not Meet WWC Group Design Standards rating.
“A” and “B” are incorrect answers.
I will now turn it over to Dana to talk about the revised
missing data standards.
Thanks, Allison.
Slide 28
We’re going to spend the rest of the webinar today, or the bulk
of the rest of the webinar today, discussing the revised missing
data standards. I’m going to start off by giving you a broad
picture of what’s changed with the WWC missing data standards. I’m
then going to talk about a particular case that reviewers often run
into, and that’s when baseline data is missing for some members of
the analytic sample. I’m then going to dig a little bit deeper into
the overall review process for reviewing studies with missing
data.
Slide 29
So as I mentioned, there have been major revisions to the
missing data standards. Let me just tell you briefly about how
that’s operationalized.
Under standards version 3.0, authors had to use an acceptable
approach to address all missing data in the analytic sample. That
itself hasn’t changed under standards version 4.0. However, there
have been some revisions to the list of methods for addressing
missing data that the WWC has classified as acceptable.
Under standards version 3.0, studies could satisfy baseline
equivalence only by using non‐imputed data for the entire analytic
sample. That’s been revised under standards version 4.0. Under
standards version 4.0, studies can satisfy baseline equivalence
using data on a subset of the analytic sample, or imputed data for
the analytic sample.
Finally, under standards version 3.0, only low‐attrition RCTs
that impute outcome data were eligible to meet WWC group design
standards. Under version 4.0, low‐attrition RCTs, high‐attrition
RCTs, and QED studies that impute outcome data are all eligible to
meet group design standards, so long as they limit the potential
bias from analyzing imputed data. And I’ll tell you more about what
that means in a moment.
Slide 30
So let's talk about one common situation that a reviewer might
run into. This is possibly the most common situation with missing
data, and this is when baseline data are missing for a small number
of
10
-
Understanding the WWC Procedures and Standards Webinar
Transcript
observations in the analytic sample, but we still need to assess
baseline equivalence. So for example, some students with outcome
data were absent on the day of a pretest.
The WWC has calculated new formulas that estimate how large the
baseline difference between the intervention and comparison groups
in the full analytic sample could reasonably be based on a few
different pieces of information.
When some baseline data is missing, the largest plausible
baseline difference between the intervention and comparison groups
is determined by the baseline effect size calculated using the
observed baseline data in the analytic sample. So we use the
information on the baseline we have. And smaller effect sizes
calculated using the observed baseline data in the analytic sample
will imply smaller plausible differences for the full analytic
sample.
These new formulas also bring in information on outcome means.
So we use information on the outcome, to think about the baseline.
So, in particular, the standardized difference in outcome means
between the full analytic sample and the subsample with observed
baseline data measured separately for the intervention and
comparison groups tells us something about the baseline difference
for the full analytic sample. And here, smaller differences in
outcomes between the full analytic sample and the subsample with
observed baseline data will imply smaller plausible baseline
differences within the full analytic sample.
Finally, it brings in information on the correlation between the
baseline and the outcome measures, where larger correlations will
imply smaller plausible baseline differences for the full analytic
sample.
Slide 31
I’m now going to show you some of the formulas that the SRG uses
in order to calculate the largest plausible baseline difference.
These formulas are complex. The SRG has tools to support reviewers
of these studies, and the SRG will do all the calculations for you.
So don’t stress about the formulas. We’re not going to test on the
formulas either – just the underlying concepts and the review
process when studies have missing data.
But I want to go through the formulas for this specific case
that you’re most likely to run into in a bit of detail, so that you
are comfortable with the concepts and also to build some of the
intuition behind what the SRG is doing in these cases.
So to assess baseline equivalence when some baseline data is
missing, instead of comparing the baseline effect size to cutoff
values – the .05 and .25 that you usually use – you’re going to
compare the maximum of four different quantities to the same cutoff
values, where those are defined by these equations. There’s a lot
of stuff going on in these equations, but they can actually be
broken down into a few easy‐to‐understand pieces.
The first piece that forms these equations is the effect size in
the observed baseline data. So you take the baseline data for the
analytic sample that you do have observed data for, and you
calculate the effect size.
Another piece of this is the correlation between the outcome and
the baseline measure.
11
-
Understanding the WWC Procedures and Standards Webinar
Transcript
The third and fourth pieces are the differences in the outcome
between the full analytic sample and the analytic sample with
observed baseline data. Like I mentioned before, we’re using
information on the outcomes to infer something about the
baseline.
And the final piece of these formulas is the normalization
factor or the standard deviation of the outcome measure.
All four formulas draw information from those four types of
statistics that I just mentioned. So even though they look
complicated, they can really be broken down into several different
pieces, many of which reviewers are very used to working with.
Slide 32
Again, this slide is to help to build intuition on what the SRG
is doing when it’s assessing whether the largest plausible baseline
difference satisfies baseline equivalence. On the X axis here, I’ve
plotted the intervention group standardized difference in outcome
means. So that’s the difference in the outcome between the full
analytic sample and analytic sample with observed baseline data,
divided by the standard deviation of the outcome. On the Y axis is
the same measure but for the comparison group.
This gold region that you see here is the largest differences in
the outcome means that still satisfy baseline equivalence. And in
particular this assumes a correlation of .50. Here is that region,
or the shape, for the correlation of .60. Here it is for .70 and
here it is for .80.
There are three main points that I want to make with this
diagram. The first is that higher correlations between the baseline
and the outcome measure imply larger regions. So higher
correlations allow for larger possible differences in means between
the analytic sample and the analytic sample with non‐missing
data.
The second main point I want to make is that if differences are
of the same sign for the intervention and the comparison groups,
the differences can be pretty big. The case that we’re looking at
here is when the study did not adjust for the baseline measure. So
differences could be, for instance, around .1 standard deviations
for both the intervention and the comparison groups, and baseline
equivalence would still be satisfied because the largest plausible
baseline difference would be below that .05 standard deviation cut
off.
The final thing I want to point out about this graph is that
it’s totally symmetric. It doesn’t matter what you call the
intervention group versus the comparison group, it doesn’t matter
if a favorable outcome is negative or positive. Everything about
this graph is symmetric.
Slide 33
Ok, so that’s a very specific case of how you would handle
missing data. More generally, the WWC uses the following review
process for studies with imputed outcome data or missing or imputed
baseline data.
Again, the example we looked at is a special case of this review
process. Now I’m going to discuss each step of the process in
detail, and cover the other cases where this review process is
used. In step one, a reviewer will assess whether the study uses an
acceptable approach to address all missing data in the analytic
sample. In step two, a reviewer will assess whether the study is a
low‐attrition RCT. In step
12
-
Understanding the WWC Procedures and Standards Webinar
Transcript
three, a reviewer determines whether the study limits potential
bias from imputed outcome data, if any outcome data are
imputed.
Slide 34
In step four, a reviewer checks whether the study is a
high‐attrition RCT that analyzes the full randomly assigned sample
using imputed data. And in step five, the reviewer assesses whether
baseline equivalence is satisfied, where a different process is
used to do that based on whether data in the analytic sample is
missing or imputed for any baseline measure specified in the review
protocol.
I’m going to talk about each of these steps in more detail
now.
Slide 35
Step one: Does the study use an acceptable approach to address
all missing data in the analytic sample?
Slide 36
Version 4.0 of the Standards Handbook lists acceptable
approaches and key considerations for each. The acceptable
approached are also listed on this slide, and the slide that
follows. These slides can be used as a reference. I will not go
through all details on these slides. I just want to introduce you
to each of the methods that the WWC has classified as acceptable
approaches to dealing with missing data.
The first approach, which you most commonly see, is the complete
case analysis approach. In this approach study authors only analyze
observations for which all needed data are not missing. If you’re
reviewing a study where this approach is used, you should follow
the usual Group Design review process, counting any omitted data as
attrition. There is no need to follow the remaining steps of the
missing data review process.
Other acceptable approaches include regression imputation, where
authors use a regression model to impute values for missing
data.
Slide 37
Maximum likelihood, where they use in iterative routine to
simultaneously estimate model parameters and account for missing
data.
Nonresponse weights are also an acceptable method to account for
missing outcome data only. Not for missing baseline data.
Slide 38
And then finally the dummy‐variable adjustment method, where
authors set all missing values for a baseline measure to a single
value and include an indicator variable for records missing data in
the impact estimation model, is an acceptable method to correct for
missing baseline data but not for missing outcome data. And note,
for the dummy variable adjustment, when applied to baseline
measures that the review protocol requires to satisfy baseline
equivalence, the method is acceptable only for randomized
controlled trials (both low‐ and high‐attrition RCTs), but not for
QEDs or compromised RCTs.
13
-
Understanding the WWC Procedures and Standards Webinar
Transcript
The WWC may also consider other methods of addressing missing
data acceptable. If you’re reviewing a study, and you think that an
author uses a method of addressing missing data that’s acceptable,
but it’s not in this list, you should consult with review team
leadership.
It should also be noted that a single analysis may use multiple
methods to deal with missing data. That’s totally acceptable, so
long as all of the methods used are acceptable.
And then finally, if a study does not use an acceptable approach
to address all missing data in the analytic sample, it receives the
Does Not Meet WWC Group Design Standard rating. Note that this
applies regardless of whether the measure with missing data is
specified by the review protocol as required for satisfying
baseline equivalence. So for example, if race and ethnicity was
imputed using a method which is not acceptable, a study would not
meet standards, even if we did not require equivalence based on
race and ethnicity to be demonstrated in the review protocol.
Slide 39
Ok, so let's just go over a quick knowledge check on this topic.
Which of the following studies cannot meet WWC group design
standards?
A. A low‐attrition RCT that imputes missing baseline and outcome
data using regression imputation.
B. An RCT with compromised random assignment that accounts for
missing baseline data using a dummy variable adjustment.
C. A QED that estimates effects using unadjusted means and
standard deviations and is missing baseline data for 40 percent of
the analytic sample.
D. A QED that uses maximum likelihood to analyze a sample with
missing baseline and outcome data.
E. All of the above can meet WWC group design standards.
Slide 40
The correct answer here is “B” – An RCT with compromised random
assignment that accounts for missing baseline data using a dummy
variable adjustment. Dummy‐variable adjustment is an acceptable
approach for handling missing baseline data only for RCTs without
compromised random assignment. So this study will receive the Does
Not Meet WWC Group Design Standards rating.
“A” ‐ A low‐attrition RCT that imputes missing baseline and
outcome data using regression imputation – that study uses an
acceptable approach for handling missing data. It’s a low‐attrition
RCT. Thus, unless there are any other design issues, the study will
receive the Meets Group Design Standards Without Reservations
rating.
“C” is also an incorrect answer. “C” was a QED that estimates
effects using unadjusted means and standard deviations and is
missing baseline data for 40 percent of the analytic sample.
Although a large amount of baseline data is missing, the study is
eligible to receive the Meets Standards With Reservations rating,
as long as the largest anticipated baseline difference in the
analytic sample is less than 0.05 standard deviations.
14
-
Understanding the WWC Procedures and Standards Webinar
Transcript
“D” is also incorrect. “D” was a QED that uses maximum
likelihood to analyze a sample with missing baseline and outcome
data. Maximum likelihood is an acceptable method for accounting for
missing data. So unless there are other design issues, the study
would be eligible to receive the Meets Standards With Reservations
rating.
Slide 41
Let's go on to talk about step two: Is the study a low attrition
RCT?
Slide 42
To assess whether the study is low attrition RCT, a reviewer
will calculate overall and differential attrition counting sample
members with imputed outcome data as missing. So the denominator
for this calculation will be the full sample subject to random
assignment, and the numerator is the analytic sample excluding any
imputed outcome data. The WWC treats imputed data in the same way
as missing data when assessing attrition, because both present a
potential threat of bias. So a low attrition RCT would be eligible
to receive the meets standards without reservations rating, so long
as the study used an acceptable method to address missing data.
For QEDs , high attrition RCTs, and compromised RCTs, a reviewer
should proceed to step three.
Slide 43
Before moving on though, let's go through a quick knowledge
check.
Researchers randomly assigned 50 students to a comparison group,
and 50 students to an intervention group. The intervention group
received a one‐year intensive tutoring program. By visiting schools
on two days, the researchers were able to obtain pretest scores for
43 students in the intervention group and 41 students in the
comparison group, and posttest scores for 45 students in the
intervention group and 40 students in the comparison group. Using
an acceptable method of imputation, they analyzed data from an
analytic sample including the 47 intervention group students and 43
comparison group students who had either pre‐ or posttest data (or
both).
What is the overall rate of attrition? Is it:
A. 10 percent
B. 15 percent
C. 16 percent or
D. 21 percent
The correct answer here is “B”, 15%
Slide 44
The difference between the number of students randomly assigned
and the number of students with observed outcome data should be
used to calculate attrition. In this case, the researchers randomly
assigned 100 students and 85 had observed outcome data (45 in the
intervention group and 40 in the comparison group). Therefore, the
overall rate of attrition is 15 percent.
15
-
Understanding the WWC Procedures and Standards Webinar
Transcript
A, C, and D are all incorrect answers. The quoted levels of
attrition that correspond to these options result from treating
observations with observed pretest data and missing posttest data
or observed posttest data and missing pretest data incorrectly in
assessing attrition.
Slide 45
Great, so let's move on to step three. Does the study limit
potential bias from imputed outcome data, if any outcome data or
imputed?
Slide 46
When a QED, high‐attrition RCT, or RCT with compromised random
assignment analyzes imputed outcome data, the study must
demonstrate that the potential bias from that imputed outcome data
is limited. And by limited, we mean less than 0.05 standard
deviations.
The potential bias from analyzing imputed outcome data is
determined by the standardized difference in baseline means between
the full analytic sample and the subsample with observed outcome
data, measured separately for the intervention and comparison
groups (where smaller differences will imply smaller biases). So
just like when we talked about the case of missing baseline data we
used information on the outcome measure to infer something about
the missing baseline data, we’re using information from the
baseline here to infer something about the imputed outcome
data.
The potential bias is also determined by the fraction of the
analytic sample missing outcome data measured separately for the
intervention and comparison groups (where smaller fractions will
imply smaller biases); and the correlation between the baseline and
the outcome measure (where larger correlations imply smaller
biases). So for this last item, the correlation, you’re typically
using the baseline measure in order to impute the outcome measure.
So if there’s a stronger link between the baseline measure and the
outcome measure, we can guess that our imputation is in some sense
better, and larger correlations will imply smaller possible biases.
If a study must limit potential bias from imputed outcome data, but
does not do so, it will receive the does not meet WWC Group Design
Standards rating.
Slide 47
So, here’s another one of the graphs that I showed you
previously. Now on the X axis, I’ve plotted the intervention group
standardized difference in baseline means (so that’s the difference
in baseline means for the full analytic sample versus the analytic
sample with observed outcome data divided by the standard deviation
of the baseline measure). And then the same measure for the
comparison group is plotted on the Y axis. Now the gold region
outlines the largest difference in baseline means that limit
potential biases from imputation, and in this case, we’re looking
at the region when the correlation between baseline measure and the
outcome measure is equal to .50. Here is the difference for a
region of .60, for .70, and for .80.
So again, three main take‐aways here for this graph. First, the
graph is symmetric. So just like before, it doesn’t matter what you
call the intervention group, it doesn’t matter what you call the
comparison group, it doesn’t matter whether positive numbers refer
to favorable impacts, or positive numbers refer to non‐favorable
impacts. It’s all symmetric.
16
-
Understanding the WWC Procedures and Standards Webinar
Transcript
Second, like before, the bigger the correlation between the
baseline and the outcome measure, the larger the differences can be
and still lead to limited potential bias from imputation.
Finally I want to note that this diagram was created by assuming
that there was no missing or imputed baseline data. This is a
specific case but we could create a similar diagram for different
levels of missing or imputed baseline data. There’s nothing
particularly special about this particular case. It’s just one to
highlight.
Slide 48
Moving on to step four. In step four a reviewer will assess
whether a study is a high attrition RCT that analyzes the full
randomly assigned sample.
Slide 49
In general, the WWC requires that high attrition RCTs satisfy
the baseline equivalence requirement because of a risk of bias from
compositional differences between the intervention and comparison
group members that remain in the sample after attrition.
However some high attrition RCTs impute all missing outcome data
and analyze the original randomly assigned sample. These high
attrition RCTs do not need to satisfy the baseline equivalence
requirement because of a presumption that the intervention and
comparison groups that result from random assignment are unlikely
to have substantive compositional differences. So put another way,
a high attrition RCT is eligible to receive the meets group design
standards with reservations rating, if it does the following:
1. Uses an acceptable approach to address all missing data, that
you assessed in step one,
2. Limits the potential bias from imputed outcome data, as
assessed in step three, and
3. Imputes outcome data for all randomly assigned sample members
who are missing outcome data.
Notably absent from this list is satisfying baseline
equivalence. High attrition RCTs that satisfy the above three
requirements do not need to satisfy baseline equivalence to meet
group design standards.
Slide 50
Let's move on to step five: Assessing Baseline Equivalence.
We’re going to use a slightly different method for assessing
baseline equivalence based on whether data in the analytic sample
are missing or imputed for any baseline measure specified in the
review protocol.
Slide 51
If the analytic sample includes no subjects with missing or
imputed data for the measures required by the protocol to satisfy
baseline equivalence, use the usual WWC procedures to assess
baseline equivalence. You don’t need to do anything special in this
case, you just look at your means and standards deviations of the
baseline measure for the full analytic sample, and calculate the
baseline effect size, comparing that to our typical cutoff
values.
If you’re not in this situation, though, you need to proceed to
step 5b in conducting a review.
17
-
Understanding the WWC Procedures and Standards Webinar
Transcript
Slide 52
Does the study satisfy baseline equivalence using the largest
baseline difference accounting for missing or imputed baseline
data?
If a study has missing or imputed baseline data, you want to
measure the largest reasonable baseline difference accounting for
these data.
The exact formulas that you’ll use in order to calculate this
largest reasonable baseline difference will depend on the precise
situation you’re in, but can include means, standard deviations,
correlations, and samples sizes. All of the formulas just use these
standard statistics.
As discussed earlier, if baseline data are missing, these
formulas essentially take the baseline difference estimated using
observed baseline data, and then adjust it. If baseline data are
imputed, the formulas can adjust either the baseline difference
estimated using both observed and imputed data, or the baseline
difference estimated using observed data only.
Slide 53
So this slide here is a reference slide, and again, these slides
should be available to you through the ON24 environment. If you
have any questions how to access the slides, you should let me
know. Hopefully we’ll be able to deal with those questions as they
come up.
But this particular slide is a reference slide that you should
use in order to understand what types of additional data are
needed. You do not have to memorize it, you just have to be able to
use it to understand what you need to request in an author query.
And an SRG will also help you out here, as it will indicate the
types of data you need to provide in different cases.
So what this slide does is, for each of the six cases you could
possibly be in, based on whether outcome data are always observed
or sometimes imputed, and baseline data are always observed,
sometimes missing, or sometimes imputed, it contains X’s for the
boxes specifying the statistics that you need to gather in order to
conduct a review.
For two of the cases, there are options one and two. This means
we need either the option one, or the option two information to
assess baseline equivalence, but not both. If information from
option one is available we’ll typically use it. It’s preferred to
option two. But we can just as easily use the information from
option two.
Slide 54
So that’s all five steps of the missing data review process. I’m
now going to go through an example. This is the most complex
missing data scenario in that it’s designed to walk reviewers
through all five steps. Typically, a reviewer would not necessarily
have to walk through all five of the steps, but we designed this
example in order to do just that.
So in this example, we’re looking at an RCT which assigned 100
students to an intervention group and 100 students to a comparison
group. Pretest data was observed for 90 students in the
intervention group and 85 students in the comparison group.
Posttest data was observed for 90 students in the intervention
group and 80 students in the comparison group. In total, 95
students in the intervention
18
-
Understanding the WWC Procedures and Standards Webinar
Transcript
group and 90 students in the comparison group had observed data
for either the pretest or the posttest, or both.
For all students who took either the pretest or the posttest
(but not both) the study authors imputed missing test scores using
multiple imputation. The imputation model included controls for
study group, all covariates that were used in the impact estimation
model, and, when imputing pretest scores, the model controls for
the posttest. The correlation between the pretest and the posttest
was 0.60.
The study authors used a regression framework, controlling for
pretest scores, and all imputed and observed data to estimate
impacts.
The review protocol specifies that the cautious attrition
threshold should be used. What is the highest rating the study is
eligible to receive?
Let's walk through the process that would be involved in
reviewing this study and determining the highest rating the study
is eligible to receive.
Slide 55
In step one, a reviewer would assess whether the study uses an
acceptable approach to address all missing data in the analytic
sample. In this case the study authors do use an acceptable
approach to address all missing data in the analytic sample. The
study authors used multiple imputation, which is a form of
regression imputation, to impute missing values when exactly one
test score was observed. The imputation model included all of the
necessary covariates. So, the WWC would classify this as an
acceptable approach to address all missing data.
In step two you would determine whether the study is a low
attrition RCT. In this case, the answer to the question in step two
is no. It is a high attrition RCT. Posttest data was observed for
90 students in the intervention group and 80 students in the
comparison group. Counting imputed data as attrition, the overall
attrition rate was 15%, and the differential attrition rate was 10
percentage points. Under the cautious attrition boundary, we would
classify this as high attrition.
Slide 56
Given that, we move on to step 3: Does the study limit potential
bias from imputed outcome data, if any outcome data are imputed? In
this case, we would need a little bit more information. And in
particular, we would need information on the unadjusted mean of the
pretest for the sample with observed pretest data, the unadjusted
mean of the pretest for those with observed pretest and posttest
data, and the standard deviation of the pretest for those with
observed pretest data.
Based on this information, the potential bias from imputed
outcome data can be calculated to be .03 standard deviations. So
since .03 is less than .05, we can determine that the study authors
limited potential bias from imputed outcome data.
Based on this we move on to step four: is the study a high
attrition RCT that analyzes the full randomly assigned sample? In
this case, the answer is no. Some students had missing data on both
the pretest and the posttest. These students were omitted from the
analysis. So even though this is a high attrition RCT, it does not
analyze the full randomly assigned sample, and we need to move on
to assess baseline equivalence in step 5.
19
-
Understanding the WWC Procedures and Standards Webinar
Transcript
Slide 57
We’re going to use step 5b as opposed to step 5a because the
analytic sample includes some observations with missing baseline
data. In order to assess the largest baseline difference accounting
for the missing or imputed baseline data, we would need a few more
pieces of information. In particular, we need the unadjusted mean
of the pretest for those with observe or imputed pretest data; the
unadjusted mean of the posttest for those with observed posttest
data; the unadjusted mean of the posttest for those with observed
pretest and posttest data; and the standard deviation of the
posttest for those with observed posttest data. So you put all of
this information into the SRG, and the SRG would tell you that the
largest baseline difference accounting for imputed data is 0.09
standard deviations. If you’ll recall, the study authors controlled
for the pretest in the regression analysis used to estimate
impacts, so the study satisfies baseline equivalence because that
largest baseline difference is in the adjustment range and the
authors adjusted for the pretest in the analysis. Therefore,
overall we can conclude this study receives the Meets WWC Group
Design Standards With Reservations rating.
Slide 58
Let's just conclude with a couple of knowledge checks.
An RCT assigned 60 students to the intervention group and the
same number to the comparison group. Outcome data are available for
50 students in the intervention group and 45 students in the
comparison group. Baseline data are available for 45 students in
the intervention group and 50 students in the comparison group. The
study authors imputed all missing baseline and outcome data using
an acceptable approach. There is no reason to