WWC Recertification for Standards Version 4.0, Webinar Transcript · 2018-01-23 · Slide10. Becausestudies are the building blocks of systematic reviews, the WWC needs a definition

Understanding the WWC Procedures and Standards Webinar Transcript

WWC Recertification for Standards Version 4.0 Webinar Transcript January 12, 2018

Slide 1 Hello everyone, and thank you for attending today’s webinar, “What Works Clearinghouse (WWC) Reviewer Recertification Training.” The webinar will begin with a brief introduction from Chris Weiss, Senior Education Research Scientist, National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

Following Chris’s introduction, Neil Seftor, Mathematica’s project director for the What Works Clearinghouse, will provide an overview of today’s presentation, the recertification process, and what has changed in version 4.0 of the WWC procedures and standards. Next, Allison McKie and Dana Rotz will provide information on some of the major changes in procedures and standards to help reviewers prepare for recertification. We have saved about 20 minutes at the end of the presentation for Q&A related to recertification process. You can also submit technical questions on the standards and procedures throughout the webinar; however, we will respond to those during the office hours.

I will be briefly going through some housekeeping information before we get started. You can make the slides larger on your screen by clicking the bottom right corner of the slides window and dragging. If you have accessed the audio for the webinar through the teleconference line, you may experience a slight delay. If possible, we encourage you to listen to the webinar through your computer or device speakers. We encourage you to submit questions throughout the webinar, using the Q&A tool on the webinar software on your screen. You can ask a question when it comes to mind, you don't have to wait until the question and answer session. Because we're recording this, every member of the audience is in listen‐only mode. That improves the sound quality of the recording, but it also means that the only way to ask questions is through the Q&A tool, so please use that. We will try to answer as many questions as possible. The slide deck, and a recording and transcript of the webinar will be available on the WWC website for download. So, with that introduction, let’s get started.

Chris, you now have the floor.

Thank you, Brice. Good afternoon. I’m Chris Weiss, team lead for the What Works Clearinghouse. On behalf of the Institute for Education Sciences, it’s my pleasure to welcome you to this webinar. As you know, the WWC updated our standards and procedures a few months ago. Today’s webinar will discuss changes between version 3.0 and the new version 4.0 handbooks. We’re very excited to have this opportunity to tell you more about the new standards and procedures. I also wanted to take a moment to express my thanks to you, not only for your participation in today’s webinar, but also for the contributions you make to the What Works Clearinghouse. And thank you for this. And with that, I’ll turn it over to Neil Seftor. Neil?

Thanks, Chris. And I’d like to thank everyone for joining us for today’s webinar, Recertification for the version 4 standards. I’m Neil Seftor, a senior researcher a Mathematica Policy Research, and I oversee Mathematica’s work on the What Works Clearinghouse. My co‐presenters today are Allison McKie and Dana Rotz, also senior researchers at Mathematica, who are involved in WWC training activities.

1


Slide 2

Before we begin, a little context for this webinar. The WWC updates its procedures and standards for two key reasons. First, existing research methods evolve and new research methods are developed. In order to keep pace with the field, we revise and develop standards to deal with these changes that appear in the research we review. Second, we sometimes find that our procedures or standards are unclear or don’t deal with specific situations. We find these in the course of reviewing studies, and through questions and comments from education decision makers and researchers. In the short run, we provide additional guidance to reviewers and answers to people who send us questions, but then we incorporate all of that information in the next version of the handbook. After an extensive process involving many levels of review by methodological experts, the version 4 handbooks were released in October, and they reflect the most up‐to‐date procedures and standards used by the WWC. In today’s webinar, I’ll first explain the process that reviewers can use to keep their certification current, and then give a high‐level overview of the changes in version 4. Next, I’ll turn it over to Allison and Dana to discuss some things that have changed significantly, including procedures for how the WWC defines a study and distinguishes different types of findings, along with some updates to standards for baseline equivalence and missing data.

Slide 3

Let’s go over the process for recertification.

Slide 4

The recertification process has been created to allow reviewers who are certified on version 3 of the WWC’s group design standards to update their certification to version 4. If you are not currently certified on version 3 of the standards, you may not use this process to be certified on version 4. Instead, you’ll need to go through the full training on Group Design standards, which has been updated to reflect the latest changes.

The recertification process uses several resources to give you the information you need to know about the changes to the standards and procedures. Today’s webinar is one resource, which will cover the topics I listed before. At the end of today’s webinar, we will answer questions related to the recertification process only. In two weeks, we will be providing follow‐up office hours, during which we will answer questions you may have related to the standards and procedures. Another resource is the training module on cluster level assignment. The standards for reviewing these studies has changed significantly. Rather than try to add all of that material to this webinar, we will point you to a video that describes the standards for cluster‐level assignment in detail. The webinar, office hours, and cluster module will all be made available for your viewing. In addition to these webinars and videos, we recommend reviewing the procedures and standards handbooks.

Next, we ask you to demonstrate your understanding of the material. The recertification exam consists of ten multiple‐choice questions. To pass, you must answer eight of these correctly, and you’ll have two opportunities to pass the test. The exam is challenging, because the new standards deal with complex issues. However, if you take your time and use the resources, which can also be used during the test, you should not have trouble. Finally, after passing the exam, we ask you to view one more training

2


module, which introduces you to the new online study review guide. The certification status of reviewers listed on the WWC website will be updated as recertification is completed.

Slide 5

So, what’s new in version 4?

Slide 6

The most obvious change is that we’ve split our procedures and standards handbooks into two separate documents.

The five steps listed below comprise our systematic review process. The WWC procedures handbook contains information on most of the steps using the process, including finding studies, screening them for eligibility, and reporting on findings. The WWC standard handbook contains detailed information on the standards we use to assess quality. These two version 4 documents replace the single version 3 document. They are now split to allow for ease for future updating of procedures and standards as necessary.

Slide 7

What content has changed?

On the procedure side, we will talk about the bolded items today, including how the WWC defines a study and reports on different types of findings. Other changes that we won’t cover, and are not included in the exam, focus on aspects of the WWC reporting of findings. Information on these changes can be found in the procedures handbook.

Slide 8

On the standards side, we will talk about updates to the baseline equivalence and missing data standards. You will also learn about the new standards for reviewing studies with cluster‐level assignment in a separate training module. The remaining items, which are described in detail in the standards handbook, include some clarifications on attrition terminology and outcome requirements, along with extensive changes to the standards used to review studies with specific designs and analyses. And with that, I’ll turn it over to Allison to start describing some of the more significant changes in version 4.

Slide 9

Thank you, Neil.

Slide 10

Because studies are the building blocks of systematic reviews, the WWC needs a definition for what constitutes a study. In practice, a single manuscript, such as a journal article or report might include more than one study, or multiple manuscripts might all report on the same study. So drawing the line to distinguish one from the other can be challenging. The version 4 procedures include a new, more formal definition of a study that we can implement more consistently. When analyses of the same intervention share certain characteristics, we consider them parts of the same study. These characteristics include:

3


The assignment or selection process used to create the intervention and comparison groups in the study sample: A random assignment process is one way of creating the study sample. Another way is to form groups by matching students who received the intervention to comparison students who did not.

The study sample: This is the set of students, classrooms, teachers, or schools that the study analyzed. Findings from analyses that include some or all of the same sample members may be related. Note that the samples do not need to be identical, they just need to overlap.

The data collection and analysis procedures used to produce the findings: When authors use identical or nearly identical procedures to collect and analyze data, the findings may be related.

And finally, the research team that conducted the study: When manuscripts share one or more authors, the findings reported in those manuscripts may be related.

When two findings on the effectiveness of the same intervention share at least three of these characteristics, we consider them parts of the same study. Findings that meet this criterion demonstrate similarity or continuity in the intervention and comparison groups, and similarity or continuity in the procedures used to produce the findings.

Slide 11

Let's consider a few scenarios to help clarify how the WWC defines a study. First, consider a manuscript that presents a finding based on combining data across related samples, such as those from multiple time periods within the same school.

For example, imagine that the researchers randomly assigned classrooms within a middle school to conditions, and the same intervention is implemented in the intervention classrooms over several school years. The study examines the effectiveness of the intervention in three consecutive cohorts of students in the middle school and measures the effectiveness of the intervention based on each cohort separately, but using the same analytic procedures.

Findings for the three cohorts are presented in a single manuscript.

In this case, we would review all the findings in the manuscript as a single study because they share all four of the study characteristics: the same random assignment process was used for each cohort, the study samples in each cohort were formed within the same school, the researchers used the same data collection and analysis procedures, and the same research team was responsible for all of the findings.

Slide 12

Now, imagine that instead of reporting the combined finding in one manuscript, the study authors released a series of reports, each reporting a single‐year finding.

These findings again share all four characteristics, so the WWC would review the series of reports as a single study. The fact that the authors released the findings separately in different reports does not affect how we review the findings.

4


Slide 13

Now consider a journal article on the effectiveness of an algebra curriculum on students’ scores on a standardized math assessment.

The study conducted a randomized controlled trial in New Jersey schools, and used a quasi‐experimental design in Wisconsin schools.

Although the study authors report findings for both samples in the same journal article, both the sample members and the assignment process differ in the two samples. Because the findings share fewer than three of the characteristics, we would review the randomized controlled trial and quasi‐experimental design as two separate studies.

Slide 14

Now let’s turn to updates to reporting procedures, starting with the guidelines the WWC follows in distinguishing among main findings, supplemental findings, and sensitivity analyses.

Slide 15

Often studies will report multiple findings, sometimes for the same outcome measure on different samples, or for multiple outcomes within a single outcome domain. In most products, the WWC reports on all eligible findings that meet WWC design standards with or without reservations.

Among those eligible findings that meet standards, we designate one or more findings reported in the study as the main findings that contribute to the WWC’s summary of the evidence. In particular, we identify a main finding using three criteria. A main finding:

‐ uses the full sample, rather than a subgroup; ‐ uses an aggregate or composite outcome measure, rather than individual subscales; and, ‐ is the finding closest to the preferred follow‐up time point, as specified in the review protocol.

This preferred follow‐up period may most often be either the immediate posttest, meaning the earliest follow‐up period after the conclusion of the intervention, or the latest follow‐up period, depending on the topic of interest.

If no one finding meets all three of these criteria, we will select a set of findings that together meet the criteria. In doing so, we seek to avoid overlap in subscales or subsamples.

For example, if a finding for a composite measure does not meet standards but findings for separate subscales of a composite measure meet standards, are based on the full sample, and are reported at the preferred follow‐up period, then we would report the findings for each subscale as main findings.

As we did under version 3 standards and procedures, under version 4 we combine subsamples into an aggregate finding if no aggregate finding meets WWC group design standards.

These rules allow the WWC to characterize a study’s findings based on the most comprehensive information available. However, not all studies will report a single finding or set of findings that reflects

5


the full sample for the composite outcome measure and preferred time period. When applying these rules is not straightforward because of incomplete information about findings, overlapping samples, or other complications, the review team leadership has discretion for a study or group of studies under review to identify main findings in a way that best balances the goals of comprehensively characterizing each study’s findings and presenting the findings in a clear and straightforward manner.

Slide 16

All eligible findings that meet WWC group design standards and are not identified as main findings are considered either supplemental findings or sensitivity analyses.

Supplemental findings do not contribute to the WWC’s summary of evidence but may be reported separately in the WWC product that includes the study review. In intervention reports, for example, supplemental findings are reported in an appendix.

Sensitivity analyses are acknowledged in a note in the WWC product but are not reported.

Findings for subsamples, subscales, and time periods other than the preferred follow‐up period that are not identified as main findings are supplemental findings.

Note that when the WWC calculates an aggregate finding from subsample findings, we report each subsample finding that meets standards as a supplemental finding.

Slide 17

A sensitivity analysis uses the same (or a very similar) sample as the main finding, but applies a different analytic method, such as using a different set of control variables.

To differentiate main findings from sensitivity analyses among the findings that meet standards, we identify as the main finding the one from the analysis that

‐ receives the highest rating, ‐ accounts for the baseline measures specified in the review protocol, ‐ uses the most comprehensive sample, ‐ and is most robust to threats to internal validity based on the judgment of either the study

authors, as reported in the study, or review team leadership.

Let’s go through an example.

Slide 18

A study estimated effects of a 1‐year intervention on SAT‐9 scores for students in grade 2. The study is reviewed using the Beginning Reading protocol, which prioritizes immediate posttest findings. All of the findings we will discuss are from eligible analyses.

First, suppose the following findings meet WWC group design standards:

‐ 1‐year impact for girls, ‐ 1‐year impact for all students, ‐ 2‐year impact for all students, and ‐ 1‐year impact for all students estimated using a simpler regression model.

6


We assume the simpler regression model has weaker internal validity than the full regression model used for the first three findings.

The 1‐year impact for boys does not meet standards.

Under this scenario, the main finding is the 1‐year impact for all students because it is the full‐sample finding closest to the preferred immediate posttest time period that is the most robust to threats to internal validity.

The findings for subsamples and other time periods that meet standards are supplemental findings: 1‐year impact for girls and 2‐year impact for all students.

The 1‐year impact for all students estimated using the simpler and presumably weaker regression model is classified as a sensitivity analysis.

Note that the 1‐year impact for boys does not appear under any of our classifications because the finding does not meet standards. The WWC reports only findings that meet design standards.

Slide 19

Now let’s consider the same study but instead suppose only the following findings meet standards:

‐ 1‐year impact for girls, ‐ 2‐year impact for all students, and ‐ 1‐year impact for all students estimated using a simpler regression model.

Now, the main finding is the 1‐year impact for all students estimated using a simpler regression model because it is the only full‐sample finding for the preferred time period that meets design standards.

The 1‐year impact for girls and the 2‐year impact for all students are again supplemental findings.

There are no sensitivity analyses. No findings from analyses that use a similar sample as the main finding but a different analytic method meet standards.

Slide 20

Another important reporting update concerns when to apply a multiple comparison adjustment.

Under version 4 procedures, the WWC applies a multiple comparison adjustment only to main findings.

We do not apply the adjustment to supplemental findings or sensitivity analyses.

The procedure itself has not changed. We perform the multiple comparison adjustment using the same Benjamini‐Hochberg correction that we used under version 3 of the WWC Standards and Procedures.

Slide 21

Next we will look at an update related to baseline equivalence. Version 4 standards use a more flexible statistical adjustment requirement for studies with moderate levels of differences in key baseline measures.

7


Slide 22

As you know, we assess baseline equivalence for randomized controlled trials with high attrition, randomized controlled trials with compromised random assignment, and quasi‐experimental designs by examining the standardized mean difference, or effect size, between the analytic intervention and comparison groups on baseline measures specified in the protocol.

If the absolute value of the baseline effect size is less than or equal to 0.05, the study satisfies the baseline equivalence requirement.

If the absolute value of the baseline effect size is larger than 0.25, the study does not satisfy the equivalence requirement.

For differences between 0.05 and 0.25, the WWC requires a statistical adjustment to satisfy equivalence. Under version 3 standards, this statistical adjustment had to be a regression covariate adjustment or an analysis of covariance.

Slide 23

Under version 4 standards, other analytical approaches may satisfy the statistical adjustment requirement if two conditions are met:

First, the baseline characteristic and the outcome measure must be measured using the same units.

This condition would be met if the researchers administered the same test and used the same scales and scoring procedures for the baseline characteristic and the outcome measure.

This condition would not be met if the researchers administered different tests for the baseline and outcome measures, or if they used the same test, but used different scoring procedures.

Second, the baseline characteristic and outcome measure must have a correlation of 0.6 or higher.

When these two conditions are met, there are three additional methods that can be used to satisfy the statistical adjustment requirement:

‐ a difference‐in‐differences adjustment, ‐ gain scores, and ‐ fixed effects.

When the study authors do not perform a statistical adjustment, but these two conditions hold, the WWC will perform its own difference‐in‐differences adjustment to satisfy the statistical adjustment requirement.

Going back to reporting for a moment, recall that in reporting findings that meet standards, the WWC performs a difference‐in‐differences adjustment when an analysis does not adjust for baseline differences. We still do a difference‐in‐differences adjustment for such studies, even if the correlation between the baseline characteristic and the outcome is below 0.6.

Let’s do a couple of knowledge checks.

8


Slide 24

Knowledge Check 1. A quasi‐experimental design study analyzed raw Peabody Picture Vocabulary Test (or PPVT) scores for 60 intervention and 60 comparison group students, estimating impacts by comparing unadjusted means. We are told that the baseline effect size is 0.07 standard deviations. The correlation between the raw PPVT scores at baseline and follow‐up is 0.82.

What is the highest rating this study is eligible to receive?

A. Meets WWC Group Design Standards Without Reservations,

B. Meets WWC Group Design Standards With Reservations, or

C. Does Not Meet WWC Group Design Standards

Slide 25

The correct answer is B: Meets WWC Group Design Standards With Reservations.

The study uses a quasi‐experimental design and unadjusted means to assess impacts. The baseline difference between the intervention and comparison groups is 0.07 standard deviations, and the pretest and the posttest are measured using the same units. The correlation between the pretest and posttest is over 0.6. Therefore, the WWC can do a difference‐in‐differences adjustment and the study is eligible to receive the Meets WWC Group Design Standards With Reservations rating.

A and C are incorrect answers.

Slide 26

Knowledge Check 2. A quasi‐experimental study analyzed the impact of a high school tutoring program by comparing student scores on a grade 11 district‐wide math test at the end of the school year for 40 students who received the program to scores for 40 students who did not receive the program. Normalized baseline scores on the grade 10 district‐wide math test have a mean of 0.04 for the intervention group, a mean of ‐0.04 for the comparison group, and a standard deviation of 1 for both groups.

To estimate impacts, the authors normalized the grade 10 and grade 11 scores so that each had a mean of 0 and standard deviation of 1, stacked the data, and estimated a model with student fixed effects, a time fixed effect, and an interaction between time and an indicator for being in the intervention group. The coefficient on the interaction term was the measure of the program’s effect.

What is the highest rating this study is eligible to receive?

A. Meets WWC Group Design Standards Without Reservations,

B. Meets WWC Group Design Standards With Reservations, or

C. Does Not Meet WWC Group Design Standards

Slide 27

The correct answer is “C”: Does Not Meet WWC Group Design Standards.

9


The baseline difference between the intervention and comparison groups is 0.08 standard deviations. The authors use individual fixed effects to account for the baseline difference, but the baseline and outcome scores do not have the same units. Even though the grade 10 and grade 11 math tests are both district‐wide assessments and have been transformed to the same scale, they are not the same assessment. Therefore, the study does not satisfy the baseline equivalence requirement, and it should receive the Does Not Meet WWC Group Design Standards rating.

“A” and “B” are incorrect answers.

I will now turn it over to Dana to talk about the revised missing data standards.

Thanks, Allison.

Slide 28

We’re going to spend the rest of the webinar today, or the bulk of the rest of the webinar today, discussing the revised missing data standards. I’m going to start off by giving you a broad picture of what’s changed with the WWC missing data standards. I’m then going to talk about a particular case that reviewers often run into, and that’s when baseline data is missing for some members of the analytic sample. I’m then going to dig a little bit deeper into the overall review process for reviewing studies with missing data.

Slide 29

So as I mentioned, there have been major revisions to the missing data standards. Let me just tell you briefly about how that’s operationalized.

Under standards version 3.0, authors had to use an acceptable approach to address all missing data in the analytic sample. That itself hasn’t changed under standards version 4.0. However, there have been some revisions to the list of methods for addressing missing data that the WWC has classified as acceptable.

Under standards version 3.0, studies could satisfy baseline equivalence only by using non‐imputed data for the entire analytic sample. That’s been revised under standards version 4.0. Under standards version 4.0, studies can satisfy baseline equivalence using data on a subset of the analytic sample, or imputed data for the analytic sample.

Finally, under standards version 3.0, only low‐attrition RCTs that impute outcome data were eligible to meet WWC group design standards. Under version 4.0, low‐attrition RCTs, high‐attrition RCTs, and QED studies that impute outcome data are all eligible to meet group design standards, so long as they limit the potential bias from analyzing imputed data. And I’ll tell you more about what that means in a moment.

Slide 30

So let's talk about one common situation that a reviewer might run into. This is possibly the most common situation with missing data, and this is when baseline data are missing for a small number of

10


observations in the analytic sample, but we still need to assess baseline equivalence. So for example, some students with outcome data were absent on the day of a pretest.

The WWC has calculated new formulas that estimate how large the baseline difference between the intervention and comparison groups in the full analytic sample could reasonably be based on a few different pieces of information.

When some baseline data is missing, the largest plausible baseline difference between the intervention and comparison groups is determined by the baseline effect size calculated using the observed baseline data in the analytic sample. So we use the information on the baseline we have. And smaller effect sizes calculated using the observed baseline data in the analytic sample will imply smaller plausible differences for the full analytic sample.

These new formulas also bring in information on outcome means. So we use information on the outcome, to think about the baseline. So, in particular, the standardized difference in outcome means between the full analytic sample and the subsample with observed baseline data measured separately for the intervention and comparison groups tells us something about the baseline difference for the full analytic sample. And here, smaller differences in outcomes between the full analytic sample and the subsample with observed baseline data will imply smaller plausible baseline differences within the full analytic sample.

Finally, it brings in information on the correlation between the baseline and the outcome measures, where larger correlations will imply smaller plausible baseline differences for the full analytic sample.

Slide 31

I’m now going to show you some of the formulas that the SRG uses in order to calculate the largest plausible baseline difference. These formulas are complex. The SRG has tools to support reviewers of these studies, and the SRG will do all the calculations for you. So don’t stress about the formulas. We’re not going to test on the formulas either – just the underlying concepts and the review process when studies have missing data.

But I want to go through the formulas for this specific case that you’re most likely to run into in a bit of detail, so that you are comfortable with the concepts and also to build some of the intuition behind what the SRG is doing in these cases.

So to assess baseline equivalence when some baseline data is missing, instead of comparing the baseline effect size to cutoff values – the .05 and .25 that you usually use – you’re going to compare the maximum of four different quantities to the same cutoff values, where those are defined by these equations. There’s a lot of stuff going on in these equations, but they can actually be broken down into a few easy‐to‐understand pieces.

The first piece that forms these equations is the effect size in the observed baseline data. So you take the baseline data for the analytic sample that you do have observed data for, and you calculate the effect size.

Another piece of this is the correlation between the outcome and the baseline measure.

11


The third and fourth pieces are the differences in the outcome between the full analytic sample and the analytic sample with observed baseline data. Like I mentioned before, we’re using information on the outcomes to infer something about the baseline.

And the final piece of these formulas is the normalization factor or the standard deviation of the outcome measure.

All four formulas draw information from those four types of statistics that I just mentioned. So even though they look complicated, they can really be broken down into several different pieces, many of which reviewers are very used to working with.

Slide 32

Again, this slide is to help to build intuition on what the SRG is doing when it’s assessing whether the largest plausible baseline difference satisfies baseline equivalence. On the X axis here, I’ve plotted the intervention group standardized difference in outcome means. So that’s the difference in the outcome between the full analytic sample and analytic sample with observed baseline data, divided by the standard deviation of the outcome. On the Y axis is the same measure but for the comparison group.

This gold region that you see here is the largest differences in the outcome means that still satisfy baseline equivalence. And in particular this assumes a correlation of .50. Here is that region, or the shape, for the correlation of .60. Here it is for .70 and here it is for .80.

There are three main points that I want to make with this diagram. The first is that higher correlations between the baseline and the outcome measure imply larger regions. So higher correlations allow for larger possible differences in means between the analytic sample and the analytic sample with non‐missing data.

The second main point I want to make is that if differences are of the same sign for the intervention and the comparison groups, the differences can be pretty big. The case that we’re looking at here is when the study did not adjust for the baseline measure. So differences could be, for instance, around .1 standard deviations for both the intervention and the comparison groups, and baseline equivalence would still be satisfied because the largest plausible baseline difference would be below that .05 standard deviation cut off.

The final thing I want to point out about this graph is that it’s totally symmetric. It doesn’t matter what you call the intervention group versus the comparison group, it doesn’t matter if a favorable outcome is negative or positive. Everything about this graph is symmetric.

Slide 33

Ok, so that’s a very specific case of how you would handle missing data. More generally, the WWC uses the following review process for studies with imputed outcome data or missing or imputed baseline data.

Again, the example we looked at is a special case of this review process. Now I’m going to discuss each step of the process in detail, and cover the other cases where this review process is used. In step one, a reviewer will assess whether the study uses an acceptable approach to address all missing data in the analytic sample. In step two, a reviewer will assess whether the study is a low‐attrition RCT. In step

12


three, a reviewer determines whether the study limits potential bias from imputed outcome data, if any outcome data are imputed.

Slide 34

In step four, a reviewer checks whether the study is a high‐attrition RCT that analyzes the full randomly assigned sample using imputed data. And in step five, the reviewer assesses whether baseline equivalence is satisfied, where a different process is used to do that based on whether data in the analytic sample is missing or imputed for any baseline measure specified in the review protocol.

I’m going to talk about each of these steps in more detail now.

Slide 35

Step one: Does the study use an acceptable approach to address all missing data in the analytic sample?

Slide 36

Version 4.0 of the Standards Handbook lists acceptable approaches and key considerations for each. The acceptable approached are also listed on this slide, and the slide that follows. These slides can be used as a reference. I will not go through all details on these slides. I just want to introduce you to each of the methods that the WWC has classified as acceptable approaches to dealing with missing data.

The first approach, which you most commonly see, is the complete case analysis approach. In this approach study authors only analyze observations for which all needed data are not missing. If you’re reviewing a study where this approach is used, you should follow the usual Group Design review process, counting any omitted data as attrition. There is no need to follow the remaining steps of the missing data review process.

Other acceptable approaches include regression imputation, where authors use a regression model to impute values for missing data.

Slide 37

Maximum likelihood, where they use in iterative routine to simultaneously estimate model parameters and account for missing data.

Nonresponse weights are also an acceptable method to account for missing outcome data only. Not for missing baseline data.

Slide 38

And then finally the dummy‐variable adjustment method, where authors set all missing values for a baseline measure to a single value and include an indicator variable for records missing data in the impact estimation model, is an acceptable method to correct for missing baseline data but not for missing outcome data. And note, for the dummy variable adjustment, when applied to baseline measures that the review protocol requires to satisfy baseline equivalence, the method is acceptable only for randomized controlled trials (both low‐ and high‐attrition RCTs), but not for QEDs or compromised RCTs.

13


The WWC may also consider other methods of addressing missing data acceptable. If you’re reviewing a study, and you think that an author uses a method of addressing missing data that’s acceptable, but it’s not in this list, you should consult with review team leadership.

It should also be noted that a single analysis may use multiple methods to deal with missing data. That’s totally acceptable, so long as all of the methods used are acceptable.

And then finally, if a study does not use an acceptable approach to address all missing data in the analytic sample, it receives the Does Not Meet WWC Group Design Standard rating. Note that this applies regardless of whether the measure with missing data is specified by the review protocol as required for satisfying baseline equivalence. So for example, if race and ethnicity was imputed using a method which is not acceptable, a study would not meet standards, even if we did not require equivalence based on race and ethnicity to be demonstrated in the review protocol.

Slide 39

Ok, so let's just go over a quick knowledge check on this topic. Which of the following studies cannot meet WWC group design standards?

A. A low‐attrition RCT that imputes missing baseline and outcome data using regression imputation.

B. An RCT with compromised random assignment that accounts for missing baseline data using a dummy variable adjustment.

C. A QED that estimates effects using unadjusted means and standard deviations and is missing baseline data for 40 percent of the analytic sample.

D. A QED that uses maximum likelihood to analyze a sample with missing baseline and outcome data.

E. All of the above can meet WWC group design standards.

Slide 40

The correct answer here is “B” – An RCT with compromised random assignment that accounts for missing baseline data using a dummy variable adjustment. Dummy‐variable adjustment is an acceptable approach for handling missing baseline data only for RCTs without compromised random assignment. So this study will receive the Does Not Meet WWC Group Design Standards rating.

“A” ‐ A low‐attrition RCT that imputes missing baseline and outcome data using regression imputation – that study uses an acceptable approach for handling missing data. It’s a low‐attrition RCT. Thus, unless there are any other design issues, the study will receive the Meets Group Design Standards Without Reservations rating.

“C” is also an incorrect answer. “C” was a QED that estimates effects using unadjusted means and standard deviations and is missing baseline data for 40 percent of the analytic sample. Although a large amount of baseline data is missing, the study is eligible to receive the Meets Standards With Reservations rating, as long as the largest anticipated baseline difference in the analytic sample is less than 0.05 standard deviations.

14


“D” is also incorrect. “D” was a QED that uses maximum likelihood to analyze a sample with missing baseline and outcome data. Maximum likelihood is an acceptable method for accounting for missing data. So unless there are other design issues, the study would be eligible to receive the Meets Standards With Reservations rating.

Slide 41

Let's go on to talk about step two: Is the study a low attrition RCT?

Slide 42

To assess whether the study is low attrition RCT, a reviewer will calculate overall and differential attrition counting sample members with imputed outcome data as missing. So the denominator for this calculation will be the full sample subject to random assignment, and the numerator is the analytic sample excluding any imputed outcome data. The WWC treats imputed data in the same way as missing data when assessing attrition, because both present a potential threat of bias. So a low attrition RCT would be eligible to receive the meets standards without reservations rating, so long as the study used an acceptable method to address missing data.

For QEDs , high attrition RCTs, and compromised RCTs, a reviewer should proceed to step three.

Slide 43

Before moving on though, let's go through a quick knowledge check.

Researchers randomly assigned 50 students to a comparison group, and 50 students to an intervention group. The intervention group received a one‐year intensive tutoring program. By visiting schools on two days, the researchers were able to obtain pretest scores for 43 students in the intervention group and 41 students in the comparison group, and posttest scores for 45 students in the intervention group and 40 students in the comparison group. Using an acceptable method of imputation, they analyzed data from an analytic sample including the 47 intervention group students and 43 comparison group students who had either pre‐ or posttest data (or both).

What is the overall rate of attrition? Is it:

A. 10 percent

B. 15 percent

C. 16 percent or

D. 21 percent

The correct answer here is “B”, 15%

Slide 44

The difference between the number of students randomly assigned and the number of students with observed outcome data should be used to calculate attrition. In this case, the researchers randomly assigned 100 students and 85 had observed outcome data (45 in the intervention group and 40 in the comparison group). Therefore, the overall rate of attrition is 15 percent.

15


A, C, and D are all incorrect answers. The quoted levels of attrition that correspond to these options result from treating observations with observed pretest data and missing posttest data or observed posttest data and missing pretest data incorrectly in assessing attrition.

Slide 45

Great, so let's move on to step three. Does the study limit potential bias from imputed outcome data, if any outcome data or imputed?

Slide 46

When a QED, high‐attrition RCT, or RCT with compromised random assignment analyzes imputed outcome data, the study must demonstrate that the potential bias from that imputed outcome data is limited. And by limited, we mean less than 0.05 standard deviations.

The potential bias from analyzing imputed outcome data is determined by the standardized difference in baseline means between the full analytic sample and the subsample with observed outcome data, measured separately for the intervention and comparison groups (where smaller differences will imply smaller biases). So just like when we talked about the case of missing baseline data we used information on the outcome measure to infer something about the missing baseline data, we’re using information from the baseline here to infer something about the imputed outcome data.

The potential bias is also determined by the fraction of the analytic sample missing outcome data measured separately for the intervention and comparison groups (where smaller fractions will imply smaller biases); and the correlation between the baseline and the outcome measure (where larger correlations imply smaller biases). So for this last item, the correlation, you’re typically using the baseline measure in order to impute the outcome measure. So if there’s a stronger link between the baseline measure and the outcome measure, we can guess that our imputation is in some sense better, and larger correlations will imply smaller possible biases. If a study must limit potential bias from imputed outcome data, but does not do so, it will receive the does not meet WWC Group Design Standards rating.

Slide 47

So, here’s another one of the graphs that I showed you previously. Now on the X axis, I’ve plotted the intervention group standardized difference in baseline means (so that’s the difference in baseline means for the full analytic sample versus the analytic sample with observed outcome data divided by the standard deviation of the baseline measure). And then the same measure for the comparison group is plotted on the Y axis. Now the gold region outlines the largest difference in baseline means that limit potential biases from imputation, and in this case, we’re looking at the region when the correlation between baseline measure and the outcome measure is equal to .50. Here is the difference for a region of .60, for .70, and for .80.

So again, three main take‐aways here for this graph. First, the graph is symmetric. So just like before, it doesn’t matter what you call the intervention group, it doesn’t matter what you call the comparison group, it doesn’t matter whether positive numbers refer to favorable impacts, or positive numbers refer to non‐favorable impacts. It’s all symmetric.

16


Second, like before, the bigger the correlation between the baseline and the outcome measure, the larger the differences can be and still lead to limited potential bias from imputation.

Finally I want to note that this diagram was created by assuming that there was no missing or imputed baseline data. This is a specific case but we could create a similar diagram for different levels of missing or imputed baseline data. There’s nothing particularly special about this particular case. It’s just one to highlight.

Slide 48

Moving on to step four. In step four a reviewer will assess whether a study is a high attrition RCT that analyzes the full randomly assigned sample.

Slide 49

In general, the WWC requires that high attrition RCTs satisfy the baseline equivalence requirement because of a risk of bias from compositional differences between the intervention and comparison group members that remain in the sample after attrition.

However some high attrition RCTs impute all missing outcome data and analyze the original randomly assigned sample. These high attrition RCTs do not need to satisfy the baseline equivalence requirement because of a presumption that the intervention and comparison groups that result from random assignment are unlikely to have substantive compositional differences. So put another way, a high attrition RCT is eligible to receive the meets group design standards with reservations rating, if it does the following:

1. Uses an acceptable approach to address all missing data, that you assessed in step one,

2. Limits the potential bias from imputed outcome data, as assessed in step three, and

3. Imputes outcome data for all randomly assigned sample members who are missing outcome data.

Notably absent from this list is satisfying baseline equivalence. High attrition RCTs that satisfy the above three requirements do not need to satisfy baseline equivalence to meet group design standards.

Slide 50

Let's move on to step five: Assessing Baseline Equivalence. We’re going to use a slightly different method for assessing baseline equivalence based on whether data in the analytic sample are missing or imputed for any baseline measure specified in the review protocol.

Slide 51

If the analytic sample includes no subjects with missing or imputed data for the measures required by the protocol to satisfy baseline equivalence, use the usual WWC procedures to assess baseline equivalence. You don’t need to do anything special in this case, you just look at your means and standards deviations of the baseline measure for the full analytic sample, and calculate the baseline effect size, comparing that to our typical cutoff values.

If you’re not in this situation, though, you need to proceed to step 5b in conducting a review.

17


Slide 52

Does the study satisfy baseline equivalence using the largest baseline difference accounting for missing or imputed baseline data?

If a study has missing or imputed baseline data, you want to measure the largest reasonable baseline difference accounting for these data.

The exact formulas that you’ll use in order to calculate this largest reasonable baseline difference will depend on the precise situation you’re in, but can include means, standard deviations, correlations, and samples sizes. All of the formulas just use these standard statistics.

As discussed earlier, if baseline data are missing, these formulas essentially take the baseline difference estimated using observed baseline data, and then adjust it. If baseline data are imputed, the formulas can adjust either the baseline difference estimated using both observed and imputed data, or the baseline difference estimated using observed data only.

Slide 53

So this slide here is a reference slide, and again, these slides should be available to you through the ON24 environment. If you have any questions how to access the slides, you should let me know. Hopefully we’ll be able to deal with those questions as they come up.

But this particular slide is a reference slide that you should use in order to understand what types of additional data are needed. You do not have to memorize it, you just have to be able to use it to understand what you need to request in an author query. And an SRG will also help you out here, as it will indicate the types of data you need to provide in different cases.

So what this slide does is, for each of the six cases you could possibly be in, based on whether outcome data are always observed or sometimes imputed, and baseline data are always observed, sometimes missing, or sometimes imputed, it contains X’s for the boxes specifying the statistics that you need to gather in order to conduct a review.

For two of the cases, there are options one and two. This means we need either the option one, or the option two information to assess baseline equivalence, but not both. If information from option one is available we’ll typically use it. It’s preferred to option two. But we can just as easily use the information from option two.

Slide 54

So that’s all five steps of the missing data review process. I’m now going to go through an example. This is the most complex missing data scenario in that it’s designed to walk reviewers through all five steps. Typically, a reviewer would not necessarily have to walk through all five of the steps, but we designed this example in order to do just that.

So in this example, we’re looking at an RCT which assigned 100 students to an intervention group and 100 students to a comparison group. Pretest data was observed for 90 students in the intervention group and 85 students in the comparison group. Posttest data was observed for 90 students in the intervention group and 80 students in the comparison group. In total, 95 students in the intervention

18


group and 90 students in the comparison group had observed data for either the pretest or the posttest, or both.

For all students who took either the pretest or the posttest (but not both) the study authors imputed missing test scores using multiple imputation. The imputation model included controls for study group, all covariates that were used in the impact estimation model, and, when imputing pretest scores, the model controls for the posttest. The correlation between the pretest and the posttest was 0.60.

The study authors used a regression framework, controlling for pretest scores, and all imputed and observed data to estimate impacts.

The review protocol specifies that the cautious attrition threshold should be used. What is the highest rating the study is eligible to receive?

Let's walk through the process that would be involved in reviewing this study and determining the highest rating the study is eligible to receive.

Slide 55

In step one, a reviewer would assess whether the study uses an acceptable approach to address all missing data in the analytic sample. In this case the study authors do use an acceptable approach to address all missing data in the analytic sample. The study authors used multiple imputation, which is a form of regression imputation, to impute missing values when exactly one test score was observed. The imputation model included all of the necessary covariates. So, the WWC would classify this as an acceptable approach to address all missing data.

In step two you would determine whether the study is a low attrition RCT. In this case, the answer to the question in step two is no. It is a high attrition RCT. Posttest data was observed for 90 students in the intervention group and 80 students in the comparison group. Counting imputed data as attrition, the overall attrition rate was 15%, and the differential attrition rate was 10 percentage points. Under the cautious attrition boundary, we would classify this as high attrition.

Slide 56

Given that, we move on to step 3: Does the study limit potential bias from imputed outcome data, if any outcome data are imputed? In this case, we would need a little bit more information. And in particular, we would need information on the unadjusted mean of the pretest for the sample with observed pretest data, the unadjusted mean of the pretest for those with observed pretest and posttest data, and the standard deviation of the pretest for those with observed pretest data.

Based on this information, the potential bias from imputed outcome data can be calculated to be .03 standard deviations. So since .03 is less than .05, we can determine that the study authors limited potential bias from imputed outcome data.

Based on this we move on to step four: is the study a high attrition RCT that analyzes the full randomly assigned sample? In this case, the answer is no. Some students had missing data on both the pretest and the posttest. These students were omitted from the analysis. So even though this is a high attrition RCT, it does not analyze the full randomly assigned sample, and we need to move on to assess baseline equivalence in step 5.

19


Slide 57

We’re going to use step 5b as opposed to step 5a because the analytic sample includes some observations with missing baseline data. In order to assess the largest baseline difference accounting for the missing or imputed baseline data, we would need a few more pieces of information. In particular, we need the unadjusted mean of the pretest for those with observe or imputed pretest data; the unadjusted mean of the posttest for those with observed posttest data; the unadjusted mean of the posttest for those with observed pretest and posttest data; and the standard deviation of the posttest for those with observed posttest data. So you put all of this information into the SRG, and the SRG would tell you that the largest baseline difference accounting for imputed data is 0.09 standard deviations. If you’ll recall, the study authors controlled for the pretest in the regression analysis used to estimate impacts, so the study satisfies baseline equivalence because that largest baseline difference is in the adjustment range and the authors adjusted for the pretest in the analysis. Therefore, overall we can conclude this study receives the Meets WWC Group Design Standards With Reservations rating.

Slide 58

Let's just conclude with a couple of knowledge checks.

An RCT assigned 60 students to the intervention group and the same number to the comparison group. Outcome data are available for 50 students in the intervention group and 45 students in the comparison group. Baseline data are available for 45 students in the intervention group and 50 students in the comparison group. The study authors imputed all missing baseline and outcome data using an acceptable approach. There is no reason to

WWC Recertification for Standards Version 4.0, Webinar Transcript · 2018-01-23 · Slide10. Becausestudies are the building blocks of systematic reviews, the WWC needs a definition

Documents