Of Cabbages and Kings: Classroom Observations & Value-Added Measures Julie Cohen & Pam Grossman Stanford University March 30, 2011 Paper to be presented at the Annual Meeting of AERA, April 2011. We would like to thank the Carnegie Corporation, W.T. Grant Foundation, and the Spencer Foundation for funding this work.
34
Embed
Of Cabbages and Kings: Classroom Observations & Value ...platorubric.stanford.edu/2011 AERA paper Cabbages Kings.pdf · Of Cabbages and Kings: Classroom Observations & Value-Added
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Of Cabbages and Kings: Classroom Observations & Value-Added Measures
Julie Cohen & Pam Grossman Stanford University
March 30, 2011
Paper to be presented at the Annual Meeting of AERA, April 2011. We would like to thank the Carnegie Corporation, W.T. Grant Foundation, and the Spencer
Foundation for funding this work.
Introduction and research questions
Teachers and teacher quality are the focus of current discussions about
educational improvement. A number of studies suggest teachers represent one of the most
important factors affecting student achievement (c.f. Rivkin, Hanushek and Kain, 2005;
Rockoff, 2004). This recognition has led to policies, codified into the Race to the Top
legislation, promoting more rigorous evaluation of teachers. There is tremendous
enthusiasm among policy-makers about the use of value-added methodologies to assess
teacher effectiveness, including using such measures instead of years of experience or
education coursework to make consequential tenure decisions. This enthusiasm is
tempered by many researchers’ skepticism that such measures can be used to evaluate
individual teachers (c.f. Raudenbush, 2004; Rothstein, 2009; McCaffrey et al., 2004). To
address these concerns, researchers advocate using multiple measures, including
structured observations, to capture features of teaching (Gitomer, 2009; Goe, Bell, &
Little, 2008). This allows researchers and practitioners alike to understand the process,
the work that teachers do in classrooms, associated with outcomes such as student
achievement gains. However, relatively few studies have tried to go into the classrooms
of teachers identified as more or less effective to try to understand the relationship
between value-added measures and classroom instruction (c.f. Grossman et al., 2009;
Kane, Taylor, Tyler, & Wooten, 2010). While many might assume a straightforward
relationship between teaching practices and student achievement gains, the nature and
strength of that relationship may largely depend upon how both classroom practice and
student learning are conceptualized and measured.
In this paper, we focus on a variety of issues related to measuring classroom
practice, using both value-added measures and one subject-specific observation
instrument, the Protocol for Language Arts Teaching Observation or PLATO, as the basis
for this exploration. Our research questions are:
o How do classroom practices of more effective teachers differ from those
of less effective teachers? Are there consistent patterns both within and
across schools?
o Are value-added measures identifying teachers who score higher on
measures of classroom practice?
Background
Even as research has begun to document that teachers matter, there is less
certainty about what attributes of teachers actually make the most difference in raising
student achievement. Some studies suggest teacher attributes, including certification, the
selectivity of teachers’ undergraduate institutions, and scores on tests of general
knowledge and verbal ability as factors related to student achievement gains (see Rice,
2003 for a review). Other work indicates that programmatic differences in teacher
preparation and other measured characteristics, may account for only a limited portion of
the variation in student achievement among teachers, particularly in English/Language
level, yet we believe that middle school is a consequential time in students’ academic
lives (Carnegie Corporation, 1989).
In order to look at the quality of instruction provided in English/Language Arts,
we need measures of teaching that can be used across multiple settings and schools.
Structured observation protocols direct observers to focus on specific facets of
instruction, and provide a common technical vocabulary for describing those facets.
Consistent language for describing teaching is rare (Grossman & McDonald, 2008), but
allows for comparisons of practice across classrooms. Some researchers have focused on
more “generic” features of classroom practice that cut across grade levels and subject
areas, such as behavior management and instructional planning and reflection (Danielson,
2006; Pianta, LaParo, Stuhlman, 2004). A few protocols have targeted specific content
areas, particularly mathematics (Hill, 2005) and elementary reading (Taylor, Pearson,
Peterson, & Rodriguez, 2005; Hoffman, Sailors, & Duffy, 2004). None of the existing
observation protocols, however, provides a way to measure the quality of ELA classroom
practice across the multiple domains of ELA, particularly at the secondary level.
The paucity of discipline-based observation approaches has been a persistent
problem in efforts to develop assessments of teaching (Kennedy, in press). Indeed, as
Kennedy argues, “until recently, assessments have not attended to the intellectual
substance of teaching: to the content actually presented, how that content is represented,
and whether and how students are engaged with it…Documenting the intellectual
meaning of teaching events remains the illusive final frontier in performance assessment”
(p. 21). To that end, the PLATO instrument builds on existing observation tools and
research on effective teaching practices in ELA in an attempt to parse the different facets
of teaching events in secondary ELA classrooms.
Research Design and Methods Sample
We first identified 45 middle schools (6th-8th grade) in New York City that are
similar in terms of student demographics (more than 70% minority students, more than
50% students qualify for free and reduced lunch). To maximize potential observation
time, we identified a subset of 37 schools that were geographically clustered and had at
least 10 ELA teachers. We then contacted principals to request participation in the study.
After identifying willing schools, we worked with the school’s literacy coach or another
administrator to recruit teachers. We observed in all schools in which the majority of
teachers were willing to participate. Though we did not sample based on value-added
models as we had in previous rounds of data collection (see Grossman et al., 2009), we
assumed that if the majority of teachers in a school participated we would have a range of
levels of “effectiveness” within the sample (see Table 1 for background characteristics of
teachers in this sample). 1
1 There were a number of instances in which teachers were eliminated from the study. For example, a few teachers designated as teaching ELA actually only taught special education classes or classes in which Spanish was the language of instruction. We did not observe in their classrooms.
Variable (%)
2008-9 Sample (N= 179)*
2008-9 Teachers without value-added coefficients used in analyses (N=54)*
2008-9 Teachers with value-added coefficients used in analyses (N=125)*
New York City Middle School Teachers (N= 3777
College Recommended 48.8
53.49 47.15 47.3
Our final sample of 13 schools included sites located in all of the boroughs in
New York City, except for Staten Island. Six of those schools were located in Queens,
five were in Brooklyn, one was in the Bronx, and one was in Manhattan. These schools
represented a large range in school size: at one school, there were only 10 ELA teachers,
and at the largest, there were 38 ELA teachers (see Table 2 for demographics of the
schools in the study). The final participation rates also varied from 45% of ELA teachers
in a school participating in the study to 90% of teachers participating in the study. As
can be seen the Table 1, the teachers in our sample do not differ significantly from other
ELA teachers in NYC public schools in terms of gender, or scores on LAST, the Liberal
Arts and Sciences Test required of teachers in New York. Our sample differs somewhat
in terms of pathway than the larger population of middle school ELA teachers in NYC;
we have fewer teachers in our sample who entered teaching through Teach For America
New York City Teaching Fellows 16.3
18.6 15.45 24.3
Teach For America 1.2 4.65 0 8.7 Individual Evaluation 16.3
11.63 17.89 9
Temporary License 13.3 6.98 14.45 0.9
Female 83.05
80.77
84
83.1
White 77.51 88.46
72.65
69.5 Black 11.24 5.77 13.68 14.2 Hispanic 5.92 3.84 6.84 9.7 Other ethnicity 5.33 1.92 6.84 6.5
Age 36.67 32.07
(10.65) 38.58
(10.65) 32 Years of Experience 7.17
4.29 (5.33)
8.368 (4.93)
SAT Math (diff. N) 500.44 526.36 (90.48)
492.06 (100.02) 495
SAT Verbal (diff. N) 526.22
505 (114.22)
533.09 (88.83) 509
LAST score first time (diff. N)
257.013 (24)
264.32 (18.9)
254.24 (25.24)
258 (21.8)
or The NYC Teaching Fellows. Perhaps most surprising, we have a much larger
percentage of teachers who entered teaching on a Temporary License.
Table 2: Information on Middle Schools Included in the Study
School Code
Environment Grade 2008-9
Student Performance Score 2008-9
(out of 25)
Student Progress
Score 2008-9 (out of
60)
# of ELA Teachers
Total
ELA Teachers Participat
ing in Study
Participating Teachers for whom we can
calculate a value-added coefficient
Free and
Reduced Lunch
English Language Learners
Black His
1 A 20.80 43.70 31 45.2% 71.4% 93% 36% 7% 2 A 20.30 41.80 25 72.0% 72.2% 89% 23% 3% 3 B 18.20 44.60 20 90.0% 61.1% 91% 18% 15% 4 B 20.90 44.80 27 59.3% 68.7% 88% 11% 3% 5 A 19.50 29.80 23 60.9% 78.6% 92% 65% 6% 6 A 21.90 38.90 16 50.0% 75.0% 94% 5% 91% 7 D 17.40 42.00 25 80.0% 70.0% 58% 4% 74% 8 C 15.40 37.90 10 50.0% 100.0% 76% 13% 18% 9 C 21.70 43.10 38 57.9% 90.9% 78% 12% 7%
10 D 18.60 33.70 12 58.3% 42.9% 90% 4% 80% 11 A 18.10 44.30 20 50.0% 80.0% 94% 30% 7% 12 A 22.7 50.9 16 81.3% 84.6% 83% 8% 5% 13 B 22.4 44.8 17 76.5% 69.2% 82% 16% 5%
Raters
We worked with EDC to recruit potential raters for this study. Members of the
PLATO team trained 12 new raters in New York City with both ELA background, and
experience teaching in the middle grades. The two-day training was face-to-face and
focused on all of the PLATO elements. We provided numerous opportunities for
potential raters to score video clips and receive feedback on their scores. By the end of
the training and with some follow-up with individual raters, 12 of 14 potential raters
achieved 80% reliability, with exact score matches on at least five videos of ELA
instruction that had been master coded by several members of the PLATO team (for more
details about the raters, their use of the instrument and overall reliability, see Cor, 2011).
Revision of PLATO rubric
The PLATO protocol used in this study included twelve elements of instruction
highlighted in existing literature on adolescent literacy and effective instruction in
secondary ELA: purpose, intellectual challenge, connections to prior knowledge,
connections to personal and cultural experiences, modeling, strategy use and instruction,
guided practice, classroom discourse, text-based instruction, accommodations for
language learning, and classroom environment (for details on the development of these
elements and the background literature on each, see Grossman et al, 2009). These
elements were refined based on analysis of previous rounds of data collection. Text-based
instruction was added to assess how regularly and effectively a teacher and students
reference texts and use those textual references to meet the goals of ELA instruction. The
classroom environment element looks at both time and behavior management to assess
the teacher’s efficient organization of classroom routines and materials to ensure that
instructional time is maximized, and the degree to which student behavior facilitates
academic work. A factor analysis of a previous data round of data collection suggest
three underlying factors: disciplinary demand and representation of content, instructional
scaffolding, and classroom environment.
The first version of PLATO used a seven-point scale, in which reliability was
achieved based on 80% agreement with exact and adjacent score matching. Based on
feedback from experts (Gitomer, personal communication), we switched to a 4-point
scale. Each element includes a rubric that details how to score instruction on a scale from
one (lowest) to four (highest). In addition, PLATO captures the content of instruction
(writing, reading, literature, grammar) and activity structures (whole group, small group,
independent work). PLATO is designed for use over multiple segments of instruction in
each lesson; each observation cycle captures 15 minutes of instruction with five minutes
for scoring. Observations focus on the classroom experience of the “average” student,
and try to weigh the balance of evidence across a fifteen-minute segment.
Observation Process
Teachers were observed on three separate days of instruction for at least two class
periods per day in two waves of observations. The number of observation cycles varied
depending on the length of class periods, but on average, teachers were observed for
twelve PLATO cycles. Neither observers nor teachers knew the teachers’ value-added
coefficient or quartile during observations. To ensure consistency among raters, 15% of
observations were double-coded. The study was designed to be able to conduct a
generalizability study, so teachers were observed by multiple raters, and raters were
assigned to multiple schools. (For results of the g-study, see Cor, 2011).
Value-added models
To calculate teachers’ value-added scores, we chose to run a number of different
specifications. The base used to estimate teacher effects is summarized by Equation 1.
Representations of Content -0.081 0.135 Connections-Prior Knowledge -0.028 0.265 Connections-Personal /Cultural Experiences -0.175 0.229 Models/Modeling -0.112 0.245 Explicit Strategy Instruction 0.213 0.205 Guided Practice -0.414 0.258 Classroom Discourse 0.055 0.264 Text-based Instruction -0.736* 0.309 Accommodations for Language Learning 0.016 0.177 Classroom Environment 0.067 0.256 Number of teachers=125
We then ran logistic regressions to compare the classroom practices of teachers in
the 1st and 3rd quartiles (N=46)2. Table 7 below shows the odds of being in the 3rd quartile
group based on a 1 unit higher score on each of the PLATO elements. While a number of
the elements (Purpose, Connections to Prior Knowledge, Connections to Personal
Experience), have odds ratios above 2.0, only two elements are significant: Modeling
(4.340) and Explicit Strategy Instruction (4.698).
Table 7: Effect of PLATO Elements on the Odds Ratios Predicting Being in the 3rd Value-Added Quartile Versus the First Value-Added Quartile
Purpose 2.27 Likelihood of Being
in 3rd Quartile Standard
Error Purpose 2.271 (2.308) Intellectual Challenge 0.914 (0.683) Representations of Content 0.767 (0.973) Connections to Prior Knowledge 2.392 (1.604) Connections to Personal and Cultural Experiences 2.163 (1.612) Modeling 4.340* (3.026) Strategy Instruction 4.698+ (3.829) Guided Practice 0.777 (0.509) Classroom Discourse 1.182 (0.840)
2 We have many fewer teachers in the logistic regressions because the value-added model used for this study is a composite of two separate models. Teachers were identified in a certain quartile only if they were in that quartile in both value-added specifications. For this reason, we lost a large number of teachers when the sample was divided in quartile groups.
Text-Based Instruction 0.974 (0.535) Accommodations for Language Learning 1.445 (1.359) Classroom Environment 1.135 (0.831) Number of Teachers=51 Results presented as odds-ratios + p<.10, * p<.05, ** p<.01
Results for teachers with 3-10 years of experience in New York City
We then re-ran the analyses focusing on teachers with between 3 and10 years of
teaching experience in New York City schools. Since our original study focused on
teachers with between 3-6 years of experience, and because experience may be associated
with the use of different classroom practices, we were interested in exploring differences
in a more restricted sample. Table 8 below provides the mean years of experience of ELA
teachers at the schools in our sample. It indicates that the experience profile of the
teachers in our sample varied among schools. Schools 6 and 7 have teachers with the
highest mean level of experience, while schools 1 and 4 have the lowest mean
experience. As a result, restricting the sample in this way differentially impacted the
Accommodations for Language Learning 5.816 (7.122) Classroom Environment 1.172 (1.047) Number of Teachers= 39 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001
Table 10: OLS regressions for teachers with 3-10 years of experience
Regression Coefficient Std. Error
Purpose 0.017 (0.203) Intellectual Challenge -0.035 (0.276) Representations of Content -0.100 (0.165) Connections-Prior Knowledge -0.120 (0.340) Connections-Personal /Cultural Experiences 0.116 (0.324) Models/Modeling 0.281 (0.332) Explicit Strategy Instruction 0.439 (0.272) Guided Practice -0.231 (0.309) Classroom Discourse 0.073 (0.336) Text-based Instruction -0.636 (0.401) Accommodations for Language Learning 0.035 (0.233) Classroom Environment 0.078 (0.346) Number of Teachers= 96
Adjusting for psychometric properties of PLATO
We then reran these analyses with adjusted scores that predicted a PLATO score
of each teacher on each item based on the following measurement model:
log ( Pnijkr/Pnijk(r-1)) = Bn - Di - Cj - Fk - Gr
This is a polytomous Rasch model that adjusts for the difficulty of occasion and
segment of measurement as well as the severity of the rater. In the model Pnijkr is the
probability of observing category r for teacher n encountering occasion i, segment j, and
judge k. Pnijk(r-1) is the probability of observing category r-1. Bn is the ability of
teacher n and Gr is the difficulty of being observed in category r relative to category r-1.
Therefore, the scores can be viewed as an estimate of the teacher’ underlying skill in
each instructional element that are adjusted for the measurement circumstance of each
observation. A reliability analysis found that the Cronbach's alpha for the 12 PLATO
items used in conjunction increased from .72 to .85 when using the adjusted scores.
Table 11 shows the relationship between PLATO elements and teachers’ value-
added scores, once we adjust for the various sources of error.
Table 11: Comparison of Logistic Regression Results with Raw and Adjusted Scores- Total sample of teachers Likelihood of being in Q3 Original Scores Adjusted Scores Purpose 2.271 1.326 Intellectual Challenge 0.914 1.077 Representations of Content 0.767 1.027 Connections-Prior Knowledge 2.392 1.700 Connections-Personal/Cultural Experiences 2.163 1.925 Models/Modeling 4.340* 2.658 Explicit Strategy Instruction 4.698 1.579 Guided Practice 0.777 0.890 Classroom Discourse 1.182 1.458 Text-based Instruction 0.974 1.068 Accommodations for Language Learning 1.445 1.123 Classroom Environment 1.135 0.959 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001
Table 12: Comparison of Logistic Regression Results with Raw and Adjusted Scores- Teachers with 3-10 years of Experience Likelihood of being in Q3 Original Scores Adjusted Scores Purpose 3.791 1.546 Intellectual Challenge 1.182 1.204 Representations of Content 0.993 1.101 Connections-Prior Knowledge 2.727 1.789 Connections-Personal/Cultural Experiences 3.488 3.312
Models/Modeling 17.430** 6.196* Explicit Strategy Instruction 12.414* 4.993* Guided Practice 1.544 1.174 Classroom Discourse 0.620 1.006 Text-based Instruction 0.983 0.892 Accommodations for Language Learning 5.816 2.267 Classroom Environment 1.172 0.961 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001
These results suggest that while adjusting the scores based on our g-study generally
reduced the odds ratios for many of the elements, it also raised the odds ratios for
Classroom Discourse and Representation of Content in both the full and restricted
samples.
In general, the reason unadjusted and adjusted scores differentially predict
likelihood of being in third quartile is that the raw scores are systematically affected by
measurement features in the data collection process. For example, it could be that the
teachers with the highest unadjusted scores were more likely to be rated by the most
lenient raters. As a result, their scores are spuriously high compared to the rest of the
teachers. Alternatively, it could be that the teachers with the higher scores were observed
on fewer occasions and on fewer segments than teachers with lower scores. Once again
the result would be a spuriously high likelihood of being in the third quartile. In order to
verify these and other potential explanations, the measurement characteristics of the
teachers most likely to be in the third quartile must be tracked to identify potential
predictors of systematic differences.
Discussion
These results provide a mixed picture of the relationship between teachers’
classroom practices, as measured by PLATO, and teachers’ value-added scores. When
we use logistic regression, two of the elements, Modeling and Explicit Strategy
Instruction, seem to distinguish between teachers in the 1st and 3rd value-added quartiles.
This relationship seems to hold up across a number of different analyses, and is
particularly strong among teachers with 3-10 years of experience. It is easy to understand
why “Strategy Use and Instruction,” explicitly teaching students how to successfully
complete academic tasks, and “Modeling,” demonstrating or enacting the processes in
which students will engage, might be considered “high-leverage practices” for
achievement gains on standardized English Language Arts assessments. At the highest
level of Modeling, the teacher provides a specific, concrete image of what student work,
including process or intellectual work, can and should look like and decomposes the
process, highlighting specific features for students to replicate. The opportunity to watch
an “expert,” the teacher, engage the same activity as students helps make visible how a
more experienced reader or writer approaches the task. This, in turn, might students
better understand the processes being modeled and help them become more persistent and
flexible when they approach novel tasks such as those they face on standardized
assessments, the primary tool for determining a teacher’s value-added. Moreover to be
successful on such assessments, students must consistently employ strategies to interpret
literary text, make a compelling argument, or analyze grammatical errors. It is the
flexibility of strategies that makes them critical for success on a range of ELA tasks.
When students understand when and how to use specific strategies, as well as why they
are useful, they are better able to attack less familiar tasks or material. During our first
year of data collection, we found the vast majority of teachers provided students with
directions for completing activities, but they did not instruct them on the nuances of how
to complete those activities effectively. In literature circles, students were often told to
analyze a character’s actions or determine the meaning of unknown words without any
discussion of the strategies that would enable them to do so. Similarly, teachers
highlighted the features of cinquains or editorials but did not teach students how they
might approach different types of writing based on those features. Thus the goal of many
lessons was completion of the specific task rather than mastering a more broadly
applicable skill. Those teachers who actually taught students strategies were nearly
always in the highest value-added quartile.
However, other elements do not show a clear relationship to teachers’ value-
added scores. While in some of these analyses, most relationships between elements and
value-added scores are positive, few of the other relationships are significant. In the
remainder of this paper, we explore a number of hypotheses related to our findings.
Measuring well: Measuring what matters
One hypothesis is that these particular classroom practices are not associated with
teachers’ impact on student achievement in ELA. While certainly possible, this
hypothesis seems unlikely to be true for all elements, given the previous research on the
effects of a teacher’s time on task and student achievement (Denham & Lieberman, 1980)
and on the relationship between the cognitive demand of classrooms and student learning
(Newmann, Lopez, & Bryk, 1998).
A second hypothesis is that we are not measuring these practices well. There are
a number of reasons to question the value of how well we measured these instructional
elements. We had just revised significantly the PLATO instrument from the version used
in our earlier study, and the revised instrument may not have adequately captured the
qualities of classroom practice we intended to capture. We know, for example, that some
of the measures, such as Representation of Content, shifted their meaning during data
collection. Raters began to use the score of 3 as a “default score,” so we may not have
measured well the quality of teachers’ representations of content during these
observations. Secondly, we shifted our scale from a 7 point scale to a 4 point scale; in this
version of PLATO, we may not have done a good job of distinguishing among the
various score points. In addition, some of these elements were new to this version of
PLATO, such as Text-Based Instruction and Classroom Environment. So it is possible
that while these practices are indeed components of effective instruction, our instrument
did not do a good job of capturing them during this round of data collection.
Another potential problem with PLATO as a measurement tool is that some of our
elements measure aspects of instruction that are always present in some form, such as
intellectual challenge, purpose, representation of content, and classroom environment,
while other elements normally occur only at discrete points in the lesson, such as strategy
instruction, guided practice, etc. The fact that the g-study suggests that the greater
variation occurs across segments for the same teacher suggests that to get an accurate
measure of classroom practice would require multiple observations. In the first year of
data collection using PLATO we observed six days of instruction, while in this second
year, we observed only three days of instruction. This may have meant that we missed
low-incidence practices among teachers who do, in fact, use these practices.
There may also have been problematic scoring by raters during this year of
classroom observations. In the first year of the study, most raters had expertise in the
instrument; in fact, many of the raters were also the instrument developers. In contrast,
during year 2, none of the developers regularly observed in classrooms. While our g-
study indicates that relatively little of the error is attributable to raters, it is possible that
raters agreed with each other, but may have scored differently from the way the first
group of raters scored. Because we do not have videos from these observations, we are
unable to check this hypothesis. Raters might have been consistent with each other, but
they might not have been consistent with the developers of PLATO. The one rater who
had the most experience in PLATO actually scored differently than other raters during
Year 2, lending some support to this hypothesis. In looking at the school-level data,
however, we can spot some effects of the school context on rater behavior. In schools
where there were few, or no, 3rd and 4th quartile teachers, scores may have drifted
upward, as raters re-calibrated to what was typical practice in a school. Without
consistent examples of what PLATO would consider high-quality teaching, raters may
have redefined the scale somewhat.
While PLATO may not have done a good job measuring the quality of teacher
practice, it is also possible that the standardized tests in ELA do not do a good job of
measuring some aspects of student outcomes. For example, the quality of classroom
discourse may be important in developing students’ reasoning abilities and conceptual
understanding of literature and writing, but these abilities may not be measured well by
the tests that were used to construct value-added scores. Few would argue that students
need not develop their ability to engage in productive academic discourse, but the skills
learned in these discussions may not be captured on state assessments. In this case, the
observation scores might provide better measures of this particular outcome than value-
added scores. In addition to developing multiple measures of instruction, we need to
develop multiple measures of student outcomes to ensure that classroom instruction
supports the development of a broad range of learning outcomes for students.
The model matters
One of the striking findings relates to the difference in the results depending upon
whether we used OLS or logistic regression. Again, there might be several hypotheses
for why there are negative correlations between PLATO and value-added coefficients
under OLS and positive relationships using logistic regression, particularly in the
restricted sample. One hypothesis might be that OLS may not be the best approach to
looking at relationships between practices and teachers’ value-added scores. In
particular, OLS assumes that there is a normal distribution of errors, which is highly
unlikely given the skewed distribution of both our dependent and independent variables.
As such, the sample is heavily weighted towards teachers with lower value-added
coefficients the few observations in the 4th quartile have an inappropriately large
influence on the slope of the fitted line. OLS is also extremely sensitive to outliers; a few
outliers in our sample—a few 4th quartile teachers who scored lower on some PLATO
elements or 1st quartile teachers who scored higher—would affect the overall results. In
contrast, logistic regression creates two groups (“more and less effective” teachers) that
are defined vis a vis some parameters, in this case percentile cut-off points, which
eliminates the potential for outlier effects and does not necessitate normally distributed
data.
We also know that value-added estimates for individual teachers tend to fluctuate
significantly from year to year (McCaffrey, Sass, Lockwood, & Mihaly, 2009),
particularly those in the middle of the distribution of value-added scores. So including the
full range of teachers might mean that some teachers who are in our 2nd and 3rd quartiles
are misidentified and are really more or less effective at raising achievement than these
particular VA scores might suggest. Because of this, small distinctions between teachers
with similar value added scores may be less meaningful. As a result, treating value-added
coefficients as a continuous variable is problematic. Blunter categories, such as quartiles
used in a logistic regression, may make more sense.
Another possibility for the lack of strong relationships between VA scores and
scores on PLATO could be the influence of school context. As illustrated by the work of
Julie Cohen and Michelle Brown, certain classroom practices may be associated with
particular school contexts, which makes it more difficult to distinguish among teachers in
the same context. While value-added models control for school characteristics, they do
not address what economists call “unobservables,” which could include the school
culture, collegial interaction, curriculum, and instructional coherence3. Schools that are
higher-functioning may have higher quality instruction overall.
Implications: Leveraging measurement for improvement
Our experience designing PLATO suggests that it is not easy work developing a
systematic observation system and that it is even more difficult to train others to use such 3 Models with school fixed effects are designed to address these issues but also limit comparisons to teachers within the same school.
a tool reliably. Given the challenges, we should be somewhat wary of the Race to the
Top mandate that districts develop teacher evaluation systems that include multiple
indicators, including classroom observations. Such systems will take time and expertise
to develop.
Despite these various issues, continuing to develop better measures of both
learning and classroom practice is work worth doing. Classroom observation systems
offer the potential to both measure and improve the quality of teaching. Part of the
challenge involved in improving the quality of teaching in our nation’s schools is the lack
of valid and reliable measures for assessing teaching effectiveness or tools for targeting
specific features of instruction (Gitomer, 2009). Without such tools, it is nearly
impossible to identify effective classroom practices and support teachers’ growth in
classroom instruction. Identifying classroom practices associated with high student
achievement gains and then targeting these practices in professional development
provides a potentially powerful approach for improving the quality of instruction for all
students. Value-added measures may be able to distinguish between teachers who have
differential impact on student achievement scores, but they contribute nothing about the
mechanisms through which teachers achieve this impact.
Among the middle school ELA classrooms we studied, the practices of modeling
and strategy instruction seem to be strong predictors of teachers’ effectiveness as
measured by value-added. The good news about this finding is that these are practices
that teachers can develop, providing a lever for instructional improvement. The fact that
they occur so infrequently also makes them an easy target for reform. Helping teachers
integrate more strategy instruction and modeling into their ELA lessons would provide
much more support for students in the areas of reading and writing. If these findings
continue to hold up, a next step would be to leverage these findings through targeted
professional development in these practices. We believe that the new tools and
technology being developed now have tremendous potential for raising the floor on
classroom practice.
References:
Biancarosa, G., & Snow, C. E. (2004). Reading next: A vision for action and research in
middle and high school literacy: A report from the Carnegie Corporation of New
York. Washington DC: Alliance for Excellent Education.