Of Cabbages and Kings: Classroom Observations & Value ...platorubric.stanford.edu/2011 AERA paper Cabbages Kings.pdf · Of Cabbages and Kings: Classroom Observations & Value-Added

Of Cabbages and Kings: Classroom Observations & Value-Added Measures

Julie Cohen & Pam Grossman Stanford University

March 30, 2011

Paper to be presented at the Annual Meeting of AERA, April 2011. We would like to thank the Carnegie Corporation, W.T. Grant Foundation, and the Spencer

Foundation for funding this work.

Introduction and research questions

Teachers and teacher quality are the focus of current discussions about

educational improvement. A number of studies suggest teachers represent one of the most

important factors affecting student achievement (c.f. Rivkin, Hanushek and Kain, 2005;

Rockoff, 2004). This recognition has led to policies, codified into the Race to the Top

legislation, promoting more rigorous evaluation of teachers. There is tremendous

enthusiasm among policy-makers about the use of value-added methodologies to assess

teacher effectiveness, including using such measures instead of years of experience or

education coursework to make consequential tenure decisions. This enthusiasm is

tempered by many researchers’ skepticism that such measures can be used to evaluate

individual teachers (c.f. Raudenbush, 2004; Rothstein, 2009; McCaffrey et al., 2004). To

address these concerns, researchers advocate using multiple measures, including

structured observations, to capture features of teaching (Gitomer, 2009; Goe, Bell, &

Little, 2008). This allows researchers and practitioners alike to understand the process,

the work that teachers do in classrooms, associated with outcomes such as student

achievement gains. However, relatively few studies have tried to go into the classrooms

of teachers identified as more or less effective to try to understand the relationship

between value-added measures and classroom instruction (c.f. Grossman et al., 2009;

Kane, Taylor, Tyler, & Wooten, 2010). While many might assume a straightforward

relationship between teaching practices and student achievement gains, the nature and

strength of that relationship may largely depend upon how both classroom practice and

student learning are conceptualized and measured.

In this paper, we focus on a variety of issues related to measuring classroom

practice, using both value-added measures and one subject-specific observation

instrument, the Protocol for Language Arts Teaching Observation or PLATO, as the basis

for this exploration. Our research questions are:

o How do classroom practices of more effective teachers differ from those

of less effective teachers? Are there consistent patterns both within and

across schools?

o Are value-added measures identifying teachers who score higher on

measures of classroom practice?

Background

Even as research has begun to document that teachers matter, there is less

certainty about what attributes of teachers actually make the most difference in raising

student achievement. Some studies suggest teacher attributes, including certification, the

selectivity of teachers’ undergraduate institutions, and scores on tests of general

knowledge and verbal ability as factors related to student achievement gains (see Rice,

2003 for a review). Other work indicates that programmatic differences in teacher

preparation and other measured characteristics, may account for only a limited portion of

the variation in student achievement among teachers, particularly in English/Language

Arts (ELA) (Boyd, Grossman, Lankford, Loeb, & Wyckoff, 2009; Kane, Rockoff, &

Staiger, 2005).

This emphasis on teacher characteristics and preparation as well as student

outcomes on standardized assessments obscures the importance of classroom instruction.

If teaching, and not just teachers, matters, then it is crucial to understand what teaching

practices are associated with student achievement gains. We know little about whether

instruction, rather than the composition of students, is substantively different in the

classrooms of high value-added ELA teachers. Identifying classroom practices associated

with high student achievement gains, and then targeting these practices in teacher

education, mentoring, coaching, and professional development, provides a possible way

to improve the quality of instruction for many students.

Our study focuses specifically on classroom practices within English/Language

Arts classrooms for several reasons: our value-added analyses are able to link teachers to

students’ scores in ELA on city and state tests, and our belief that teaching effectiveness

is likely to be domain-specific (e.g. Seidel & Shavelson, 2007). In addition, there is

growing concern about adolescent literacy in this country, given that there have been

relatively few literacy gains 8th and 12th grade students, as measured by the National

Assessment of Educational Progress, (http://nces.ed.gov/nationsreportcard). Though a

number of curricular and instructional programs have been aimed at improving

elementary literacy (see Snow & Biancarosa, 2003, for an overview of 12 such

programs), we still know relatively little about the instructional practices that support

adolescent literacy achievement, particularly in urban schools.

While we know that teachers in other content areas can, and must, help develop

students’ academic literacy, ELA teachers are those most explicitly charged with

developing such skills. We focus on instruction in ELA classrooms where we can look at

relationships between instruction and student achievement. This study looks at teaching

in middle school classrooms because so little research has focused on instruction at this

http://nces.ed.gov/nationsreportcard

level, yet we believe that middle school is a consequential time in students’ academic

lives (Carnegie Corporation, 1989).

In order to look at the quality of instruction provided in English/Language Arts,

we need measures of teaching that can be used across multiple settings and schools.

Structured observation protocols direct observers to focus on specific facets of

instruction, and provide a common technical vocabulary for describing those facets.

Consistent language for describing teaching is rare (Grossman & McDonald, 2008), but

allows for comparisons of practice across classrooms. Some researchers have focused on

more “generic” features of classroom practice that cut across grade levels and subject

areas, such as behavior management and instructional planning and reflection (Danielson,

2006; Pianta, LaParo, Stuhlman, 2004). A few protocols have targeted specific content

areas, particularly mathematics (Hill, 2005) and elementary reading (Taylor, Pearson,

Peterson, & Rodriguez, 2005; Hoffman, Sailors, & Duffy, 2004). None of the existing

observation protocols, however, provides a way to measure the quality of ELA classroom

practice across the multiple domains of ELA, particularly at the secondary level.

The paucity of discipline-based observation approaches has been a persistent

problem in efforts to develop assessments of teaching (Kennedy, in press). Indeed, as

Kennedy argues, “until recently, assessments have not attended to the intellectual

substance of teaching: to the content actually presented, how that content is represented,

and whether and how students are engaged with it…Documenting the intellectual

meaning of teaching events remains the illusive final frontier in performance assessment”

(p. 21). To that end, the PLATO instrument builds on existing observation tools and

research on effective teaching practices in ELA in an attempt to parse the different facets

of teaching events in secondary ELA classrooms.

Research Design and Methods Sample

We first identified 45 middle schools (6th-8th grade) in New York City that are

similar in terms of student demographics (more than 70% minority students, more than

50% students qualify for free and reduced lunch). To maximize potential observation

time, we identified a subset of 37 schools that were geographically clustered and had at

least 10 ELA teachers. We then contacted principals to request participation in the study.

After identifying willing schools, we worked with the school’s literacy coach or another

administrator to recruit teachers. We observed in all schools in which the majority of

teachers were willing to participate. Though we did not sample based on value-added

models as we had in previous rounds of data collection (see Grossman et al., 2009), we

assumed that if the majority of teachers in a school participated we would have a range of

levels of “effectiveness” within the sample (see Table 1 for background characteristics of

teachers in this sample). 1

1 There were a number of instances in which teachers were eliminated from the study. For example, a few teachers designated as teaching ELA actually only taught special education classes or classes in which Spanish was the language of instruction. We did not observe in their classrooms.

Variable (%)

2008-9 Sample (N= 179)*

2008-9 Teachers without value-added coefficients used in analyses (N=54)*

2008-9 Teachers with value-added coefficients used in analyses (N=125)*

New York City Middle School Teachers (N= 3777

College Recommended 48.8

53.49 47.15 47.3

Our final sample of 13 schools included sites located in all of the boroughs in

New York City, except for Staten Island. Six of those schools were located in Queens,

five were in Brooklyn, one was in the Bronx, and one was in Manhattan. These schools

represented a large range in school size: at one school, there were only 10 ELA teachers,

and at the largest, there were 38 ELA teachers (see Table 2 for demographics of the

schools in the study). The final participation rates also varied from 45% of ELA teachers

in a school participating in the study to 90% of teachers participating in the study. As

can be seen the Table 1, the teachers in our sample do not differ significantly from other

ELA teachers in NYC public schools in terms of gender, or scores on LAST, the Liberal

Arts and Sciences Test required of teachers in New York. Our sample differs somewhat

in terms of pathway than the larger population of middle school ELA teachers in NYC;

we have fewer teachers in our sample who entered teaching through Teach For America

New York City Teaching Fellows 16.3

18.6 15.45 24.3

Teach For America 1.2 4.65 0 8.7 Individual Evaluation 16.3

11.63 17.89 9

Temporary License 13.3 6.98 14.45 0.9

Female 83.05

80.77

84

83.1

White 77.51 88.46

72.65

69.5 Black 11.24 5.77 13.68 14.2 Hispanic 5.92 3.84 6.84 9.7 Other ethnicity 5.33 1.92 6.84 6.5

Age 36.67 32.07

(10.65) 38.58

(10.65) 32 Years of Experience 7.17

4.29 (5.33)

8.368 (4.93)

SAT Math (diff. N) 500.44 526.36 (90.48)

492.06 (100.02) 495

SAT Verbal (diff. N) 526.22

505 (114.22)

533.09 (88.83) 509

LAST score first time (diff. N)

257.013 (24)

264.32 (18.9)

254.24 (25.24)

258 (21.8)

or The NYC Teaching Fellows. Perhaps most surprising, we have a much larger

percentage of teachers who entered teaching on a Temporary License.

Table 2: Information on Middle Schools Included in the Study

School Code

Environment Grade 2008-9

Student Performance Score 2008-9

(out of 25)

Student Progress

Score 2008-9 (out of

60)

# of ELA Teachers

Total

ELA Teachers Participat

ing in Study

Participating Teachers for whom we can

calculate a value-added coefficient

Free and

Reduced Lunch

English Language Learners

Black His

1 A 20.80 43.70 31 45.2% 71.4% 93% 36% 7% 2 A 20.30 41.80 25 72.0% 72.2% 89% 23% 3% 3 B 18.20 44.60 20 90.0% 61.1% 91% 18% 15% 4 B 20.90 44.80 27 59.3% 68.7% 88% 11% 3% 5 A 19.50 29.80 23 60.9% 78.6% 92% 65% 6% 6 A 21.90 38.90 16 50.0% 75.0% 94% 5% 91% 7 D 17.40 42.00 25 80.0% 70.0% 58% 4% 74% 8 C 15.40 37.90 10 50.0% 100.0% 76% 13% 18% 9 C 21.70 43.10 38 57.9% 90.9% 78% 12% 7%

10 D 18.60 33.70 12 58.3% 42.9% 90% 4% 80% 11 A 18.10 44.30 20 50.0% 80.0% 94% 30% 7% 12 A 22.7 50.9 16 81.3% 84.6% 83% 8% 5% 13 B 22.4 44.8 17 76.5% 69.2% 82% 16% 5%

Raters

We worked with EDC to recruit potential raters for this study. Members of the

PLATO team trained 12 new raters in New York City with both ELA background, and

experience teaching in the middle grades. The two-day training was face-to-face and

focused on all of the PLATO elements. We provided numerous opportunities for

potential raters to score video clips and receive feedback on their scores. By the end of

the training and with some follow-up with individual raters, 12 of 14 potential raters

achieved 80% reliability, with exact score matches on at least five videos of ELA

instruction that had been master coded by several members of the PLATO team (for more

details about the raters, their use of the instrument and overall reliability, see Cor, 2011).

Revision of PLATO rubric

The PLATO protocol used in this study included twelve elements of instruction

highlighted in existing literature on adolescent literacy and effective instruction in

secondary ELA: purpose, intellectual challenge, connections to prior knowledge,

connections to personal and cultural experiences, modeling, strategy use and instruction,

guided practice, classroom discourse, text-based instruction, accommodations for

language learning, and classroom environment (for details on the development of these

elements and the background literature on each, see Grossman et al, 2009). These

elements were refined based on analysis of previous rounds of data collection. Text-based

instruction was added to assess how regularly and effectively a teacher and students

reference texts and use those textual references to meet the goals of ELA instruction. The

classroom environment element looks at both time and behavior management to assess

the teacher’s efficient organization of classroom routines and materials to ensure that

instructional time is maximized, and the degree to which student behavior facilitates

academic work. A factor analysis of a previous data round of data collection suggest

three underlying factors: disciplinary demand and representation of content, instructional

scaffolding, and classroom environment.

The first version of PLATO used a seven-point scale, in which reliability was

achieved based on 80% agreement with exact and adjacent score matching. Based on

feedback from experts (Gitomer, personal communication), we switched to a 4-point

scale. Each element includes a rubric that details how to score instruction on a scale from

one (lowest) to four (highest). In addition, PLATO captures the content of instruction

(writing, reading, literature, grammar) and activity structures (whole group, small group,

independent work). PLATO is designed for use over multiple segments of instruction in

each lesson; each observation cycle captures 15 minutes of instruction with five minutes

for scoring. Observations focus on the classroom experience of the “average” student,

and try to weigh the balance of evidence across a fifteen-minute segment.

Observation Process

Teachers were observed on three separate days of instruction for at least two class

periods per day in two waves of observations. The number of observation cycles varied

depending on the length of class periods, but on average, teachers were observed for

twelve PLATO cycles. Neither observers nor teachers knew the teachers’ value-added

coefficient or quartile during observations. To ensure consistency among raters, 15% of

observations were double-coded. The study was designed to be able to conduct a

generalizability study, so teachers were observed by multiple raters, and raters were

assigned to multiple schools. (For results of the g-study, see Cor, 2011).

Value-added models

To calculate teachers’ value-added scores, we chose to run a number of different

specifications. The base used to estimate teacher effects is summarized by Equation 1.

Aisgty = g0 + g 1 Ais’g(g-1)t’(y-1) + g 2Siy + Cjt g3 + Tjtg4 + πs + πg + εisgty (1)

Here the standardized achievement level (test score) A of student i in school s in

grade g with teacher t in year y is a linear function of the student’s test score in the prior

year, characteristics of the student S, characteristics of the other students in the same

grade with the same teacher in that year C, and the teacher’s attributes T, including

classroom practices. Student characteristics include gender, race/ethnicity, poverty

status, days absent during the prior year, and suspensions in the prior year. The aggregate

(teacher by grade by school) student characteristics include race/ethnicity, poverty status,

average attendance in the prior year, average suspensions in the prior year, average

student test scores in the prior year, and the standard deviation of student test scores in

the prior year. Teaching experience is measured by dummy variables for each year.

Other teacher attributes will vary by model but include test score performance along a

number of dimensions. In addition, the model includes fixed effects for grades and

schools. The standard errors are clustered at the teacher level to account for the fact

teacher-level variables are of primary interest in our analysis. This approach identifies the

effectiveness of a teacher attribute such as classroom practices by comparing teachers

with different attributes within the same school.

There are active debates concerning the best specification for estimating teacher

effects. Because there is no consensus on the best approach, we chose to combine two

measures for our analyses in this paper. In particular, we use one estimate that models

gains in student achievement as a function of student fixed-effects, student time varying

characteristics, school characteristics, classroom characteristics, and year and grade

indicator variables. This strategy identifies value-added by comparing teachers who

teach the same students, usually in different years. Our other estimate includes student

controls, school controls, classroom controls and year and grade indicator variables. The

student controls include gender, race, eligibility for free lunch, prior year test scores in

math and ELA, and English learner status, among other factors. Classroom variables

include the aggregates of all the individual variables plus the standard deviations of the

prior year test scores. The school variables include enrollment, the percent of both black

and Hispanic students, the percent of English learners, and the school average

expenditures per pupil. We shrink each measure of value-added using empirical Bayes

techniques to adjust for estimation error in calculating value-added. This is generally a

conservative approach to the identification of the effects.

It is important to note that while we observed in the spring of 2008-9 school year,

the value-added models were based on student achievement data from the 2007-8 school

year. Since we know that year-to-year correlations of teachers’ value-added scores are in

the range of .50, we are eager to re-run these analyses with the 2008-09 data. We have

recently received the student achievement scores from 2008-9 and are in the process of

developing value-added models using this data.

Analysis

We analyzed the relationship between teachers’ value-added and elements of their

observed practices, using both OLS and logistic regression. In the first year of data

collection, we compared teachers in the 4th quartile to those in the 2nd quartile. However,

in this round of data collection, our sample was skewed toward the lower quartile

teachers. As seen in Table 3, we had many more teachers in the 1st and 2rd quartiles (70%

of the sample), and very few in the 4th quartile (11.25%). To maximize the number of

teachers in our analyses for the logistic regressions, we compared the “more effective”

teachers in the 3rd quartile (second highest quartile) of the specified VAM to teachers in

the “less effective” 1st quartile (the lowest quartile). We are not sure why teachers in the

lower quartiles were more likely to volunteer for this study, as it seems counter-intuitive.

However, the skewed nature of the sample provides a cautionary note for studies that rely

on volunteers. The skewed distribution may also affect our ability to accurately measure

relationships with classroom practices.

Table 3- Value-added distribution by quartile and school

School 1st Quartile 2nd Quartile 3rd Quartile 4th quartile 1 6 2 0 0 2 4 5 1 0 3 2 2 0 1 4 1 3 5 0 5 3 3 0 2 6 1 1 4 1 7 2 5 1 2 8 0 0 0 0 9 4 3 2 1

10 1 1 0 0 11 4 0 0 1 12 0 0 0 0 13 3 0 2 1

Total Number 31 25 15 9 Percentage 38.75% 31.25% 18.75% 11.25%

Limitations

We recognize that there are a number of limitations to this study. First, we do not

yet have the value-added data for the year we actually observed these teachers, so these

findings represent the relationships between classroom practices and teachers’ value-

added scores from previous years of instruction. Since the correlation between teachers’

value added scores from one year to the next varies, and can be as low as .3 (McCaffrey

et al., 2009), we are eager to re-run these analyses with the VA scores from the 2008-09

achievement data.

A second limitation has to do with the skewed sample. While we tried our best to

recruit 80% of the ELA teachers at every school, participation varied by school. More

troubling, the teachers in the highest VA quartile were least likely to volunteer. This

meant that we had a more restricted range of teachers than we anticipated, which may

have affected the findings as well.

Findings Frequency distributions by element The distribution of scores differed significantly by element (see Table 4). On

elements such as Purpose, Representations of Content, and Classroom Environment the

vast majority of instructional segments received a score of 3 or 4. In contrast, most

instructional segments scored a 1 on Modeling and Strategy Instruction. Other elements

like Guided Practice and Text-Based Instruction had closer to a uniform distribution of

scores.

Table 4 Sample Size 1 2 3 4

Purpose 2462

observations 0% 10% 63% 27%

Intellectual Challenge 2462


Representations of Content 2462


Connections-Prior Knowledge 2462

observations 41% 16% 25% 18% Connections-Personal/Cultural Experiences

2462 observations 69% 10% 12% 9%

Models/Modeling 2462


Explicit Strategy Instruction 2462


Guided Practice 2462


Classroom Discourse 2462


Text-based Instruction 2462


Accommodations for Language Learning 2462


Classroom Environment 2462


Figure 1: Distributions of PLATO scores Distribution of scores by teacher

We then looked at the basic descriptive statistics of PLATO scores aggregated at

the teacher, rather than the observation, level (see Table 5). They reflect similar patterns

noted in the frequency distributions. Classroom Environment had the highest mean score

across our elements, suggesting that on average, these were well-managed classrooms.

Purpose had the 2nd highest mean score, again suggesting that there was a clear purpose

related to ELA instruction in most of the classrooms we observed. Explicit strategy

instruction had the lowest mean score across teachers (1.44), followed by Modeling,

Accommodations for Language Learners (both 1.58) and Connections to Personal

Experience (1.62).

Figure

Figure 2: Mean scores of PLATO elements

OLS and Logistic (1st to 3rd Quartile Comparisons)

In looking at the OLS regression coefficients for the entire sample (table 6), we

see that none of the elements is significantly related to value-added scores of teachers,

with the exception of Text-Based Instruction, which is negatively related.

Table 6: OLS of Value-added Coefficients regressed on PLATO scores

Regression Coefficient Std. Deviation

Purpose -0.091 0.16 Intellectual Challenge -0.182 0.223

Representations of Content -0.081 0.135 Connections-Prior Knowledge -0.028 0.265 Connections-Personal /Cultural Experiences -0.175 0.229 Models/Modeling -0.112 0.245 Explicit Strategy Instruction 0.213 0.205 Guided Practice -0.414 0.258 Classroom Discourse 0.055 0.264 Text-based Instruction -0.736* 0.309 Accommodations for Language Learning 0.016 0.177 Classroom Environment 0.067 0.256 Number of teachers=125

We then ran logistic regressions to compare the classroom practices of teachers in

the 1st and 3rd quartiles (N=46)2. Table 7 below shows the odds of being in the 3rd quartile

group based on a 1 unit higher score on each of the PLATO elements. While a number of

the elements (Purpose, Connections to Prior Knowledge, Connections to Personal

Experience), have odds ratios above 2.0, only two elements are significant: Modeling

(4.340) and Explicit Strategy Instruction (4.698).

Table 7: Effect of PLATO Elements on the Odds Ratios Predicting Being in the 3rd Value-Added Quartile Versus the First Value-Added Quartile

Purpose 2.27 Likelihood of Being

in 3rd Quartile Standard

Error Purpose 2.271 (2.308) Intellectual Challenge 0.914 (0.683) Representations of Content 0.767 (0.973) Connections to Prior Knowledge 2.392 (1.604) Connections to Personal and Cultural Experiences 2.163 (1.612) Modeling 4.340* (3.026) Strategy Instruction 4.698+ (3.829) Guided Practice 0.777 (0.509) Classroom Discourse 1.182 (0.840)

2 We have many fewer teachers in the logistic regressions because the value-added model used for this study is a composite of two separate models. Teachers were identified in a certain quartile only if they were in that quartile in both value-added specifications. For this reason, we lost a large number of teachers when the sample was divided in quartile groups.

Text-Based Instruction 0.974 (0.535) Accommodations for Language Learning 1.445 (1.359) Classroom Environment 1.135 (0.831) Number of Teachers=51 Results presented as odds-ratios + p<.10, * p<.05, ** p<.01

Results for teachers with 3-10 years of experience in New York City

We then re-ran the analyses focusing on teachers with between 3 and10 years of

teaching experience in New York City schools. Since our original study focused on

teachers with between 3-6 years of experience, and because experience may be associated

with the use of different classroom practices, we were interested in exploring differences

in a more restricted sample. Table 8 below provides the mean years of experience of ELA

teachers at the schools in our sample. It indicates that the experience profile of the

teachers in our sample varied among schools. Schools 6 and 7 have teachers with the

highest mean level of experience, while schools 1 and 4 have the lowest mean

experience. As a result, restricting the sample in this way differentially impacted the

samples across schools.

Table 8: Mean years of experience by school

School Mean Years of Experience Std. Dev Min Max

1 4.214 2.259 1 8 2 9.833 5.813 1 20 3 5.823 5.714 1 20 4 4.867 3.044 1 11 5 6.5 5.08 1 21 6 9.375 4.897 4 17 7 9.047 6.359 1 20 8 9.8 11.691 2 29 9 5.952 2.923 1 13

10 7.375 7.501 1 22 11 7.556 4.927 1 16 12 6.71 4.196 2 18 13 8.462 5.517 1 22

With our restricted sample, we then re-ran both our logistic regressions comparing

teachers in the 3rd and 1st quartiles as well as the OLS regressions, looking at the full

range of value-added coefficients. In the logistic regressions, higher PLATO scores were

positively associated with being in the 3rd quartile on nine of the twelve elements

including: Purpose, Intellectual Challenge, Connections to Prior Knowledge,

Connections to Personal Experience, Strategy Instruction, Modeling, Guided Practice,

Accommodations for Language Learning, and Classroom Environment. The teachers’

scores on Modeling and Explicit Strategy Instruction were significantly predictive (p <

.01) of the teachers’ odds of being in the 3rd quartile. Using OLS regressions, seven of the

twelve PLATO elements had positive associations with the teachers’ value-added

coefficients including: Purpose, Connections to Personal Experience, Strategy

Instruction, Modeling, Classroom Discourse, Accommodations for Language Learning,

Classroom Environment. Though none of these associations are significant, they do

suggest that on average, for the teachers with 3-10 years of experience, higher scores on

many of the PLATO elements may be associated with higher value-added coefficients.

Table 9: Logistic regressions for teachers with 3-10 years of experience

Likelihood of Being in 3rd Quartile Std. Error

Purpose 3.791 (4.821) Intellectual Challenge 1.182 (1.114) Representations of Content 0.993 (1.587) Connections-Prior Knowledge 2.727 (2.396) Connections-Personal /Cultural Experiences 3.488 (3.023) Models/Modeling 17.430** (19.191) Explicit Strategy Instruction 12.414* (13.014) Guided Practice 1.544 (1.336) Classroom Discourse 0.620 (0.582) Text-based Instruction 0.983 (0.679)

Accommodations for Language Learning 5.816 (7.122) Classroom Environment 1.172 (1.047) Number of Teachers= 39 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001

Table 10: OLS regressions for teachers with 3-10 years of experience

Regression Coefficient Std. Error

Purpose 0.017 (0.203) Intellectual Challenge -0.035 (0.276) Representations of Content -0.100 (0.165) Connections-Prior Knowledge -0.120 (0.340) Connections-Personal /Cultural Experiences 0.116 (0.324) Models/Modeling 0.281 (0.332) Explicit Strategy Instruction 0.439 (0.272) Guided Practice -0.231 (0.309) Classroom Discourse 0.073 (0.336) Text-based Instruction -0.636 (0.401) Accommodations for Language Learning 0.035 (0.233) Classroom Environment 0.078 (0.346) Number of Teachers= 96

Adjusting for psychometric properties of PLATO

We then reran these analyses with adjusted scores that predicted a PLATO score

of each teacher on each item based on the following measurement model:

log ( Pnijkr/Pnijk(r-1)) = Bn - Di - Cj - Fk - Gr

This is a polytomous Rasch model that adjusts for the difficulty of occasion and

segment of measurement as well as the severity of the rater. In the model Pnijkr is the

probability of observing category r for teacher n encountering occasion i, segment j, and

judge k. Pnijk(r-1) is the probability of observing category r-1. Bn is the ability of

teacher n and Gr is the difficulty of being observed in category r relative to category r-1.

Therefore, the scores can be viewed as an estimate of the teacher’ underlying skill in

each instructional element that are adjusted for the measurement circumstance of each

observation. A reliability analysis found that the Cronbach's alpha for the 12 PLATO

items used in conjunction increased from .72 to .85 when using the adjusted scores.

Table 11 shows the relationship between PLATO elements and teachers’ value-

added scores, once we adjust for the various sources of error.

Table 11: Comparison of Logistic Regression Results with Raw and Adjusted Scores- Total sample of teachers Likelihood of being in Q3 Original Scores Adjusted Scores Purpose 2.271 1.326 Intellectual Challenge 0.914 1.077 Representations of Content 0.767 1.027 Connections-Prior Knowledge 2.392 1.700 Connections-Personal/Cultural Experiences 2.163 1.925 Models/Modeling 4.340* 2.658 Explicit Strategy Instruction 4.698 1.579 Guided Practice 0.777 0.890 Classroom Discourse 1.182 1.458 Text-based Instruction 0.974 1.068 Accommodations for Language Learning 1.445 1.123 Classroom Environment 1.135 0.959 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001

Table 12: Comparison of Logistic Regression Results with Raw and Adjusted Scores- Teachers with 3-10 years of Experience Likelihood of being in Q3 Original Scores Adjusted Scores Purpose 3.791 1.546 Intellectual Challenge 1.182 1.204 Representations of Content 0.993 1.101 Connections-Prior Knowledge 2.727 1.789 Connections-Personal/Cultural Experiences 3.488 3.312

Models/Modeling 17.430** 6.196* Explicit Strategy Instruction 12.414* 4.993* Guided Practice 1.544 1.174 Classroom Discourse 0.620 1.006 Text-based Instruction 0.983 0.892 Accommodations for Language Learning 5.816 2.267 Classroom Environment 1.172 0.961 Results presented as odds-ratios * p<0.05, ** p<0.01, *** p<0.001

These results suggest that while adjusting the scores based on our g-study generally

reduced the odds ratios for many of the elements, it also raised the odds ratios for

Classroom Discourse and Representation of Content in both the full and restricted

samples.

In general, the reason unadjusted and adjusted scores differentially predict

likelihood of being in third quartile is that the raw scores are systematically affected by

measurement features in the data collection process. For example, it could be that the

teachers with the highest unadjusted scores were more likely to be rated by the most

lenient raters. As a result, their scores are spuriously high compared to the rest of the

teachers. Alternatively, it could be that the teachers with the higher scores were observed

on fewer occasions and on fewer segments than teachers with lower scores. Once again

the result would be a spuriously high likelihood of being in the third quartile. In order to

verify these and other potential explanations, the measurement characteristics of the

teachers most likely to be in the third quartile must be tracked to identify potential

predictors of systematic differences.

Discussion

These results provide a mixed picture of the relationship between teachers’

classroom practices, as measured by PLATO, and teachers’ value-added scores. When

we use logistic regression, two of the elements, Modeling and Explicit Strategy

Instruction, seem to distinguish between teachers in the 1st and 3rd value-added quartiles.

This relationship seems to hold up across a number of different analyses, and is

particularly strong among teachers with 3-10 years of experience. It is easy to understand

why “Strategy Use and Instruction,” explicitly teaching students how to successfully

complete academic tasks, and “Modeling,” demonstrating or enacting the processes in

which students will engage, might be considered “high-leverage practices” for

achievement gains on standardized English Language Arts assessments. At the highest

level of Modeling, the teacher provides a specific, concrete image of what student work,

including process or intellectual work, can and should look like and decomposes the

process, highlighting specific features for students to replicate. The opportunity to watch

an “expert,” the teacher, engage the same activity as students helps make visible how a

more experienced reader or writer approaches the task. This, in turn, might students

better understand the processes being modeled and help them become more persistent and

flexible when they approach novel tasks such as those they face on standardized

assessments, the primary tool for determining a teacher’s value-added. Moreover to be

successful on such assessments, students must consistently employ strategies to interpret

literary text, make a compelling argument, or analyze grammatical errors. It is the

flexibility of strategies that makes them critical for success on a range of ELA tasks.

When students understand when and how to use specific strategies, as well as why they

are useful, they are better able to attack less familiar tasks or material. During our first

year of data collection, we found the vast majority of teachers provided students with

directions for completing activities, but they did not instruct them on the nuances of how

to complete those activities effectively. In literature circles, students were often told to

analyze a character’s actions or determine the meaning of unknown words without any

discussion of the strategies that would enable them to do so. Similarly, teachers

highlighted the features of cinquains or editorials but did not teach students how they

might approach different types of writing based on those features. Thus the goal of many

lessons was completion of the specific task rather than mastering a more broadly

applicable skill. Those teachers who actually taught students strategies were nearly

always in the highest value-added quartile.

However, other elements do not show a clear relationship to teachers’ value-

added scores. While in some of these analyses, most relationships between elements and

value-added scores are positive, few of the other relationships are significant. In the

remainder of this paper, we explore a number of hypotheses related to our findings.

Measuring well: Measuring what matters

One hypothesis is that these particular classroom practices are not associated with

teachers’ impact on student achievement in ELA. While certainly possible, this

hypothesis seems unlikely to be true for all elements, given the previous research on the

effects of a teacher’s time on task and student achievement (Denham & Lieberman, 1980)

and on the relationship between the cognitive demand of classrooms and student learning

(Newmann, Lopez, & Bryk, 1998).

A second hypothesis is that we are not measuring these practices well. There are

a number of reasons to question the value of how well we measured these instructional

elements. We had just revised significantly the PLATO instrument from the version used

in our earlier study, and the revised instrument may not have adequately captured the

qualities of classroom practice we intended to capture. We know, for example, that some

of the measures, such as Representation of Content, shifted their meaning during data

collection. Raters began to use the score of 3 as a “default score,” so we may not have

measured well the quality of teachers’ representations of content during these

observations. Secondly, we shifted our scale from a 7 point scale to a 4 point scale; in this

version of PLATO, we may not have done a good job of distinguishing among the

various score points. In addition, some of these elements were new to this version of

PLATO, such as Text-Based Instruction and Classroom Environment. So it is possible

that while these practices are indeed components of effective instruction, our instrument

did not do a good job of capturing them during this round of data collection.

Another potential problem with PLATO as a measurement tool is that some of our

elements measure aspects of instruction that are always present in some form, such as

intellectual challenge, purpose, representation of content, and classroom environment,

while other elements normally occur only at discrete points in the lesson, such as strategy

instruction, guided practice, etc. The fact that the g-study suggests that the greater

variation occurs across segments for the same teacher suggests that to get an accurate

measure of classroom practice would require multiple observations. In the first year of

data collection using PLATO we observed six days of instruction, while in this second

year, we observed only three days of instruction. This may have meant that we missed

low-incidence practices among teachers who do, in fact, use these practices.

There may also have been problematic scoring by raters during this year of

classroom observations. In the first year of the study, most raters had expertise in the

instrument; in fact, many of the raters were also the instrument developers. In contrast,

during year 2, none of the developers regularly observed in classrooms. While our g-

study indicates that relatively little of the error is attributable to raters, it is possible that

raters agreed with each other, but may have scored differently from the way the first

group of raters scored. Because we do not have videos from these observations, we are

unable to check this hypothesis. Raters might have been consistent with each other, but

they might not have been consistent with the developers of PLATO. The one rater who

had the most experience in PLATO actually scored differently than other raters during

Year 2, lending some support to this hypothesis. In looking at the school-level data,

however, we can spot some effects of the school context on rater behavior. In schools

where there were few, or no, 3rd and 4th quartile teachers, scores may have drifted

upward, as raters re-calibrated to what was typical practice in a school. Without

consistent examples of what PLATO would consider high-quality teaching, raters may

have redefined the scale somewhat.

While PLATO may not have done a good job measuring the quality of teacher

practice, it is also possible that the standardized tests in ELA do not do a good job of

measuring some aspects of student outcomes. For example, the quality of classroom

discourse may be important in developing students’ reasoning abilities and conceptual

understanding of literature and writing, but these abilities may not be measured well by

the tests that were used to construct value-added scores. Few would argue that students

need not develop their ability to engage in productive academic discourse, but the skills

learned in these discussions may not be captured on state assessments. In this case, the

observation scores might provide better measures of this particular outcome than value-

added scores. In addition to developing multiple measures of instruction, we need to

develop multiple measures of student outcomes to ensure that classroom instruction

supports the development of a broad range of learning outcomes for students.

The model matters

One of the striking findings relates to the difference in the results depending upon

whether we used OLS or logistic regression. Again, there might be several hypotheses

for why there are negative correlations between PLATO and value-added coefficients

under OLS and positive relationships using logistic regression, particularly in the

restricted sample. One hypothesis might be that OLS may not be the best approach to

looking at relationships between practices and teachers’ value-added scores. In

particular, OLS assumes that there is a normal distribution of errors, which is highly

unlikely given the skewed distribution of both our dependent and independent variables.

As such, the sample is heavily weighted towards teachers with lower value-added

coefficients the few observations in the 4th quartile have an inappropriately large

influence on the slope of the fitted line. OLS is also extremely sensitive to outliers; a few

outliers in our sample—a few 4th quartile teachers who scored lower on some PLATO

elements or 1st quartile teachers who scored higher—would affect the overall results. In

contrast, logistic regression creates two groups (“more and less effective” teachers) that

are defined vis a vis some parameters, in this case percentile cut-off points, which

eliminates the potential for outlier effects and does not necessitate normally distributed

data.

We also know that value-added estimates for individual teachers tend to fluctuate

significantly from year to year (McCaffrey, Sass, Lockwood, & Mihaly, 2009),

particularly those in the middle of the distribution of value-added scores. So including the

full range of teachers might mean that some teachers who are in our 2nd and 3rd quartiles

are misidentified and are really more or less effective at raising achievement than these

particular VA scores might suggest. Because of this, small distinctions between teachers

with similar value added scores may be less meaningful. As a result, treating value-added

coefficients as a continuous variable is problematic. Blunter categories, such as quartiles

used in a logistic regression, may make more sense.

Another possibility for the lack of strong relationships between VA scores and

scores on PLATO could be the influence of school context. As illustrated by the work of

Julie Cohen and Michelle Brown, certain classroom practices may be associated with

particular school contexts, which makes it more difficult to distinguish among teachers in

the same context. While value-added models control for school characteristics, they do

not address what economists call “unobservables,” which could include the school

culture, collegial interaction, curriculum, and instructional coherence3. Schools that are

higher-functioning may have higher quality instruction overall.

Implications: Leveraging measurement for improvement

Our experience designing PLATO suggests that it is not easy work developing a

systematic observation system and that it is even more difficult to train others to use such 3 Models with school fixed effects are designed to address these issues but also limit comparisons to teachers within the same school.

a tool reliably. Given the challenges, we should be somewhat wary of the Race to the

Top mandate that districts develop teacher evaluation systems that include multiple

indicators, including classroom observations. Such systems will take time and expertise

to develop.

Despite these various issues, continuing to develop better measures of both

learning and classroom practice is work worth doing. Classroom observation systems

offer the potential to both measure and improve the quality of teaching. Part of the

challenge involved in improving the quality of teaching in our nation’s schools is the lack

of valid and reliable measures for assessing teaching effectiveness or tools for targeting

specific features of instruction (Gitomer, 2009). Without such tools, it is nearly

impossible to identify effective classroom practices and support teachers’ growth in

classroom instruction. Identifying classroom practices associated with high student

achievement gains and then targeting these practices in professional development

provides a potentially powerful approach for improving the quality of instruction for all

students. Value-added measures may be able to distinguish between teachers who have

differential impact on student achievement scores, but they contribute nothing about the

mechanisms through which teachers achieve this impact.

Among the middle school ELA classrooms we studied, the practices of modeling

and strategy instruction seem to be strong predictors of teachers’ effectiveness as

measured by value-added. The good news about this finding is that these are practices

that teachers can develop, providing a lever for instructional improvement. The fact that

they occur so infrequently also makes them an easy target for reform. Helping teachers

integrate more strategy instruction and modeling into their ELA lessons would provide

much more support for students in the areas of reading and writing. If these findings

continue to hold up, a next step would be to leverage these findings through targeted

professional development in these practices. We believe that the new tools and

technology being developed now have tremendous potential for raising the floor on

classroom practice.

References:

Biancarosa, G., & Snow, C. E. (2004). Reading next: A vision for action and research in

middle and high school literacy: A report from the Carnegie Corporation of New

York. Washington DC: Alliance for Excellent Education.

Boyd, D.J., Grossman, P.L., Lankford, H., Loeb, S., & Wyckoff, J. (2009). Teacher

preparation and student achievement. Educational Evaluation and Policy

Analysis, 31(4), 416-440.

Carnegie Corporation of New York. (1989). Turning Points: Preparing American youth

for the 21st century. New York: Carnegie Corporation of New York.

Cor, K. (2011). The measurement properties of the PLATO rubric. Paper presented at

American Educational Research Association, New Orleans, LA.

Danielson, C. (2007). Enhancing professional practice a framework for teaching.

Alexandria, VA: Association for Supervision and Curriculum Development.

Denham, C. & Lieberman, A. (Eds.) (1980). Time to learn. Washington D. C.: National

Institute of Health.

Gitomer, D., Ed. (2009). Measurement issues and assessment for teaching quality. Los

Angeles: Sage Publications.

Goe, L., C. Bell, C., & O. Little. (2008). Approaches to evaluating teacher effectiveness:

A research synthesis. Washington, DC: National Comprehensive Center for

Teacher Quality. Retrieved November 29, 2009,

http://www.tqsource.org/publications/EvaluatingTeachEffectiveness.pdf.

Grossman, P., Loeb, S., Cohen, J., Hammerness, K., Wyckoff, J., Boyd, D., et al. (2010).

Measure for Measure: The relationship between measures of instructional practice

in middle school English Language Arts and teachers’ value-added scores. NBER

Working Paper No. 16015.

Grossman, P., & McDonald, M. (2008). Back to the future: Directions for research in

teaching and teacher education. American Educational Research Journal, 45(1),

184-205.

Hill, H. (2005). Content across communities: Validating measures of elementary

mathematics instruction. Educational Policy, 19, 447-475.

Hoffman, J.V., Sailors, M., & Duffy, G. (2004). The effective elementary classroom

literacy environment: examining the validity of the TEX-IN3 observation system”

Journal of Literacy Research, 36, 303-334.

Kane, T. J., Rockoff, J., & Staiger, D. (2008). What does certification tell us about

teacher effectiveness? Evidence from New York City. Economics of Education

Review, 27(6), 615-631.

Kane, T.J., Taylor, E.S., Tyler, J.H., Wooten, A.L. (2010). Identifying effective

classroom practices using student achievement data, NBER Working Paper 15803.

Kennedy, M. M. (in press). Recognizing good teaching when we see it. To appear in

Handbook of Teacher Assessment and Teacher Quality. M. Kennedy (Ed.) San

Francisco, Jossey-Bass.

McCaffrey, D.F., Lockwood, J.R., Louis, T.A., & Hamilton, L. (2004). Models for value-

added modeling of teacher effects, Journal of Educational and Behavioral

Statistics, 29(1), 67-101.

McCaffrey, D., Sass, T., Lockwood, J., & Mihaly, K. (2009). The intertemporal

variability of teacher effect estimates. Education Finance and Policy, 4(4), 572-

606.

Newmann, F., Lopez, G, & Bryk, A. (1998). The quality of intellectual work in Chicago

schools. Chicago, IL: Consortium on Chicago School Research.

Pianta, R. C., LaParo, K. M., Stuhlman, M. (2004). The classroom assessment scoring

system: Findings from the prekindergarten year, The Elementary School Journal,

104, 409-426.

Raundenbush, S.W. (2004). What are value-added models estimating and what does

this imply for statistical practice?, Journal of Educational and Behavioral

Statistics, 29(1), 121-129.

Rice, J.K. (2003). Teacher quality: Understanding the effectiveness of teacher attributes.

Washington: Economic Policy Institute.

Rivkin, S., Hanushek, E., & Kain, J. (2005). Teachers, schools, and academic

achievement. Econometrica, 73(2), 417-458.

Rockoff, J. (2004). The impact of individual teachers on student achievement: Evidence

from panel data. American Economic Review, 94(2), 247-252.

Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on

observables and unobservables. Education Finance and Policy, 4(4), 537-571.

Seidel, T.S. & Shavelson, R. (2007) Teaching effectiveness research in the past decade:

The role of theory and research design in disentangling meta-analysis results.

Review of Educational Research, 77(4), 454-499.

Snow, C. E., & Biancarosa, G. (2003). Adolescent literacy and the achievement gap:

What do we know and where do we go from here? Report of the Adolescent

Literacy Funders Meeting. New York: Carnegie Corporation of New York.

Taylor, B. M., Pearson, D. P., Peterson, D.S., Rodriguez, M.C. (2005). The CIERA

School Change Framework: An evidence-based approach to professional

development and school reading improvement. Reading Research Quarterly,

40(1), 40–69.

Of Cabbages and Kings: Classroom Observations & Value ...platorubric.stanford.edu/2011 AERA paper Cabbages Kings.pdf · Of Cabbages and Kings: Classroom Observations & Value-Added

Documents