-
Classroom Observation Schemes: Where Are the Errors?Author(s):
Barry McGaw, James L. Wardrop and Mary Anne BundaSource: American
Educational Research Journal, Vol. 9, No. 1 (Winter, 1972), pp.
13-27Published by: American Educational Research AssociationStable
URL: http://www.jstor.org/stable/1162047 .Accessed: 22/05/2013
10:46
Your use of the JSTOR archive indicates your acceptance of the
Terms & Conditions of Use, available at
.http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars,
researchers, and students discover, use, and build upon a wide
range ofcontent in a trusted digital archive. We use information
technology and tools to increase productivity and facilitate new
formsof scholarship. For more information about JSTOR, please
contact [email protected].
.
American Educational Research Association is collaborating with
JSTOR to digitize, preserve and extendaccess to American
Educational Research Journal.
http://www.jstor.org
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
Classroom Observation Schemes: Where are the Errors?
BARRY McGAW and JAMES L. WARDROP University of Illinois at
Urbana-Champaign
MARY ANNE BUNDA The Ohio State University
Much of the confusion about the notion of reliability was
cleared with the publication of Technical Recommendations for
Psychological Tests and Diagnostic Techniques (American
Psychological Association, 1954) and Technical Recommendations for
Achievement Tests (Ameri- can Educational Research Association,
1955). The subsequent publica- tion of Standards for Educational
and Psychological Tests and Manuals (American Psychological
Association, 1966), while it introduced considerable changes in the
conception of validity, redefined reliability in essentially the
same terms.
The term "reliability" was clarified through the recognition of
its several meanings and, consequently, that " 'reliability
coefficient' is a generic term referring to various types of
evidence; each type of evidence suggesting a different meaning"
(APA, 1966, p. 25). Although the explicit definition of a series of
alternative conceptions of reliability clarified the issue it did
not resolve a number of important questions of interpretation. For
example, there is no clear guide for the selection of an
appropriate reliability coefficient nor is there any resolution of
the problem of interpreting differences in the values of the
various coefficients for the same test.
Further problems arise when consideration is given to the
reliability of observation schedules. Here a further variable is
intro- duced into the measurement situation, and inter-observer
disagreement
13
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
American Educational Research Journal
becomes an additional source of errors of measurement. This type
of error is somewhat akin to that between alternate forms of a
test, which is typically estimated with an equivalent forms
correlation coefficient. It is a more important source of error
with an observation schedule, however, since it is usually
unavoidable. Whereas, with a test, it is possible to use only a
single form in many situations to avoid this source of
unreliability, with an observation schedule the obvious physical
limitations necessitate the use of more than one observer in all
but the smallest studies.'
A major problem with a series of reliability indices, each of
which measures the effect of only one or two sources of
unreliability, is that there is no way of obtaining a "total"
picture of the combined effects of all relevant sources of error.
The range of sources of error with observational techniques is
discussed by Medley and Mitzel (1963). At the descriptive level
with respect to classroom observation techniques a measure can be
claimed to be reliable "to the extent that the average difference
between two measurements independently obtained in the same
classroom is smaller than the average difference between two
measurements obtained in different classrooms" (p. 250).
Unreliability occurs when two measures of the same object (person,
classroom) tend to differ too much either because the behaviors
being observed are too variable or because independent observers
cannot agree on what is happening.
In addition to these sources, unreliability may also be due to
the smallness of differences among the objects of observation on
the dimensions observed. This is a less important point for test
construc- tion, where preliminary item analyses are used to reject
items which do not discriminate, than it is for the development of
observation schedules. The point seems seldom to have been
recognized, particu- larly by those who have developed classroom
observation schemes. It is clearly possible that an instrument for
which the level of inter-judge agreement is very high, may be quite
unreliable in spite of the judges' consistency.
Coefficients of Observer Agreement The inadequacy of measures of
inter-observer agreement as
reliability indices has been discussed (see e.g., Medley and
Mitzel, 1963; Westbury, 1967), yet these discussions appear to have
had little impact. Most authors of classroom observation schemes
deal with reliability exclusively in terms of observer agreement.
For example, Bellack, et al.
This is not intended to imply that the use of only one observer
is desirable, for in such a study no estimate of the effect of
observer bias can be obtained.
14 Vol. 9 No. 1 January 1972
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
Classroom Observation Schemes: Where are the Errors?
(1966), describing their technique to measure observer
agreement, claim that the results "indicate a consistently high
degree of reliability for all major categories of analysis:
agreement ranged from 84 to 96 percent" (p. 35). They established
these figures using two pairs of independent coders.
Smith, et al. (1964, 1967), used the same technique used
previously by Smith and Meux (1962) to calculate the consistency
with which independent judges identified and classified their units
of behavior (ventures). These two estimates are essentially percent
agreement measures. Flanders (1967) also specified reliability only
in terms of observer agreement. The r coefficient he used is
superior to simple percentage agreement measures since it
estimates, instead, the extent to which chance agreement has been
exceeded. A notable exception to these studies is that of Brown,
Mendenhall, and Beaver (1968), in which the reliability estimate
was a measure of the consistency with which teachers could be
discriminated from one another.
Medley and Mitzel (1963) argued strongly against using observer
agreement indices as measures of reliability, in fact referring to
them as coefficients of observer agreement to distinguish them from
coefficients of reliability.
Weick (1968), referring to observation schemes in general,
observed that "the most common reliability measure in observational
studies is observer agreement" but, unfortunately, supported this
state of affairs by arguing that inter-observer agreement is more
important than replicability, i.e., intra-observer agreement over
occasions. While this is undoubtedly true, it misses the main
point. The establishment and maintenance of a suitably high level
of inter-judge agreement is important only insofar as it operates
as a limiting factor on reliability.
However great the problems apparently introduced by a failure to
obtain suitable inter-judge agreement, this is a situation under
fairly direct control of the experimenter. In the case of a
category system clearer definition of the categories, minimization
of overlap, and more extensive training of observers should raise
the level of agreement, while in the case of sign analysis more
precise definitions of the behaviors to be recorded should achieve
the same results.
If the objects do not differ sufficiently in the behavior
observed, in comparison with their own stability over time, no
level of inter-judge agreement can render the observation scheme
acceptably reliable (if one insists-as is done in this
discussion-on restricting the use of "reliability" to its
measurement-theoretic sense).
The confusion introduced into the literature through failure
clearly to distinguish the different sources of unreliability, and
through
15
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
American Educational Research Journal
over-emphasis on inter-judge agreement has resulted from a
confusion of the importance of primacy with prime importance.
Inter-judge agreement is the first but not the most important issue
to be faced.
Stability of Behavior
Variability of the object of observation is the most important
source of error variance. Unless stable estimates of behavior can
be obtained, inter-object variability will inevitably be swamped by
intra-object variability. Thus sampling of behavior is of critical
impor- tance. Medley and Mitzel (1963, p. 268) contended that it is
better to increase the number of observations than the number of
observers. They cite an example of a study of teacher behavior in
which increasing the number of observers on a single occasion from
1 to 12 increased reliability from .47 to .50 whereas using the 12
observers individually on 12 separate occasions increased
reliability from .47 to .92. Errors due to instability of behavior
were clearly greater than errors due to inter-observer disagreement
in this example.
A critical point in the conception of reliability expounded by
Medley and Mitzel is that instability of behavior over occasions
(i.e., time) is due to random error in one or both of the
environment or the person (object). This implies that the
characteristic being measured is stable in a sense that does not
allow of lawful change. While this may be a reasonable assumption
in relation to relatively enduring aspects of personality, it is
unreasonable when other types of behavior patterns are being
observed.
In the case of observations of teacher behavior, efforts to
develop indices to characterize particular teachers appear to be
misplaced unless there is some allowance made for lawful
adaptations of behavior to different situations.
One study (Flanders, et al., 1969) has attempted to examine the
effect of variability in teacher behavior. In this study a series
of different situations, over which teacher behavior might be
expected to vary systematically, were chosen. Ten situations were
included and treated as independent, though it appears that there
were two dimensions, at least, underlying the situations. One
dimension (six situations) involved variations in subject matter
(language arts, science, etc.). The other (four situations)
involved differences in activities (introducing new material,
extending old material, planning, etc.). Assuming that there are
two dimensions involved then, it is clear that there are actually
24 (6 X 4) possible situations rather than ten. The authors
collected data only in the broad categories without indicating that
they had balanced (or controlled) one dimension while observing the
other.
16 Vol. 9 No. 1 January 1972
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
Classroom Observation Schemes: Where are the Errors?
Twenty teachers were observed in each of the ten situations
(there were some empty cells, in fact) and their teaching behavior
in each situation was categorized using Flanders' Interaction
Analysis Schedule. Thus ten i/d ratios were obtained for most of
the teachers, fewer for those teachers not observed in all
situations. The standard deviations of i/d ratios for each teacher
were calculated and taken to be measures of variability, or even
"flexibility" as the authors choose to call it. This index of
flexibility was then used as a dependent variable in comparing
groups of teachers.
The view, implicit in the Flanders study, that changes in
behavior over situations are lawful, is premature without some
empirical evidence. An appropriate way to demonstrate such
lawfulness, or at least that the changes are systematic, would be
to assess the reliability of the observation schedule for measuring
behavior within situations. From this standpoint, stability would
be expected only over separate occasions within each situation.
Variability among situations, within teachers, would no longer be
attributed to error variance but to true variance, i.e., variation
in the measure reflecting real, systematic variation in the
person.
To attempt to assess each of the sources of error variance in a
complex model such as this via inter-class correlation coefficients
would be impossible. Apart from the work involved, the difficulties
discussed earlier with respect to interpretations of coefficients
of equivalence, stability and observer agreement would be
multiplied since, for example, there would not be a separate
coefficient of stability for each situation involved. The most
appropriate approach would be to use a variance components analysis
such as that proposed by Medley and Mitzel (1963) or by Cronbach
and his associates (Gleser, et al., 1965; Cronbach, et al., in
press). In the following pages, the basic concepts of
generalizability theory, developed by Cronbach, et al., will be
described and then a solution to the above problem of estimating
reliability will be suggested.
Generalizability Theory The conditions under which observations
are made can be
classified in various ways. For example, in the previous
section, in the study of "variability," observations could be
classified with respect to situations, occasions (within
situations), and observers. Each of these aspects of the
observations is termed a facet in the terminology of
generalizability theory. Each facet is, of course, a source of
variability in the observations in addition to the between person
variability. The variation attributable to each facet can be
estimated by analysis of variance procedures.
17
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
American Educational Research Journal
The procedure will first be illustrated with respect to a single
facet design. If observations are made on P persons in I situations
the ratings of all persons in all situations can be presented by a
P X I matrix X, in which Xpi is the rating of person p in situation
i. In this simple example only one observer is used for all
observations. If more than one observer were used, observers would
constitute a second facet.
If the I situations are exactly the same for all P persons, the
example satisfies the definition of "matched data" (Cronbach, et
al., 1963), and in terms of analysis of variance, constitutes a
two-way crossed design. If, as is more likely to be the case, the
situations in which observations are made for different persons are
different, the data would be "unmatched." In terms of analysis of
variance, this would be a nested design.
The P persons and the I situations are considered to be random
samples from their respective universes of persons and situations.
The only assumptions or requirements made are:
1. The universe is described unambiguously, so that it is clear
what situations fall within the universe.
2. Situations are experimentally independent; a person's score
in situation i does not depend on whether he has or has not been
previously observed under other conditions.
3. Scores Xpi are on an interval scale.2 (See Cronbach, et al.,
1963, p. 145.)
For each person p, the average of the Xpi over all situations i
in the universe, is his universe score. That is
Ap = E(Xpilp) = E(Xpi) i
Similarly the universe score for situation i is
Pi = E(Xpili) = E(Xpi) p and the universe score for all persons
and conditions is
/ = EE(Xpi) pi From the additive model for two-way analysis of
variance, with the restriction of only one observation per cell and
consequent confound- ing of interaction and error, each person's
score can be expressed as:
2 It is unlikely that data derived from classroom observation
schedules meet this assumption. On the other hand, while the
assumption is necessary for the formal application of the variance
components approach, its violation would seem to be of little
consequence in practice.
18 Vol. 9 No. 1 January 1972
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
Classroom Observation Schemes: Where are the Errors?-
[1] Xpi = + (p - p) + (i - ) + epi person situation residual
effect effect
The population variances due to each of these effects are:
a2 (pp) = E(pp -- p)2 02 (pi ) = E(pi -- )2
02 (epi) = Eep2 The total observed score variance for the
population would be
a2 (Xpi) = E(Xpi -)2 and since there are no covariances among
the effects in the additive model, the population observed score
variance may be expressed in terms of its components as:
[2] 02 (Xpi) = 02 (p) + a2(pi) + 02(ep If a nested design
(unmatched data) were used there would be no
identifiable situation effects since each person would be
observed in different situations. In this case the model would
reduce to
[3] Xpi = + (p - I) + ep
and the identifiable components of variance would be
[4] 02 (Xpi) = 02 (p) + a2(i,epi) (see Travers, 1969)
Equation [1] may be rewritten, with the interaction %opi and
specific error Epi effects separated, as:
Xpi = p + (i -II) + cpi + Epi or: Xpi = lp + Epi
where pp is the universe score for person p, over all I
occasions, (equivalent to generic true score in Lord and Novick's,
1968, treatment) and where epi, the generic error score, contains
both the residual and situation effects of equation [1]. For one
person p, over all situations i, the generic error variance is:
02 (ep) = Ee2 pi
19
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
American Educational Research Journal
and over all persons, the error variance becomes
[5] 02 () = E 02(Ep*) P
= 02 (pi) + EEa2 (epi) pi
This expression for error variance can be seen to contain the
between situations variance component and the residual
variance.
From the two way analysis of variance design it can be seen that
the expected mean squares may be expressed in terms of variance
components as:
EMSp = 02 (epi)+ Io2 (tlp)
EMSi = 2 (epi) + Pa2 (i)
EMSres = 02 (epi) and, therefore, unbiased estimators of the
variance components may be obtained as:
[6] a2 (pp) = (1/I) (MSp -- MSres) [7] a2 (i ) = (1/P) (MSi -
MSres)
[8] a2 (epi) = MSres For the nested design the expected mean
squares would be
EMSp = a2 (i,epi)+ lo2 (gp)
EMSwp = 02 (i,epi) where MSwp represents mean squares within
persons since, in the case of nesting, this is the residual mean
square. In this design the estimates of the components of variance
may be found as:
[9] a2 (Ap) = (1/I) (MSp - MSw p) [10] a2 (i,epi) = MSwp
Coefficients of Generalizability In classical theory only a
single estimate of reliability is obtained
as an estimate of the correlation between scores obtained on two
administrations of parallel tests.
20 Vol. 9 No. 1 January 1972
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
Classroom Observation Schemes: Where are the Errors?
In generalizability theory, the range of different reliability
coefficients is made explicit. The universe for which the universe
score is estimated may be variously defined and, hence, for each
different universe of scores a different coefficient of correlation
(generalizability) between universe and observed scores may be
obtained. Thus a clear definition of the universe of generalization
for any particular study is important. Furthermore, reliability,
defined in this sense and estimated as a coefficient of
generalizability, is clearly dependent on the design of the study
in which the instrument is to be used since the relevant universe
is defined in terms of the facets included in the study.
Still considering the simple case of a single observer rating P
persons in I situations, the most likely event in an experimental
study would be that the situations would differ from person to
person. The coefficient of generalizability required is one for the
population of situations. Cronbach et al. (1963) show how this
coefficient, for unmatched data, may be obtained from a
generalizability study with matched data (i.e., in which all P
persons are observed in the same I situations).
The coefficient of generalizability may be defined as the ratio
of the variance of the universe scores for persons gp to the
variance, over persons and situations, of the observed scores Xpi.
Thus,
02(p) [11 p2 (X,pp) = a2 (X) From equations [5] through [8] we
see that the variance of the observed scores in the population may
be estimated as:
[12] 62 (X)= 2(p +) = A2 (Ap) + A (e)
= A2 (p) + 2(i) + 2 (epi) = (1/I)(MSp - MSres) + (1/P)(MSi -
MSres) + MSres = (1/I) MSp + (I/P)MSi + [(IX P-P-1)/P] MSrest
An estimate for the variance of universe scores is given as [6]
.3 3 Cronbach, et al. (1963), and Lord and Novick (1968) provide
methods for
estimating the reliability of observations in a single situation
(analogous to the reliability of a single test form, or a single
observer) from data obtained in more than one situation. These
formulae are not relevant in the present context where the concern
is with reliability across the universe of situations.
21
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
American Educational Research Journal
Generalizability and Variations in Behavior Earlier in this
paper a distinction was made between instability of
behavior and lawful variations in behavior in response to
varying conditions. A data layout, indicating the type of data to
be collected in a generalizability study which allows for such
variations, is shown in Table 1.4 The model for this analysis
is:
[13] Xikm + aT + aS + aO + aJ + aTS I k ik m n in + aSJ + aTSJ +
aOJ + E. kn ikn ikmn ikmn
where p = general mean
Eikmn - specific error (as in true score theory)
and, in terms such as aTS the superscripts indicate the effect
and the subscripts have the usual meaning. Where the number of
subscripts is greater than the number of superscripts, the
extraneous subscripts refer to the factors within which the effect
is nested. Thus, aikm represents the effect of the mth level of
Factor O (occasions) nested in the ith level of Factor T (teacher)
and the kth level of Factor S (situations).
In the usual generalizability theory analysis all but the
teacher effect would be considered as part of the generic error,
eikmn, contributed by the various facets in the design over which
generaliza- tions are to be made. From this standpoint [13] could
be reduced to
[14] Xikm = p + aT+ikmn In the present analysis, however,
systematic (non-random) changes
in behavior over situations are not considered as contributing
to error. This^ is also true for systematic differences among
teachers in their changes in behavior (the teachers x situations
interaction component). Changes in behavior over occasions (within
T x S) are considered to be random fluctuations and thus contribute
to error. Differences among judges are clearly in the same
category.
4 The design portrayed in Table 1 is an elaboration of Design 5
in Gleser, et al. (1965). It is the introduction of a "situations"
facet and the consideration of a teachers-by-situations interaction
which represents the major departure from the treatment given by
Cronbach, et al. (in press). The proposed design may also be seen
as merging the Medley and Mitzel (1963) conception with the
reformulation presented by Cronbach, et al. (in press).
22 Vol. 9 No. 1 January 1972
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
0
0
E 00
CCIO
0 CIS
0 0
. . + +
00
CCn o - 4 -4 ? ,.
O i. - . . . z
-~O z
23
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
American Educational Research Journal
Thus, the model for the analysis may be written as:
[15] Xikmn - +a
+ ak ik ikmn
In terms of the partitioning of variance provided in the
analysis of variance the observed score variance may be expressed
as:
[16] U2 (Xikmn) = 02 (aT) + (as) + U2(aTS) + 2 (ikmn) Converting
to a simpler notation this may be written as:
[171] - 2+o 2 + 2 + where
[18] (ts) j tj si tsj o(ts)j E where, since the analysis has
only one observation per cell, or + 2o(ts)j must be estimated as 02
res"
All factors in the design are considered to be random, with
levels chosen at random from an infinite universe. The expected
mean square for each of the sources of variation is shown in Table
2. Unbiased estimators of each of the components of variance are
given in Table 3, in terms of the observed mean squares.
In this analysis, three generalizability coefficients may be
deter- mined. The first p2 provides a measure of the reliability
with which teachers' behavior (within situations) may be
observed.
This coefficient may be estimated as: A
=JA^2U A+2A2) Pt a
t a(t + 0 )
where 62 can be estimated as indicated in Table 3 and a by
substituting in [18] the appropriate estimators from Table 3.
A second coefficient of generalizability ps2 provides an index
of the reliability with which situations may be distinguished. This
coefficient may be estimated as:
"A2 =A2 /A2 +A2 Ps =s/(a+a) If this coefficient is small then
the implication is that behavior changes from situation to
situation are either random (and, therefore, not lawful) or vary
systematically from teacher to teacher in such a way that no
overall differences among situations may be detected. If this
24 Vol. 9 No. 1 January 1972
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
Classroom Observation Schemes: Where are the Errors?
b z
+
+
b
0 0 0 Z Z Z + + +
b b b z z z ; NN + + +
+ + +
b bb b
+ + +
+ + + + + + +
+ + + + + + + + +
0 0 0 0 0 0 0C 0 0 Cd
0
4 a 4 _
a "o x x
25
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
American Educational Research Journal
TABLE 3
Components of variance
2t = (I/KMN) (MSt- MSo(ts) - MSts - MStj + 2MStsj + MSres) 2 =
(1/IMN)(MS -MSo(ts) - MS- MS + 2MStsj + MSs) s s o(ts) ts-s+ tsj
res'
A2 (ts) = (1/N)(MSo(ts) - MSres)
Oj = (I/IKM)(MSj -MStj -MSj + MStsj)
dts = (I/MN)(MSts - MStsj)
t2 = (1/KM)(MStj -MStsj)
S =
(1/IM)(MSsj -- MStsj 62 = (1/M)(MS5 - MS ) (Mstsj - Mres
2es =MS res res
latter is the case it will be detected by the third coefficient
of generalizability proposed, viz,
A2 =A2 /( +2 + 2 Pts ts ts
19 )
This coefficient provides an index of the reliability with which
observers can detect systematic variations among teachers in their
changes in behavior from one situation to another.
Obviously considerable amounts of data need to be collected to
obtain these estimates, but without them it seems futile, on the
one hand, to treat all changes in behavior as though they were
lawful or, on the other, to treat all variations in behavior as
errors of measurement.
REFERENCES American Educational Research Association. Technical
recommendations for
achievement tests. Washington, D. C.: NEA, 1955. American
Psychological Association. Technical recommendations for
psychological
tests and diagnostic techniques. Washington, D. C.: APA, 1954.
American Psychological Association. Standards for educational and
psychological
tests and manuals. Washington, D. C.: APA, 1966. BELLACK, A. A.,
KLIEBARD, H. M., HYMAN, R. T., & SMITH, F. L., Jr. The
language of the classroom. New York: Teachers College Press,
Columbia University, 1966.
26 Vol. 9 No. 1 January 1972
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
-
Classroom Observation Schemes: Where are the Errors?
BROWN, B. B., MENDENHALL, W., & BEAVER, R. The reliability
of observations of teachers' classroom behavior. Journal of
Experimental Education, 1968, 36, 1-10.
CRONBACH, L. J., RAJARATNAM, N., & GLESER, G. Theory of
generaliz- ability: a liberalization of reliability theory. British
Journal of Statistical Psychology, 1963, 16, 137-163.
CRONBACH, L. J., GLESER, G., & RAJARATNAM, N. Dependability
of behav- ioral measurements. New York: Wiley, in press.
FLANDERS, N. A. The problems of observer training and
reliability. In Amidon, E., and Hough J. (Eds.) Interaction
analysis: theory, research, and applica- tion. Massachusetts:
Addison-Wesley, 1967.
FLANDERS, N. A., et al. Teacher influence patterns and pupil
achievement in second, fourth and sixth grade levels. Vols. I-II.
Cooperative Research Project No. 5-1055, U.S. Dept. of Health,
Education, and Welfare: Office of Education. University of
Michigan, 1969.
GLESER, G., CRONBACH, L. J., & RAJARATNAM, N.
Generalizability of scores influenced by multiple sources of
variance. Psychometrika, 1965, 30, 395-418.
LORD, F. M., & NOVICK, M. R. Statistical Theories of Mental
Test Scores. Massachusetts: Addison-Wesley, 1968.
MEDLEY, D. M., & MITZEL, H. E. Measuring classroom behavior
by systematic observation. In N. L. Gage (Ed.) Handbook of research
on teaching. Chicago: Rand McNally, 1963.
SMITH, B. O., & MEUX, M. A study of the logic of teaching.
Cooperative Research Project No. 258, U.S. Dept. of Health,
Education, and Welfare: Office of Education. University of
Illinois, 1962.
TRAVERS, K. J. Correction for attenuation: a generalizability
approach using components of covariance. College of Education,
University of Illinois, 1969 (mimeo).
WEICK, K. E. Systematic observational methods. In G. Lindzey and
E. Aronson (Eds.) The Handbook of Social Psychology Vol. II (2nd
Ed.) Massachusetts: Addison-Wesley, 1968.
WESTBURY, I. The reliability of measures of classroom behavior.
Ontario Journal of Educational Research, 1967, 10, 125-138.
(Received June, 1971) (Revised September, 1971)
AUTHORS
McGAW, BARRY Address: Center for Instructional Research and
Curriculum Evaluation, College of Education, University of
Illinois, Urbana, Illinois 61801. Title: University of Illinois
Fellow. Age: 30. Degrees: B.Sc., B.Ed. (Hons.) University of
Queensland (Australia); M.Ed., University of Illinois.
Specialization: Measurement, learning, evaluation.
WARDROP, JAMES L. Address: Center for Instructional Research and
Curriculum Evaluation, College of Education, University of
Illinois, Urbana, Illinois 61801. Title: Associate Professor. Age:
30. Degrees: B.A., Ph.D., Washington University. Specialization:
Measurement, evaluation, statistics.
BUNDA, MARY ANNE Address: The Ohio State University, Evaluation
Center, 1712 Neil Avenue, Columbus, Ohio 43210. Title: Assistant
Professor. Age: 27. Degrees: B.S., M.Ed., Loyola University of
Chicago; Ph.D., University of Illinois. Specialization:
Measurement, statistics, evaluation.
27
This content downloaded from 148.209.126.44 on Wed, 22 May 2013
10:46:05 AMAll use subject to JSTOR Terms and Conditions
Article
Contentsp.13p.14p.15p.16p.17p.18p.19p.20p.21p.22p.23p.24p.25p.26p.27
Issue Table of ContentsAmerican Educational Research Journal,
Vol. 9, No. 1 (Winter, 1972), pp. I-VII+1-173Front Matter
[pp.I-164]Patterns of Mental Abilities: Ethnic, Socioeconomic, and
Sex Differences [pp.1-12]Classroom Observation Schemes: Where Are
the Errors? [pp.13-27]A Note on the Processing of Classroom
Observation Records [pp.29-44]Comparison Behavior in Elementary
School Children [pp.45-63]The Effect of Letter-Name Knowledge on
Learning to Read [pp.65-74]Comparisons of Letter Name and Letter
Sound Training as Transfer Variables [pp.75-86]Effects of Pattern
Drill on the Phonology, Syntax, and Reading Achievement of Rural
Appalachian Children [pp.87-100]Reconstruction in the Analysis of
Verbal Interaction [pp.101-112]S and X Tests of Association: An
Empirical Comparison [pp.113-122]Effects of Competition as a
Motivational Technique in the Classroom [pp.123-137]Ecological
Correlates of Ambience in the Learning Environment
[pp.139-148]Interaction between College Effects and Students'
Aptitudes [pp.149-161]Comment on J. A. Creager's Comment
[p.163]Reviewsuntitled [pp.165-168]untitled [pp.168-173]
Back Matter