Classroom Observation Errors

Classroom Observation Schemes: Where Are the Errors?Author(s): Barry McGaw, James L. Wardrop and Mary Anne BundaSource: American Educational Research Journal, Vol. 9, No. 1 (Winter, 1972), pp. 13-27Published by: American Educational Research AssociationStable URL: http://www.jstor.org/stable/1162047 .Accessed: 22/05/2013 10:46

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Educational Research Association is collaborating with JSTOR to digitize, preserve and extendaccess to American Educational Research Journal.

http://www.jstor.org

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AMAll use subject to JSTOR Terms and Conditions

Classroom Observation Schemes: Where are the Errors?

BARRY McGAW and JAMES L. WARDROP University of Illinois at Urbana-Champaign

MARY ANNE BUNDA The Ohio State University

Much of the confusion about the notion of reliability was cleared with the publication of Technical Recommendations for Psychological Tests and Diagnostic Techniques (American Psychological Association, 1954) and Technical Recommendations for Achievement Tests (Ameri- can Educational Research Association, 1955). The subsequent publication of Standards for Educational and Psychological Tests and Manuals (American Psychological Association, 1966), while it introduced considerable changes in the conception of validity, redefined reliability in essentially the same terms.

The term "reliability" was clarified through the recognition of its several meanings and, consequently, that " 'reliability coefficient' is a generic term referring to various types of evidence; each type of evidence suggesting a different meaning" (APA, 1966, p. 25). Although the explicit definition of a series of alternative conceptions of reliability clarified the issue it did not resolve a number of important questions of interpretation. For example, there is no clear guide for the selection of an appropriate reliability coefficient nor is there any resolution of the problem of interpreting differences in the values of the various coefficients for the same test.

Further problems arise when consideration is given to the reliability of observation schedules. Here a further variable is introduced into the measurement situation, and inter-observer disagreement

13


American Educational Research Journal

becomes an additional source of errors of measurement. This type of error is somewhat akin to that between alternate forms of a test, which is typically estimated with an equivalent forms correlation coefficient. It is a more important source of error with an observation schedule, however, since it is usually unavoidable. Whereas, with a test, it is possible to use only a single form in many situations to avoid this source of unreliability, with an observation schedule the obvious physical limitations necessitate the use of more than one observer in all but the smallest studies.'

A major problem with a series of reliability indices, each of which measures the effect of only one or two sources of unreliability, is that there is no way of obtaining a "total" picture of the combined effects of all relevant sources of error. The range of sources of error with observational techniques is discussed by Medley and Mitzel (1963). At the descriptive level with respect to classroom observation techniques a measure can be claimed to be reliable "to the extent that the average difference between two measurements independently obtained in the same classroom is smaller than the average difference between two measurements obtained in different classrooms" (p. 250). Unreliability occurs when two measures of the same object (person, classroom) tend to differ too much either because the behaviors being observed are too variable or because independent observers cannot agree on what is happening.

In addition to these sources, unreliability may also be due to the smallness of differences among the objects of observation on the dimensions observed. This is a less important point for test construc- tion, where preliminary item analyses are used to reject items which do not discriminate, than it is for the development of observation schedules. The point seems seldom to have been recognized, particu- larly by those who have developed classroom observation schemes. It is clearly possible that an instrument for which the level of inter-judge agreement is very high, may be quite unreliable in spite of the judges' consistency.

Coefficients of Observer Agreement The inadequacy of measures of inter-observer agreement as

reliability indices has been discussed (see e.g., Medley and Mitzel, 1963; Westbury, 1967), yet these discussions appear to have had little impact. Most authors of classroom observation schemes deal with reliability exclusively in terms of observer agreement. For example, Bellack, et al.

This is not intended to imply that the use of only one observer is desirable, for in such a study no estimate of the effect of observer bias can be obtained.

14 Vol. 9 No. 1 January 1972



(1966), describing their technique to measure observer agreement, claim that the results "indicate a consistently high degree of reliability for all major categories of analysis: agreement ranged from 84 to 96 percent" (p. 35). They established these figures using two pairs of independent coders.

Smith, et al. (1964, 1967), used the same technique used previously by Smith and Meux (1962) to calculate the consistency with which independent judges identified and classified their units of behavior (ventures). These two estimates are essentially percent agreement measures. Flanders (1967) also specified reliability only in terms of observer agreement. The r coefficient he used is superior to simple percentage agreement measures since it estimates, instead, the extent to which chance agreement has been exceeded. A notable exception to these studies is that of Brown, Mendenhall, and Beaver (1968), in which the reliability estimate was a measure of the consistency with which teachers could be discriminated from one another.

Medley and Mitzel (1963) argued strongly against using observer agreement indices as measures of reliability, in fact referring to them as coefficients of observer agreement to distinguish them from coefficients of reliability.

Weick (1968), referring to observation schemes in general, observed that "the most common reliability measure in observational studies is observer agreement" but, unfortunately, supported this state of affairs by arguing that inter-observer agreement is more important than replicability, i.e., intra-observer agreement over occasions. While this is undoubtedly true, it misses the main point. The establishment and maintenance of a suitably high level of inter-judge agreement is important only insofar as it operates as a limiting factor on reliability.

However great the problems apparently introduced by a failure to obtain suitable inter-judge agreement, this is a situation under fairly direct control of the experimenter. In the case of a category system clearer definition of the categories, minimization of overlap, and more extensive training of observers should raise the level of agreement, while in the case of sign analysis more precise definitions of the behaviors to be recorded should achieve the same results.

If the objects do not differ sufficiently in the behavior observed, in comparison with their own stability over time, no level of inter-judge agreement can render the observation scheme acceptably reliable (if one insists-as is done in this discussion-on restricting the use of "reliability" to its measurement-theoretic sense).

The confusion introduced into the literature through failure clearly to distinguish the different sources of unreliability, and through

15



over-emphasis on inter-judge agreement has resulted from a confusion of the importance of primacy with prime importance. Inter-judge agreement is the first but not the most important issue to be faced.

Stability of Behavior

Variability of the object of observation is the most important source of error variance. Unless stable estimates of behavior can be obtained, inter-object variability will inevitably be swamped by intra-object variability. Thus sampling of behavior is of critical importance. Medley and Mitzel (1963, p. 268) contended that it is better to increase the number of observations than the number of observers. They cite an example of a study of teacher behavior in which increasing the number of observers on a single occasion from 1 to 12 increased reliability from .47 to .50 whereas using the 12 observers individually on 12 separate occasions increased reliability from .47 to .92. Errors due to instability of behavior were clearly greater than errors due to inter-observer disagreement in this example.

A critical point in the conception of reliability expounded by Medley and Mitzel is that instability of behavior over occasions (i.e., time) is due to random error in one or both of the environment or the person (object). This implies that the characteristic being measured is stable in a sense that does not allow of lawful change. While this may be a reasonable assumption in relation to relatively enduring aspects of personality, it is unreasonable when other types of behavior patterns are being observed.

In the case of observations of teacher behavior, efforts to develop indices to characterize particular teachers appear to be misplaced unless there is some allowance made for lawful adaptations of behavior to different situations.

One study (Flanders, et al., 1969) has attempted to examine the effect of variability in teacher behavior. In this study a series of different situations, over which teacher behavior might be expected to vary systematically, were chosen. Ten situations were included and treated as independent, though it appears that there were two dimensions, at least, underlying the situations. One dimension (six situations) involved variations in subject matter (language arts, science, etc.). The other (four situations) involved differences in activities (introducing new material, extending old material, planning, etc.). Assuming that there are two dimensions involved then, it is clear that there are actually 24 (6 X 4) possible situations rather than ten. The authors collected data only in the broad categories without indicating that they had balanced (or controlled) one dimension while observing the other.




Twenty teachers were observed in each of the ten situations (there were some empty cells, in fact) and their teaching behavior in each situation was categorized using Flanders' Interaction Analysis Schedule. Thus ten i/d ratios were obtained for most of the teachers, fewer for those teachers not observed in all situations. The standard deviations of i/d ratios for each teacher were calculated and taken to be measures of variability, or even "flexibility" as the authors choose to call it. This index of flexibility was then used as a dependent variable in comparing groups of teachers.

The view, implicit in the Flanders study, that changes in behavior over situations are lawful, is premature without some empirical evidence. An appropriate way to demonstrate such lawfulness, or at least that the changes are systematic, would be to assess the reliability of the observation schedule for measuring behavior within situations. From this standpoint, stability would be expected only over separate occasions within each situation. Variability among situations, within teachers, would no longer be attributed to error variance but to true variance, i.e., variation in the measure reflecting real, systematic variation in the person.

To attempt to assess each of the sources of error variance in a complex model such as this via inter-class correlation coefficients would be impossible. Apart from the work involved, the difficulties discussed earlier with respect to interpretations of coefficients of equivalence, stability and observer agreement would be multiplied since, for example, there would not be a separate coefficient of stability for each situation involved. The most appropriate approach would be to use a variance components analysis such as that proposed by Medley and Mitzel (1963) or by Cronbach and his associates (Gleser, et al., 1965; Cronbach, et al., in press). In the following pages, the basic concepts of generalizability theory, developed by Cronbach, et al., will be described and then a solution to the above problem of estimating reliability will be suggested.

Generalizability Theory The conditions under which observations are made can be

classified in various ways. For example, in the previous section, in the study of "variability," observations could be classified with respect to situations, occasions (within situations), and observers. Each of these aspects of the observations is termed a facet in the terminology of generalizability theory. Each facet is, of course, a source of variability in the observations in addition to the between person variability. The variation attributable to each facet can be estimated by analysis of variance procedures.

17



The procedure will first be illustrated with respect to a single facet design. If observations are made on P persons in I situations the ratings of all persons in all situations can be presented by a P X I matrix X, in which Xpi is the rating of person p in situation i. In this simple example only one observer is used for all observations. If more than one observer were used, observers would constitute a second facet.

If the I situations are exactly the same for all P persons, the example satisfies the definition of "matched data" (Cronbach, et al., 1963), and in terms of analysis of variance, constitutes a two-way crossed design. If, as is more likely to be the case, the situations in which observations are made for different persons are different, the data would be "unmatched." In terms of analysis of variance, this would be a nested design.

The P persons and the I situations are considered to be random samples from their respective universes of persons and situations. The only assumptions or requirements made are:

1. The universe is described unambiguously, so that it is clear what situations fall within the universe.

2. Situations are experimentally independent; a person's score in situation i does not depend on whether he has or has not been previously observed under other conditions.

3. Scores Xpi are on an interval scale.2 (See Cronbach, et al., 1963, p. 145.)

For each person p, the average of the Xpi over all situations i in the universe, is his universe score. That is

Ap = E(Xpilp) = E(Xpi) i

Similarly the universe score for situation i is

Pi = E(Xpili) = E(Xpi) p and the universe score for all persons and conditions is

/ = EE(Xpi) pi From the additive model for two-way analysis of variance, with the restriction of only one observation per cell and consequent confound- ing of interaction and error, each person's score can be expressed as:

2 It is unlikely that data derived from classroom observation schedules meet this assumption. On the other hand, while the assumption is necessary for the formal application of the variance components approach, its violation would seem to be of little consequence in practice.



Classroom Observation Schemes: Where are the Errors?-

[1] Xpi = + (p - p) + (i - ) + epi person situation residual effect effect

The population variances due to each of these effects are:

a2 (pp) = E(pp -- p)2 02 (pi ) = E(pi -- )2

02 (epi) = Eep2 The total observed score variance for the population would be

a2 (Xpi) = E(Xpi -)2 and since there are no covariances among the effects in the additive model, the population observed score variance may be expressed in terms of its components as:

[2] 02 (Xpi) = 02 (p) + a2(pi) + 02(ep If a nested design (unmatched data) were used there would be no

identifiable situation effects since each person would be observed in different situations. In this case the model would reduce to

[3] Xpi = + (p - I) + ep

and the identifiable components of variance would be

[4] 02 (Xpi) = 02 (p) + a2(i,epi) (see Travers, 1969)

Equation [1] may be rewritten, with the interaction %opi and specific error Epi effects separated, as:

Xpi = p + (i -II) + cpi + Epi or: Xpi = lp + Epi

where pp is the universe score for person p, over all I occasions, (equivalent to generic true score in Lord and Novick's, 1968, treatment) and where epi, the generic error score, contains both the residual and situation effects of equation [1]. For one person p, over all situations i, the generic error variance is:

02 (ep) = Ee2 pi

19



and over all persons, the error variance becomes

[5] 02 () = E 02(Ep*) P

= 02 (pi) + EEa2 (epi) pi

This expression for error variance can be seen to contain the between situations variance component and the residual variance.

From the two way analysis of variance design it can be seen that the expected mean squares may be expressed in terms of variance components as:

EMSp = 02 (epi)+ Io2 (tlp)

EMSi = 2 (epi) + Pa2 (i)

EMSres = 02 (epi) and, therefore, unbiased estimators of the variance components may be obtained as:

[6] a2 (pp) = (1/I) (MSp -- MSres) [7] a2 (i ) = (1/P) (MSi - MSres)

[8] a2 (epi) = MSres For the nested design the expected mean squares would be

EMSp = a2 (i,epi)+ lo2 (gp)

EMSwp = 02 (i,epi) where MSwp represents mean squares within persons since, in the case of nesting, this is the residual mean square. In this design the estimates of the components of variance may be found as:

[9] a2 (Ap) = (1/I) (MSp - MSw p) [10] a2 (i,epi) = MSwp

Coefficients of Generalizability In classical theory only a single estimate of reliability is obtained

as an estimate of the correlation between scores obtained on two administrations of parallel tests.




In generalizability theory, the range of different reliability coefficients is made explicit. The universe for which the universe score is estimated may be variously defined and, hence, for each different universe of scores a different coefficient of correlation (generalizability) between universe and observed scores may be obtained. Thus a clear definition of the universe of generalization for any particular study is important. Furthermore, reliability, defined in this sense and estimated as a coefficient of generalizability, is clearly dependent on the design of the study in which the instrument is to be used since the relevant universe is defined in terms of the facets included in the study.

Still considering the simple case of a single observer rating P persons in I situations, the most likely event in an experimental study would be that the situations would differ from person to person. The coefficient of generalizability required is one for the population of situations. Cronbach et al. (1963) show how this coefficient, for unmatched data, may be obtained from a generalizability study with matched data (i.e., in which all P persons are observed in the same I situations).

The coefficient of generalizability may be defined as the ratio of the variance of the universe scores for persons gp to the variance, over persons and situations, of the observed scores Xpi. Thus,

02(p) [11 p2 (X,pp) = a2 (X) From equations [5] through [8] we see that the variance of the observed scores in the population may be estimated as:

[12] 62 (X)= 2(p +) = A2 (Ap) + A (e)

= A2 (p) + 2(i) + 2 (epi) = (1/I)(MSp - MSres) + (1/P)(MSi - MSres) + MSres = (1/I) MSp + (I/P)MSi + [(IX P-P-1)/P] MSrest

An estimate for the variance of universe scores is given as [6] .3 3 Cronbach, et al. (1963), and Lord and Novick (1968) provide methods for

estimating the reliability of observations in a single situation (analogous to the reliability of a single test form, or a single observer) from data obtained in more than one situation. These formulae are not relevant in the present context where the concern is with reliability across the universe of situations.

21



Generalizability and Variations in Behavior Earlier in this paper a distinction was made between instability of

behavior and lawful variations in behavior in response to varying conditions. A data layout, indicating the type of data to be collected in a generalizability study which allows for such variations, is shown in Table 1.4 The model for this analysis is:

[13] Xikm + aT + aS + aO + aJ + aTS I k ik m n in + aSJ + aTSJ + aOJ + E. kn ikn ikmn ikmn

where p = general mean

Eikmn - specific error (as in true score theory)

and, in terms such as aTS the superscripts indicate the effect and the subscripts have the usual meaning. Where the number of subscripts is greater than the number of superscripts, the extraneous subscripts refer to the factors within which the effect is nested. Thus, aikm represents the effect of the mth level of Factor O (occasions) nested in the ith level of Factor T (teacher) and the kth level of Factor S (situations).

In the usual generalizability theory analysis all but the teacher effect would be considered as part of the generic error, eikmn, contributed by the various facets in the design over which generaliza- tions are to be made. From this standpoint [13] could be reduced to

[14] Xikm = p + aT+ikmn In the present analysis, however, systematic (non-random) changes

in behavior over situations are not considered as contributing to error. This^ is also true for systematic differences among teachers in their changes in behavior (the teachers x situations interaction component). Changes in behavior over occasions (within T x S) are considered to be random fluctuations and thus contribute to error. Differences among judges are clearly in the same category.

4 The design portrayed in Table 1 is an elaboration of Design 5 in Gleser, et al. (1965). It is the introduction of a "situations" facet and the consideration of a teachers-by-situations interaction which represents the major departure from the treatment given by Cronbach, et al. (in press). The proposed design may also be seen as merging the Medley and Mitzel (1963) conception with the reformulation presented by Cronbach, et al. (in press).



0

0

E 00

CCIO

0 CIS

0 0

. . + +

00

CCn o - 4 -4 ? ,.

O i. - . . . z

-~O z

23



Thus, the model for the analysis may be written as:

[15] Xikmn - +a

+ ak ik ikmn

In terms of the partitioning of variance provided in the analysis of variance the observed score variance may be expressed as:

[16] U2 (Xikmn) = 02 (aT) + (as) + U2(aTS) + 2 (ikmn) Converting to a simpler notation this may be written as:

[171] - 2+o 2 + 2 + where

[18] (ts) j tj si tsj o(ts)j E where, since the analysis has only one observation per cell, or + 2o(ts)j must be estimated as 02 res"

All factors in the design are considered to be random, with levels chosen at random from an infinite universe. The expected mean square for each of the sources of variation is shown in Table 2. Unbiased estimators of each of the components of variance are given in Table 3, in terms of the observed mean squares.

In this analysis, three generalizability coefficients may be deter- mined. The first p2 provides a measure of the reliability with which teachers' behavior (within situations) may be observed.

This coefficient may be estimated as: A

=JA^2U A+2A2) Pt a

t a(t + 0 )

where 62 can be estimated as indicated in Table 3 and a by substituting in [18] the appropriate estimators from Table 3.

A second coefficient of generalizability ps2 provides an index of the reliability with which situations may be distinguished. This coefficient may be estimated as:

"A2 =A2 /A2 +A2 Ps =s/(a+a) If this coefficient is small then the implication is that behavior changes from situation to situation are either random (and, therefore, not lawful) or vary systematically from teacher to teacher in such a way that no overall differences among situations may be detected. If this




b z

+

+

b

0 0 0 Z Z Z + + +

b b b z z z ; NN + + +

+ + +

b bb b

+ + +

+ + + + + + +

+ + + + + + + + +

0 0 0 0 0 0 0C 0 0 Cd

0

4 a 4 _

a "o x x

25



TABLE 3

Components of variance

2t = (I/KMN) (MSt- MSo(ts) - MSts - MStj + 2MStsj + MSres) 2 = (1/IMN)(MS -MSo(ts) - MS- MS + 2MStsj + MSs) s s o(ts) ts-s+ tsj res'

A2 (ts) = (1/N)(MSo(ts) - MSres)

Oj = (I/IKM)(MSj -MStj -MSj + MStsj)

dts = (I/MN)(MSts - MStsj)

t2 = (1/KM)(MStj -MStsj)

S =

(1/IM)(MSsj -- MStsj 62 = (1/M)(MS5 - MS ) (Mstsj - Mres

2es =MS res res

latter is the case it will be detected by the third coefficient of generalizability proposed, viz,

A2 =A2 /( +2 + 2 Pts ts ts

19 )

This coefficient provides an index of the reliability with which observers can detect systematic variations among teachers in their changes in behavior from one situation to another.

Obviously considerable amounts of data need to be collected to obtain these estimates, but without them it seems futile, on the one hand, to treat all changes in behavior as though they were lawful or, on the other, to treat all variations in behavior as errors of measurement.

REFERENCES American Educational Research Association. Technical recommendations for

achievement tests. Washington, D. C.: NEA, 1955. American Psychological Association. Technical recommendations for psychological

tests and diagnostic techniques. Washington, D. C.: APA, 1954. American Psychological Association. Standards for educational and psychological

tests and manuals. Washington, D. C.: APA, 1966. BELLACK, A. A., KLIEBARD, H. M., HYMAN, R. T., & SMITH, F. L., Jr. The

language of the classroom. New York: Teachers College Press, Columbia University, 1966.




BROWN, B. B., MENDENHALL, W., & BEAVER, R. The reliability of observations of teachers' classroom behavior. Journal of Experimental Education, 1968, 36, 1-10.

CRONBACH, L. J., RAJARATNAM, N., & GLESER, G. Theory of generalizability: a liberalization of reliability theory. British Journal of Statistical Psychology, 1963, 16, 137-163.

CRONBACH, L. J., GLESER, G., & RAJARATNAM, N. Dependability of behav- ioral measurements. New York: Wiley, in press.

FLANDERS, N. A. The problems of observer training and reliability. In Amidon, E., and Hough J. (Eds.) Interaction analysis: theory, research, and application. Massachusetts: Addison-Wesley, 1967.

FLANDERS, N. A., et al. Teacher influence patterns and pupil achievement in second, fourth and sixth grade levels. Vols. I-II. Cooperative Research Project No. 5-1055, U.S. Dept. of Health, Education, and Welfare: Office of Education. University of Michigan, 1969.

GLESER, G., CRONBACH, L. J., & RAJARATNAM, N. Generalizability of scores influenced by multiple sources of variance. Psychometrika, 1965, 30, 395-418.

LORD, F. M., & NOVICK, M. R. Statistical Theories of Mental Test Scores. Massachusetts: Addison-Wesley, 1968.

MEDLEY, D. M., & MITZEL, H. E. Measuring classroom behavior by systematic observation. In N. L. Gage (Ed.) Handbook of research on teaching. Chicago: Rand McNally, 1963.

SMITH, B. O., & MEUX, M. A study of the logic of teaching. Cooperative Research Project No. 258, U.S. Dept. of Health, Education, and Welfare: Office of Education. University of Illinois, 1962.

TRAVERS, K. J. Correction for attenuation: a generalizability approach using components of covariance. College of Education, University of Illinois, 1969 (mimeo).

WEICK, K. E. Systematic observational methods. In G. Lindzey and E. Aronson (Eds.) The Handbook of Social Psychology Vol. II (2nd Ed.) Massachusetts: Addison-Wesley, 1968.

WESTBURY, I. The reliability of measures of classroom behavior. Ontario Journal of Educational Research, 1967, 10, 125-138.

(Received June, 1971) (Revised September, 1971)

AUTHORS

McGAW, BARRY Address: Center for Instructional Research and Curriculum Evaluation, College of Education, University of Illinois, Urbana, Illinois 61801. Title: University of Illinois Fellow. Age: 30. Degrees: B.Sc., B.Ed. (Hons.) University of Queensland (Australia); M.Ed., University of Illinois. Specialization: Measurement, learning, evaluation.

WARDROP, JAMES L. Address: Center for Instructional Research and Curriculum Evaluation, College of Education, University of Illinois, Urbana, Illinois 61801. Title: Associate Professor. Age: 30. Degrees: B.A., Ph.D., Washington University. Specialization: Measurement, evaluation, statistics.

BUNDA, MARY ANNE Address: The Ohio State University, Evaluation Center, 1712 Neil Avenue, Columbus, Ohio 43210. Title: Assistant Professor. Age: 27. Degrees: B.S., M.Ed., Loyola University of Chicago; Ph.D., University of Illinois. Specialization: Measurement, statistics, evaluation.

27


Article Contentsp.13p.14p.15p.16p.17p.18p.19p.20p.21p.22p.23p.24p.25p.26p.27

Issue Table of ContentsAmerican Educational Research Journal, Vol. 9, No. 1 (Winter, 1972), pp. I-VII+1-173Front Matter [pp.I-164]Patterns of Mental Abilities: Ethnic, Socioeconomic, and Sex Differences [pp.1-12]Classroom Observation Schemes: Where Are the Errors? [pp.13-27]A Note on the Processing of Classroom Observation Records [pp.29-44]Comparison Behavior in Elementary School Children [pp.45-63]The Effect of Letter-Name Knowledge on Learning to Read [pp.65-74]Comparisons of Letter Name and Letter Sound Training as Transfer Variables [pp.75-86]Effects of Pattern Drill on the Phonology, Syntax, and Reading Achievement of Rural Appalachian Children [pp.87-100]Reconstruction in the Analysis of Verbal Interaction [pp.101-112]S and X Tests of Association: An Empirical Comparison [pp.113-122]Effects of Competition as a Motivational Technique in the Classroom [pp.123-137]Ecological Correlates of Ambience in the Learning Environment [pp.139-148]Interaction between College Effects and Students' Aptitudes [pp.149-161]Comment on J. A. Creager's Comment [p.163]Reviewsuntitled [pp.165-168]untitled [pp.168-173]

Back Matter

Classroom Observation Errors

Documents

jstor terms

term reliability

jstor archive

redefined reliability

notion of reliability

terms conditions of

psychological tests

amall use subject