VALIDITY AND RELIABILITY. VALIDITY In scientific research validity refers to whether a study is able to scientifically answer the questions it is intended.

VALIDITY AND RELIABILITY

VALIDITY

In scientific research validity refers to whether a study is able to scientifically answer the questions it is intended to answer.

Instrument selection is important for validity because instruments are used to collect data Data are used to make inferences related to the

questions

Thus, the inferences about the specific uses of an instrument should also be validated.

What do we mean with validity of inferences?

Our inferences should be relevant to the purpose of the study (appropriate) If we want to see what our students’ attitudes are

towards learning English, there is no use in making inferences using their scores in English tests.

Our inferences should be meaningful and correct We should say something about the meaning of the

information we collect. E.g. What does a high score on a particular test mean?

Our inferences should be useful They should help researchers make a

decision related to what they are trying to find out. E.g. If you want to see positive effects of

formative assessment on student achievement, you should have information that will help you infer whether your students’ achievement is affected by formative assessment or not.

Thus, validity depends on the amount and type of the evidence you have!

Kinds of evidence of validity

Content-Related evidence of validity

Content and format of the instrument: the degree to which an instrument logically appears to measure an intended variable

How appropriate is the content? Is the format appropriate? Does it logically get at the intended variable? How adequately does the sample of items or

questions represent the content to be assessed?, etc.

Two points to consider in content-related evidence

i) adequacy of sampling

Whether the content of the instrument has adequate sample of the domain of content it is supposed to represent E.g. If you want to see your students’ achievement

at macro level, you should have enough number of items that show this skill.

ii) format of the instrument

Clarity of printing, size of type, adequacy of work space, appropriateness of language, clarity of directions, etc.

E.g. If you want to see students’ attitudes towards English, the questionnaire should be in their target language if their level of target language proficiency is not high enough.

How do we obtain content-related evidence of validity?

Write out the definition of what you want to measure and give this definition (together with the instrument and the intended sample) to a number of judges.

The judges look at the definition and place a checkmark in front of each item in the instrument that they feel does not measure the objectives.

They also place a checkmark in front of each aspect in the definitions that is not assessed by the instrument.

They evaluate the appropriateness of the format. Then the researcher rewrites these items. This continues until all judges approve of all items.

ExampleJudge No: ___________

Match to Portfolio Assessment Objectives No Match Perfect Match

A RANGE 1. ability to link ideas in a variety of ways 2. ability to use wide range of genres (stories, reports, articles, etc) 3. evidence of various topics

1 2 3 4 51 2 3 4 51 2 3 4 5

B FLEXIBILITY 4. evidence of variations in the style, vocab, tone,lang., voice and ideas 5. evidence for the appropriateness of style, vocab, tone, lang. and voice

1 2 3 4 51 2 3 4 5

C CONNECTIONS 6. evidence of applications of already-known concepts to newly- learned ones 7. evidence of new concepts and/or metaphors

1 2 3 4 51 2 3 4 5

General Aims of the Portfolio Assessment System 1. improving students’ writing abilities 2. improving students’ metacognitive skills 3. leading students to become autonomous language learners Specific objectives of the Portfolio Assessment System I- Helping students improve their linguistic skills in writing from the point of A) Grammar, punctuation and spelling, B) Vocabulary C) Coherence and Cohesion II- Helping students improve their metacognitive skills from the point of A) Applying and/or creating new concepts or ideas B) Using varieties in writing appropriately C) Analysing and Synthesising what they have learned/read D) Using other sources III- Helping students become autonomous language learners from the point

of A) Applying their own views B) Connecting other sources with what they know

Criterion-Related Evidence

Comparing performance on one instrument with performance on some other.

Two forms are available: a) predictive validity: compares the scores

on the original test with scores on one or more criterion measures obtained in a follow-up testing

b) concurrent validity: compares the test results with results obtained through a parallel, substitute measure

On both forms, a correlation coefficient is used.

Correlation coefficient (r) shows the degree of relationship that exists between the scores individuals obtain on two instruments. A positive relationship : a high (low) score on one instrument is

accompanied by a high score (low) score on the other A negative relationship: a high (low) score on one instrument is

accompanied by a low (high) score on the other

Correlation coefficients fall somewhere between +1.00 and -1.00. An r of .00 indicates that no relationship exists.

Construct-Related Evidence

Establishing a link between the underlying theoretical construct we wish to measure and the visible performance we choose to observe

Construct validation consists of building a strong logical case based on circumstantial evidence that a test measures the construct it is intended to measure

Generally there are 3 steps i) the variable being measure is clearly

defined ii) hypotheses, based on a theory

underlying the variable, are formed about how people who possess a lot versus a little of the variable will behave in a particular situation

iii) hypotheses are tested both logically and empirically

RELIABILITY

The consistency of the scores obtained.

Possible to have quite reliable but invalid scores (Unreliable scores can never be valid!)

What is desirable is to have both high reliability and high validity.

Errors of Measurement

When someone takes the same test twice, they rarely perform exactly the same, due to many factors.

Such factors result in errors of measurement.

Because of errors of measurement, researchers expect some variation in scores.

Reliability estimates help researchers have an idea of how much variation to expect.

This estimate is another application of correlation coefficient, known as a reliability coefficient.

A reliability coefficient is again a relationship, but it is between scores of the same individuals on the same instrument on two different times, or between two parts of the same instrument.

There are three best ways to obtain reliability coefficient.

1. Test-Retest Method

Administering the same test twice to the same group after a certain time. A reliability coefficient indicates the relationship between the two sets of scores obtained.

Reliability coefficient is affected by the length of the time interval. The longer the time, the lower the reliability coefficient.

The interval should be determined by the researcher considering that the individuals would retain their relative position.

Most of the time 1-3 month interval is sufficient!

2. Equivalent-Forms Method

Two different but equivalent (parallel) forms of an instrument are administered to the same group of individuals during the same period of time.

The questions (items) are different but they sample the same content.

A reliability coefficient indicates strong evidence that the two forms are measuring the same thing.

3. Internal-Consistency Methods

There are several internal-consistency methods and they all require only a single administration of an instrument.

Split-half procedure Two halves of a test (odd items vs even

items) is scored and a correlation coefficient is calculated for the two sets of scores.

Spearman-Brown prophecy formula is used for calculation.

The reliability of a test (instrument) can be increased by adding more items.

Kuder-Richardson Approaches

Two formulas: KR20 and KR21

KR21 is used when all items are of equal difficulty: you need the number of items on the test, the mean, and the standard deviation

KR20 is more complicated but must be used when you cannot assume that all items are of equal difficulty

Alpha Coefficient (Cronbach alpha) (α)

General form of the KR20 formula Used to calculate the reliability of items

that are not scored right versus wrong e.g. some essays where more than one

answer is possible

Scoring Agreement

When there is subjective evaluation (like essay scoring), there is the possibility of observer differences. In that case, scoring agreement should be reported.

Such cases require training to obtain as high reliability as possible.

The expected correlation is at least .90 correlation or 80% of agreement.

In case of subjective rating, we can talk about two kinds of reliability:

Intra-rater reliability: similar to test-retest strategy.

The same raters score the papers of the same group of students in two separate occasions (e.g. two weeks apart).

Thus, the intra-rater reliability is an estimate of the consistency of judgments over time

Inter-rater reliability: similar to the equivalent-forms strategy since the scores are obtained from two different raters

Inter-rater reliability estimates the extent to which two or more raters agree on the score that should be assigned to a written sample.

A correlation coefficient is calculated between the scores. Then the obtained coefficients are adjusted by the use of Spearman-Brown Prophecy formula.

VALIDITY AND RELIABILITY. VALIDITY In scientific research validity refers to whether a study is able to scientifically answer the questions it is intended.

Documents

validity of inferences

instrument selection

reliability slide

domain of content

instrument clarity of

scientific research

students achievement

intended sample