Test tasks for speaking – balancing between authenticity and reliability

TEST TASKS FOR SPEAKING – BALANCING

BETWEEN AUTHENTICITY AND

RELIABILITY

Raili Hildén, University of Helsinki, [email protected]

TBLT 2009Lancaster

‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language

Teaching13-16 September 2009

mailto:[email protected]

Raili Hildén 15.9.2009 2

BACKGROUND: HY-TALK PROJECT OF SPEAKING ASSESSMENT The project is funded by the University

of Helsinki To validate the illustrative scales of

speaking included in the national core curricula for general education and upper secondary level by trialing a prototype test of speaking.

Subscales: overall task completion, fluency, pronunciation, range and accuracy is empirically aligned to relevant scales of the CEFR.

http://blogs.helsinki.fi/hy-talk/

http://blogs.helsinki.fi/hy-talk/


THE CONCEPTUAL FRAMEWORK Validity argumentation scheme for

interpretation of the HY-Talk project data (adapted from Kane, 2001, Fulcher & Davidson, 2007, 164 – 174; Bachman, 2005)

The claim to be probed:“The illustrative scales of descriptors of

oral proficiency included in the national core curricula for language education enable sufficiently valid conclusions on students´ oral proficiency in general school education in Finland.”


THE PURPOSE OF THE HY-TALK STUDY The validity claim is supported and

challenged by warrants and rebuttals regarding

relevance utility (Intended consequences) sufficiency


WARRANTS The tasks used to elicit student

performance correspond to pedagogic tasks and target language use tasks of students at the age of general education. (utility)

Reliability of assessments based on the scale and the tasks to elicit performances is found to be high enough. (sufficiency)


BACKING TO SUPPORT THE UTILITY CLAIM Rater and test taker feedback confirm

the perceived authenticity of the tasks and appropriateness of administration.

The level ratings correspond to the target levels in the curricula.


BACKING DATA TO SUPPORT THE SUFFICIENCY CLAIM Statistical reliability evidence confirm

sufficient level of consistency across raters, tasks and languages, and interlocutors.


COUNTERCLAIMS The tasks used to elicit student performance

correspond inadequately to pedagogic tasks or TLU tasks of students. (utility)

The link to the scale descriptors may be weak. (utility)

The level assignments do not match the target levels set in the curricula.

Reliability of assessments is not stable, but varies too much across tasks, raters or languages, or is caused by intervening variables or inadequate evidence base. (sufficiency)


REUBUTTAL DATA TO SUPPORT THE UTILITY CLAIM Statistical evidence challenge the

intended utility of the tasks. Verbal data from students and teachers

question the utility and/or sufficiency of the tasks for the purpose.


RESEARCH QUESTIONS1. How is the inter-rater reliability of the

judgements?2. How are the tasks and corresponding

salient task features related to target level judgements, assessment criteria and their combination? (numeric data, analysed with Facets)

3. How are the tasks perceived by students and raters? (verbal data based on feedback sheets and audio recorded rating sessions)


SPEAKING TASKS Tasks were designed to reflect the

average target level specified for good mastery of the syllabus

English (grade 7: A1.3, grade 1: A2.2) German etc. (grade 7: A1.2, grade 1:

A2.1) They also draw on the thematic content

of the curricula Discussed, revised and piloted by the

project group


PROTOTYPE TASKS (WITH EXAMPLES)1. Presentation (A2.2) partly controlled

monologue2. Everyday life (A2.1 – A2.2) rigidly

controlled dialogues At the airport, grade 7 At home, grade 7 Accommodation, grade 1 On the way home, grade 13. Negotiation: partly controlled idalogue

Planning an outing (A2.1 – B1.1)


SPEAKING TASKS Prompts in L1 Time on task 10-15 min, Conducted in pairs Rated by 5-10 language experts


DATA OF THIS STUDY Speech samples in English (56) Speech samples in German (66)


FACETS EXAMINED IN THIS STUDY Raters (5 English, 7 German) Tasks 1-4 Task dimensions Overall task performance Fluency Pronunciation Range Accuracy


RESULTS: RQ1 ENGLISH SAMPLES:OVERALL INTER-RATER AGREEMENT Majority of total ratings were placed

between levels 5-6 (CEFR A2-B1) Across all facets the raters the distance

between the most severe and the most lenient rater was 1 logit (levels 5/6)

Average of ratings given by R4 6.66 Average of ratings given by R1 5.87

For more detailed record please contact the author.


RESULTS: RQ1 ENGLISH SAMPLES:OVERALL TASK DIFFICULTY”The easiest” task: Presentation was assigned the highest

fair average of 6.29

”The trickiest” task: Everyday life task ”Accommodation”

was assigned the lowest fair average of 6.21



RESULTS: RQ1 ENGLISH SAMPLESCRITERIA ”The easiest” criterion:Pronunciation (fair average 6.39)

”The trickiest” criterion:Range (fair average 6.02)



RESULTS: RQ1 ENGLISH SAMPLESCOMBINED DIFFICULTY =TASK+CRITERIA

”The easiest” combination Presentation + Accuracy Presentation+ Fluency

”The trickiest” combination: Everyday situation: Accommodation +

Range



RESULTS: RQ1 GERMAN SAMPLES:OVERALL INTER-RATER AGREEMENT Majority of total ratings were placed

between levels 5-6/10 (CEFR A2-B1) Across all facets and raters, the distance

between the most severe and the most lenient rater was 1 logit (levels 5-6)

Average of ratings given by R6 (3.96/10) Average of ratings given by R2 (3.57/10)



RESULTS: RQ1 GERMAN SAMPLES:OVERALL TASK DIFFICULTY”The easiest” task: Presentation task was assigned the

highest fair average of 4.21/10

”The trickiest” task: Everyday life task ”On the way home”

was assigned the lowest fair average of 3.57/10



RESULTS: RQ1 GERMAN SAMPLESCRITERIA ”The easiest” criterion: Pronunciation

4.24/10 (fair average )

”The trickiest” criterion: Range 3.49/10(fair average )



RESULTS: RQ1 GERMAN SAMPLESCOMBINED DIFFICULTY =TASK+CRITERIA

”The easiest” combination Presentation + Pronunciation (level

6=B1.1)

”The trickiest” combination: Negotiation (Planning an outing) + Range

(level 5 = A2.2 lower band)



RQ2: ENGLISH & GERMAN The tasks were conceived as authentic in

regard to themes and situations Authenticity (Bachman & Palmer, 1996) was

questioned by raters during the sessions due to the high grade of control regulated by the L1 prompts (to increase reliability)

Students regarded the tasks as relevant and highly probable in real life.

The raters of German discussed the interlocutor impact of the pair setting as a biasing factor.

The results suggest that the target level requirements set in the Finnish curricula are attained reasonably well.


DISCUSSION Utility claim was confirmed as to the

high level of agreement of raters across facets (reliability)

Sufficiency and relevance were partly questioned due to the claimed unauthenticity of the task (rigor of instructions)

How to go about the dilemma in the future versions of the test?


REFERENCES Bachman. L.F. (2005). Building and supporting a

case for test use. Language Assessment Quarterly, 2(1), 1–34.

Fulcher, G. & Davidson, F. (2007). Language Testing and Assessment. An advanced resource book. Abington & New York: Routledge.

Hildén, R. & Takala, S. 2007. Relating Descriptors of the Finnish School Scale to the CEF Overall Scales for Communicative Activities. Teoksessa Koskensalo, A., Smeds, J., Kaikkonen, P. & Kohonen, V. (toim.) Foreign languages and multicultural perspectives in the European context; Fremdsprachen und multikulturelle Perspektiven im europäischen Kontext. Dichtung, Wahrheit und Sprache (ss. 73 – 88). LIT-Verlag.


BIBLIOGRAPHY National Core Curriculum for the Comprehensive

School 2004. Helsinki: Finnish National Board of Education. In Finnish http://www.oph.fi/info/ops/

National Core Curriculum for the Upper Secondary Level 2003. Helsinki: Finnish National Board of Education. In Finnish

http://www.oph.fi/pageLast.asp?path=1,17627,1830,23059

Kane, M. D. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38 (4), 319 – 342.

http://www.oph.fi/info/ops/

http://www.oph.fi/info/ops/




28

[email protected]

Thank you!

mailto:[email protected]

Test tasks for speaking – balancing between authenticity and reliability

Documents

pedagogic tasks

counterclaimsthe tasks

reliability test tasks

reliability raili hildn

tlu tasks of students

intended utility

utility andor sufficiency

project data