TEST TASKS FOR SPEAKING – BALANCING BETWEEN AUTHENTICITY AND RELIABILITY Raili Hildén, University of Helsinki, Finland [email protected]TBLT 2009 Lancaster ‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language Teaching 13-16 September 2009
28
Embed
Test tasks for speaking – balancing between authenticity and reliability
Test tasks for speaking – balancing between authenticity and reliability. Raili Hildén , University of Helsinki, Finland [email protected] TBLT 2009 Lancaster ‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language Teaching 13-16 September 2009. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
THE CONCEPTUAL FRAMEWORK Validity argumentation scheme for
interpretation of the HY-Talk project data (adapted from Kane, 2001, Fulcher & Davidson, 2007, 164 – 174; Bachman, 2005)
The claim to be probed:“The illustrative scales of descriptors of
oral proficiency included in the national core curricula for language education enable sufficiently valid conclusions on students´ oral proficiency in general school education in Finland.”
Raili Hildén 15.9.2009 4
THE PURPOSE OF THE HY-TALK STUDY The validity claim is supported and
performance correspond to pedagogic tasks and target language use tasks of students at the age of general education. (utility)
Reliability of assessments based on the scale and the tasks to elicit performances is found to be high enough. (sufficiency)
Raili Hildén 15.9.2009 6
BACKING TO SUPPORT THE UTILITY CLAIM Rater and test taker feedback confirm
the perceived authenticity of the tasks and appropriateness of administration.
The level ratings correspond to the target levels in the curricula.
Raili Hildén 15.9.2009 7
BACKING DATA TO SUPPORT THE SUFFICIENCY CLAIM Statistical reliability evidence confirm
sufficient level of consistency across raters, tasks and languages, and interlocutors.
Raili Hildén 15.9.2009 8
COUNTERCLAIMS The tasks used to elicit student performance
correspond inadequately to pedagogic tasks or TLU tasks of students. (utility)
The link to the scale descriptors may be weak. (utility)
The level assignments do not match the target levels set in the curricula.
Reliability of assessments is not stable, but varies too much across tasks, raters or languages, or is caused by intervening variables or inadequate evidence base. (sufficiency)
Raili Hildén 15.9.2009 9
REUBUTTAL DATA TO SUPPORT THE UTILITY CLAIM Statistical evidence challenge the
intended utility of the tasks. Verbal data from students and teachers
question the utility and/or sufficiency of the tasks for the purpose.
Raili Hildén 15.9.2009 10
RESEARCH QUESTIONS1. How is the inter-rater reliability of the
judgements?2. How are the tasks and corresponding
salient task features related to target level judgements, assessment criteria and their combination? (numeric data, analysed with Facets)
3. How are the tasks perceived by students and raters? (verbal data based on feedback sheets and audio recorded rating sessions)
Raili Hildén 15.9.2009 11
SPEAKING TASKS Tasks were designed to reflect the
average target level specified for good mastery of the syllabus
English (grade 7: A1.3, grade 1: A2.2) German etc. (grade 7: A1.2, grade 1:
A2.1) They also draw on the thematic content
of the curricula Discussed, revised and piloted by the
controlled dialogues At the airport, grade 7 At home, grade 7 Accommodation, grade 1 On the way home, grade 13. Negotiation: partly controlled idalogue
Planning an outing (A2.1 – B1.1)
Raili Hildén 15.9.2009 13
SPEAKING TASKS Prompts in L1 Time on task 10-15 min, Conducted in pairs Rated by 5-10 language experts
Raili Hildén 15.9.2009 14
DATA OF THIS STUDY Speech samples in English (56) Speech samples in German (66)
Raili Hildén 15.9.2009 15
FACETS EXAMINED IN THIS STUDY Raters (5 English, 7 German) Tasks 1-4 Task dimensions Overall task performance Fluency Pronunciation Range Accuracy
Raili Hildén 15.9.2009 16
RESULTS: RQ1 ENGLISH SAMPLES:OVERALL INTER-RATER AGREEMENT Majority of total ratings were placed
between levels 5-6 (CEFR A2-B1) Across all facets the raters the distance
between the most severe and the most lenient rater was 1 logit (levels 5/6)
Average of ratings given by R4 6.66 Average of ratings given by R1 5.87
For more detailed record please contact the author.
Raili Hildén 15.9.2009 17
RESULTS: RQ1 ENGLISH SAMPLES:OVERALL TASK DIFFICULTY”The easiest” task: Presentation was assigned the highest
fair average of 6.29
”The trickiest” task: Everyday life task ”Accommodation”
was assigned the lowest fair average of 6.21
For more detailed record please contact the author.
Raili Hildén 15.9.2009 18
RESULTS: RQ1 ENGLISH SAMPLESCRITERIA ”The easiest” criterion:Pronunciation (fair average 6.39)
”The trickiest” criterion:Range (fair average 6.02)
For more detailed record please contact the author.
Raili Hildén 15.9.2009 19
RESULTS: RQ1 ENGLISH SAMPLESCOMBINED DIFFICULTY =TASK+CRITERIA
”The trickiest” combination: Negotiation (Planning an outing) + Range
(level 5 = A2.2 lower band)
For more detailed record please contact the author.
Raili Hildén 15.9.2009 24
RQ2: ENGLISH & GERMAN The tasks were conceived as authentic in
regard to themes and situations Authenticity (Bachman & Palmer, 1996) was
questioned by raters during the sessions due to the high grade of control regulated by the L1 prompts (to increase reliability)
Students regarded the tasks as relevant and highly probable in real life.
The raters of German discussed the interlocutor impact of the pair setting as a biasing factor.
The results suggest that the target level requirements set in the Finnish curricula are attained reasonably well.
Raili Hildén 15.9.2009 25
DISCUSSION Utility claim was confirmed as to the
high level of agreement of raters across facets (reliability)
Sufficiency and relevance were partly questioned due to the claimed unauthenticity of the task (rigor of instructions)
How to go about the dilemma in the future versions of the test?
Raili Hildén 15.9.2009 26
REFERENCES Bachman. L.F. (2005). Building and supporting a
case for test use. Language Assessment Quarterly, 2(1), 1–34.
Fulcher, G. & Davidson, F. (2007). Language Testing and Assessment. An advanced resource book. Abington & New York: Routledge.
Hildén, R. & Takala, S. 2007. Relating Descriptors of the Finnish School Scale to the CEF Overall Scales for Communicative Activities. Teoksessa Koskensalo, A., Smeds, J., Kaikkonen, P. & Kohonen, V. (toim.) Foreign languages and multicultural perspectives in the European context; Fremdsprachen und multikulturelle Perspektiven im europäischen Kontext. Dichtung, Wahrheit und Sprache (ss. 73 – 88). LIT-Verlag.
Raili Hildén 15.9.2009 27
BIBLIOGRAPHY National Core Curriculum for the Comprehensive
School 2004. Helsinki: Finnish National Board of Education. In Finnish http://www.oph.fi/info/ops/
National Core Curriculum for the Upper Secondary Level 2003. Helsinki: Finnish National Board of Education. In Finnish