Eastwood presentation on_kellyetal2010

1. Effects of Position and Number of Relevant Documents on Users Evaluations of System Performance
A presentation by Meg Eastwood
on the 2010 paper by D. Kelly, X. Fu, and C. Shah
INF 384H
September 26th, 2011
1

2. Diane Kelly
Associate Professor, School of Library and Information Science, UNC Chapel Hill

Education:

3. Ph.D., Rutgers University (Information Science) 4. MLS, Rutgers University (Information Retrieval) 5. BA, University of Alabama (Psychology and English) 6. Graduate Certificate in Cognitive Science, Rutgers Center for Cognitive Science2
7. Primary Aim of Research
to investigate therelationship between actual system performance and users evaluations of system performance (pg 9:2)
3
8. Secondary Aim of Research
to develop an experimental method that can be used to isolate and study specific aspects of the search process (pg 9:2)
4
9. Previous Experimental Protocols
Traditional lab-based
Naturalistic
TREC Interactive Track
Study entire search episodes
Thomas and Hawking (2006)
Trade control for ecological validity
5
Both designs include so many variables that it can be difficult to establish causal relationships (pg 9:2)
10. Literature Review
Main criticisms of previous studies:
Evaluation measures were calculated based on TREC assessors relevance judgments, not user judgments
Users not provided with explicit instructions
Users may have been fatigued
Low sample sizes
6
11. Methods
7
12. Studies 1 and 2 :
effect of position of relevant documents on users evaluation of system performance
Study 3:
effect of number of relevant documents
8
13. 9
Participants were asked to help researchers evaluate four search engines
For each search engine, read topic and posed one query
14. 10
After issuing query, all participants were re-directed to the same results page with 10 standardized results
15. 11
Participants asked to evaluate full text of each search result in the order presented and judge the relevance
16. 12
After evaluating all the documents on the results page, participants were asked to evaluate the search engine
17. Study 1
Operationalized average precision at n
Subjects required to evaluate all 10 documents
13
18. Study 2
Also operationalized average precision at n
Subjects instructed to find five relevant documents
14
19. Study 3 Operationalized Precision at n
15
20. Topics and Documents
16
Selected topics associated with newspaper articles about current events
Selected documents with high probability of being judged relevant or not relevant (pg 9:12)
21. Study Participants
17
Convenient sample (pg 9:27) of undergraduates from UNC
27 participants for each study (1 -3)
Demographic information collected:
Sex
Age
Major
Search experience
Search frequency
22. Results
Relevance Assessments
18
23. Did users relevance judgments agree with baseline assessments?
19
24. Did users relevance judgments agree with baseline assessments?
20
25. Did the topic affect differences in relevance assessments?
21
26. How much did relevance assessments vary between documents?
22
27. Results
Evaluations of
System Performance
23
28. Did participants modify evaluation ratings?
24
29. Participant ratings compared between performance levels and studies
25
26
Study 1 showed no significant differences in ratings according to performance level
27
Studies 2 and 3 did show significant differences in ratings according to performance level
32. What are the differences between study 1 and study 2?
Intended difference:
Completion time?
28
33. What are the differences between study 1 and study 2?
Unintended differences:
Instructions for study 2 provided clearer performance objective
Subjects felt more successful in study 2?
29
34. User Experienced Precision
30
experimental manipulations [of precision] were only 90% effective (pg 9:24)
35. Are user-experienced precision values correlated with user ratings of system performance?
31
36. Are user-experienced precision values correlated with user ratings of system performance?
32
37. Regression analysis: can you use experienced precision to predict user evaluation?
33
38. Authors Discussion and Conclusions
variations in precision at 10 scores have the greatest impact on subjects evaluation ratings. (pg 9:26)
Thoughtful analysis of experimental caveats and generalizability of results
Convenient sample of students
Only one genre of documents represented
Are these results specific to informational/exploratory tasks?
34
39. Suggested Class Discussion Topics
Areas where the experiment may have been too tightly controlled/artificial:
Controlling order in which users could rate documents?
Areas where the experiment may not have been as controlled as the authors intended:
Allowing subjects to formulate own queries
Study 2 allowed participants to feel successful?
Ten-point evaluation scale versus five-point evaluation scale?
35
40. References
Kelly, D., Fu, X., and Shah, C. 2010. Effects of position and number of relevant documents retrieved on users evaluations of system performance. ACM Trans. Inf. Syst. 28, 2, Article 9 (May 2010), 29 pages. DOI 10.1145/1740592.1740597. http://doi.acm.org/10.1145/1740592.1740597
36

Eastwood presentation on_kellyetal2010

Technology

performance levels

user evaluation

relevant pg

actual system performance

number of relevant documents

subjects evaluation

participant ratings

operationalized precision