Effects of Position and Number of Relevant Documents on Users’ Evaluations of System Performance A presentation by Meg Eastwood on the 2010 paper by D. Kelly, X. Fu, and C. Shah INF 384H September 26 th , 2011 1
May 26, 2015
2. Diane Kelly
Associate Professor, School of Library and Information Science, UNC
Chapel Hill
3. Ph.D., Rutgers University (Information Science) 4. MLS,
Rutgers University (Information Retrieval) 5. BA, University of
Alabama (Psychology and English) 6. Graduate Certificate in
Cognitive Science, Rutgers Center for Cognitive Science2
7. Primary Aim of Research
to investigate therelationship between actual system performance
and users evaluations of system performance (pg 9:2)
3
8. Secondary Aim of Research
to develop an experimental method that can be used to isolate and
study specific aspects of the search process (pg 9:2)
4
9. Previous Experimental Protocols
Traditional lab-based
Naturalistic
TREC Interactive Track
Study entire search episodes
Thomas and Hawking (2006)
Trade control for ecological validity
5
Both designs include so many variables that it can be difficult to
establish causal relationships (pg 9:2)
10. Literature Review
Main criticisms of previous studies:
Evaluation measures were calculated based on TREC assessors
relevance judgments, not user judgments
Users not provided with explicit instructions
Users may have been fatigued
Low sample sizes
6
11. Methods
7
12. Studies 1 and 2 :
effect of position of relevant documents on users evaluation of
system performance
Study 3:
effect of number of relevant documents
8
13. 9
Participants were asked to help researchers evaluate four search
engines
For each search engine, read topic and posed one query
14. 10
After issuing query, all participants were re-directed to the same
results page with 10 standardized results
15. 11
Participants asked to evaluate full text of each search result in
the order presented and judge the relevance
16. 12
After evaluating all the documents on the results page,
participants were asked to evaluate the search engine
17. Study 1
Operationalized average precision at n
Subjects required to evaluate all 10 documents
13
18. Study 2
Also operationalized average precision at n
Subjects instructed to find five relevant documents
14
19. Study 3 Operationalized Precision at n
15
20. Topics and Documents
16
Selected topics associated with newspaper articles about current
events
Selected documents with high probability of being judged relevant
or not relevant (pg 9:12)
21. Study Participants
17
Convenient sample (pg 9:27) of undergraduates from UNC
27 participants for each study (1 -3)
Demographic information collected:
Sex
Age
Major
Search experience
Search frequency
22. Results
Relevance Assessments
18
23. Did users relevance judgments agree with baseline
assessments?
19
24. Did users relevance judgments agree with baseline
assessments?
20
25. Did the topic affect differences in relevance
assessments?
21
26. How much did relevance assessments vary between
documents?
22
27. Results
Evaluations of
System Performance
23
28. Did participants modify evaluation ratings?
24
29. Participant ratings compared between performance levels and
studies
25
30. Participant ratings compared between performance levels and
studies
26
Study 1 showed no significant differences in ratings according to
performance level
31. Participant ratings compared between performance levels and
studies
27
Studies 2 and 3 did show significant differences in ratings
according to performance level
32. What are the differences between study 1 and study 2?
Intended difference:
Completion time?
28
33. What are the differences between study 1 and study 2?
Unintended differences:
Instructions for study 2 provided clearer performance
objective
Subjects felt more successful in study 2?
29
34. User Experienced Precision
30
experimental manipulations [of precision] were only 90% effective
(pg 9:24)
35. Are user-experienced precision values correlated with user
ratings of system performance?
31
36. Are user-experienced precision values correlated with user
ratings of system performance?
32
37. Regression analysis: can you use experienced precision to
predict user evaluation?
33
38. Authors Discussion and Conclusions
variations in precision at 10 scores have the greatest impact on
subjects evaluation ratings. (pg 9:26)
Thoughtful analysis of experimental caveats and generalizability of
results
Convenient sample of students
Only one genre of documents represented
Are these results specific to informational/exploratory
tasks?
34
39. Suggested Class Discussion Topics
Areas where the experiment may have been too tightly
controlled/artificial:
Controlling order in which users could rate documents?
Areas where the experiment may not have been as controlled as the
authors intended:
Allowing subjects to formulate own queries
Study 2 allowed participants to feel successful?
Ten-point evaluation scale versus five-point evaluation
scale?
35
40. References
Kelly, D., Fu, X., and Shah, C. 2010. Effects of position and
number of relevant documents retrieved on users evaluations of
system performance. ACM Trans. Inf. Syst. 28, 2, Article 9 (May
2010), 29 pages. DOI 10.1145/1740592.1740597.
http://doi.acm.org/10.1145/1740592.1740597
36