Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin [email protected]http://researchrepository.ucd.ie #iCanHazRobot? Improved robot detection for IR usage statistics Open Repositories 2016 Dublin, 14 June
21
Embed
#iCanHazRobot?: improved robot detection for IR usage statistics
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Leabharlann UCD
An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire
UCD Library
University College Dublin,Belfield, Dublin 4, Ireland
Joseph GreeneResearch Repository LibrarianUniversity College [email protected]://researchrepository.ucd.ie
#iCanHazRobot?Improved robot detection for IR usage statistics
Open Repositories 2016Dublin, 14 June
Overview and take-home points
• Usage stats are important– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am - 12:30pm)• Robot filtration is a problem, especially in
repositories• Robot detection has an exponential effect on
usage stats’ accuracy in repositories• 2-3 ways to improve DSpace and EPrints’ usage
stats by 20% or more will be demonstrated
Experimental study
• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human• Applied DSpace, EPrints robot detection
algorithms to the dataset– This is an EXPERIMENT, simulating algorithms on a
DSpace repository’s usage data and Apache logs– The data is real, live data, and the algorithms were
very easy to simulate
First finding
85% of unfiltered repository downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots)
Accu
racy
of d
ownl
oad
stat
s (in
vers
e pr
eciti
on)
Catching more robots improves stats(But how much depends on the number of robots)
Get b
ette
r sta
ts
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
Robot detection techniques used
DSpace EPrints Minho DSpace
Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓
Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making
Measurements used in robot detection
• All measurements are a number between 0 and 1• Recall: proportion of robots detected
– I can haz robot?• Precision: true positives in robot detection
– Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots)
• Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by
humans
How they perform, out-of-the-box
DSpace
EPrin
ts
Minho
Minho with
monthly
manual
check
ing
No robot d
etecti
on0
0.20.40.60.8
1
Robot detection in OA IR systems
RecallPrecisionNegative precision (accuracy of download stats)
Room for improvement?
1. Ability to manually check for outliers
• At UCD, once a month, we check:– Daily downloads for the last 2-4 months– Top 10 most downloaded items– Top 20 downloading IP addresses for the last 2-4
months
DSpace Eprints Minho0
0.20.40.60.8
1
Robots caught (Recall)
DSpace Eprints Minho Wihtout robot detection
00.10.20.30.40.50.60.70.80.9
1
Accuracy of reported download stats (Inverse precision)