Not All Mementos Are Created Equal: Measuring The Impact Of Missing Mementos

Not All Mementos are Created Equal: Measuring the Impact of Missing

ResourcesJustin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle,

Michael L. Nelson

Old Dominion University

{jbrunelle, mkelly, hany, mweigle, mln}@cs.odu.edu

Goal: Automatically measure the quality of the archives

20% missing

14% missing

28% missing

7% missing

“Live” XKCD

• Missing 17% of embedded resources

• Looks complete

“Live” XKCD

• Take three resources:• Logo

• Main Comic

• Navigation Strip

• Relative importance?

• All present in “Live” XKCD

Damaging XKCD

• Created a local memento

• Removed the logo and navigation strip

• Now missing 29% of embedded resources

• Human assessment: looks OK

Damaging XKCD

• From our local memento

• Removed the Main Comic

• Human assessment: Not a usable memento

Damaging XKCD

• From our local memento

• Removed the Main Comic

• Human assessment: Not a usable memento

• Percent of missing embedded resources is not a suitable metric for memento quality

Image Importance

• Size (as percentage of all pixels)

Image Importance

• Size

• Position (in viewport?)

Image Importance

• Size

• Position

• Centrality (in the vertical or horizontal center?)

Missing CSS

• Damage not limited to images

• When missing CSS, content shifts left

Missing CSS

• Partitioned snapshot into thirds

• Background color determined

• Pixel-by-pixel comparison

Missing CSS

• Calculated the amount of content in each vertical third

• If >=80% in left column and missing CSS, CSS is important

• Only performed if stylesheets are missing

Percent Missing vs. Weighted Damage

• 𝑀𝑀 = Percent of embedded resources missing

𝑀𝑀 =𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔

𝑇𝑜𝑡𝑎𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠

• 𝐷𝑀 = Damage rating of missing embedded resources

𝐷𝑀 =𝐷𝑀𝐴𝑐𝑡𝑢𝑎𝑙𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙

𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 = 𝑖=1

𝑛[𝐼|𝑀𝑀]𝐷[𝐼|𝑀𝑀] (𝑖)

𝑛[𝐼|𝑀𝑀]+ 𝑖=1

𝑛[𝐶]𝐷[𝐶] (𝑖)

𝑛𝐶 17

𝐼 = 𝐼𝑚𝑎𝑔𝑒

𝑀𝑀 = 𝑀𝑢𝑙𝑡𝑖𝑀𝑒𝑑𝑖𝑎

𝐶 = 𝐶𝑆𝑆

Calculated Damage

• 𝑀𝑀 = Percent of embedded resources missing

𝑀𝑀 =𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔

𝑇𝑜𝑡𝑎𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠

• 𝐷𝑀 = Damage rating of missing embedded resources

𝐷𝑀 =𝐷𝑀𝐴𝑐𝑡𝑢𝑎𝑙𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙

𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 = 𝑖=1

𝑛[𝐼|𝑀𝑀]𝐷[𝐼|𝑀𝑀] (𝑖)

𝑛[𝐼|𝑀𝑀]+ 𝑖=1

𝑛[𝐶]𝐷[𝐶] (𝑖)

𝑛𝐶 18

𝑀𝑀 = 0.29𝐷𝑀 = 0.36

𝑀𝑀 = 0.24𝐷𝑀 = 0.41

What do Web users think?

Setting up the Turk Test

• Amazon’s mechanical turkers represent real web users

• Two legs of the experiment:• Manually damaged memento vs. Live resource

• 10 manually damaged mementos and resources

• Real Memento vs. Real Memento• 100 URI-Rs, one memento per year

Quantifying Turker Response

• 5 turkers for each comparison

• Assume 𝐷𝐴 < 𝐷𝐵 (i.e., A is less damaged)

• Measure turker agreement:

Image A Image B Split

Turker 1 Y

Turker 2 Y

Turker 3 Y

Turker 4 Y

Turker 5 Y

Result 5 0 5-024

Turker 1 Y

Turker 2 Y

Turker 3 Y

Turker 4 Y

Turker 5 Y

Result 4 1 4-125

Turker 1 Y

Turker 2 Y

Turker 3 Y

Turker 4 Y

Turker 5 Y

Result 0 5 0-526

Turker 1 Y

Turker 2 Y

Turker 3 Y

Turker 4 Y

Turker 5 Y

Result 0 5 0-527

No agreement!

Turker 1 Y

Turker 2 Y

Turker 3 Y

Turker 4 Y

Turker 5 Y

Result 3 2 3-228

• Measure turker agreement:Defined only by 4-1 and 5-0 splits

Turker 1 Y

Turker 2 Y

Turker 3 Y

Turker 4 Y

Turker 5 Y

Result 3 2 3-229

Split decision No agreement!

Turk Results

• Compared damage(𝐷𝑀) and percent missing (𝑀𝑀)• M0: Manually damaged mementos

• D: Internet Archive Mementos

• M: Percent missing in Internet Archive Mementos

• 𝐷𝑀vs. Live: 78.9% true positives

• 𝑀𝑀 vs. Live: 47.2% true positives• Worse than a 50/50 chance!

• 𝐷𝑀 vs 𝐷𝑀: 58.4% true positives

Damage in the Internet Archive

• 1,000 URI-Rs from Bitly

• 1,000 URI-Rs from Archive-it

• Remove non-HTML representations

• 1,861 URI-Rs remaining

• Sample 1 memento per year from Internet Archive

• Measure damage

• Measured Internet Archive mementos

• Damage generally improves over time

• Despite missing more resources over time

Damage in the Internet Archive

Conclusions

• 𝐷𝑀 is a better measure of memento quality than 𝑀𝑀• On average, the Internet Archive is improving its quality over time

• Internet Archive is also missing more embedded resources over time

• Improved damage weighting (58.4% correct can be improved)

• Measure cumulative temporal damage ratings• E.g., a logo that never changes for 10 years and is used by 100 mementos is

more important than the one used in a single memento.

Not All Mementos Are Created Equal: Measuring The Impact Of Missing Mementos

Science

Awards, Ceremonies, Food or Refreshments, Gifts or Mementos

California Courts - Home · Keywords: Equal Access,equal...

Missing links, missing markets: Internal exchanges ...

Mementos lmd comptabilité générale 2015-2016

Children Missing Education and Children Missing From ...

Macquarie mementos The Hippies’ car wins Variety Bash...

Delhi Trophy.Com, New Delhi, Trophies & Mementos

Missing Money and Missing Markets: Reliability, Capacity...

Missing Data Missing Data Methods in ML Multiple...

Felicitation of six Past Presidents (sporting mementos ...

Equal pay for equal work

The missing link and missing out

Promotional Clocks,Tabletops, Photoframes & Mementos

Not All Mementos Are Created Equal: Measuring The Impact...

Client-side Reconstruction of Composite Mementos Using...

EQUAL PROTECTION AND EQUAL TREATMENT