Summarizing archival collections using storytelling techniques Yasmin AlNoamany Michele C. Weigle Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group www.cs.odu.edu/~mln/ @phonedude_mln Research Funded by IMLS LG-71-15-0077-15 Dodging the Memory Hole Los Angeles, CA, 2016-10-14
95
Embed
Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Summarizing archival collections using storytelling techniques
Yasmin AlNoamanyMichele C. WeigleMichael L. Nelson
Old Dominion UniversityWeb Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole Los Angeles, CA, 2016-10-14
2
Archive-It, a subscription-based service, allows creation of collections
> 3,000 collections
~340 institutions
> 10B archived pages
3
Collection title
Collection categorization based on the
curator
Seed URI
Metadata about the collection
Text search box
The group that the resource belongs to
List of the seed
URIs
Timespan of the resource
and the number of times it has been captured
4
Collection understanding and collection summarization are not supported currentlyNot easy to answer “what’s in that collection?” or “how is this collection different from others”?
5
There is more than one collection about “Egyptian Revolution”
• “2010-2011 Arab Spring” https://archive-it.org/collections/3101• “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349• “Egypt Revolution and Politics” https://archive-it.org/collections/2358
6
7
8
9
Our early attempts at collection understanding tried to include everything…
“Visualizing digital collections at Archive-It”, JCDL 2012.http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 .
25
Feb 1 Feb 1 Feb 2
Feb 4 Feb 5 Feb 7
Feb 9 Feb 11 Feb 11
Fixed Page, Sliding Time
26
Feb. 11, 2011Mubarak resigns Sliding Page, Fixed Time
27
Jan 27 Jan 31
Feb 7Feb 4
Feb 11 Feb 11
Feb 2
Jan 25
Feb 10
Sliding Page, Sliding Time
28
The Dark and Stormy Archives (DSA) framework
Establish a baseline
Reduce the candidate pool of archived pages
Select good representative
pages
Characteristics of human-generated
Stories
Characteristics of Archive-It collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages in each slice
Select high-quality pages from each
cluster
Order pages by time
Visualize
https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
29
Establish a baseline of social media stories
"Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016.
30
What is the length of a story(the number of resources per story)?
This story has 31 resources
1
3
2
31
What are the types of resources that compose a story?
Quotes
Video
This story has • 19 quotes • 8 images• 4 videos
32
What are the most frequently used domains?
Twitter.com
Twitter.com
Twitter.com
This story has • 90% twitter.com• 7% instagram.com• 3% facebook.com
33
Top 25 domains represents 92% of all domains
34
What differentiates a popular story? (popular = stories with the top 25% of views)
19,795 views 64 views
35
The distributions for the features of the stories
• Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features
• Popular stories tend to have:• more web elements (medians of 28 vs. 21) • longer timespan (5 hours vs. 2 hours) than the unpopular stories
36
Do popular stories have a lower decay rate?
The 75th percentile of decay rate per popular story is 10% of the resources, while it is 15% in the unpopular stories
37
We found that 28 mementos is a good number for the resources in the stories.
38
Establish a baseline of current Archive-IT collections
"Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016.
39
The mean and median number of
URIs in a collection
This collection has 435 seed URIs
40
The mean and median number of mementos per URI
This seed URI has 16 mementos
41
The most frequent used domains
abcnews.go.com
blogspot.com
This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com
42
Archive-It top 25 is fundamentally different than Storify top 25
43
Archive-It top 25 is fundamentally different than Storify top 25
Twitter is #10 not #1
44
What we archive and what we put in our stories are different
subsets of the web
45
Detecting off-topic pages
"Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016.
46
Archive-It provides their partners with tools that allow them to build themed collections
47
Archive-It tools are about HTTP events / mechanics, not “content”
48
Over 60% of archived versions of hamdeensabahy.com are off-topic
May 13, 2012: The page started as on-topic.
May 24, 2012: Off-topic due to adatabase error.
Mar. 21, 2013: Not working because offinancial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
Selecting representative pages for generating stories(skipping clustering details, but goal is k=28)
68
Quality metrics for selecting mementos• In the DSA, memento quality Mq is calculated as
following: Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc
• Dm is the memento damage (Brunelle, JCDL 2014)
• Sql is the snippet quality based on the URI level• Sqc is the snippet quality based on URI category• wm, wql, wqc are the weights of memento damage, level,
• Generated by domain experts• Sliding Page, Sliding Time• The Boston Marathon
Bombing collection
83
Automatically generated stories from archived collections
1. Obtain the seed list and the TimeMap of URIs from the front-end interface of Archive- It
2. Extract the HTML of the mementos from the WARC files (locally hosted at ODU) and download the collections that we do not have in the ODU mirror from Archive-It
3. Extract the text of the page using the Boilerpipe library 4. Eliminate the off-topic pages based on the best-performing method ((Cosine,
Word-Count) with the suggested thresholds (0.1, −0.85))5. Exclude the duplicates of each TimeMap 6. Eliminate the non-English language pages7. Slice the collection dynamically and then cluster the mementos of each slice
using DBSCAN algorithm8. Apply the quality metrics to select the best representative pages9. Sort the selected mementos chronologically then put them and their metadata
in a JSON object
84https://storify.com/mturk_exp/3649b0s
• Automatically generated story • Sliding Page, Sliding Time• The Boston Marathon
Bombing collection
85
Random stories
28 mementos were randomly selected from each collection before excluding off-topic and duplicate pages
• Randomly generated story• Sliding Page, Sliding Time• The Boston Marathon
Bombing collection
87https://storify.com/mturk_exp/3649bads
if someone prefers this story, we exclude their results
• Poorly generated story• Sliding Page, Sliding Time• The Boston Marathon
Bombing collection
88
MT experiment setup
• Three HITs for each story (69 HITs to evaluate 23 stories); two comparisons per HIT:
• HIT1: human vs. automatic, human vs. poor• HIT2: human vs. random, human vs. poor• HIT3: random vs. automatic, automatic vs. poor
• 15 distinct turkers with master (have high acceptance rate) qualification for each HIT
• We rejected the submissions contained poorly-generated stories and the HITs that were completed in less than 10 seconds (mean time per HIT = 7 minutes)