MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

Post on 16-Jan-2015

3987 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Advisor: Dr. Michele C. Weigle. Committee: Dr. Michael L. Nelson, Dr. Ravi Mukkamala Slide 32-34 contain embedded video which has been embedded as youtube video in this slideshare Direct links to videos: Treemap: http://www.youtube.com/watch?v=BJDrxQEEFYM Timecloud: http://www.youtube.com/watch?v=YYkI6aBO0to Bubble Chart, Image Plot and Timeline: http://www.youtube.com/watch?v=j94clxqKQk8 Abstract: Archive-It, a subscription service from the Internet Archive, allows users to create, maintain, and view digital collections of web resources. The current interface of Archive-It is largely text-based, supporting drill-down navigation using lists of URIs. While this interface provides good searching capabilities, it is not very efficient for browsing. In the absence of keywords, a user has to spend large amount of time trying to locate a webpage of interest. In order to provide a better visual experience to the user, we have studied the underlying characteristics of Archive-It collections and implemented six different visualizations (treemap, time cloud, bubble chart, image plot, timeline and wordle), each highlighting one or more of the underlying characteristics of the collection. Archive-It supports grouping of webpages into categories, however, it does not enforce its usage. As a result there are many collections with missing or improper grouping. For such collections, we present a method of grouping webpages based on a set of pre-defined rules.

Transcript

1

Visualizing Digital Collections at Archive-It

Kalpesh Padia

Director: Michele C. Weigle

Committee: Michael L. Nelson Ravi Mukkamala

7/20/2012 MS Thesis - August 2012

MS Thesis - August 2012 2

Agenda

Introduction

Motivation

Related Work

Collection Retrieval and Processing

Visualizations

Case Studies

Future Work

Conclusion7/20/2012

MS Thesis - August 2012 3

INTRODUCTION AND MOTIVATION

7/20/2012

MS Thesis - August 2012 4

Digital Archives

7/20/2012

http://www.loc.gov/index.htmlhttp://digitalcollections.library.yale.edu/

MS Thesis - August 2012 5

Archive-It

7/20/2012

http://archive-it.org/

MS Thesis - August 2012 6

Archive-It Collection Hierarchy

7/20/2012

Level 3 (Leaf Nodes)

Level 2

Level 1

Root Collection Title

Category 1

Web page 1

Archived Version 1

Archived Version n

Web page n

Category n

MS Thesis - August 2012 7

Exploring Archive-It Collections

7/20/2012

http://archive-it.org/collections/1068

MS Thesis - August 2012 8

Exploring Archive-It Collections

7/20/2012

http://archive-it.org/collections/1068

MS Thesis - August 2012 9

Exploring Archive-It Collections

7/20/2012

http://archive-it.org/collections/1068

MS Thesis - August 2012 10

Exploring Archive-It Collections

7/20/2012

http://wayback.archive-it.org/1068/*/http://acda.co/

MS Thesis - August 2012 11

Exploring Archive-It Collections

7/20/2012

http://archive-it.org/collections/1068

MS Thesis - August 2012 12

Exploring Archive-It Collections

7/20/2012

http://archive-it.org/collections/2836

MS Thesis - August 2012 13

Drawbacks

No visual feedback

Discovering individual pages is difficult

Optional metadata and categorization

Collection structure known only to curator

7/20/2012

MS Thesis - August 2012 14

Contribution

Interactive visualizations Treemap Time cloud Bubble chart Image plot Wordle Timeline

Temporal exploration of collections

Uncover collection structure7/20/2012

MS Thesis - August 2012 15

RELATED WORK

7/20/2012

MS Thesis - August 2012 16

Microsoft Pivot

7/20/2012

http://www.microsoft.com/silverlight/pivotviewer/

MS Thesis - August 2012 17

Page History Explorer

7/20/2012

A. Jatowt, Y. Kawai, and K. Tanaka, “Visualizing Historical Content of Web Pages,” in Proceedings of the 17th international conference on World Wide Web,2008.

MS Thesis - August 2012 18

3D Wall

7/20/2012

http://www.webarchive.org.uk/ukwa/wall/Blogs

MS Thesis - August 2012 19

Treemap

7/20/2012

Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in proceedings of the 2nd conference on Visualization '91

MS Thesis - August 2012 20

Series Browser

7/20/2012

M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.

MS Thesis - August 2012 21

A1 Explorer

7/20/2012

M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.

MS Thesis - August 2012 22

EASY

7/20/2012

Scharnhorst et.al. “Looking at a digital research data archive Visual interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.3200

MS Thesis - August 2012 23

Wordle

7/20/2012

. Jonathan Feinberg, http://wordle.net/ , Dogear

MS Thesis - August 2012 24

DATA RETRIEVAL AND PROCESSING

7/20/2012

MS Thesis - August 2012 25

11 Collections, 2K+ Web pages, 70K+ Mementos

7/20/2012

MS Thesis - August 2012 26

Data Retrieval & Processing

Retrieval: Screen scrape Copy collection hierarchy Store page content

Processing: Calculate TF and TF-IDF Generate screenshots Generate wordles Rule-based categorization Construct JSON

7/20/2012

MS Thesis - August 2012 27

No categorization

7/20/2012

http://www.archive-it.org/collections/2836

MS Thesis - August 2012 28

Improper Categorization

7/20/2012

http://www.archive-it.org/collections/2323

MS Thesis - August 2012 29

Rule based categorization

News Web pages

7/20/2012

http://www.archive-it.org/collections/2836

Blogs

Social Media

Videos

MS Thesis - August 2012 30

Special URI and TLD based categorization

Pakistani news web pages

7/20/2012

http://www.archive-it.org/collections/2836

MS Thesis - August 2012 31

VISUALIZATIONS

7/20/2012

MS Thesis - August 2012 32

Treemap

7/20/2012

MS Thesis - August 2012 33

Time Cloud

7/20/2012

MS Thesis - August 2012 34

Bubble Chart, Image Plot & Timeline

7/20/2012

MS Thesis - August 2012 35

CASE STUDIES

7/20/2012

MS Thesis - August 2012 36

1. Collection Building and Growth

7/20/2012

MS Thesis - August 2012 37

2. Re-Categorization (Pakistan Flood: no categorization)

7/20/2012

MS Thesis - August 2012 38

2. Re-Categorization (Pakistan Flood: after categorization)

7/20/2012

MS Thesis - August 2012 39

3. Collection Synopsis

7/20/2012

MS Thesis - August 2012 40

3. Collection Synopsis

7/20/2012

MS Thesis - August 2012 41

3. Collection Synopsis

7/20/2012

MS Thesis - August 2012 42

3. Collection Synopsis

7/20/2012

MS Thesis - August 2012 43

3. Collection Synopsis

7/20/2012

MS Thesis - August 2012 44

4. Theme Tracking

7/20/2012

MS Thesis - August 2012 45

4. Theme Tracking

7/20/2012

MS Thesis - August 2012 46

4. Theme Tracking

7/20/2012

MS Thesis - August 2012 47

4. Theme Tracking

7/20/2012

MS Thesis - August 2012 48

Informal User Evaluation

Alex Thurman, Columbia University Libraries

Feedback on ease of browsing and obtaining information user-friendliness of the interface whether they prefer textual or graphical

interface most effective visualization effectiveness of the rule-based categorization

in exploring archives7/20/2012

MS Thesis - August 2012 49

Feedback

Effective visualizations: Treemap – color coding useful for identifying newer

additions Image plot – screenshots with mouse-over wordles

allow for good navigation Timeline – useful for visualizing development of

groups in collection

Suggestions Broader timescale for treemaps Include stop words from other languages

7/20/2012

MS Thesis - August 2012 50

FUTURE WORK AND CONCLUSION

7/20/2012

MS Thesis - August 2012 51

Future Work

N-Gram wordles

Term expansion

Krovetz stemmer (dictionary based stemmer)

Integration with Archive-It

Detailed user evaluation

Implementation for other archives

7/20/2012

MS Thesis - August 2012 52

Conclusion

Identified metrics for collections

7/20/2012

MS Thesis - August 2012 53

Conclusion

Identified metrics for collections

Visualizations Treemap

7/20/2012

MS Thesis - August 2012 54

Conclusion

Identified metrics for collections

Visualizations Treemap Time cloud

7/20/2012

MS Thesis - August 2012 55

Conclusion

Identified metrics for collections

Visualizations Treemap Time cloud Bubble chart

7/20/2012

MS Thesis - August 2012 56

Conclusion

Identified metrics for collections

Visualizations Treemap Time cloud Bubble chart Image plot

7/20/2012

MS Thesis - August 2012 57

Conclusion

Identified metrics for collections

Visualizations Treemap Time cloud Bubble chart Image plot Wordle

7/20/2012

MS Thesis - August 2012 58

Conclusion

Identified metrics for collections

Visualizations Treemap Time cloud Bubble chart Image plot Wordle Timeline

7/20/2012

MS Thesis - August 2012 59

Conclusion

Identified metrics for collections

Visualizations Treemap Time cloud Bubble chart Image plot Wordle Timeline

Rule – based categorization7/20/2012

MS Thesis - August 2012 60

BACKUP

7/20/2012

MS Thesis - August 2012 61

Time Span

Time spanSmall 1 Day - 2 Weeks

Medium 2 Weeks - 4 MonthsLarge > 4 Months

7/20/2012

http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/

MS Thesis - August 2012 62

Groups

GroupsSmall 1

Medium 2 - 5Large > 5

7/20/2012

http://www.archive-it.org/collections/1068

MS Thesis - August 2012 63

URI Domains

URI DomainsSmall 1 - 10

Medium 11 - 20Large > 20

7/20/2012http://www.archive-it.org/collections/2836

MS Thesis - August 2012 64

Number of Web Pages

# of Web PagesSmall 1 - 10

Medium 11 - 99Large > 99

7/20/2012

http://www.archive-it.org/collections/2836

MS Thesis - August 2012 65

Jigsaw

7/20/2012

Stasko et.al., IEEE VAST 2007

MS Thesis - August 2012 66

Themeriver

7/20/2012

Wei et.al. in SIGKDD, 2010

MS Thesis - August 2012 68

Time Cloud

7/20/2012

MS Thesis - August 2012 69

Bubble Chart

7/20/2012

http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068

MS Thesis - August 2012 70

Image Plot with Wordle

7/20/2012

http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068

MS Thesis - August 2012 71

Timeline

7/20/2012

http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068

top related