Top Banner
The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web
34

The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

Jan 03, 2016

Download

Documents

Derek Parsons
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

The Cornell Web Library (or Laboratory)

William Y. Arms

A Very Large Digital Library for Research on the History of the Web

Page 2: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

2

Research Team

Faculty

William Arms, Geri Gay, Dan Huttenlocher, Jon Kleinberg, Michael Macy, David Strang

Cornell Theory Center

Manuel Calimlim, Dave Lifka, Ruth Mitchell, and the Petabyte Data Store team

Ph.D. Students

Selcuk Aya, Pavel Dmitriev, Blazej Kot, with more than 20 M.Eng. and undergraduate students

Internet Archive

Brewster Kahle, Tracey Jacquith, John Berry

Page 3: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

3

The Internet Archive

Page 4: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

4

Archive of www.cni.org

Page 5: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

5

www.cni.org in 1998

Page 6: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

6

The Internet Archive Web Collection

The Data

Complete crawls of the Web, every two months since 1996, with

some gaps:

• Range of formats and depth of crawl have increased with time.

• No data from sites that are protected by robots.txt or where owners have requested not to be archived

• Some missing or lost data

• Metadata contains format, links, anchor text

• Organized to facilitate historical access to a known URL (Wayback Machine)

Page 7: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

7

Outline

Web-scale research

Building the Web Lab

Observations about very large digital libraries

Page 8: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

8

Web-scale Research

Interviews with researchers

Interviews with 15 faculty and graduate students from social sciences and computer science.

• Focused Web Crawling

• The Structure and Evolution of the Web

• Diffusion of Ideas and Practices

• Social and Information Networks

Page 9: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

9

NSF Cyberinfrastructure Tools

Sociology: Michael Macy (Principal Investigator), David Strang

Computing and Information Science: Bill Arms, Dan Huttenlocher, Jon Kleinberg

Very Large Semi-Structured Datasets for Social Science Research

"Computer scientists have learned through experience that it is usually best to build software tools in close collaboration with users. Hence, our proposal is two-fold – to build an intelligent front-end that will make the Internet Archive data broadly accessible to social scientists, and to develop, test, and refine these tools through a specific research application – the diffusion of innovation."

Began January 2006

Page 10: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

10

Social Science Research

The Web as a social phenomenon

Political campaigns

Online retailing

Polarization of opinions

The Web as evidence of current social events

The spread of urban legends ("The Internet is doubling every six months")

Development of legal concepts across time

Page 11: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

11

An Outsider's View of Diffusion Research (Current)

Ryan and Gross (1943)

Studied factors that influenced adoption of hybrid corn

Hybrid corn, introduced by Iowa State in 1928

Adopted by most Iowa farmers by 1940.

Found communication between previous/ potential adopter important

Found an S-shaped rate of adoption

Methodology

Hypothesis, e.g., a model of diffusion

Retrospective survey interviews (small sample size)

Coding of data, by hand with high quality control

Analysis of coded data by hand or by computer

Page 12: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

12

An Outsider's View of Diffusion Research (Future)

A vast collection of information that is used for many studies and experiments (with many gaps, known and unknown)

Automatic extraction of items (e.g., Web pages) that appear relevant to a hypothesis (using search methods that will have errors of inclusion and omission)

Automatic coding of these items for factors believed relevant to the hypothesis (using Artificial Intelligence methods that have significant error rates)

Analysis of the encoded data usually by computer

Page 13: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

13

Social and Information Networks

Studying social networks intertwined with online information networks

A major area of research at Cornell

– In sociology, communication, economics, information science and computer science

People mediated by information artifacts and vice versa

– Not just social connections or just links between documents

Page 14: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

14

Sources of Data

Model systems

– Cornell e-print arXiv (scientific topics, coauthorship/collaboration, trends over time)

– Usenet, with Marc Smith at Microsoft (conversational structure, topic dynamics)

Medium-scale systems

– On-line communities (LiveJournal, MySpace)

Web-scale

– Internet Archive data, showing evolution of Web 1996-2006.

Page 15: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

15

Small Scale: Evolution of the arXiv Networks

Changes in the arXiv citation network over time– Number of edges grows superlinearly in number of nodes,

e n1.69

– Average distance between nodes decreases over time

Challenges theoretical models in which diameter is a slowly growing function of number of nodes

Have similarly observed densification laws for many other networks

Page 16: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

16

Medium-Scale: Online Social Networking Systems

• A fundamental question in the diffusion of innovation:

– What is the probability an individual will adopt a new behavior, as a function of the number of his/her friends who are adopters?

– New behavior could be: adopting a new technology, joining a social movement, believing a rumor, …

• Large online social networks can have 100,000+ explicitly defined user “communities”.– New behavior: choosing to join a community.

Page 17: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

17

Joining a Community

• Unprecedented scale for such a curve

– Close to one billion (user, community) instances

• Most standard models predict S-shaped probability curves

• Further: machine learning to predict joining

Page 18: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

18

Web-Scale: Network Evolution

• Researchers are acquiring a large vocabulary of patterns for static networks

– small-world, scale-free, preferential attachment, PageRank, hubs and authorities, bow-ties, bipartite cores, network motifs, …

• Because of the computational challenges, most of the standards results have been measured very few times, e.g., on early Alta vista crawls.

• Little is known about the characteristic ways in which networks grow over time

– What are the analogous collections of patterns?

• Most studies have used link structures. There have been few studies of the evolution of terminology over time

Page 19: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

19

Building the Web Library/Laboratory

The Cornell petabyte data store allows us to mount many crawls of

the Web online for broad range of Web research.

• Copy snapshots of the Web from the Internet Archive

• Index snapshots and store online at Cornell

• Extract feature sets for researchers

• Provide APIs for researchers (program interface, download of datasets, Web Services API)

• Provide Web GUI for social science researchers

Page 20: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

20

The Internet Archive Web Collection

Sizes

• Current crawls are about 40-60 TByte (compressed)

• Total archive is about 600 TByte (compressed)

• Compression ratio:

up to 25:1best estimate of overall average is 10:1

• Rate of increase is about 1 TByte/day (compressed)

Total storage requirement at Cornell will differ because:

• Elimination of data that is duplicated between crawls

• Expansion of metadata for research

• Database indexes

Page 21: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

21

Data Processing Overview

Page 22: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

22

Scale of Data Processing

Balance of Resources

Ideal Realistic

Networking 500 Mbit/sec 100 Mbit/sec

Data online allfew crawls/year

Metadata online all all?

Disk 750 TB 240 TB

Tape archive all few crawls/year

Computers research sharedseparate with storage

Page 23: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

23

Equipment

Scidata1 -- Initial Configuration

16-Processor Unisys ES7000 Servers – 16 GByte RAM– 8 GByte/sec aggregate I/O bandwidth

100 TByte RAID Online Storage

ADIC Scalar 10K robotic tape library for archive

Separate Web server

Near-term Expansion• Disk capacity will expand to 240 TByte by end of 2007

Network• Internet2 with dedicated 100 mbs link to Internet

Archive

Page 24: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

24

Data Processing

Transfer 300-500 GByte per day

Internet 2 -- 100 mbs maximum throughput

Archive raw data to tape

Process raw data

Uncompress and unpack Web pages (ARC) and metadata (DAT) files

Create IDs for pages and content hashes

Database load

Database load batches of metadata about page and links (MS SQL Server 2000)

Store compressed page files

Page 25: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

25

Metadata

URL’s, pages and links:

• URL’s contained in metadata may link to pages never crawled

• URL’s not canonicalized: different URL’s may refer to same page

• Links are from a page to a URL

Web graph:

• Nodes are union of pages crawled and URL’s seen

• Each node and edge have time interval(s) over which they exist

Content:

• Anchor text in more recent crawls

• File and mime types

Page 26: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

26

Current Status

Data Capture:

• Connection of Internet Archive to Internet 2 (October 2005)

• Parallel loading of crawls of DAT and ARC files (January 2006)

Storage:

• Relational database and preload system: under test

Two complete crawls available April/May 2006

• Page store: preliminary design work

Page 27: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

27

User Services

Page 28: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

28

Services for Researchers

Under development (available in 2006)

API for users to extract data and download it to their own computers, or to process it on the Scidata 1 computer

Retro Browser (browse the Web as it was on a given date)

Subset extraction (select dataset by query of a relational database with indexes by date, URL, domain name, file type, anchor text, etc.)

Extract Web graph from subset, organized as sparse matrix

Full text index of subset (using Nutch/Lucene)

Future

NLP and machine learning tools to analyze text

Full text index of entire collection

Page 29: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

29

Use of the Library (Draft)

Custodianship of data

Make no changes to the content of the Web pages and the identifying metadata such as URL and date.

Copyright

Assume an implied license to use Web data for archiving and academic research. Respects robots.txt exclusions and other requests from copyright owners.

Privacy

Research that might identify individuals is subject to standards that apply to research involving human subjects.

Authentication of users

All users of this library, whether at Cornell or elsewhere, are authenticated. Use restricted to academic, non-commercial research.

Page 30: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

30

Very Large Scale Digital Libraries

Only the computer reads every word

• Researchers interact with the library through computer programs that act as their agents.

• Users rarely view individual items except after preliminary screening by programs.

• The library is a highly technical computer system that is used by researchers who are not computing specialists.

• The library is a super-computing application.

• Use of the library depends on automated tools. These tools require state-of-the-art indexes for text and semi-structured data, natural language processing, and machine learning.

Page 31: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

31

Design Guidelines for Builders of Digital Libraries

Every online collection or service needs an application program interface (API) for computers to interact with the library.

A primary methodology is:– select a subset of the collection

– download to the researcher's computer

– use programs on the researcher's computer to analyze the data

Almost all metadata will be computer generated, but human cooperative editing can correct errors (see Crane, D-Lib Magazine, March 2006)

Page 32: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

32

Further Information

Web site: http://www.infosci.cornell.edu/SIN/

General overview: Arms, et al., "A Research Library based on the Historical Collections of the Internet Archive". D-Lib Magazine, February 2006. http://www.dlib.org/dlib/february06/arms/02arms.html.

Technical information: Arms, et al.,"Building a Research Library for the History of the Web". Joint Conference on Digital Libraries, 2006.

Page 33: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

33

Thanks

This work would not be possible without the forethought and longstanding commitment of the Internet Archive to capture and preserve the content of the Web for future generations.

This work has been funded in part by the National Science Foundation, grants CNS-0403340, DUE-0127308, and SES-0537606, with equipment support from Unisys and by an E-Science grant and a gift from Microsoft Corporation.

Page 34: The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web.

The Cornell Web Library (or Laboratory)

William Y. Arms

A Very Large Digital Library for Research on the History of the Web