The Cornell Web Library (or Laboratory) William Y. Arms A Very Large Digital Library for Research on the History of the Web
Jan 03, 2016
The Cornell Web Library (or Laboratory)
William Y. Arms
A Very Large Digital Library for Research on the History of the Web
2
Research Team
Faculty
William Arms, Geri Gay, Dan Huttenlocher, Jon Kleinberg, Michael Macy, David Strang
Cornell Theory Center
Manuel Calimlim, Dave Lifka, Ruth Mitchell, and the Petabyte Data Store team
Ph.D. Students
Selcuk Aya, Pavel Dmitriev, Blazej Kot, with more than 20 M.Eng. and undergraduate students
Internet Archive
Brewster Kahle, Tracey Jacquith, John Berry
6
The Internet Archive Web Collection
The Data
Complete crawls of the Web, every two months since 1996, with
some gaps:
• Range of formats and depth of crawl have increased with time.
• No data from sites that are protected by robots.txt or where owners have requested not to be archived
• Some missing or lost data
• Metadata contains format, links, anchor text
• Organized to facilitate historical access to a known URL (Wayback Machine)
8
Web-scale Research
Interviews with researchers
Interviews with 15 faculty and graduate students from social sciences and computer science.
• Focused Web Crawling
• The Structure and Evolution of the Web
• Diffusion of Ideas and Practices
• Social and Information Networks
9
NSF Cyberinfrastructure Tools
Sociology: Michael Macy (Principal Investigator), David Strang
Computing and Information Science: Bill Arms, Dan Huttenlocher, Jon Kleinberg
Very Large Semi-Structured Datasets for Social Science Research
"Computer scientists have learned through experience that it is usually best to build software tools in close collaboration with users. Hence, our proposal is two-fold – to build an intelligent front-end that will make the Internet Archive data broadly accessible to social scientists, and to develop, test, and refine these tools through a specific research application – the diffusion of innovation."
Began January 2006
10
Social Science Research
The Web as a social phenomenon
Political campaigns
Online retailing
Polarization of opinions
The Web as evidence of current social events
The spread of urban legends ("The Internet is doubling every six months")
Development of legal concepts across time
11
An Outsider's View of Diffusion Research (Current)
Ryan and Gross (1943)
Studied factors that influenced adoption of hybrid corn
Hybrid corn, introduced by Iowa State in 1928
Adopted by most Iowa farmers by 1940.
Found communication between previous/ potential adopter important
Found an S-shaped rate of adoption
Methodology
Hypothesis, e.g., a model of diffusion
Retrospective survey interviews (small sample size)
Coding of data, by hand with high quality control
Analysis of coded data by hand or by computer
12
An Outsider's View of Diffusion Research (Future)
A vast collection of information that is used for many studies and experiments (with many gaps, known and unknown)
Automatic extraction of items (e.g., Web pages) that appear relevant to a hypothesis (using search methods that will have errors of inclusion and omission)
Automatic coding of these items for factors believed relevant to the hypothesis (using Artificial Intelligence methods that have significant error rates)
Analysis of the encoded data usually by computer
13
Social and Information Networks
Studying social networks intertwined with online information networks
A major area of research at Cornell
– In sociology, communication, economics, information science and computer science
People mediated by information artifacts and vice versa
– Not just social connections or just links between documents
14
Sources of Data
Model systems
– Cornell e-print arXiv (scientific topics, coauthorship/collaboration, trends over time)
– Usenet, with Marc Smith at Microsoft (conversational structure, topic dynamics)
Medium-scale systems
– On-line communities (LiveJournal, MySpace)
Web-scale
– Internet Archive data, showing evolution of Web 1996-2006.
15
Small Scale: Evolution of the arXiv Networks
Changes in the arXiv citation network over time– Number of edges grows superlinearly in number of nodes,
e n1.69
– Average distance between nodes decreases over time
Challenges theoretical models in which diameter is a slowly growing function of number of nodes
Have similarly observed densification laws for many other networks
16
Medium-Scale: Online Social Networking Systems
• A fundamental question in the diffusion of innovation:
– What is the probability an individual will adopt a new behavior, as a function of the number of his/her friends who are adopters?
– New behavior could be: adopting a new technology, joining a social movement, believing a rumor, …
• Large online social networks can have 100,000+ explicitly defined user “communities”.– New behavior: choosing to join a community.
17
Joining a Community
• Unprecedented scale for such a curve
– Close to one billion (user, community) instances
• Most standard models predict S-shaped probability curves
• Further: machine learning to predict joining
18
Web-Scale: Network Evolution
• Researchers are acquiring a large vocabulary of patterns for static networks
– small-world, scale-free, preferential attachment, PageRank, hubs and authorities, bow-ties, bipartite cores, network motifs, …
• Because of the computational challenges, most of the standards results have been measured very few times, e.g., on early Alta vista crawls.
• Little is known about the characteristic ways in which networks grow over time
– What are the analogous collections of patterns?
• Most studies have used link structures. There have been few studies of the evolution of terminology over time
19
Building the Web Library/Laboratory
The Cornell petabyte data store allows us to mount many crawls of
the Web online for broad range of Web research.
• Copy snapshots of the Web from the Internet Archive
• Index snapshots and store online at Cornell
• Extract feature sets for researchers
• Provide APIs for researchers (program interface, download of datasets, Web Services API)
• Provide Web GUI for social science researchers
20
The Internet Archive Web Collection
Sizes
• Current crawls are about 40-60 TByte (compressed)
• Total archive is about 600 TByte (compressed)
• Compression ratio:
up to 25:1best estimate of overall average is 10:1
• Rate of increase is about 1 TByte/day (compressed)
Total storage requirement at Cornell will differ because:
• Elimination of data that is duplicated between crawls
• Expansion of metadata for research
• Database indexes
22
Scale of Data Processing
Balance of Resources
Ideal Realistic
Networking 500 Mbit/sec 100 Mbit/sec
Data online allfew crawls/year
Metadata online all all?
Disk 750 TB 240 TB
Tape archive all few crawls/year
Computers research sharedseparate with storage
23
Equipment
Scidata1 -- Initial Configuration
16-Processor Unisys ES7000 Servers – 16 GByte RAM– 8 GByte/sec aggregate I/O bandwidth
100 TByte RAID Online Storage
ADIC Scalar 10K robotic tape library for archive
Separate Web server
Near-term Expansion• Disk capacity will expand to 240 TByte by end of 2007
Network• Internet2 with dedicated 100 mbs link to Internet
Archive
24
Data Processing
Transfer 300-500 GByte per day
Internet 2 -- 100 mbs maximum throughput
Archive raw data to tape
Process raw data
Uncompress and unpack Web pages (ARC) and metadata (DAT) files
Create IDs for pages and content hashes
Database load
Database load batches of metadata about page and links (MS SQL Server 2000)
Store compressed page files
25
Metadata
URL’s, pages and links:
• URL’s contained in metadata may link to pages never crawled
• URL’s not canonicalized: different URL’s may refer to same page
• Links are from a page to a URL
Web graph:
• Nodes are union of pages crawled and URL’s seen
• Each node and edge have time interval(s) over which they exist
Content:
• Anchor text in more recent crawls
• File and mime types
26
Current Status
Data Capture:
• Connection of Internet Archive to Internet 2 (October 2005)
• Parallel loading of crawls of DAT and ARC files (January 2006)
Storage:
• Relational database and preload system: under test
Two complete crawls available April/May 2006
• Page store: preliminary design work
28
Services for Researchers
Under development (available in 2006)
API for users to extract data and download it to their own computers, or to process it on the Scidata 1 computer
Retro Browser (browse the Web as it was on a given date)
Subset extraction (select dataset by query of a relational database with indexes by date, URL, domain name, file type, anchor text, etc.)
Extract Web graph from subset, organized as sparse matrix
Full text index of subset (using Nutch/Lucene)
Future
NLP and machine learning tools to analyze text
Full text index of entire collection
29
Use of the Library (Draft)
Custodianship of data
Make no changes to the content of the Web pages and the identifying metadata such as URL and date.
Copyright
Assume an implied license to use Web data for archiving and academic research. Respects robots.txt exclusions and other requests from copyright owners.
Privacy
Research that might identify individuals is subject to standards that apply to research involving human subjects.
Authentication of users
All users of this library, whether at Cornell or elsewhere, are authenticated. Use restricted to academic, non-commercial research.
30
Very Large Scale Digital Libraries
Only the computer reads every word
• Researchers interact with the library through computer programs that act as their agents.
• Users rarely view individual items except after preliminary screening by programs.
• The library is a highly technical computer system that is used by researchers who are not computing specialists.
• The library is a super-computing application.
• Use of the library depends on automated tools. These tools require state-of-the-art indexes for text and semi-structured data, natural language processing, and machine learning.
31
Design Guidelines for Builders of Digital Libraries
Every online collection or service needs an application program interface (API) for computers to interact with the library.
A primary methodology is:– select a subset of the collection
– download to the researcher's computer
– use programs on the researcher's computer to analyze the data
Almost all metadata will be computer generated, but human cooperative editing can correct errors (see Crane, D-Lib Magazine, March 2006)
32
Further Information
Web site: http://www.infosci.cornell.edu/SIN/
General overview: Arms, et al., "A Research Library based on the Historical Collections of the Internet Archive". D-Lib Magazine, February 2006. http://www.dlib.org/dlib/february06/arms/02arms.html.
Technical information: Arms, et al.,"Building a Research Library for the History of the Web". Joint Conference on Digital Libraries, 2006.
33
Thanks
This work would not be possible without the forethought and longstanding commitment of the Internet Archive to capture and preserve the content of the Web for future generations.
This work has been funded in part by the National Science Foundation, grants CNS-0403340, DUE-0127308, and SES-0537606, with equipment support from Unisys and by an E-Science grant and a gift from Microsoft Corporation.