Top Banner
Warcbase: Building a Scalable Web Archiving Platform on HBase and Hadoop Jimmy Lin University of Maryland @lintool Ian Milligan University of Waterloo @ianmilligan1 Thursday, June 4, 2015
30

Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Feb 10, 2017

Download

Documents

ngokien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Warcbase: Building a Scalable Web Archiving Platform on HBase and Hadoop

Jimmy LinUniversity of Maryland@lintool

Ian MilliganUniversity of Waterloo@ianmilligan1

Thursday, June 4, 2015

Page 2: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Source: Wikipedia (Opposition to United States involvement in the Vietnam War)

When does an event become “history”?

Page 3: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

When does an event become “history”?~20-30 years later

The history of the 1960s were written in the 1980s!

Page 4: Warcbase: Building a Scalable Web Archiving Platform on HBase ...
Page 5: Warcbase: Building a Scalable Web Archiving Platform on HBase ...
Page 6: Warcbase: Building a Scalable Web Archiving Platform on HBase ...
Page 7: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

This scandal was brought to you by the digital revolution. That meant we could access all the information we wanted, when we wanted it, anytime, anywhere, and when the story broke in January 1998, it broke online. It was the first time the traditional news was usurped by the Internet for a major news story, a click that reverberated around the world…

… it was before social media, but people could still comment online, email stories, and, of course, email cruel jokes. News sources plastered photos of me all over to sell newspapers, banner ads online, and to keep people tuned to the TV.

Monica Lewinsky – The Price of ShameTED Talk, March 2015

So, where are those web pages now?

Page 8: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

When does an event become “history”?~20-30 years later

Historians are getting ready to write about the 90s…

Right about now!

Page 9: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Can you write a history of the ���1990s without the web?

So, where are those web pages now?

Page 10: Warcbase: Building a Scalable Web Archiving Platform on HBase ...
Page 11: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

And beyond…

70+ web archiving efforts worldwide!

Page 12: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Can you write a history of the ���1990s without the web?

So, where are those web pages now?Great, let’s get to work…

Page 13: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Source: http://www.flickr.com/photos/cheryne/8417457803/

Users can’t do much with current web archives

Hard to develop tools for non-existent needs

Page 14: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

We need deep collaborations between:Users (e.g., archivists, journalists, historians, digital humanists, etc.)

Tool builders (me and my colleagues)

Goal: tools to support exploration and discovery in web archives

Beyond browsing…���Beyond searching…

Source: http://waterloocyclingclub.ca/wp-content/uploads/2013/05/Help-Wanted-Sign.jpg

Page 15: Warcbase: Building a Scalable Web Archiving Platform on HBase ...
Page 16: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Source: Google

What would a web archiving platform built on modern big data infrastructure look like?

Page 17: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

OpenWayback: ���monolithic Tomcat applicationScalable storage of archived data

Efficient random access

Scalable processing and analytics

Scalable storage and access of derived data

Desiderata

Some work by the Internet Archive,

Common Crawl, and others…

Petabox by Internet Archive;���

NAS, SAN, etc. by others

Ad hoc storage in flat text WAT files

Page 18: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Scalable storage of archived data

Efficient random access

Scalable processing and analytics

Scalable storage and access of derived data

Desiderata

HDFS

HBase

Hadoop

HBase

Existing tools aren’t adequate!

Page 19: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

(file name, block id)

(block id, block location)

instructions to datanode

datanode state (block id, byte range)

block data

HDFS namenode

HDFS datanode

Linux file system

HDFS datanode

Linux file system

File namespace /foo/bar

block 3df2

Application

HDFS Client

Stores data blocks across commodity serversScales to 100s of PBs of data

Open source implementation of the Google File System

Page 20: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

map map map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

Suitable for batch processing on HDFS data

Open source implementation of Google’s framework

Page 21: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

~ Google’s Bigtable

A collection of tables, each of which represents a sparse, distributed, persistent multidimensional sorted map

Source: Bonaldo Big Table by Alain Gilles

Page 22: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

WarcbaseAn open-source platform for managing web archives built on and H andhttp://warcbase.org/  

Source: Wikipedia (Archive)

Page 23: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

WARC data

Ingestion

Applications and Services

Processing & Analyticstext analysis, link analysis, …

Page 24: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Warcbase: here.

Page 25: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Warcbase in a boxWarcbase: here.

✗ cylinder

Page 26: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Warcbase: here.Portable Warcbase

Page 27: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

But they can probably afford a Mac Pro or Macbook

What’s the big deal?

Historians probably can’t afford Hadoop clusters…

How will this change historical scholarship?

Visual graph analysis on longitudinal data, select subsets for further textual analysis

… all on your desktop/laptop!

Drill down to examine individual pages

Page 28: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Warcbase: here.The price of three cocktails

Page 29: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

Throw in search, lightweight analytics, …

What’s the big deal?

Store every page you’ve ever visited in your pocket!

What will you do with the web in your pocket?

How will this change how you interact with the web?

Page 30: Warcbase: Building a Scalable Web Archiving Platform on HBase ...

So what can you do with Warcbase?