AWS Summit Berlin 2012 Talk on Web Data Commons

Large-Scale Analysis of Web Pages － on a Startup Budget?

Hannes Mühleisen, Web-Based Systems Group

AWS Summit 2012 | Berlin

Our Starting Point

• Websites now embed structured data in HTML

Our Starting Point

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

Our Starting Point

• Various Encoding Formats possible

• μFormats, RDFa, Microdata

Our Starting Point

• Various Encoding Formats possible

• μFormats, RDFa, Microdata

Question: How are Vocabularies and Formats used?

Web Indices

• To answer our question, we need to access to raw Web data.

Web Indices

• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

Web Indices

• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

• Google and Bing have indices, but do not let outsiders in

• Non-Profit Organization

• Runs crawler and provides HTML dumps

• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

• Available on AWS Public Data Sets

Why AWS?

• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

Why AWS?

• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

• Preliminary analysis: 1 GB / hour / CPU possible

• 8-CPU Desktop: 8 months

• 64-CPU Server: 1 month

• 100 8-CPU EC2-Instances: ~ 3 days

Common Crawl Dataset Size

1 CPU, 1 h

1000 € PC, 1 h

1 CPU, 1 h

1000 € PC, 1 h

1 CPU, 1 h

5000 € Server, 1 h

1000 € PC, 1 h

1 CPU, 1 h

5000 € Server, 1 h

17 € EC2 Instances, 1 h

AWS Setup

• Data Input: Read Index Splits from S3

AWS Setup

• Job Coordination: SQS Message Queue

AWS Setup

• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

AWS Setup

• Result Output: Write to S3

AWS Setup

• Result Output: Write to S3

• Logging: SDB

42 43 ... CC R42 R43 ...WDC

• Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

42 43 ... CC R42 R43 ...WDC

Results - Types of Data

0 50 100 150 200

Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010

Website Structure 23 %

Products, Reviews 19 %

Movies, Music, ... 15 %

Geodata 8 %

People, Organizations 7 %

2012 Microdata Breakdown

0 50 100 150 200

Geodata 8 %

• Available data largely determined by major player support

0 50 100 150 200

Geodata 8 %

• Available data largely determined by major player support

• “If Google consumes it, we will publish it”

Results - Formats

• URLs with embedded Data: +6%

RDFa Microdata geo hcalendar hcard hreview XFN

Format

4 2009/201002−2012

Results - Formats

• Microdata +14% (schema.org?)

Format

4 2009/201002−2012

Results - Formats

• Microdata +14% (schema.org?)

• RDFa +26% (Facebook?)

Format

4 2009/201002−2012

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

• Have a look!

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

AWS Costs

• Cost for other services negligible *

AWS Costs

• Cost for other services negligible *

• * At first, we underestimated SDB cost

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

• Large-Scale Web Analysis now possible with Common Crawl datasets

• AWS great for massive ad-hoc computing power and complexity reduction

• Choose your architecture wisely, test by experiment, for us EMR was too expensive.

Thank You!

Web Resources: http://webdatacommons.orghttp://hannes.muehleisen.org

Questions?Want to hire me?

AWS Summit Berlin 2012 Talk on Web Data Commons

Technology

AWS Summit Berlin 2013 - Architecting for high availability

AWS Deck Template - Amazon Web...

Berlin 2015 - Amazon Web...

AWS Summit Berlin 2013 - Keynote Werner Vogels

Fachsymposium - Urheberrecht für die Wissensgesellschaft -....

BERLIN - Amazon Web...

AWS & Intel: A Partnership Dedicated to Cloud...

AWS Pop-up Loft Berlin: Cache is King - Running Lean...

AWS User Group Berlin - Introduction To Amazon Mechanical...

of Global Commons - Startseite | WBGU · Charging the Use.....

AWS Summit Berlin 2013 - Optimizing your AWS applications...

BERLIN 2015 - Amazon Web...

Berlin AWS meetup: here.com on AWS

AWS Summit Berlin 2013 - Realtech - How to Determine the...

AWS Summit Berlin 2013 - Building web scale applications...

Berlin - Amazon Web...