Top Banner
Large-Scale Analysis of Web Pages on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin
45

AWS Summit Berlin 2012 Talk on Web Data Commons

May 06, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: AWS Summit Berlin 2012 Talk on Web Data Commons

Large-Scale Analysis of Web Pages - on a Startup Budget?

Hannes Mühleisen, Web-Based Systems Group

AWS Summit 2012 | Berlin

Page 2: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

2

Page 3: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

2

Page 4: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

2

Page 5: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

• Various Encoding Formats possible

• μFormats, RDFa, Microdata

2

Page 6: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

• Various Encoding Formats possible

• μFormats, RDFa, Microdata

2

Question: How are Vocabularies and Formats used?

Page 7: AWS Summit Berlin 2012 Talk on Web Data Commons

Web Indices

• To answer our question, we need to access to raw Web data.

3

Page 8: AWS Summit Berlin 2012 Talk on Web Data Commons

Web Indices

• To answer our question, we need to access to raw Web data.

• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

3

Page 9: AWS Summit Berlin 2012 Talk on Web Data Commons

Web Indices

• To answer our question, we need to access to raw Web data.

• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

• Google and Bing have indices, but do not let outsiders in

3

Page 10: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

4

Page 11: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

• Runs crawler and provides HTML dumps

4

Page 12: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

• Runs crawler and provides HTML dumps

• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

4

Page 13: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

• Runs crawler and provides HTML dumps

• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

• Available on AWS Public Data Sets

4

Page 14: AWS Summit Berlin 2012 Talk on Web Data Commons

Why AWS?

• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

5

Page 15: AWS Summit Berlin 2012 Talk on Web Data Commons

Why AWS?

• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

• Preliminary analysis: 1 GB / hour / CPU possible

• 8-CPU Desktop: 8 months

• 64-CPU Server: 1 month

• 100 8-CPU EC2-Instances: ~ 3 days

5

Page 16: AWS Summit Berlin 2012 Talk on Web Data Commons

Common Crawl Dataset Size

Page 17: AWS Summit Berlin 2012 Talk on Web Data Commons

1 CPU, 1 h

Common Crawl Dataset Size

Page 18: AWS Summit Berlin 2012 Talk on Web Data Commons

1000 € PC, 1 h

1 CPU, 1 h

Common Crawl Dataset Size

Page 19: AWS Summit Berlin 2012 Talk on Web Data Commons

1000 € PC, 1 h

1 CPU, 1 h

5000 € Server, 1 h

Common Crawl Dataset Size

Page 20: AWS Summit Berlin 2012 Talk on Web Data Commons

1000 € PC, 1 h

1 CPU, 1 h

5000 € Server, 1 h

Common Crawl Dataset Size

17 € EC2 Instances, 1 h

Page 21: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

7

Page 22: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

7

Page 23: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

7

Page 24: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

• Result Output: Write to S3

7

Page 25: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

• Result Output: Write to S3

• Logging: SDB

7

Page 26: AWS Summit Berlin 2012 Talk on Web Data Commons

S3

SQS

42

EC2

...

42 43 ... CC R42 R43 ...WDC

• Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

Page 27: AWS Summit Berlin 2012 Talk on Web Data Commons

S3

SQS

42

EC2

...

42 43 ... CC R42 R43 ...WDC

• Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

Page 28: AWS Summit Berlin 2012 Talk on Web Data Commons

S3

SQS

42

EC2

...

42 43 ... CC R42 R43 ...WDC

• Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

Page 29: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Types of Data

9

0 50 100 150 200

5e+0

35e

+04

5e+0

55e

+06

Type

Entit

y C

ount

(log

)

Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010

Website Structure 23 %

Products, Reviews 19 %

Movies, Music, ... 15 %

Geodata 8 %

People, Organizations 7 %

2012 Microdata Breakdown

Page 30: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Types of Data

9

0 50 100 150 200

5e+0

35e

+04

5e+0

55e

+06

Type

Entit

y C

ount

(log

)

Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010

Website Structure 23 %

Products, Reviews 19 %

Movies, Music, ... 15 %

Geodata 8 %

People, Organizations 7 %

2012 Microdata Breakdown

• Available data largely determined by major player support

Page 31: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Types of Data

9

0 50 100 150 200

5e+0

35e

+04

5e+0

55e

+06

Type

Entit

y C

ount

(log

)

Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010

Website Structure 23 %

Products, Reviews 19 %

Movies, Music, ... 15 %

Geodata 8 %

People, Organizations 7 %

2012 Microdata Breakdown

• Available data largely determined by major player support

• “If Google consumes it, we will publish it”

Page 32: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Formats

10

• URLs with embedded Data: +6%

RDFa Microdata geo hcalendar hcard hreview XFN

Format

Perc

enta

ge o

f UR

Ls

01

23

4 2009/201002−2012

Page 33: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Formats

10

• URLs with embedded Data: +6%

• Microdata +14% (schema.org?)

RDFa Microdata geo hcalendar hcard hreview XFN

Format

Perc

enta

ge o

f UR

Ls

01

23

4 2009/201002−2012

Page 34: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Formats

10

• URLs with embedded Data: +6%

• Microdata +14% (schema.org?)

• RDFa +26% (Facebook?)

RDFa Microdata geo hcalendar hcard hreview XFN

Format

Perc

enta

ge o

f UR

Ls

01

23

4 2009/201002−2012

Page 35: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

11

Page 36: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

11

Page 37: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

• Have a look!

11

Page 38: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

12

Page 39: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

• Cost for other services negligible *

12

Page 40: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

• Cost for other services negligible *

• * At first, we underestimated SDB cost

12

Page 41: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

13

Page 42: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

• Large-Scale Web Analysis now possible with Common Crawl datasets

13

Page 43: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

• Large-Scale Web Analysis now possible with Common Crawl datasets

• AWS great for massive ad-hoc computing power and complexity reduction

13

Page 44: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

• Large-Scale Web Analysis now possible with Common Crawl datasets

• AWS great for massive ad-hoc computing power and complexity reduction

• Choose your architecture wisely, test by experiment, for us EMR was too expensive.

13

Page 45: AWS Summit Berlin 2012 Talk on Web Data Commons

Thank You!

Web Resources: http://webdatacommons.orghttp://hannes.muehleisen.org

Questions?Want to hire me?