Top Banner
Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter semester 2019/20 1
63

Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Jul 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Information extraction

3. Design considerations,

crawling and scraping

Simon Razniewski

Winter semester 2019/20

1

Page 2: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Announcements

• Assignments

• Do not plagiarize

• Submit outputs where asked

• No lecture nor tutorial next week

• Automating extraction?

• Stay tuned…

• Visualizing KGs

• https://www.wikidata.org/wiki/Wikidata:Tools/Visualize_data

• https://angryloki.github.io/wikidata-graph-

builder/?property=P40&item=Q3044&iterations=100&limit=100

• https://angryloki.github.io/wikidata-graph-

builder/?property=P737&item=Q937&iterations=100&limit=100

• https://gate.d5.mpi-

inf.mpg.de/webyago3spotlxComp/SvgBrowser/

• https://developers.google.com/knowledge-graph

2

Page 3: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

• https://www.reddit.com/r/wikipedia/comments/dg6pnl/th

e_death_date_of_lucius_pinarius_wasnt_added_so/

• https://www.wikidata.org/wiki/Wikidata:Project_chat#unk

nown_values_for_people_who_have_long-since_died

3

Page 4: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Outline

1. Design considerations

2. Crawling

3. Scraping

4

Page 5: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

IE design considerations

1. What should be the output?

• Type of information

• Quality requirements

2. What is the best suited input?

3. Which method to get from input to output?

5

Page 6: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity Names, Aliases & Classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

6

Page 7: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Crawling

7

Page 8: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Scraping

8

Page 9: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Outline

1. Design considerations

2. Crawling

3. Scraping

9

Page 10: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Acknowledgment

• Material adapted from Fabian Suchanek and Antoine Amarilli

10

Page 11: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

11

Page 12: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

12

Page 13: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

13

Page 14: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

14

Page 15: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

15

Page 16: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

16

Page 17: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

17

Page 18: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

18

Page 19: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Freshness problem (2)

• Prediction problem: Estimate page change frequency

• From previous change behavior

• Or from page content

• Optimization problem: Decide crawl frequency

• Fixed budget How to distribute them

• Flexible budget Cost-benefit framework needed

19

Page 20: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Estimating change frequencies

• Cho and Molina, TOIT 2003

• Model changes as Poisson processes (i.e., memoryless/

statistically independent)

• Extrapolate change frequency from previous visits

Daily visit for 10 days, 6 changes detected

Change frequency: 0.6 changes/day?

• Extrapolation underestimates change frequency due to multiple

change possibility

• Liang et al., IJCAI 2017

• Monitor news websites

• Build supervised prediction models based on page features

• Wijaya et al., EMNLP 2015

• Wikipedia-specific

• Learn state-change-indicating terms

• E.g., engage, divorce

20

Page 21: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Wijaya et al., EMNLP 2015

21

Page 22: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Distributing crawl resources

[Razniewski, CIKM 2016]

• Ingredients:

• Benefit of an up-to-date website

• Synonymous: cost of outdated website

• Cost of a crawl action

• Decay behavior

Page-specific recrawl frequency that maximizes

benefit minus cost

22

Page 23: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Decay behaviour

23

Page 24: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Observed decay behaviour

24

Page 25: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Average freshness F

25

Page 26: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Net income NI

26

B…Benefit/time unit

F…Average freshness

Λ… decay coefficient

u…update interval length

C…cost of an update

Optimum via

common algebra

Page 27: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Examples for address updates

NI over u

27

Assumption: benefit over one year = 100 x cost of single crawl

Actual ratio magnitudes lower, e.g., 0.003 Cents/crawl

[http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/]

(and for 580 $ on Amazon EC2)

Page 28: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

28

and later Google)

Page 29: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

29

Page 30: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

30

Page 31: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

31

https://www.mpi-inf.mpg.de/robots.txt

https://www.google.de/robots.txt

Page 32: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

32

Page 33: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

33

Page 34: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

34

Try often enough

Page 35: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

35

Page 36: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

36

Deep web / dark web

Page 38: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Insights from crawling mpi-inf.mpg.de

• URL ending inclusion/exclusion criteria need thought

• Long (machine-generated URLs) need exclusion

• Beyond that no issues

• 35 lines in Python

• Sequential runtime for 2000 pages: ~10 minutes

• Completeness?

38

Page 39: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Outline

1. Design considerations

2. Crawling

3. Scraping

39

Page 40: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Crawling

40

Page 41: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Scraping

41

Page 42: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

42

Page 43: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

43

Page 44: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

44

Scraping aims to reconstruct the KB

Page 45: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

45

Page 46: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

46

Page 47: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

47[https://www.w3schools.com/xml/xml_xpath.asp]

[https://devhints.io/xpath]

Page 48: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

48https://www.freeformatter.com/xpath-tester.html

<html>

<body>

<b>Shrek</b>

<ul>

<li>Creator: <b>W. Steig</b></li>

<li>Duration: <i>84m</i></li>

</ul>

</body>

</html>

Page 49: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping: Browser

• “Try XPath” Firefox addin

• //h3[@class='pi-data-label pi-secondary-font']

• Firefox console

• $x('//h3[@class=\'pi-data-label pi-secondary-font\']')

• //h3[@class='pi-data-label pi-secondary-font'] |

//div[@class='pi-data-value pi-font']

49

Page 50: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping in Python - XPath

50

Page 51: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

51

Page 52: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

52

Page 53: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

53

Page 54: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

54

gender: string)))

Page 55: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

55

Crescenzi et al., VDLB 2001

http://www.vldb.org/conf/2001/P109.pdf

Finds least upper bounds in regex lattice

Page 56: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

56

Crescenzi et al., VDLB 2001

http://www.vldb.org/conf/2001/P109.pdf

Page 57: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

57

Page 58: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

58

Page 59: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping in Python – BeautifulSoup (1)

• Python library for pulling data out of HTML and XML

files.

59

<html>

<head>

<title>

The Dormouse's story

</title>

</head>

<body>

Once upon a time there were

three little sisters; and their names

were <a class="sister"

href="http://example.com/elsie"

id="link1">Elsie</a> , <a

class="sister"

href="http://example.com/lacie"

id="link2"> Lacie</a> and …

soup.title

# <title>The Dormouse's story</title>

soup.title.string

# u'The Dormouse's story'

soup.title.parent.name

# u‘head'

soup.a

# <a class="sister" href="http://ex.com/elsie"

id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://ex.com/elsie"

id="link1">Elsie</a>,

# <a class="sister" href="http://ex.com/lacie"

id="link2">Lacie</a>,

# <a class="sister" href="http://ex.com/tillie"

id="link3">Tillie</a>]

Page 60: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping in Python – BeautifulSoup (2)

60

Page 61: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

XPath vs. BeautifulSoup vs …

• XPath: Generic query language to select nodes in XML

(HTML) documents

• Queries can be issued from Python, Java, C, …

• BeautifulSoup

• Python library to manipulate websites as Python objects

• Scrapy

• Python library to crawl websites

• Selenium

• Actual scripted browser interaction

To get around Javascript etc.

61

https://www.udemy.com/tutorial/scrapy-tutorial-web-scraping-with-python/scrapy-vs-beautiful-soup-vs-selenium/

Page 62: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Assignment 3

• No crawling (ethics…)

• 1x Extraction from dump – infobox treasure

• Remember design considerations

• XML format, but essential content not structured by XML tags

pattern matching/regex

• 2x Scraping

• BeautifulSoup recommended, but XPath fine as well

• Reading on large-scale WP extraction:

DBpedia extraction framework

62

Page 63: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Take home

1. Think about goal, sources, methods

2. Crawling

• BFS to achieve coverage

• Challenges with traps and deep web

3. Scraping

• Reverse-engineering of template-based websites

63