Top Banner
Data and Information Systems Laboratory University of Illinois Urbana-Champaign WWW Conference April 30, 2010 CETR: Content Extraction via Tag Ratios Tim Weninger , William H. Hsu and Jiawei Han Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Department of Computing and Information Sciences Kansas State University, Manhattan KS
26

CETR: Content Extraction via Tag Ratios

Dec 30, 2015

Download

Documents

CETR: Content Extraction via Tag Ratios. Tim Weninger , William H. Hsu and Jiawei Han. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Department of Computing and Information Sciences Kansas State University, Manhattan KS. Problem: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

CETR: Content Extraction via Tag Ratios

Tim Weninger, William H. Hsu and Jiawei Han

Department of Computer ScienceUniversity of Illinois Urbana-Champaign, Urbana, IL

Department of Computing and Information SciencesKansas State University, Manhattan KS

Page 2: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Problem:Too much junk in a

web page

Goal:Extract only the

content of a page

Taken from The Hutchinson News on 8/14/2008

Page 3: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Rendered HTML Document

Text content of the document

Published online 8/13/2008

A home away from schoolDay care has after-school duties as some clients start academic yearBy Kristen Roderick - The Hutchinson News - [email protected]

(Travis Morisse/The Hutchinson News) Mary Waln, 7, and Nija Morris, 6, read “The Magic Mat” together Wednesday at Hadley Day Care.

The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.

Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.

"I played and I did art and I played outside and I went to the gym, and I went inside and did centers," she said. "And then I went to meet the other classes and then we went home."

The school-aged children were a little more wound up on Wednesday, program director Christie Gardner said. The excitement is always higher the first day of school, and not everyone is in a routine.

Example

Page 4: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Naïve Approach› Remove all HTML tags

Original, Rendered HTML Document

RSSCIRCULATIONYOUR ACCOUNTCONTACT USHomeNews

Top StoriesLocal/Regional NewsBriefsEducationAsk HutchBusinessPublic RecordSpecial sectionsVideosPhotosSlideshowsForums

WeatherObituariesSports

Latest SportsNJCAA Tournament

Opinion EditorialsLetters to the EditorColumnists

Lifestyles FoodHealthReligionOutdoors

Life’s Little MomentsWeddingsEngagementsAnniversariesComing EventsEt cetera

Entertainment PreviewTV listings

JobsAutosClassifiedsMarketplace Archive searchThursday, August 14, 2008 10 : 35 AM Search the Web using Hutch News Weather Forecast today’s top stories

Published online 8/13/2008

A home away from schoolDay care has after-school duties as some clients start academic yearBy Kristen Roderick - The Hutchinson News - [email protected]

(Travis Morisse/The Hutchinson News) Mary

Waln, 7, and Nija Morris, 6, read “The Magic Mat” together Wednesday at Hadley Day Care. The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.

Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.

"I played and I did art and I played outside and I went to the gym, and I went inside and did centers," she said. "And then I went to meet the other classes and then we went home."

All Text of the Document

Page 5: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Tag-based Approach› Use HTML tags as clues for content› Problem: Style-sheets

Original, Rendered HTML Document

<div><div></div><div>

<div>Eat at Joes

</div></div><div>

<div><div>

The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.

</div><div>

Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.

</div></div>

</div></div>

Page 6: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

• Wrapper Generation› Learn rules via machine learning from webpage› Problem: pages vary, web changes

Page 7: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Text-to-Tag Ratio

Algorithm 1: Text-To-Tag Ratio pseudocodeinput

h ← HTML source codebegin

Remove all script, remark tags and empty linesfor each line k to numLines( h ) do

x ← number of non-tag ASCII characters in h[k]y ← number of tags in h[k]if y = 0 then

TTRArray[i] ← xelse

TTRArray[i] ← x / yend if

end forreturn TTRArray

end

Page 8: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Example

http://www2010.org/www/2010/04/program-guide/

Text: 21 - Tags: 8 -> TTR: 2.63

Text: 22 - Tags: 8 -> TTR: 2.75

Text: 298 - Tags: 6 -> TTR: 49.67

Text: 0 - Tags: 0 -> TTR: 0Text: 0 - Tags: 1 -> TTR: 0

Page 9: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

1 26 51 76 1011261511762012262512763013263513764014260

50

100

150

200

250

Line Number

Text

To

Tag

Ratio

Page 10: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Preprocessing – Blur the tag ratios

1 11 21 31 41 51 61 71 81 91 1010

100

200

300

400

500

Line Number

Text

To

Tag

Rati

o

1 9 17 25 33 41 49 57 65 73 81 89 971050

50

100

150

200

250

Line Number

Text

To

Tag

Rati

o

Page 11: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Apply a Threshold

Threshold based on the standard deviation

1 25 49 73 97 1211451691932172412652893133373613854090

20

40

60

80

100

120

Line Number

Text

To

Tag

Ratio

Std. Dev. Is 20.3TTR for Hutchinson News document

Page 12: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

What’s wrong with this?

Page 13: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Worst CaseAmerican Declaration of Independence Web page

American Declaration of IndependenceTTR computed from digital copy at

http://www.ushistory.org/declaration/document/index.htm

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 1060

100

200

300

400

500

Line NumberTe

xt T

o Ta

g Ra

tio

Page 14: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Histogram Clustering in 2-Dimensions

Looks for jumps in the moving average of TTR

1 50 99 1481972462953443930

20

40

60

80

100

120

Line Number

Text

To

Tag

Ratio

1 50 99 148197246295344393-150

-100

-50

0

50

100

150

Line Number

Page 15: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Histogram Clustering in 2-Dimensions

Absolute value gives insight

1 52 103154205256307358409-150

-100

-50

0

50

100

150

Line Number

1 46 91 1361812262713163614060

100200300400500600700800

Line Number

Page 16: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

102030405060708090

100

TTR (hʹ)

Diff

eren

ces

(g')

Histogram Clustering in 2-Dimensions

Make a scatterplot

0 25 50 75 1000

20

40

60

80

100

TTR (hʹ)

Diff

eren

ces

(g')

Page 17: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

10

20

30

40

50

60

70

80

90

100

TTR (hʹ)

Diff

eren

ces

(g')

How should we cluster this?

Page 18: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

10

20

30

40

50

60

70

80

90

100

TTR (hʹ)

Diff

eren

ces

(g')

Modified k-Means

Page 19: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

10

20

30

40

50

60

70

80

90

100

TTR (hʹ)

Diff

eren

ces

(g')

Can also use a Max-Margin ApproachBut we don’t

Page 20: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Evaluation methodCleanEval Dataset (Standard Evaluation dataset)

741 English documents 713 Chinese documents

Myriad 40 Dataset40 News Websites – 206 documents at random

Big 5 DatasetNY Post, Freep, Suntimes, Techweb, Tribune – 50 each

BBC and NYTimes50 documents each

Evaluation MetricsPrecision, Recall, F1-Measure

Page 21: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

CETR Results

Page 22: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Case study

Page 23: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

CETR-TM as the σ coefficient (λ) (i.e. threshold) varies

Page 24: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Worst Case RevisitedNon-HTML or all content pages

1 11 21 31 41 51 61 71 81 91 1011111211311410

20

40

60

80

100

120

Line Number

Text

To

Tag

Rati

o

approximation

WWW’10 Paper

Page 25: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Conclusions and Future Work

Tag Ratio ApproachContent extraction technique – outperforms existing

methodsSimple… easy to implement and useStatic method

no training requiredJust give a document and get content

Future WorkUse for page segmentationUse in Search/Retrieval

Page 26: CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Questions?