Adaptive web page content identification

Adaptive Web-page Content IdentificationAuthor: John Gibson, Ben Wellner, Susan Lubar

Publication: WIDM’2007

Presenter: Jhih-Ming Chen

1

Outline

• Introduction• Content Identification as Sequence Labeling• Models• Conditional Random Fields• Maximum Entropy Classifier• Maximum Entropy Markov Models

• Data• Harvesting and Annotation• Dividing into Blocks

• Experimental Setup• Results and Analysis• Conclusion and Feature Work 2

Introduction

• Web pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements.

• This paper’s goal:• Detect identical and near duplicate articles within a set of web

pages from a conglomerate of websites.

• Why do this work?• Provide input for an application such as a Natural Language tool

or for an index for a search engine.• Re-display it on a small screen such as a cell phone or PDA.

3

Introduction

• Typically, content extraction is done via a hand-crafted tool targeted to handle a single web page format.

• Shortcoming:• When the page format changes, the extractor is likely to break.• It is labor intensive.

• Web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written.

• Some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task.

• In general, the site-specific extractors are unworkable as a long-term solution.

• The approach described in this paper is meant to overcome these issues. 4

Introduction

• The data set for this work consisted of web pages from 27 different news sites.

• Identifying portions of relevant content in web-pages can be construed as a sequence labeling problem.• i.e. each document is broken into a sequence of blocks and the

task is to label each block as Content or NotContent.

• The best system, based on Conditional Random Fields, can correctly identify individual Content blocks with recall above 99.5% and precision above 97.9%.

5

Content Identification as Sequence Labeling• Problem description• Identify the portions of news-source web-pages that contain

relevant content – i.e. the news article itself.

• Two general ways to approach this problem:• Boundary Detection Method

• Identify positions where content begins and ends.

• Sequence Labeling Method• Divide the original Web document into a sequence of some

appropriately sized units or blocks.• Categorize each block as Content or NotContent.

6

Content Identification as Sequence Labeling• In this work, the authors focus largely on the sequence labeling

method.• Shortcomings of boundary detection methods:• A number of web pages contain ‘noncontiguous’ content.

• The paragraphs of the article body have other page content, such as advertisements, interspersed.

• Boundary detection methods aren’t able to nicely model transitions from Content to NotContent back to Content.

• If a boundary is not identified at all, an entire section of content can be missed.

• When developing a statistical classifier to identify boundaries, there are many, many more negative examples of boundaries than positive ones.• It may be possible to sub-sample negative examples or identify some

reasonable set of candidate boundaries and train a classifier on those, but this complicates matters greatly.

7

Models

• This section describes the three statistical sequence labeling models employed in the experiments:• Conditional Random Fields (CRF)• Maximum Entropy Classifiers (MaxEnt)• Maximum Entropy Markov Models (MEMM)

8

Conditional Random Fields

• Let x = x1, x2, …, xn be a sequence of observations,

• such as a sequence of words, paragraphs, or, as in our setting, a sequence of HTML “segments”.

• Given a set of possible output values (i.e., labels),• sequence CRFs define the conditional probability of a label sequence

y = y1, y2, …, yn as:

• Zx is a normalization term overall possible label sequences.

• Current position is i.

• Current and previous labels are yi and yi-1

• Often, the range of feature functions fk is {0,1}.

• Associated with each feature function, fk, is a learned parameter λk that captures the strength and polarity of the correlation between the label transition, yi-1 to yi.

9

)),,,(exp(1

)|(1

1

n

i kiikk

x

ixyyfZ

xyp


• Training• D = {(y(1), x(1)), (y(2), x(2)), …,(y(m), x(m))}.

• A set of training data consisting of a set of pairs of sequences.

• The model parameters are learned by maximizing the conditional log-likelihood of the training data.• which is simply the sum of the log-probabilities assigned by the

model to each label-observation sequence pair:

• The second term in the above equation is a Gaussian prior over the parameters, which helps the model to overcome over-fitting.

10

m

i

ii xyPD1

)()( )|(

k

km

i

ii xypDL2

1

)()(

2))|(log()(

Regularization term


• Decoding (Testing)• Given a trained model, the problem of decoding in sequence

CRFs involves finding the most likely label sequence for a given observation sequence.

• There are NM possible label sequences.• N is the number of labels.• M is the length of the sequence.

• Dynamic programming, specifically a variation on the Viterbi algorithm, can find the optimal sequence in time linear in the length of the sequence and quadratic in the number of possible labels.

11

Maximum Entropy Classifier

• MaxEnt classifiers are conditional models that given a set of parameters produce a conditional multinomial distribution according to:

• In contrast with CRFs, MaxEnt models (and classifiers generally) are “state-less” and do not model any dependence between different positions in the sequence.

12

jy kjkk

kikk

i xyf

xyfxyyp

)),(exp(

)),(exp()|(

Maximum Entropy Markov Models

• MEMMs model a state sequence just as CRFs do, but use a “local” training method rather than performing global inference over the sequence at each training iteration.

• Viterbi decoding is employed as with CRFs to find the best label sequence.

• MEMMs are prone to various biases that can reduce their accuracy.• Do not normalize over the entire sequence like CRFs.

13

i

ii xyypxyp ),|()|( 1

'1

1

1 )),,,'(exp(

)),,,(exp(),|(

y kikk

kiikk

ii ixyyf

ixyyfxyyp

Data

• Harvesting and Annotation• 1620 labeled documents from 27 of the sites.

• 388 distinct articles within the 1620 documents.

• This large number of duplicate and near duplicate documents introduced bias into our training set.

14

Data

• Dividing into Blocks• Sanitize the raw HTML and transform it into XHTML.• Exclude all words inside style and script tags.• Tokenize the document.• Divide up the entire sequence of <lex> into smaller sequences

called blocks.• except the following: <a>, <span>, <strong>, ...etc.

• Wrap each block as tightly as possible with a <span>.• Create features based on the material from the <span> tags in each

block.

• There is a large skew of NotContent vs. Content blocks.• 234,436 NotContent blocks.• 24,388 Content blocks.

15

Experimental Setup

16

Feature Type Description

Words Number of times that token occurs in the block.

Inverse Stop-Wording Number of times that tokens appearing in a list of stop words.

Named Entities A count of the named entities in a block.

Title CasingEvery token in a block begins with an upper case letter or when any token begins with a lower case letter.

Anchor Percentage The percentage of tokens in a block contained within an anchor tag.

Title Matching The percentage of the title that matched.

Ancestor Tags The names of the parent and grandparent tags of a block.

Descendant Tags The names any descendant tags found within a block.

Sibling Tags The names of the previous and next sibling tags of the current block.

Word Count The count of the tokens found in a block.

After Image TagAn <img> appears before the current block and after the previous one. This feature is intended to exclude photo captions from inclusion in the content.

Experimental Setup

• Data Set Creation• The authors ran four separate cross validation experiments to

measure the bias introduced by duplicate articles and mixed sources.

1. Duplicates, Mixed Sources.• Split documents with 75% in the training set and 25% in the testing set.

2. No Duplicates, Mixed Sources.• This split prevented duplicate documents from spanning the training and

testing boundary.

3. Duplicates Allowed, Separate Sources.• Create four separate bundles each containing 6 or 7 sources.• Fill the bundles in round-robin fashion by selecting a source at random.• Three bundles were assigned to training and one to testing.

4. No Duplicates, Separate Sources.• Ensure that no duplicates crossed the training/test set boundary.

17

Results and Analysis

• Each of the three different models, CRF, MEMM and MaxEnt are evaluated on each of the four different data setups.• All results are a weighted average of four-fold cross validation.

18


• Feature Type Analysis• Shows results for the CRF when individually removing each class

of feature using the “No Duplicates; Separate Sources” data set.

19


• Contiguous vs. Non-Contiguous performance• Document level results.

20


• Amount of Training Data• Show the results for the CRF with varying quantities of training

data using the “No Duplicates; Separate Sources” experimental setup.

21


• Error Analysis1. NotContent blocks found interspersed within portions of

Content being falsely labeled as Content.

2. Sometimes section headers would be incorrectly singled out as NotContent.

3. The beginning and end of article boundaries where sometimes the first or last block would incorrectly be labeled NotContent.

22

Conclusion and Feature Work

• Sequence labeling emerged as the clear winner with CRF edging out MEMM.• MaxEnt was only competitive on the easiest of the four data sets

and is not a viable alternative to site-specific wrappers.

• Future work includes applying the techniques to additional data, including additional sources, different languages, and other types of data such as weblogs.

• Another interesting avenue: semi-Markov CRFs.

23

Adaptive web page content identification

Technology

sequence of blocks

sequence crfs

optimal sequence

sequence labelingin

sequence of observations

sequence of words

likely label sequence

label sequence y