Top Banner
Learning to Extract Learning to Extract Form Labels Form Labels Nguyen et al. Nguyen et al.
23

Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Learning to Extract Form Learning to Extract Form LabelsLabels

Nguyen et al.Nguyen et al.

Page 2: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

The ChallengeThe Challenge

We want to retrieve and integrate We want to retrieve and integrate online databasesonline databases

Most online databases are accessed Most online databases are accessed through formsthrough forms

The better we can understand the The better we can understand the forms the better we know the forms the better we know the databasesdatabases

Page 3: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Web formsWeb forms

Most forms on the web are very Most forms on the web are very

differentdifferent

Page 4: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

The SolutionThe Solution

Introducing … LABELEXIntroducing … LABELEX

A learning-based approach for A learning-based approach for

automatically parsing and extracting automatically parsing and extracting

element labels of forms used by element labels of forms used by

humans humans

Page 5: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

OverviewOverview

Page 6: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Basic DefinitionsBasic Definitions

Forms contain elements and labelsForms contain elements and labels Elements are textboxes, lists, etc.Elements are textboxes, lists, etc. Labels represent attributes or fieldsLabels represent attributes or fields Elements are associated with labelsElements are associated with labels Element domain is the range of Element domain is the range of

elementselements

Page 7: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Algorithm DescriptionAlgorithm Description

Generating candidate mappingsGenerating candidate mappings Extracting featuresExtracting features Learning to identify mappingsLearning to identify mappings Using prior knowledge to discover Using prior knowledge to discover

new labelsnew labels

Page 8: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Generating Mapping Generating Mapping CandidatesCandidates

Mappings between labels and Mappings between labels and elements are generatedelements are generated

We consider only text close to the We consider only text close to the elementelement

Page 9: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Generating Mapping Generating Mapping CandidatesCandidates

ExampleExample

Page 10: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Extracting FeaturesExtracting Features

Form Elements and LabelsForm Elements and Labels Elements: TypeElements: Type Labels: Font and Size Labels: Font and Size

Label-Element SimilarityLabel-Element Similarity Uses internal name and default value (LCS)Uses internal name and default value (LCS)

Spatial FeatureSpatial Feature Topological features: Top, Bottom, left, etcTopological features: Top, Bottom, left, etc Label element distance (Normalized).Label element distance (Normalized).

Page 11: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Extracting FeaturesExtracting Features

Page 12: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Identifying MappingsIdentifying Mappings We need to prune firstWe need to prune first

We choose a classifier to prune We choose a classifier to prune mappingsmappings

Page 13: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Learning MappingsLearning Mappings

We choose a classifier for selecting We choose a classifier for selecting correct mappingscorrect mappings

Page 14: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

The Reconciliation processThe Reconciliation process

A vocabulary is created to reconcile A vocabulary is created to reconcile ambiguous mappingsambiguous mappings

Terms with high frequency might be Terms with high frequency might be labelslabels

Ex: “Save $220” and “From”Ex: “Save $220” and “From” Two tables for single terms and Two tables for single terms and

multiple onesmultiple ones

Page 15: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Experimental EvaluationExperimental Evaluation

DatasetsDatasets

Page 16: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

ResultsResults

Best configurationBest configuration

Page 17: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

ResultsResults

Domain specific (DSCE)Domain specific (DSCE)

Page 18: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

ResultsResults

DSCE vs GenericDSCE vs Generic

Page 19: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

ResultsResults

Comparison to state of the art: HSP, Comparison to state of the art: HSP, IEXPIEXP

Page 20: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

StrengthsStrengths

Lots of experimentsLots of experiments

Good chartsGood charts

Well explainedWell explained

Page 21: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

WeaknessesWeaknesses

One typoOne typo

Their approach is layout dependentTheir approach is layout dependent

Page 22: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Future WorkFuture Work

Handle N:M mappingsHandle N:M mappings

Go beyond the naïve approachGo beyond the naïve approach

Consider other features for Consider other features for

classificationclassification

Page 23: Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

?’s?’s