Top Banner
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob
32

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Visual Web Information Extraction With Lixto

Robert Baumgartner

Sergio Flesca

Georg Gottlob

Page 2: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Overview

Introduction and Motivation Wrapper Generation Extraction Language/Mechanisms Testing Lixto Results Strengths & Weakness Current/Future Work

Page 3: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

HTML vs. XML

HTML & XML represent semi-structured data

HTML mainly presentation oriented Web content typically formatted in HTML HTML lacks data querying

Page 4: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

XML Advantages

XML structure/layout separation XML provides suitable data representation XML sets act as database XML sets queried via, XML-GL, XML-QL,

XQuery

Page 5: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

eBay Example

No data querying ability increases cost and time to retrieve information from web pages

Example: watch interesting eBay offers of notebooks

Criteria:– Auction contains the word “notebook”– Current value between GBP 1500 and 3000– Received at least 3 bids

Page 6: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

eBay Problems

eBay does not support complex queries Similar sites do not give restricted queries Large number of results returned with no

possibility to further restrict the results Only one site can be queried at a time Results from different queries cannot be

compiled into a single structured file

Page 7: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

eBay Solution

Lixto introduces new ideas and programming language concepts for wrapper generation

Lixto translates HTML to XML Resulting XML can then be queried and

further processed Wrappers applied automatically to extract

information from changing web pages

Page 8: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Lixto Advantages

Easy to learn Full visual and interactive UI provided No fine tuning required No knowledge of internal language necessary No knowledge of HTML necessary Graphical region marking and selection Works directly on browser-display pages, no

additional view necessary

Page 9: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Lixto Advantages

Extraction of target patterns based on:– Surrounding landmarks– Actual content– HTML attributes– Order of appearance– Semantic and syntactic concepts

Extraction from flat strings possible Semi-automatic wrapper generation

Page 10: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Advanced Lixto Features

Disjunctive pattern definitions Crawling page links during extraction Recursive wrapping Extracted data can have disjoint structure

from HTML source page Internal data structure language Elog

Page 11: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Implemented Lixto System

Page 12: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Architecture and Implementation

Lixto created with Java using Swing, OroMather and JDOM

Lixto toolkit contains three modules:– Interactive Pattern Builder– Extractor– XML Generator

Page 13: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Creating Wrappers

Lixto wrappers created interactively using patterns in a hierarchical order

Patterns names act as default XML elements<Item>

<Price>

Sub patterns express 1:* relationships Each pattern characterizes one kind of information Each pattern is defined by one or more filters

Page 14: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Filter Creation

User highlights desired target– Internally Elog rule created describing filter

Add restrictive conditions to filter– Goals added to Elog rule body

Filter conditions:– Before/after– Not before/not after– Internal– Range

Page 15: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Pattern Creation Algorithm

Loading initial document creates a <document> pattern

User highlights instance of the pattern Lixto displays all matched instances of the

pattern

Page 16: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Pattern Creation Algorithm

User can add filters to limit the matched targets

The set of filters is added to the <document> pattern

Test if <document> pattern extracts exactly the desired set of data

If yes, save the pattern, if no select new instance of the pattern

Page 17: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Generation of a New Pattern

Page 18: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

The Lixto Browser

Page 19: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Conditional Generation

Page 20: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Visual Interface

Visual tree pattern construction Regular expression string patterns XML visualization tool Concept generator

– Regular expression / database driven– Creates “isCity”, “isDate”– Requires no regular expression knowledge

Page 21: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Main Menu / Pattern Generation Menu

Page 22: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Elog

Internal data storage language Data-log like syntax and semantics Invisible to the user Specifically designed for hierarchical and modular

data extraction Flexible, intuitive, easily extensible Patterns stored as narrowing (logical and) and

broadening (logical or) steps Elog rules are implementations of the visually defined

filters

Page 23: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Elog Extraction Program for eBay Example

Page 24: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Document Model

Brackets specify character offsets Nodes numbered in depth-first left-to-right

fashion HTML tags refer to element sets containing

attribute names and values– <body> tag contains attributes

• {(name,body), (bgcolor,FFFFFF),(elementtext,…)}

Page 25: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

HTML Example Page

Page 26: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

XML Translation

Page 27: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Extraction Mechanisms

Tree extraction– Elements identified by tree path (*.table*.tr)– Attribute constraints reduce matched elements– Element path definition (epd): tree path +

attribute constraints String extraction

– Strings stored in ‘context’ nodes– Regular expression matching

Page 28: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

HTML Tree Extraction

Page 29: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Lixto Test Sites

Page 30: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Results

Page 31: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Strengths & Weakness

Intuitive UI (If it needs a manual it’s not a good program)

Highly customizable Supports crawling across web sites No tree output after crawling Slow Extracts only one target type at a time

Page 32: Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Current/Future Work

Extend tree structure to support crawling across multiple sites (crawling is currently supported)

Server based Lixto system Automated heuristics Support for multiple example targets at once Embedding Lixto wrappers into information

channel system