Top Banner
RDG___Web2.0_Extraction___2008.09.22.ppt KOM - Multimedia Communications Lab Prof. Dr.-Ing. Ralf Steinmetz (Director) Dept. of Electrical Engineering and Information Technology Dept. of Computer Science (adjunct Professor) TUD – Technische Universität Darmstadt Merckstr. 25, D-64283 Darmstadt, Germany Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de © author(s) of these slides 2008 including research results of the research network KOM and TU Darmstadt otherwise as specified at the respective slide httc – Hessian Telemedia Technology Competence-Center e.V - www.httc.de Dipl. Inform. Renato Dominguez Garcia [email protected] Tel.+49 6151 165842 13. Oktober 2008 Extraction of Segments from Web 2.0 Pages Genre Detection Page Segmentation Segment Classification URL Output Format
6

Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

Jul 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

RDG___Web2.0_Extraction___2008.09.22.ppt

KOM - Multimedia Communications LabProf. Dr.-Ing. Ralf Steinmetz (Director)

Dept. of Electrical Engineering and Information TechnologyDept. of Computer Science (adjunct Professor)

TUD – Technische Universität Darmstadt Merckstr. 25, D-64283 Darmstadt, Germany

Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de

© author(s) of these slides 2008 including research results of the research network KOM and TU Darmstadt otherwise as specified at the respective slide

httc –Hessian Telemedia Technology

Competence-Center e.V - www.httc.de

Dipl. Inform. Renato Dominguez Garcia

[email protected] Tel.+49 6151 165842

13. Oktober 2008

Extraction of Segments from Web 2.0 Pages

GenreDetection

PageSegmentation

SegmentClassification

URL Output Format

Page 2: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 2

KOM Research Program

UbiquitousCommunication

Mobile Networking

Peer-to-PeerNetworking

IT Architectures

E-Le

arni

ng

E-B

usin

ess

& E

-Fin

ance

Com

mun

icat

ion

Serv

ices

& IP

Tel

epho

ny

Wor

kflo

ws

Net

wor

k M

echa

nism

s

Qua

lity

of S

ervi

ce, D

epen

dabi

lity

& S

ecur

ity

Application Areas Fundamentals Research Areas

E-Le

arni

ng

Knowledge MediaFiltering of irrelevant

Information

Page 3: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 3

vs

Motivation

The influence of the internet for the US president election has grown since 2004 (+11%) [1]Primary information source about the US president election for young Americans (18 – 29 years old) is the internet (42%) [1]Each third American read blogs [2]PR professionals recognize importance of blogs [3]

→ Information extraction from blogs can help to understand the public opinion

→ Automatically detection of blogs, wikis and forums may be useful

→ Information extraction from small segments is easier than from large web pages

→ Genre Detection and Information extraction can be used in other fields: Community Mining, Improvement of search results

Source: http://a.abcnews.com/

[1] http://people-press.org/report/384/internets-broader-role-in-campaign-2008[2] DER SPIEGEL 30/2008 of 21.07.2008, Page 94[3] http://www.conversationblog.com/journal/2007/3/18/survey-pr-professionals-recognise-importance-of-blogs-but-do-not-know-how-to-integrate-them-in-their-planning.html

Topic

No of comments

Author

Date

Content

Source: http://www.techcruch.com/

Page 4: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 4

Genre Detection6 Genres

Blogs (Start pages, post pages)WikisForums (Start pages, thread pages)Others

Based on the structure of web pages (Patterns)Machine Learning Techniques

Support Vector Machines 336 FeaturesCorpus: ~ 33000 Web pages

Evaluation1345 Instances (1000 Blogs/Wikis/Forums)87,5 % Correctly classified instances

Our Approach: Genre Detection

GenreDetection

PageSegmentation

SegmentClassification

URL

a b c d e f <-- classified as156 6 1 0 0 37 | a = Blog_Page

15 170 0 1 1 13 | b = Blog_Post1 1 171 0 1 26 | c = Wiki_Page1 0 1 184 4 10 | d = Forum_Page0 0 1 4 185 10 | e = Forum_Thread

13 12 4 5 0 311 | f = Outlier

Number of detected patternsNumber of outer patternsRatio of patterned vs unpatterned CodeLength of patternsOffset before first pattern startsDepth of patterns…

Output Format

Page 5: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 5

Our Features: Examples

Many long patterns

patterns

Many short patterns

Blog’s start page Wiki-page Forum’s start page

All threads share a similar structure …

Few patterns

All posts share a similar structure

title

Set of links

Page 6: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 6

Our Approach: Segmentation

Page segmentation (Four steps)Pre-processing (cleaning HTML)Segmentation based on the hierarchical structure of web PagesVisual-based segmentationFiltering based on heuristics

Segment classificationMachine Learning Techniques

Random Forest139 FeaturesCorpus: ~ 500 instances

EvaluationGenres (blog posts, comments, others)97,2 % correct classified instances

GenreDetection

PageSegmentation

SegmentClassification

URL Output Format

Source: http://www.techcruch.com/ Source: http://www.techcruch.com/