Top Banner
DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go
16

DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Dec 16, 2015

Download

Documents

Jared Nicholson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

DAISYDutch lAnguage Investigationof Summarization technologY

Katholieke Universiteit LeuvenRijksuniversiteit Groningen

Q-go

Page 2: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

DAISY on one slide

Segmentation

Rhetoricalclassification

Sentencecompression

Sentencegeneration

Multi-document summarization:Detect differences

Improvement question answering,e.g. e-mail answering

Summarization of web content

Page 3: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Overview

Report of our current progress in:• Corpus building and preprocessing• Segmentation• Sentence generation

Page 4: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Corpus Building and Preprocessing

Target: corpus of questions, short texts and webpages about the same topic

• Freely available: – UWV (questions & answer texts)– SVB (questions)

• Available for internal use: KLM (questions, answer texts, web pages)

• Todo: – web pages SVB– ABN AMRO (committed, not delivered)

Page 5: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Corpus Building and Preprocessing

• POS-tagged and parsed: KLM and UWV• SVB corpus: in progress• Coreference resolution: in progress

Page 6: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Segmentation

Find main content in webpage

Smaller segments Can be obtained from HTML structure <H#>, <P>, <BR>, <UL>, ... Hierarchical Will be refined in relation to rhetorical roles

Page 7: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Segmentation

Page 8: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Segmentation

Page 9: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Segmentation

Search for block with highest density of text

Page 10: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Segmentation

Page 11: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Segmentation

Additional heuristics to extend the selection: Find closing tags for all tags that were opened in the

selection Include all text delimited by known tag patterns occurring

just before and after the selection Take the smallest enclosing DIV block

Page 12: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Sentence generation

• Specification of abstract dependency trees– Specify grammatical relations between lexical

items and constituents dominating over lexical items

– Alpino dependency trees without adjacency information

– More variation through underspecification in lexical items, handling of particles

Page 13: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Sentence generation

• Initial implementation generator:– Chart generator (Kay, 1996)– Top-down guidance through expected dependency

relations– Generates substantial part of input created from

the Alpino testsuites– Included in recent Alpino versions

• Further work: optimization (time and space)

Page 14: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Sentence generation

• Selecting the most fluent sentence through fluency ranking:– N-gram language model– Log-linear model– Experiments with Velldall (2007) and parse

disambiguation feature templates.• Need more insight about feature overlap• Experiment with more feature templates

Page 15: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Sentence generation

• Evaluation:– Corpus sentences used as a reference for the most

fluent realization– Fairly strict, since there can be multiple fluent

sentences– Where is the ceiling?– More annotated material!– FLAN: FLuency ANnotator (web application)

Page 16: DAISY Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go.

Thanks!