Top Banner
Naive Algorithms for Key-phrase Naive Algorithms for Key-phrase Extraction and Text Summarization Extraction and Text Summarization from a Single Document inspired by from a Single Document inspired by the Protein Biosynthesis Process the Protein Biosynthesis Process Daniel Gayo Avello (University of Oviedo)
14

Daniel Gayo Avello (University of Oviedo)

Jan 22, 2016

Download

Documents

Colm

Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process. Daniel Gayo Avello (University of Oviedo). What’s the problem?. . Document reading is a time consuming task… - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Daniel Gayo Avello (University of Oviedo)

Naive Algorithms for Key-phrase Extraction and Text Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by Summarization from a Single Document inspired by

the Protein Biosynthesis Processthe Protein Biosynthesis Process

Daniel Gayo Avello(University of Oviedo)

Page 2: Daniel Gayo Avello (University of Oviedo)

What’s the problem?What’s the problem?

• Document reading is a time consuming task…

• Many common documents (e.g., e-mail, newsgroup posts, web pages) lack of abstract or keywords…

• But, they are “electronic” so we can work on them in some way…

8%

Page 3: Daniel Gayo Avello (University of Oviedo)

What’s the problem? (cont.)What’s the problem? (cont.)

• Many techniques to perform several Natural Language Processing (NLP) useful tasks:– Language identification.

– Document categorization and clustering.

– Keyword extraction.

– Text summarization.

• Quite different:– With/Without human supervision.

– With/Without training.

– With/Without complex linguistic data.

– With/Without document corpora.

17%

Page 4: Daniel Gayo Avello (University of Oviedo)

Any suggestion?Any suggestion?

• It would be great to use only one technique to carry out several of those tasks.

• Desirable goals:– Simple (only free text, not linguistic data)– Fully automatic (neither supervision nor ad

hoc heuristics)– Scalable (from one web page to several web

sites)

• Could it be a bio-inspired solution?Could it be a bio-inspired solution?

25%

Page 5: Daniel Gayo Avello (University of Oviedo)

Our (bio-inspired) hypothesisOur (bio-inspired) hypothesis

• Living beings are defined by their genome.• Document from a corpus ≈ Individual from a

population• So…? • Let’s imagine a “document genome”…

– Similar documents (similar language/topic) Similar genomes.

– More interesting, translation from More interesting, translation from “document genome” to “significance “document genome” to “significance proteins” (i.e., keyphrases and proteins” (i.e., keyphrases and summaries).summaries).

33%

Page 6: Daniel Gayo Avello (University of Oviedo)

42%

Our biological inspirationOur biological inspiration

• The protein biosynthesis process…

DNA

copied into a single-stranded mRNA molecule

mRNA AUGAUGCCGGGUUACUAAUAAUAC

Polypeptide chain

Protein folded into a 3D structure

Folding process

Transcription

InitiationElongation

Termination

aminoacids

Could we mimic this to distill from a single

document keyphrases and summaries!?

Page 7: Daniel Gayo Avello (University of Oviedo)

The “ingredients”…The “ingredients”…

Biological element

Computational “counterpart”

tRNASpliced document “genome”

mRNA Document’s plain text

Ribosome Algorithm

Polypeptide chainDocument chunks with significance weights

Protein Keyphrases

50%

Page 8: Daniel Gayo Avello (University of Oviedo)

A “DNA” for Natural Language?A “DNA” for Natural Language?

• n-grams (slices of adjoining n characters)

• Frequency not the most relevant weight for each n-gram.

• There exist different measures to show relation between both elements in a bigram:– Mutual information.– Dice coefficient.– Loglike.– …

• Cannot be applied straightforward to n-grams… • …But, they can be generalized (Ferreira and

Pereira, 1999)

58%

Page 9: Daniel Gayo Avello (University of Oviedo)

A “DNA” for Natural Language? (cont.)A “DNA” for Natural Language? (cont.)

The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.

Original document

< in > < mai> < pla> < rai> < Spa> < sta> < the> <ain > <ainl> <ays > <e pl> <e ra>…

n-grams

0.025

0.025

Relative frequency

1.9751.975<inly><inly>

2.0132.013<Spai><Spai>

Fair Specific Mutual

Information

Assigning weights to n-grams

67%

Page 10: Daniel Gayo Avello (University of Oviedo)

Document genome translationDocument genome translation

The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.

The-The-

2020

he-rhe-r

2929

e-rae-ra

2424

pseudo-mRNA2020The The 4949The r The r 7373The ra The ra

etcetc..

• So…– “Document genome” spliced into “pseudo-

tRNA”.– Document used as “pseudo-mRNA”.– We “attach” to the document pseudo-tRNA

“molecules” (with max. weight) while average significance per character continues growing.

• Result: Document spliced into “chunks” Result: Document spliced into “chunks” with maximum average significance.with maximum average significance.

TheraininSpainstays mainly inthe plain

75%

Page 11: Daniel Gayo Avello (University of Oviedo)

• To obtain keyphrases the “protein” (text chunks) must be folded…

• At this moment we are studying different alternatives:– Mutual reinforcement?– Chunks ≈ Documents Apply classical IR

techniques?– Others?

• Automatic text summarization– Simple but useful approach.– Use the shortest paragraphs with the most

significant keyphrases.

Folding the “protein” / summarizationFolding the “protein” / summarization

Work on Early Stage

Work on Early Stage

83%

Page 12: Daniel Gayo Avello (University of Oviedo)

• To test feasibility of these ideas a prototype was developed.

• blindLight – http://www.purl.org/NET/blindLight

• It receives a user-provided URL and produces:

– A “blindlighted” version of the original URL.

– A list of keyphrases.

– An automatic summary.

92%

Page 13: Daniel Gayo Avello (University of Oviedo)

ConclusionsConclusions

• Proof-of-concept tests have been performed– Details in the paper…– Results can be improved.– Thorough study and analysis is needed.– Really promising!

• Summary of the proposal1. Free text from just one document.2. Language independent (currently only western

languages).3. Bio-inspired.4. Extremely simple to implement.

100%

Page 14: Daniel Gayo Avello (University of Oviedo)

Merci beaucoup!¡Muchas gracias!Thank you!