Daniel Gayo Avello (University of Oviedo)

Naive Algorithms for Key-phrase Extraction and Text Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by Summarization from a Single Document inspired by

the Protein Biosynthesis Processthe Protein Biosynthesis Process

Daniel Gayo Avello(University of Oviedo)

What’s the problem?What’s the problem?

• Document reading is a time consuming task…

• Many common documents (e.g., e-mail, newsgroup posts, web pages) lack of abstract or keywords…

• But, they are “electronic” so we can work on them in some way…

What’s the problem? (cont.)What’s the problem? (cont.)

• Many techniques to perform several Natural Language Processing (NLP) useful tasks:– Language identification.

– Document categorization and clustering.

– Keyword extraction.

– Text summarization.

• Quite different:– With/Without human supervision.

– With/Without training.

– With/Without complex linguistic data.

– With/Without document corpora.

Any suggestion?Any suggestion?

• It would be great to use only one technique to carry out several of those tasks.

• Desirable goals:– Simple (only free text, not linguistic data)– Fully automatic (neither supervision nor ad

hoc heuristics)– Scalable (from one web page to several web

sites)

• Could it be a bio-inspired solution?Could it be a bio-inspired solution?

Our (bio-inspired) hypothesisOur (bio-inspired) hypothesis

• Living beings are defined by their genome.• Document from a corpus ≈ Individual from a

population• So…? • Let’s imagine a “document genome”…

– Similar documents (similar language/topic) Similar genomes.

– More interesting, translation from More interesting, translation from “document genome” to “significance “document genome” to “significance proteins” (i.e., keyphrases and proteins” (i.e., keyphrases and summaries).summaries).

Our biological inspirationOur biological inspiration

• The protein biosynthesis process…

copied into a single-stranded mRNA molecule

mRNA AUGAUGCCGGGUUACUAAUAAUAC

Polypeptide chain

Protein folded into a 3D structure

Folding process

Transcription

InitiationElongation

Termination

aminoacids

Could we mimic this to distill from a single

document keyphrases and summaries!?

The “ingredients”…The “ingredients”…

Biological element

Computational “counterpart”

tRNASpliced document “genome”

mRNA Document’s plain text

Ribosome Algorithm

Polypeptide chainDocument chunks with significance weights

Protein Keyphrases

A “DNA” for Natural Language?A “DNA” for Natural Language?

• n-grams (slices of adjoining n characters)

• Frequency not the most relevant weight for each n-gram.

• There exist different measures to show relation between both elements in a bigram:– Mutual information.– Dice coefficient.– Loglike.– …

• Cannot be applied straightforward to n-grams… • …But, they can be generalized (Ferreira and

Pereira, 1999)

A “DNA” for Natural Language? (cont.)A “DNA” for Natural Language? (cont.)

The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.

Original document

< in > < mai> < pla> < rai> < Spa> < sta> < the> <ain > <ainl> <ays > <e pl> <e ra>…

n-grams

Relative frequency

1.9751.975<inly><inly>

2.0132.013<Spai><Spai>

Fair Specific Mutual

Information

Assigning weights to n-grams

Document genome translationDocument genome translation

The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.

The-The-

he-rhe-r

e-rae-ra

pseudo-mRNA2020The The 4949The r The r 7373The ra The ra

etcetc..

• So…– “Document genome” spliced into “pseudo-

tRNA”.– Document used as “pseudo-mRNA”.– We “attach” to the document pseudo-tRNA

“molecules” (with max. weight) while average significance per character continues growing.

• Result: Document spliced into “chunks” Result: Document spliced into “chunks” with maximum average significance.with maximum average significance.

TheraininSpainstays mainly inthe plain

• To obtain keyphrases the “protein” (text chunks) must be folded…

• At this moment we are studying different alternatives:– Mutual reinforcement?– Chunks ≈ Documents Apply classical IR

techniques?– Others?

• Automatic text summarization– Simple but useful approach.– Use the shortest paragraphs with the most

significant keyphrases.

Folding the “protein” / summarizationFolding the “protein” / summarization

Work on Early Stage

• To test feasibility of these ideas a prototype was developed.

• blindLight – http://www.purl.org/NET/blindLight

• It receives a user-provided URL and produces:

– A “blindlighted” version of the original URL.

– A list of keyphrases.

– An automatic summary.

ConclusionsConclusions

• Proof-of-concept tests have been performed– Details in the paper…– Results can be improved.– Thorough study and analysis is needed.– Really promising!

• Summary of the proposal1. Free text from just one document.2. Language independent (currently only western

languages).3. Bio-inspired.4. Extremely simple to implement.

Merci beaucoup!¡Muchas gracias!Thank you!

Daniel Gayo Avello (University of Oviedo)

document reading

document categorization

single document keyphrases

withwithout document

protein text chunks

proposalfree text

chunks documents

language identification

Documents

pemerintahan gayo lues

EBAU 2017-2018 b...Doble Grado en Matemáticas y Física...

Gayo - Vilasboas.pdf

Tecnologías Web y XML - CD Universidad de...

Gayo IV.pdf

Reflexiones y experiencias sobre la enseñanza de POO como.....

One Size Fits All? A Simple Technique to Perform Several NLP...

PT. MEUKAT KOMUDITI GAYO is a...

Naive Algorithms for Key-phrase Extraction and Text...

TRADISI MUNIRIN REJE DI MASYARAKAT GAYO LUKUP …...

MSc. Raidell Avello Martínez

Aaron Gayo PPP

Hojas de estilo en cascada CSS - CD Universidad de...

Gayo Cayo Institutiones 1923

Melacak Jejak “Koro Jamu” Di Gayo _ Media Online Lintas....

Tecnologías Web y XML -...