Naive Algorithms for Key-phrase Naive Algorithms for Key-phrase Extraction and Text Summarization Extraction and Text Summarization from a Single Document inspired by from a Single Document inspired by the Protein Biosynthesis Process the Protein Biosynthesis Process Daniel Gayo Avello (University of Oviedo)
Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process. Daniel Gayo Avello (University of Oviedo). What’s the problem?. . Document reading is a time consuming task… - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Naive Algorithms for Key-phrase Extraction and Text Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by Summarization from a Single Document inspired by
the Protein Biosynthesis Processthe Protein Biosynthesis Process
Daniel Gayo Avello(University of Oviedo)
What’s the problem?What’s the problem?
• Document reading is a time consuming task…
• Many common documents (e.g., e-mail, newsgroup posts, web pages) lack of abstract or keywords…
• But, they are “electronic” so we can work on them in some way…
8%
What’s the problem? (cont.)What’s the problem? (cont.)
• Many techniques to perform several Natural Language Processing (NLP) useful tasks:– Language identification.
– Document categorization and clustering.
– Keyword extraction.
– Text summarization.
• Quite different:– With/Without human supervision.
– With/Without training.
– With/Without complex linguistic data.
– With/Without document corpora.
17%
Any suggestion?Any suggestion?
• It would be great to use only one technique to carry out several of those tasks.
• Desirable goals:– Simple (only free text, not linguistic data)– Fully automatic (neither supervision nor ad
hoc heuristics)– Scalable (from one web page to several web
sites)
• Could it be a bio-inspired solution?Could it be a bio-inspired solution?
• Living beings are defined by their genome.• Document from a corpus ≈ Individual from a
population• So…? • Let’s imagine a “document genome”…
– Similar documents (similar language/topic) Similar genomes.
– More interesting, translation from More interesting, translation from “document genome” to “significance “document genome” to “significance proteins” (i.e., keyphrases and proteins” (i.e., keyphrases and summaries).summaries).
The rain in Spain stays mainly in the plain.The rain in Spain stays mainly in the plain.
The-The-
2020
he-rhe-r
2929
e-rae-ra
2424
pseudo-mRNA2020The The 4949The r The r 7373The ra The ra
etcetc..
• So…– “Document genome” spliced into “pseudo-
tRNA”.– Document used as “pseudo-mRNA”.– We “attach” to the document pseudo-tRNA
“molecules” (with max. weight) while average significance per character continues growing.
• Result: Document spliced into “chunks” Result: Document spliced into “chunks” with maximum average significance.with maximum average significance.
TheraininSpainstays mainly inthe plain
75%
• To obtain keyphrases the “protein” (text chunks) must be folded…
• At this moment we are studying different alternatives:– Mutual reinforcement?– Chunks ≈ Documents Apply classical IR
techniques?– Others?
• Automatic text summarization– Simple but useful approach.– Use the shortest paragraphs with the most
significant keyphrases.
Folding the “protein” / summarizationFolding the “protein” / summarization
Work on Early Stage
Work on Early Stage
83%
• To test feasibility of these ideas a prototype was developed.
• blindLight – http://www.purl.org/NET/blindLight
• It receives a user-provided URL and produces:
– A “blindlighted” version of the original URL.
– A list of keyphrases.
– An automatic summary.
92%
ConclusionsConclusions
• Proof-of-concept tests have been performed– Details in the paper…– Results can be improved.– Thorough study and analysis is needed.– Really promising!
• Summary of the proposal1. Free text from just one document.2. Language independent (currently only western
languages).3. Bio-inspired.4. Extremely simple to implement.