Web Document Analysis: How Web Document Analysis: How can Natural Language can Natural Language Processing Help in Processing Help in Determining Correct Content Determining Correct Content Flow? Flow? Hassan Alam, Hassan Alam, Fuad Rahman and Fuad Rahman and Yuliya Tarnikova Yuliya Tarnikova Human Computer Interaction Group Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA 95050 BCL Technologies Inc. Santa Clara, CA 95050 www. www. bcltechnologies bcltechnologies .com .com [email protected][email protected]
21
Embed
Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Web Document Analysis: How can Web Document Analysis: How can Natural Language Processing Help Natural Language Processing Help in Determining Correct Content in Determining Correct Content
Human Computer Interaction GroupHuman Computer Interaction GroupBCL Technologies Inc. Santa Clara, CA 95050BCL Technologies Inc. Santa Clara, CA 95050www.www.bcltechnologiesbcltechnologies.com.com
Overview of the talkOverview of the talk Web document re-authoring HTML data structure and segmentation Merging and the “mess” Semantic Relatedness of Textual Segments Spoken Language User Interface Toolkit
(SLUITK) How do we do it? Some applications Conclusion and future work
Web Page Data StructureWeb Page Data Structure
Merging R UsMerging R Us
While merging two segments, the only information available to the merging algorithm is the proximity map and broad content classification.
It is not uncommon that sometimes totally unrelated content can easily meet these tests, resulting in the failure of the merging algorithm.
eMerging Questions?eMerging Questions?
How do we determine if two separate web document segments contain related information?
What is the definition of 'relatedness'? If other segments are geometrically embedded within
closely related segments, can we determine if this segment is also related to the surrounding segments?
When a hyperlink is followed and a new page is accessed, how do we know which exact segment within that new page is directly related to the link we just followed?
Natural Language ProcessingNatural Language Processing
SyntaxSemanticsContextAnaphoraTokenizingTheme
Our AnswerOur Answer
Lexical Chains
Lexical Chains
A lexical chain is a sequence of related words in a narrative. It can be composed of adjacent words or sentences or can cover elements from the complete narrative.
Cohesion is a way of connecting different parts of text into a single theme: is a list of semantically related words, constructed by the use of co-reference, ellipses and conjunctions.
This aims to identify the relationship between words that tend to co-occur in the same lexical context.
Lexical Chains
Coreference: The grammatical relation between two words that have a common referent– Example: You said you would come
In the given sentence, both ‘you’ s have the same referent. Ellipsis: Omission or suppression of parts of words
or sentences– Example: 'the virtues I admire', for, 'the virtues 'which' I
admire' Conjecture: Reasoning that involves the formation of
conclusions from incomplete evidence– Example: Scientists supposed that large dinosaurs lived in
swamps
SL
UI
Input Sentences
DialogsPRO
GRA
M
S LUITOOLKIT
C + +J a v a
End U s er
P rogrammer
Action Code& VSP
What is SLUI TK?What is SLUI TK?
SLUI is a set of tools
that allows programmers
to rapidly develop
applications with
Natural Language
Processing Functionality
Input S et upInf ormat ion
Expand S FT
Debug S F T
Deploy P rogram
1 . Insert the domain spec ifi c lex icon
2 . Enter s ample sent ences
3 . Enter V ariable S entence P arameter values4 . Enter an act ion code f or each s ample sent ence.5 . Expand S FT
6 . D ebug the results in the S emant ic F rame T able (S FT )
7 . D irect user input t o the S LU I
S teps For P rogrammer
S LU I T oo lk itA na lyzes
S entences
End User Runs the S LUI EnabledP rogram
S LU IA na lyzes U ser
Input andhandles errors
S LU IM aps U s er
Input to A ct ions and R eturns A ct ion C ode
P rogramExecutes
T as ks
SLUI TKSLUI TKSteps for the Steps for the Programmer Programmer
to Follow to Follow while Setting while Setting
up the up the ToolkitToolkit
S peech R ecognit ion Input ter
S entenceT okenizer
Querryc lassifi er
A utoS pellC orrectS yntaxR ecognizer
P arser
A naphoraR es o lut ion
T rans la torF rame G enerat orF rameH andlerA c t ionH andler
---- incude information----? ---- informationon cancer
risks---- ----
SentenceType
PredicateObject(Arg 2)
Subject(Arg 1)
ActionObject(Arg 3)
Mod 1(Head)
Mod 1(Comp)
Mod 2(Head)
Mod 3(Comp)
---- give detail----? ---- detailom lowerLDL by 50
points---- ----
SentenceType
PredicateObject(Arg 2)
Subject(Arg 1)
ActionObject(Arg 3)
Mod 1(Head)
Mod 1(Comp)
Mod 2(Head)
Mod 3(Comp)
---- lower LDL----? ---- lowerby 50points
---- ----
OurOurFrameFrameCan you
suggest some internet sites or books that give details on lowering the LDL by 50 points without including
information on cancer
risks?
Sentences collected from email messages received between June 2000 and May 2001
Deleted attachments, html and other tags, header files, and senders’ information.
Also deleted were salutations and greetings Total of 34,640 lines and 170,000 words We constantly update our corpus with new emails
from our customers.
BCL Database
Our Lexical Chains
Relatedness Factor
An Application: Web Page Re-An Application: Web Page Re-authoringauthoring
Segment Scores
Example Output
Future WorkFuture Work
Only a single main theme can be handled per document. In future we are going to address a more generic solution that can handle documents with multiple themes.
Integration of this NLP method in building commercial summarizers and in aiding existing web page summarization techniques based on structural analysis alone is already well underway.
Determining the flow of web information between different web pages as the browser loads up new pages following hyperlinks.
Aiding geometric web parsers in determining the correct logical layout by complementing geometric information with linguistic coherence.
ConclusionsConclusions
A novel approach of determining semantic relationship among segments of web documents using lexical chain computation.
Two related papers in ICDAR 2003– One will explore the application of lexical chains in
building a commercial summarizer capable of summarizing any document
– The other will concentrate on a hybrid approach to web page summarization, combining structural and NLP techniques.