Top Banner
Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University Dayton, OH-45435
28

Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Dec 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Formalizing and Querying Heterogeneous Documents with Tables

Krishnaprasad Thirunarayan and Trivikram Immaneni

Department of Computer Science and EngineeringWright State University

Dayton, OH-45435

Page 2: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Overall R&D Agenda

Develop semi-automatic techniques for information extraction/retrieval to enable man and machine to complement each other in assimilation of semi-structured, heterogeneous documents

=> Semantic Web Technologies.

Page 3: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Goal (What?)

Background and Motivation (Why?)

Implementation Details (How?)

Evaluation and Applications (Why?)

Conclusions

Page 4: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Goal

Page 5: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Define, embed, and use metadata in semi-structured documents containing tables.

Content-oriented/domain-specific annotation of human sensible document Makes explicit semantics of complex data Enables augmentation of an interpretation

in a modular fashion.

Page 6: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Heterogeneous Document

Page 7: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Background and Motivation

Page 8: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Generate XML Master Document that is both machine processable and that can serve as a basis for human sensible presentation.

Basis of semi-automation in practice.

Page 9: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Embedding metadata improves traceability, thereby facilitating

Content Extraction Verification Update

Page 10: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Implementation Details (How?)

Page 11: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

XML Technology

Document-Centric View: XML is used to annotate documents for use by humans in the realm of document processing and content extraction.Data-Centric View: XML is used as text-based format for information exchange / serialization in the context of Web Services.

Page 12: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Basic idea behind our approach

Unify the two views by using XML-elements to materialize abstract syntax, and together with XML attributes and XML element definitions, formalize the content.

Key advantage: Minimizes maintenance of additional data structures to relate original document with its formalization.

Page 13: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Two Concrete Implementations

Use Web Services language Water which amalgamates XML Technology with programming language concepts

Use XML/XSLT infrastructure

Page 14: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Water-based approachEach annotation reflects the semantics of the text fragment it encloses. The annotated data can be interpreted

by viewing it as a function/procedure call in Water. The correspondence between formal parameter and actual argument is position-based.

The semantics of annotation is defined in Water as a method definition in a class, separately.

Page 15: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Example Table

Thickness (mm)

Tensile Strength

(ksi)

Yield Strength

(ksi)

0.50 and under

165 155

0.05 – 1.00 160 150

1.00 – 1.50 155 145

Page 16: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Example of Tagged Table

Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi)

table.<setHeading thickness strength.tensile

strength.yield/>

0.50 and under 165 155

table.<addRow 0 0.50 165 155 />

0.50 - 1.00 160 150

table.<addRow 0.50 1.00 160 150 />

1.00 - 1.50 155 145

table.<addRow 1.00 1.50 155 145 /> ...

Page 17: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Example of Processing Code

<defclass table rows=required=vector heading=optional=vector>

<defmethod setHeading t=required ts=required ys=required>

<set heading=<vector t ts ys/>/>

</>

<defmethod addRow smin smax ts ys>

<set rows=

table.rows.<insert <vector smin smax ts ys/>/>/>

</>

<defmethod computeYieldStrength> … </>

<defmethod computeTensileStrength> … </>

</>

Page 18: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

XML/XSLT-based approach

Each annotation reflects the semantics of the text fragment it encloses.

To make the annotated data XML compliant, dummy attributes such as one, two, three, … etc are introduced. The correspondence between formal attribute and the actual value is name-based.

The semantics is defined modularly by interpreting XML-elements and its XML-attributes via XSLT, separately.

Page 19: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Example of Tagged Table

<table type="Tensile">

<dependency name="Yield Offset" value="0.2%"/>

<tableSchema one="Thickness(min)" two="Thickness(max)"

three="Tensile Strength“ four="Yield Strength"/>

<tableUnits one="in" two="in" three="ksi" four="ksi" />

<tableData one="0" two="0.50" three="165" four="155" />

<tableData one="0.50" two="1.00" three="160" four="150" />

...

<\table>

Page 20: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

XSLT Stylesheets can be used to:

Query: to perform table look-ups.Transform: to change units of measure such as from standard SI units to FPS units and vice versa.Format: to display the table in HTML form.Extract: to recover the original table.Verify: to check static semantic constraints on table data values.

Page 21: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Evaluation and Application (Why?)

Page 22: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Advantage

Only tabular data in each document is annotated. The annotation definition is factored out as background knowledge. Thus, the semantics of each table type is specified just once outside the document and is reused with different documents containing similar tables.

Page 23: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Disadvantage

Both avenues require mature tool support for wide spread adoption.

For example, develop MS FrontPage like interface where the Master document is the annotated form, and the user explicitly interacts with/edits only a view of the annotated document, for readability reasons, and has support for export as XML to generate well-formed XML document.

Page 24: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Prolog rendition

strengthTableRow( 0, 0.50, 165, 155).strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ...strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength,

YieldStrength), L =< Thickness, U > Thickness.

thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _).thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength).

?- thicknessToYieldStrength(0.6,YS).

Page 25: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Conclusion and Future Work

Page 26: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Develop a catalog of predefined tables, specifying them using Semantic Web formalisms (such as RDF, OWL, etc) and mapping the tabular data into a set of pre-defined tables, possibly qualified. Develop techniques for manual mapping of complex tables into simpler ones: To provide semantics to data. To improve traceability. To facilitate automatic manipulation.

Page 27: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Tailor and improve IE and IR techniques developed in the context of text processing to Semantic Web documents such as in XML, RDF, etc benefiting from additional support from ontologies such as in OWL, etc

Page 28: Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Holy Grail

Ultimately develop principles,

techniques and tools, to author and extract human-readable and machine-comprehensible parts of a document hand in hand, and keep them side by side.