Rainbow: Bridging XML and Relational Databases Using a Flexible Mapping The Design, Implementation, and Evaluation of the Rainbow System A Major Qualifying Project Report Submitted to the Faculty Of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Bachelor of Science By ____________________ ___________________ ____________________ Tien Vu John Lee Mirek Cymer
122
Embed
intro, background, approach, and proposaldavis.wpi.edu/dsrg/Old/TJM/report5_02.doc · Web viewSQL is a query language that allows users to access data in a RDBMS (Ullman, 1997). Commercial
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Rainbow: Bridging XML and Relational Databases Using a Flexible Mapping
The Design, Implementation, and Evaluation of the Rainbow System
A Major Qualifying Project Report
Submitted to the Faculty
Of the
WORCESTER POLYTECHNIC INSTITUTE
In partial fulfillment of the requirements for the
Degree of Bachelor of Science
By
____________________ ___________________ ____________________ Tien Vu John Lee Mirek Cymer
Date: 5/02/2001 Approved: __________________________ Professor Elke A. Rundensteiner
Authorship Page
T=Tien VuJ=John LeeM=Mirek Cymer
1 Introduction T1.1 Motivation T1.2 Our Approach T1.3 The Rainbow System T1.4 MQP Project Goals T1.5 Additional Team Goals T1.6 Outline of the Remaining Sections TJ
2 Background M2.1 Readings M2.2 Basics of XML and DTDs M2.3 Technologies2.3.1 SQL, Relational Databases, and Oracle 8i TJM2.3.2 JDBC and ResultSet Classes J2.4 Software Development TM2.4.1 Object Oriented (OO) Design TJ2.4.2 Software Migration TJ
3 DTD Metadata Management J3.1 Metadata Tables J3.2 Data Schema J3.3 DTD Manager and XML Manager Extensions J3.3.1 Original DTD Manager and XML Manager J3.3.2 Support for Multiple DTDs and XMLs J
4 Implementation of the Rainbow System J4.1 Restructuring Subsystem J4.1.1 The Restructuring Functionality J4.1.2 A Prototype Design J4.1.3 Implementation Details J4.2 Restructuring Operators T4.2.1 Pushup and Pushdown Attribute Operators T4.2.2 Rename Item and Attribute Operators T4.2.3 Pushup and Pushdown Nesting Operators T4.2.4 Other Operators T4.3 Rainbow Graphical User Interface M
2
5 Implementation Details T5.1 System Architecture T5.2 Code Facts T5.3 Existing System Packages J5.4 Implementation Environment J
6 Experimental Evaluation J6.1 Experimental Setup J6.1.1 Scope and Design of a Test Plan J6.1.2 Designing an Experimental Test Bed J6.2 Performance Considerations J6.2.1 Restructuring Time J6.2.2 Query Time J6.3 Cost Factors J6.4 Experimental Data J6.5 Restructuring Setup Time Evaluations J6.5.1 Experiment 1: Scalability of Increase in Operations J6.5.2 Experiment 2: Operation Scalability J6.6 Query Time Evaluations for Restructured Schema J6.6.1 Experiment 3: Query Performance J6.7 Analyses J
7 Conclusions TJ7.1 Summary of the Rainbow Project T7.2 Experience Gained and Lessons Learned T7.2.1 Object-Oriented Design T7.2.2 UML T7.2.3 The Java Programming Language T7.2.4 XML TJ7.2.5 Database Management Systems TJ7.2.6 Software Engineering Experience TJ7.2.7 Designing the Test Plan TJ7.2.8 Working as a Team TJ7.3 Future Work TJ
References TPast Works and Books TJWeb-pages TJ
Appendixes TJMReadme for System Environment Setup and Demo TJM
3
Abstract
The use of Extensible Markup Language (XML) documents to model data and exchange data over the web is becoming increasingly prominent and promising. Due to the maturity and performance of existing relational database technology, there is great interest in exploiting this technology to serve as a backend engine to store, manage, and query XML data. It is well known that different relational schemas will have different query performances for a given load. Hence, one fixed way of mapping XML into relational databases is not sufficient to reach an overall optimized query performance for a given query workload. The Rainbow System proposes a flexible mapping approach by first loading the XML data into a relational database system and then applying relational restructuring technique with the help of SQL queries and database views of the loaded data and schema. A key ingredient of this Rainbow solution is the management of metadata in relational format of both XML structure information (DTD) as well as the chosen mapping. We have achieved the design, implementation, and preliminary evaluation of this flexible mapping component of the Rainbow system in this project.
4
Table of Contents
1 Introduction.......................................................................................................................91.1 Motivation..................................................................................................................91.2 Our Approach..........................................................................................................101.3 The Rainbow System...............................................................................................111.4 MQP Project Goals..................................................................................................141.5 Additional Team Goals............................................................................................141.6 Outline of the Remaining Sections..........................................................................15
2 Background.....................................................................................................................162.1 Readings..................................................................................................................162.2 Basics of XML and DTDs.......................................................................................162.3 Technologies............................................................................................................20
2.3.1 SQL, Relational Databases, and Oracle 8i........................................................202.3.2 JDBC and ResultSet Classes............................................................................22
3 DTD Metadata Management..........................................................................................263.1 Metadata Tables.......................................................................................................273.2 Data Schema............................................................................................................303.3 DTD Manager and XML Manager Extensions.......................................................33
3.3.1 Original DTD Manager and XML Manager.....................................................343.3.2 Support for Multiple DTDs and XMLs............................................................35
4 Flexible Mapping Support in the Rainbow System........................................................354.1 Restructuring Subsystem.........................................................................................37
4.1.1 The Restructuring Functionality.......................................................................374.1.2 A Prototype Design...........................................................................................384.1.3 Implementation Details.....................................................................................41
4.2 Restructuring Operators...........................................................................................444.2.1 Pushup and Pushdown Attribute Operators......................................................474.2.2 Rename Item and Attribute Operators..............................................................484.2.3 Pushup and Pushdown Nesting Operators........................................................484.2.4 Other Operators................................................................................................49
4.3 Rainbow Graphical User Interface..........................................................................495 Implementation Details...................................................................................................54
5.1 System Architecture.................................................................................................545.2 Code Facts...............................................................................................................545.3 Existing System Packages.......................................................................................555.4 Implementation Environment..................................................................................56
6.1.1 Scope and Design of a Test Plan......................................................................576.1.2 Designing an Experimental Test Bed...............................................................57
7.1 Summary of the Rainbow Project............................................................................697.2 Experience Gained and Lessons Learned................................................................70
7.2.1 Object-Oriented Design....................................................................................717.2.2 UML.................................................................................................................717.2.3 The Java Programming Language....................................................................717.2.4 XML.................................................................................................................727.2.5 Database Management Systems.......................................................................727.2.6 Software Engineering Experience....................................................................727.2.7 Designing the Test Plan....................................................................................737.2.8 Working as a Team...........................................................................................74
Past Works and Books:..................................................................................................76Web-pages:....................................................................................................................77
Appendixes........................................................................................................................78Readme for System Environment Setup and Demo......................................................78
6
List of Illustrations
Figure 1: Proposed Rainbow Architecture 12Figure 2: Examples of XML Elements 18Figure 3: XML Content Definitions 18Figure 4: XML/DTD Example Documents 19Figure 5: Algorithm of Mapping DTD into Relational Schema 26Figure 6: DTD Manager and XML Manager 34Figure 7: Restructure Function 38Figure 8: Restructure Subsystem 40Figure 9: Restructuring Subsystem Class Diagram 42Figure 10: Pushup Attribute Operator
46Figure 11: Pushup and Pushdown Attribute 47Figure 12: Pushup and Pushdown Nesting 48Figure 13: Screenshot 1 51Figure 14: Screenshot 2 52Figure 15: Screenshot 3 53Figure 16: Rainbow Architecture with RDBMS 54Figure 17: Statistics of Class Implementation 55Figure 18: Experiment DTD 60Figure 19: Batch versus Serial Restructuring 63Figure 20: Restructuring Overhead Results 65Figure 21: Join Query Performance Results 67
7
List of Tables
Table 1: Student Address Relation 21Table 2: Relation Resulting from a Query Evaluation 21Table 3: Item DTDM (DTDM-Item table) 28Table 4: Nesting DTDM (DTDM-Nesting table) 28Table 5: Attribute DTDM (DTDM-Attribute table) 29Table 6: DTDMS for Figure 2’s DTD 30Table 7: Data Relations for Figure 2’s XML Document 32Table 8: Parameters of Restructuring Evaluations 60
8
1 Introduction
1.1 Motivation
The use of Extensible Markup Language (XML) documents to store information
is becoming increasingly prominent and promising [14]. XML’s main strength of
organizing information using a human-readable and machine-interpretable file format
makes it ideal for exchanging data between different systems. Unlike Hypertext Markup
Language (HTML) that stores information about the physical presentation of a web page,
XML represents information about the meaning of the data itself by appropriate tags [9].
An element of an XML document refers to a defined tag [9]. See Figure 2 for an
example of an XML document. The nesting of elements represents the logical hierarchy
among the elements. For this reason, information can be extracted more easily out of an
XML document in response to a user request.
Efforts by industry groups in specifying standard structure for XML documents in
the form of Document Type Definitions (DTD) or more recently in the form of XML
schema facilitate the exchange of XML documents among enterprises [14]. By enabling
automatic data flow among businesses, XML is pushing the world into the electronic
commerce era. Collecting, analyzing, mining, and managing XML data will hence
become tremendously important tasks for future web-based applications [2]. An XML
bound system is required to store, retrieve, update, and query XML documents.
One prominent method for such a system is to store XML documents into
relational databases. RDBMSs stand for Relational Database Management Systems.
They deal with data storage, query, concurrency, and other features. Many database
venders such as Microsoft, IBM, Informix, and Oracle have started to support XML in
9
their own RDBMS systems. The benefits of XML being managed by a relational
database are: many fold including the availability of matured database tools, efficient
query and analysis tools, and the easy integration with existing business databases.
However, there are open issues to be resolved concerning XML. These issues include
mapping between XML and Relational Model, XML Update Propagation, and XML
Query Translation and Optimization. This MQP mainly focuses on solving the issue of
mapping between XML and the Relational Model.
1.2 Our Approach
Zhang et al. [2] propose a metadata driven approach that addresses the issue of
flexible mapping. The proposed approach can generate a relational schema out of a DTD
and store the XML data compliant to that DTD into relational tables that then could be
queried by Structured Querying Language (SQL) queries. This metadata driven approach
includes the loading of a DTD into relational metatables, construction of a relational
schema called Metadata Tables (DTDMs), restructuring of the DTDMs for efficient
querying purposes, automatic construction of a relational schema for the XML documents
that conform to this DTD, and loading of the XML data into the prepared relational
database schema.
As an additional feature, Zhang et al. [3], it also handles the issue of updates on
the XML documents. The information in the database correctly represents the
information of the external sources, those that hold the up-to-date XML pages, through
updating by means of synchronization that will utilize the DTDMs. The initial store of
the XML data utilizes a fixed mapping that retains the hierarchical semantics of the DTD
loaded. However, once the XML data is stored, a restructuring process may be called
10
upon to modify the DTDM schemas and the XML data because different mapping yields
varying query processing optimizations [8].
To reap the benefits from the metadata driven approach, the data contained within
an XML document has to be accessed efficiently. This approach should allow for easy
data retrieval and modifications of the XML data in a database system by the mean of
SQL. Such an approach would be more beneficial over the need for a programmer to
traverse through the XML and DTD documents by means of specialized miniature
parsing programs. It would not only save development time and money by code reuse,
but will eliminate the possibility of any error arising from additional programming.
1.3 The Rainbow System
Much of the information presented here is extracted from [8]. To keep track of
XML updates and provide optimal query performance, the metadata driven system from
herein referred to as Rainbow, is composed of a DTD manager, a basic storage manager,
a schema creator, a restructurer, an XML query engine, and an XML schema depicted in
Figure 1. A working system was already in place and included some of the components
of this figure.
11
Figure 1: Proposed Rainbow Architecture
The DTD Manager will load DTD documents into our system by storing them in
DTDMs as part of the system dictionary tables. DTDMs model the DTD as a collection
of items, attributes and nesting relationships. After the DTDMs repository is loaded, the
schema creator will infer a relational schema from the DTDMs repository.
The basic storage manager maintains XML documents with the help of three
modules: an importer, an exporter, and a synchronizer. The importer imports XML
compliant to a prior specified DTD into our system. The exporter will export the
12
XML Query EngineXML Query Engine
XML QueryXML Query XMLXML
XMLXML
Basic Storage ManagerBasic Storage Manager
DTDDTD
DTD ManagerDTD Manager
Restructure
Optimizer
RestructureOperatorLibrary
Query StorageMapping
UserDBA
XMLXMLQueryQueryLoadLoad
Sub-Sub-systemsystem
XMLXMLDataData
LegendLegend
Process
Relational
Model
relational data into XML documents. The synchronizer is used to keep the internal
relational representation and external XML representation consistent with each other
under data updates.
The restructure operator library stores a collection of restructuring operators for
optimization purposes An optimizer takes a given XML query load specified by a
database administrator (DBA) and the DTDMs, which model the current structure of
relational database, as input. It generates a mapping by applying the restructuring
operators provided from the restructuring operator library. A mapping specifies the
application of a sequence of restructuring operators to be applied on the different element
types defined in that DTD. Then, the restructuring manager actually transforms the
initially loaded data into the desired optimized format. The latter is to be utilized for
efficient query purposes.
The end user can issue XML queries through the XML Query Engine subsystem.
The Query Translator based on the mapping provided by the Optimizer will translate the
XML query into a sequence of SQL queries. Then the relational query engine of the
RDBMS will execute the SQL queries, and return the corresponding relational query
result. The query result translator will translate the query result back into the XML
model and return it to the end user.
The Rainbow architecture was partially implemented when the project team
started working on the development of its components. The DTD Manager and Basic
Storage Manager were capable of loading a single DTD and a single XML document.
The Basic Storage Manager, herein referred to as the XML Manager, did not have a
synchronizer process that keeps the integrity of the internal data. Instead, the
13
synchronizer, called Clock, was a separate component developed at WPI, but not yet
integrated. A constituting operator set was researched and designed for the Restructuring
Operator Library [8], but no implementations of the Optimizer or Restructuring Manager
were in place. Lastly, the XML Query Engine remained at its conceptual stage and has
yet to be realized.
1.4 MQP Project Goals
The scope of this MQP is to continue the necessary development of the remaining
subsystems in addition to extensions of the existing ones. With the benefit of an
extended schedule, the project team was able to pursue the extensions of the DTD
Manager and XML Manager and the design and development of a prototype of the
Restructuring Manager. With the completion of these tasks, the Rainbow system is now
able to store multiple DTDs and XML documents, and to restructure the initial fixed M!
mapping of the XML data utilizing an administrator specified mapping. To evaluate the
system we developed, the project team designed a test bed and experimental outline and
performed experimental studies on the working system.
1.5 Additional Team Goals
An additional goal of this MQP is for the project team members to learn and
develop a competency with the technologies of database, XML, SQL, Java programming
for RDBMSs. With respect to the goals of developing the subsystems of Rainbow, the
team had the goals to learn how to maintain and extend existing software, and engineer
from the design phase through to experimental evaluations of a complete software
system. Reuse of previous code that went beyond simple extension in functionality
14
becomes essential for both the team’s quicker adaptation to several of the needed
technologies mentioned and to complete the development of several subsystems within
the time constraints of the project.
By the conclusion of this project, members of the team did not only develop the
software engineering skills necessary to succeed in the field of Computer Science, but
each individual did learn and understand team dynamics. The project members must
work closely to ensure that the separable tasks lead to the development of compliant parts
as well as to show progress to the project advisor. To guarantee deliverables in a timely
manner, the team learned about the presentation and communication pertinent to a
manageable work schedule.
1.6 Outline of the Remaining Sections
This project report has the following structure. In the following section, we
describe the background technologies and tools that one needs to grasp an understanding
of the remaining sections of this paper. Section 3 describes the metadata model for the
Rainbow architecture and how it is used to load XML data and the extensions to the
existing subsystems, namely the XML Manager and DTD Manager. Section 4 details the
implementation of the Restructuring subsystem. Section 5 discusses what a restructuring
operator is and the list of operators that are implemented for the Restructuring subsystem.
Section 6 details experiments conducted to evaluate the Restructuring subsystem.
Finally, a summary and discussion of future work in Section 7 concludes the report.
15
2 Background
2.1 Readings
The team made extensive use of the following references, chapters two and nine,
from Database Management Systems [1], by Ramakrishnan, and a significant amount of
documents contributed by graduate students and the professor for the purpose of this
project including: “Metadata-Driven Approach to Integrating XML and Relational Data”
[2], “Clock: Synchronizing Internal Relational Storage with External XML Document”
[3], “Incremental Maintenance of Virtual XML Repository” [5], “ISP-EAR555: XML
Relational Management” [4], and “A Performance Evaluation of Alternative Mapping
Schemes for Storing XML Data in a Relational Database “ [6], and “DyDa: Dynamic
Data Warehousing” [7]. Since the Relational Database Management System (RDBMS)
that hosted the information for the team project was Oracle8i running on a Microsoft NT
Server PC, the team learned skills essential to manipulate information and navigate
through the system. The project team acquired background knowledge in design and
programming techniques that include the use of Java and its Java documentation
standard, XML, RDBMS, and SQL.
2.2 Basics of XML and DTDsXML is a markup language that allows a document to contain structured
information. A markup language is a mechanism to identify structures in a document.
The XML specification defines a standard way to add markup to documents. The content
of these documents may include descriptions, pictures, headings, etc. XML documents
also hold information about each type of content. Similar to an HTML document, an
XML document contains tags that specify these types of content. In HTML documents,
both the tag semantics and the tag set are fixed. Even with efforts by industry to improve
16
the flexibility of HTML, any changes are always strictly confined by what the browser
vendors have implemented and by the fact that backward compatibility is paramount.
XML, on the other hand, specifies neither semantics nor a tag set. While HTML
specifies how a document should be displayed, it does not describe what kind of
information the document contains. XML allows document authors to organize
information in a flexible way. In fact XML is really a meta-language for describing
markup languages. In other words, XML provides a facility to define tags and the
structural relationships between them. Since there is no predefined tag set, there cannot
be any preconceived semantics. All of the semantics of an XML document will either be
defined by the applications that process them or by style sheets.
Many applications of XML are Internet-related, but XML is in no way limited to
Internet use. In fact, XML's main strength is organizing information that makes it perfect
for exchanging data between different systems, regardless of whether the Internet is part
of the picture.
To view XML you'll need a program called an XML parser. This program reads
an XML document and displays it in a user-friendly way based on a stylesheet. Both
Microsoft and Netscape are working to add XML parsing capabilities to their browsers.
XML can benefit e-commerce by enabling back-end systems to communicate
business transaction information in a known format. For example, business partners can
standardize on specific XML syntax they use to describe purchase orders and can then
automate the transfer of that information across otherwise incompatible systems.
17
An example of XML is given in the following figure:
Description Example Empty element with attributes <ELEMENT ATTR1="value" ATTR2="value"/>
Element with content and end tag <ELEMENT>Element Content Here</ELEMENT>
Parent element with attributes and child elements
<PARENT ATTR1="value">
<CHILD1>
Content
</CHILD1>
<CHILD2 ATTR1="value"/>
</PARENT>
Figure 2: Examples of XML Elements
The allowable contents of an element type are EMPTY, ANY, Mixed, or children
element types[16].
Allowable Contents: Definition: EMPTY Refers to tags that are empty.
ANY
Refers to anything at all, as long as XML rules are followed. ANY is useful to use when you have yet to decide the allowable contents of the element.
Children elements
You can place any number of element types within another element type. These are called children elements, and the elements they are placed in are called parent elements.
Mixed content
Refers to a combination of (#PCDATA) and children elements. PCDATA stands for parsed character data, that is, text that is not markup. Therefore, an element that has the allowable content (#PCDATA) may not contain any children.
Figure 3: XML Content Definitions
18
For simplification purposes, we assume that the XML documents that this
particular project handles receive tag definitions through one standalone external DTD.
Therefore, the tags contained within each XML document are defined in a separate DTD.
A DTD holds definitions for tag elements, nesting relationships of these elements, as well
as attributes of these elements and other relations of the data types. To reiterate, DTDs
are defined by the industry group to specify the standard schema of XML documents in
order to facilitate that exchange. Therefore, the project scope to handle only those XMLs
that are compliant to a DTD is a reasonable limit.
An example of a DTD and an XML document [14]:
DTD:
<!ELEMENT prices (book*)><!ELEMENT book (title, source, price)><!ELEMENT title (#PCDATA)><!ELEMENT source (#PCDATA)><!ELEMENT price (#PCDATA)>
Compliant XML:
<prices> <book> <title>Advanced Programming in the Unix environment</title> <source>www.amazon.com</source> <price>65.95</price> </book> <book> <title> TCP/IP Illustrated </title> <source>www.amazon.com</source> <price>65.95</price> </book></prices>
Figure 4: XML/DTD Example Documents
19
2.3 Technologies
2.3.1 SQL, Relational Databases, and Oracle 8i
SQL is a query language that allows users to access data in a RDBMS (Ullman,
1997). Commercial RDBMS products from corporations such as Oracle, Sybase,
Informix, Microsoft, and others allow a user to describe the data of interest that the user
wishes to receive through support of standard SQL. SQL can provide these services by
allowing users to defined relations, manipulate relations, and query them. These relations
are simple tables that each have a schema, and may or may not be interconnected by
various constraints and keys to form an entire relational schema. The collection schemas
of all the relations of concern would be referred to as a relational schema. The execution
of a SQL query against the relational database will return a relation whereby this returned
relation’s schema is specified by the query.
Most of our information about relational databases came from [DMS]. We will
give a brief overview of how to access and manipulate data in SQL. The main objective
of this overview is to show the effectiveness of using SQL against an RDBMS for the
purpose of this project to effectively manage XML documents in a relational database.
In a relational database, data is stored in tables. The following table relates Social
Security Number, Name, and Address:
20
StudentAddressTable
SSN FirstName LastName Address City State
124368537 John Lee 100 Institute Road Jackson Nebraska
339152314 Tien Vu 23 Grover Street Lousville Lousiana
736192613 Jane Doe 34 Main Street New York New York
Table 1: Student-Address Relation
To see the address of each student, you could use the SELECT statement:
SELECT FirstName, LastName, Address, City, State FROM
StudentAddressTable;
Table 2 contains the result of your above query against the database in Table 1.
First Name Last Name Address City State
John Lee 100 Institute Road Jackson Nebraska
Tien Vu 23 Grover Street Lousville Lousianna
Mirek Cymer 19 Terrace Ave San Francisco California
Jane Doe 34 Main Street New York New York
Table 2: Relation Resulting from a Query Evaluation
21
Let us look at what just happened in detail. The query asked for all of the data in
the StudentAddressTable (specifically for the columns called FirstName, LastName,
Address, City, and State.) Note that all query statements end with a semicolon and that
table names and column names do not contain spaces. The general template of a
SELECT statement, retrieving all of the rows in the table is:
SELECT ColumnName, ColumnName, ... FROM TableName;
To get all columns of a table without typing all column names, use * as in:
SELECT * FROM TableName;
The SELECT type statement can be written in a great number of ways giving a
wide access to the data contained in the tables. SQL also supports using conditional
statements (i.e. querying data greater or less than certain amounts). More complex
conditional statements may be joined with the typical logical operators, AND, NOT, and
OR. SQL uses the keyword DISTINCT to retrieve only one set data (name, address,
number, etc.) appearing in the table queried against. There may be nested queries,
objects, joins of tables, and more advanced SQL syntax providing functionalities that go
beyond what are needed for the scope of the project.
2.3.2 JDBC and ResultSet Classes
Due to the complexity of this system it had to be implemented in a high-level
computer language. The language had to be object-oriented for the purpose of extending
existing classes and needed to be able to make calls to databases quickly and easily. The
language we chose was Java 1.2 due to its flexibility, and its extensive use of strict
object-oriented principles such as inheritance, encapsulation, and polymorphism (Horton,
22
1997). Another feature that was convenient for the Rainbow system was its ability to
make calls to databases quickly and easily through the use of Java DataBase Connection
(JDBC). In addition, Java manages to avoid many of the difficulties that can be
experienced when using other programming languages (Hortan, 1998). Lastly, it was
more convenient to use Java because all of the existing code that was included in the
Rainbow design had been written in Java.
To make it easier for future work with our code, Javadocs were used extensively
throughout our code. Javadocs are comments contained within the code that give
information and perform specific functions such as citing the author of code and listing
parameters of code. This helped the team read code more easily and more quickly and
will prove to be valuable to the members of future projects concerning Rainbow [7].
To establish a connection to a DBMS, associating program classes must utilize
the JDBC class in the case of programming with Java. It is a Java class that defines
connection objects (Taylor, 1997). The connection object, once initialized with proper
login information to a DBMS, will allow a Java program to execute queries or update
statements on the database. The project team has a class provided by the Java SQL
package that will allow Java programs to traverse through a returned relation. This class
is named ResultSet. It bridges the two languages of Java and SQL to overcome the
impedance mismatch issue. Impedance mismatch resolution from the ResultSet class
essential allows for Java to handle the data tuple structure returned by a DBMS. The
JDBC and ResultSet classes combined provide all the data retrieval and manipulation
functionalities needed to support the project team in terms of interfacing Java processes
with a DBMS.
23
2.4 Software Development
2.4.1 Object Oriented (OO) Design
Because the project team strives to continue the development of the Rainbow
system with accordance to its architectural design, the main objective for the team in
understanding and developing the project was to establish a way of translating the
architecture presented by the previous work into an actual system design in addition to
extend the established initial subsystems.
Once the project team grasped a firm understanding of the Rainbow system
architecture, the following phase for the development of a new subsystem from scratch is
to put it into a concrete design using the Unified Modeling Language (UML). UML is a
common design language that consists of many different diagram types (such as class
diagrams, activity diagrams and sequence diagrams). These diagrams serve as a type of
‘blueprint’ for the entire system, as each gives a different level and type of description of
the system. To utilize the benefits of UML, the team found that it was necessary to
become familiar with the software design tool, Object Domain [15]. Utilizing Object
Domain, the project team designed class diagrams for the Restructure subsystem.
2.4.2 Software Migration
The Rainbow system itself is very extensive, containing a large number of classes
and a large amount of code. The established subsystems were a resource of a great deal
of existing codes in the implementation of Rainbow. The project team encountered both
difficulties and advantages in the reuse of the existing code-base. The code-base needed
to be examined to determine the portions that were suitable for reuse, which may need to
24
be modified and enhanced, and the portions that had to be completely re-implemented
due to a lack of support for an extension. In order to accomplish the re-engineering of the
previous code, the team also had to make use of various software engineering skills
obtained from courses with the most important being proper documentation. The team
documented the code added as well as documented any reused codes once they were
eventually understood, but were either undocumented or documented insufficiently
before.
25
3 DTD Metadata Management
This section presents the details of the original metadata model that enables
flexible mapping as proposed by Zhang et al. [14]. The system assumes that there exists
only one external DTD document for the compliant XML documents and that file has no
nested DTDs, and there is no internal DTD in the XML documents. The data model only
focuses on XML documents that meet these requirements.
Figure 5: Algorithm of Mapping DTD into Relational Schema
As shown in Figure 5, the system first stores the DTD into metadata tables. Then
it can optionally restructure the metadata tables. At the end it will generate the relational
schema from the metadata. The storing module identifies the characteristics of the DTD
and stores them as metadata. The restructuring module identifies the multi-valued
attributes of the DTD and also identifies the items that could be represented as attributes.
Lastly, mapping a DTD into a relational schema is achieved by applying mapping rules
over the metadata tables storing the DTD.
26
DTD Store
Restructure Generate Relational Schema
Metadata
This metadata approach includes the storing stages, the mapping stage, and an
optional restructuring stage. We show how the metadata approach is flexible on
restructuring the metadata in order to provide various relational schemas in the
restructuring stage. The following subsections explain these stages in more detail along
with a working example of storing a DTD and loading the XML document.
3.1 Metadata Tables
Storing the DTD properties into relational tables makes it practical to use
relational query facilities to query the metadata. The metadata tables keep track of the
mappings to allow the system to automatically load the XML data into the generated
relational schema.
Let’s focus on the details of this metadata driven approach of managing XML
data; an approach that incorporates the loading of a DTD into DTDMs in a relational
database as part of the process for managing XML data. In order to capture all the
necessary information in the DTD, there are three DTDMs, one for each of the three
identified types of pertinent information. The three types of information captured are:
items, nesting, and attributes. The Items relation essentially corresponds to any element
defined as well as groupings of elements. An item represents an element type or group in
a DTD. The Nesting relation captures information regarding the relationships of the
various elements defined in a DTD. Finally, the Attribute relation captures all the
attributes defined for any of the particular elements defined in the DTD. An attribute is a
property of an item. The following tables have been extracted from [3].
In Tables 3 through 5, the schema for each of the three DTDMs is depicted.
27
Fields MeaningID Internal ID for items.Name Element Type or Group Name.Type Defines the type of this item from the domain: PCDATA,
ELEMENT.ELEMNT, ELEMENT.EMPTY, ELEMENT.ANY, ELEMENT.MIX, and GROUP.
Table 3: Item DTDM (DTDM-Item table)
The type field defines the type of an item or rather the type of the element content
in an element type declaration. ELEMENT.ELEMENT represents an element content.
ELEMENT.MIX represents a mix content. ELEMENT.EMPTY represents an empty
content. ELEMENT.ANY represents an ANY content. There are two new item types,
i.e., PCDATA represents PCDATA definition, and GROUP represents a group definition.
Fields MeaningID Internal ID of this nesting relationship.FromID ID of parent item of this nesting relationship.ToID ID of child item of this nesting relationship.Ratio Cardinality between the parent element and child element.Optional Used to indicate whether a child element is optional or not.Index The schema order of the child element.
Table 4: Nesting DTDM (DTDM-Nesting table)
The two fields FromID and ToID reference a parent item and a child item that
participate in a nesting relationship. The Index field captures the Schema Ordering
Property denoting the position of this child item in the parent item’s definition. If in a
sequence group, each child item will have a different value for indices. For the case that
all children are of a choice group, all the index fields will be have the same value.
28
The occurrence property for a child element is captured by a combination of the
Ratio and Optional fields. The Ratio field shows the cardinality between the instances of
the parent item and of the child item. Since the nesting relationships are always from one
element type to its sub-elements in the DTD, there are only one-to-one or one-to-many
nesting relationships in the Ratio field. The Optional field has value true or false
depending on whether or not this relationship is defined as optional in the DTD or not.
Fields MeaningID Internal ID for this attribute.PID ID of parent item.Name Name of this attribute.Type Type of this attribute, e.g., ID, IDREFS.Default A keyword or a default literal value of this attribute, e.g., #IMPLIED
Table 5: Attribute DTDM (DTDM-Attribute table)
To better understand how a DTD document is mapped into each of the described
DTDMs, let’s recaptured the DTD document example given in Figure 2.
DTD:<!ELEMENT prices (book*)><!ELEMENT book (title, source, price)><!ELEMENT title (#PCDATA)><!ELEMENT source (#PCDATA)><!ELEMENT price (#PCDATA)>
This DTD document will be loaded into the three relations as shown in Table 6.
29
DTDM-Item DTDM-Nesting
DTDM-Attribute
ID PID Name Type Default
Table 6: DTDMs for Figure 2’s DTD
The five elements, namely, prices, book, title, source, and price get stored as
tuples in the DTDM-Item relation. The relationships between these elements are stored
as tuples in the DTDM-Nesting relation. For example: the one-to-many relationship
between element prices and element book is recorded in the tuple with ID equal 7 within
the DTDM-Nesting relation. Lastly, the attributes are stored in the DTDM-Attribute
relation. The three elements, namely, title, source, and price each have PCDATA, so
their relationship with a PCDATA item is stored in DTDM-Nesting tuples with IDs 11,
12, and 13. The PCDATA information is stored in the Name field of tuple 14 in the
team designed a Restructurer class where its running process will take as input a
mapping. Such a mapping object contains a series of restructuring operations to be
performed on the XML data mapped by the XML manager in conjunction with the DTD
manager. The other input for the process is the Restructuring Operator Library. The
contents of this library will be discussed later in this section. The library essentially
contains the SQL templates for manipulating the XML data mapped in the RDBMS.
The Restructurer process will read the restructuring operations needed from the
mapping object and then call the corresponding restructuring operators of the
Restructuring Operator Library to perform the necessary restructuring.
39
Figure 8 shows the Restructuring subsystem breakdown into its components.
Figure 8: Restructuring Subsystem
This Restructuring subsystem does not incorporate the Optimizer process that
takes as input a query load and intelligently generates a flexible mapping that best
optimizes the query performance for that load utilizing information from the mapping,
DTDMs, and the Restructuring Library. This subsystem is instead a simplified version
that assumes the administrator decides upon a good mapping for the XML data and then
calls the Restructurer process to perform restructuring with the mapping object as input.
40
SubSubsystemsystem
DataData
ProcessProcess
LegendLegend
Restructuring Restructuring
Mapping
RestructuringOperatorLibrary
RestructurerRestructurer
4.1.3 Implementation Details
The implementation details of the Restructuring Subsystem follow the UML that
was first designed. Figure 9 shows the Restructuring Subsystem broken down into
classes in UML. Mapping is an object that holds all the restructuring operations. The
OperatorInterface class is a template for all operators to follow. All operators that
implement this OperatorInterface must provide a code for the public method Execute().
The 11 operator classes in this figure correspond to the 11 operators that are defined later
in this section. Lastly, the Restructurer class contains a Java Vector container that it
initializes with the public method readOperators() given the operations specified by the
Mapping input file: Its public method runOperators() will call the method Execute() of
each operator in the Vector container.
41
Figure 9: Restructuring Subsystem Class Diagram
42
Mapping
Op1 Op2……………
Restructurer
Vector Operator operators//contains list of operators
private ReadOperators(File inp)//reads operators from input file //and stores them in vector format
public runOperators()//matches each operator to the //matches each operator to the //corresponding method and //corresponding method and runs //execute for that methodruns //execute for that method with with the //appropriate argumentsthe //appropriate arguments
RenameAttribute
public Execute()public Execute()
PushDownAttribute
public Execute()public Execute()
PushUpAttribute
public Execute()public Execute()
RenameItem
public Execute()public Execute()
Dereference
public Execute()public Execute()
Reference
public Execute()public Execute()
SplitNesting
public Execute()public Execute()
MergeNesting
public Execute()public Execute()
PushDownNesting
public Execute()public Execute()
PushUpNesting
public Execute()public Execute()
SwitchNesting
public Execute()public Execute()
OperatorInterface
<virtual>Execute()<virtual>Execute()
Operator
String OperatorNameString Parameters[ ]
After having broken down the components necessary for this subsystem, namely,
the Mapping object, the Restructuring Operator Library object, and the Restructurer
process, the Restructurer process was the first component to be developed.
The Restructurer process had to read from the Mapping component, so the first of
its tasks is to parse an input file. This input file is essentially the Mapping component. It
contains a series of operators with specified arguments of type item, attribute, or nesting
intelligently selected by a user to yield a mapping that may be beneficial for particular
kinds of queries. Once these operators are instantiated with the specified arguments, the
project team will refer to these instantiated operators as operations. The Restructurer
process parses the series of operations, store them locally, and instantiates the
Restructuring Operator Library operator classes into the mapping object. Once the entire
series of operations are parsed and the individual operators of the library get instantiated,
then the Restructurer process calls these operators to execute one by one. The execution
of the individual operators within the library will execute the instantiated query templates
of the respective operator thereby changing both the DTDMs and the XML mapping.
The Restructuring Operator Library is a set of restructuring operator classes. The
library is first implemented with an operator interface that describes the functionalities
each operator must provide when called by the Restructurer process. As for the
implementation of the operators, they each must contain a method for instantiation and a
method for execution of the instantiated SQL template. The SQL template is defined
within the operator classes and their details are described in detail later in this section.
Once the templates are instantiated, they are stored in the local process space of the
running operator class. When the operator processes are called to be executed by the
43
Restructurer process, they process the instantiated SQL templates, then SQL statements,
to perform the restructuring. The execution of a series of these operator processes
generates the mapping that had been specified by the user.
To illustrate how the classes in Figure 9 work together, let’s observe an example.
If Mapping contains the operation “fooOperator(arg1, arg2, arg3)”, the Restructurer class
adds an instance of fooOperator with the arguments arg1, arg2, and arg3 to the Vector
container when the method readOperators() is called by the Restructurer. When the
method runOperators() is called, the Restructurer class calls the method Execute() for
each object in the Vector container. In this example, the only object will be an instance
of fooOperator and calling its method Execute() will evaluate the code inside the
fooOperator class. The code in the fooOperator class utilizes SQL queries which do the
actual updates to the DTDMs and the restructuring of the XML data that is mapped.
4.2 Restructuring Operators
To support the restructuring functionalities of the Rainbow System to achieve
flexible mapping, we have developed a set of restructuring operators implemented by
view technology. The restructuring operators will restructure the relational data set into
another relational format optimized for query evaluation.
So far, there are 11 restructuring operators defined in the Restructuring Operator
library. Restructuring operator library stores a collection of reversible restructuring
operators for optimization purpose. Reversible meaning the restructuring operators can
keep track of the changes and easy to restore the original data. An optimizer takes a
given XML query load specified by a database administrator and the DTDM tables,
which model the current structure of relational database, as input. It generates a mapping
44
by applying the restructuring operators provided from the restructuring operator library.
A mapping specifies the application of a sequence of restructuring operators to be applied
on the different element types defined in that DTD. Then, the restructuring manager
actually transforms the initially loaded data into the desired optimized format. The latter
is to be utilized for efficient query purposes [8].
Reversible restructuring operators include Rename Item, Rename Attribute,
CREATE VIEW <new.ChildItemName> ASSELECT c.<all-columns>, <ParentAttributeName> as <ChildAttributeName>FROM <old.ParentItemname> p, <old.ChildItemName> c
47
X
A
B
A
BX
Push-up
Push-down
WHERE p.iid = c.pid
4.2.2 Rename Item and Attribute Operators
Rename item and rename attribute will rename an item and an attribute
respectively. They can easily be implemented using the DTDM primitives. Here is the
SQL template:
renameItem(OldItemName, NewItemName): CREATE VIEW <new.NewItemName> AS SELECT * FROM <old.OldItemName>;
renameAttribute (ParentItemName, OldAttributeName, NewAttributeName) CREATE VIEW <new.ParentItemName> AS SELECT <OldAttributeName> as <NewAttributeName>, <rest-of-columns> FROM <old.ParentItemName>;
4.2.3 Pushup and Pushdown Nesting Operators
The pushup/down nesting operators will push up a child item to the sibling item
of its parent child, or vice versa it will push down an item to the child of its sibling item.
Figure 12: Pushup and Pushdown Nesting
Here is the SQL template:
pushUpNesting (MovedItemName, FromPosition, ChildItemName, ParentPosition, ParentItemName, ToPosition) Without considering the position, this would correspond to the query given below: CREATE VIEW new.MovedItemName ASSELECT m.<all-columsn-but-pid>, c.pidFROM old.MovedItemName m, old.ChildItemName c, old.ParentItemName pWHERE m.pid = c.iid AND c.pid = p.iid
48
A
B A
B
A
B X
Push-up
Push-down
A
BPush-up
Push-downC
C
pushDownNesting (MovedItemName, FromPosition, ChildItemName, ParentPosition, ChildItemName, ToPosition) Without considering the position, this would correspond to the query given below: CREATE VIEW <new.MovedItemName> AsSELECT m.<all-columsn-but-pid>, c.pidFROM <old.MovedItemName> m, <old.ChildItemName> c, <old.ParentItemName> pWHERE m.pid = p.iid AND c.pid = p.iid
4.2.4 Other Operators
Due to time constraint, we were not able to implement Switch Nesting, Merge
Nesting, Split Nesting, Reference, and Dereference operators. Switch Nesting was
partially implemented but need further modification and improvement. Switch Nesting
will switch two nesting relationship within the same parent. Merge Nesting will merge
nestings of two items. Split Nesting will split nesting between two items. Reference
breaks a nesting relationship between two items by assigning an ID attribute to the child
item and adds an IDREF(s) attribute to the parent item, which together are used to
represent that nesting relationship. Dereference will create a nesting relationship between
the items that have the ID and IDREF(s) attributes respectively [8].
4.3 Rainbow Graphical User Interface
The Rainbow Interface allows the administrator to do the restructuring of a DTD
and its loaded XML from within a GUI environment by giving access to the functions of
the Rainbow System. The GUI environment eliminates the chore of having to manually
run classes of the Rainbow System. In other words, it gives the administrator a more
convenient way of selecting XML documents for loading, specifying parameters for the
operators, viewing the tables contained in the database at any time (before or after the
restructuring).
49
Let us examine the sequence of steps one would take to do a simple restructuring.
The primary step that must be taken before anything else can be done is to establish a
connection with the Oracle Database. Then, an XML document has to be imported into
the database so that it can be restructured. Any imported documents can be viewed in a
table format. In order to do the restructuring, the administrator has to select a sequence of
operators and give each a set of parameters. Once the restructuring is done the
administrator can choose to export the modified data back into a DTD file on the
administrator's local computer.
The following screen shots of the interface give the main idea of its appearance.
(To switch between the various tabs of the Working Window the administrator only has
to click on the tab corresponding to the appropriate window). The first screenshot is the
main window of the Rainbow Interface. Its menu bar contains options for importing and
exporting documents, establishing connections, entering manual queries into the
database, etc. Screenshot 2 in Figure WHATEVER is a figure of the Work Window with
the DB Tab selected. The main purpose of this window is to give the administrator
information about what kind of data is currently in the database. It displays all the tables
in the database and the data of each table. Screenchot 3 is a display of the Work Window
with the Operators Tab selected. In this window the administrator does the restructuring
by selecting the desired operators and inputting the appropriate arguments. The main
window lists all the tables the user requested.
50
The left column represents the names of the tables. The right column represents (in
order) the ID# of the item, the item name, item type, the item DTD id
Figure 13: Screenshot 1
51
Main window message field.
The administrator is entering a query manually.
Figure 14: Screenshot 2
52
The administrator selects which table to view.
The data of the selected table appears here.
Figure 15: Screenshot 3
53
The administrator selects an operator.
An argument is selected and a value is entered.
All the selected operators appear here.
5 Implementation Details
5.1 System Architecture
Previous to the start of this MQP, the DTD and XML Managers were
implemented to handle only one XML/DTD pair. The project team modified and
extended these modules to support multiple XMLs and their DTDs. The team designed
and implemented the Restructuring Subsystem. Lastly, with respect to the architecture as
shown in Figure 16, but not within the scope of this project, is the XML Query Engine
which has not yet been implemented.
Figure 16: Rainbow Architecture with RDBMS
54
XMLXMLDataData
SubSubsystemsystem
LegendLegend
XMLXML
XMLXMLQueryQuery XMLXMLUser
XML Query EngineXML Query Engine
XML ManagerXML Manager
RDBMS
DTDDTD
DTD ManagerDTD Manager
Restructuring SubsystemRestructuring Subsystem
5.2 Code Facts
The completed Rainbow system totals 44 classes, 17 of which have been coded
from scratch by the Rainbow MQP team. In addition to the creation of 17 new classes,
the Rainbow System takes advantage of existing code, much of which was extended to
support new functionalities. Eight classes are preexisting and unchanged classes.
Nineteen are preexisting, but extended. Pie charts of the class facts can be seen in Figure
17.
Figure 17: Statistics of the Class Implementation
5.3 Existing System Packages
The implementation of the Rainbow System is contained in 8 packages. The
DTDMObjects package contains classes that encapsulate the DTDMs into objects with
methods for accessing and modifying the data of each of the DTDM relations. The
exportDTD package contains the classes that provide the functionality of exporting a
DTD from the database. The JDBCClient package contains classes that encapsulates
database connections into easy to understand objects utilized by every class that needs
55
connections to the database. The MetadataDrivenLoader package contains a class that
allows for the generation of unique identifying numbers for relations in a database. The
Operators package contains the operator interface class and all the restructuring operator
classes. The Restructuring package contains the class that encapsulates the XML Catalog
relation in objects for easy accessing and modifications. It also contains the Restructurer
class. The StoreDTD package contains the classes that generate the DTDM schema and
the loading of multiple DTDs into a database. The XMLRDBMSUpdate package
contains the classes that generate the XML data schema and the loading of multiple
XMLs into a database. Two other packages, namely, DTDWrapper and Utils, were used
to facilitate implementation in general.
5.4 Implementation Environment
All class extensions and implementations were programmed in Java 1.2 using
JDK 1.2.2 running on a Digital UNIX 64 terminal on the WPI LAN. The database server
is a PC, PII 300MHz with 256 MB memory, running Microsoft NT Server with Oracle8i
software. The GUI was developed in Visual Café on a PII 400MHz with 128 MB
memory, running Windows 98. It was tested on a Windows NT system and compiled/ran
successfully using various Java languages (Visual Café, Jdeveloper, etc.).
56
6 Experimental Evaluation
The purpose of the experiments is two fold: one, to evaluate the performance of
loading and restructuring XML data and their DTDs, and two, to evaluate the
performance of queries evaluated against fixed mapping and restructured data. In
evaluating the outcome of the experiment, one must consider the overhead associated
with loading the data and with getting the internal representation of the data in RDBMS.
When we speak of restructuring data, we refer to one or a set of restructuring operators
applied in sequence. The motivation for using this set of restructuring operators is the
expectation that this will improve the performance of query time.
Logically, below is divided into two major parts: evaluation of restructuring time
and evaluation of query processing time. These are the two major divisions of
consideration from which we hope our experiments will lend some satisfactory
conclusions.
6.1 Experimental Setup
6.1.1 Scope and Design of a Test Plan
The proposal of the Rainbow system is a product of analyses done by many
graduate students and Professor Elke A. Rundensteiner. The main focus when we
designed the test plans was outlined by what the system had to achieve: update
propagation capacity, and query evaluation optimization.
6.1.2 Designing an Experimental Test Bed
After the system was determined to be complete and functional by cycles of test
and debug experiments, our goal was to design an evaluation system. The evaluation
57
system should not only yield conclusive data that outlines the benefits and limitations of
the system in terms of performance versus overhead under varying scenarios, it must also
be reliable. The evaluation system was set up in a way that makes it either tolerant of un-
factored influences such as outside processes taking up microprocessor time, or lets it
avoid these unexpected costs. The evaluation system designed ran each experiment five
times to eliminate un-factored influences that may obscure a particular timing, such as
another scheduled computer process that is heavy on the CPU executing in some interval
within the testing.
Even with precautions taken during the design of such an evaluation system,
experiments had to be performed under the same conditions. By this we mean that a
devoted client and server must be selected and that not only this pair of machines be
utilized for all experiments, but also that the machines are not reconfigured or modified
in any significant way. The project team chose a PC, Pentium 233MHz with 128Mb
memory, running Microsoft NT Workstation as the database client and a PC, PII
300MHz with 256Mb memory, running Microsoft NT Server with Oracle8i as the
database server. The network between the client and the server PCs remained unchanged
throughout the experiments.
6.2 Performance Considerations
As described in the introductory portion of this section, it is possible to evaluate
the performance of restructuring data for query efficiency by considering two types of
actions, namely, restructuring and query. The experiments outlined in this paper
conducted one of the two actions. The following describes in further detail what it means
to evaluate either type of actions.
58
6.2.1 Restructuring Time
Restructuring time includes the loading of data and additionally the restructuring
applied to the data. First we measured the performance of loading a set of documents and
then we measured the performance of applying a set of restructuring operations on the
loaded data.
Two different methods of applying a set of restructuring operations were utilized
to evaluate the performance of restructuring the data: single (series) restructuring and
batch restructuring. Series restructuring is running one operation on a set of data at a
time. Batch restructuring is running a set of operations on a set of data all at one time.
Since this project includes a restructuring component to execute all restructuring
operations, the difference here means providing as input a single line of operation
repeatedly for all each operation for the former versus a list of operations for the latter to
this component. The difference between Series and Batch restructuring with respect to
Oracle8i is when materialization of the views created by the restructuring operations is
performed. For Series restructuring, materialization is performed after each operation,
and for Batch restructuring, materialization is performed after every set of operations.
6.2.2 Query Time
To evaluate the performance of query processing, we measured the time it took
for a set of queries to evaluate on the data before and after restructuring. The
measurement performed was on each query, not the set of queries as a whole. Each query
thereby yields a query-performance time for a set of data. All queries were designed by
the project team and therefore was not randomly generated or selected from some list.
59
6.3 Cost Factors
Numerous factors can influence the performance evaluation of the whole concept
of restructuring data for query efficiency.
Parameter DescriptionOP# Number of operationsOP-TYPE Type of operatorDAT-SIZE Data sizeQY# Number of queriesDU# Number of data updates
Table 8: Parameters of Restructuring Evaluations
6.4 Experimental Data
The DTD designed by the project team for the experiment is depicted in Figure
18.
<!ELEMENT one (two+)><!ELEMENT two (three)><!ELEMENT three (four)><!ELEMENT four (five)><!ELEMENT five (six)><!ELEMENT six (seven)><!ELEMENT seven EMPTY><!ATTLIST seven attribute #REQUIRED>
Figure 18: Experiment DTD
The project team designed this DTD to yield deep nesting levels for the
evaluation of the experiments. The attribute embedded in the seventh level allows for
attribute information that may be queried. XML documents were then randomly
generated from this DTD utilizing IBM’s XML-generator [17]. With the data in place,
useful evaluations that lead to conclusive materials were discovered.
60
6.5 Evaluations of Restructuring Setup Time
Below, we performed each experiment 5 times to gather average findings. The
motivation, data models, and analysis methods for each experiment will be discussed in
their respective sections. Note that these experiments are not necessarily mutually
exclusive in their variable settings.
For simplicity, the three experiments discussed in this section will not observe
any data updates; data updates will be fixed at 0. Update propagations were ignored for
this set of preliminary experiments.
6.5.1 Experiment 1: Scalability of Increase in Operations
In this experiment, we aim to evaluate the overhead associated with restructuring
with a varying number of operations.
Nine tests were conducted, varying the number of restructuring operations of each
test. Each test will also evaluate the performance of two restructuring methods; for each
test, the fixed set of operations will be applied first serially and then in batch.
We aimed to formulate some idea about the relation between restructuring
overhead and the number of restructuring operations. Additionally, we aimed to
formulate some idea about the performance difference between batch restructuring and
series restructuring in Oracle8i.
The results of this experiment is a graph plot of the number of renameItem
operations (one of the most important operators) versus the average processing time in
seconds for both batch and series restructuring. The processing time will be the
processing time of the operations for Batch, and the sum of the processing time for each
operation for Series.
61
In order to account for the un-factored influences that may obscure the results as
mentioned earlier in this section, each plot point on the graph corresponds to the average
of several runs, five identical runs with the greatest and smallest numbers taken out and
the remaining three averaged.
The fixed-parameter settings for both batch and series restructuring:
OP-TYPE: renameItemDAT-SIZE: 104KBQY#: 0
Expectations
1. Since there is an overhead associated with restructuring beyond the direct
modifications of the DTDMs and materialization of the XML data views
generated by an operation, batch restructuring should take less processing
time than series restructuring. Each operation evaluated in series will
accumulate its own overhead.
2. We can expect the processing time to increase linearly as the number of
restructuring operations increase for both batch and series restructuring.
62
The graph in Figure 19 shows that although both series and batch restructuring
observe linear growth the processing time used for batch restructuring yielded less of an
overhead demand.
Figure 19: Batch versus Serial Restructuring
This result is expected because batch restructuring only requires the
materialization of the views created by a set of operations. Materialization only occurs
once per mapping for batch restructuring. The average processing time is mostly taken
up by the materialization of the views created by the restructuring operations in the
database.
Batch restructuring will be used for the evaluations of the remaining experiments.
6.5.2 Experiment 2: Operation Scalability
63
Since any particular type of operation may be performed many times with
different parameters, a performance evaluation of the batch restructuring of many
operations, all of the same operator type, would yield some idea of each operator type’s
cost for evaluation. To get a better grasp of actual overhead costs, we materialize only
after the set of restructuring operations.
An operator type was tested with an increasing number of operations, using batch
restructuring. The performance of the operator type was determined by evaluating the
performance of evaluating a batch of operations of the same operator type. The
performance for each set of operations will of course be the sum of the processing time
for each operation. We have tested an operator type starting with one operation, and
incrementally adding one additional operation until we reached a batch restructuring of
six operations.
This experiment yields a graph plotting the number of pushUpAttribute operations
versus the average processing time in seconds for the restructuring time of a set of that
many operations. The processing time is the processing time for the batch set of
operations, the number indicated on the x-axis, of the specific operator type.
Again, in order to account for the un-factored influences that may obscure the
results as mentioned earlier in this section, each plot point on the graph corresponds to
the average of several runs, five identical runs with the greatest and smallest numbers
taken out and the remaining three averaged.
The fixed-parameter settings:
OP-TYPE: pushUpAttributeDAT-SIZE: 22MBQY#: 0
64
Expectations
1. We hope to be able to conclude that the cost of an operator type observes
linear growth over the number of operations of its type.
The Operation Scalability experiment yielded the following results for the
pushUpAttribute operator as depicted in Figure 20.
Figure 20: Restructuring Overhead Results
The restructuring overhead was hoped to increase linearly with respect to the
increase in the number of restructuring operations. The results in Figure 20 however
suggest that the overhead cost actually increased with an exponential or polynomial curve
rather than linear. Much of the overhead cost came from the materialization of the views
generated from the series of operations. What we should keep in mind is that the
65
restructuring of the XML data measured yields better query performance as one can see
in the following query time evaluation. The more queries performed on the restructured
data, the greater the benefits of restructuring become.
6.6 Query Time Evaluations for Restructured Schema
The experiment in this section evaluated query performance. The queries used for
evaluation in this section are performed on materialized restructured data.
6.6.1 Experiment 3: Query Performance
This experiment is concerned with the optimization of query performance. The
motivation for this experiment is the general assumption that the pushing up of XML
information with respect to nesting would yield better query evaluation time as a result of
a reduction in the number of joins necessary to find the data.
To evaluate query performance, we used restructuring operations of the operator
type pushUpAttribute and then we measured the performance over a fixed data set
varying only the number of operations performed on it. The information we tried to
retrieve was the value of an attribute that is nested which required joins. The evaluation
will be from one to six operations and as discussed, the query will be on actual tables, not
non-materialized views.
This experiment yields a graph with a plot of the number of pushUpAttribute
operations performed versus the average processing time of the join-query needed to
retrieve the attribute data as described. The processing time is the processing time for the
query, the number indicated on the x-axis, of the specific operator type.
66
Each point on the graph corresponds to the average of several runs, five identical
runs with the greatest and smallest numbers taken out and the remaining three averaged.
The queries are designed to specifically query for the restructured data.
- Go to the 'root' directory - You can directly type in: java Demo DBURI <username> <userpassword> to run the system. For example: java Demo jdbc:oracle:thin:@shiba.wpi.edu:1521:ORCL foo foo
Or do the following three steps.
a. First add two entries for your database in the source fileedu/wpi/cs/DSRG/xmldb/JDBCClient/JDBCClient.java
Once the interface is run, a main window pops up. This window iscomposed of a main menu and a text box which displays messages tothe administrator.
1.2. Establishing a ConnectionIn order for any interaction to occur with the database, a connectionmust fist be established. An administrator selects the "System" optionfrom the main window menu and clicks on "Connect". A connect window pops up with three text fields. The database path is entered into the firstfield, the user name into the second field, and the user password intothe third. When all information is entered the administrator clickson the "Connect" button and if successful a connection is establishedwith Oracle.
1.3. Sending Manual Queries to the DatabaseThe administrator may enter an SQL query into the database by selecting"System" from the main window menu and clicking "Manual". A window popsup with one text field. Once the administrator enters the query stringinto the field and clicks on the "Send" button, the query is processedinto Oracle and any output received is echoed in the main window textbox.
1.4. Importing XML documentsIn order to import an XML document into the database, the administratorselects "Import" from the main window menu. An "open file" window popsup which allows the administrator to select the XML file.
1.5. Exporting a DTDIn order to export a DTD document from the database and save it as afile, the administrator selects "Export" from the main window menu. A"Save file" window pops up which allows the administrator to selectthe name and path of the DTD file to create.
1.6. Using the Work Window1.6.1. The work window is initially invisible. In order for it to become
visible, the administrator must select "Window" from the main windowmenu and click "WorkWindow". The work window contains three tabs (eachtab is a separate sub-window). The first tab (DB) brings up theDatabase data, the second tab (DTD/XML) is not yet implemented and isintended to display the DTD and XML structure, and the third tab(Operators) is for the purpose of doing restructuring.
1.6.2. Viewing TablesWhen the administrator clicks on the "Get Table List" on the secondtab (DB) the list of tables contained in the database will be displayed.When the administrator clicks on one of the table names, the data ofthat particular table is displayed in the adjacent "Table Data" textbox.
1.6.3. RestructuringIn order for a restructuring to be done, the administrator must firstselect what operators to run and give the parameters for each of theoperators. The third tab (Operators) contains three text boxes. Thefirst box is a list of all available operators. The administrator mustfirst select one of the operators. Once selected, it will appear inthe second box. This process may be repeated for as many operators asare intended to be run. Each selected operator that is clicked on inthe second box will cause a list of parameters for that operator toappear in the third box. In order to enter values for each of theseparameters, the administrator must click on a particular parameter andenter its value in the text field. Once operators are ready to be run,the "Run" button is clicked. Upon successful execution, the operators
80
then run sequentially and do the restructuring.
ADDITIONAL NOTES:---------------------The Rainbow system has been tested to successfuly compile and run on a pcrunning Windows NT 5.01 as well as under Windows 98. The java programminglanguages used were JDeveloper by Oracle and Visual Cafe.
FIND OUT MORE:--------------
- Please look the javadocs for the source files, in particular the one for src\Demo.java
TELL US ABOUT IT:---------------------
- If you have any questions or comments, why don't you drop us an email with your comments or questions at [email protected], noting that it relates to 'Rainbow Project 2000-2001'.