intro, background, approach, and proposaldavis.wpi.edu/dsrg/Old/TJM/report5_02.doc · Web viewSQL is a query language that allows users to access data in a RDBMS (Ullman, 1997). Commercial

Rainbow: Bridging XML and Relational Databases Using a Flexible Mapping

The Design, Implementation, and Evaluation of the Rainbow System

A Major Qualifying Project Report

Submitted to the Faculty

Of the

WORCESTER POLYTECHNIC INSTITUTE

In partial fulfillment of the requirements for the

Degree of Bachelor of Science

By

____________________ ___________________ ____________________ Tien Vu John Lee Mirek Cymer

Date: 5/02/2001 Approved: __________________________ Professor Elke A. Rundensteiner

Authorship Page

T=Tien VuJ=John LeeM=Mirek Cymer

1 Introduction T1.1 Motivation T1.2 Our Approach T1.3 The Rainbow System T1.4 MQP Project Goals T1.5 Additional Team Goals T1.6 Outline of the Remaining Sections TJ

2 Background M2.1 Readings M2.2 Basics of XML and DTDs M2.3 Technologies2.3.1 SQL, Relational Databases, and Oracle 8i TJM2.3.2 JDBC and ResultSet Classes J2.4 Software Development TM2.4.1 Object Oriented (OO) Design TJ2.4.2 Software Migration TJ

3 DTD Metadata Management J3.1 Metadata Tables J3.2 Data Schema J3.3 DTD Manager and XML Manager Extensions J3.3.1 Original DTD Manager and XML Manager J3.3.2 Support for Multiple DTDs and XMLs J

4 Implementation of the Rainbow System J4.1 Restructuring Subsystem J4.1.1 The Restructuring Functionality J4.1.2 A Prototype Design J4.1.3 Implementation Details J4.2 Restructuring Operators T4.2.1 Pushup and Pushdown Attribute Operators T4.2.2 Rename Item and Attribute Operators T4.2.3 Pushup and Pushdown Nesting Operators T4.2.4 Other Operators T4.3 Rainbow Graphical User Interface M

2

5 Implementation Details T5.1 System Architecture T5.2 Code Facts T5.3 Existing System Packages J5.4 Implementation Environment J

6 Experimental Evaluation J6.1 Experimental Setup J6.1.1 Scope and Design of a Test Plan J6.1.2 Designing an Experimental Test Bed J6.2 Performance Considerations J6.2.1 Restructuring Time J6.2.2 Query Time J6.3 Cost Factors J6.4 Experimental Data J6.5 Restructuring Setup Time Evaluations J6.5.1 Experiment 1: Scalability of Increase in Operations J6.5.2 Experiment 2: Operation Scalability J6.6 Query Time Evaluations for Restructured Schema J6.6.1 Experiment 3: Query Performance J6.7 Analyses J

7 Conclusions TJ7.1 Summary of the Rainbow Project T7.2 Experience Gained and Lessons Learned T7.2.1 Object-Oriented Design T7.2.2 UML T7.2.3 The Java Programming Language T7.2.4 XML TJ7.2.5 Database Management Systems TJ7.2.6 Software Engineering Experience TJ7.2.7 Designing the Test Plan TJ7.2.8 Working as a Team TJ7.3 Future Work TJ

References TPast Works and Books TJWeb-pages TJ

Appendixes TJMReadme for System Environment Setup and Demo TJM

3

Abstract

The use of Extensible Markup Language (XML) documents to model data and exchange data over the web is becoming increasingly prominent and promising. Due to the maturity and performance of existing relational database technology, there is great interest in exploiting this technology to serve as a backend engine to store, manage, and query XML data. It is well known that different relational schemas will have different query performances for a given load. Hence, one fixed way of mapping XML into relational databases is not sufficient to reach an overall optimized query performance for a given query workload. The Rainbow System proposes a flexible mapping approach by first loading the XML data into a relational database system and then applying relational restructuring technique with the help of SQL queries and database views of the loaded data and schema. A key ingredient of this Rainbow solution is the management of metadata in relational format of both XML structure information (DTD) as well as the chosen mapping. We have achieved the design, implementation, and preliminary evaluation of this flexible mapping component of the Rainbow system in this project.

4

Table of Contents

1 Introduction.......................................................................................................................91.1 Motivation..................................................................................................................91.2 Our Approach..........................................................................................................101.3 The Rainbow System...............................................................................................111.4 MQP Project Goals..................................................................................................141.5 Additional Team Goals............................................................................................141.6 Outline of the Remaining Sections..........................................................................15

2 Background.....................................................................................................................162.1 Readings..................................................................................................................162.2 Basics of XML and DTDs.......................................................................................162.3 Technologies............................................................................................................20

2.3.1 SQL, Relational Databases, and Oracle 8i........................................................202.3.2 JDBC and ResultSet Classes............................................................................22

2.4 Software Development............................................................................................242.4.1 Object Oriented (OO) Design...........................................................................242.4.2 Software Migration...........................................................................................24

3 DTD Metadata Management..........................................................................................263.1 Metadata Tables.......................................................................................................273.2 Data Schema............................................................................................................303.3 DTD Manager and XML Manager Extensions.......................................................33

3.3.1 Original DTD Manager and XML Manager.....................................................343.3.2 Support for Multiple DTDs and XMLs............................................................35

4 Flexible Mapping Support in the Rainbow System........................................................354.1 Restructuring Subsystem.........................................................................................37

4.1.1 The Restructuring Functionality.......................................................................374.1.2 A Prototype Design...........................................................................................384.1.3 Implementation Details.....................................................................................41

4.2 Restructuring Operators...........................................................................................444.2.1 Pushup and Pushdown Attribute Operators......................................................474.2.2 Rename Item and Attribute Operators..............................................................484.2.3 Pushup and Pushdown Nesting Operators........................................................484.2.4 Other Operators................................................................................................49

4.3 Rainbow Graphical User Interface..........................................................................495 Implementation Details...................................................................................................54

5.1 System Architecture.................................................................................................545.2 Code Facts...............................................................................................................545.3 Existing System Packages.......................................................................................555.4 Implementation Environment..................................................................................56

6 Experimental Evaluation................................................................................................576.1 Experimental Setup..................................................................................................57

6.1.1 Scope and Design of a Test Plan......................................................................576.1.2 Designing an Experimental Test Bed...............................................................57

6.2 Performance Considerations....................................................................................58

5

6.2.1 Restructuring Time...........................................................................................596.2.2 Query Time.......................................................................................................59

6.3 Cost Factors.............................................................................................................606.4 Experimental Data...................................................................................................606.5 Evaluations of Restructuring Setup Time................................................................61

6.5.1 Experiment 1: Scalability of Increase in Operations........................................616.5.2 Experiment 2: Operation Scalability................................................................63

6.6 Query Time Evaluations for Restructured Schema.................................................666.6.1 Experiment 3: Query Performance...................................................................66

6.7 Analyses...................................................................................................................687 Conclusions.....................................................................................................................69

7.1 Summary of the Rainbow Project............................................................................697.2 Experience Gained and Lessons Learned................................................................70

7.2.1 Object-Oriented Design....................................................................................717.2.2 UML.................................................................................................................717.2.3 The Java Programming Language....................................................................717.2.4 XML.................................................................................................................727.2.5 Database Management Systems.......................................................................727.2.6 Software Engineering Experience....................................................................727.2.7 Designing the Test Plan....................................................................................737.2.8 Working as a Team...........................................................................................74

7.3 Future Work.............................................................................................................75References..........................................................................................................................76

Past Works and Books:..................................................................................................76Web-pages:....................................................................................................................77

Appendixes........................................................................................................................78Readme for System Environment Setup and Demo......................................................78

6

List of Illustrations

Figure 1: Proposed Rainbow Architecture 12Figure 2: Examples of XML Elements 18Figure 3: XML Content Definitions 18Figure 4: XML/DTD Example Documents 19Figure 5: Algorithm of Mapping DTD into Relational Schema 26Figure 6: DTD Manager and XML Manager 34Figure 7: Restructure Function 38Figure 8: Restructure Subsystem 40Figure 9: Restructuring Subsystem Class Diagram 42Figure 10: Pushup Attribute Operator

46Figure 11: Pushup and Pushdown Attribute 47Figure 12: Pushup and Pushdown Nesting 48Figure 13: Screenshot 1 51Figure 14: Screenshot 2 52Figure 15: Screenshot 3 53Figure 16: Rainbow Architecture with RDBMS 54Figure 17: Statistics of Class Implementation 55Figure 18: Experiment DTD 60Figure 19: Batch versus Serial Restructuring 63Figure 20: Restructuring Overhead Results 65Figure 21: Join Query Performance Results 67

7

List of Tables

Table 1: Student Address Relation 21Table 2: Relation Resulting from a Query Evaluation 21Table 3: Item DTDM (DTDM-Item table) 28Table 4: Nesting DTDM (DTDM-Nesting table) 28Table 5: Attribute DTDM (DTDM-Attribute table) 29Table 6: DTDMS for Figure 2’s DTD 30Table 7: Data Relations for Figure 2’s XML Document 32Table 8: Parameters of Restructuring Evaluations 60

8

1 Introduction

1.1 Motivation

The use of Extensible Markup Language (XML) documents to store information

is becoming increasingly prominent and promising [14]. XML’s main strength of

organizing information using a human-readable and machine-interpretable file format

makes it ideal for exchanging data between different systems. Unlike Hypertext Markup

Language (HTML) that stores information about the physical presentation of a web page,

XML represents information about the meaning of the data itself by appropriate tags [9].

An element of an XML document refers to a defined tag [9]. See Figure 2 for an

example of an XML document. The nesting of elements represents the logical hierarchy

among the elements. For this reason, information can be extracted more easily out of an

XML document in response to a user request.

Efforts by industry groups in specifying standard structure for XML documents in

the form of Document Type Definitions (DTD) or more recently in the form of XML

schema facilitate the exchange of XML documents among enterprises [14]. By enabling

automatic data flow among businesses, XML is pushing the world into the electronic

commerce era. Collecting, analyzing, mining, and managing XML data will hence

become tremendously important tasks for future web-based applications [2]. An XML

bound system is required to store, retrieve, update, and query XML documents.

One prominent method for such a system is to store XML documents into

relational databases. RDBMSs stand for Relational Database Management Systems.

They deal with data storage, query, concurrency, and other features. Many database

venders such as Microsoft, IBM, Informix, and Oracle have started to support XML in

9

their own RDBMS systems. The benefits of XML being managed by a relational

database are: many fold including the availability of matured database tools, efficient

query and analysis tools, and the easy integration with existing business databases.

However, there are open issues to be resolved concerning XML. These issues include

mapping between XML and Relational Model, XML Update Propagation, and XML

Query Translation and Optimization. This MQP mainly focuses on solving the issue of

mapping between XML and the Relational Model.

1.2 Our Approach

Zhang et al. [2] propose a metadata driven approach that addresses the issue of

flexible mapping. The proposed approach can generate a relational schema out of a DTD

and store the XML data compliant to that DTD into relational tables that then could be

queried by Structured Querying Language (SQL) queries. This metadata driven approach

includes the loading of a DTD into relational metatables, construction of a relational

schema called Metadata Tables (DTDMs), restructuring of the DTDMs for efficient

querying purposes, automatic construction of a relational schema for the XML documents

that conform to this DTD, and loading of the XML data into the prepared relational

database schema.

As an additional feature, Zhang et al. [3], it also handles the issue of updates on

the XML documents. The information in the database correctly represents the

information of the external sources, those that hold the up-to-date XML pages, through

updating by means of synchronization that will utilize the DTDMs. The initial store of

the XML data utilizes a fixed mapping that retains the hierarchical semantics of the DTD

loaded. However, once the XML data is stored, a restructuring process may be called

10

upon to modify the DTDM schemas and the XML data because different mapping yields

varying query processing optimizations [8].

To reap the benefits from the metadata driven approach, the data contained within

an XML document has to be accessed efficiently. This approach should allow for easy

data retrieval and modifications of the XML data in a database system by the mean of

SQL. Such an approach would be more beneficial over the need for a programmer to

traverse through the XML and DTD documents by means of specialized miniature

parsing programs. It would not only save development time and money by code reuse,

but will eliminate the possibility of any error arising from additional programming.

1.3 The Rainbow System

Much of the information presented here is extracted from [8]. To keep track of

XML updates and provide optimal query performance, the metadata driven system from

herein referred to as Rainbow, is composed of a DTD manager, a basic storage manager,

a schema creator, a restructurer, an XML query engine, and an XML schema depicted in

Figure 1. A working system was already in place and included some of the components

of this figure.

11

Figure 1: Proposed Rainbow Architecture

The DTD Manager will load DTD documents into our system by storing them in

DTDMs as part of the system dictionary tables. DTDMs model the DTD as a collection

of items, attributes and nesting relationships. After the DTDMs repository is loaded, the

schema creator will infer a relational schema from the DTDMs repository.

The basic storage manager maintains XML documents with the help of three

modules: an importer, an exporter, and a synchronizer. The importer imports XML

compliant to a prior specified DTD into our system. The exporter will export the

12

XML Query EngineXML Query Engine

XML QueryXML Query XMLXML

XMLXML

Basic Storage ManagerBasic Storage Manager

DTDDTD

DTD ManagerDTD Manager

Restructure

Optimizer

RestructureOperatorLibrary

Query StorageMapping

UserDBA

XMLXMLQueryQueryLoadLoad

Sub-Sub-systemsystem

XMLXMLDataData

LegendLegend

Process

Relational

Model

relational data into XML documents. The synchronizer is used to keep the internal

relational representation and external XML representation consistent with each other

under data updates.

The restructure operator library stores a collection of restructuring operators for

optimization purposes An optimizer takes a given XML query load specified by a

database administrator (DBA) and the DTDMs, which model the current structure of

relational database, as input. It generates a mapping by applying the restructuring

operators provided from the restructuring operator library. A mapping specifies the

application of a sequence of restructuring operators to be applied on the different element

types defined in that DTD. Then, the restructuring manager actually transforms the

initially loaded data into the desired optimized format. The latter is to be utilized for

efficient query purposes.

The end user can issue XML queries through the XML Query Engine subsystem.

The Query Translator based on the mapping provided by the Optimizer will translate the

XML query into a sequence of SQL queries. Then the relational query engine of the

RDBMS will execute the SQL queries, and return the corresponding relational query

result. The query result translator will translate the query result back into the XML

model and return it to the end user.

The Rainbow architecture was partially implemented when the project team

started working on the development of its components. The DTD Manager and Basic

Storage Manager were capable of loading a single DTD and a single XML document.

The Basic Storage Manager, herein referred to as the XML Manager, did not have a

synchronizer process that keeps the integrity of the internal data. Instead, the

13

synchronizer, called Clock, was a separate component developed at WPI, but not yet

integrated. A constituting operator set was researched and designed for the Restructuring

Operator Library [8], but no implementations of the Optimizer or Restructuring Manager

were in place. Lastly, the XML Query Engine remained at its conceptual stage and has

yet to be realized.

1.4 MQP Project Goals

The scope of this MQP is to continue the necessary development of the remaining

subsystems in addition to extensions of the existing ones. With the benefit of an

extended schedule, the project team was able to pursue the extensions of the DTD

Manager and XML Manager and the design and development of a prototype of the

Restructuring Manager. With the completion of these tasks, the Rainbow system is now

able to store multiple DTDs and XML documents, and to restructure the initial fixed M!

mapping of the XML data utilizing an administrator specified mapping. To evaluate the

system we developed, the project team designed a test bed and experimental outline and

performed experimental studies on the working system.

1.5 Additional Team Goals

An additional goal of this MQP is for the project team members to learn and

develop a competency with the technologies of database, XML, SQL, Java programming

for RDBMSs. With respect to the goals of developing the subsystems of Rainbow, the

team had the goals to learn how to maintain and extend existing software, and engineer

from the design phase through to experimental evaluations of a complete software

system. Reuse of previous code that went beyond simple extension in functionality

14

becomes essential for both the team’s quicker adaptation to several of the needed

technologies mentioned and to complete the development of several subsystems within

the time constraints of the project.

By the conclusion of this project, members of the team did not only develop the

software engineering skills necessary to succeed in the field of Computer Science, but

each individual did learn and understand team dynamics. The project members must

work closely to ensure that the separable tasks lead to the development of compliant parts

as well as to show progress to the project advisor. To guarantee deliverables in a timely

manner, the team learned about the presentation and communication pertinent to a

manageable work schedule.

1.6 Outline of the Remaining Sections

This project report has the following structure. In the following section, we

describe the background technologies and tools that one needs to grasp an understanding

of the remaining sections of this paper. Section 3 describes the metadata model for the

Rainbow architecture and how it is used to load XML data and the extensions to the

existing subsystems, namely the XML Manager and DTD Manager. Section 4 details the

implementation of the Restructuring subsystem. Section 5 discusses what a restructuring

operator is and the list of operators that are implemented for the Restructuring subsystem.

Section 6 details experiments conducted to evaluate the Restructuring subsystem.

Finally, a summary and discussion of future work in Section 7 concludes the report.

15

2 Background

2.1 Readings

The team made extensive use of the following references, chapters two and nine,

from Database Management Systems [1], by Ramakrishnan, and a significant amount of

documents contributed by graduate students and the professor for the purpose of this

project including: “Metadata-Driven Approach to Integrating XML and Relational Data”

[2], “Clock: Synchronizing Internal Relational Storage with External XML Document”

[3], “Incremental Maintenance of Virtual XML Repository” [5], “ISP-EAR555: XML

Relational Management” [4], and “A Performance Evaluation of Alternative Mapping

Schemes for Storing XML Data in a Relational Database “ [6], and “DyDa: Dynamic

Data Warehousing” [7]. Since the Relational Database Management System (RDBMS)

that hosted the information for the team project was Oracle8i running on a Microsoft NT

Server PC, the team learned skills essential to manipulate information and navigate

through the system. The project team acquired background knowledge in design and

programming techniques that include the use of Java and its Java documentation

standard, XML, RDBMS, and SQL.

2.2 Basics of XML and DTDsXML is a markup language that allows a document to contain structured

information. A markup language is a mechanism to identify structures in a document.

The XML specification defines a standard way to add markup to documents. The content

of these documents may include descriptions, pictures, headings, etc. XML documents

also hold information about each type of content. Similar to an HTML document, an

XML document contains tags that specify these types of content. In HTML documents,

both the tag semantics and the tag set are fixed. Even with efforts by industry to improve

16

the flexibility of HTML, any changes are always strictly confined by what the browser

vendors have implemented and by the fact that backward compatibility is paramount.

XML, on the other hand, specifies neither semantics nor a tag set. While HTML

specifies how a document should be displayed, it does not describe what kind of

information the document contains. XML allows document authors to organize

information in a flexible way. In fact XML is really a meta-language for describing

markup languages. In other words, XML provides a facility to define tags and the

structural relationships between them. Since there is no predefined tag set, there cannot

be any preconceived semantics. All of the semantics of an XML document will either be

defined by the applications that process them or by style sheets.

Many applications of XML are Internet-related, but XML is in no way limited to

Internet use. In fact, XML's main strength is organizing information that makes it perfect

for exchanging data between different systems, regardless of whether the Internet is part

of the picture.

To view XML you'll need a program called an XML parser. This program reads

an XML document and displays it in a user-friendly way based on a stylesheet. Both

Microsoft and Netscape are working to add XML parsing capabilities to their browsers.

XML can benefit e-commerce by enabling back-end systems to communicate

business transaction information in a known format. For example, business partners can

standardize on specific XML syntax they use to describe purchase orders and can then

automate the transfer of that information across otherwise incompatible systems.

17

An example of XML is given in the following figure:

Description Example Empty element with attributes <ELEMENT ATTR1="value" ATTR2="value"/>

Element with content and end tag <ELEMENT>Element Content Here</ELEMENT>

Parent element with attributes and child elements

<PARENT ATTR1="value">

<CHILD1>

Content

</CHILD1>

<CHILD2 ATTR1="value"/>

</PARENT>

Figure 2: Examples of XML Elements

The allowable contents of an element type are EMPTY, ANY, Mixed, or children

element types[16].

Allowable Contents: Definition: EMPTY Refers to tags that are empty.

ANY

Refers to anything at all, as long as XML rules are followed. ANY is useful to use when you have yet to decide the allowable contents of the element.

Children elements

You can place any number of element types within another element type. These are called children elements, and the elements they are placed in are called parent elements.

Mixed content

Refers to a combination of (#PCDATA) and children elements. PCDATA stands for parsed character data, that is, text that is not markup. Therefore, an element that has the allowable content (#PCDATA) may not contain any children.

Figure 3: XML Content Definitions

18

For simplification purposes, we assume that the XML documents that this

particular project handles receive tag definitions through one standalone external DTD.

Therefore, the tags contained within each XML document are defined in a separate DTD.

A DTD holds definitions for tag elements, nesting relationships of these elements, as well

as attributes of these elements and other relations of the data types. To reiterate, DTDs

are defined by the industry group to specify the standard schema of XML documents in

order to facilitate that exchange. Therefore, the project scope to handle only those XMLs

that are compliant to a DTD is a reasonable limit.

An example of a DTD and an XML document [14]:

DTD:

<!ELEMENT prices (book*)><!ELEMENT book (title, source, price)><!ELEMENT title (#PCDATA)><!ELEMENT source (#PCDATA)><!ELEMENT price (#PCDATA)>

Compliant XML:

<prices> <book> <title>Advanced Programming in the Unix environment</title> <source>www.amazon.com</source> <price>65.95</price> </book> <book> <title> TCP/IP Illustrated </title> <source>www.amazon.com</source> <price>65.95</price> </book></prices>

Figure 4: XML/DTD Example Documents

19

2.3 Technologies

2.3.1 SQL, Relational Databases, and Oracle 8i

SQL is a query language that allows users to access data in a RDBMS (Ullman,

1997). Commercial RDBMS products from corporations such as Oracle, Sybase,

Informix, Microsoft, and others allow a user to describe the data of interest that the user

wishes to receive through support of standard SQL. SQL can provide these services by

allowing users to defined relations, manipulate relations, and query them. These relations

are simple tables that each have a schema, and may or may not be interconnected by

various constraints and keys to form an entire relational schema. The collection schemas

of all the relations of concern would be referred to as a relational schema. The execution

of a SQL query against the relational database will return a relation whereby this returned

relation’s schema is specified by the query.

Most of our information about relational databases came from [DMS]. We will

give a brief overview of how to access and manipulate data in SQL. The main objective

of this overview is to show the effectiveness of using SQL against an RDBMS for the

purpose of this project to effectively manage XML documents in a relational database.

In a relational database, data is stored in tables. The following table relates Social

Security Number, Name, and Address:

20

StudentAddressTable

SSN FirstName LastName Address City State

124368537 John Lee 100 Institute Road Jackson Nebraska

339152314 Tien Vu 23 Grover Street Lousville Lousiana

452078093 Mirek Cymer 19 Terrace Ave Miami Beach Florida

736192613 Jane Doe 34 Main Street New York New York

Table 1: Student-Address Relation

To see the address of each student, you could use the SELECT statement:

SELECT FirstName, LastName, Address, City, State FROM

StudentAddressTable;

Table 2 contains the result of your above query against the database in Table 1.

First Name Last Name Address City State

John Lee 100 Institute Road Jackson Nebraska

Tien Vu 23 Grover Street Lousville Lousianna

Mirek Cymer 19 Terrace Ave San Francisco California

Jane Doe 34 Main Street New York New York

Table 2: Relation Resulting from a Query Evaluation

21

Let us look at what just happened in detail. The query asked for all of the data in

the StudentAddressTable (specifically for the columns called FirstName, LastName,

Address, City, and State.) Note that all query statements end with a semicolon and that

table names and column names do not contain spaces. The general template of a

SELECT statement, retrieving all of the rows in the table is:

SELECT ColumnName, ColumnName, ... FROM TableName;

To get all columns of a table without typing all column names, use * as in:

SELECT * FROM TableName;

The SELECT type statement can be written in a great number of ways giving a

wide access to the data contained in the tables. SQL also supports using conditional

statements (i.e. querying data greater or less than certain amounts). More complex

conditional statements may be joined with the typical logical operators, AND, NOT, and

OR. SQL uses the keyword DISTINCT to retrieve only one set data (name, address,

number, etc.) appearing in the table queried against. There may be nested queries,

objects, joins of tables, and more advanced SQL syntax providing functionalities that go

beyond what are needed for the scope of the project.

2.3.2 JDBC and ResultSet Classes

Due to the complexity of this system it had to be implemented in a high-level

computer language. The language had to be object-oriented for the purpose of extending

existing classes and needed to be able to make calls to databases quickly and easily. The

language we chose was Java 1.2 due to its flexibility, and its extensive use of strict

object-oriented principles such as inheritance, encapsulation, and polymorphism (Horton,

22

1997). Another feature that was convenient for the Rainbow system was its ability to

make calls to databases quickly and easily through the use of Java DataBase Connection

(JDBC). In addition, Java manages to avoid many of the difficulties that can be

experienced when using other programming languages (Hortan, 1998). Lastly, it was

more convenient to use Java because all of the existing code that was included in the

Rainbow design had been written in Java.

To make it easier for future work with our code, Javadocs were used extensively

throughout our code. Javadocs are comments contained within the code that give

information and perform specific functions such as citing the author of code and listing

parameters of code. This helped the team read code more easily and more quickly and

will prove to be valuable to the members of future projects concerning Rainbow [7].

To establish a connection to a DBMS, associating program classes must utilize

the JDBC class in the case of programming with Java. It is a Java class that defines

connection objects (Taylor, 1997). The connection object, once initialized with proper

login information to a DBMS, will allow a Java program to execute queries or update

statements on the database. The project team has a class provided by the Java SQL

package that will allow Java programs to traverse through a returned relation. This class

is named ResultSet. It bridges the two languages of Java and SQL to overcome the

impedance mismatch issue. Impedance mismatch resolution from the ResultSet class

essential allows for Java to handle the data tuple structure returned by a DBMS. The

JDBC and ResultSet classes combined provide all the data retrieval and manipulation

functionalities needed to support the project team in terms of interfacing Java processes

with a DBMS.

23

2.4 Software Development

2.4.1 Object Oriented (OO) Design

Because the project team strives to continue the development of the Rainbow

system with accordance to its architectural design, the main objective for the team in

understanding and developing the project was to establish a way of translating the

architecture presented by the previous work into an actual system design in addition to

extend the established initial subsystems.

Once the project team grasped a firm understanding of the Rainbow system

architecture, the following phase for the development of a new subsystem from scratch is

to put it into a concrete design using the Unified Modeling Language (UML). UML is a

common design language that consists of many different diagram types (such as class

diagrams, activity diagrams and sequence diagrams). These diagrams serve as a type of

‘blueprint’ for the entire system, as each gives a different level and type of description of

the system. To utilize the benefits of UML, the team found that it was necessary to

become familiar with the software design tool, Object Domain [15]. Utilizing Object

Domain, the project team designed class diagrams for the Restructure subsystem.

2.4.2 Software Migration

The Rainbow system itself is very extensive, containing a large number of classes

and a large amount of code. The established subsystems were a resource of a great deal

of existing codes in the implementation of Rainbow. The project team encountered both

difficulties and advantages in the reuse of the existing code-base. The code-base needed

to be examined to determine the portions that were suitable for reuse, which may need to

24

be modified and enhanced, and the portions that had to be completely re-implemented

due to a lack of support for an extension. In order to accomplish the re-engineering of the

previous code, the team also had to make use of various software engineering skills

obtained from courses with the most important being proper documentation. The team

documented the code added as well as documented any reused codes once they were

eventually understood, but were either undocumented or documented insufficiently

before.

25

3 DTD Metadata Management

This section presents the details of the original metadata model that enables

flexible mapping as proposed by Zhang et al. [14]. The system assumes that there exists

only one external DTD document for the compliant XML documents and that file has no

nested DTDs, and there is no internal DTD in the XML documents. The data model only

focuses on XML documents that meet these requirements.

Figure 5: Algorithm of Mapping DTD into Relational Schema

As shown in Figure 5, the system first stores the DTD into metadata tables. Then

it can optionally restructure the metadata tables. At the end it will generate the relational

schema from the metadata. The storing module identifies the characteristics of the DTD

and stores them as metadata. The restructuring module identifies the multi-valued

attributes of the DTD and also identifies the items that could be represented as attributes.

Lastly, mapping a DTD into a relational schema is achieved by applying mapping rules

over the metadata tables storing the DTD.

26

DTD Store

Restructure Generate Relational Schema

Metadata

This metadata approach includes the storing stages, the mapping stage, and an

optional restructuring stage. We show how the metadata approach is flexible on

restructuring the metadata in order to provide various relational schemas in the

restructuring stage. The following subsections explain these stages in more detail along

with a working example of storing a DTD and loading the XML document.

3.1 Metadata Tables

Storing the DTD properties into relational tables makes it practical to use

relational query facilities to query the metadata. The metadata tables keep track of the

mappings to allow the system to automatically load the XML data into the generated

relational schema.

Let’s focus on the details of this metadata driven approach of managing XML

data; an approach that incorporates the loading of a DTD into DTDMs in a relational

database as part of the process for managing XML data. In order to capture all the

necessary information in the DTD, there are three DTDMs, one for each of the three

identified types of pertinent information. The three types of information captured are:

items, nesting, and attributes. The Items relation essentially corresponds to any element

defined as well as groupings of elements. An item represents an element type or group in

a DTD. The Nesting relation captures information regarding the relationships of the

various elements defined in a DTD. Finally, the Attribute relation captures all the

attributes defined for any of the particular elements defined in the DTD. An attribute is a

property of an item. The following tables have been extracted from [3].

In Tables 3 through 5, the schema for each of the three DTDMs is depicted.

27

Fields MeaningID Internal ID for items.Name Element Type or Group Name.Type Defines the type of this item from the domain: PCDATA,

ELEMENT.ELEMNT, ELEMENT.EMPTY, ELEMENT.ANY, ELEMENT.MIX, and GROUP.

Table 3: Item DTDM (DTDM-Item table)

The type field defines the type of an item or rather the type of the element content

in an element type declaration. ELEMENT.ELEMENT represents an element content.

ELEMENT.MIX represents a mix content. ELEMENT.EMPTY represents an empty

content. ELEMENT.ANY represents an ANY content. There are two new item types,

i.e., PCDATA represents PCDATA definition, and GROUP represents a group definition.

Fields MeaningID Internal ID of this nesting relationship.FromID ID of parent item of this nesting relationship.ToID ID of child item of this nesting relationship.Ratio Cardinality between the parent element and child element.Optional Used to indicate whether a child element is optional or not.Index The schema order of the child element.

Table 4: Nesting DTDM (DTDM-Nesting table)

The two fields FromID and ToID reference a parent item and a child item that

participate in a nesting relationship. The Index field captures the Schema Ordering

Property denoting the position of this child item in the parent item’s definition. If in a

sequence group, each child item will have a different value for indices. For the case that

all children are of a choice group, all the index fields will be have the same value.

28

The occurrence property for a child element is captured by a combination of the

Ratio and Optional fields. The Ratio field shows the cardinality between the instances of

the parent item and of the child item. Since the nesting relationships are always from one

element type to its sub-elements in the DTD, there are only one-to-one or one-to-many

nesting relationships in the Ratio field. The Optional field has value true or false

depending on whether or not this relationship is defined as optional in the DTD or not.

Fields MeaningID Internal ID for this attribute.PID ID of parent item.Name Name of this attribute.Type Type of this attribute, e.g., ID, IDREFS.Default A keyword or a default literal value of this attribute, e.g., #IMPLIED

Table 5: Attribute DTDM (DTDM-Attribute table)

To better understand how a DTD document is mapped into each of the described

DTDMs, let’s recaptured the DTD document example given in Figure 2.

DTD:<!ELEMENT prices (book*)><!ELEMENT book (title, source, price)><!ELEMENT title (#PCDATA)><!ELEMENT source (#PCDATA)><!ELEMENT price (#PCDATA)>

This DTD document will be loaded into the three relations as shown in Table 6.

29

DTDM-Item DTDM-Nesting

DTDM-Attribute

ID PID Name Type Default

Table 6: DTDMs for Figure 2’s DTD

The five elements, namely, prices, book, title, source, and price get stored as

tuples in the DTDM-Item relation. The relationships between these elements are stored

as tuples in the DTDM-Nesting relation. For example: the one-to-many relationship

between element prices and element book is recorded in the tuple with ID equal 7 within

the DTDM-Nesting relation. Lastly, the attributes are stored in the DTDM-Attribute

relation. The three elements, namely, title, source, and price each have PCDATA, so

their relationship with a PCDATA item is stored in DTDM-Nesting tuples with IDs 11,

12, and 13. The PCDATA information is stored in the Name field of tuple 14 in the

DTDM-Attribute relation.

3.2 Data Schema

ID FromID ToID Ratio Optional IndexID Name Type

7 1 2 1:n true 08 2 3 1:1 false 19 2 4 1:1 false 210 2 5 1:1 false 311 3 6 1:1 false 012 4 6 1:1 false 013 5 6 1:1 false 0

1 prices ELEMENT.ELEMENT2 book ELEMENT.ELEMENT3 title ELEMENT.MIX4 source ELEMENT.MIX5 price ELEMENT.MIX6 PCDATA PCDATA

14 6 value PCDATA #REQUIRED

30

DTDMs provide meta information about the structure of an XML document in the

form that can be queried to generate a relational schema for loading XML data of XML

documents that are compliant with the DTD that was used to generate these DTDMs.

The DTDMs will be used to generate a relational schema for each item tuple within the

Items relation. Each relation generated serves to store occurrences of the corresponding

item type from loaded XML documents. These relations are defaulted to contain three

columns corresponding to an internal identification number (iid), a parent identification

number (pid), and an order number among sibling items. The iid field is for querying

purposes of the system identifying a particular item. The iid value is the primary key for

each item’s relation, and therefore each tuple represents an instance of each item. Hence,

the iid must be unique. The pid field references the iid field of the item it is nested

within. Finally, the order field identifies its position among other items that have the

same pid number, that is, sibling items. If a particular item has any attributes, each

attribute of that item becomes a column for the item’s relation.

To illustrate the loading of an XML document using the DTDMs, let’s follow

through with the example from Figure 2.

Compliant XML: <prices> <book>

<title>Advanced Programming in the Unix environment</title><source>www.amazon.com</source><price>65.95</price>

</book> <book>

<title> TCP/IP Illustrated </title><source>www.amazon.com</source><price>65.95</price>

</book> </prices>

31

With the DTD for this XML already loaded as shown on Table 6, the relational

schema generated from these DTDMs with the loaded XML data is shown on Table 7.

Table 7: Data Relations for Figure 2’s XML Document

The six relations in Table 7 capture all the information in the XML

document. Each element type has its own relation and instances of

each element type in the XML document are stored as tuples of the

iid pid order

1 0 1

iid pid order

2 1 0

3 1 0

iid pid order iid pid order iid pid order

4 2 1

5 3 1

6 2 27 3 2

8 2 39 3 3

iid pid order Value

10 4 1 Advanced Programming in the Unix environment11 6 2 www.amazon.com12 8 3 65.9513 5 1 TCP/IP Illustrated14 7 2 www.amazon.com15 9 3 65.95

32

prices

book

title source price

PCDATA

http://www.amazon.com/

http://www.amazon.com/

correct relation. Here, the one instance of the element prices contains two instances of

the element book. Each of the two book elements have an instance of the elements title,

source, and price. The values for title, source, and price elements, as mentioned earlier,

are stored in the value field of the PCDATA relation.

3.3 DTD Manager and XML Manager Extensions

The DTD Manager and XML Manager at the time this project started were the

main two functionalities of the existing Rainbow system that were implemented. Being

responsible for the loading and exporting of multiple DTDs and XMLs, understanding

and extending these two subsystems was the first step taken for the implementation

phase.

Utilizing these resources, the team first needed to design an XML document and

DTD from scratch. Then using the existing DTD and XML manager subsystems, the

team accomplished the loading of the XML document and its DTD into relational tables

stored on the Oracle8i system. Initially, the project approached the task of exporting a

single DTD from its database store back into the form of a DTD document. This first

task provided the exposure necessary for the project team to grasp an understanding of

what XML documents and their DTDs are as well as how the existing managers

integrate. With an output in the form of a proper DTD, the team realized its first task and

gained a basic understanding of the inner workings of the existing system.

3.3.1 Original DTD Manager and XML Manager

33

At the start of this project, the DTD Manager and XML Manager were already in

place. To recapture the functions for these two subsystems, the DTD Manager is

responsible for the importing and exporting of DTD documents into and out of a

RDBMS, and the XML Manager handles the importing and exporting of XML

documents into and out of the RDBMS. Together, these two management subsystems

handle the loading of a single DTD and an XML document that is compliant with the

DTD.

The DTD Manager generates DTDMs from a loaded DTD, and the manager’s

schema creator component takes the information from these DTDMs and generates XML

data schemas for the loading of XML documents. In turn, the XML manager will utilize

a fixed mapping to store XML data into these XML data relations. Figure 6 shows the

XML Manager and DTD Manager subsystems with a separate process, Schema Creator.

The actual implementation of the DTD Manager incorporated the Schema Creator

process’s function of creating XML data schemas.

3.3.2 Support for Multiple DTDs and XMLs

Useful data management system would require the support of multiple

documents. Hence, we needed to provide the ability to load multiple DTDs and XMLs

34

XMLXML

XML ManagerXML Manager

DTDDTD

DTD ManagerDTD Manager SchemaCreator

Process

XMLXMLDataData

SubSubsystemsystem

LegendLegend

Figure 6: DTD Manager and XML Manager

into the relational database allows operations to be done on several documents at one

time. In order to support loading of multiple DTDs and XMLs, an additional column

must be added to the existing DTDMs that store the id of the particular document.

Because data relations that store instances of items from XML documents have their

schemas generated by these DTDMs, this generation must be extended to incorporate the

addition of an ID field to identify from which XML document the instance of the item

originated from.

An XML data relation for items will have a column corresponding to the XML

document ID from which that item instance originated, and a DTDM will have a field

corresponding to the DTD document ID from which that item originated. This ID field

becomes part of the key and is used for differentiating similar items from different DTDs

or XMLs. Two catalog relations will store information regarding the DTDs and XMLs:

one storing DTD IDs and their corresponding URIs, and another storing XML IDs and

their corresponding URIs. Catalog relations to store DTD and XML URIs allows the

database administrator to refer to the documents by a URI without the need to use their

internal ID.

4 Flexible Mapping Support in the Rainbow System

The motivation and architecture of the Rainbow System have already been

discussed in Section 1. Here we present some highlights of the Rainbow System that

pertain to this project as a guideline for our project goals.

Highlights of the system include:

35

Rainbow keeps track of DTD documents in DTDM repository.

Rainbow automatically generates the table schema out of the DTDM.

Rainbow has reversible restructuring feature built in and stored as restructure

operators in the restructure operator library.

Java is the language of choice for the implementation of the restructuring

subsystem since it is a platform independent and exclusively object-oriented language.

Because the DTD and XML Managers were already written in Java, the project team

aimed to continue the use of this language to also extend their functionality to support

loading of multiple documents. The existing subsystems, the DTD and XML Managers,

managed XML data utilizing the commercial product Oracle 8i relational database

management system (RDBMS). Oracle8i has many features that are used for the

manipulation of relational tables. This RDBMS provides the services necessary for

querying its data utilizing the standard Structured Querying Language (SQL), otherwise

pronounced as ‘sēqual’. The functionality of querying database allows for an efficient

mean of accessing and modifying the data stored in the database. For these advantages,

the team conferred to continue the use of an Oracle8i RDBMS to serve as the database

server.

With the first two highlights, the DTD Manager and XML Manager were the first

subsystems we worked on.

4.1 Restructuring Subsystem

4.1.1 The Restructuring Functionality

36

The Restructuring subsystem is aimed to provide services that allow the Rainbow

system to achieve one of its primary goals, query optimization. To achieve the goal of

query optimization, a system must be in place to enable flexible mapping of the XML

data mapped by the XML manager. This system should perform various restructurings

on the initial fixed mapping to achieve the functionality of flexible mapping. The

flexible mapping capacity of the Rainbow system aims to decrease the query processing

time of specific query loads given by a database administrator. As shown in Figure 1 and

discussed in Section 1, an Optimizer process intelligently selects a set of restructuring

operations to perform on the mapped data given the input of this query load. Figure 7

captures the original components of the Rainbow system that work together to provide

the restructuring functionality.

37

Figure 7: Restructure Function

The possible restructuring operations come from the Restructurings Operator

Library. Later in this section, we will describe in further details the operators constituting

this library. For an overview, a restructuring operator fits into the restructuring

functionality by applying some query on the mapped data to generate a new mapping.

The Restructure process executes series of restructuring operations to produce flexible

mappings of a fixed data map.

4.1.2 A Prototype Design

The first goal when it came to designing the restructuring functionality was to

define a subsystem for restructuring. This restructuring subsystem will have some core

functionalites of what the Rainbow restructuring function requires. The requirement for a

Restructurer process to execute a restructuring mapping was first realized. The project

38

Basic StorageBasic Storage ManagerManagerDTD ManagerDTD Manager

Restructure

Optimizer

RestructureOperatorLibrary

Query StorageMapping

DBA

XMLXMLQueryQueryLoadLoad

Sub-Sub-systemsystem

XMLXMLDataData

LegendLegend

Process

Relational

Model

team designed a Restructurer class where its running process will take as input a

mapping. Such a mapping object contains a series of restructuring operations to be

performed on the XML data mapped by the XML manager in conjunction with the DTD

manager. The other input for the process is the Restructuring Operator Library. The

contents of this library will be discussed later in this section. The library essentially

contains the SQL templates for manipulating the XML data mapped in the RDBMS.

The Restructurer process will read the restructuring operations needed from the

mapping object and then call the corresponding restructuring operators of the

Restructuring Operator Library to perform the necessary restructuring.

39

Figure 8 shows the Restructuring subsystem breakdown into its components.

Figure 8: Restructuring Subsystem

This Restructuring subsystem does not incorporate the Optimizer process that

takes as input a query load and intelligently generates a flexible mapping that best

optimizes the query performance for that load utilizing information from the mapping,

DTDMs, and the Restructuring Library. This subsystem is instead a simplified version

that assumes the administrator decides upon a good mapping for the XML data and then

calls the Restructurer process to perform restructuring with the mapping object as input.

40

SubSubsystemsystem

DataData

ProcessProcess

LegendLegend

Restructuring Restructuring

Mapping

RestructuringOperatorLibrary

RestructurerRestructurer

4.1.3 Implementation Details

The implementation details of the Restructuring Subsystem follow the UML that

was first designed. Figure 9 shows the Restructuring Subsystem broken down into

classes in UML. Mapping is an object that holds all the restructuring operations. The

OperatorInterface class is a template for all operators to follow. All operators that

implement this OperatorInterface must provide a code for the public method Execute().

The 11 operator classes in this figure correspond to the 11 operators that are defined later

in this section. Lastly, the Restructurer class contains a Java Vector container that it

initializes with the public method readOperators() given the operations specified by the

Mapping input file: Its public method runOperators() will call the method Execute() of

each operator in the Vector container.

41

Figure 9: Restructuring Subsystem Class Diagram

42

Mapping

Op1 Op2……………

Restructurer

Vector Operator operators//contains list of operators

private ReadOperators(File inp)//reads operators from input file //and stores them in vector format

public runOperators()//matches each operator to the //matches each operator to the //corresponding method and //corresponding method and runs //execute for that methodruns //execute for that method with with the //appropriate argumentsthe //appropriate arguments

RenameAttribute

public Execute()public Execute()

PushDownAttribute


PushUpAttribute


RenameItem


Dereference


Reference


SplitNesting


MergeNesting


PushDownNesting


PushUpNesting


SwitchNesting


OperatorInterface

<virtual>Execute()<virtual>Execute()

Operator

String OperatorNameString Parameters[ ]

After having broken down the components necessary for this subsystem, namely,

the Mapping object, the Restructuring Operator Library object, and the Restructurer

process, the Restructurer process was the first component to be developed.

The Restructurer process had to read from the Mapping component, so the first of

its tasks is to parse an input file. This input file is essentially the Mapping component. It

contains a series of operators with specified arguments of type item, attribute, or nesting

intelligently selected by a user to yield a mapping that may be beneficial for particular

kinds of queries. Once these operators are instantiated with the specified arguments, the

project team will refer to these instantiated operators as operations. The Restructurer

process parses the series of operations, store them locally, and instantiates the

Restructuring Operator Library operator classes into the mapping object. Once the entire

series of operations are parsed and the individual operators of the library get instantiated,

then the Restructurer process calls these operators to execute one by one. The execution

of the individual operators within the library will execute the instantiated query templates

of the respective operator thereby changing both the DTDMs and the XML mapping.

The Restructuring Operator Library is a set of restructuring operator classes. The

library is first implemented with an operator interface that describes the functionalities

each operator must provide when called by the Restructurer process. As for the

implementation of the operators, they each must contain a method for instantiation and a

method for execution of the instantiated SQL template. The SQL template is defined

within the operator classes and their details are described in detail later in this section.

Once the templates are instantiated, they are stored in the local process space of the

running operator class. When the operator processes are called to be executed by the

43

Restructurer process, they process the instantiated SQL templates, then SQL statements,

to perform the restructuring. The execution of a series of these operator processes

generates the mapping that had been specified by the user.

To illustrate how the classes in Figure 9 work together, let’s observe an example.

If Mapping contains the operation “fooOperator(arg1, arg2, arg3)”, the Restructurer class

adds an instance of fooOperator with the arguments arg1, arg2, and arg3 to the Vector

container when the method readOperators() is called by the Restructurer. When the

method runOperators() is called, the Restructurer class calls the method Execute() for

each object in the Vector container. In this example, the only object will be an instance

of fooOperator and calling its method Execute() will evaluate the code inside the

fooOperator class. The code in the fooOperator class utilizes SQL queries which do the

actual updates to the DTDMs and the restructuring of the XML data that is mapped.

4.2 Restructuring Operators

To support the restructuring functionalities of the Rainbow System to achieve

flexible mapping, we have developed a set of restructuring operators implemented by

view technology. The restructuring operators will restructure the relational data set into

another relational format optimized for query evaluation.

So far, there are 11 restructuring operators defined in the Restructuring Operator

library. Restructuring operator library stores a collection of reversible restructuring

operators for optimization purpose. Reversible meaning the restructuring operators can

keep track of the changes and easy to restore the original data. An optimizer takes a

given XML query load specified by a database administrator and the DTDM tables,

which model the current structure of relational database, as input. It generates a mapping

44

by applying the restructuring operators provided from the restructuring operator library.

A mapping specifies the application of a sequence of restructuring operators to be applied

on the different element types defined in that DTD. Then, the restructuring manager

actually transforms the initially loaded data into the desired optimized format. The latter

is to be utilized for efficient query purposes [8].

Reversible restructuring operators include Rename Item, Rename Attribute,

Pushup Attribute, Pushdown Attribute, Pushup Nesting, Pushdown Nesting, Switch

Nesting, Merge Nesting, Split Nesting, Reference, and Dereference. Each operator is

composed of two parts, the DTDM transformation and corresponding relational data

transformation.

45

x

A

B

A

B

DTD Modifications: Modifications:Data Changes:

CREATE VIEW out.$A ASCREATE VIEW out.$A ASSELECT p.SELECT p.<all_columns>, c.$x, c.$xFROM in.$A p, in.$B cFROM in.$A p, in.$B cWHERE c.pid = p.iidWHERE c.pid = p.iid

CREATE VIEW out.$B ASCREATE VIEW out.$B ASSELECT SELECT <all-columns-but-x>FROM in.$BFROM in.$BPushup

In Out

x

x

Next, we will explain the pushup attribute operator in more detail as an example

to illustrate the general concept of an operator.

Figure 10: Pushup Attribute Operator

On the left of Figure 10, the pushup attribute operator pushes attribute X of

element B up to element A as attribute X. The changes that are made to the DTD is that

attribute X’s pid(parent id) field will change from the iid(item id) of element B to the

iid(item id) of element A.

In addition to the changes made to the DTD schema, the opertor uses two queries

to restructure the XML data as depicted on the right of Figure 10. The first query creates

a view on top of the relation corresponding to element A inserting attribute X as a new

field. The second query creates a view on top of the relation corresponding to element B

46

that projects every field except for the field for attribute X. The logistics behind the rest

of the operators follows.

4.2.1 Pushup and Pushdown Attribute Operators

Pushup/down attribute operators will push up an attribute from a child item to its

parent item, or vice versa it will push down an attribute from an item to its child item.

Figure 11: Pushup and Pushdown Attribute

Here are the SQL templates:

pushUpAttribute (ChidlItemName, ChildAttributeName, ParentItemName, ParentAttributeName) CREATE VIEW <new.ParentItemName> ASSELECT p.<all-columns>, c.<ChildAttributeName> as <ParentAttributeName>FROM <old.ChildItemName> c, <old.ParentItemName> pWHERE c.pid = p.iid

CREATE VIEW <new.ChildItemName> ASSELECT <all-columns-but-ChildAttributeName>FROM <old.ChildItemName>

pushDownAttribute (ParentItemName, ParentAttributeName, ChildItemName, ChildAttributeName) CREATE VIEW <new.ParentItemName> ASSELECT <all-columns-but-ParentAttributeName>FROM <old.ParentItemName>

CREATE VIEW <new.ChildItemName> ASSELECT c.<all-columns>, <ParentAttributeName> as <ChildAttributeName>FROM <old.ParentItemname> p, <old.ChildItemName> c

47

X

A

B

A

BX

Push-up

Push-down

WHERE p.iid = c.pid

4.2.2 Rename Item and Attribute Operators

Rename item and rename attribute will rename an item and an attribute

respectively. They can easily be implemented using the DTDM primitives. Here is the

SQL template:

renameItem(OldItemName, NewItemName): CREATE VIEW <new.NewItemName> AS SELECT * FROM <old.OldItemName>;

renameAttribute (ParentItemName, OldAttributeName, NewAttributeName) CREATE VIEW <new.ParentItemName> AS SELECT <OldAttributeName> as <NewAttributeName>, <rest-of-columns> FROM <old.ParentItemName>;

4.2.3 Pushup and Pushdown Nesting Operators

The pushup/down nesting operators will push up a child item to the sibling item

of its parent child, or vice versa it will push down an item to the child of its sibling item.

Figure 12: Pushup and Pushdown Nesting

Here is the SQL template:

pushUpNesting (MovedItemName, FromPosition, ChildItemName, ParentPosition, ParentItemName, ToPosition) Without considering the position, this would correspond to the query given below: CREATE VIEW new.MovedItemName ASSELECT m.<all-columsn-but-pid>, c.pidFROM old.MovedItemName m, old.ChildItemName c, old.ParentItemName pWHERE m.pid = c.iid AND c.pid = p.iid

48

A

B A

B

A

B X

Push-up

Push-down

A

BPush-up

Push-downC

C

pushDownNesting (MovedItemName, FromPosition, ChildItemName, ParentPosition, ChildItemName, ToPosition) Without considering the position, this would correspond to the query given below: CREATE VIEW <new.MovedItemName> AsSELECT m.<all-columsn-but-pid>, c.pidFROM <old.MovedItemName> m, <old.ChildItemName> c, <old.ParentItemName> pWHERE m.pid = p.iid AND c.pid = p.iid

4.2.4 Other Operators

Due to time constraint, we were not able to implement Switch Nesting, Merge

Nesting, Split Nesting, Reference, and Dereference operators. Switch Nesting was

partially implemented but need further modification and improvement. Switch Nesting

will switch two nesting relationship within the same parent. Merge Nesting will merge

nestings of two items. Split Nesting will split nesting between two items. Reference

breaks a nesting relationship between two items by assigning an ID attribute to the child

item and adds an IDREF(s) attribute to the parent item, which together are used to

represent that nesting relationship. Dereference will create a nesting relationship between

the items that have the ID and IDREF(s) attributes respectively [8].

4.3 Rainbow Graphical User Interface

The Rainbow Interface allows the administrator to do the restructuring of a DTD

and its loaded XML from within a GUI environment by giving access to the functions of

the Rainbow System. The GUI environment eliminates the chore of having to manually

run classes of the Rainbow System. In other words, it gives the administrator a more

convenient way of selecting XML documents for loading, specifying parameters for the

operators, viewing the tables contained in the database at any time (before or after the

restructuring).

49

Let us examine the sequence of steps one would take to do a simple restructuring.

The primary step that must be taken before anything else can be done is to establish a

connection with the Oracle Database. Then, an XML document has to be imported into

the database so that it can be restructured. Any imported documents can be viewed in a

table format. In order to do the restructuring, the administrator has to select a sequence of

operators and give each a set of parameters. Once the restructuring is done the

administrator can choose to export the modified data back into a DTD file on the

administrator's local computer.

The following screen shots of the interface give the main idea of its appearance.

(To switch between the various tabs of the Working Window the administrator only has

to click on the tab corresponding to the appropriate window). The first screenshot is the

main window of the Rainbow Interface. Its menu bar contains options for importing and

exporting documents, establishing connections, entering manual queries into the

database, etc. Screenshot 2 in Figure WHATEVER is a figure of the Work Window with

the DB Tab selected. The main purpose of this window is to give the administrator

information about what kind of data is currently in the database. It displays all the tables

in the database and the data of each table. Screenchot 3 is a display of the Work Window

with the Operators Tab selected. In this window the administrator does the restructuring

by selecting the desired operators and inputting the appropriate arguments. The main

window lists all the tables the user requested.

50

The left column represents the names of the tables. The right column represents (in

order) the ID# of the item, the item name, item type, the item DTD id

Figure 13: Screenshot 1

51

Main window message field.

The administrator is entering a query manually.


52

The administrator selects which table to view.

The data of the selected table appears here.


53

The administrator selects an operator.

An argument is selected and a value is entered.

All the selected operators appear here.

5 Implementation Details

5.1 System Architecture

Previous to the start of this MQP, the DTD and XML Managers were

implemented to handle only one XML/DTD pair. The project team modified and

extended these modules to support multiple XMLs and their DTDs. The team designed

and implemented the Restructuring Subsystem. Lastly, with respect to the architecture as

shown in Figure 16, but not within the scope of this project, is the XML Query Engine

which has not yet been implemented.

Figure 16: Rainbow Architecture with RDBMS

54

XMLXMLDataData

SubSubsystemsystem

LegendLegend

XMLXML

XMLXMLQueryQuery XMLXMLUser

XML Query EngineXML Query Engine

XML ManagerXML Manager

RDBMS

DTDDTD

DTD ManagerDTD Manager

Restructuring SubsystemRestructuring Subsystem

5.2 Code Facts

The completed Rainbow system totals 44 classes, 17 of which have been coded

from scratch by the Rainbow MQP team. In addition to the creation of 17 new classes,

the Rainbow System takes advantage of existing code, much of which was extended to

support new functionalities. Eight classes are preexisting and unchanged classes.

Nineteen are preexisting, but extended. Pie charts of the class facts can be seen in Figure

17.

Figure 17: Statistics of the Class Implementation

5.3 Existing System Packages

The implementation of the Rainbow System is contained in 8 packages. The

DTDMObjects package contains classes that encapsulate the DTDMs into objects with

methods for accessing and modifying the data of each of the DTDM relations. The

exportDTD package contains the classes that provide the functionality of exporting a

DTD from the database. The JDBCClient package contains classes that encapsulates

database connections into easy to understand objects utilized by every class that needs

55

connections to the database. The MetadataDrivenLoader package contains a class that

allows for the generation of unique identifying numbers for relations in a database. The

Operators package contains the operator interface class and all the restructuring operator

classes. The Restructuring package contains the class that encapsulates the XML Catalog

relation in objects for easy accessing and modifications. It also contains the Restructurer

class. The StoreDTD package contains the classes that generate the DTDM schema and

the loading of multiple DTDs into a database. The XMLRDBMSUpdate package

contains the classes that generate the XML data schema and the loading of multiple

XMLs into a database. Two other packages, namely, DTDWrapper and Utils, were used

to facilitate implementation in general.

5.4 Implementation Environment

All class extensions and implementations were programmed in Java 1.2 using

JDK 1.2.2 running on a Digital UNIX 64 terminal on the WPI LAN. The database server

is a PC, PII 300MHz with 256 MB memory, running Microsoft NT Server with Oracle8i

software. The GUI was developed in Visual Café on a PII 400MHz with 128 MB

memory, running Windows 98. It was tested on a Windows NT system and compiled/ran

successfully using various Java languages (Visual Café, Jdeveloper, etc.).

56

6 Experimental Evaluation

The purpose of the experiments is two fold: one, to evaluate the performance of

loading and restructuring XML data and their DTDs, and two, to evaluate the

performance of queries evaluated against fixed mapping and restructured data. In

evaluating the outcome of the experiment, one must consider the overhead associated

with loading the data and with getting the internal representation of the data in RDBMS.

When we speak of restructuring data, we refer to one or a set of restructuring operators

applied in sequence. The motivation for using this set of restructuring operators is the

expectation that this will improve the performance of query time.

Logically, below is divided into two major parts: evaluation of restructuring time

and evaluation of query processing time. These are the two major divisions of

consideration from which we hope our experiments will lend some satisfactory

conclusions.

6.1 Experimental Setup

6.1.1 Scope and Design of a Test Plan

The proposal of the Rainbow system is a product of analyses done by many

graduate students and Professor Elke A. Rundensteiner. The main focus when we

designed the test plans was outlined by what the system had to achieve: update

propagation capacity, and query evaluation optimization.

6.1.2 Designing an Experimental Test Bed

After the system was determined to be complete and functional by cycles of test

and debug experiments, our goal was to design an evaluation system. The evaluation

57

system should not only yield conclusive data that outlines the benefits and limitations of

the system in terms of performance versus overhead under varying scenarios, it must also

be reliable. The evaluation system was set up in a way that makes it either tolerant of un-

factored influences such as outside processes taking up microprocessor time, or lets it

avoid these unexpected costs. The evaluation system designed ran each experiment five

times to eliminate un-factored influences that may obscure a particular timing, such as

another scheduled computer process that is heavy on the CPU executing in some interval

within the testing.

Even with precautions taken during the design of such an evaluation system,

experiments had to be performed under the same conditions. By this we mean that a

devoted client and server must be selected and that not only this pair of machines be

utilized for all experiments, but also that the machines are not reconfigured or modified

in any significant way. The project team chose a PC, Pentium 233MHz with 128Mb

memory, running Microsoft NT Workstation as the database client and a PC, PII

300MHz with 256Mb memory, running Microsoft NT Server with Oracle8i as the

database server. The network between the client and the server PCs remained unchanged

throughout the experiments.

6.2 Performance Considerations

As described in the introductory portion of this section, it is possible to evaluate

the performance of restructuring data for query efficiency by considering two types of

actions, namely, restructuring and query. The experiments outlined in this paper

conducted one of the two actions. The following describes in further detail what it means

to evaluate either type of actions.

58

6.2.1 Restructuring Time

Restructuring time includes the loading of data and additionally the restructuring

applied to the data. First we measured the performance of loading a set of documents and

then we measured the performance of applying a set of restructuring operations on the

loaded data.

Two different methods of applying a set of restructuring operations were utilized

to evaluate the performance of restructuring the data: single (series) restructuring and

batch restructuring. Series restructuring is running one operation on a set of data at a

time. Batch restructuring is running a set of operations on a set of data all at one time.

Since this project includes a restructuring component to execute all restructuring

operations, the difference here means providing as input a single line of operation

repeatedly for all each operation for the former versus a list of operations for the latter to

this component. The difference between Series and Batch restructuring with respect to

Oracle8i is when materialization of the views created by the restructuring operations is

performed. For Series restructuring, materialization is performed after each operation,

and for Batch restructuring, materialization is performed after every set of operations.

6.2.2 Query Time

To evaluate the performance of query processing, we measured the time it took

for a set of queries to evaluate on the data before and after restructuring. The

measurement performed was on each query, not the set of queries as a whole. Each query

thereby yields a query-performance time for a set of data. All queries were designed by

the project team and therefore was not randomly generated or selected from some list.

59

6.3 Cost Factors

Numerous factors can influence the performance evaluation of the whole concept

of restructuring data for query efficiency.

Parameter DescriptionOP# Number of operationsOP-TYPE Type of operatorDAT-SIZE Data sizeQY# Number of queriesDU# Number of data updates

Table 8: Parameters of Restructuring Evaluations

6.4 Experimental Data

The DTD designed by the project team for the experiment is depicted in Figure

18.

<!ELEMENT one (two+)><!ELEMENT two (three)><!ELEMENT three (four)><!ELEMENT four (five)><!ELEMENT five (six)><!ELEMENT six (seven)><!ELEMENT seven EMPTY><!ATTLIST seven attribute #REQUIRED>

Figure 18: Experiment DTD

The project team designed this DTD to yield deep nesting levels for the

evaluation of the experiments. The attribute embedded in the seventh level allows for

attribute information that may be queried. XML documents were then randomly

generated from this DTD utilizing IBM’s XML-generator [17]. With the data in place,

useful evaluations that lead to conclusive materials were discovered.

60

6.5 Evaluations of Restructuring Setup Time

Below, we performed each experiment 5 times to gather average findings. The

motivation, data models, and analysis methods for each experiment will be discussed in

their respective sections. Note that these experiments are not necessarily mutually

exclusive in their variable settings.

For simplicity, the three experiments discussed in this section will not observe

any data updates; data updates will be fixed at 0. Update propagations were ignored for

this set of preliminary experiments.

6.5.1 Experiment 1: Scalability of Increase in Operations

In this experiment, we aim to evaluate the overhead associated with restructuring

with a varying number of operations.

Nine tests were conducted, varying the number of restructuring operations of each

test. Each test will also evaluate the performance of two restructuring methods; for each

test, the fixed set of operations will be applied first serially and then in batch.

We aimed to formulate some idea about the relation between restructuring

overhead and the number of restructuring operations. Additionally, we aimed to

formulate some idea about the performance difference between batch restructuring and

series restructuring in Oracle8i.

The results of this experiment is a graph plot of the number of renameItem

operations (one of the most important operators) versus the average processing time in

seconds for both batch and series restructuring. The processing time will be the

processing time of the operations for Batch, and the sum of the processing time for each

operation for Series.

61

In order to account for the un-factored influences that may obscure the results as

mentioned earlier in this section, each plot point on the graph corresponds to the average

of several runs, five identical runs with the greatest and smallest numbers taken out and

the remaining three averaged.

The fixed-parameter settings for both batch and series restructuring:

OP-TYPE: renameItemDAT-SIZE: 104KBQY#: 0

Expectations

1. Since there is an overhead associated with restructuring beyond the direct

modifications of the DTDMs and materialization of the XML data views

generated by an operation, batch restructuring should take less processing

time than series restructuring. Each operation evaluated in series will

accumulate its own overhead.

2. We can expect the processing time to increase linearly as the number of

restructuring operations increase for both batch and series restructuring.

62

The graph in Figure 19 shows that although both series and batch restructuring

observe linear growth the processing time used for batch restructuring yielded less of an

overhead demand.

Figure 19: Batch versus Serial Restructuring

This result is expected because batch restructuring only requires the

materialization of the views created by a set of operations. Materialization only occurs

once per mapping for batch restructuring. The average processing time is mostly taken

up by the materialization of the views created by the restructuring operations in the

database.

Batch restructuring will be used for the evaluations of the remaining experiments.

6.5.2 Experiment 2: Operation Scalability

63

Since any particular type of operation may be performed many times with

different parameters, a performance evaluation of the batch restructuring of many

operations, all of the same operator type, would yield some idea of each operator type’s

cost for evaluation. To get a better grasp of actual overhead costs, we materialize only

after the set of restructuring operations.

An operator type was tested with an increasing number of operations, using batch

restructuring. The performance of the operator type was determined by evaluating the

performance of evaluating a batch of operations of the same operator type. The

performance for each set of operations will of course be the sum of the processing time

for each operation. We have tested an operator type starting with one operation, and

incrementally adding one additional operation until we reached a batch restructuring of

six operations.

This experiment yields a graph plotting the number of pushUpAttribute operations

versus the average processing time in seconds for the restructuring time of a set of that

many operations. The processing time is the processing time for the batch set of

operations, the number indicated on the x-axis, of the specific operator type.

Again, in order to account for the un-factored influences that may obscure the

results as mentioned earlier in this section, each plot point on the graph corresponds to

the average of several runs, five identical runs with the greatest and smallest numbers

taken out and the remaining three averaged.

The fixed-parameter settings:

OP-TYPE: pushUpAttributeDAT-SIZE: 22MBQY#: 0

64

Expectations

1. We hope to be able to conclude that the cost of an operator type observes

linear growth over the number of operations of its type.

The Operation Scalability experiment yielded the following results for the

pushUpAttribute operator as depicted in Figure 20.

Figure 20: Restructuring Overhead Results

The restructuring overhead was hoped to increase linearly with respect to the

increase in the number of restructuring operations. The results in Figure 20 however

suggest that the overhead cost actually increased with an exponential or polynomial curve

rather than linear. Much of the overhead cost came from the materialization of the views

generated from the series of operations. What we should keep in mind is that the

65

restructuring of the XML data measured yields better query performance as one can see

in the following query time evaluation. The more queries performed on the restructured

data, the greater the benefits of restructuring become.

6.6 Query Time Evaluations for Restructured Schema

The experiment in this section evaluated query performance. The queries used for

evaluation in this section are performed on materialized restructured data.

6.6.1 Experiment 3: Query Performance

This experiment is concerned with the optimization of query performance. The

motivation for this experiment is the general assumption that the pushing up of XML

information with respect to nesting would yield better query evaluation time as a result of

a reduction in the number of joins necessary to find the data.

To evaluate query performance, we used restructuring operations of the operator

type pushUpAttribute and then we measured the performance over a fixed data set

varying only the number of operations performed on it. The information we tried to

retrieve was the value of an attribute that is nested which required joins. The evaluation

will be from one to six operations and as discussed, the query will be on actual tables, not

non-materialized views.

This experiment yields a graph with a plot of the number of pushUpAttribute

operations performed versus the average processing time of the join-query needed to

retrieve the attribute data as described. The processing time is the processing time for the

query, the number indicated on the x-axis, of the specific operator type.

66

Each point on the graph corresponds to the average of several runs, five identical

runs with the greatest and smallest numbers taken out and the remaining three averaged.

The queries are designed to specifically query for the restructured data.

The fixed-parameter settings:

OP-TYPE: pushUpAttributeDAT-SIZE: 22MBQY#: 1QY-TYPE: join

Expectations:

1. As more operations are performed on the data, we should observe a linear

increase in query performance for queries requiring joins.

This experiment yielded the necessary results to evaluate query optimization

provided by the Restructuring subsystem.

Figure 21: Join Query Performance Results

67

The result set on queries lead to the conclusion that the increase in the number of

pushUpAttribute operations performed on the data also leads to a linear decrease in the

time it would take for query evaluation. The greater the number of pushups performed on

the attribute queried, the smaller the number of joins necessary to evaluate the join query.

The linear increase in performance confirms the hypothesis of this experiment and also

presents some preliminary support for flexible mapping.

6.7 Analyses

Given our time constraint, we were not able to evaluate the different types of

queries in conjunction with other operators of the Restructuring Operator Library.

However, what these results establish are preliminary findings that begin to justify the

need for flexible mapping that in turn is the ultimate goal of this project. At the cost of

some restructuring overhead, an intelligent mapping will yield restructured data that

allows for faster query evaluations. The realization of this benefit continues the

inspiration for further evaluations, possibly developments of new operators, and most

importantly, the Optimizer module that is based on a query load from a DBA.

68

7 Conclusions

7.1 Summary of the Rainbow Project

The Rainbow project itself started with the theories and ideas expressed by Zhang

et al.’s technical reports. The project team started this project by reading and analyzing

these research documents, and understanding what subsystems were in place, what had to

be extended as well as what subsystems had to be more thoroughly designed and

developed. Because the project team started with a partially implemented system, the

first tasks after understanding the existing code was to extend its capability to reflect

more of Rainbow’s architecture.

The project team had to revise some of the existing code in order to ensure

compatibility with the extensions needed to fully develop the DTD and XML manager

subsystems. Once the existing subsystems were properly extended, the following phase

for the team was to thoroughly design a new subsystem called the Restructuring

Subsystem. After learning how to properly design the classes pertaining to the design of

the Restructuring subsystem using Object Domain, the members laid out the subsystem

components in UML. The team broke down the implementation of this subsystem with

each member having separate tasks to complete in order to meet the demanded schedule

for completion. All implementations were tested and debugged in the implementation

phase by the individuals working on the particular components.

One of the team members focused on the implementation of the Restructurer

component for the Restructuring Subsystem. Additionally, this member was also

responsible for any modifications and cleanups necessary for the existing and extended

subsystems to be easily utilized by the team. The team had to use the existing

69

subsystems to setup the environment necessary for the integration of the Restructuring

Subsystem since it required XML data and DTD to be loaded and an input for mapping

data.

The remaining members of the team focused on the implementation of each of the

restructuring operators that make up the Restructuring Operator Library component. In

implementing these operator classes, these members learned to reuse and modify codes

that a graduate student had developed in order to access the DTDMs more efficiently.

In parallel with the implementation phase was the team’s work on this project

report as well as the outlined experiments. The final phase of this MQP project was for

the members to conclude the preliminary experimental evaluations, create the

presentation for the Rainbow project, and finalize this project report.

During the implementation phase we stayed in very close contact with the primary

author of the Rainbow technical report (Zhang, 1999), Xin Zhang. As design changes

were made, or if guidance was needed, the Rainbow MQP team would consult Xin for

assistance and keep him updated on the progress of the project in general [7].

Once the implementation phase concluded, the integration phase began. The

integration phase started with importing the code source tree into Visual Café and setting

up an environment that allowed for database client/server connection for

experimentation.

Lastly, the team learned the logistics of setting the experimental test bed,

implemented the experimental code, and conducted each experiment.

7.2 Experience Gained and Lessons Learned

70

By the conclusion of the Rainbow project, many different concepts and practices

were learned. The skills that we developed include object-oriented design, UML, Java

language, database and SQL concepts, software engineering experience, and teamwork

skills. In the following section these concepts will be discussed.

7.2.1 Object-Oriented Design

The first task given to the group was to read and comprehend the technical report

written by Zhang and others. To reach an understanding of these documents was

necessary in order to begin work with the Rainbow system. After reviewing and

analyzing the report, ideas were discussed and decisions were made about how to extend

and design parts of the Rainbow system. It was a practical and rewarding experience to

assist in the turning of complex technical reports explaining algorithms and modules into

a large, organized and well-documented system [7].

7.2.2 UML

Creating the design of the Rainbow system through the use of the Unified

Modeling Language was also a practical learning experience. Software engineering

knowledge had to be reviewed, new concepts had to be researched, and an understanding

of the state-of-the-art software, Object Domain, had to be achieved. Using UML helped

deliver a better understanding of how different modules and classes are represented, the

order of processes and events, and how objects cooperate over time. UML is becoming a

very popular tool in the software engineering industry, and here our exposure will be

beneficial..

7.2.3 The Java Programming Language

71

Before starting the Rainbow MQP, the team members had limited Java

knowledge. After completing the implementation, integration, and evaluation, all team

members acquired a great deal of Java knowledge and a better understanding of object-

oriented programming. During the project, all team members dealt with concepts

pertaining to inheritance, polymorphism, encapsulation, abstract classes, and how to code

in a visual environment [7].

7.2.4 XML

The project members learned XML at the start of the project to understand what

types of information are contained in XML documents as well as how they can be

mapped into a RDBMS. Because XML is a Markup Language that is popular for web

based applications, it is important technology to learn and to understand.

7.2.5 Database Management Systems

All three members of the Rainbow MQP team had to acquire knowledge of SQL

as well as the Oracle database platform. Additionally, the background knowledge learned

in the introductory databases course taught at WPI helped in dealing with breaking down

queries and having SQL commands embedded into Java code using JDBC connections.

Mostly, the experience with Rainbow helped give a more thorough understanding of the

SQL language and all of its components.

7.2.6 Software Engineering Experience

From taking the software engineering undergraduate class at WPI (CS 3733), all

three members of the Rainbow MQP team had knowledge of the software engineering

process. Some of the software engineering concepts such as the reuse and the integration

72

of existing code pertained to this project and were helpful throughout its evolution. The

Rainbow project went through the stages of requirements, design, implementation,

integration, testing, evaluation, and analysis. The Rainbow MQP members gained hands-

on experience in developing a large software system.

7.2.7 Designing the Test Plan

To learn the particulars of what aspects are involved with a test plan, the team

reviewed the DyDa project team’s experimental outline as an example [7]. Our project

team followed that general template as a guide when designing the test plan.

A test plan should begin with the introduction on what categories the testing aims

to evaluate, whether they are one time setup costs vs. continuous run-time costs. The test

plan should follow with a comprehensive listing of the cost factors. The cost factors

make up the variables and constants for each experiment, but they identify only the

constituting factors and not the specifics such as how many or whether or not a particular

factor is observed within a particular experiment. Next, we decided upon each individual

experiment that composed the test plan.

Each experiment stated the hypothesis of the experimenter, described what data

the experiment gathered, and whether the results may be presented in the form of tables

or graphs. Having identified what data was to be collected, the cost factor settings were

listed for a better understanding of the limit to what the gathered data inferred. The

experimental outline followed with a list of expected conclusions about what one is likely

to see. Following the individual experimental outline is a conclusion for the test plan that

73

reflected what observations were made conclusive with respect to each evaluated

category.

7.2.8 Working as a Team

A fundamental goal of our project was to distribute the work among the project

team members to complete the project in a timely fashion. Much of the early work was

completed by a unified team because it was more feasible due to the lack of any prior

knowledge of a majority of the technologies required for the project. The project team

made a transition from teamwork to team collaboration with separated tasks to make

better progress and to utilize the strengths of the individuals. Even when the members

worked individually, the team had frequent communication via email and scheduled

meetings to ensure the integrity, compliancy, and elimination of redundancy of individual

progress. Though some of tasks were assigned to one member, all project members

would provide assistance and contribute to the successful completion of such tasks.

Additionally, the MQP project team supported a leader who was responsible for the

successful and punctual completion of tasks beyond recording and updating notes and

websites.

Good teamwork skills are necessary when working on a large project like the

Rainbow system. Through working together for over several months, the Rainbow MQP

team grasped the skills to allow the team to work together. The work was split up evenly

and meetings were held weekly. Research and design was done as a team. The final

stages of evaluation and report write up were again split up among the members but final

modifications were finalized by the group.

74

7.3 Future Work

The conclusion of this MQP has left many opportunities for future studies. The

challenging Optimizer process has yet to be developed. With its development and

integration, the goal of query optimization could then be realized given this intelligent

process. Please refer to the research documents from this project’s website for further

details regarding the Optimizer process.

The XML query engine component has yet to be developed, and with its

development comes a friendlier interface for users who are concerned only with the XML

technology. The database administrator would provide XML queries and receive results

in the form of XMLs. The details of the various subsystems could be hidden and even

the interface with the Rainbow subsystems would be abstracted as the only user interface

points become XML queries provided to the XML query engine and query loads given to

the Optimizer process.

The set of restructuring operators are not exhaustive. To realize all potential

benefits of flexible mapping, it would be ideal that further development, evaluation, and

analysis be performed on components related to the restructuring of data. The scope of

this MQP was substantial and the time constraint did not permit a comprehensive study

on the various operators.

75

References

Past Works and Books:

[1][DMS] Ramakrishnan, Raghu. Database Management Systems. WCB McGraw-Hill,

1998.

[2][MDA-IXRD] Zhang, Xin, Wang-Chien Lee, and Gail Mitchell. Metadata-Driven

Approach to Integrating XML and Relational Data. February 22, 2000.

[3][CLOCK] Zhang, Xin, Wang-Chien Lee, and Elke A. Rundensteiner. Clock:

Synchronizing Internal Relational Storage with External XML Documents.

October 9, 2000.

[4][ISP-EAR555] Zhang, Xin, Aparna Pillai, and Wei Huang. ISP-EAR555: XML

Relational Management. Summer 2000.

[5][IMVXR] Zhang, Xin. Incremental Maintenance of Virtual XML Repository.

May 1, 2000.

[6][PEAMS] Florescu, Daniela, and Donald Kossmann. A Performance Evaluation of

Alternative Mapping Schemes for Storing XML Data in a Relational Database.

August 3, 1999.

[7][DDW] DyDa MQP Project Team. DyDa: Dynamic Data Warehousing. May 4, 2000.

[8][RAINBOW] Zhang, Mitchell, Lee, Rundensteiner. Rainbow: A Flexible Bridge

between XML Documents and Relational Data based on Relational Database

Restructuring. February 26, 2001.

[9] Bates, Chris. Web Programming. John Wiley & Sons: Chichister, 2000.

[10] Horton, Ivor. Beginning Java. Wrox Press: USA, 1997.

[11] Taylor, Art. JDBC Developer’s Resource. Prentice Hall: New Jersey, 1997.

76

[12] Ullman, Jeffrey. A First Course in Database Systems. Prentice Hall: New Jersey, 1997.

Web-pages:

[13]

http://msdn.microsoft.com/workshop/delivery/cdf/reference/channels.a

sp - information pertaining to XML.

[14] http://www.w3.org/TR/xmlquery-use-cases#xmp-dtd - information

pertaining to DTD and an XML document.

[15] http://www.objectdomain.com - information pertaining to the

Object Domain tool.

[16] http://xmlwriter.net/xml_guide/element_declaration.shtml -

information pertaining to XML Content Definitions.

[17] http://www.alphaworks.ibm.com - download for XML-generator

software.

Appendixes

Readme for System Environment Setup and Demo

77

http://www.alphaworks.ibm.com/

http://xmlwriter.net/xml_guide/element_declaration.shtml

http://www.objectdomain.com/

http://www.w3.org/TR/xmlquery-use-cases#xmp-dtd

http://msdn.microsoft.com/workshop/delivery/cdf/reference/channels.asp

http://msdn.microsoft.com/workshop/delivery/cdf/reference/channels.asp

README===============================================================================

AUTHOR: Tien&John-------

REQUIREMENTS--------------

- JDK 1.2 or higher- Database Server should be running MS Windows NT Server

w/Oracle 7 or higher- username, password, and URI for the database (example: shiba.dsrg)

INSTALLATION--------------

- This is only a description for Windows NT, for other platforms use your own intellegent.

- Open the compressed archive dtdm-dtd-project.zip by unzip into a new directory

- The contents of this new directory, say 'root', will contain the following directories

- \src contains .java- \data contains .xml- \classes contains .class- \doc contains .javadoc- \lib contains .java

- classpath=.;%CD%\lib\jdbc\classes12.zip;%CD%\lib\jdbc\jdbcodbc.zip; %CD%\lib\xml4j.jar;%CD%\lib\xerces.jar:%CD%\classes

* %CD% is the new directory 'root'.

START POINT--------------

- Go to the 'root' directory - You can directly type in: java Demo DBURI <username> <userpassword> to run the system. For example: java Demo jdbc:oracle:thin:@shiba.wpi.edu:1521:ORCL foo foo

Or do the following three steps.

a. First add two entries for your database in the source fileedu/wpi/cs/DSRG/xmldb/JDBCClient/JDBCClient.java

- add entry 1: final static String YOURDB_URL = "<your db's URI>";

example A: final static String SHIBA_URL ="jdbc:oracle:thin:@shiba.wpi.edu:1521:ORCL";

NOTE: insert entry 1 immediately after the example A

- add entry 2: else if (uri.toUpperCase().equals("YOURDB")) return JDBCClient.YOURDB_URL;

example B: else if

78

(uri.toUpperCase().equals("SHIBA")) returnJDBCClient.SHIBA_URL;

NOTE: insert entry 2 immediately after theexample B

b. Compile: javac -g -d /classes <.java file here> on all .java files in this directory and its subdirectories

- example1: javac -g -d /classes edu/wpi/cs/DSRG/xmldb/storeDTD/StoreDTD.java

- example2: javac -g -d /classes Demo.java

c. Run: java Demo YOURDB <username> <usrpassword>

RESULT-------

- The Demo process will load /data/book.xml and /data/book.dtd into your database

- The following relations should now exist on your account on your database:

+ ALL_DTDS_DTDM_Item contains every element of the dtd

+ ALL_DTDS_DTDM_Nesting contains every nesting relationship between the elements

+ ALL_DTDS_DTDM_Attribute contains every attribute for any of the elements

+ ALL_DTDS_DTD_ID_Mapping contains the dtd URI and its internal id

+ UNIQUEID contains some uniqueid used by the UniqueID.class

+ XML_CATALOG contains the XML URI and its internal id

+ DATAVIEW_CATALOG contains the current view for each individual element's relation

+ other relations that stores the XML's individual elements

DEMO GUI=================================================================================AUTHOR: Mirek-------

RUNNING INSTRUCTIONS:--------------------- The GUI is started by running "java Rainbow".

OPERATION INSTRUCTIONS:----------------------------

79

1. Rainbow Interface1.1. General Structure

Once the interface is run, a main window pops up. This window iscomposed of a main menu and a text box which displays messages tothe administrator.

1.2. Establishing a ConnectionIn order for any interaction to occur with the database, a connectionmust fist be established. An administrator selects the "System" optionfrom the main window menu and clicks on "Connect". A connect window pops up with three text fields. The database path is entered into the firstfield, the user name into the second field, and the user password intothe third. When all information is entered the administrator clickson the "Connect" button and if successful a connection is establishedwith Oracle.

1.3. Sending Manual Queries to the DatabaseThe administrator may enter an SQL query into the database by selecting"System" from the main window menu and clicking "Manual". A window popsup with one text field. Once the administrator enters the query stringinto the field and clicks on the "Send" button, the query is processedinto Oracle and any output received is echoed in the main window textbox.

1.4. Importing XML documentsIn order to import an XML document into the database, the administratorselects "Import" from the main window menu. An "open file" window popsup which allows the administrator to select the XML file.

1.5. Exporting a DTDIn order to export a DTD document from the database and save it as afile, the administrator selects "Export" from the main window menu. A"Save file" window pops up which allows the administrator to selectthe name and path of the DTD file to create.

1.6. Using the Work Window1.6.1. The work window is initially invisible. In order for it to become

visible, the administrator must select "Window" from the main windowmenu and click "WorkWindow". The work window contains three tabs (eachtab is a separate sub-window). The first tab (DB) brings up theDatabase data, the second tab (DTD/XML) is not yet implemented and isintended to display the DTD and XML structure, and the third tab(Operators) is for the purpose of doing restructuring.

1.6.2. Viewing TablesWhen the administrator clicks on the "Get Table List" on the secondtab (DB) the list of tables contained in the database will be displayed.When the administrator clicks on one of the table names, the data ofthat particular table is displayed in the adjacent "Table Data" textbox.

1.6.3. RestructuringIn order for a restructuring to be done, the administrator must firstselect what operators to run and give the parameters for each of theoperators. The third tab (Operators) contains three text boxes. Thefirst box is a list of all available operators. The administrator mustfirst select one of the operators. Once selected, it will appear inthe second box. This process may be repeated for as many operators asare intended to be run. Each selected operator that is clicked on inthe second box will cause a list of parameters for that operator toappear in the third box. In order to enter values for each of theseparameters, the administrator must click on a particular parameter andenter its value in the text field. Once operators are ready to be run,the "Run" button is clicked. Upon successful execution, the operators

80

then run sequentially and do the restructuring.

ADDITIONAL NOTES:---------------------The Rainbow system has been tested to successfuly compile and run on a pcrunning Windows NT 5.01 as well as under Windows 98. The java programminglanguages used were JDeveloper by Oracle and Visual Cafe.

FIND OUT MORE:--------------

- Please look the javadocs for the source files, in particular the one for src\Demo.java

TELL US ABOUT IT:---------------------

- If you have any questions or comments, why don't you drop us an email with your comments or questions at [email protected], noting that it relates to 'Rainbow Project 2000-2001'.

81