Tp database

8/7/2019 Tp database

http://slidepdf.com/reader/full/tp-database 1/15

Term Paper

MANAGING DATABASE

SUBMITTED TO:

SUBMITTED BY:

Respected ANKUR

SINGH



Gargi mam

RE3801A29

CAP 200

Inclusion of New Types in Relational Data

Base Systems

Problem statement

The needs of business processing applications were the impetus for

many of the built-in data types (e.g. floating point, money, date, etc.)

and operators (e.g. +, -, etc.) found in commercial database

management systems. However, these built-in types are of little use

for a wider range of applications in areas such as engineering and

scientific research. Applications used for scientific research, for

example, require a database to store large complex structures and

have the ability to make efficient queries on this data. Geographic

applications usually require data types such as points, lines, and

polygons. Other current examples include storage of images

and other multimedia data. Thus, a database management system

needs to have extendible data types to serve a wider community of

users and applications that use these systems.

To achieve this goal, the databases should allow for the addition of

extendible data types.When new data types exist in a DBMS, new

operators for these types may be needed. For example, if a DBMS is

extended with the data type “box”, a user may want to issue a query

to find all boxes that overlap one another. Therefore, an “overlap”



operator is appropriate for this cause. In addition to extensible

operators, built-in access methods for native data types using

existing data structures (e.g. B-trees, hash tables) may not be

suitable to store the user-defined data types. For example, in

Geographic Information Systems (GIS) that require data types such

asregions and lines, queries that use intersection and existence

operators cannot use B-

Trees as an efficient or useful access method. In this situation, it may

be appropriate to use an R-tree or KBD tree data structures. When

extensible data types use these new data structures in their access

methods, the problem of query optimization comes into play.

Therefore, a DBMS that allows the extension of data types should

also pass relevant performance information to the query optimizer.

The query optimizer should be aware of the cost of user-defined

operations, know how to optimize these new operations, and select

the best execution plans. To summarize, a DBMS that allows

extensible data types should provide the following four features:

1) A method for defining new data types

2) A method for defining operators for these new data types

3) A method for implementing access paths for these new data types

4) A method for allowing the query optimizer to process new

commands for new data

types and operators

The formal problem statement this paper addresses is as follows:

• Given:

o A core DBMS with built-in data types, operators, access

methods, and a query plan optimizer

• Find:



o A framework for adding user-defined data types; along with

relevant operators

o access methods, and statistic estimation techniques for

query plan optimization

• Objective:

o Minimize the amount of work for implementing new data

types

o Possibility of re-using existing data structures (e.g., B-Tree)

in access methods for user-defined data types

• Constraints:

o Possible safety loopholes when implementing new access

methods

o Performance (e.g., of transaction management, query plan)

of DBMS using new data types

Major contributions

This paper discusses a complete framework for implementing user-

defined data types.

It presents a solution addressing the four main areas mentioned in the

previous section. To the best of our knowledge, the contributions

presented here encompass the first comprehensive solution for

extendible data types in a relational database management system.Portions of the framework (namely solutions to points 1 and 2 in the

previous section) come from a previous work by the author [2], but

are present in this paper to provide a complete picture of the

extensible data type solution. The major contributions, therefore,

categorically address the four needs when implementing extensible

data types. Each of these contributions will be discussed in



the next section on key concepts.

• Definition of abstract data types (ADT): the author offers a method

for defining extensible data types within a DBMS

• Definition of ADT operators: the author offers a method fordefining operators for the new extensible data types

• Access methods: the author describes how new access paths can

be implemented to efficiently support extensible data types.

• Query optimization: the author describes how query optimization

takes place inside the DBMS when extensible data types are

present.

Motivation

The needs of business processing applications were the impetus formany of the built-in data types (e.g. floating point, money, date, etc.)

and operators (e.g. +, -, etc.) found in commercial database

management systems. However, these built-in types are of little use

for a wider range of applications in areas such as engineering and

scientific research. Applications used for scientific research, for

example, require a database to store large complex structures and



have the ability to make efficient queries on this data. Geographic

applications usually require data types such as points, lines, and

polygons. Other current examples include storage of images and other

multimedia data. Thus, a database management system needs to

have extendible data types to serve a wider community of users and

applications that use these systems.

Key Concepts

Data type definition

As a space requirement, we assume that the reader understands the

concept of native

types in relation to a DBMS or programming language. If a database

allows for extendible data types, the method described in this paperinvolves a simple syntax to define the data type.

Define type-name length=value,

Input = file-name,

Output = file-name

In this example, length is a fixed amount of space that the data type

will occupy, while the input and output properties define routines thatwill convert the data type to and from character strings for storage.

Operator Definition

As a space requirement, we assume that the reader has a basic

understanding of an

operator in relation to a DBMS. Such operators could be any of the set

{=, <, >}. To define an operator for a user-defined type, the

method described in the paper involves a similar structure to thetype definitions.

Define operator token = value,

Left-operand = type-name,

Right-operand = type-name,

Result = type-name,



Precedence-level like operator-2,

File = file name

Here, the operator definition encompasses both right and left operand

types, along with precedence level if multiple operators exist. Thefile attribute stores the procedure that performs the operator logic.

Access Methods

Access methods are the routines for managing access to disk-based

data structures

supported by the system. An example of such a data structure is a

B+-Tree. In a B+

tree, all data is saved at the leaf level, while the internal nodes onlycontain search keys and tree pointers. The leaf nodes are also stored

as a linked list, making range queries easy .

Image courtesy

The paper describes a method to extend access methods to either re-use existing datastructures or make use of completely new datastructures depending on the properties of the user-defined data type.For instance, if a user were to issue the query [4]:

retrieve (target-list) where relation.key <= 3

A B+-Tree would work very well in this case since the operator (OPR) is



‘<=’. The access

method would start at the root node and follow the leftmost pointer to

the node pointing to data values d1, d2, and d3. A B+ Tree works well

for the integer data type. However, if the extended data type is a box,

the access methods may require a different data structure, such as an

R-Tree that is more suited for spatial data. To extend access methods,the paper defines access method templates. Each template defines an

access method, along with the operator information necessary to

implement that access method. The paper gives an example of a

template for a B- Tree.

In this template, only the <= operator is required (reading from the

opt column, it is the only value of “req”) since it is the only operator

necessary to implement a B-Tree. Other columns in this template

define the left and right operands as well as the result for a given

operator.Along with this template, an access method table must also

be in place, which defines a collection of operators that satisfy the

template. This table also contains values that the query processor

may use to estimate the number of tuples that satisfy the operator

qualification, and the number of pages touched when using the

operator to compare a key field to a constant. The paper gives an

example of such a table in the context of regular integer operators

for a B-Tree, along



with “box” operators (AE – area equal, AL – area less-than, AG – area

greater-than) that are used in a B-Tree access method.

In this case, both the box (defined as the area-op class) and

integer (defined as the int-ops class) operators are defined for use

with a B-Tree. The paper also defines a “using class” clause to

change a relation to use a particular access method. For instance,

if a user wanted a relationstoring “box” information to use the

operators AE, AL, and AG within the B-Tree access method,they

would issue the command:

modify box to B-Tree on desc using area-op

The actual implementation of the access methods come though

implementing procedure calls which will use the access method

information previously defined. Two examples of these procedure

calls are:

Open(relation-name) – returns a pointer to a structurecontaining information about the relation Get-first(descriptor,

OPR, value) – return first record which satisfies the “where key

OPR value” clause.

In the case of extensible data types, new access methods may

have to handle tasks such as logging, concurrency control, and



buffer management. In the case of logging, if a DBMS supports

logical logging, then the access methods must implement

REDO and UNDO methods when a log manager rolls forward or

rolls backward log events. In the case of concurrency control,

the access method may have to make use of system calls (e.g.,

read, begin, abort, etc.) to a DBMS scheduler that will in turnrespond with yes/no/abort response for each request. Finally,if

buffer management is a concern for access method designers,

the author suggests that a set of procedures (e.g., get, fix,

unfix, put, order) must be made available so the access

method may perform buffer manipulation.

Query Optimization

Query optimization is a function of many databasemanagement systems that examinesmultiple query plans

for satisfying a particular query. Most optimizers consider

statistics when analyzing query plans. The statistical

categories are usually in the area of CPU cost and disk

storage service time. The optimizer also examines

different query paths by looking at the indexes available

and relational table join techniques to choose an optimal

query path. As a simple example, consider the query

Select employee.name

From employee

Where employee.level = 5

In this case, they query optimizer will want to find the cheapest

way to find all employees with the level of 5. The query could

scan all tuples in the employee relation to find the employees

with level equal to 5. However, if an index exists on the

employee level column, the number of operations will be

greatly reduced as the query can use this index to scan only a

subset of employee records (i.e., employees with level 5). In

the case of join ordering, consider three tables A, B, and C that

must be joined to satisfy a query. Table A contains 50 records,

while B and C contain 400,000 records. The job of the query



optimizer is to find the optimal join order and join method

which will optimize the query performance. In this case, if

table B is first joined with table C, then the result is joined with

table A, this plan can take several orders of magnitude more

than a plan that first joins tables A and C [5]. Also, if hash join

is a feasible strategy for joining A and C, the optimizer maychoose this option over a nested-loop join. In this case, hash-

join is

appealing since table A is small enough to fit in memory,

resulting in a one-pass join algorithm.

When user-defined types and operators are present in a DBMS,

the query optimizer must have a way to estimate the

selectivity and join methods available for tables containingthese new types in order to make decisions as described

above. Otherwise, optimization becomes daunting (if not

impossible) task. This paper proposes that four pieces of

information must be available when defining an extensible data

type operator [4]:

Stups:

o estimation of the number of records satisfying theclause Where relname.field-name OPR value

Selectivity factor S: the expected number of records

which satisfies the clause:

o Where relname-1.field-1 OPR relname-2.field-2

o Whether merge-sort is feasible for the operator

o Whether hash-join is a feasible joining strategy for this

operator

With this information in place, the query optimizer has enough

information to produce a more optimal query path than random

selection when a query is issued on user-defined data types.



Validation

The author mainly provides a general framework to add user-

defined types to the database. As mentioned previously, the

methods for defining extensible data types and their operators

were presented in [2] and implemented in the INGRES DBMS at UC-Berkeley. For the discussion on access paths and query

optimization, the author does not mention if these methods had

been implemented in a DBMS. Therefore, he seems only to be

discussing the vision and rationale of how to implement the

constructs for access methods and query optimization for

extensible data types.

The actual implementation of access methods and queryoptimization was probably beyond the scope of this paper.

Therefore, the ideas could not be validated through

experimental evidence. However, in the sections 3 and 4, the

author provides good case studies (through examples) when

discussing his proposals for implementing access methods and

performing query optimization in the context of user-defined

types and operators.

Assumptions

The author discusses performance of extensible data types in the

context of implementation on commercial systems by writing, “An

‘industrial strength’ implementation might choose to specify the

user types which an installation wants at the time the DBMS is

installed” [4]. This is an alternative to dynamically linking user-

defined routines for the extensible data types. While this wouldcertainly be a performance benefit, the author does not discuss if

this could actually happen in a

commercial setting. It seems that commercial database vendors

would want keep user-defined code away from the native code.

The author also implicitly assumes that creating constructs (i.e.,



data types and operators) types is empirically better than using

built-in data types to model these non-standard types. In other

words, he is assuming that this custom work (coupled with long-

term support) outweighs the problem of query logic complexity (as

presented in section

o when using native data-types.

When discussing the implementation of access methods, the

author

limits his discussion to support for single key fields.

Furthermore, the author also assumes single-dimension access

methods. These two assumptions seem valid given the scopeof the paper, as it discusses a whole framework for extensible

types, operators, access methods,and query optimization.

Making these assumptions allows the author to cover each

topic,rather than covering one particular topic (e.g., access

methods) in-depth while glossing over the other topics.

Rewrite

In general, this paper is very well organized and its ideas are

presented in a succinct manner. If we were to rewrite the paper

today, we would focus on improving the following points:

• Add a discussion on query rewrite in query optimization section.

The author did not discuss query rewrite in the context of user-

defined types

• Actual implementation of access methods in a DBMS (such as

INGRES) may bebeyond the scope of this paper. However, there

could be simulation data and a larger discussion of performance

drawbacks for extensible data types

• Add more discussion on how this proposal, along with the





References

[1] Hellerstein, J. and Stonebraker M., “Anatomy of a Database

Sytem.” Readings in Database Sytems, Cambridge, Mass.: MIT Press,

2005. 42-95.

[2] Stonebraker, M. et. al., “Application of Abstract Data Types and

Abstract Indices to CAD Data,” Proc. Engineering Applications Stream

of Database Week/83, San Jose, Ca., May 1983.

[3] “B+ Trees.” Wikipedia, The Free Encyclopedia. 17 Sep 2006, 10:55

UTC. Wikimedia

Foundation, Inc. 10 Aug 2004 < http://en.wikipedia.org/wiki/B

%2B_tree>.

[4] M.Stonebraker, “Inclusion of New Types in Relational Data Base

Systems.”, Proceedings of ICDE, 1986.

Tp database

Documents