Term Paper MANAGING DATABASE SUBMITTED TO: SUBMITTED BY: Respected ANKUR SINGH
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 1/15
Term Paper
MANAGING DATABASE
SUBMITTED TO:
SUBMITTED BY:
Respected ANKUR
SINGH
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 2/15
Gargi mam
RE3801A29
CAP 200
Inclusion of New Types in Relational Data
Base Systems
Problem statement
The needs of business processing applications were the impetus for
many of the built-in data types (e.g. floating point, money, date, etc.)
and operators (e.g. +, -, etc.) found in commercial database
management systems. However, these built-in types are of little use
for a wider range of applications in areas such as engineering and
scientific research. Applications used for scientific research, for
example, require a database to store large complex structures and
have the ability to make efficient queries on this data. Geographic
applications usually require data types such as points, lines, and
polygons. Other current examples include storage of images
and other multimedia data. Thus, a database management system
needs to have extendible data types to serve a wider community of
users and applications that use these systems.
To achieve this goal, the databases should allow for the addition of
extendible data types.When new data types exist in a DBMS, new
operators for these types may be needed. For example, if a DBMS is
extended with the data type “box”, a user may want to issue a query
to find all boxes that overlap one another. Therefore, an “overlap”
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 3/15
operator is appropriate for this cause. In addition to extensible
operators, built-in access methods for native data types using
existing data structures (e.g. B-trees, hash tables) may not be
suitable to store the user-defined data types. For example, in
Geographic Information Systems (GIS) that require data types such
asregions and lines, queries that use intersection and existence
operators cannot use B-
Trees as an efficient or useful access method. In this situation, it may
be appropriate to use an R-tree or KBD tree data structures. When
extensible data types use these new data structures in their access
methods, the problem of query optimization comes into play.
Therefore, a DBMS that allows the extension of data types should
also pass relevant performance information to the query optimizer.
The query optimizer should be aware of the cost of user-defined
operations, know how to optimize these new operations, and select
the best execution plans. To summarize, a DBMS that allows
extensible data types should provide the following four features:
1) A method for defining new data types
2) A method for defining operators for these new data types
3) A method for implementing access paths for these new data types
4) A method for allowing the query optimizer to process new
commands for new data
types and operators
The formal problem statement this paper addresses is as follows:
• Given:
o A core DBMS with built-in data types, operators, access
methods, and a query plan optimizer
• Find:
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 4/15
o A framework for adding user-defined data types; along with
relevant operators
o access methods, and statistic estimation techniques for
query plan optimization
• Objective:
o Minimize the amount of work for implementing new data
types
o Possibility of re-using existing data structures (e.g., B-Tree)
in access methods for user-defined data types
• Constraints:
o Possible safety loopholes when implementing new access
methods
o Performance (e.g., of transaction management, query plan)
of DBMS using new data types
Major contributions
This paper discusses a complete framework for implementing user-
defined data types.
It presents a solution addressing the four main areas mentioned in the
previous section. To the best of our knowledge, the contributions
presented here encompass the first comprehensive solution for
extendible data types in a relational database management system.Portions of the framework (namely solutions to points 1 and 2 in the
previous section) come from a previous work by the author [2], but
are present in this paper to provide a complete picture of the
extensible data type solution. The major contributions, therefore,
categorically address the four needs when implementing extensible
data types. Each of these contributions will be discussed in
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 5/15
the next section on key concepts.
• Definition of abstract data types (ADT): the author offers a method
for defining extensible data types within a DBMS
• Definition of ADT operators: the author offers a method fordefining operators for the new extensible data types
• Access methods: the author describes how new access paths can
be implemented to efficiently support extensible data types.
• Query optimization: the author describes how query optimization
takes place inside the DBMS when extensible data types are
present.
Motivation
The needs of business processing applications were the impetus formany of the built-in data types (e.g. floating point, money, date, etc.)
and operators (e.g. +, -, etc.) found in commercial database
management systems. However, these built-in types are of little use
for a wider range of applications in areas such as engineering and
scientific research. Applications used for scientific research, for
example, require a database to store large complex structures and
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 6/15
have the ability to make efficient queries on this data. Geographic
applications usually require data types such as points, lines, and
polygons. Other current examples include storage of images and other
multimedia data. Thus, a database management system needs to
have extendible data types to serve a wider community of users and
applications that use these systems.
Key Concepts
Data type definition
As a space requirement, we assume that the reader understands the
concept of native
types in relation to a DBMS or programming language. If a database
allows for extendible data types, the method described in this paperinvolves a simple syntax to define the data type.
Define type-name length=value,
Input = file-name,
Output = file-name
In this example, length is a fixed amount of space that the data type
will occupy, while the input and output properties define routines thatwill convert the data type to and from character strings for storage.
Operator Definition
As a space requirement, we assume that the reader has a basic
understanding of an
operator in relation to a DBMS. Such operators could be any of the set
{=, <, >}. To define an operator for a user-defined type, the
method described in the paper involves a similar structure to thetype definitions.
Define operator token = value,
Left-operand = type-name,
Right-operand = type-name,
Result = type-name,
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 7/15
Precedence-level like operator-2,
File = file name
Here, the operator definition encompasses both right and left operand
types, along with precedence level if multiple operators exist. Thefile attribute stores the procedure that performs the operator logic.
Access Methods
Access methods are the routines for managing access to disk-based
data structures
supported by the system. An example of such a data structure is a
B+-Tree. In a B+
tree, all data is saved at the leaf level, while the internal nodes onlycontain search keys and tree pointers. The leaf nodes are also stored
as a linked list, making range queries easy .
Image courtesy
The paper describes a method to extend access methods to either re-use existing datastructures or make use of completely new datastructures depending on the properties of the user-defined data type.For instance, if a user were to issue the query [4]:
retrieve (target-list) where relation.key <= 3
A B+-Tree would work very well in this case since the operator (OPR) is
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 8/15
‘<=’. The access
method would start at the root node and follow the leftmost pointer to
the node pointing to data values d1, d2, and d3. A B+ Tree works well
for the integer data type. However, if the extended data type is a box,
the access methods may require a different data structure, such as an
R-Tree that is more suited for spatial data. To extend access methods,the paper defines access method templates. Each template defines an
access method, along with the operator information necessary to
implement that access method. The paper gives an example of a
template for a B- Tree.
In this template, only the <= operator is required (reading from the
opt column, it is the only value of “req”) since it is the only operator
necessary to implement a B-Tree. Other columns in this template
define the left and right operands as well as the result for a given
operator.Along with this template, an access method table must also
be in place, which defines a collection of operators that satisfy the
template. This table also contains values that the query processor
may use to estimate the number of tuples that satisfy the operator
qualification, and the number of pages touched when using the
operator to compare a key field to a constant. The paper gives an
example of such a table in the context of regular integer operators
for a B-Tree, along
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 9/15
with “box” operators (AE – area equal, AL – area less-than, AG – area
greater-than) that are used in a B-Tree access method.
In this case, both the box (defined as the area-op class) and
integer (defined as the int-ops class) operators are defined for use
with a B-Tree. The paper also defines a “using class” clause to
change a relation to use a particular access method. For instance,
if a user wanted a relationstoring “box” information to use the
operators AE, AL, and AG within the B-Tree access method,they
would issue the command:
modify box to B-Tree on desc using area-op
The actual implementation of the access methods come though
implementing procedure calls which will use the access method
information previously defined. Two examples of these procedure
calls are:
Open(relation-name) – returns a pointer to a structurecontaining information about the relation Get-first(descriptor,
OPR, value) – return first record which satisfies the “where key
OPR value” clause.
In the case of extensible data types, new access methods may
have to handle tasks such as logging, concurrency control, and
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 10/15
buffer management. In the case of logging, if a DBMS supports
logical logging, then the access methods must implement
REDO and UNDO methods when a log manager rolls forward or
rolls backward log events. In the case of concurrency control,
the access method may have to make use of system calls (e.g.,
read, begin, abort, etc.) to a DBMS scheduler that will in turnrespond with yes/no/abort response for each request. Finally,if
buffer management is a concern for access method designers,
the author suggests that a set of procedures (e.g., get, fix,
unfix, put, order) must be made available so the access
method may perform buffer manipulation.
Query Optimization
Query optimization is a function of many databasemanagement systems that examinesmultiple query plans
for satisfying a particular query. Most optimizers consider
statistics when analyzing query plans. The statistical
categories are usually in the area of CPU cost and disk
storage service time. The optimizer also examines
different query paths by looking at the indexes available
and relational table join techniques to choose an optimal
query path. As a simple example, consider the query
Select employee.name
From employee
Where employee.level = 5
In this case, they query optimizer will want to find the cheapest
way to find all employees with the level of 5. The query could
scan all tuples in the employee relation to find the employees
with level equal to 5. However, if an index exists on the
employee level column, the number of operations will be
greatly reduced as the query can use this index to scan only a
subset of employee records (i.e., employees with level 5). In
the case of join ordering, consider three tables A, B, and C that
must be joined to satisfy a query. Table A contains 50 records,
while B and C contain 400,000 records. The job of the query
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 11/15
optimizer is to find the optimal join order and join method
which will optimize the query performance. In this case, if
table B is first joined with table C, then the result is joined with
table A, this plan can take several orders of magnitude more
than a plan that first joins tables A and C [5]. Also, if hash join
is a feasible strategy for joining A and C, the optimizer maychoose this option over a nested-loop join. In this case, hash-
join is
appealing since table A is small enough to fit in memory,
resulting in a one-pass join algorithm.
When user-defined types and operators are present in a DBMS,
the query optimizer must have a way to estimate the
selectivity and join methods available for tables containingthese new types in order to make decisions as described
above. Otherwise, optimization becomes daunting (if not
impossible) task. This paper proposes that four pieces of
information must be available when defining an extensible data
type operator [4]:
Stups:
o estimation of the number of records satisfying theclause Where rel- name.field-name OPR value
Selectivity factor S: the expected number of records
which satisfies the clause:
o Where relname-1.field-1 OPR relname-2.field-2
o Whether merge-sort is feasible for the operator
o Whether hash-join is a feasible joining strategy for this
operator
With this information in place, the query optimizer has enough
information to produce a more optimal query path than random
selection when a query is issued on user-defined data types.
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 12/15
Validation
The author mainly provides a general framework to add user-
defined types to the database. As mentioned previously, the
methods for defining extensible data types and their operators
were presented in [2] and implemented in the INGRES DBMS at UC-Berkeley. For the discussion on access paths and query
optimization, the author does not mention if these methods had
been implemented in a DBMS. Therefore, he seems only to be
discussing the vision and rationale of how to implement the
constructs for access methods and query optimization for
extensible data types.
The actual implementation of access methods and queryoptimization was probably beyond the scope of this paper.
Therefore, the ideas could not be validated through
experimental evidence. However, in the sections 3 and 4, the
author provides good case studies (through examples) when
discussing his proposals for implementing access methods and
performing query optimization in the context of user-defined
types and operators.
Assumptions
The author discusses performance of extensible data types in the
context of implementation on commercial systems by writing, “An
‘industrial strength’ implementation might choose to specify the
user types which an installation wants at the time the DBMS is
installed” [4]. This is an alternative to dynamically linking user-
defined routines for the extensible data types. While this wouldcertainly be a performance benefit, the author does not discuss if
this could actually happen in a
commercial setting. It seems that commercial database vendors
would want keep user-defined code away from the native code.
The author also implicitly assumes that creating constructs (i.e.,
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 13/15
data types and operators) types is empirically better than using
built-in data types to model these non-standard types. In other
words, he is assuming that this custom work (coupled with long-
term support) outweighs the problem of query logic complexity (as
presented in section
o when using native data-types.
When discussing the implementation of access methods, the
author
limits his discussion to support for single key fields.
Furthermore, the author also assumes single-dimension access
methods. These two assumptions seem valid given the scopeof the paper, as it discusses a whole framework for extensible
types, operators, access methods,and query optimization.
Making these assumptions allows the author to cover each
topic,rather than covering one particular topic (e.g., access
methods) in-depth while glossing over the other topics.
Rewrite
In general, this paper is very well organized and its ideas are
presented in a succinct manner. If we were to rewrite the paper
today, we would focus on improving the following points:
• Add a discussion on query rewrite in query optimization section.
The author did not discuss query rewrite in the context of user-
defined types
• Actual implementation of access methods in a DBMS (such as
INGRES) may bebeyond the scope of this paper. However, there
could be simulation data and a larger discussion of performance
drawbacks for extensible data types
• Add more discussion on how this proposal, along with the
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 14/15
8/7/2019 Tp database
http://slidepdf.com/reader/full/tp-database 15/15
References
[1] Hellerstein, J. and Stonebraker M., “Anatomy of a Database
Sytem.” Readings in Database Sytems, Cambridge, Mass.: MIT Press,
2005. 42-95.
[2] Stonebraker, M. et. al., “Application of Abstract Data Types and
Abstract Indices to CAD Data,” Proc. Engineering Applications Stream
of Database Week/83, San Jose, Ca., May 1983.
[3] “B+ Trees.” Wikipedia, The Free Encyclopedia. 17 Sep 2006, 10:55
UTC. Wikimedia
Foundation, Inc. 10 Aug 2004 < http://en.wikipedia.org/wiki/B
%2B_tree>.
[4] M.Stonebraker, “Inclusion of New Types in Relational Data Base
Systems.”, Proceedings of ICDE, 1986.