MULTI KEY INDEXING FOR DISTRIBUTED DATABASE MANAGEMENT SYSTEM BY MD. SHAZZAD HOSAIN STUDENT NO: 9505025 For the partial fulfillment of B.Sc. Engineering Degree in Computer Science and Engineering SUPERVISED B Y MD. HUMAYUN KABIR Assistant Professor & ABDUL HAKIM NEWTON Lecturer Department of Computer Science & Engineering, BUET Department of Computer Science & Engineering Bangladesh University of Engineering & Technology, Dhaka – 1000, Bangladesh
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
MULTI KEY INDEXING FOR DISTRIBUTED DATABASE MANAGEMENT SYSTEM
BY
MD. SHAZZAD HOSAIN STUDENT NO: 9505025
For the partial fulfillment of B.Sc. Engineering Degree in Computer Science and Engineering
SUPERVISED BY MD. HUMAYUN KABIR
Assistant Professor &
ABDUL HAKIM NEWTON Lecturer
Department of Computer Science & Engineering, BUET
Department of Computer Science & Engineering Bangladesh University of Engineering & Technology, Dhaka – 1000, Bangladesh
Acknowledgement
I like to express my sincerest appreciation and profound gratitude to my supervisors
Md. Humayun Kabir, Assistant Professor and Abdul Hakim Newton, Lecturer,
Department of Computer Science and Engineering, BUET, for their supervision,
encouragement and guidance. Especially, Md. Humayun Kabir has keen interest in
distributed database, and his valuable suggestions and advice were the source of all
inspirations to me. I also like to convey gratitude to all my course teachers here. Their
teaching helps me a lot to start and complete this thesis work. Finally, I would like to
acknowledge the assistance and contribution of a large number of individuals and
express my gratefulness to them.
(Md. Shazzad Hosain )
Abstract Complex accessing structures like indexes are a major aspect of centralized database.
The support for these structures is the most important part of the database
management systems (DDMS). The reason for providing indexes is to obtain fast and
efficient access to data. Most of the centralized database management systems use B-
tree or other types of index structures such as bit vector, graph structure or grid file
index. But in distributed databases no index structure is used to obtain fast and
efficient access to the data. Therefore efficient access is the major problem in
distributed databases. We proposed a distributed index model, which is a data
structure based index comprising of two types of index structures: Global Index (GI)
and Local Index (LI). GI is created and maintained by distributed database component
(DDB) and LI is created and maintained by local database component (DB) of a
distributed database management system. Our proposed global index (GI) uses the
techniques of bit vector, graph structure and grid file organization. This distributed
database index is implemented with multi key indexing technique, which enables us
to search a record with more than one attribute values. A simulation program tested
the proposed model and found satisfactory results.
Chapter 1
Introduction
1.1 An overview of distributed database system
The traditional database approach keeps all data centrally and then accesses them
mostly in a client server model. But in a distributed database system data are
distributed over site geographically. Let us clarify this feature through an example.
Say there are three branch of a bank at different site. There will be two types of
transaction, one is local transaction and the other is global transaction. Local
transaction means money transaction in the account at the same site. But if it needs to
transfer money from one account of a site to another account of another site then a
global transaction will occur. In that case the program has to access data over site,
which needs much attention, such as, transaction over network, speed, efficient
access, integrity, recovery, concurrency control, privacy, security and a lot of things.
These entire tasks are done by DDBMSs (Distributed Database Management System).
When a local operation happens all are right as usual, but when a global operation
needs it is the DDBMSs to determine which site has to access or how to perform the
operation.
An important property of DDBMSs is whether they are homogeneous or
heterogeneous. Homogeneity and heterogeneity can be considered at different levels
in a distributed database, the hardware, the operating system and the local DBMSs.
However, the important thing for us is at the level of local DBMSs, because the
communication software manages differences at lower levels. Therefore the term
homogeneous DDBMSs, refers to a DDBMSs with the same DBMS at each site, even
if the computers and / or the operating system are not the same. A heterogeneous
DDBMS uses instead at least two different DBMSs. This adds the problem of
translating between the different data models of the different local DBMSs to the
complexity of homogeneous DDBMSs. So if Top-Down design of a global schema
for a new system has to be designed a homogeneous system is a fine solution but if
there already exists different local DBMSs and one has to integrate those systems then
obviously heterogeneous system will emerge and in that case the DDBMSs has to
cope with this heterogeneous system.
1.2 Reference architecture of distributed database
In order to understand distributed database system we have to study this Reference
architecture of distributed database architecture. Actually it is not explicitly
implemented an all distributed databases but we have to analyze and understand all
the components of this architecture to have better knowledge about distributed
database.
Fig 1.1: The reference architecture for distributed database.[8]
The reference model has two main parts
1. Site independent schemas
2. Site dependent schemas
Global Schema
Fragmentation Schema
Local mapping Schema 1
Local mapping Schema 2
DBMS of site 1 DBMS of site 1
Local database At site 1
Local database At site 2
(Other site)
Allocation schema
1.2.1 Site independent schemas:
This schemas has the following parts
• Global schema
• Fragmentation schema
• Allocation schema
1.2.1.1 Global schema:
This schema defines all the data that are contained in the distributed database as if
the database were not distributed at all. Therefore, this schema is defined exactly as
the schema of no distributed database. However, the data model that is used for the
definition of a global schema should be compatible for the definition of the mapping
to the other levels of the distributed database. For this purpose, the relational data
model will be used. Using this model, the global schema consists of the definition of a
set of global relations.
1.2.1.2 Fragmentation schema:
Each global relation can be split into several no overlapping portions that are
called fragments. The mapping between global relations and fragments is defined as
fragmentation schema. This mapping is one to many: i.e., several fragments
correspond to one global relation, but only one global relation corresponds to one
fragment. Fragments are indicated by a global relation name with an index (fragment
index); for example, Ri indicates the ith fragment of global relation R.
1.2.1.3 Allocation schema:
Fragments are logical portions of global relations that are physically located at one
or several sites of the network. The allocation schema defines at which sites a
fragment is located. All the fragments that correspond to the same global relation R
and are located at the same site j constitute the physical image of global relation at site
j. Therefore there is a one to one mapping between a physical image and a pair; a
global relation named and a site index can indicate physical images. To distinguish
them from fragments, we will use a superscript; for example, Rj indicates the physical
image of the global relation R at site j.
An example of the relationship between the object types defined above is shown
in fig 2. A global relation R is split into four fragments R1, R2, R3, R4. These four
fragments are allocated redundantly at the three sites of a computer network, thus
building three physical images R1, R2 and R3.
R
Global relation Fragments Physical image
1. FIG 1.2: FRAGMENTS AND PHYSICAL IMAGES FOR A GLOBAL RELATION [8]
To complete the terminology, we will refer to a copy of a fragment at a given site,
and denote it using the global relation name and two indexes. For example, in fig 2,
the notation R32 indicates the copy of fragment R2 that is located at site 3.
Finally, note that two physical images can be identical. In this case, we will say
that a physical image is a copy of another physical image. For example, in fig 2 R1 is
a copy of R2.
1.2.2 Site dependent schemas:
This schemas has the following parts
• Local mapping schema
• DBMS of the local site
• Local database at that site
R1
R2
R3
R4
R1 site 1
R2 site 2
R3 site 3
R11
R12
R21
R22
R31
R32
R33
1.2.2.1 Local mapping schema:
Since the top three levels are site independent, therefore they do not depend on the
data model of local DBMS. At a lower level, it is necessary to map the physical
images to the objects that are manipulated by the local DBMS. This mapping is called
a local mapping schema and depends on the type of local DBMS; therefore, in a
heterogeneous system we have different types of local mappings at different sites.
This reference architecture provides a very general conceptual framework for
understanding distributed database system. The most important three features that
motivates in designing this architecture are
• Separation of data fragmentation and allocation.
• The control of redundancy.
• The independence from local DBMS.
• Separation of data fragmentation and allocation:
This separation allows us to distinguish two different levels of distribution
transparency, namely fragmentation transparency and location transparency.
Fragmentation transparency is the highest degree of transparency where as the
location transparency is the lower degree of transparency. The separation between the
concept of fragmentation and allocation is very convenient for the distributed
database because the determination of relevant portions of the data is thus
distinguished from the problem of optimal allocation.
• Explicit control of redundancy:
In fig 2 the two physical images R2 and R3 are overlapping; i.e. they contain
common data. The definition of disjoint fragments as building blocks of physical
images allows us to refer explicitly to this overlapping part: the replicated fragment
R2. As we shall see, the explicit control over redundancy is useful in several aspects
of distributed database management.
• Independence from local DBMS:
The feature local mapping transparency allows us to build distributed database
system homogeneous or heterogeneous. In a homogeneous system, it is possible that
the site independent schemata are defined using the same data model as the local
DBMS but in a heterogeneous database system local mapping schemata helps to
coordinate the different kinds of DBMS.
Another kind of transparency that is strictly related to location transparency is
replication transparency. Replication transparency means that the user is unaware of
the replication of fragments.
1.2.3 Types of data fragmentation:
The decomposition of data fragmentation is two types, horizontal fragmentation
and vertical fragmentation. More fragmentation that is complex can be obtained by
combining these two types of fragmentation. In all types of fragmentation, a fragment
can be defined by an expression in a relational language that takes global relations as
operands and produces the fragment as result. There are some rules that must be
followed when defining fragments.
1.2.3.1 Completeness condition:
All the data of the global relation must be mapped into the fragments; i.e., it must
not happen that a data item that belongs to global relation does not belong to any
fragment.
1.2.3.2 Reconstruction condition:
It must be always possible to reconstruct the global relation from those fragments.
This necessity must comply with this architecture because distributed database only
store the fragments in different sites and global relation have to be built through this
reconstruction operation if necessary.
1.2.3.3 Disjointness condition:
It is convenient that fragments should be disjoint, so that the replication of data
can be controlled explicitly at the allocation level. However, this condition is satisfied
by horizontal fragmentation while the in vertical fragmentation this condition
sometimes violated.
1.2.4 Horizontal fragmentation:
Horizontal fragmentation consists of partitioning the tupelos of a global relation
into subsets that is very much useful for the distributed database system. If data are
fragmented by some common properties then those data can be stored geographically
in a convenient way. Here we can clarify this by an example. Let a global relation be
SUPPLIER (SUM, NAME, CITY)
Here the SUPPLIER contains supplier number, supplier name and the city where
the supplier lives. However if the entire supplier comes from Sanfransisco city (SF”)
and Losangels city (“LA”) then the horizontal fragmentation can be defined in the
following way:
SUPPLIER1 = SL CITY = ”SF” SUPPLIER
SUPPLIER2 = SL CITY = ”LA” SUPPLIER
The above fragmentation satisfies the completeness condition because “SF” and
“LA” are the only possible values of CITY attribute, otherwise we would not know to
which fragments the other CITY values belong.
Again, the reconstruction condition is easily verified, because it is always possible
to reconstruct the SUPPLIER global relation through the union operation:
SUPPLIER = SUPPLIER1 UN SUPPLIER2
1.2.5 Vertical fragmentation:
The vertical fragmentation can be obtained by the subdivision of its attributes into
global groups. It is useful when the subgroups have the same common geographical
properties. For example a global relation
EMPLOYEE (EMPNUM, SAL, TAX, MGRNUM, DEPTNUM)
A vertical fragmentation of this relation can be defined as
(a) Canonical form of query Q4 (b) Simplified query
Fig 3.7: Simplification of vertically fragmented relations [8]
Chapter 4
Multi-key Processing 4.1 Introduction
For primary key indexing, each index entry identifies a unique record in the main
file. Any application however, require multikey retrieval, that is the retrieval from a
file of all records having some combination of attribute values. For example, a college
dean might want to genereate a list of all students with
• U.S. citizenship
• Physics or math major
• GPA of at least 3.3
• Student identification number less than 150, 0000
There are likely to be many records in a file with the same value of a particular
secondary attribute. Indexes of various forms are one mechanism for finding them.
Normally secondary index can produce a list of pointers to records having a particular
value of a secondary key.
However there is a price to be paid. Indexes take up space, and if the file is
changed frequently much time may be spent updating secondary indexes. Multi-key
queries involving ranges of attribute values are awkward to deal with using
conventional indexes. They might be better served by the grid file organization.
4.2 Threaded files
In a threaded file a pointer field is associated with each indexed secondary key
field. The value in the pointer field identifies the next record in the file with the same
value of the secondary key. Thus a number of threads run through the file. An entry
points to the first record having the attribute value and acts as a header of a linked list.
Let a file in which records have k attributes and suppose that two of these
attributes are threaded in the manner described. If the first two attributes has N
different values and the second has M different values, then the general form of the
file and indexes would be shown as given bellow
Record
Number
Attribute
1
Next
Pointer
Attribute
2
Next
Pointer
Attribute
3
Attribute
4
Attribute
K
1 - - - - - - -
2 - - - - - - -
3 - - - - - - -
Fig 4.1: general threaded file[21]
A part of a file of car records with one set of threads (for manufacturer) and part
of the corresponding index is shown bellow
Manufacturer index
Ford: 1
↓
VW: 2
↓
BMW: 3
↓
Audi: 11
↓
Honda: 15
Record
No
Manuf. Next
Manuf.
Model Color
Next
Color
License
1 Ford 4 Pinto White 4 HORS4ME
2 VW 6 Bug Red 9 SKIBNY
3 BMW 9 322I Black 6 DADIOUI
4 Ford 5 Mustang White 7 VALEGRL
5 Ford 7 Pinto Blue 8 RATFACE
6 VW 8 Rabbit Black 12 910VCD
7 Ford 10 Pinto White 11 PACMAX
8 VW 12 Rabbit Blue 10 BYE2NOW
9 BMW 16 320 Red 14 CMEGO
10 Ford 14 Mustang Blue 17 DPGURI
11 Audi ? 5000 White 16 OULSK
12 VW 13 Letta Black 18 LKWJE
13 VW ? Bug Green 15 WEIOU
14 Ford 18 Mustang Red 19 SD2332
15 Honda ? Civic Green 20 SDF33
16 BMW 17 320 White ? SDF3
17 BMW ? 322I Blue ? SFSDF23
18 Ford 19 Tempo Black ? TRE344
19 Ford 20 Pinto Red ? 34DRTER
20 Ford ? Mustang Green ? LKJPE
Fig 4.2: threaded file [21]
4.3 Multi-lists
In the multi-list organization the threads in the main file have the same structure
as in the threaded file organization but the index entries are different. Instead of an
index entry pointing simply to the beginning of a thread, it now points to every kth
record on the thread (for some value of k). In effect, we have a number of sub list of
length k and there is a pointer in the index to each subsist. An index entry now has
two links, one to the entry for the next value of the attribute and a second to a list of
pointers to the main file. With this additional information in the index, performance of
merge operations can be speeded up. Here we give an example of multi-list index
using fig 4.2 where the value of k is taken as 3.
Manufacturer index Color index
↓ ↓
Ford :1-7-18-49 White :1-11
↓ ↓
VW :2-12 Red :2-19
↓ ↓
BMW :3-17 Black :3-18
↓ ↓
Audi : 11 Blue : 5-17
↓ ↓
Honda : 15 Green : 13-25-40-28
Fig 4.3: Multi-lists [21]
4.4 Inverted files
Here the initial threaded file represents one extreme of the multi-list organization
with k = ∞. The other extreme is when k = 1; in this case the index points to every
record with a particular attribute value. This type of multi-list organization is known
as inverted file. Virtually all the commercially available systems are based on inverted
file designs.
4.4.1 STAIRS: An Application of Inverted Files
IBM’s STAIRS (Storage And Information Retrieval System) is a powerful
document retrieval system. Users can retrieve documents based on their content. For
example, they can retrieve documents containing an arbitrary word, or those
satisfying a complex Boolean expression of words.
The STAIRS system has this capability because it indexes every word occurrence
in the text, in contrast to most document systems, which index a few selected
keywords. STAIRS can thus be classified as a full text document retrieval system.
Here we give a brief description of the file structures that make retrievals efficient and
show how queries are answered. The file structures we describe are simplified
versions at the actual STAIRS structures.
Matrix
↓
Dictionary
↓
Occurrence file
↓
Index
↓
Documents
Fig 4.4: STAIRS file hierarchy [21]
4.4.2 File structure:
The STAIRS system contains five levels of data structures files as depicted in
fig4. The lowest level in the structure, the documents file, contains the machine-
readable documents. The only change from the conventional representation of a
document is that each paragraph is tagged with a label such as TITLE, TEXT,
ABSTRACT, and so on. In addition, the document contains end-of-sentence codes
that the system can recognize.
The next level in the structure, the index, has one entry for each document. The
entry contains information such as a pointer to the document, protection codes, and
date of entry into the system. The three file levels above the index refer to a document
by a unique document number: the number of its entry in the index.
The occurrence file contains one record for each word occurrence in the document
collection. The information recorded for each word occurrence is:
• Document number
• Paragraph code
• Sentence number
• Position within sentence
The entries in the occurrence file are ordered so that all records for a particular
word are contiguous. Within this grouping, records are stored in the order of the four
fields listed above.
The dictionary contains an entry for each different word in the document
collection, including such common words as ‘the’, ‘of’, and ‘in’. Summary
information, such as the numbers of times the word occurs and the number of
different documents in which it occurs, is stored together with the word.
In large dictionaries in book form usually have a thumb index, which enables the
user to find an alphabetical section rapidly? The matrix takes this one step further. It
has 26*27 entries, each of which identifies the start of a section of the dictionary for
words beginning with a particular pair of letters. In fact matrix eliminates the need to
store the first two letters of words in the dictionary. This key compression saves a
certain amount of space.
Fig5 represents a small part of the top three levels of an example file collection.
We assume in this example that in the document collection, “macabre” is a
alphabetically the first word starting with the letters “ma”. Also, we assume that the
word “mainframe” occurs a total of 109 different times and in 20 different documents.
Chapter 5
Index implementation 5.1 Introduction
With the multi-list and inverted file indexes there is a problem of maintaining
variable-length lists for each attribute value. This is one of the major problems of
multi-list indexing. Here we consider two alternatives to simple lists: bit vectors and a
general graph structure.
5.2 Bit vectors
A bit vector in the context of indexes is an array of two-valued objects having as
many elements as there are records in the main file. Each element indicates whether
or not the corresponding main file record has a particular attribute of the car file of
fig2.
Figure 5.1: Bit Vector Index [22]
5.3 Graph structure
We can save a certain amount of space in a structure by combining those list
elements that point to the same main file record. Thus each main file record is
represented in an inverted directory file by a single node that is on as many lists as
there are indexed attributes. A node will have, for each indexed attribute, a pointer to
the next node representing a main file record with the same value of that attribute. It is
Manufacturer
Ford
VW
BMW
Honda
101110000110101111000011100
110010101001110111000000110
101101010101010001100000001
101010101011111100000110101
convenient if each node also contains a pointer back to the owner of the list, that is,
the index entry for the particular attribute value. Fig6 shows what a node might look
like. Fig7 shows a small part of the graph structure for our car file. Only nodes for
records 1, 4 and 5 are shown.
Fig 5.2: General graph node[22]
Pointer to main file Pointer to owner node attribute 1 Pointer to next node with same value of attribute 1
Fig 5.3 Partial Graph Structure [22]
To find green fords we choose, arbitrarily, one of the two attributes, for example
manufacturer. We follow the manufacturer pointers from the first node pointed to by
“Ford”. At each node visited we check the owner of the color attribute to see if it is
green.
Since the nodes are distributed all over the file, we may have to access the entire
file to traverse the path. This takes a lot of time to find a specific query. Grid File
Organization can eliminate this problem.
Manufacturer
Ford
VW
BMW
Model
Pinto
Bug
Mustang
Color
Red
White
Black
To main file
To main file
To main file
To node for record to next mustang
To node for record to next white car
To node for record to next Ford
To node for record to next Pinto
To node for record to next Black
5.3.1 Comparison of Bit Vectors and Graphs
Bit vectors may at first appear to have higher storage costs than graph structures.
However, suppose that the main file contains M records, that P different attributes are
indexed, and that the attributes have an average of N different values. The storage
required for the bit vectors is
N*M*P bits
Assuming that we are combining nodes as described above, the storage for the
comparable part of the graph structure (M nodes) is
M*(2*P+1) pointers, which equals (2*M*P)+M pointers
Roughly speaking, if the number of different values of an attribute is less than the
number of bits required to hold two pointers, then a set of bit vectors occupies less
space than the comparable graph structure. A hybrid system might be a preferred
compromise; bit vectors would be used for attributes with few different values and
lists would be used for attributes with many different values.
The biggest advantage of bit vectors is the speed with which simple set operations
can be performed on conventional hardware. Most computers have machine-level
instructions for performing logical operations on bit patterns. In contrast, list merging
is comparatively slow.
5.3.2 Index maintenance
Secondary indexes, like primary indexes, must reflect the contents of the main file
at all times. Although there is only one primary index, there may be several secondary
indexes. Thus maintenance of correct index entries may become a big overhead. Here
we consider how bit lists and graph structures compare in the amount of maintenance
required.
5.3.2.1 Updating
An inverted file must be updated in three cases: when we insert a new record into
the main file, when we delete a record, or when we change the value of a secondary
key of an existing record. In the case of an insertion or a deletion, all indexes for the
file have to be modified. In the case of a single field modification, one index at most
has to be updated.
To delete a record from the main file we could simply mark it deleted. The
alternative is to rewrite the file and omit the deleted record. The marking operation is
a logical rather than a physical deletion. Logical deletion is simpler. A disadvantage
of physical deletion is that records may change position in the file and thus require
changes in pointers to them. An advantage of physical deletion is the reduction in list
length; this makes subsequent traversals shorter. For some insertions we may be able
to reuse logically deleted records in the main file, in which case there is no problem in
updating a bit vector.
If we change the value of an attribute the changes in the bit list are vary small. We
simply clear a bit in one list and set it in another. When we modify an attribute in a
graph structure, on the other hand, we must release the node from one inverted
attribute list and assign it to another.
Both the insertion of a new record and a change in an attribute value might
introduce an attribute value not previously represented in the file. In the case of a bit
vector, we must create a new vector with exactly one bit set. In the case of the graph
structure, there wil be a new list with exactly one node.
5.3.3 Reliability
Secondary storage tends to be more vulnerable than primary storage to data
corruption. Some of the data structures we have described above can be rendered
useless if a critical pointer is damaged. We thus therefore consider methods of making
the structures more robust, that is, less likely to be damaged irreparably. A general
technique might be to provide two or more paths to any record. A file maintenance
program can check the integrity of access paths. If a damaged path is detected; we can
possibly use another route and repair it. This recovery principle suggests that a
double-linked circular list is preferable to a single-linked list. 5.4 Grid Files
Nievergelt, Hinterberger, and Sevcik describe a secondary key accessing
teachnique-using grids that performs well on both stable and volatile files. We
describe the briefly here.
5.4.1 Design Aims
The design aims of the grid file organization are fourfold:
1. Point queries. The processing of a completely specified query should require no
more then two disc accesses. A completely specified query (or point query) is one
in which a single value is specified for each key attribute. An example from the
car file is: find Manufacturer = Ford, Color = Black, License = 1GWN821, Model
= Bug
2. Range queries: Processing of range queries and partially specified queries should
be efficient. Two examples of such queries are: find Manufacturer = VW, Model
= Bug, D < license < X: find Manufacturer = Ford, Color = Green
3. Dynamic adaptation: The file structure should adapt smoothly to insertions and
deletions.
4. Symmetry: All key fields, whether primary or secondary, should be treated
equally.
5.4.2 Ideal Solution
Assume that records have k keys. Consider the k-dimensional space defined by
the k sets of attribute values. Modifying the bit vector idea given above, we can
conceive of a k-dimensional bit matrix I have different values. If a particular element
of the k-dimensional matrix is set to 1, this indicates that a record exists with the
corresponding set of k attribute values. If the bit is 0, then no such record exists. Note
that the matrix of bits is therefore a complete representation of the set of records.
This organization satisfies the four design aims above, although we have assumed
nothing about how it might be stored. Processing point queries involves examination
of a single element. Processing range queries involves processing all elements in
particular j-dimensional matrix j <= k. Insertions and deletions are carried out by
setting and clearing appropriate single elements of the matrix. All key fields are
treated equally.
The k-dimensional matrix, however, is an ideal rather than a practical file
organization. In practice the matrix would be far too large to store. If a file has records
with 4 keys and each attribute has 1—different values, the matrix will have
100,000,000 elements. The grid file organization that we discuss next is in some ways
an approximation of the matrix ideal.
5.4.3 Practical Grid File Implementation
In the grid partitioning sets of attribute values reduces file organization the size of
the matrix. For example, we could partition the set of colors into four subsets. If we
regard attribute values as character strings, we might have
Color < F
F <= Color < K
K <= Color < Q
Q <= Color
Thus the color lemon, for example, falls into the third partition (K <= lemon < Q).
The partition points are held in linear scales. The set of k linear scales, one for each
attribute, defines a grid on the k-dimensional attribute space. The space is thus
divided into grid blocks. The number of grid blocks is much smaller than the number
of elements in the matrix. What we have lost, however, is the one-to-one
correspondence between elements of the grid/matrix and possible records.
In the grid file organization, records are stored in buckets. Buckets have a fixed
size, but there can be arbitrarily many of them in a file. The dynamic assignment of
buckets to grid blocks is maintained in the grid directory. The grid directory consists
of linear scales and a grid array, where each element contains a pointer to a bucket,
grid array elements, which form a k-dimensional rectangle, any point to the same
bucket in the file where records with the corresponding attribute values are stored.
We can represent the partitioning of the Color attribute described above by the
following linear scale
Color (F, K, Q)
Suppose that the other three attributes are partitioned similarly and that the linear
scales are
Manufacturer (G, R)
Model (C, H, N, T)
License (B, M, S)
Consider the record with
Manufacturer = Ford, Model = Pinto, Color = Blue, License = BBC1500
The partitions into which the attribute values fall are 1, 4, 1 and 2 respectively.
Therefore, the bucket pointed to by
Grid-array [1, 4, 1, 2]
Is the only place where the record would be stored?
The grid array of pointers is normally so large that it must be held in secondary
memory. On the other hand, the linear scales would normally fir into main memory.
Next we consider how well the grid file organization achieves the four design aims of
point queries range queries, dynamic adaptation, and symmetry.
5.4.4 Performance of Grid Files
Point queries: In response to a point query, each of the k specified attribute values
is first transformed into a grid index using the appropriate linear scale. The element of
the grid array selected by the set of k indexes can now be fetched from disc.
Nievergelt, Hinterberger, and Sevcik make a number of suggestions for implementing
the grid array. The calculations involved in mapping a rectangular k-dimensional
array onto linear memory are not complex. The address of a particular element is easy
to compute, and the element can be fetched in one access. A second disc access
fetches the bucket pointed to from the array element. Thus the first design aim is
achieved.
5.4.4.1 Range queries
The second design aim is to answer range queries efficiently. To satisfy this aim it
must be possible to move efficiently along an arbitrary axis of the grid array. That is,
given the address of a particular element, it must be easy to compute the address of
the next or previous element in any of the k-dimensions. For example, to satisfy the
range query
Manufacturer = VW, Model = Bug, D < license < X
We need to process records in buckets pointed to by elements in the rectangle
Grid array [3, 1, 1…4, 2…4]
A linked list implementation of the matrix would satisfy this requirement. An
array implementation enabling direct access of an element given its indexes would
also be suitable.
5.4.4.2 Dynamic adaptation
The third design aim is that the organization should adapt smoothly to insertions
and deletions. Let us consider these in turn.
5.4.4.3 Insertions
If a record must be inserted into a bucket that is already full, then a new bucket is
allocated to the file and records are distributed between the two buckets. There are
two cases to consider: when only only pointer points to the full bucket.
If the full bucket is pointed to form more than one element of the grid array, we do
not need to make changes to the partitioning. Records are distributed between the two
buckets according to current partitioning, and some of the pointers change from the
old to new bucket.
Suppose, for example, that the grid element representing records with
Color < F
G <= Manufacturer < R
H <= Model < N
License < B
And the grid element representing records with
Color < F
G <= Manufacturer < R
N <= Model < T
License < B
Both point to the same bucket and that this bucket overflows. A new bucket is
allocated to the file, and one of the two elements of the grid array is changed to point
to it. Records in the overflowing bucket are distributed between the two buckets
according to whether the value of the model attribute is less than N.
If only one grid element points to a bucket, the grid must be refined. One of the
sub ranges represented by the bucket contents must be divided. A new partition point
is added to the appropriate linear scale. One bucket is assigned to each half of the
original grid element, and the records are distributed according to the new
partitioning.
Suppose that after further insertions there is overflow in the bucket pointed to by
the element representing records with
Color < F
G <= Manufacturer < R
N <= Model < T
License < B
Assume further those only one-element points to this bucket. We therefore need to
split one of the sub ranges. Choosing arbitrarily, the Manufacturer dimension, we
could insert a partition at M. the corresponding linear scale is now
Manufacturer (G, M, R)
The number of elements in the grid array increases by 33% because the
manufacturer dimension now has four rather than three sub ranges. Most of the new
elements will point to an existing bucket. For example, the element representing
records with
J <= Color < K
G<=Manufacturer<M
Model < C
M <= License < S
Will point to the same bucket as the element representing records with
F <= color < K
M <= Manufacturer < R
Model < C
M <= License < S
We need to allocate a new bucket to the file and distrib ute records from the
overflowing on. The two buckets will now be pointed to by the array elements
representing
Color < F
G <= Manufacturer < M
N <= Model < T
License < B
And
Color < F
M <= Manufacturer < R
N <= Model < T
License < B
5.4.4.4 Deletions
To maintain reasonable storage utilization, two candidate buckets might be
merged if their combined number of records falls below some threshold. The records
would be moved into one of the buckets and pointers to the other reassigned to it. The
empty bucket would be de-allocated from the file. Note that not every pair of buckets
can be merged. Only elements that form a k-dimensional rectangle can point to a
particular bucket.
5.4.4.5 Symmetry
The symmetry of the matrix ideal is preserved in the grid organization. There is no
performance difference between primary and secondary indexing because all indexed
attributes are treated in the same way.
Chapter 6
PROPOSED DISTRIBUTED
DATABASE INDEX 6.1 Introduction
In previous chapters we have discussed index techniques and their uses in the
databases. Indices are associated with the main data file. They facilitate the Database
Management Systems to access records in the data file faster and in a random fashion.
There are different types of index structures and algorithms. Each has some
advantages and disadvantages over the others. One structure may be suitable in one
context but may not be suitable in another context. For example B, B* and B+ trees
are often used for implementing dynamic index in primary key implementation. But
for secondary keys such as multi-key indexing the methods are Bit Vectors, Graph
Structures, and Grid File Organizations etc. We have discussed about those methods
and their advantages and disadvantages in previous chapters. Now we will show how
we can improve searching in distributed database management systems.
6.2 Distributed Database
In recent years, distributed databases have become an important area of
information processing, and it is easy to foresee that their importance will rapidly
grow. There are both organizational and technological reasons for this trend.
Distributed databases eliminate many of the shortcomings of centralized databases
and fit more naturally in the decentralized structures of many organizations.
A distributed database is a collection of data, which belong logically to the same
system but are spread over the sites of a computer network.
For example, consider a bank that has three branches at different locations. At
each branch, a computer controls the teller terminals of the branch and the account
database of the branch. Each computer with its local account database at one branch
constitutes one site of the distributed database; a communication network connects
computers. During normal operations the applications, which are requested from the
terminals of a branch, need only to access the database of that branch. These
applications are completely called local applications. An example of a local
application is a debit or a credit application performed on an account stored at the
same branch at which the application is requested. Some applications are called global
applications or distributed applications. A typical global application is a transfer of
funds from an account of one branch to an account of another branch. This application
requires updating the databases at two different branches.
Therefore, a distributed database is a collection of data distributed over different
computers of a computer network. Each site of a network has autonomous processing
capability and can perform local applications. Each site also participates in the
execution of at least one global application, which requires accessing data at several
sites using a communication subsystem. Figure given below shows a typical
Distributed Database.
6.3 Finding records in a Distributed Database
At present distributed databases are inefficient in locating records since it is not
using any global index structure. For example, if we have a book data file in a
distributed database, the single book data file should be fragmented into several data
files and these fragments should be allocated in different sites of the distributed
database. The fragments information will be stored in the fragmentation schema, and
the information regarding the allocation of fragments to the sites will be stored in the
allocation schema. When a query is searching a book by a particular author it will be
fragmented into sub queries according to fragmentation and allocation schemas
because the fragmentation schema will tell the number and the names of the
fragments of the data file and the allocation schema will tell about the sites where to
get the said fragments. Using this information Distributed Database Management
Systems submit the fragmented queries into sites and collect the results from different
sites. Some sites that don’t contain the query information will result empty set. So it is
a waste of time to submit the fragmented queries in those sites.
If we know before hand which sites contain the required information and submit
fragmented queries only in those sites then it will faster and more efficient than
before. We can achieve this goal by implementing global index.
6.4 PROPOSED DISTRIBUTED INDEX
6.4.1 INTRODUCTION
The reason for providing indexes is to obtain fast and efficient access to data.
Indexing is a data structure based technique for accessing records in a file. Multi-key
indexing is often graph structured or grid file organized. Though local database has
index file to search a record efficiently distributed database has no such opportunity.
Our aim is to organize the whole thing so that global queries can be executed with
more efficiently and more fast.
6.4.2 Local Index Architecture
In local Database Management Systems primary key index is implemented in
B/B*/B+ trees. Grid File Organization often implements by Bit Vector, Graph
Structure or multikey indices. Normally Graph Structure and Grid File are used
widely.
6.4.3. Comparison among Bit Vector, Graph Structure and Grid File
Organization
6.4.3.1 Advantages and disadvantages of Bit vector
If we implement index on one attribute then we need a two-dimensional index
vector as shown in chapter 5. This bit vector is efficient because of the speed with
which simple set of operations can be performed on conventional hardware. But if we
want to implement a multi-list index with bit vector then a three dimensional bit
vector will be necessary, where the third dimension will represent the fields on which
the index is to be implemented. For example let we want to make a multi-list index on
the attributes Manufacturer, Model and Color of a car. Then the third dimension or the
z-axis of the bit vector will represent the bit vectors of Manufacturer, Model and
Color. Now if we want to know the query where Manufacturer = Ford, Model =
Mustang, and Color = Green, we have to access the corresponding bits from the bit
vector and to find all three bits are to be 1. But it is not easy to access those three bits
at the same time. The same problem we face when we delete of update a record.
Again different attribute has different no of attribute values. So the bit vector is not
always equilateral. It also makes maintaining the bit vectors difficult. These problems
are solved by graph structures.
6.4.3.2 Advantages and disadvantages of Graph Structure
To eliminate the problems of bit vector, in graph structure the record information
is represented in concise way. As described in previous chapters, if we want to find a
query we have to traverse the nodes through a single path of any one attribute. As the
attributes may be distributed all over the index file this requires to check all the nodes
throughout the entire index file. This may cause to access the index file several times,
which make the process inefficient. These problems are solved in grid file
organization where a particular record can be found only in two disk accesses.
6.4.4 Distributed Index Architecture
6.4.4.1 Introduction
To provide efficient access to the data we proposed a distributed index concept.
Distributed index is also a data structure based index comprising of two types of index
structures. One is Global Index (GI) and the other is Local Index (LI). Figure 6.2
shows the proposed distributed structure.
Fig 6.2: Architecture of Distributed Index [8]
GI is created and maintained by distributed database component (DDB) of
distributed database management systems (DDBMS). LI is created and maintained by
local database management component (DB) of DDBMS. Our study shows Bit
Vector, Graph Structure and Grid File Organization all have advantages in some way
and have disadvantages in other way. For this reason we preferred a way which uses
the techniques of Bit Vector, Graph Structure and Grid File Organization for GI. LI
has been imple mented by grid file. For every site there is a Local Index (LI), which
has been created, updated and used independently. Like other local database
management components LI enjoys autonomy in each site. There must be a single
global index (GI) for a distributed index. GI is created, updated and used based on
local indexes. All the local indexes are perfectly mapped with the global indexes.
When a record is searched in a distributed database, GI is used first to determine
which LI needs to be used to find the data. After selecting the right LI it is used to
access records in the corresponding site. In this way, distributed index ensures
efficient access to the data in a distributed database. If a record is inserted, deleted or
updated in a site and if it causes a new combination of fields or deletes a record
completely in any other sites then the record information is passed all other sites. The
other sites update their own global index according to the record information.
Proposed Global Index (GI) is a combination of Bit Vector, Graph Structure and
Grid File. In a grid file the actual records of database is stored in buckets as described
in previous chapters, and the index file is created to access those buckets. Introducing
Grid Dictionary accelerates this access. The Grid Dictionary consists of linear scales
and a grid array, where each element contains a pointer to a bucket. On the other hand
in Graph Structure all the records are stored in index file concisely as graph nodes,
where only two pointers for each indexed attribute, the forward pointer and the back
pointer, and an original pointer to the record of main file is stored.
Our main goal is to optimize query submission to database sites so that
unnecessary submission is not made. Hence we want to know every possible record of
other sites. This is possible in the form of graph nodes. For example if there are M no
of records in all sites but N (N < M) no of different records, then we have to keep
information of N records. Let us clarify it fully by an example. We have a database of
Manufacturer, Model, Color and License of cars at two different sites where multi key
index is created on Manufacturer, Model and Color fields. The site records are given
below.
Records in site1:
Rec. no Manuf. Model Color License
1 Ford Pinto Green 23023234
2 VW Civic White 23424244
3 BMW Bug Red 43543535
4 Ford Mustang Black 65435435
5 BMW Mustang White 45645654
6 Honda Tempo Green 23432543
7 VW Civic White 54654645
8 BMW Bug Red 34543543
9 Ford Pinto Green 54654664
10 Honda Tempo Green 54654654
Here, there are 10 records but 6 different records according to three indexed fields
as record 1 & 9, 2 & 7, 3 & 8, and 6 & 10 have same value in three fields.
Records at site2:
Record no Manuf. Model Color License
1 Ford Mustang White 23432424
2 VW Civic White 34655466
3 Ford Pinto Green 45654646
4 BMW Pinto Green 65765353
5 Ford Mustang White 32984983
6 Ford Pinto Green 56765756
7 Honda Tempo Red 54366547
8 VW Civic White 54765466
9 BMW Pinto Green 45645664
10 Honda Tempo Red 45654654
Similarly here, there are only 5 different records. Again in both sites there are only
9 no of different records and we have to keep information about these 9 records in our
global index (GI) file.
Now our aim is that, if we want to find a record with Manuf = Honda, Model =
Tempo and Color = Green then we will submit the query only in site 1 as site 2 has no
such records, but if we want to query the record with Manuf = Ford, Model = Pinto
and Color = Green, then we will submit the query to both sites.
To achieve this goal we will store the 9 different records in global index file as a
form of graph nodes and the graph nodes will be stored as a fashion of Grid File
Organization. Here the graph nodes will consists of three back pointers and a bit
vector, where the bit vector gives the site address of that specific record and it
replaces the original pointer of records in data files. Here the graph nodes will be like
the following figure:
Bit vector
Back Ptr1
Back Ptr2
Back Ptr3
The corresponding graph nodes of the two records with Manufacturer = Honda,
Model = Tempo, Color = Green and Manufacturer = Ford, Model = Pinto, Color =
Green are given below.
This record exists in site 1 this record exist in both sites
In global index file these records will be stored as the original records of database
are stored in grid file. Thus by partitioning the entire range of different values of
individual fields into linear scales we will access the bucket of these records by grid
array and then find the desired records searching through the nodes linearly. The
whole process is done like Grid File Organization as if the nodes are now treated as
original records. Let us clear the whole thing using the example of the above two
database sites.
The different values of the three attributes are given in the following table
Manufacturer Model Color
BMW Bug Black
Ford Civic Green
Honda Mustang Red
VW Pinto White
Tempo
Now we could partition the three set of attributes into subsets. If we regard
attribute values as character strings, we might have
Manufacturer < G , G <= Manufacturer
Model < K, K <= Model < R, R <= Model
Color < H, H <= Color
Thus the color Black falls into the first partition (Black < G). Similarly Model
Pinto falls into the second partition (K <= Pinto < R) and Manufacturer Ford falls into
the first partition (Ford < G). The partition points are held in linear scales. The set of k
10
11 Site bit vector
Honda
Tempo Green
Site bit vector
Ford
Pinto Green
linear scales, one for each attribute, defines a grid on the k-dimensional attribute
space. The space is thus divided into grid blocks.
Now, consider the first record of site1 database with
Manufacturer = Ford, Model = Mustang and Color = Black
The partition into which the attribute values fall are 1, 2 and 1 respectively.
Therefore the bucket pointed to by Grid-array[1,2,1] is the only place where the
record would be stored.
The grid array of pointers is normally so large that it must be held in secondary
memory. On the other hand, the linear scales would normally fit into main memory.
The grid array contains the pointers of the buckets where the records are stored. The
grid array looks like figure 6.3.
Fig 6.3: Grid-Array pointing to buckets [15]
[1, 1, 1]
[1, 1, 2]
[1, 2, 1]
[1, 2, 2]
[1, 3, 1]
[1, 3, 2] . . . . . .
[2, 3, 2]
Bucket 1
Bucket 2
Bucket 3
0 1 2 3 4 5
Grid Array
Bucket 12
* *
So the record with Manufacturer = Ford, Model = Mustang and Color = Black
falls into bucket3 among with other records that belongs to the group of grid-array [1,
2, 1]. We can find the group [1, 2, 1] in the grid array by using the following formula:
In general if the grid array is [i, j, k] and there are P, Q and R no of subsets in
linear scales then the grid array index will be (i-1)*Q*R + (j-1)*R + (k-1). Here in our
example the value of i = 1, j = 2, k = 1, P = 2, Q = 3 and R = 2. So the index value of
grid array is (1-1)*3*2 + (2-1)*2 + (1-1) = 2.
After finding the bucket pointer from the grid-array we access the bucket from the
file and search linearly through the entire bucket. The record can exists only in that
bucket. While finding, we visit each node in the bucket and check the back pointer for
real attribute values. Here the organization of Bucket 3 is given in figure 6.4.
Fig 6.4: Proposed Global Index File Organization
From the above figure we see that we will find the record with Manufacturer =
Ford, Model = Mustang and Color = Black at the second position of the bucket. At the
node the bit vector is ‘10’ that is the record exists only in site1 but not in site2. Hence
we will submit the query for the record in site1 but not in site2.
Manufacturer
Ford
VW
BMW
Model
Pinto
Bug
Mustang
Tempo
Civic
Color
White
Black
Green
Red
11
10
01
00
00
00
Bucket 3
6.4.4.2 Distributed Global Index Searching (GI Searching)
Searching point query
When searching for a record with a given key value we start searching in global
index first to find out the right local index. We start searching the global index by
finding the grid array index using the linear scales upon which the field values are
divided. When we find the grid array index we find the bucket pointer from the grid
array and access the corresponding bucket. Then we search for the record throughout
the entire bucket with the help of back pointer of the nodes. If the record is found then
we get the site bit vector from the node and send the query to the sites in which the
record actually exists.
Search Point Query in GI ( searchKey )
{
get the grid array index using linear scales
find the bucket pointer from grid array using the index
access the bucket from index file
search the desired record in the bucket
if the record is found get the bit vector of that record
Submit the query according the bit vector where the corresponding record
actually exists
}
When the subqueries are submitted into local sites, the local sites search the
records using their own local index. Here the actual records are stored in grid file. So
its local index consists of grid dictionary. Using this LI, local database management
system searches the record and returns the result to the site that issue the original
query. Here the search technique follow the same as GI search for a particular graph
node in global index file. The LI search technique is given below:
Search Point Query in LI ( searchKey )
{
get the grid array index using linear scales
find the bucket pointer from grid array using the index
access the bucket from main file
search the desired record in the bucket
the record is found and return that to original site that issued the query
}
Searching range query
Two examples of such queries are: find Manufacturer = VW, Model = Bug,
D < license < X: find Manufacturer = Ford, Color = Green
Search Range Query in GI ( searchKey )
{
get the grid array index range using linear scales
find the bucket pointers from grid array using the index
access the buckets from index file
search the desired record in the buckets
if the record is found get the bit vector of the records
do OR operation among the bit vectors
If all bit become one
Submit the query all the sites
Else
Submit the query which sites have those records
}
Algorithm for searching in Global Index of a Distributed Database
6.4.4.3 Global Index Insertion (GI Insertion)
Let a record with Manufacturer = Ford, Model = Mustang, and Color = Black is
inserted into local database at site 2. Each time a record is inserted it is checked
whether it is a new combination of values of indexed attributes. Here this record is
new for site2. So it will send the values of this record to other sites like site1. Site1
will now locate the bucket where this record information could be stored. As
mentioned above this record falls into Grid Array [1, 2, 1] and from this we know that
the record will be stored in bucket 3. Now in bucket3 the record is searched. Here we
find the record in position two in the bucket. Now what we do is to turn on the bit
value of that node for site2 and save the node in the bucket again. The site bit vector
will now look like 11.
Let us now try another example. We want to insert a record with Manufacturer =
Audi, Model = Pinto, and Color = White. After inserting the record into site1 it will
be found as a new combination of indexed attribute values. So its information will be
passed to other sites to maintain GI. As calculated above this record information will
be stored in bucket3. Bucket3 is searched whether this record already exists or not. In
that case this record doesn’t exists. So a new node is created with attribute values
pointers pointing to their original value, and its site bit vector is made 10 as the record
exists only in site1.
While creating a new node in any bucket it could be found that the bucket is
already full. In that case the bucket should be split into two buckets and records will
be distributed between those two buckets. The splitting process depend on two cases.
1. when only one pointer points to a bucket
2. when more than one pointer points to a bucket
let us now clarify the process with two examples. Before this we assume that, at
present the elements of Grid Array ([1, 1, 1] to [2, 3, 2]) points to only one bucket.
Case 1
Let, while inserting a record with Manufacturer = BMW, Model = Pinto, and
Color = Black we find the corresponding bucket is full. According to linear scales this
record falls into the bucket pointed by the Grid Array [1, 2, 1] i.e. the 3rd bucket and
only one pointer points this. In that case one of the sub ranges represented by the
bucket contents must be divided. Choosing, arbitrarily, the Manufacturer dimension,
we could insert a partition at C. The corresponding linear scale is now
Manufacturer (C, G), i.e.
Manufacturer < C, C <= Manufacturer < G, G <= Manufacturer
Previously the number of Grid Array element was 2*3*2 or, 12, but now the
number of Grid Array element is 3*3*2 or, 18. The number of elements increases by
33%. Most of the new elements will point to an existing bucket. Here the new Grid
Array structure is given below:
Fig 6.5: Insertion of record in (case 1)
Previously the bucket1 was pointed only by one pointer Grid Array [1, 1, 1] but
now bucket1 is pointed by two pointers, Grid Array [1, 1, 1] and Grid Array [2, 1, 1].
Previously the records, which fall into group
Manufacturer < G
Model < K
Color < H
Now fall into two groups. The groups are
[1, 1, 1]
[1, 1, 2]
[1, 2, 1]
[1, 2, 2]
[1, 3, 1]
[1, 3, 2]
[2, 1, 1]
[2, 1, 2]
[2, 2, 1]
[2, 2, 2]
[2, 3, 1]
[2, 3, 2]
[3, 1, 1] . . . . .
[3, 3, 2]
Bucket 1
Bucket 2
Bucket 3
0 1 2 3 4 5 6 7 8 9 10 11 12 . . . . . . .
17
Grid Array
Bucket 12
Bucket 13
* * * * *
Manufacturer < C, Model < K, Color < H
And
C <= Manufacturer < G, Model < K, Color < H
But, since the records of both groups were in bucket1, now both the pointers point
to bucket1. Similarly both Grid Array [1, 1, 2] and Grid Array [2, 1, 2] point to
bucket2 and so on.
We actually split bucket3, which was pointed by Grid Array [1, 2, 1] previously.
Now we allocate a new bucket, bucket13, as there were 12 buckets previously, and
distribute the records of bucket3 to bucket3 and bucket13 according to the following
two groups:
1. Manufacturer < C, K <= Model < R, Color < H
2. C <= Manufacturer < G, K <= Model < R, Color < H
Let we keep group1 in bucket3 and group2 in bucke13. So Grid Array [1, 2, 1] still
points to bucket3 and Grid Array [2, 1, 1] now points to bucket13.
Case 2
Let us now insert a record that falls into bucket2, which is pointed by two
pointers, Grid Array [1, 1, 2] and Grid Array [2, 1, 2]. Now if we find that the bucket
is full then we don’t split the linear scales. Rather we will allocate a new bucket,
bucket14, and distribute the records of bucket2 into those two buckets according to
the following groups:
1. Manufacturer < C, Model < K, H <= Color
2. C <= Manufacturer < G, Model < K, H <= Color
If we keep group1 in bucket2 and group2 in bucket14 then Grid Array [1, 1, 2]
still points to bucket2 and Grid Array [2, 1, 2] now points to newly created bucket,
bucket14. The related picture is drawn below for further clarification:
Fig 6.6: Insertion of record in (case 2)
[1, 1, 1]
[1, 1, 2]
[1, 2, 1]
[1, 2, 2]
[1, 3, 1]
[1, 3, 2]
[2, 1, 1]
[2, 1, 2]
[2, 2, 1]
[2, 2, 2]
[2, 3, 1]
[2, 3, 2]
[3, 1, 1] . . . . .
[3, 3, 2]
Bucket 1
Bucket 2
Bucket 3
0 1 2 3 4 5 6 7 8 9 10 11 12 . . . . . . .
17
Grid Array
Bucket 12
Bucket 13
* * * * *
Bucket 14
Insert LI ()
{
Insert the record in local database
If the record is new
send it to all other sites
Insert GI ( inRec )
}
Insert GI ( inRec )
{
Search GI ( inRec )
If inRec exists in global index
then simply turn on the bit of the corresponding site
Else
insert the inRec into the bucket
if bucket is full
find whether the bucket is pointed by one pointer or more than
one pointer
if the bucket is pointed by only one pointer
randomly select any one of the linear scales
divide it into its middle
create a bucket and distribute the records according to
new linear scales
create a node in the corresponding bucket
turn on the bit of the corresponding site
save the node in the bucket
else if the bucket is pointed by more than one pointer
create a new bucket
make necessary change in grid array pointers
distribute records according to grid array pointers and
their subdivisions
create a graph node in the appropriate bucket
turn on the bit of the corresponding site and save it
}
Algorithm for inserting records in Global Index of a Distributed Database
6.4.4.4. Global Index Deletion (GI Deletion)
When a record is deleted from a local database it is checked that whether there is
any more such records in that database. If there are no such records then the
information is passed to other sites to delete the record information from their GI.
When a site receives record information to delete from its global index (GI), it
searches the corresponding node in buckets and if it finds the node it turns off the
corresponding site bit from the site bit vector of that node. Now it checks whether all
the bits of the site bit vector is zero. If so, then there are no such records in any other
sites. So it deletes the node from the bucket.
To maintain reasonable storage utilization, two candidate buckets might be
merged if their combined number of records falls below some threshold. The records
would be moved into one of the buckets and pointers to the other reassigned to it. The
empty bucket would be de-allocated from the file. Note that not every pair of buckets
can be merged. Only elements that form a k-dimensional rectangle can point to a
particular bucket.
Delete LI ()
{
delete a record from local database
if there is no such record anymore
to delete the record information from all other sites call the function
DeleteGI ( outRec )
}
Delete GI ( outRec )
{
Search GI ( outRec )
If the outRec is found in the bucket
Turn off the bit of the corresponding site
If all site bits is zero
delete the record from bucket
If the no of records in bucket falls below some threshold
Move records into one bucket
Re-assign the other pointers to it
}
Algorithm for deleting records in Global Index of a Distributed Database
6.5 Performance evaluation
In Global Index structure the techniques of Bit Vector, Graph Structure and Grid
File Organization are used. If there are P no of different values of one attribute, Q no
of different values of another attribute and R no of different values of the other then
we have to keep information at most P*Q*R no of records. This is less than the no of
total actual records in all sites. Previously in Graph Structure two pointers are used for
each attribute, but now only one pointer is used. Again the bit vector of site addresses
replaces the space of original record data file pointer. So the total amount of space
required for a graph node is almost half than the previous one. Since the graph nodes
or so-called records are stored in GI file in a fashion of Grid File Organization we
achieve the goal of searching a record in two disc file access for point queries. It also
facilitates searching range queries and dynamic adaptation such as insertion, deletion
and updating. Since only one bit is used to keep information whether a record exists in
a site or not it reduces the amount of space and also speeds up the execution.
The performance of global index is dependent on how the records of original
database are distributed over the sites. For example, a record, with Manufacturer =
Ford, Model = Pinto and Color = Black, if exists in 2 or 3 sites of total 8 sites then we
gain a considerable efficiency. But if the record exists in more than 4 sites then we
don’t gain any better performance. Again if the record exists in all sites then this
global index causes some overhead because without GI the record is searched in local
sites only, but now the record will be searched in global index file as well as in all
local sites where it exists. That is if a record exists in less than 50% of total sites then
we gain efficiency by the use of GI. So it is a matter of further inquiry how the
records are distributed all over the sites.
We have simulated a program, which assumes 8 sites and 100,000 different no of
individual records according to the indexed attributes. Each site contains 50,000
records randomly. So we see 400,000 no of actual records are distributed all over 8
sites. That is, one individual record can exist in at most 4 sites i.e. 50% of total no of
sites. In that condition, we examined 10,000 individual point queries randomly. It
shows that, before implementing GI, it needs 8*N no of comparison if there are N no
of records in a bucket, but after implementing GI it needs 7*N no of comparison.
Thus we see our performance gain equivalents to 12.5% in terms of comparison.
Again after implementation of GI the cost of network overhead reduces to 50% than
before.
The above result is true when the probability of record distribution is 0.5 among
the sites. If the probability is reduced then the performance gain increases linearly.
2.1.1.1.1 References
[1] “A Distributed Weighted Centroid-based Indexing System, in: Proceedings of the 8th European Networking Conference (JENC8)”, 1997. M. Rio. J. Macedo, V. Freitas, http://www.international. conf.jene8papaers322.ps [2] “An algorithm for the organization of information”, G. M. Adel’son-Vel’skii and E. M. Landis, Doklady Academia Nauk SSSR, vol, 146, no. 2, pp. 263-266, 1962. English translation in Soviet Mathematics, vol.3, no. 5, pp. 1259-1263, 1962. [3] “Architecture of the Whois++ Index Service, RFC1835”, August 1995, P. Deutsch, R. Schoultz, P. Faltstrom, C. Weider, ftp: ftp.ripe.net_rfe_rfe1835.txt [4] “Architecture of the Whois++ Index Service, RFC1913”, February 1996, C. Weider, J. Fullton, S. Spero, ftp://ftp.ripe. net/rfe1913.txt [5] “CIP Index Object Format for SOIF Objects (RFC draft version 2) April 1997”, M. Bowman, D, Hardy, M. Schwartz, D. Wessels, ftp://ftp.ietf.org/internet-drafts/draft-ietf-find-eip-soif-02.txt [6] “Complexity of the Common Indexing Protocol”P. Panotzki, September 1996, http://www.bunyip.com/reaserchpapers1996eip/eip. html [7] ”Database System Concepts”, Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Third Edition,The McGraw Hill Companies, Inc, 1997. [8] “Distributed databases Principals & Systems”, Stefano Ceri, Ginseppe Pelagatti, McGrawHill Book Company, 1984. [9] “Files and Databases: An Introduction”, Peter D. Smith, G. Michael Barnes, Addision-Wesley Publishing Company. [10] “Free Harvest Web Indexing Software Development”, http://www.trdis.ed.ac.uk/harvest [11] “InfoSeek Distributed Search Pattern, 1997”, http:// software/infseek.com/patents/dist.search [12] “Lightweight Directory Access Protocol (v3), RFC2251, December 1997” , M. Wahl, T. Howes, S. Kille, ftp://ftp.ripe.net/ rfc/rfc225.txt [13] “ORACLE 8: A Beginner’s Guide”, Michael Abbey, Michael J. Corey, Tata McGraw -Hill Publishing Company Limited, 1997. [14] “ORACLE The Complete Reference Third Edition” George Koch, Kevin Loney, Osborne McGraw -Hill, 1995. [15] “Organization and maintenance of a large ordered indexes”, R. Bayer, E. McCreight Acta Information, vol 1. No.3 pp, 173-189, 1972. [16] “Proceedings of the TERENA Networking Conference 1998”, P> Valkenburg, D. Beckett, M. Hamilton, S. Wilkinson, Standards in the CHIC-Pilot Distributed Indexing Architecture, in: Computer Networks and ISDN Systems special issue http://www.terena.nl/ libr/tech/ehie-fr.html
[17] “Special Edition Using Microsoft SQL Server 6.5“ Stephen Wynkoop, Prentiee-Hall of India Private Limited, 19998. [18] “Stanford Protocol Proposal fo Internet Search and Retrieval, January 1997”, L. Gravano, K. Chang, H. Garcia -Molina, C, Lagoze, A. Paepcke, http://www.db.stanford.edu/~gravano/ staarts.html [19] “TERNA Task Force on Cooperative Hierarchical Indexing Coordination (TF-CHIC)”, http://www.terena-nl/task/eits.html [20] “The AltaVista Search Service” , http://www.altavista. digital.com [21] “The Architecture of the Common Index Protocol (CIP)” J. Allen, M. Mealling,(RFC draft version 1), 1997, ftp://ftp.ietf.org/internet-drafts/draft-ietf-find-eip-arch-01.txt [22] “The Art of Computer Proogramming vol.3: Sorting and Searching”, D.E. Knuth, Addision-Wesley, Reading Mass, 1973. [23] “The InfoSeek Search Service”, http://www.infoseek.com [24] “W3C’s Distributed Indexing/Searching workshop”, May 1996, http://www.w3.org/Search/9605-Indexing-Workshop