Top Banner
Watermarking Relational Databases Using Optimization-Based Techniques Mohamed Shehab, Member, IEEE, Elisa Bertino, Fellow, IEEE, and Arif Ghafoor, Fellow, IEEE Abstract—Proving ownership rights on outsourced relational databases is a crucial issue in today’s internet-based application environments and in many content distribution applications. In this paper, we present a mechanism for proof of ownership based on the secure embedding of a robust imperceptible watermark in relational data. We formulate the watermarking of relational databases as a constrained optimization problem and discuss efficient techniques to solve the optimization problem and to handle the constraints. Our watermarking technique is resilient to watermark synchronization errors because it uses a partitioning approach that does not require marker tuples. Our approach overcomes a major weakness in previously proposed watermarking techniques. Watermark decoding is based on a threshold-based technique characterized by an optimal threshold that minimizes the probability of decoding errors. We implemented a proof of concept implementation of our watermarking technique and showed by experimental results that our technique is resilient to tuple deletion, alteration, and insertion attacks. Index Terms—Watermarking, digital rights, optimization. Ç 1 INTRODUCTION T HE rapid growth of the Internet and related technologies has offered an unprecedented ability to access and redistribute digital contents. In such a context, enforcing data ownership is an important requirement, which requires articulated solutions, encompassing technical, organizational, and legal aspects [25]. Although we are still far from such comprehensive solutions, in the last years, watermarking techniques have emerged as an important building block that plays a crucial role in addressing the ownership problem. Such techniques allow the owner of the data to embed an imperceptible watermark into the data. A watermark describes information that can be used to prove the ownership of data such as the owner, origin, or recipient of the content. Secure embedding requires that the embedded watermark must not be easily tampered with, forged, or removed from the watermarked data [26]. Imperceptible embedding means that the presence of the watermark is unnoticeable in the data. Furthermore, the watermark detection is blinded, that is, it neither requires the knowledge of the original data nor the watermark. Watermarking techniques have been developed for video, images, audio, and text data [24], [12], [15], [2], and also for software and natural language text [7], [3]. By contrast, the problem of watermarking relational data has not been given appropriate attention. There are, however, many application contexts for which data represent an important asset, the ownership of which must thus be carefully enforced. This is the case, for example, of weather data, stock market data, power consumption, consumer behavior data, and medical and scientific data. Watermark embedding for relational data is made possible by the fact that real data can very often tolerate a small amount of error without any significant degradation with respect to their usability. For example, when dealing with weather data, changing some daily temperatures of 1 or 2 degrees is a modification that leaves the data still usable. To date, only a few approaches to the problem of watermarking relational data have been proposed [1], [23]. These techniques, however, are not very resilient to water- mark attacks. In this paper, we present a watermarking technique for relational data that is highly resilient compared to these techniques. In particular, our proposed technique is resilient to tuple deletion, alteration, and insertion attacks. The main contributions of the paper are summarized as follows: . We formulate the watermarking of relational data- bases as a constrained optimization problem and discuss efficient techniques to handle the constraints. We present two techniques to solve the formulated optimization problem based on genetic algorithms (GAs) and pattern search (PS) techniques. . We present a data partitioning technique that does not depend on marker tuples to locate the partitions and, thus, it is resilient to watermark synchroniza- tion errors. . We develop an efficient technique for watermark detection that is based on an optimal threshold. The optimal threshold is selected by minimizing the probability of decoding error. 116 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 . M. Shehab is with the Department of Software and Information Systems, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223. E-mail: [email protected]. . E. Bertino is with the Department of Computer Science, Purdue University, 250 N. University Street, West Lafayette, IN 47906. E-mail: [email protected]. . A. Ghafoor is with the School of Electrical and Computer Engineering, Purdue University, Electrical Engineering Building, 465 Northwestern Ave., West Lafayette, IN 47907. E-mail: [email protected]. Manuscript received 8 Aug. 2005; revised 20 June 2007; accepted 18 July 2007; published online 4 Sept. 2007. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-0303-0805. Digital Object Identifier no. 10.1109/TKDE.2007.190668. 1041-4347/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society
14

Watermarking relational databases using optimization based techniques

May 07, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Watermarking relational databases using optimization based techniques

Watermarking Relational Databases UsingOptimization-Based Techniques

Mohamed Shehab, Member, IEEE, Elisa Bertino, Fellow, IEEE, and Arif Ghafoor, Fellow, IEEE

Abstract—Proving ownership rights on outsourced relational databases is a crucial issue in today’s internet-based application

environments and in many content distribution applications. In this paper, we present a mechanism for proof of ownership based on the

secure embedding of a robust imperceptible watermark in relational data. We formulate the watermarking of relational databases as a

constrained optimization problem and discuss efficient techniques to solve the optimization problem and to handle the constraints. Our

watermarking technique is resilient to watermark synchronization errors because it uses a partitioning approach that does not require

marker tuples. Our approach overcomes a major weakness in previously proposed watermarking techniques. Watermark decoding is

based on a threshold-based technique characterized by an optimal threshold that minimizes the probability of decoding errors. We

implemented a proof of concept implementation of our watermarking technique and showed by experimental results that our technique

is resilient to tuple deletion, alteration, and insertion attacks.

Index Terms—Watermarking, digital rights, optimization.

Ç

1 INTRODUCTION

THE rapid growth of the Internet and related technologies

has offered an unprecedented ability to access and

redistribute digital contents. In such a context, enforcing

data ownership is an important requirement, which

requires articulated solutions, encompassing technical,

organizational, and legal aspects [25]. Although we are stillfar from such comprehensive solutions, in the last years,

watermarking techniques have emerged as an important

building block that plays a crucial role in addressing the

ownership problem. Such techniques allow the owner of the

data to embed an imperceptible watermark into the data. A

watermark describes information that can be used to prove

the ownership of data such as the owner, origin, or recipient

of the content. Secure embedding requires that theembedded watermark must not be easily tampered with,

forged, or removed from the watermarked data [26].

Imperceptible embedding means that the presence of the

watermark is unnoticeable in the data. Furthermore, the

watermark detection is blinded, that is, it neither requires

the knowledge of the original data nor the watermark.

Watermarking techniques have been developed for video,

images, audio, and text data [24], [12], [15], [2], and also forsoftware and natural language text [7], [3].

By contrast, the problem of watermarking relational data

has not been given appropriate attention. There are,

however, many application contexts for which data

represent an important asset, the ownership of which must

thus be carefully enforced. This is the case, for example, of

weather data, stock market data, power consumption,

consumer behavior data, and medical and scientific data.

Watermark embedding for relational data is made possible

by the fact that real data can very often tolerate a small

amount of error without any significant degradation with

respect to their usability. For example, when dealing with

weather data, changing some daily temperatures of 1 or

2 degrees is a modification that leaves the data still usable.To date, only a few approaches to the problem of

watermarking relational data have been proposed [1], [23].

These techniques, however, are not very resilient to water-

mark attacks. In this paper, we present a watermarking

technique for relational data that is highly resilient

compared to these techniques. In particular, our proposed

technique is resilient to tuple deletion, alteration, and

insertion attacks. The main contributions of the paper are

summarized as follows:

. We formulate the watermarking of relational data-bases as a constrained optimization problem anddiscuss efficient techniques to handle the constraints.We present two techniques to solve the formulatedoptimization problem based on genetic algorithms(GAs) and pattern search (PS) techniques.

. We present a data partitioning technique that doesnot depend on marker tuples to locate the partitionsand, thus, it is resilient to watermark synchroniza-tion errors.

. We develop an efficient technique for watermarkdetection that is based on an optimal threshold. Theoptimal threshold is selected by minimizing theprobability of decoding error.

116 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

. M. Shehab is with the Department of Software and Information Systems,University of North Carolina at Charlotte, 9201 University City Blvd.,Charlotte, NC 28223. E-mail: [email protected].

. E. Bertino is with the Department of Computer Science, PurdueUniversity, 250 N. University Street, West Lafayette, IN 47906.E-mail: [email protected].

. A. Ghafoor is with the School of Electrical and Computer Engineering,Purdue University, Electrical Engineering Building, 465 NorthwesternAve., West Lafayette, IN 47907. E-mail: [email protected].

Manuscript received 8 Aug. 2005; revised 20 June 2007; accepted 18 July2007; published online 4 Sept. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0303-0805.Digital Object Identifier no. 10.1109/TKDE.2007.190668.

1041-4347/08/$25.00 � 2008 IEEE Published by the IEEE Computer Society

Page 2: Watermarking relational databases using optimization based techniques

. With a proof of concept implementation of ourwatermarking technique, we have conducted experi-ments using both synthetic and real-world data. Wehave compared our watermarking technique withprevious approaches [1], [23] and shown the super-iority of our technique with respect to all types ofattacks.

The paper is organized as follows: Section 2 discusses therelated work that includes the available relational databasewatermarking techniques and highlights the shortcomingsof these techniques. An overview of our watermarkingtechnique is described in Section 3, where an overview ofthe watermark encoding and decoding stages is presented.Section 4 discusses the data partitioning algorithm. Thewatermark embedding algorithm is described in Section 5.Sections 6 and 7 discuss the decoding threshold evaluation,and the watermark detection scheme. Section 8 presents theattacker model. The experimental results are presented inSection 9. Finally, conclusions are given in Section 10.

2 RELATED WORK

Agrawal and Kiernan [1] proposed a watermarking algo-rithm that embeds the watermark bits in the least significantbits (LSB) of selected attributes of a selected subset oftuples. This technique does not provide a mechanism formultibit watermarks; instead, only a secret key is used. Foreach tuple, a secure message authenticated code (MAC) iscomputed using the secret key and the tuple’s primary key.The computed MAC is used to select candidate tuples,attributes, and the LSB position in the selected attributes.Hiding bits in LSB is efficient. However, the watermark canbe easily compromised by very trivial attacks. For example,a simple manipulation of the data by shifting the LSB’s oneposition easily leads to watermark loss without muchdamage to the data. Therefore, the LSB-based data hidingtechnique is not resilient [21], [8]. Moreover, it assumes thatthe LSB bits in any tuple can be altered without checkingdata constraints. Simple unconstrained LSB manipulationscan easily generate undesirable results such as changing theage from 20 to 21. Li et al. [18] have presented a techniquefor fingerprinting relational data by extending Agrawalet al.’s watermarking scheme.

Sion et al. [23] proposed a watermarking technique thatembeds watermark bits in the data statistics. The datapartitioning technique used is based on the use of specialmarker tuples, which makes it vulnerable to watermarksynchronization errors resulting from tuple deletion and

tuple insertion; thus, such a technique is not resilient to

deletion and insertion attacks. Furthermore, Sion et al.

recommend storing the marker tuples to enable the decoder

to accurately reconstruct the underlying partitions; how-

ever, this violates the blinded watermark detection prop-

erty. A detailed discussion of such attacks is presented in

Section 8. The data manipulation technique used to change

the data statistics does not systematically investigate the

feasible region; instead, a naive unstructured technique is

used, which does not make use of the feasible alterations

that could be performed on the data without affecting its

usability. Furthermore, Sion et al. proposed a threshold

technique for bit decoding that is based on two thresholds.

However, the thresholds are arbitrarily chosen without any

optimality criteria. Thus, the decoding algorithm exhibits

errors resulting from the nonoptimal threshold selection,

even in the absence of an attacker.Gross-Amblard [11] proposed a watermarking technique

for XML documents and theoretically investigates links

between query result preservation and acceptable water-

marking alterations. Another interesting related research

effort is to be found in [17], where the authors have

proposed a fragile watermark technique to detect and

localize alterations made to a database relation with

categorical attributes.

3 APPROACH OVERVIEW

Fig. 1 shows a block diagram summarizing the main

components of the watermarking system model used. A

data set D is transformed into a watermarked version DW

by applying a watermark encoding function that also takes

as inputs a secret key Ks only known to the copyright

owner and a watermark W . Watermarking modifies the

data. However, these modifications are controlled by

providing usability constraints referred to by the set G.

These constraints limit the amount alterations that can be

performed on the data, such constraints will be discussed in

detail in the following sections. The watermark encoding

can be summarized by the following steps:

Step E1. Data set partitioning: By using the secret key Ks,

the data set D is partitioned into m nonoverlapping

partitions fS0; . . . ; Sm�1g.Step E2. Watermark embedding: A watermark bit is

embedded in each partition by altering the partition

statistics while still verifying the usability constraints in

SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 117

Fig. 1. Stages of watermark encoding and decoding.

Page 3: Watermarking relational databases using optimization based techniques

G. This alteration is performed by solving a constrainedoptimization problem.

Step E3. Optimal threshold evaluation: The bit embeddingstatistics are used to compute the optimal threshold T �

that minimizes the probability of decoding error.

The watermarked version DW is delivered to theintended recipient. Then, it can suffer from unintentionaldistortions or attacks aimed at destroying the watermarkinformation. Note that even intentional attacks are per-formed without any knowledge of Ks or D, since these arenot publicly available.

Watermark decoding is the process of extracting theembedded watermark using the watermarked data set DW ,the secret keyKs, and the optimal thresholdT �. The decodingalgorithm is blind as the original data setD is not required forthe successful decoding of the embedded watermark. Thewatermark decoding is divided into three main steps:

Step D1. Data set partitioning: By using the data partitioningalgorithm used in E1, the data partitions are generated.

Step D2. Threshold-based decoding: The statistics of eachpartition are evaluated, and the embedded bit is decodedusing a threshold-based scheme based on the optimalthreshold T �.

Step D3. Majority voting: The watermark bits are decodedusing a majority voting technique.

In the following sections, we discuss each of theencoding and decoding steps in detail.

4 DATA PARTITIONING

In this section, we present the data partitioning algorithmthat partitions the data set based on a secret keyKs. The dataset D is a database relation with scheme DðP;A0; . . . ; A��1Þ,where P is the primary key attribute, A0; . . . ; A��1 are �

attributes which are candidates for watermarking, and jDj isthe number of tuples inD. The data setD is to be partitionedinto m nonoverlapping partitions, namely, fS0; . . . ; Sm�1g,such that each partition Si contains on the average jDjm tuplesfrom the data setD. Partitions do not overlap, that is, for anytwo partitions Si and Sj such that i 6¼ j, we have Si \ Sj ¼ fg.For each tuple r 2 D, the data partitioning algorithmcomputes a MAC, which is considered to be secure [22]and is given by HðKskHðr:PkKsÞÞ, where r:P is the primarykey of the tuple r, HðÞ is a secure hash function, and k is theconcatenation operator. Using the computed MAC tuples areassigned to partitions. For a tuple r, its partition assignmentis given by

partitionðrÞ ¼ HðKskHðr:PkKsÞÞ mod m:

Using the property that secure hash functions generateuniformly distributed message digests this partitioningtechnique, on average, places jDjm tuples in each partition.Furthermore, an attacker cannot predict the tuples-to-partition assignment without the knowledge of the secretkey Ks and the number of partitions m, which are keptsecret. Keeping m secret is not a requirement. However,keeping it secret makes it harder for the attacker to

regenerate the partitions. The partitioning algorithm isdescribed in Fig. 2.

Although the presence of a primary key in the relationbeing watermarked is a common practice in relationaldata, our technique can be easily extended to handlecases when the relation has no primary key. Assuming asingle attribute relation, the most significant � bits (MSB)of the data could be used as a substitute for the primarykey. The use of the MSB assumes that the watermarkembedding data alterations will unlikely alter the MSB� bits. However, if too many tuples share the same MSB� bits, this would enable the attacker to infer informationabout the partition distribution. The solution would be toselect � that minimizes the duplicates. Another technique,in case of a relation with multiple attributes, is to useidentifying attributes instead of the primary key; forexample, in medical data, we could use the patient fullname, patient address, and patient date of birth.

Our data partitioning algorithm does not rely on specialmarker tuples for the selection of data partitions, whichmakes it resilient to watermark synchronization attackscaused by tuple deletion and tuple insertion. By contrast,Sion et al. [23] use special marker tuples, having theproperty that HðKskHðr:PkKsÞÞ mod m ¼ 0 to partition thedata set. In Sion’s approach, a partition is defined as the setof tuples between two markers. Marker-based techniquesnot only use markers to define partitions but also to defineboundaries between the embedded watermark bits. Such atechnique is very fragile to tuple deletion and insertion dueto the errors caused by the addition and deletion of markertuples. This attack is discussed in more detail in Section 8.

5 WATERMARK EMBEDDING

In this section, we describe the watermark embeddingalgorithm by formalizing the bit encoding as a constrainedoptimization problem. Then, we propose a GA and a PStechnique that can be used to efficiently solve suchoptimization problem. The selection of which optimizationalgorithm to use is decided according to the applicationtime and processing requirements, as will be discussedfurther. At the end of this section, we give the overallwatermark embedding algorithm. Our watermarking tech-nique is able to handle tuples with multiple attributes, aswe will discuss in Section 7. However, to simplify thefollowing discussion, we assume the tuples in a partition Sicontain a single numeric attribute. In such a case eachpartition, Si can be represented as a numeric data vector

118 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

Fig. 2. Data partitioning algorithm.

Page 4: Watermarking relational databases using optimization based techniques

Si ¼ ½si1; . . . ; sin� 2 <n. Some of the notations used are citedin Fig. 3.

5.1 Single Bit Encoding

Given a watermark bit bi and a numeric data vectorSi ¼ ½si1; . . . ; sin� 2 <n, the bit encoding algorithm mapsthe data vector Si to a new data vector SWi ¼ Si þ�i,where �i ¼ ½�i1; . . . ;�in� 2 <n is referred to as themanipulation vector. The performed manipulations arebounded by the data usability constraints referred to bythe set Gi ¼ fgi1; . . . ; gipg. The encoding is based onoptimizing encoding function referred to as the hidingfunction, which is defined as follows:

Definition 1. A hiding function �� : <n ! <, where � is theset of secret parameters decided by the data owner.

The set � can be regarded as part of the secret key. Notethat when the hiding function is applied to Si þ�i, the onlyvariable is the manipulation vector �i, whereas Si and �are constants. To encode bit bi into set Si, the bit encodingalgorithm optimizes the hiding function ��ðSi þ�iÞ. Theobjective of the optimization problem of maximizing orminimizing the hiding function is based on the bit bi suchthat if the bit bi is equal to 1, then the bit encoding algorithmsolves the following maximization problem:

max�i

��ðSi þ�iÞsubject to Gi:

However, if the bit bi is equal to 0, then the problem issimply changed into a minimization problem. The solutionto the optimization problem generates the manipulationvector ��i at which ��ðSi þ��i Þ is optimal. The new data setSWi is computed as Si þ��i . Using contradicting objectives,namely, maximization for bi ¼ 1 and minimization for bi ¼0 ensures that the values of ��ðSi þ��i Þ generated in bothcases are located at maximal distance and, thus, makes theinserted bit more resilient to attacks, in particular, toalteration attacks.

Fig. 4 depicts the bit encoding algorithm. The bit encodingalgorithm embeds bit bi in the partition Si if jSij is greaterthan �. The value of � represents the minimum partition size.The maximize and minimize in the bit encoding algorithmoptimize the hiding function ��ðSi þ��i Þ subject to theconstraints in Gi. The maximization and minimizationsolution statistics are recorded for each encoding step inXmax, Xmin, respectively, as indicated in lines 4 and 7 of the

encoding algorithm. These statistics are used to computeoptimal decoding parameters, as will be discussed inSection 6.

The set of usability constraints Gi represents the boundson the tolerated change that can be performed on theelements of Si. These constraints describe the feasible spacefor the manipulation vector �i for each bit encoding step.These constraints are application and data dependent. Theusability constraints are similar to the constraints enforcedon watermarking algorithms for audio, images, and video,which mainly require that the watermark is not detectableby the human auditory and visual system [24], [12], [15], [8].For example, interval constraints could be used to controlthe magnitude of the alteration for �ij, that is,

�minij � �ij � �max

ij :

Another example of usability constraints are classification-preserving constraints, which constrain the encodingalterations to generate data that belong to the sameclassification as the original data. For example, whenwatermarking age data, the results after the alterationshould fall in the same age group, for example, “preschool”(0-6 years), “child” (7-13), “teenager” (14-18), “young male”(19-21), “adult” (22þ), these constraints can be easilydescribed using interval constraints as they are similar todefining bounds on �i. Another interesting type ofconstraints may require that the watermarked data setmaintain certain statistics. For example, the mean of thegenerated data set be equal to mean of the original data set,in such a case, the constructed constraint is of the form

Xnj¼1

�ij ¼ 0:

Several other usability constraints could be deviseddepending on the application requirements. These con-straints are handled by the bit encoding algorithm by usingconstrained optimization techniques when optimizing thehiding function as will be discussed in the subsequentsections.

For the sake of comparison in this paper, we use thestatistics-based hiding function used by Sion et al. [23]. Themean and variance estimates of the new set SWi ¼ Si þ�i

are referred to as �ðSiþ�iÞ and �2ðSiþ�iÞ, respectively; in short,

we will use � and �2. We define the reference point as

SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 119

Fig. 3. Notation.

Fig. 4. Bit encoding algorithm.

Page 5: Watermarking relational databases using optimization based techniques

ref ¼ �þ c� �, where c 2 ð0; 1Þ is a secret real number thatis a part of the set �. The data points in Si þ�i that areabove ref are referred to as the“tail” entries, as illustrated inFig. 5. The hiding function �c is defined as the number oftail entries normalized by the cardinality of Si, also referredto as the normalized tail count. It is computed as follows:

�cðSi þ�iÞ ¼1

n

Xnj¼1

1fsijþ�ij�refg;

where n is the cardinality of Si, and 1fg is the indicatorfunction defined as follows:

1fconditiong ¼1 if condition ¼ TRUE;0 otherwise:

�Note that the reference ref is dependent on both � and �,which means that it is not fixed and dynamically varieswith the statistics of Si þ�i. Also note that the normalizedtail count �cðSi þ�iÞ depends on the distribution of Si þ�i and the dynamic ref .

The objective function �cðSi þ�iÞ is nonlinear andnondifferentiable, which makes the optimization problemat hand a nonlinear constrained optimization problem. Insuch problems, traditional gradient-based approaches turnout to be inapplicable. To solve this optimization problem,we propose two techniques based on GA and PS, respec-tively. The choice of the technique to use depends on theapplication processing requirements. Solving the optimiza-tion problem does not necessarily require to find a globalsolution because finding such solution may require a largenumber of computations. Our main goal is to find a nearoptimal solution that ensures that solutions of the mini-mization of �cðSi þ�iÞ and maximization of �cðSi þ�iÞare separated as far as possible from each other. As we willdiscuss further, GA could be used in order to determineglobal optimal solutions by trading processing time,whereas PS could be used to provide a local optimalsolution without trading processing time. Note that theseoptimization techniques will function for simpler hidingfunctions. However, having a simple linear hiding functionmakes it easier to attack. For example, if the average is usedas the hiding function, in this case, the optimizationproblem will merely be adding or subtracting a constantterm to the data vector to maximize or minimize theaverage.

5.2 Genetic Algorithm Technique

A GA is a search technique that is based on the principles ofnatural selection or survival of the fittest. Pioneering workin this field was conducted by Holland in the 1960s [13], [6].Many researchers have refined his initial approach. Insteadof using gradient information, the GA uses the objectivefunction directly in the search. The GA searches the solutionspace by maintaining a population of potential solutions.Then, by using evolving operations such as crossover,

mutation, and selection, the GA creates successive genera-tions of solutions that evolve and inherit the positivecharacteristics of their parents and thus gradually approachoptimal or near-optimal solutions. By using the objectivefunction directly in the search, GAs can be effectivelyapplied in nonconvex, highly nonlinear, complex problems[10], [5]. GAs have been frequently used to solve combina-torial optimization problems and nonlinear problems withcomplicated constraints or nondifferentiable objective func-tions. A GA is not guaranteed to find the global optimum;however, it is less likely to get trapped at a local optimumthan traditional gradient-based search methods when theobjective function is not smooth and generally wellbehaved. A GA usually analyzes a larger portion of thesolution space than conventional methods and is thereforemore likely to find feasible solutions in heavily constrainedproblems.

The feasible set �i is the set of values of �i that satisfy allconstraints in Gi. GAs do not work directly with points inthe set �i but rather with a mapping of the points in �i intoa string of symbols called chromosomes. A simple binaryrepresentation scheme uses symbols from f0; 1g; eachchromosome is L symbols long. As an example, the binarychromosome representing the vector �i ¼ ½�i1; . . . ;�in� isindicated in Fig. 6. Note that each component of �i usesL=n bits, where n ¼ jSij. This chromosome representationautomatically handles interval constraints on �i. Forexample, if �ij can only take values in the interval ½lij; hij�,then by mapping the integers in the interval ½0; 2L=n � 1� tovalues in the interval ½lij; hij� via simple translation andscaling, this ensures that, whatever operations are per-formed on the chromosome, the entries are guaranteed tostay within the feasible interval.

Each chromosome has a corresponding value of theobjective function, referred to as the fitness of the chromo-some. To handle other types of constraints, we penalize theinfeasible chromosomes by reducing their fitness valueaccording to a penalty function �ð�iÞ, which represents thedegree of infeasibility. Without loss of generality, if we aresolving the maximization problem in Section 5.1 withconstraints Gi ¼ fgi1; . . . ; gipg, then the fitness function usedis �cðSi þ�iÞ þ ��ð�iÞ, where � 2 <� is the penalty multi-plier and is chosen large enough to penalize the objectivefunction in case of infeasible �i. The penalty function �ð�iÞis given by

�ð�iÞ ¼Xpj¼1

gþijð�iÞ;

where

gþijð�iÞ ¼0 if �i is feasible w:r:t gij;�ðgij;�iÞ otherwise;

�where �ðgij;�iÞ 2 <þ represents the amount of infeasibilitywith respect to the constraint gij. For example, if the

120 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

Fig. 6. Binary chromosome representing �i.Fig. 5. The distribution of the set Si þ�i on the number line and the tail

entries circled.

Page 6: Watermarking relational databases using optimization based techniques

constraint gij isPn

j¼1 �ij ¼ 0, then �ðgij;�iÞ ¼ kPn

j¼1 �ijk.For a detailed discussion of penalty-based techniques, theinterested reader is referred to [20], [5].

A GA is less likely to get stuck in local optima. However,a GA requires a large number of functional evaluations toconverge to a global optimal. Thus, we recommend the useof GAs only when the processing time is not a strictrequirement and watermarking is performed offline. Forfaster performance, we recommend the use of PS techniquesdiscussed in Section 5.3.

5.3 Pattern Search Technique

PS methods are a class of direct search methods fornonlinear optimization. PS methods [4], [14] have beenwidely used because of their simplicity and the fact thatthey work well in practice on a variety of problems. Morerecently, they are provably convergent [16], [9]. PS starts atan initial point and samples the objective function at apredetermined pattern of points centered about that pointwith the goal of producing a new better iterate. Such movesare referred to as exploratory moves, Fig. 8 shows anexample pattern in <2. If such sampling is successful (thatis, it produces a new better iterate), the process is repeatedwith the pattern centered about the new best point. If not,the size of the pattern is reduced, and the objective functionis again sampled about the current point. For a detaileddiscussion on PS refer to [16] and [9]. To improve theperformance of PS, the objective function �cðSi þ�iÞ isapproximated by smooth sigmoid functions. The objectivefunction is approximated as follows:

b�cðSi þ�iÞ ¼1

n

Xnj¼1

Sigmoidðf ;refÞðsij þ�ijÞ;

where Sigmoidð;ÞðxÞ is a sigmoid function with parametersð; Þ, shown in Fig. 7, is defined as ð1� ð1þ eðx�ÞÞ�1Þ.

Constraints could be handled using the techniquesdiscussed earlier. However, PS can easily handle theconstraints by limiting the exploratory moves to only thedirections that end up in the feasible space, thus, ensuringthat the generated solution is feasible. For a more detaileddiscussion refer to [16]. The systematic behavior of PS andthe adaptable pattern size leads to the fast convergence tooptimal feasible solutions. However, PS is not guaranteedto find a global optimum. This problem can be overcome

by starting the algorithm from different initial feasiblepoints. For the sake of comparison, we conducted anexperiment using normally distributed data, where the tailcount �cðSi þ�iÞ was maximized and minimized usingboth PS and GA with interval constraints. Both algorithmswere restricted to use an equal number of objectivefunction evaluations. Fig. 9 reports the results of thisexperiment, which shows that PS generates better opti-mized tail counts and, thus, better separation between themaximization and minimization results. However, if GA isgiven more functional evaluations converges to globaloptimum solutions.

5.4 Watermark Embedding Algorithm

A watermark is a set of l bits W ¼ bl�1; . . . ; b0 that are to beembedded in the data partitions fS0; . . . ; Sm�1g. To enablemultiple embeddings of the watermark in the data set, thewatermark length l is selected such that l� m. The water-mark embedding algorithm embeds a bit bi in partition Sksuch that k mod l ¼ i. This technique ensures that eachwatermark bit is embedded bml c times in the data set D. Thewatermark embedding algorithm is reported in Fig. 10. Thewatermark embedding algorithm generates the partitions bycalling get partitions, then for each partition Sk, a watermarkbit bi is encoded by using the single bit encoding algorithmðencode single bitÞ that was discussed in the previoussections. The generated altered partition SWk is inserted intowatermarked data set DW . Statistics ðXmax;XminÞ arecollected after each bit embedding and are used by the

SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 121

Fig. 7. Shows Sigmoidð;Þ, where ¼ 0 and ¼ f1; 2; 8g.

Fig. 8. Example pattern for coordinate search in <2, as part of a larger

grid.

Fig. 9. Comparison between GA and PS for the same number of

functional evaluations.

Page 7: Watermarking relational databases using optimization based techniques

get optimal threshold algorithm to compute the optimaldecoding threshold; these details will be discussed further inthe following sections.

6 DECODING THRESHOLD EVALUATION

In the previous sections, we discussed the bit encodingtechnique that embeds a watermark bit bi in a partition Si togenerate a watermarked partition SWi . In this section, wediscuss the bit decoding technique that is used to extract theembedded watermark bit bi from the partition SWi . The bitdecoding technique is based on an optimal threshold T �

that minimizes the probability of decoding error. Theevaluation of such optimal threshold is discussed in thissection.

Presented with the data partition SWi , the bit decodingtechnique computes the hiding function ��ðSWi Þ andcompares it to the optimal decoding threshold T � to decodethe embedded bit bi. If ��ðSWi Þ is greater than T �, then thedecoded bit is 1; otherwise, the decoded bit is 0. Forexample, using the hiding function described in Section 5.1,the decoding technique computes the normalized tail countof SWi by computing the reference ref and by counting thenumber of entries in SWi that are greater than ref . Then, thecomputed normalized tail count is compared to T �, seeFig. 11. The decoding technique is simple; however, thevalue of the threshold T � should be carefully calculated soas to minimize the probability of bit decoding error, as will bediscussed in this section.

The probability of bit decoding error is defined as theprobability of an embedded bit decoded incorrectly. Thedecoding threshold T � is selected such that it minimizes theprobability of decoding error. The bit embedding stagediscussed in Section 5.1 is based on the maximization orminimization of the tail count; these optimized hidingfunction values computed during the encoding stage areused to compute the optimum threshold T �. The maximizedhiding function values corresponding to b0is equal to 1 arestored in the set Xmax. Similarly, the minimized hidingfunction values are stored in Xmin (see algorithm in Fig. 4).

Let Perr, P0, and P1 represent the probability of decodingerror, the probability of encoding a bit ¼ 0 and theprobability of encoding a bit ¼ 1, respectively. Furthermore,let be, bd, and fðxÞ represent the encoded bit, decoded bit,

and a probability density function, respectively. Perr iscalculated as follows:

Perr ¼P ðbd ¼ 0; be ¼ 1Þ þ P ðbd ¼ 1; be ¼ 0Þ¼P ðbd ¼ 0jbe ¼ 1ÞP1 þ P ðbd ¼ 1jbe ¼ 0ÞP0

¼P ðx < T jbe ¼ 1ÞP1 þ P ðx > T jbe ¼ 0ÞP0

¼P1

Z T

�1fðxjbe ¼ 1Þdxþ P0

Z 1T

fðxjbe ¼ 0Þdx:

To minimize the probability of decoding error ðPerrÞ withrespect to the threshold T , we take the first order derivativeof Perr with respect to T to locate the optimal threshold T �,as follows:

@Perr@T

¼P1@

@T

Z T

�1fðxjbe ¼ 1Þdxþ P0

@

@T

Z 1T

fðxjbe ¼ 0Þdx

¼P1fðT jbe ¼ 1Þ � P0fðT jbe ¼ 0Þ:

The distributions fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ are estimated

from the statistics of the sets Xmin and Xmax, respectively.

From our experimental observations of Xmin and Xmax, the

distributions fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ pass the chi-square

test of normality and thus can be estimated as Gaussian

distributions Nð�0; �0Þ and Nð�1; �1Þ, respectively. How-

ever, the following analysis can still be performed with

other types of distributions. P0 could be estimated byjXminj

jXmax jþjXminj and P1 ¼ 1� P0. Substituting the Gaussian

expressions for fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ, the first order

derivative of Perr is as follows:

@Perr@T

¼ P1

�1

ffiffiffiffiffiffi2�p exp �ðT � �1Þ2

2�20

!

� P0

�0

ffiffiffiffiffiffi2�p exp �ðT � �0Þ2

2�20

!:

By equating the first order derivative of Perr to zero, we get

the following quadratic equation, the roots of which include

the optimal threshold T � that minimizes Perr. The second

order derivative of Perr is evaluated at T � to ensure that the

second order necessary condition ð@2PerrðT �Þ@T 2 > 0Þ is met.

�20 � �2

1

2�20�

21

T �2 þ �0�21 � �1�

20

�20�

21

T �þ

ln�P0�1

P1�0

�þ �

21�

20 � �2

0�21

2�20�

21

¼ 0:

From the above analysis, the selection of the optimal T � isbased on the collected output statistics of the watermarkembedding algorithm. The optimal threshold T � minimizesthe probability of decoding error and thus enhances thestrength of the embedded watermark by increasing thechances of successful decoding. To show the high depen-dency of the probability of decoding error and the choice ofdecoding threshold T �, we conducted an experiment using

122 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

Fig. 11. Threshold-based decoding scheme.

Fig. 10. Watermark embedding algorithm.

Page 8: Watermarking relational databases using optimization based techniques

real-life1 data with usability constraints of 0:5 percent ofthe original data value. The histograms and the Gaussianestimates of Xmax and Xmin obtained from the experimentare reported in Fig. 12a. The optimal computed thresholdT � is indicated by the dotted vertical line. As we can see inFig. 12a, the two distributions are far apart, which is a directresult of using the competing objectives for bi equal to 1 and0. Fig. 12b, shows the probability of decoding error fordifferent values of the decoding threshold, which shows thepresence of an optimal threshold that minimizes theprobability of decoding error. Furthermore, Fig. 12b showsboth the Gaussian approximation and the experimentalvalues of the probability of decoding error, which showsthat the Gaussian approximation matches the experimentalresults.

The probability of decoding error is also dependent onthe usability constraints. If the usability constraints aretight, the amount of alterations to the data set D may notbe enough for the watermark insertion. Fig. 13a showsthe effect of varying the usability constraints on theseparation between fðxjbe ¼ 0Þ and fðxjbe ¼ 1Þ. Note thatas the usability constraints are increased, this allows more

encoding data manipulation and thus makes fðxjbe ¼ 0Þand fðxjbe ¼ 1Þ more separated. Fig. 13b shows theminimum probability of decoding error computed usingthe optimal threshold T � for data subject to differentusability constraints. The overall watermark probability ofdecoding error is reduced by embedding the watermarkmultiple times in the data set, which is basically arepetition error correcting code.

7 WATERMARK DETECTION

In this section, we discuss the watermark detectionalgorithm that extracts the embedded watermark using thesecret parameters includingKs,m, �, c, and T . The algorithmstarts by generating the data partitions fS0; . . . ; Sm�1g usingthe watermarked data set DW , the secret key Ks, and thenumber of partitions m as input to the data partitioningalgorithm discussed in Section 4. Each partition encodes asingle watermark bit; to extract the embedded bit, we use thethreshold decoding scheme based on the optimal thresholdT that minimizes the probability of decoding error, asdiscussed in Section 6. If the partition size is smaller than �,the bit is decoded as an erasure; otherwise, it is decodedusing the threshold scheme.

SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 123

Fig. 12. (a) Shows fðxjbe ¼ 0Þ, fðxjbe ¼ 1Þ, and the computed T � ¼ 0:24142. (b) Gaussian approximation and experimental values of Perr for different

decoding threshold ðT Þ values.

Fig. 13. (a) Shows fðxjbe ¼ 0Þ, fðxjbe ¼ 1Þ, and T � for different usability constraints. (b) The minimum Perr at T � for different usability constraints.

1. Description of such data is discussed in Section 9.

Page 9: Watermarking relational databases using optimization based techniques

As the watermark W ¼ bl�1; . . . ; b0 is embedded severaltimes in the data set, each watermark bit is extracted severaltimes, where for a bit bi, it is extracted from partition Sk,where k mod l ¼ i. The extracted bits are decoded using themajority voting technique, which is used in the decoding ofrepetition error correcting codes. Each bit bi is extractedml times so it represents a bml c-fold repetition code [19]. Themajority voting technique is illustrated by the example inFig. 14. The detailed algorithm used for watermarkdetection is reported in Fig. 15.

In case of a relation with multiple attributes, thewatermark resilience can be increased by embedding thewatermark in multiple attributes. This is a simple extensionto the presented encoding and decoding techniques inwhich the watermark is embedded in each attributedseparately. For a � attribute relation, the watermark bit isembedded in each of the � columns separately using the bitembedding technique discussed in Section 5.1. The use ofmultiple attributes enables the multiple embedding ofwatermark bits � times in each partition, such embeddingcan be considered as an inner �-fold repetition code [19]. Fordecoding purposes, the statistics Xmax and Xmin arecollected for each attribute separately. The optimal thresh-old is computed for each attribute using the collectedstatistics to minimize the probability of decoding error asdiscussed in Section 6. In the decoding phase, the water-mark is extracted separately from each of the � attributesusing the discussed watermark detection algorithm, thenmajority voting is used to detect the final watermark.

8 ATTACKER MODEL

In this section, we discuss the attacker model and thepossible malicious attacks that can be performed. Assumethat Alice is the owner of the data set D and has marked D

by using a watermark W to generate a watermarked dataset DW . The attacker Mallory can perform several types ofattacks in the hope of corrupting or even deleting theembedded watermark. A robust watermarking techniquemust be able to survive all such attacks.

We assume that Mallory has no access to the originaldata set D and does not know any of the secret informationused in the embedding of the watermark, including thesecret key Ks, the secret number of partitions m, the secretconstant c, the optimization parameters, and the optimaldecoding threshold T �. Given these assumptions, Mallory

cannot generate the data partitions fS0; . . . ; Sm�1g becausethis requires the knowledge of both the secret key Ks andthe number of partitions m, thus Mallory cannot intention-ally attack certain watermark bits. Moreover, any datamanipulations executed by Mallory cannot be checkedagainst the usability constraints because the original dataset D is unknown. Under these assumptions, Mallory isfaced with the dilemma of trying to destroy the watermarkand at the same time of not destroying the data. We classifythe attacks preformed by Mallory into three types, namely,deletion, alteration, and insertion attacks.

8.1 Deletion Attack

Mallory deletes tuples from the marked data set. If thetuples are randomly deleted, then, on average, eachpartition loses

m tuples. The watermarking techniquesavailable in the literature rely on special tuples, referredto as marker tuples. Agrawal and Kiernan [1] use markertuples to locate the embedded bit, and Sion et al. [23] usemarker tuples to locate the start and end of data partitions.The embedded watermark is a stream of bits where themarker tuples identify the boundaries between these bits inthe stream. The successful deletion of marker tuples deletesthese boundaries between the bits of the watermark stream,which makes such marker-based watermarking techniques[1], [23] susceptible to watermark synchronization error. Forexample, using the watermarking technique presented bySion et al. [23], Fig. 16a shows an example partitioned dataset and the corresponding majority voting map used todecode the embedded watermark. The marker tuples arerepresented by the shaded cells; these markers are used toidentify the start and end of each partition. The embeddedbit is noted in each partition; the embedded watermark is“101010.” Now, if Mallory successfully deletes the markertuple controlling the first bit ðb0Þ, this results in the deletionof the first bit, see Fig. 16b. The decoder, unaware of thedeleted bit, will generate “�10101” instead, which is the

124 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

Fig. 15. Watermark detection algorithm.

Fig. 14. An example illustrates the majority bit matching decoding

algorithm for a watermark W ¼ 011010, with “�” representing the

erasures.

Page 10: Watermarking relational databases using optimization based techniques

result of decoding a shifted version of the embedded bits.This results in a watermark synchronization error at thedecoder, see Fig. 16b. Moreover, the resynchronization ofthe watermark bit stream becomes more complicated in thepresence of flipped bits due to other decoding errors. Thus,the successful deletion of a single marker could result in alarge number of errors in the decoding phase. To avoidwatermark synchronization errors in marker-based techni-ques, the m marker tuples should be stored, as indicated bySion et al. in [23]. Note that this violates the requirementthat the watermark decoding is blinded.

On the other hand, our partitioning technique isresilient to such synchronization errors as it does not relyon marker tuples to locate the partition limits; instead, ourpartitioning technique assigns tuples to partitions based adifferent approach, as discussed in Section 4. We also useerasures to indicate the loss of a bit due to insufficientpartition size and, thus, to maintain synchronization andensure that our technique is resilient to the watermarksynchronization error.

8.2 Alteration Attack

In this attack, Mallory alters the data value of tuples.Here, Mallory is faced with the challenge that altering thedata may disturb the watermark; however, at the sametime, Mallory does not have access to the original data setD and, thus, may easily violate the usability constraintsand render the data useless. The alteration attack basicallyperturbs the data in hope of introducing errors in theembedded watermark bits. The attacker is trying to movethe hiding function values from the left of the optimalthreshold to the right and vice versa. However, using theconflicting objectives in encoding the watermark bits, thatis the maximizing the tail count for bi ¼ 1 and minimizingthe tail count for bi ¼ 0, maximizes the distance betweenthe hiding function values in both cases; thus, it makes it

more difficult for the attacker to alter the embedded bit. Inaddition, by the repeated embedding of the watermarkand the use of majority voting technique discussed inSection 7, this attack can easily be mitigated.

8.3 Insertion Attack

Mallory decides to insert tuples to the data set DW hopingto perturb the embedded watermark. The insertion of newtuples acts as additive noise to the embedded watermark.However, the watermark embedding is not based on asingle tuple and is based on a cumulative hiding functionthat operates on all the tuples in the partition. Thus, theeffect of adding tuples is a minor perturbation to the valueof the hiding function and thus to the embedded watermarkbit. Marker-based watermarking techniques may sufferbadly from this attack because the addition of tuples mayintroduce new markers in the data set and thus lead to theaddition of new bits in the embedded watermark sequence.Consequently, this results in a watermark synchronizationerror. Using the example mentioned earlier, Fig. 16a shows

SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 125

Fig. 16. Watermarked data set subject to the deletion and insertion attacks, repectively, and their corresponding majority voting maps. Gray-shaded

cells represent the original marker tuples, and the black cells represent the added marker tuples. (a) Watermarked data set. (b) After deletion attack.

(c) After insertion attack.

Fig. 17. Resilience to deletion attack.

Page 11: Watermarking relational databases using optimization based techniques

a partitioned data set and its corresponding majority votingmap using the Sion et al. technique [23], where theembedded watermark is “101010.” Now, if Mallory success-fully adds a marker tuple after the third marker tuple, thisresults in the addition of a new bit between ðb2Þ and ðb3Þ, seeFig. 16c, where the black cell represents the added markertuple. The decoder, unaware of the added bit, will generate“010111” instead, which is the result of decoding a shiftedversion of the embedded bits. This problem is furthercomplicated in the presence of bit errors in the watermarkstream. To ensure synchronization at the decoder, themarker-based watermarking techniques require the storageof the m marker tuples to ensure successful partitioning ofthe data set in the presence of the insertion attack [23]. Onthe other hand, our partitioning algorithm is not dependenton special marker tuples, which makes it resilient to suchattack, and watermark synchronization is guaranteedduring decoding.

he experimental results presented in Section 9 supportthe claims made about the resilience of our watermarkingtechnique to all the above attacks.

9 EXPERIMENTAL RESULTS

In this section, we report the results of an extensiveexperimental study that analyzes the resilience of theproposed watermarking scheme to the attacks describedin Section 8. All the experiments were performed on3.2 GHz Intel Pentium IV CPUs with 512 Mbytes of RAM.We use real-life data from a relatively small database that

contains the daily power consumption rates of somecustomers over a period of one year. Such data sets aremade available through the CIMEG2 project. The databasesize is approximately 5 Mbytes; for testing purposes only, asubset of the original data is used with 150,000 tuples. Weused c ¼ 75 percent, a 16-bit watermark, a minimumpartition size � ¼ 10, a number of partitions m ¼ 2; 048,and the data change was allowed within 0:5 percent. ThePS algorithm was used for the optimization. The optimalthreshold was computed using the technique used inSection 6 to minimize the probability of decoding error.The watermarked data set was subject to different types ofattacks including deletion, alteration, and addition attacks.The results were averaged over multiple runs. Similarresults were obtained for both uniform and normallydistributed synthetic data. We show that it is difficult forMallory to remove or alter the watermark without destroy-ing the data.

We assessed computation times and observed a poly-nomial behavior with respect to the input data size. Giventhe setup described above, with a local database, weobtained an average of around 300 tuples/sec for water-mark embedding, whereas detection turned out to be atleast approximately five times as fast. This occurs in thenonoptimized interpreted Java proof of concept implemen-tation. We expect major orders of magnitude speedups in areal-life deployment version. For comparison purposes, we

126 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

Fig. 18. Resilience to -insertion and ð; Þ-insertion attacks. (a) -selected insertion attack. (b) ð; Þ-insertion attack. (c) ð; Þ-insertion attack.

2. Consortium for the Intelligent Management of the Electric Power Grid(CIMEG). http://helios.ecn.purdue.edu/~cimeg.

Page 12: Watermarking relational databases using optimization based techniques

have implemented the Sion et al. [23] approach with no

stored markers, where the marker tuples are generated on

the fly during both encoding and the decoding phases.

9.1 Deletion Attack

In this attack, Mallory randomly drops tuples from the

watermarked data set, the watermark is then decoded and

watermark loss is measured for different values. Further-

more, in this test, we compare our implementation with Sion

et al. (No Stored Markers) [23] approach. Fig. 17 shows the

experimental results; they clearly show that our water-

marking technique is resilient to the random deletion attack.

Using our technique, the watermark was successfullyextracted with 100 percent accuracy even when more than80 percent of the tuples were deleted. On the other hand, thetechnique by Sion et al. badly deteriorates when only10 percent of the tuples were deleted. We believe that highresilience of our watermarking technique is due to themarker-free data partitioning algorithm that is resilient to thewatermark synchronization errors caused by the tupledeletion.

Because our technique is highly resilient to tuple deletion

attacks, the watermark can be retrieved from a small sample

of the data. This important property combined with the high

SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 127

Fig. 19. (a), (b), and (c) Resilience to fixed-ð; Þ alter attacks. (d), (e), and (f) Resilience to random-ð; Þ alter attacks.

Page 13: Watermarking relational databases using optimization based techniques

efficiency of our watermark detection algorithm makes it

possible to develop tools able to effectively and efficiently

search the Web to detect illegal copies of data. We could think

of an agent-based tool where the agent visits sites and

selectively tests parts of the stored data sets to check for

ownership rights. Such a technique would only need inspect

20 percent of the data for successful watermark detection.

9.2 Insertion Attack

In this experiment, Mallory attempts to add a number oftuples hoping to weaken the embedded watermark.However, by adding tuples to the current data, Malloryis adding meaningless data to the current data. Mallorycould simply generate the new tuples by replicating valuesin randomly selected existing tuples; we refer to suchattack as the -selected insertion attack. Mallory couldrandomly generate the tuple values by generating randomdata from the range ð�DW

� �DW; �DW

þ �DWÞ, where

�DWand �DW

are the mean and standard deviation of thedata set DW , respectively. We refer to such an attack as theð; Þ-insertion attack. Fig. 18a shows a comparison be-tween our approach and the Sion et al. (No StoredMarkers) [23] approach. The comparison shows that ourtechnique is resilient to -selected attack even when isup to 100 percent of the data set size. Although on theother hand, the Sion et al. marker-based techniquedeteriorates just after adding 10 percent of the data setsize. Figs. 18b and 18c show the resilience of ourwatermarking technique to ð; Þ-insertion attack, wherethe watermark was recovered with 100 percent accuracyeven when up to 80 percent of the data set size tupleswere inserted.

9.3 Alteration Attack

We tested our watermarking technique against two types

of alteration attacks, namely, the fixed and the random

ð; Þ alter attacks. In the fixed-ð; Þ alter attack, Mallory

randomly selects tuples and alters them by multiplying2 tuples by exactly ð1þ Þ and the other

2 tuples by

ð1� Þ. In this attack, the value of is fixed. In the

random-ð; Þ, alter attack tuples are randomly selected;2 tuples are then multiplied by ð1þ xÞ and the other2 tuples by ð1� xÞ, where x is a uniform random variable

in the range ½0; �.Figs. 19a, 19b, and 19c show the behavior of our

watermarking technique subject to the fixed-ð; Þ alterattack. As we can see from that in Fig. 19a, the watermark isdecoded with 100 percent accuracy even when 100 percentof the tuples are altered by > 1:0 percent. This shows thestrong resilience of our watermarking technique to fixedalteration attacks. Furthermore, Fig. 19b shows the numberof corrupted tuples as the attack proceeds. Tuples thatexceed the usability constraints are referred to as corruptedtuples. Fig. 19b shows that after > 0:9 percent, a suddenincrease in the number of corrupted tuples; such an increaseis due to the usability constraints used in this experiment,which are set to 0:5 percent. Fig. 19c is a clear descriptionof the dilemma that the attacker is facing. The dotted linesshow the number of corrupted tuples, whereas the solidlines are represent the detected watermark accuracy. Byincreasing , the attacker is able to corrupt the watermark to80 percent accuracy; however, at the same time, 75 percentof the tuples are corrupted. Similar results were experi-enced for the random-ð; Þ attack, which are shown inFigs. 19d, 19e, and 19f.

Experiments performed at lower usability constraintsstill showed similar resilience trends of the watermarkencoding and decoding when subject to above attacks.Table 1 shows a comparison between our technique andSion et al. technique based on the different watermarkattacks and main characteristics of each technique.

128 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

TABLE 1Comparison between our Technique and the Technique by Sion et al. (No Stored Markers) [23]

Page 14: Watermarking relational databases using optimization based techniques

10 CONCLUSION

In this paper, we have presented a resilient watermarkingtechnique for relational data that embeds watermark bits inthe data statistics. The watermarking problem was for-mulated as a constrained optimization problem thatmaximizes or minimizes a hiding function based on thebit to be embedded. GA and PS techniques were employedto solve the proposed optimization problem and to handlethe constraints. Furthermore, we presented a data partition-ing technique that does not depend on special markertuples to locate the partitions and proved its resilience towatermark synchronization errors. We developed anefficient threshold-based technique for watermark detectionthat is based on an optimal threshold that minimizes theprobability of decoding error. The watermark resilience wasimproved by the repeated embedding of the watermark andusing majority voting technique in the watermark decodingphase. Moreover, the watermark resilience was improvedby using multiple attributes.

A proof of concept implementation of our watermarking

technique was used to conduct experiments using both

synthetic and real-world data. A comparison our water-

marking technique with previously posed techniques

shows the superiority of our technique to deletion, altera-

tion, and insertion attacks.

REFERENCES

[1] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,”Proc. 28th Int’l Conf. Very Large Data Bases, 2002.

[2] M. Atallah and S. Lonardi, “Authentication of LZ-77 CompressedData,” Proc. ACM Symp. Applied Computing, 2003.

[3] M. Atallah, V. Raskin, C. Hempelman, M. Karahan, R. Sion, K.Triezenberg, and U. Topkara, “Natural Language Watermarkingand Tamperproofing,” Proc. Fifth Int’l Information Hiding Workshop,2002.

[4] G. Box, “Evolutionary Operation: A Method for IncreasingIndustrial Productivity,” Applied Statistics, vol. 6, no. 2, pp. 81-101, 1957.

[5] E. Chong and S. _Zak, An Introduction to Optimization. John Wiley &Sons, 2001.

[6] D. Coley, “Introduction to Genetic Algorithms for Scientists andEngineers,” World Scientific, 1999.

[7] C. Collberg and C. Thomborson, “Software Watermarking:Models and Dynamic Embeddings,” Proc. 26th ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, Jan. 1999.

[8] I. Cox, J. Bloom, and M. Miller, Digital Watermarking. MorganKaufmann, 2001.

[9] E. Dolan, R. Lewis, and V. Torczon, “On the Local Convergence ofPattern Search,” SIAM J. Optimization, vol. 14, no. 2, pp. 567-583,2003.

[10] D. Goldberg, Genetic Algorithm in Search, Optimization and MachineLearning. Addison-Wesley, 1989.

[11] D. Gross-Amblard, “Query-Preserving Watermarking of Rela-tional Databases and XML Documents,” Proc. 22nd ACMSIGMOD-SIGACT-SIGART Symp. Principles of Database Systems(PODS ’03), pp. 191-201, 2003.

[12] F. Hartung and M. Kutter, “Multimedia Watermarking Techni-ques,” Proc. IEEE, vol. 87, no. 7, pp. 1079-1107, July 1999.

[13] J. Holland, Adaptation in Natural and Artificial Systems. The MITPress, 1992.

[14] R. Hooke and T. Jeeves, “Direct Search Solution of Numerical andStatistical Problems,” J. Assoc. for Computing Machinery, vol. 8,no. 2, pp. 212-229, 1961.

[15] G. Langelaar, I. Setyawan, and R. Lagendijk, “WatermarkingDigital Image and Video Data: A State-of-the-Art Overview,” IEEESignal Processing Magazine, vol. 17, no. 5, pp. 20-46, Sept. 2000.

[16] R. Lewis and V. Torczon, “Pattern Search Methods for LinearlyConstrained Minimization,” SIAM J. Optimization, vol. 10, no. 3,pp. 917-941, 2000.

[17] Y. Li, H. Guo, and S. Jajodia, “Tamper Detection and Localizationfor Categorical Data Using Fragile Watermarks,” Proc. FourthACM Workshop Digital Rights Management (DRM ’04), pp. 73-82,2004.

[18] Y. Li, V. Swarup, and S. Jajodia, “Fingerprinting RelationalDatabases: Schemes and Specialties,” IEEE Trans. Dependable andSecure Computing, vol. 2, no. 1, pp. 34-45, Jan.-Mar. 2005.

[19] R. Morelos-Zaragoza, The Art of Error Correcting Coding. JohnWiley & Sons, 2002.

[20] J. Nocedal and S. Wright, Numerical Optimization. Prentice Hall,1999.

[21] F. Petitcolas, R. Anderson, and M. Kuhn, “Attacks on CopyrightMarking Systems,” LNCS, vol. 1525, pp. 218-238, Apr. 1998.

[22] B. Schneier, Applied Cryptography. John Wiley & Sons, 1996.[23] R. Sion, M. Atallah, and S. Prabhakar, “Rights Protection for

Relational Data,” IEEE Trans. Knowledge and Data Eng., vol. 16,no. 6, June 2004.

[24] M. Swanson, M. Kobayashi, and A. Tewfik, “Multimedia Data-Embedding and Watermarking Technologies,” Proc. IEEE, vol. 86,pp. 1064-1087, June 1998.

[25] L. Vaas, “Putting a Stop to Database Piracy,” eWEEK, EnterpriseNews and Revs., Sept. 2003.

[26] R. Wolfgang, C. Podilchuk, and E. Delp, “Perceptual Watermarksfor Digital Images and Video,” Proc. IEEE, vol. 87, pp. 1108-1126,July 1999.

Mohamed Shehab received the PhD degreefrom the School of Electrical and ComputerEngineering, Purdue University, West Lafay-ette, Indiana. He is currently working as anassistant professor in the Department of Soft-ware and Information Systems, University ofNorth Carolina, Charlotte. His research inter-ests include information security, distributedaccess control, distributed workflow manage-ment systems, and watermarking of relational

databases. He is a member of the IEEE.

Elisa Bertino is a professor of computerscience and electrical and computer engineeringat Purdue University and serves as a researchdirector of Center for Education and Research inInformation Assurance and Security (CERIAS).Her main research interests include security anddatabase systems. In those areas, she haspublished more than 300 papers. She is thecoordinating editor in chief of the Very LargeDatabase Systems Journal and serves on the

editorial boards of several journals. She is a fellow of the IEEE and theACM. She received the 2002 IEEE Computer Society TechnicalAchievement Award for outstanding contributions to database systemsand database security and advanced data management systems.

Arif Ghafoor is currently a professor in theSchool of Electrical and Computer Engineering,Purdue University, West Lafayette, Indiana, andis the director of the Distributed MultimediaSystems Laboratory and Information Infrastruc-ture Security Research Laboratory. His researchinterests include database security, parallel anddistributed computing, and multimedia informa-tion systems and has published extensively inthese areas. He has served on the editorial

boards of various journals. He is a fellow of the IEEE. He has receivedthe IEEE Computer Society 2000 Technical Achievement Award for hisresearch contributions on multimedia systems.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

SHEHAB ET AL.: WATERMARKING RELATIONAL DATABASES USING OPTIMIZATION-BASED TECHNIQUES 129