A General Framework for Information Leakage

A General Framework for Information LeakageJiachun Liao, Oliver Kosut, Lalitha Sankar

School of Electrical, Computer and Energy Engineering,Arizona State University

Email: {jiachun.liao,lalithasankar,okosut}@asu.edu

Flavio P. CalmonSchool of Engineering and Applied Sciences

Harvard UniversityEmail: [email protected]

Abstract

Consider a setting where a dataset needs to be shared or published for a specific set of learning tasks. For a given publishing,the maximal α-leakage quantifies an estimate of the adversary’s ability to learn any (potentially randomized) function of theoriginal data from the published data. The choice of α (ranging from 1 to ∞) captures the adversary’s varying ability to refineits posterior belief about the original data with α = 1 capturing a purely belief-refining adversary and α = ∞ capturing anadversary that guesses the best posterior. Varying α allows continuous interpolation between the two extremes. For α = 1 theleakage introduced simplifies to mutual information and for α = ∞ it simplifies to maximal leakage. Several properties of thismeasure are proven; these include: (i) convexity in the privacy mechanism (maps original to published dataset); (ii) data processinginequality; and (iii) a composition property. Finally, the privacy-utility trade-off under a novel hard distortion constraint as wellas Hamming distortion are developed.

I. INTRODUCTION

In this age of inference, the ability to gather personal information from an expanding digital footprint is outstripping anyindividuals capability to keep this information private [1]. While the data collected can have tremendous benefit for consumersand data curators via technologies built on machine learning and artificial intelligence, this benefit must be tempered withmeaningful assurances of privacy. As the marketplace for data explodes, a more nuanced notion of privacy that goes beyondidentity is emerging: it is the requirement that a good privacy-assuring mechanism should limit all inferences beyond what thedata owner or curator intended.

Early privacy models for data disclosure [2]–[13] have given way to two robust privacy approaches: (i) differential privacy(DP) [14]–[16], quantified by a leakage ε,1 restricts distinguishability between any two “neighboring” datasets from the publisheddata; and (ii) information-theoretic (IT) privacy takes a statistical approach to modeling datasets and quantifying utility andprivacy to obtain the privacy-utility trade-off (PUT) as an optimization problem [17]–[35]. IT privacy has predominantly beenquantified by mutual information (MI) which models how well an adversary with access to the released data can refine its beliefof the private data. Recently, Issa et al. [36] introduced maximal leakage (MaxL) to quantify leakage to a strong adversarycapable of guessing any function of the dataset and showed that their adversarial model encompasses local DP (whereinthe mechanism ensures limited distinction for any pair of entries—a stronger guarantee without a neighborhood constraint[37]) [38].

In theory, DP and MaxL protect against a worst-case adversary that can guess any function of the dataset. In practice,researchers have shown that DP provides very limited utility and comes with tremendous sample (data size) complexity [39]–[47] in the desired leakage regimes. An attempt to make it more practical is approximate DP which dilutes the guaranteesby revealing the dataset as is with non-zero probability [48], [49]! We have recently shown that MaxL behaves analogous toapproximate DP, i.e., utility requirements cause MaxL to reveal a subset of the dataset as is with non-zero probability [50].

We introduce a new, tunable privacy metric, called maximal α-leakage. The choice of α (ranging from 1 to ∞) captures theadversary’s varying ability to refine its posterior belief about the original data with α = 1 capturing a purely belief-refiningadversary and α = ∞ capturing an adversary that guesses the best posterior. (The maximal α-leakage simplifies to MI andMaxL by assigning 1 and ∞ to α, respectively.) Varying α allows continuous interpolation between the two extremes.

Let X , Y and S be three discrete random variables, with alphabets X , Y and S, respectively, such that S −X − Y formsMarkov chain. We denote the joint distribution of (X,Y ) by PXY . Following the usual supervised learning setting2 a learningalgorithm receives as input a set of n i.i.d. observations {(Xi, Yi)}ni=1, (Xi, Yi) ∼ PXY , and outputs a function that capturesthe belief about X given an observation of Y , denoted by PX|Y . The belief PX|Y is often represented by the output of aparametric function fθ : Y → ∆|X | (e.g. a neural network), where θ denotes parameters of the model (e.g. weights of a neuralnetwork), and ∆|X | is the set of all probability distributions over X .

The value of PX|Y is selected in order to minimize an expected loss metric, synonymously defined as risk or generalization

error [53]. Denoting the loss function by `(x, y, PX|Y ), the expected loss is given by E[`(X,Y, PX|Y )

)]. In practice, the

1smaller ε ∈ [0,∞) implies smaller leakage and stronger privacy guarantees.2We alert the reader that usual machine learning texts denote the observed feature by X and the label to be predicted by Y . We switch the roles of X and

Y here in order to comply with the notation found in the information-theoretic privacy literature [51]–[54].

empirical loss is minimized as a proxy for the expected loss: PX|Y = arg minPX|Y1n

∑ni=1 `

(Xi, Yi, PX|Y )

)and may be

further restricted to parametric functions. However, in the interest of evaluating fundamental limits, here we assume that theexpected loss is minimized directly instead of its empirical approximation, and assume this minimization is computed over allconditional distributions PX|Y .

In many privacy applications, the engineering goal is to design the mapping PY |X in order to thwart an adversary’s abilityto estimate any property S of X , while still disclosing some information about X that may be useful for a given application.This naturally leads to a trade-off between privacy and utility [52], [53], [55]–[57]. For example, X may represent a user’s age,education level, and zip code, S the user’s income, and Y a differentially private version of X used for producing populationstatistics. Alternatively, X may represent a user’s shopping habits, S her health condition, and Y data used for producingtargeted advertisement.

Privacy metrics seek to quantify an adversary’s ability of inferring information that deemed private by a user. In the privacysetting, Y represents the disclosed data, X a user’s private data, and S some property of X that is targeted by an adversary.Upon observing Y , a machine learning adversary updates a belief of S as

PS|Y = arg minPS|Y

E[`(S, Y, PS|Y )

)].

The privacy risk incurred by an adversary’s observation of Y can then be quantified as the gain in expected loss [58]. Thisgain, in turn, captures the change in the adversary’s belief of S upon observing the disclosed data.

The choice of the loss function ` provides a concrete measure of the gain in adversarial inference. The choice of loss functionwill also influence the design of the privacy mechanism PY |X . One possible loss-function is the 0-1 loss, where

`(s, y, PS|Y ) = 1− PS|Y (s|y). (1)

Here, the optimal belief P ∗S|Y is the standard maximal posterior (MAP) estimator, given by

P ∗S|Y (s|y) =

{1, if s = arg maxs PS|Y (s|y)

0, otherwise,

and the loss is simply the average probability of error, i.e. Pr(S 6= S|Y ). An alternative is the logarithmic loss [58]–[61]:

`(s, y, PS|Y ) = log1

PS|Y (s|y)(2)

which results in a cross-entropy upon taking an expected value. The minimizing belief is the true posterior distribution of Sgiven Y , i.e., P ∗

S|Y = PS|Y , and the expected loss is simply the conditional entropy H(S|Y ).

Note that the error probability loss is minimized by the belief assigning probability 1 to the most likely s, while the logloss is minimized by the posterior distribution. These represent two extremes of adversarial strategies to guess the unknownS exactly, or to improve its belief about S. Individually, however, each metric produces only a limited picture of what anadversary can learn from Y .

We propose a new leakage measure called maximal α-leakage that allows continuous interpolation between these extremesvia the loss function

`(s, PS) =α

α− 1

(1− PS(s)1−

1α

), (3)

for any constant α > 1. For very large α, this loss is essentially the same as the probability of error. As α decreases, theconvexity of the loss function encourages the estimator S to be probabilistic, because it rewards correct inferences of less-likelyoutcomes. In the limit as α→ 1, this loss function essentially becomes the logarithmic loss, and the optimal belief Ps is simplythe posterior belief.

The expression Ps(s)1−1α in (3) can be viewed as the reward for correctly inferring the outcome s.3 Therefore, the expected

reward of correctly inferring S without side information is∑s∈S

PS(s)PS(s)1−1α =

∑s∈S

PS(s)PS(s)

(1

PS(s)

) 1α

. (4)

The minimization in (3) is equivalent to maximizing Ps(s)1− 1

α , such that the optimal belief that maximizes the expected

3The expression(PS(s)

)− 1α gives: (1) an increasing reward for a given outcome (with a certain probability) as α tends to 1; (2) in general, a higher

reward for a lower probability outcome, but the same reward for all possible outcomes when α = ∞.

reward in (4) also minimizes the average loss, and it is a tilted posterior distribution given by

P ∗S

(s) =PS(s)α∑s∈S PS(s)α

. (5)

Note that for α > 1, the function f : t → tα (t ≥ 0) is non-decreasing such that outcomes with increasing probabilitiesare assigned increasing beliefs. Our proposed maximal α-leakage considers all possible functions (properties) S of X , andmeasures the maximal multiplicative gain in the expected reward of correctly inferring S given Y . In the limit as α → 1,the maximal α-leakage turns to be mutual information; and in the other limit as α → ∞, the maximal α-leakage becomesmaximal leakage [62, Def. 1].

This report is organized as follows; in Section II we formally define maximal α-leakage for α ∈ [1,∞] and provide anoperational meaning. In Section III, we present properties of maximal α-leakage. Finally, in Section IV, we present twoprivacy-utility trade-off (PUT) problems and highlight properties of the optimal mechanisms.

II. PRIVACY METRICS

The maximal α-leakage is an effort to go beyond specific measures, such as mutual information and maximal leakage(i.e., probability of guessing). To this end, we review informational measures that will play a part in qualifying the maximalα-leakage, e.g., Renyi divergence and Sibson mutual information [63].

A. Preliminaries

In [64], Alfred Renyi defined a divergence, indicated as Dα, to measure the distance between two discrete distributions PXand QX as following: for α ∈ [0,∞]

Dα(PX‖QX) =1

α− 1log

(∑x∈X

PX(x)αQX(x)1−α

), (6)

which is called Renyi divergence of order α. Several years later, in [65] Sibson proposed a extended version of mutualinformation to qualify the shared information between two random variables X and Y , which is called α mutual informationby Verdu in [63]. To be accurate, we call it Sibson mutual information and review its definition as followed: let PX be themarginal distribution of X , PY be the marginal distribution of Y resulting from the conditional probability matrix PY |X , QYbe an arbitrary marginal distribution of Y , Sibson mutual information of order α is given by

ISα(X;Y ) , inf

QYDα(PXY ‖PX ×QY ); (7)

For discrete random variables, the Sibson mutual information in (7) can be rewritten as [63]

ISα(X;Y ) =

α

α− 1log∑y∈Y

(∑x∈X

PX(x)PY |X(y|x)α

) 1α

. (8)

In [66], Arimoto gave new descriptions about code theorem and its strong converse for memoryless channels by introducinga new α mutual information. To distinguish it from the Sibson MI, we call it Arimoto mutual information and it is given by

IAα(X;Y ) =

α

α− 1log∑y∈Y

(∑x∈X PX(x)αPY |X(y|x)α∑

x∈X PX(x)α

) 1α

(9)

These alternative mutual information generalize Shannon’s (they are equal in the limit as α → 1), and have a number ofinteresting and useful properties in various problems [63], [65]–[67].

B. Definition

Let us go back to our model with three random variables S,X and Y , where X is data that is revealed as Y and S is anyfunction of X . In [62], Issa, Kamath and Wagner introduced a privacy measure, called maximal leakage, to qualify how muchknowledge about S leaked by Y . We review the definition below.

Definition 1. Given a joint distribution PXY on finite alphabets X and Y , the maximal leakage from X to Y is

LML(X → Y ) , supS−X−Y

log

maxPS|Y

P(S = S|Y )

maxPS

P(S = S). (10)

where S and S share a finite support S.

Remark 1. Note that in Definition 1, S represents any (possibly random) function of X . The numerator represents the maximalprobability of correctly guessing S based on Y , while the denominator represents the maximal probability of correctly guessingS without knowing Y . Thus, maximal leakage qualifies the multiplicative gain that a potential adversary can achieve byaccessing to Y to improve their ability to guess any possible random function of X .

In [62, Thm. 1], the authors proved that maximal leakage simplifies to

LML(X → Y ) = log∑y

maxx:P (x)>0

P (y|x). (11)

The right-hand side may also be written as IS∞(X;Y ), where IS

∞ is the Sibson mutual information in (7) with α =∞.

The operational definition of maximal leakage in (10) evaluates the estimator S of S only in terms of its probability ofbeing correct (often referred to as the 0-1 loss in machine learning). To present a complete picture of what an adversary canlearn from Y about properties S of X between the two extremes, i.e., to guess the unknown S exactly, or to improve its beliefabout S, we formally define maximal α-leakage as follows.

Definition 2. Given a joint distribution PXY on finite alphabets X and Y , the maximal α-leakage from X to Y is defined as

Lα(X → Y ) = supS−X−Y

limα′→α

α′

α′ − 1log

maxPS|Y

E((

P(S = S|S, Y ))1− 1

α′)

maxPS

E((

P(S = S|S))1− 1

α′) (12)

where α ∈ [1,∞], and S and S take values from the same finite but arbitrary alphabet.

In (12), the expectation in the denominator of the logarithm is exactly the expected reward in (4) for correctly inferring Swithout Y , and the expectation in the numerator is

E((

P(S = S|S, Y ))1− 1

α′)

=∑

s∈S,y∈YPSY (sy)PS|Y (s|y)

(PS|Y (s|y)

)− 1α

, (13)

which is the expected reward for correctly inferring S with Y . The optimal belief of S with Y is a perturbed posteriordistribution as

P ∗S|Y (s|y) =

PS|Y (s|y)α∑s∈S PS|Y (sIy)α

. (14)

The two maximizations in the logarithm in (12) indicate that the two expected rewards compared in maximal α-leakage resultfrom the best inferences that adversaries can have for the reward functions determined by α. Therefore, maximal α-leakageis the maximal multiplicative increase in the expected reward for correctly inferring any function of X gained by using sideinformation Y comparing to without any side information.

III. PROPERTIES AND APPROXIMATIONS OF α-LEAKAGE

A. Properties

In the Definition 2, two maximizations are included in the logarithm. By solving the two maximizations, we simplify theexpression of the α- leakage in (12) in the following theorem.

Theorem 1. For α ∈ [1,∞], The maximal α-leakage defined in (12) can be equivalently expressed as

Lα(X → Y ) =

supXISα(X;Y ) = sup

XIAα(X;Y ) 1 < α <∞ (15a)

I(X;Y ) α = 1 (15b)IS∞(X;Y ) α =∞ (15c)

where ISα(X;Y ) and IA

α(X;Y ) are α-ordered Sibson and Arimoto mutual information in (8) and (9), respectively.

To prove Theorem 1, we first solve the two maximizations in (12), and find that for α ∈ (1,∞),


IAα(S;Y ), (16a)

L1(X → Y ) = supS−X−Y

I(S;Y ) = I(X;Y ) (16b)

L∞(X → Y ) = LML(X → Y ) (16c)

i.e., L∞(X → Y ) equals to maximal leakage in (10) such that L∞(X → Y ) equals to (15c) from the Theorem 1 in [62].After that, we upper bound supS−X−Y I

Aα(S;Y ), for α ∈ (1,∞), by supX I

Aα(X;Y ), and then, show that the upper bound

can be achieved by a specific S with H(X|S) = 0. A detailed proof is in Appendix A-A.

Remark 2. Note that referring to Theorem 1, MI and MaxL are captured by maximal α-leakage with α = 1 and α = ∞,respectively. We know that mutual information measures the knowledge about a probability distribution. By knowing theprobability distribution of a random variable, we can derive all statistics/moments of the the random variable, e.g., its mean(the first raw moment) and variance (the second central moment) such that mutual information measures the complete statisticknowledge. For maximal leakage, it measures the knowledge about the maximal likelihood outcome. Therefore, maximal α-leakage measures a set of continuous knowledges about a random variable from the complete statistic knowledge to the maximallikelihood knowledge.

For α ∈ (1,∞], the maximal α-leakage Lα(X → Y ) in (15a), is only determined by the conditional probability matrixPY |X , and therefore, can be represented by Lα(PY |X). In the following lemma, we present privacy mechanisms that lead to0 ({PY |X : Lα(PY |X) = 0}) or the maximum of maximal α-leakage.

Lemma 1. For α ∈ [1,∞], maximal α-leakage in (15) satisfiesa. if X is independent of Y , i.e., the conditional probability matrix of Y given X is an rank-1 conditional denoted by PX⊥Y ,

Lα(PX⊥Y ) = 0 (17)

b. if X is a deterministic function of Y , i.e., the conditional probability matrix, indicated by PX⇐Y , has only one non-zeroentry in each column (e.g., the identity matrix),

Lα(PX⇐Y ) =

{log |X | α > 1

H(PX) α = 1(18)

c. Any column permutation of a conditional probability matrix PY |X gives the same leakage Lα(PY |X)

The conclusions in Lemma 1 are directly derived from the expressions for maximal α-leakage in (15), and a detailed proofis in Appendix A-B.

For α = 1 and α = ∞, the maximal α-leakage are MI and ML, respectively, the properties of both of which are wellstudied. Therefore, we concentrate on the properties of maximal α-leakage for α ∈ (1,∞) in (15a), which is the maximum ofα-ordered Sibson mutual information IS

α(X;Y ) over all distributions of X . By using properties of Sibson mutual informationof order α ∈ (1,∞), we will prove that the maximal α-leakage Lα(PY |X) is

1. quasi-convex in PY |X ,2. monotonically increasing in α,3. satisfying data processing inequalities.

Also, we provide a composition theory; All proofs are gathered in Appendix ??, which include detained proofs for thecorresponding properties of IS

α(X;Y ).

Proposition 1. For α ∈ (1,∞), the maximal α-leakage Lα(PY |X) in (15a) is quasi-convex in PY |X .

This property of Lα(PY |X) is based on that ISα(X;Y ) is quasi-convex in PY |X . A detailed proof is in Appendix A-C which

include the proof for the quasi-convexity of ISα(X;Y ).

Remark 3. Note that since MI is convex in PY |X and ML is a linear function of PY |X , L1(PY |X) is convex in PY |X andL∞(PY |X) is linear in PY |X . Therefore, for α ∈ [1,∞], Lα(PY |X) is quasi-convex in PY |X

In PUT problems, we are always involved in minimizing leakage functions. Therefore, for maximal α-leakage (α ∈ (1,∞)),we have to minimize Sibson mutual information over marginal distributions and maximize it over conditional probabilitymatrices. To systematically solve the min−max problem, we can transform it to a convex min−max problem by replaceSibson mutual information with the function in the following definition.

Definition 3. For two random variables X and Y with joint distribution PXY , define a function kα(PX , PY |X)

kα(PX , PY |X) ,∑y∈Y

(∑x∈X

PX(x)PαY |X(y|x)

)1/α

, (19)

where PX and PY |X are the marginal distribution and conditional probability matrix from PXY , respectively. The functionkα(PX , PY |X) is convex in PX and concave in PY |X .

For α ∈ (1,∞), ISα(PX , PY |X) is a monotonically increasing function of kα(PX , PY |X) such that so is the maximal

α-leakage Lα(PY |X), i.e., if kα(PX , PY1|X) ≥ kα(PX , PY2|X) for any PX , Lα(PY1|X) ≥ Lα(PY2|X). The convexity andconcavity of kα(PX , PY |X) are justified as follows.(1) the function kα(PX , PY |X) is a sum of weighted α-norms of, since for all y ∈ Y(∑

x∈XPX(x)PαY |X(y|x)

)1/α

=∥∥∥[P 1

α

X

]PY |X(y|·)

∥∥∥α

(20)

where[P

1α

X

]is a diagonal matrix with entries P

1α

X (x) on diagonal and PY |X(y|·) is a column in PY |X with Y = y.Therefore, kα(PX , PY |X) is convex in PY |X ;

(2) for α > 1, the function f : x→ x1α is concave. Therefore, kα(PX , PY |X) is a sum of compositions of concave and linear

functions of PX , such that it is concave in PX .By exploring the monotonicity in α of maximal α-leakage, we know maximal α-leakage is non-negative, and also, have upperand lower bounds for it.

Proposition 2. For α ∈ (0,∞) maximal α-leakage in (15a) satisfies(a) Lα(PY |X) is monotonically non-decreasing in α.(b) Lα(PY |X) ≤ I∞(PY |X)with equality if PY |X = PX⊥Y , PX⇐Y

(c) Lα(PY |X) ≥ αα−1 log

(∑y

(∑x P

αY |X(y|x)

) 1α

)− 1

α−1 log |X | with equality if PY |X is symmetric4 or rank-1.

Part (a) in Proposition 2 is directly from that Sibson mutual information is a non-decreasing function of α; Part (b) is adirect result of part (a) and Lemma 1; and the lower bound in part (c) is given by uniform source distribution. A detailed proofis in Appendix A-D, which includes the proof for the monotonicity of Sibson mutual information.

In [63], Theorem 11 provides a lower bound of the minimal average decoding error probability as a function of maximalSibson mutual information. From Theorem 1, we know the maximal Sibson mutual information is, in fact, α–leakage; andtaking use of The quasi-concavity of Sibson mutual information 5 and Proposition 2, we have the following corollary.

Corollary 1. Let M be any positive integer. Given a conditional probability matrix PY |X , the minimal average error probabilityε∗(M) achievable with a size–M code satisfies

ε∗(M) ≥ 1− exp

(inf

α∈(1,∞)

{(1− 1

α

)(Lα(PY |X)− logM

)}). (21)

If PY |X is a symmetric channel, i.e., all rows of PY |X are permutations of each other and so are columns, the tightest lowerbound is achieved by α∗ =∞.

Remark 4. Note that for a symmetric channel PY |X , Sibson mutual information is symmetric in PX , and due to the quasi-concavity in PX , we know that Sibson mutual information is a Schur-concave function of a symmetric PX6. Therefore, theuniform distribution, that is majorized by all other distributions, achieves the maximum of Sibson mutual information over allmarginal distributions. Therefore, the exponential factor in (21) becomes(

1− 1

α

)(Lα(PY |X)− logM

)=

(1− 1

α

) α

α− 1log

∑y

(∑x

PαY |X(y|x)

) 1α

− 1

α− 1logM − logM

(22a)

4All rows of PY |X are permutations of other rows, and so are columns.5The quasi-concavity of Sibson mutual information is proved in the proof (Appendix A-C) of Proposition 1.6A symmetric PX means its rows are permutations of each other, and so are columns

= log

∑y

(∑x

PαY |X(y|x)

) 1α

− logM (22b)

= log(|Y|‖PY |X(y|·)‖α

)− logM (22c)

where the last equality is from the condition that PY |X is a symmetric channel. Given an arbitrary vector, its p–norm becomessmaller as p grows. Therefore, the optimal α achieving the minimum is ∞.

Intuitively we think the leaked knowledge about original data should decrease after every data processing. Therefore, areasonable measure of information leakage should have a property indicating the intuition. The following proposition verifythat the maximal α-leakage in (15a) satisfies data processing inequalities.

Proposition 3 (Data processing inequality). Let α ∈ (0,∞). If random variables X,Y, Z form a Markov chain, i.e., X−Y −Z,then

Lα(X → Z) ≤ Lα(X → Y ) (23a)Lα(X → Z) ≤ Lα(Y → Z). (23b)

The conclusion that maximal α-leakage satisfies (23a) and (23b) is based on that Sibson mutual information satisfiesdata processing inequalities [63, Thm. 3]. Sibson mutual information is not symmetric in random variables, i.e., generallyISα(X;Y ) 6= IS

α(Y ;X), and therefore, the proofs for its two data processing inequalities are a little different and the detailedproofs are included in Appendix A-E.

For two different applications, we generate two published versions Y1 and Y2 of data X and measure the correspondingleaked knowledges about X by using maximal α-leakage. If an adversary have Y1 and Y2 simultaneously, what is an upperbound of information leakage of X under maximal α-leakage? The answer is in the following theorem.

Theorem 2 (Composition Theorem). Let Y1 and Y2 be two random variables such that Y1 −X − Y2, and PY1|X and PY2|Xbe the two corresponding conditional probability matrices. Therefore, for all α ∈ [1,∞]

Lα(X → Y1, Y2) ≤ Lα(X → Y1) + Lα(X → Y2), (24)

and the equality holds if PY1|X or PY2|X is rank-1, i.e., Y1 or Y2 is independent to X .

A detailed proof is in Appendix A-F.

Remark 5. Regard Y1 and Y2 as two published versions of data X , and therefore, the information leakage of X leaked from(Y1, Y2) is upper bounded by the sum of information about X independently leaked by Y1 and Y2. Thus, if the data curatorpublishes various versions of an original dataset for different applications, we can still upper bound the information leakageof the original dataset even though an adversary collects all these published datasets.

B. Approximations

For α ∈ (1,∞), by using the Taylor expansion of the function kα defined in (37), we provide approximations for maximalα-leakage in regimes around two perfect/null privacy points, i.e., PX⊥Y and PX⇐Y in Lemma 1. Let W be the conditionalprobability matrix with Wij = PY |X(yj |xi) for all xi ∈ X and yj ∈ Y . We use W0 to denote a perfect privacy mechanismPX⊥Y , and without loss of generality, use identity matrix I as a representation of null privacy mechanisms PX⇐Y . Therefore,by restricting all entries of W in small regions, whose radii are determined by ρ (0 ≤ ρ � 1), around the correspondingentries of W0 or I, we give formal definitions for the two sets of W belonging to the two extremal regimes as follows.

Definition 4. For α ∈ (i,∞), in high privacy regime the collection of privacy mechanisms W is

Whp , {W = W0 + Θ}, (25)

where let a row vector w0 indicate the rows in W0 such that the matrix Θ should satisfy

|Θij | ≤ ρw0j for all i, j (26a)

woj ≤1

1 + ρfor all j (26b)∑

j

Θij = 0 for all i; (26c)

in high utility regime the collection of privacy mechanisms W is

Whu , {W : W = I + Θ}, (27)

where the matrix Θ satisfies

−ρ ≤ Θii ≤ 0 for all i (28a)0 ≤ Θij ≤ ρ for all i 6= j (28b)∑j

Θij = 0 for all i. (28c)

In Definition 4, the conditions in (26a), (26b) and (28a), (28b) make sure that all entries of W in (25) and (27) belong tothe interval [0, 1], respectively; (26c) and (28c) come from the requirements that rows of W sum up to 1.

Lemma 2. For α ∈ (1,∞), if W belonging to the Whp defined in (25), the maximal α-leakage can be expressed as

Lα(W) = maxq

α

2

(∑l

∑k

qkw0l

Θ2kl −

∑l

(∑k qkΘkl)

2

w0l

)+ o(ρ2) (nat), (29)

if W belonging to the Whu defined in (27), the maximal α-leakage can be expressed as

Lα(W) = log

(∑i

Wαα−1

ii

)+ o(ρ2). (30)

The detail derivative procedure is in Appendix A-G.

Remark 6. Note that

1. The quadratic approximation in (29) is only continuous for α ∈ (1,∞), and the linear approximation in 30 is continuousfor α ∈ (1,∞]) for the following two reasons:

a. The approximations in (29) and (30) are from the quadratic and linear approximations of ISα, respectively, such that

for α = 1 the corresponding original expression is

maxq

limα′→1

ISα′(q,W) = max

qI(q,W) (31)

which is, in fact, the channel capacity of W. However, for α = 1, the maximal α-leakage L1 is mutual information,such that the limitation (to α = 1) of the two approximations in Lemma 2 are not approximations of L1.

b. For α =∞ the corresponding original expression is

maxq

limα′→∞

ISα′(q,W) = IS

∞(W), (32)

which is the maximal α-leakage L∞ for α =∞, such that the approximation in (30) can be applied to approximateL∞.In fact, for W belonging to Whu in (27), the maximal entry in each column is on the diagonal, such that (30) isthe expression of maximal leakage for α =∞.

2 For ρ = 0, i.e., W = W0 and I in (29) and (30) respectively, the equalities (29) and (30) hold.

IV. PRIVACY AND UTILITY TRADE-OFF PROBLEM

Let Uk(PX , PY |X), k = 1, 2, ... be a set of utility functions of the marginal distribution PX and conditional probabilitymatrix PY |X . A general model of privacy-utility trade-off (PUT) problems with maximal α-leakage as privacy measure is

minPY |X

Lα(PY |X) (33a)

s.t., Uk(PX , PY |X) ≥ uk k = 1, 2, ... (33b)

uk, k = 1, 2, ..., are the required lower bounds for corresponding utility measures Uk.Referring to fact that for α ∈ (1,∞), Lα(PY |X) is a monotonically increasing function of kα(PX , PY |X), to solve the

problem in (33), it suffices to solve a possible simpler optimization problem in the following corollary.

Corollary 2. For α ∈ (1,∞), to attain the optimal solution of the privacy utility trade-off problem in (33), it suffices to solve

minPY |X

maxPX

kα(PX , PY |X) (34a)

s.t., Uk(PX , PY |X) ≥ uk k = 1, 2, ... (34b)

where kα(PX , PY |X) is the function define in (19) which is convex in PX and concave in PY |X .

A. Privacy-Utility Trade-off with a Hard Distortion Constraint

We now consider the PUT problem using maximal α-leakage as the measure of privacy leakage, subject to a hard distortionconstraint. Namely, the mechanism is constrained to satisfy d(X,Y ) ≤ D with probability 1, where d(·, ·) is a distortionfunction. Let

PUTHd,α(D) = infPY |X :d(X,Y )≤D

Lα(X → Y ). (35)

Recall that an f -divergence Df , for a convex function f : R → R such that f(1) = 0, measures the distance between twodistributions as

Df (P‖Q) =

∫dQf

(dP

dQ

). (36)

We define a leakage function based on such an f -divergence as follows.

Definition 5. Given a joint distribution PXY on finite alphabets X and Y , the f -divergence-based leakage from X to Y isdefined as

Lf (X → Y ) = supPX

infQY

Df (PXY ‖PX ×QY ). (37)

Note that Shannon capacity is a f -divergence-based leakage with Df in (37) being relative entropy.

Lemma 3. maximal α-leakage can be represented by a monotonic function of a f -divergence-based leakage as

Lα(X → Y ) =1

α− 1log(1 + (α− 1)Lfα(X → Y )

)(38)

with the convex function

fα(t) =1

α− 1(tα − 1). (39)

The convex function fα in Lemma 3 is exactly that of Hellinger divergence [68], and a detailed proof is in Appendix B-A.Therefore, we also consider an even more general problem, with a privacy leakage measure based on an arbitrary f -divergencedefined in Definition 5. We now define, for any convex f with f(1) = 0, the PUT as

PUTHD,f (D) = infPY |X :d(X,Y )≤D

Lf (X;Y ). (40)

This tradeoff is characterized by the following theorem.

Theorem 3. For any f -divergence Df , the optimal PUT is given by

PUTHD,f (D) = q?f((q?)−1) + (1− q?)f(0) (41)

whereq? = sup

QY

infxQY (BD(x)) (42)

andBD(x) = {y : d(x, y) ≤ D}. (43)

Moreover, if there exists a distribution Q?Y achieving the supremum in (42), then for any f an optimal mechanism PY |X isgiven by

dPY |X=x

dQ?Y(y) =

1(d(x, y) ≤ D

)Q?Y (BD(x))

. (44)

A detailed proof is in Appendix B-B.

Corollary 3. For any α > 1, the PUT for maximal α-leakage is given by

PUTHd,α(D) = − log q? (45)

where q? is defined in (42), and for all α the optimal mechanism is given by (44).

Proof: Applying Theorem 3 to the Hellinger divergence, for any α > 1 we have

PUTHd,fα(D) = q?f((q?)−1) + (1− q?)f(0) =1

α− 1

[(q?)1−α − 1

]. (46)

Again using the monotonic relationship between Renyi divergence and Hellinger divergence,

PUTHd,α(D) =1

α− 1log(1 + (α− 1)PUTHD,f (D)

)= − log q?. (47)

B. Privacy-Utility Trade-off with Average Binary Hamming Distortion

Consider the PUT problem using maximal α-leakage to measure information leakage subject to an average Hammingdistortion constraint as following

PUTHam,α(D) = minPY |X

Lα(PY |X) (48a)

s.t.,∑x,y∈X

PXY (xy)1 (y 6= x) ≤ D (48b)

where D ∈ [0, 1] determines the upper bound of the permitted average hamming distortion.We consider the simplest binary sources, i.e., |X | = 2. Let p be the parameter of the Bernoulli distribution PX , i.e.

PX = Ber(p), and represent PY |X with

PY |X =

[1− ρ1 ρ1ρ2 1− ρ2

](49)

where the cross probabilities ρ1, ρ2 ∈ [0, 1]. Therefore, the problem in (48) is explicitly rewritten as

PUTHam,α(D) = minρ1,ρ2

maxq∈[0,1]

αα−1 log

(((1− q)(1− ρ1)α + qρα2 )

1α + ((1− q)ρα2 + q(1− ρ2)α)

1α

)α ∈ (1,∞)

I(X;Y ) α = 1

I∞(X;Y ) = log(max{2− ρ1 − ρ2, ρ1 + ρ2}) α =∞(50a)

s.t. (1− p)ρ1 + pρ2 ≤ D (50b)0 ≤ ρ1, ρ2 ≤ 1 (50c)

The optimal trade-off is characterized by the following theorem.

Theorem 4. The optimal PUTs in (50) are given as following

(1) for D = 0, the optimal solutions are ρ∗1 = ρ∗2 = 0 such that

PUTHam,α(0) =

{H(p) α = 1

1 (bit) α > 1(51)

(2) for D ≥ min{p, 1 − p}, the optimal solutions are (ρ∗1, ρ∗2) = (1, 0) for p ≥ 0.5 and (ρ∗1, ρ

∗2) = (0, 1) for p ≤ 0.5 such

that

PUTHam,α(D) = 0 for all α ∈ [1,∞] (52)

(3) for 0 < D < min{p, 1− p} and p ∈ (0, 1),

PUTHam,α(D) (53)

=

H(p)−H(D) α = 1

log(

2− Dmin{1−p,p}

)α =∞

αα−1 log

(((1− ρ∗1)α(1− ρ∗2)α − (ρ∗1ρ

∗2)α

)1α

(((1− ρ∗1)α − ρ∗α2 )

11−α + ((1− ρ∗2)α − ρ∗α1 )

11−α)α−1

α

)α ∈ (1,∞)

(54)

where the optimal solutions (ρ∗1, ρ∗2) satisfying (1− p)ρ∗1 + pρ∗2 = D, and if the parameters (D,α) satisfies

(a.)d g (x)

dx

∣∣∣∣x=0

≥ 0 (p ≤ 0.5) or (b.)d g (x)

dx

∣∣∣∣x= D

1−p

≤ 0 (p ≥ 0.5), (55)

the corresponding optimal ρ∗1 equals to 0 (resp. D1−p ) for p ≤ 0.5 (resp. p ≥ 0.5); otherwise, the ρ∗1 is in (0, D) (resp.(

D, D1−p

)) for p ≤ 0.5 (resp. p ≥ 0.5) and satisfies

d g (x)

dx

∣∣∣∣x=ρ∗1

= 0, (56)

with the function g(x) defined as

g(x) ,

((1− x)α

(1− D

p+

1− pp

x

)α− xα

(D

p− 1− p

px

)α) 1α−1

·

(((1− x)α −

(D

p− 1− p

px

)α) 11−α

+

((1− D

p+

1− pp

x

)α− xα

) 11−α). (57)

A proof for Theorem 4 is detailedly presented in Appendix B-C.

Remark 7. Note that observing the structure of optimal mechanisms for different (p,D, α), we find out that there exist someregion of α in which the optimal solution is invariant in α, and sometimes the region is in a simple form as [α,∞], where αindicates the lower bound of these regions. Therefore, it is not necessary to explore all values of α.

APPENDIX APROOFS FOR SECTION III

A. Proof for Theorem 1

Proof. The expression (12) can be explicitly written as


limα′→α

α′

α′ − 1log

maxPS|Y∑s∈S,y∈Y PSY (sy)

(PS|Y (s|y)

)1− 1α′

maxPS∑s∈S PS(s)

(PS(s)

)1− 1α′

. (58)

The first step to simplify the expression in (58) is to solve the two maximizations in the logarithm. First, we concentrate onthe maximization in the denominator of the logarithm in (58) and the one in the numerator can be solved following the sameanalysis. The maximization in the denominator can be equivalently written as

maxPS

∑s∈S

PS(s)(PS(s)

)1− 1α′ (59a)

s.t.∑s∈S

PS(s) = 1 (59b)

PS(s) ≥ 0 for all s ∈ S (59c)

For α′ ∈ [1,∞), the problem in (59) is a convex program. Therefore, by using Karush-Kuhn-Tucker (KKT) conditions, wederive the optimal solution and the corresponding optimal value of (59) as follows.Let L(PS , µ) indicate the Lagrange function of (59) with µ as the Lagrange multiplier for the equality in (59b), i.e.,

L(PS , λ) =∑s∈S

PS(s)(PS(s)

)1− 1α′ + µ

(1−

∑s∈S

PS(s)

). (60)

Referring to KKT conditions, the optimal solution P ∗S

of (59) should satisfy (59b), (59c) and the following equation

∂L(PS , λ)

∂PS(s)= 0

⇒ PS(s)

(1− 1

α

)(PS(s)

)− 1α′ = µ for all s ∈ S. (61)

Combining (59b) and (61), we have

P ∗S

(s) =(PS(s))

α′∑s∈S (PS(s))

α′for all s ∈ S (62)

which satisfies (59c) such that the P ∗S

in (62) is the optimal solution of (59), and therefore, the optimal value is

maxPS

∑s∈S

PS(s)(PS(s)

)1− 1α′ =

(∑s∈S

(PS(s))α′

) 1α′

. (63)

Similarly, we attain the optimal solution P ∗S|Y of the maximization in the numerator of the logarithm in (58) as

P ∗S|Y (s|y) =

(PS|Y (s|y)

)α′∑s∈S

(PS|Y (s|y)

)α′ for all s ∈ S, y ∈ Y, (64)

and therefore, we have

maxPS|Y

∑s∈S,y∈Y

PSY (sy)(PS|Y (s|y)

)1− 1α′

=∑y∈Y

PY (y)

(∑s∈S

(PS|Y (s|y)

)α′) 1α′

. (65)

Thus, for α ∈ [1,∞), we have


limα′→α

α′

α′ − 1log

∑y∈Y PY (y)

(∑s∈S P

α′

S|Y (s|y)) 1α′

(∑s∈S P

α′S (s)

) 1α′

. (66)

For α ∈ (1,∞), let PSα(s) =PαS (s)∑s∈S P

αS (s) for all s ∈ S, such that the expression (66) can be expressed as


α

α− 1log

∑y∈Y

PY (y)

(∑s∈S P

αS|Y (s|y)∑

s∈S PαS (s)

) 1α

(67a)

= supS−X−Y

α

α− 1log

∑y∈Y

(∑s∈S P

αY |S(y|s)PαS (s)∑s∈S P

αS (s)

) 1α

(67b)

= supS−X−Y

α

α− 1log

∑y∈Y

(∑s∈S

PαY |S(y|s)PαSα(s)

) 1α

(67c)

= supS−X−Y

IAα(S;Y ) (67d)

where IA(·; ·) represents the Arimoto mutual information of order α [63] [69]. Note that since

IAα(S;Y ) =

α

1− αE0

(1

α− 1, PSα

), (68a)

ISα(S;Y ) =

α

1− αE0

(1

α− 1, PS

), (68b)

where IS(·; ·) represents Sibson mutual information of order α [63] and E0(·) is the widely used Gallager error exponentfunction [66], we have

supSIAα(S;Y ) = sup

SISα(S;Y ), (69)

which means for any given conditional probability matrix, e.g. PY |S in (69), the Arimoto and Sibson mutual information havethe same supremum over all source distributions.To attain (15a), we first provide an upper bound, and then, give an achievablescheme as follows.

(1) Converse: we derive an upper bound of Lα(X → Y ) as


IAα(S;Y ) (70a)

≤ supPX|S :PX|S(·|S=s)�PX

supPS

IAα(S;Y ) (70b)

= supPX|S :PX|S(·|S=s)�PX

supPS

ISα(S;Y ) (70c)

= supPX�PX

ISα(X;Y ) (70d)

= supPX�PX

IAα(X;Y ) (70e)

where PX � PX means the support of the distribution PX is a subset of the support of PX . In (70b), the inequalityis because we consider all PS,X as long as the support of X is X (the supporting of X), instead of the PS,X in (70a)from the source distribution PX . The equations in (70c) and (70e) are from the fact that the Arimoto and Sibson mutualinformation have the same supremum over the same set of source distributions are the same, and (70d) is due to thatS − X − Y is a Markov chain and Sibson mutual information obeys data processing inequality [63, Thm. 3] 7.

(2) Achievability: we consider a function S of X such that H(X|S) = 0. Specifically, let the support S consist of Sx, acollection of S mapped to a x ∈ X , i.e.,

S = ∪x∈XSx with S = s ∈ Sx if and only if X = x. (71a)

Therefore, for the specific variable S, we have

PY |S(y|s) =

{PY |X(y|x) for all s ∈ Sx0 otherwise.

(72)

Construct a probability distribution PX over X from PS as

PX(x) =

∑u∈Sx P

αS (s)∑

x∈X∑s∈Sx P

αS (s)

for all x ∈ X , (73)

such that the Arimoto mutual information for the S and Y can be rewritten as

IAα(S;Y ) =

α

α− 1log

∑y∈Y

(∑x∈X

∑s∈Sx P

αY |S(y|s)PαS (s)∑

x∈X∑s∈Sx P

αS (s)

) 1α

(74a)

=α

α− 1log

∑y∈Y

(∑x∈X P

αY |X(y|x)

∑s∈Sx P

αS (s)∑

x∈X∑s∈Sx P

αS (s)

) 1α

(74b)

=α

α− 1log

∑y∈Y

(∑x∈X

PαY |X(y|x)PαX

(x)

) 1α

(74c)

= ISα(X;Y ) (74d)

where (74b) and (74c) are direct consequences of (72) and (73), respectively. Therefore,

supPX�PX

ISα(X;Y ) = sup

S:H(X|S)=0

IAα(S;Y ) ≤ sup

S−X−YIAα(S;Y ) = Lα(X → Y ). (75)

Therefore, from (70d), (70e) and (75), we attain (15a).For α = 1, the optimal solutions in (62) and (64) exist, such that the expression of L1(X → Y ) can be obtained using

L’Hospital’s rule as follows.

limα′→1

α′

α′ − 1log

∑y∈Y PY (y)

(∑s∈S P

α′

S|Y (s|y)) 1α′

(∑s∈S P

α′S (s)

) 1α′

=

d

(α′ log

(∑y∈Y PY (y)

(∑s∈S P

α′S|Y (s|y)

) 1α′

(∑s∈S P

α′S (s))

1α′

))1 · dα′

∣∣∣∣∣∣∣∣∣∣α′=1

(76a)

7A detailed proof that Sibson mutual information obeys data processing inequality is included in the proof for maximal α-leakage satisfying data processinginequality.

=∑y∈Y

PY (y)

(∑s∈S

PS|Y (s|y) logPS|Y (s|y)

)−∑s∈S

PS(s) logPS(s) (76b)

=∑y∈Y

∑s∈S

PSY (sy) logPS|Y (s|y)

PS(s)= I(Y ;S) = I(S;Y ) (76c)

Furthermore, mutual information satisfies data processing inequality [70, Thm 2.8.1], and therefore, for S −X − Y we have

I(S;Y ) ≤ I(X;Y ) ⇒∑

S−X−YI(S;Y ) ≤ I(X;Y ) (77)

since S can be any function of X , let S = X and then we have

I(X;Y ) ≤∑

S−X−YI(S;Y ). (78)

Therefore, from (77) and (78), we attained (15b).Note that if α =∞, the optimal solution in (62) is 0

0 . Therefore, we can not use the strategy above to simplify L∞(X → Y )and analyze it as follows. For α =∞, the expression in (58) becomes

L∞(X → Y ) = supS−X−Y

log

(maxPS|Y

∑s∈S,y∈Y PSY (sy)PS|Y (s|y)

maxPS∑s∈S PS(s)PS(s)

). (79)

Since the largest convex combinations is the maximal involved value, the optimal values of the two maximizations in (79) are

maxPS|Y

∑s∈S,y∈Y

PSY (sy)PS|Y (s|y) =∑y∈Y

PY (y) maxs∈S

PS|Y (u|y) (80a)

maxPS

∑s∈S

PS(s)PS(s) = maxs∈S

PS(s). (80b)

Therefore, for α =∞, the maximal α-leakage is

L∞(X → Y ) = supS−X−Y

log

(∑y∈Y PY (y) maxs∈S PS|Y (s|y)

maxs∈S PS(s)

), (81)

which is same as the definition of an operational measure of information leakage in [62] and proved to be equivalent tomaximal leakage [62, Thm. 1]. That is, for α =∞, the expression (58) can be simplified as (15c).

B. Proof for Lemma 1

Proof. For α ∈ (1,∞), referring to (15a) and (8) we have

Lα(PY |X) = supX

α

α− 1log∑y∈Y

(∑x∈X

PX(x)(PY |X(y|x)

)α) 1α

(82)

≥ supX

α

α− 1log∑y∈Y

((∑x∈X

PX(x)PY |X(y|x)

)α) 1α

(83)

= supX

α

α− 1log 1 = 0, (84)

where (83) results from applying Jensens inequality to the convex function f : t→ tα (t ≥ 0), such that the equality holds ifand only if given any y ∈ Y , PY |X(y|x) for all x ∈ X are the same (as PY ), i.e., PY |X is a rank-1 row stochastic matrix,which also means that X and Y are independent. We use PX⊥Y to denote these privacy mechanisms that guarantee perfectprivacy (i.e., achieve zero leakage). For α = 1 and α =∞, we have

L1(PX⊥Y ) = I(X;Y ) = 0 (85)L∞(PX⊥Y ) = I∞(X;Y ) = 0. (86)

Therefore, Lα(PX⊥Y ) = 0 for all α ∈ [1,∞].Let PX⇐Y be an conditional probability matrix with only one non-zero entry in each column, and indicate the only non-zero

entries by xy , i.e., xy = argx PX⇐Y (y|x) > 0 for all y ∈ Y . For α ∈ (1,∞), from (15a) and (8) we have

Lα(PX⇐Y ) = supX

α

α− 1log∑y∈Y

(P

1α

X (xy)PX⇐Y (y|xy))

= supX

α

α− 1log∑x∈X

P1α

X (x); (87)

in addition, since the function maximized in (87) is symmetric and concave in PX , it is Schur-concave in PX , and therefore,the optimal distribution of X achieving the supreme in (87) is uniform. Thus,

Lα(PX⇐Y ) = log |X | for α ∈ (1,∞). (88)

For α = 1, referring to 15b we have

L1(PX⇐Y ) =∑y∈Y

PX(xy)PX⇐Y (y|xy) logPX⇐Y (y|xy)

PX(xy)PX⇐Y (y|xy)

=∑y∈Y

PX(xy)PX⇐Y (y|xy) log1

PX(xy)

=∑x∈X

PX(x) log1

PX(x)= H(PX),

which is exactly the upper bound of I(X;Y ). For α =∞, referring to 15c we have

L∞(PX⇐Y ) = log∑y∈Y

PX⇐Y (y|xy) = log |X |, (89)

which is exactly the upper bound of maximal leakage [?, Lem. 1]. Therefore, for PX⇐Y ,

Lα(PX⇐Y ) =

{log |X | α > 1

H(PX) α = 1.(90)

Column permutations of a conditional probability matrix preserve the values of its Sibson mutual information, mutual infor-mation and maximal leakage, thus conditional probability matrices generated by column permutations have the same maximalα-leakage.

C. Proof for Proposition 1

Proof. The quasi-convexity of maximal α-leakage results from the quasi-convexity of Sibson mutual information of orderα > 1 [67, Thm. 10] and the property of quasi-convex functions that the supreme of a set of quasi-convex functions is quasi-convex, i.e., let function f(a, b) is quasi-convex in b, such that supa f(a, b) is also quasi-convex in b [71]. In addition, thequasi-convexity of Sibson mutual information of order α > 1 is based on the quasi-convexity of conditional Renyi divergenceof order α > 1 as shown below [67, Thm. 8] as followed.Let PX be a probability distribution of X and PY |X and QY |X be two transition probability matrices from X to Y , such thatthe conditional Renyi divergence of order α is

Dα(PY |X‖QY |X |PX) =1

α− 1log∑x∈X

P (x)∑y∈Y

PY |X(y|x)αQY |X(y|x)1−α. (91)

Let PY |X λ = λPY |X 1 + (1− λ)PY |X 2 and QY |X λ = λQY |X 1 + (1− λ)QY |X 2, such that∑y∈Y

PY |X λ(y|x)αQY |X λ(y|x)1−α =∑y∈Y

QY |X λ(y|x)

(PY |X λ(y|x)

QY |X λ(y|x)

)α. (92)

For α > 1, the function f : x→ xα is convex for x ≥ 0. Therefore,

λQY |X 1(y|x)

QY |X λ(y|x)

(PY |X 1(y|x)

QY |X 1(y|x)

)α+

(1− λ)QY |X 2(y|x)

QY |X λ(y|x)

(PY |X 2(y|x)

QY |X 2(y|x)

)α≥(λQY |X 1(y|x)

QY |X λ(y|x)

PY |X 1(y|x)

QY |X 1(y|x)+

(1− λ)QY |X 2(y|x)

QY |X λ(y|x)

PY |X 2(y|x)

QY |X 2(y|x)

)α=

(PY |X λ(y|x)

QY |X λ(y|x)

)α(93)

=⇒ λQY |X 1(y|x)

(PY |X 1(y|x)

QY |X 1(y|x)

)α+ (1− λ)QY |X 2(y|x)

(PY |X 2(y|x)

QY |X 2(y|x)

)α≥ QY |X λ(y|x)

(PY |X λ(y|x)

QY |X λ(y|x)

)α(94)

=⇒ λ∑x∈X

QY |X 1(y|x)

(PY |X 1(y|x)

QY |X 1(y|x)

)α+ (1− λ)

∑x∈X

QY |X 2(y|x)

(PY |X 2(y|x)

QY |X 2(y|x)

)α≥∑x∈X

QY |X λ(y|x)

(PY |X λ(y|x)

QY |X λ(y|x)

)α,

(95)

which means for α > 1∑y∈Y PY |X(y|x)αQY |X(y|x)1−α is convex in (PY |X , QY |X). In addition, for α > 1, the function

1α−1 log(·) is monotonically non-decreasing. Due to the fact that the composition of a non-decreasing function and a convexfunction is quasi-convex [71], Dα is quasi-convex in (PY |X , QY |X).The Renyi mutual information of order α > 1 is defined as [67, Def. 7]

Iα(PX , PY |X) = minQX

Dα(PY |X‖QX |PX). (96)

Consider two conditional probability matrix PY |X 1 and PY |X 2, and let Q∗X i, i = 1, 2, minimize (96) for PY |X i, i.e.,

Q∗X i = arg minQX

Dα(PY |X‖QX |PX). (97)

Therefore, for λ ∈ [0, 1]

Iα(PX , λPY |X 1 + (1− λ)PY |X 2) = minQX

Dα(λPY |X 1 + (1− λ)PY |X 2‖QX |PX) (98)

≤ Dα(λPY |X 1 + (1− λ)PY |X 2‖λQ∗X 1 + (1− λ)Q∗X 2|PX) (99)

≤ max{Dα(PY |X 1‖Q∗X 1|PX), Dα(PY |X 2‖Q∗X 2|PX)

}(100)

= max{Iα(PX , PY |X 1), Iα(PX , PY |X 2)

}, (101)

i.e., for α > 1 Iα(PX , PY |X) is quasi-convex in PY |X , where the inequality in (99) is directly from the minimization, and(100) is based on the quasi-convexity of conditional Renyi entropy of order α > 1.

D. Proof for Proposition 2

Proof. In [72, Thm. 3], the Renyi divergence of order α > 1 is proved to be non-decreasing in α as follows.Let β > α > 1, and Dα(PX‖QX) and Dβ(PX‖QX) denote the Renyi divergence of order α and β respectively between twodistributions PX � QX of X . The function z → z

α−1β−1 is strictly concave for z ≥ 0, and therefore, by Jensen’s inequality

Dα(PX‖QX) =1

α− 1log

(∑x∈X

PαX(x)Q1−αX (x)

)

=1

α− 1log

(∑x∈X

PX(x)

(PX(x)

QX(x)

)(β−1)α−1β−1

)(102a)

=1

α− 1log

∑x∈X

PX(x)

(P β−1X (x)

Qβ−1X (x)

)α−1β−1

(102b)

≤ 1

α− 1log

(∑x∈X

PX(x)P β−1X (x)

Qβ−1X (x)

)α−1β−1

(102c)

=1

β − 1log

(∑x∈X

PX(x)P β−1X (x)

Qβ−1X (x)

)= Dβ(PX‖QX) (102d)

where the equality in (102c) holds if and only if PX = QX .Based on non-decreasing in order of Renyi divergence, in [67, Thm. 4] the mutual information of order α > 1 is then provedto be non-decreasing in α. Given PX and PY |X , α-order mutual information Iα(PX , PY |X) is defined as

Iα(PX , PY |X) = minQ

Dα(PY |X‖Q|PX) (103)

Let β > α > 1, and denote Q∗β = arg minQDβ(PY |X‖Q|PX). Therefore,

Iα(PX , PY |X) ≤ Dα(PY |X‖Q∗β |PX) (104a)

≤ Dβ(PY |X‖Q∗β |PX) (104b)

= Iβ(PX , PY |X) (104c)

where the equality in (104b) holds if and only if Q∗β = arg minQDα(PY |X‖Q|PX), and (104b) follows from non-decreasingin order of Renyi divergence. Based on non-decreasing in order of α mutual information, part a can be proved as follows. Letβ > α > 1, and given PY |X denote P ∗Xα = arg maxPX Iα(PX , PY |X). Therefore,

Lα(PY |X) = Iα(P ∗Xα, PY |X) (105a)≤ Iβ(P ∗Xα, PY |X) (105b)≤ max

PXIβ(PX , PY |X) = Lβ(PY |X) (105c)

where (105b) follows from non-decreasing in order of α mutual information, and the equality in (105c) holds if and only ifP ∗Xα = arg maxPX Iβ(PX , PY |X).The upper bound in part b is directly from non-decreasing in order of α maximal α-leakage, and from the parts a and b inLemma 1, it is known that if the two special kinds of mechanism, i.e., PX⊥Y and PX⇐Y , the upper bound is tight.Given PY |X , the lower bound in part c is actually the α-order mutual information of uniform distribution of X . Taking use ofthe concavity of Iα(PX , PY |X) in PX from [67, Thm. 8], it is proved that not only for PX⊥Y and PX⇐Y , the lower boundis tight for all symmetric PY |X .First, we show the proof for the concavity of Iα(PX , PY |X) which is from that conditional Renyi divergence is concave inPX [67]. Given two conditional probability matrix PY |X and QY |X , the conditional Renyi divergence Dα(PY |X‖QY |X |PX)in (106) is concave in PX since it is a composition of an increasing concave function 1

α−1 log(·) and an affine mapping ofPX [71, pp. 79].

Dα(PY |X‖QY |X |PX) =1

α− 1log

∑x∈X

PX(x)∑y∈Y

PαY |X(y|x)Q1−αY |X(y|x)

(106)

Recall that the definition Iα(PX , PY |X) is

Iα(PX , PY |X) = minQ

D(PY |X‖Q|PX). (107)

Let PX1and PX2

be two probability distributions of X , and PX = λPX1+ (1− λ)PX2

where λ ∈ [0, 1]. Use Q∗ to indicatethe probability distribution minimizing (107). Therefore,

Iα(PX , PY |X) = D(PY |X‖Q∗|λPX1+ (1− λ)PX2

) (108a)≥ λD(PY |X‖Q∗|PX1

) + (1− λ)D(PY |X‖Q∗|PX2) (108b)

≥ λminQ

D(PY |X‖Q|PX1) + (1− λ) min

QD(PY |X‖Q|PX2

) (108c)

= λIα(PX1 , PY |X) + (1− λ)Iα(PX2 , PY |X). (108d)

where the inequality in (105c) is from the concavity of conditional Renyi divergence in an marginal distribution. Thus, theconcavity of Iα(PX , PY |X) in an marginal distribution is proved.Since a symmetric and concave function is Schur concave, for a symmetric PY |X , Iα(PX , PY |X) is Schur concave in PX [67,Col. 9]. Let f(x) be a function which is Schur concave in a vector variable x ∈ Rn, x1 and x2 be two decreasing-orderedvectors in the domain of f(x). If x1 majors x2, i.e.,

k∑1

x1i ≥k∑1

x2i, for all k ≤ n (109a)

n∑1

x1i =

n∑1

x2i, (109b)

f(x1) ≤ f(x2). Using this property and the fact that an uniform distribution is majorized by all distributions, we know that forsymmetric PY |X , the uniform distribution maximizes (15a) for α ∈ (1,∞) and its α Renyi mutual information is the maximalα-leakage of order α. Therefore, the lower bound is tight for all symmetric PY |X .From parts a in Lemma 1, it is known that for the special mechanisms, i.e., PX⊥Y , the lower bound is also tight.

E. Proof for Proposition 3

Proof. The proof is based on the expression of the maximal α-leakage in (15a) as well as the statement that Sibson mutualinformation satisfies data processing inequality [63, Thm. 3]. First, we prove that for α ∈ (0,∞), the maximal α-leakage

satisfies (23) by using the statement for Sibson mutual information, i.e., for X − Y − Z

Iα(X;Z) ≤ Iα(X;Y ) (110a)Iα(X;Z) ≤ Iα(Y ;Z), (110b)

and then, we will show the proof for this statement.Let PZ|X , PY |X and PZ|Y be conditional probability matrices between X&Z, X& and Y&Z, respectively. Let P ∗X =arg supPX Iα(PX , PZ|X), and then, because of X − Y − Z, we have

Lα(X → Z) = Iα(P ∗X , PZ|X) (111a)≤ Iα(P ∗X , PY |X) (111b)≤ sup

PX

Iα(PX , PY |X) (111c)

where (111a) and (111c) are from the expression of α ∈ (0,∞) leakage in (15a), and (111b) refers to (110a). Similarly, theinequality in (23) can also be proved by using (15a) and (110b).To prove that Sibson mutual information satisfies data processing inequality, we need first to prove Reyni divergence of orderα ∈ (1,∞) satisfies data processing inequality, i.e., for α > 1, given a condition probability matrix PY |X and two probabilitydistributions PX and QX of X , as well as the corresponding resulting distribution PY and QY for Y , there is

Dα(PY ‖QY ) ≤ Dα(PX‖QX). (112)

For α > 1, the function f : x → xα is convex for all nonnegative x. Let a(x) =QX(x)PY |X(y|x)

QY (y) such that∑x∈X a(x) = 1.

Referring to Jensen’s inequality [70, Thm. 2.6.2], we have∑x∈X

a(x)

(PX(x)

QX(x)

)α≥

(∑x∈X

a(x)PX(x)

QX(x)

)α(113a)

⇔∑x∈X

QX(x)PY |X(y|x)

QY (y)

(PX(x)

QX(x)

)α≥(∑

x∈X PX(x)PY |X(y|x)

QY (y)

)α(113b)

⇔∑x∈X

PY |X(y|x)QX(x)

(PX(x)

QX(x)

)α≥ QY (y)

(PY (y)

QY (y)

)α(113c)

=⇒∑y∈Y

∑x∈X

PY |X(y|x)QX(x)

(PX(x)

QX(x)

)α≥∑y∈Y

QY (y)

(PY (y)

QY (y)

)α(113d)

⇔∑x∈X

PαX(x)Q1−αX (x) ≥

∑y∈Y

PαY (y)Q1−αY (y) (113e)

=⇒ 1

α− 1log

(∑x∈X

PαX(x)Q1−αX (x)

)≥ 1

α− 1log

∑y∈Y

PαY (y)Q1−αY (y)

for α > 1, (113f)

which shows that for α > 1, the inequality in (112) holds. Taking use of this conclusion, we prove the data processing forSibson mutual information as shown in (110a) and (110b). We firstly present the proof of (110a) as follows.Let QY is an arbitrary probability distribution of Y (independent of X), and QZ is the corresponding probability distributionof Z generated by PZ|Y . For α > 1, referring to the data processing for Renyi divergence proved above, we have 8

1

α− 1log

(∑x∈X

PX(x)2(α−1)Dα(PZ|x‖QZ)

)≤ 1

α− 1log

(∑x∈X

PX(x)2(α−1)Dα(PY |x‖QY )

)(114a)

⇐⇒ Dα(PZ|X‖PZ |PX) ≤ Dα(PY |X‖PY |PX). (114b)

Let Q∗Y be the optimal solution of minQY Dα(PY |X‖QY |PX) and Q∗Z be the resulting distribution from the PZ|Y . By usingthe definition of Sibson mutual information [63, Def. 4], we have

Iα(X;Y ) = Iα(PX , PY |X) = Dα(PY |X‖Q∗Y |PX) (115a)≥ Dα(PZ|X‖Q∗Z |PX) (115b)≥ min

QZDα(PZ|X‖QZ |PX) = Iα(X;Z), (115c)

8For any PX(x), PZ|X=x and QZ are generated by passing PY |X=x and QY through PZ|Y , respectively.

where (115a) and (115c) come directly from the definition of Sibson mutual information, and (115b) results from (114b).To prove the inequality in (110b), we consider the pairs of random variables (X,Z) and (Y,Z). For X − Y −Z, let PXZ|Y Zbe the conditional probability matrix for (X,Z) given (Y,Z), such that for an arbitrary probability distribution QZ of Z, wehave

Dα(PXZ‖PXQZ) ≤ Dα(PY Z‖PYQZ), (116)

since PXZ and PXQZ are generated by passing PY Z and PYQZ through PXZ|Y Z , respectively. Let Q∗Z = arg minQZ Dα(PY Z‖PYQZ),and then,

Iα(Y ;Z) = minQZ

Dα(PY Z‖PYQZ)

= Dα(PY Z‖PYQ∗Z) (117a)≥ Dα(PXZ‖PXQ∗Z) (117b)≥ min

QZDα(PXZ‖PXQZ) = Iα(X;Z). (117c)

Therefore, the inequality in (110b) is proved.

F. Proof for Theorem 2

Proof. Let Y1 and Y2 be the supports of Y1 and Y2, respectively. For any (y1, y2) ∈ Y1 × Y2, due to the Markov chainY1 −X − Y2, the corresponding entry of the conditional probability matrix of (Y1, Y2) given X is

P (y1, y2|x) = P (y1|x)P (y2|xy1) = P (y1|x)P (y2|x). (118)

Therefore, for α ∈ (1,∞)

Lα(X → Y1, Y2) = maxPX

α

α− 1log

∑y1,y2∈Y1×Y2

(∑x∈X

PX(x)PαY1,Y2|X(y1, y2|x)

) 1α

(119a)

= maxPX

α

α− 1log

∑y1,y2∈Y1×Y2

(∑x∈X

PX(x)PαY1|X(y1|x)PαY2|X(y2|x)

) 1α

. (119b)

Let K(y1) =∑x∈X PX(x)PαY1|X(y1|x), for all y1 ∈ Y1, such that we can construct a set of distributions over X as

PX(x|y1) =PX(x)PαY1|X(y1|x)

K(y1). (120)

Therefore, from (119b), Lα(X → Y1, Y2) can be rewritten as

Lα(X → Y1, Y2)

= maxPX

α

α− 1log

∑y1,y2∈Y1×Y2

(∑x∈X

K(y1)PX(x|y1)PαY2|X(y2|x)

) 1α

(121a)

= maxPX

α

α− 1log

∑y1,y2∈Y1×Y2

((∑x∈X

PX(x)PαY1|X(y1|x)

)(∑x∈X

PX(x|y1)PαY2|X(y2|x)

)) 1α

(121b)

= maxPX

α

α− 1log

∑y1∈Y1

(∑x∈X

PX(x)PαY1|X(y1|x)

) 1α ∑y2∈Y2

(∑x∈X


) 1α

(121c)

≤maxPX

α

α− 1log

∑y1∈Y1

(∑x∈X

PX(x)PαY1|X(y1|x)

) 1α

maxy1∈Y1

∑y2∈Y2

(∑x∈X


) 1α

(121d)

= maxPX

α

α− 1log

∑y1∈Y1

(∑x∈X

PX(x)PαY1|X(y1|x)

) 1α

· ∑y2∈Y2

(∑x∈X

PX(x|y∗1)PαY2|X(y2|x)

) 1α

(121e)

≤maxPX

α

α− 1log

∑y1∈Y1

(∑x∈X

PX(x)PαY1|X(y1|x)

) 1α

+ maxPX

α

α− 1log

∑y2∈Y2

(∑x∈X

PX(x)PαY2|X(y2|x)

) 1α

(121f)

=Lα(X → Y1) + Lα(X → Y2). (121g)

where y∗1 in (121e) is the optimal y1 achieving the maximum in (121d). Therefore, the equality in (121d) holds if and only if,for all y1 ∈ Y1,

∑y2∈Y2

(∑x∈X


) 1α

=∑y2∈Y2

(∑x∈X

PX(x|y∗1)PαY2|X(y2|x)

) 1α

; (122)

and the equality in (121f) holds if and only if, the optimal P ∗X and P ∗X

of the two corresponding maximizations in (121f)satisfy, for all x ∈ X ,

P ∗X

(x) =P ∗X(x)PαY1|X(y∗1 |x)∑x∈X PX(x)PαY1|X(y∗1 |x)

. (123)

Now we consider α = 1. For Y1 −X − Y2, we have

I(Y2;X,Y1) = I(Y2;Y1) + I(Y2;X|Y1) = I(Y2;X) + I(Y2;Y1|X) = I(Y2;X), (124)

Therefore,

I(Y2;X|Y1) ≤ I(Y2;X). (125)

Thus, from Theorem 1, we have for α = 1,

L1(X → Y1, Y2) = I(X;Y1, Y2) ≤ I(X;Y1) + I(X;Y2) = L1(X → Y1) + L1(X → Y2). (126)

For α =∞

L∞(X → Y1, Y2) = log∑

y1,y2∈Y1×Y2

maxX

P (y1|x)P (y2|x) (127)

≤ log∑

y1,y2∈Y1×Y2

(maxx∈X

P (y1|x)

)(maxx∈X

P (y2|x)

)(128)

= log∑y1∈Y1

maxx∈X

P (y1|x) + log∑y2∈Y2

maxx∈X

P (y2|x) (129)

= L∞(X → Y1) + L∞(X → Y2). (130)

G. Proof for Lemma 2

Proof. Let q be a probability distribution of X such that the function kα is expressed as

kα(q,W) =∑j

(∑i

qiWαij

)1/α

, (131)

and the first partial derivative for Wkl, for all k, l, is

∂kα(q,W)

∂Wkl=

(∑i

qiWαil

) 1α−1 (

qkWα−1kl

), (132)

as well as the second derivatives are

∂2kα(q,W)

∂W 2kl

= (α− 1)qkWα−2kl

∑i6=k

qiWαil

(∑i

qiWαil

) 1α−2

(133a)

∂2kα(q,W)

∂Wkl∂Wmk= −(α− 1)qkqm (WklWml)

α−1

(∑i

qiWαil

) 1α−2

m 6= k. (133b)

Therefore, at W = W0 the derivatives are simplified as

∂kα(q,W)

∂Wkl

∣∣∣∣W0

= qk, (134a)

∂2kα(q,W)

∂W 2kl

∣∣∣∣W0

= (α− 1)qk(1− qk)w−10l (134b)

∂2kα(q,W)

∂Wkl∂Wmk

∣∣∣∣W0

= −(α− 1)qkqmw−10l m 6= k. (134c)

such that for the W in (25), the function kα(q,W) can be expressed as

kα(q,W) = kα(q,W0) +∑l

∑k

qkΘkl +1

2

∑l

∑k

(α− 1)qk(1− qk)w−10l Θ2kl

− 1

2

∑l

∑(m,k):m 6=k

(α− 1)qkqmw−10l ΘklΘml + o

((maxi,j

Θij

)2)

(135a)

= 1 + 0 +α− 1

2

(∑l

∑k

qkw0l

Θ2kl −

∑l

(∑k qkΘkl)

2

w0l

)+ o(ρ2). (135b)

By using ln(1+x) = x+o(x2), where ln indicate the the natural logarithm, for x around 0, we attain a quadratic approximationof IS

α(W) in high privacy regime, i.e., W in Whp in (25), as

ISα(q,W) ≈ α− 1

2

(∑l

∑k

qkw0l

Θ2kl −

∑l

(∑k qkΘkl)

2

w0l

)(nat), (136)

such that for W belonging to Whp, Lα(W) can be expressed as (29).At W = I the derivatives in (132) and (133) become

∂kα(q,W)

∂Wkl

∣∣∣∣I

=

{q

1α

k k = l

0 k 6= l(137a)

∂2kα(q,W)

∂W 2kl

∣∣∣∣I

=∂2kα(q,W)

∂Wkl∂Wmk

∣∣∣∣I

= 0, (137b)

such that for the W in (27), the function kα(q,W) can be expressed as

kα(q,W) =∑i

q1αi +

∑i

q1αi Θii + 0 + o(ρ2) (138a)

=∑i

q1αi Wii + o(ρ2). (138b)

and therefore, for W belonging to Whu, Lα(W) can be expressed as

Lα(W) = maxq

α

α− 1log

(∑i

q1αi Wii

)+ o(ρ2). (139)

For α > 1, to attain the optimal solution q∗ of the maximization in (139), it suffices to solving the following convex optimization

maxq

∑i

q1αi Wii (140a)

s.t.∑i

qi = 1 (140b)

qi ≥ 0 for all i. (140c)

Let µ be the Lagrange multiplier for the equality constraint in (140), such that the Lagrange function is

L(q, µ) =∑i

q1αi Wii + µ

(∑i

qi − 1

)(141)

Referring to KKT conditions, q∗ makes partial derivatives of the Lagrange function (141) to be zero, i.e., for i

∂L(q, µ)

∂qi

∣∣∣∣q∗

=1

αWii (q∗i )

1α−1 + µ = 0. (142)

Combining with the equality constraint in (140), we attain the optimal solution as

q∗i =W

αα−1

ii∑iW

αα−1

ii

for all i, (143)

such that the maximal α-leakage in 139 can be further simplified as (30).

APPENDIX BPROOFS FOR SECTION IV

A. Proof for Theorem 3

Proof. Define the convex function

fα(t) =1

α− 1(tα − 1), (144)

then we have a f -divergence Dfα(P‖Q), which is the Hellinger divergence of order α [73], given by

Dfα(P‖Q) =1

α− 1

[∫(dP )α (dQ)1−α − 1

]. (145)

Therefore, the Renyi divergence can be written in terms of the Hellinger divergence as

Dα(P‖Q) =1

α− 1log(1 + (α− 1)Dfα(P‖Q)). (146)

Due to the definition of Sibson mutual information in (??), the maximal α-leakage can be written as

Lα(X → Y ) = supPX

ISα(X;Y ) = sup

PX

infQY

Dα(PXY ‖PX ×QY ) (147)

where Dα is the Renyi divergence of order α. Since z → 1α−1 log(1 + (α− 1)z) is monotonically increasing in z for α > 1,

we can writeLα(X → Y ) =

1

α− 1log(1 + (α− 1)Lfα(X → Y )

). (148)

That is, maximal α-leakage is a monotonic function of a f -divergence-based leakage defined in (37).

B. Proof for Theorem 3

Proof: We have

PUTHD,f (D) = infPY |X :d(X,Y )≤D

supPX

infQY

Df (PXY ‖PX ×QY ) (149)

= infQY

supPX

infPY |X :d(X,Y )≤D

Df (PXY ‖PX ×QY ) (150)

= infQY

supPX

infPY |X :d(X,Y )≤D

∫dPX(x)Df (PY |X=x‖QY ) (151)

= infQY

supPX

∫dPX(x) inf

PY |X=x:d(x,Y )≤DDf (PY |X=x‖QY ) (152)

= infQY

supx

infPY |X=x:Y ∈BD(x)

∫dQY f

(dPY |X=x

dQY

)(153)

= infQY

supx


[∫BD(x)

dQY f

(dPY |X=x

dQY

)+

∫BD(x)c

dQY f(0)

](154)

= infQY

supx


[QY (BD(x))

∫BD(x)

dQYQY (BD(x))

f

(dPY |X=x

dQY

)+QY (BD(x)c)f(0)

](155)

= infQY

supx

[QY (BD(x))f(QY (BD(x))−1) +

(1−QY (BD(x))

)f(0)

](156)

= infQY

supx

g(QY (BD(x))

)(157)

where• (150) follows from the fact that Df (PXY ‖PX ×QY ) is linear in PX for fixed (PY |X , QY ) and convex in (PY |X , QY )

for fixed PX ,• (156) follows from the convexity of f and Jensen’s inequality,• in (157) we have defined

g(q) = qf(q−1) + (1− q)f(0). (158)

Note that a mechanism will achieve equality in (156) if

dPY |X=x

dQY(y) =

1(d(x, y) ≤ D)

QY (BD(x)). (159)

To simplify (157), we claim g is non-increasing. Indeed,

g′(q) = f(q−1)− q−1f ′(q−1)− f(0). (160)

Since f is convex, for all s, t we havef(t)− f(s) ≤ (t− s)f ′(t). (161)

Setting t = q−1 and s = 0, this givesf(q−1)− f(0) ≤ q−1f ′(q−1) (162)

from which we conclude g′(q) ≤ 0. Therefore,PUTHD,f (D) = g(q?) (163)

and the optimal mechanism is given by (44).

C. Proof for Theorem 4Proof. For this PUT problem in (50), we analyze the optimal mechanisms for different parameter triples (p,D, α) as follows.

1. If D = 0, the only feasible solution is ρ1 = ρ2 = 0, such that the optimal solution is identity matrix with the optimalvalue being 1 for α > 1 and H(p) for α = 1. In addition, if D = 0, α = 1 for all p ∈ [0, 1].

2. If D ≥ min{p, 1− p}, for all α ∈ [1,∞], the optimal value is 09 with the optimal solution being either[0 10 1

]p ≥ 0.5 or

[1 01 0

]p ≤ 0.5. (164)

Thus, if D ≥ min{p, 1− p}, α = 1 for all p ∈ [0, 1]. Note that for p = 0 or p = 1, D ≥ min{p, 1− p} ⇔ D ≥ 0 suchthat if p = 0, 1, α = 1 for all D ≥ 0; for p = 0.5, any rank-1 row stochastic matrix is optimal.

3. For 0 < D < min{p, 1− p} (p ∈ (0, 1)),a. if α = 1, referring to the conclusion of the rate distortion problem in [70, Thm. 10.3.1], the optimal value isH(p)−H(D) with the optimal solution [

(1−p−D)(1−D)(1−2D)(1−p)

(p−D)D(1−2D)(1−p)

(1−p−D)D(1−2D)p

(p−D)(1−D)(1−2D)p

], (165)

It can be easily verified that the four entries in (165) are non-zero.b. if α =∞, the problem in (50) is a linear program and it can be proved that the optimal value is

log

(2− D

min{1− p, p}

), (166)

with optimal solution being either[1 0Dp

p−Dp

]p ≤ 0.5 or

[ 1−p−D1−p

D1−p

0 1

]p ≥ 0.5 (167)

It is observed that for α = ∞, the optimal solutions (167) have at least one zero entry, and therefore, at least oneinput symbol is fully revealed by a output symbol. We refer this kind of mechanisms as Z-mechanisms.

9Sibson mutual information is non-negative, such that maximal α-leakage is nonnegative for α ∈ [1,∞].

c. for α ∈ (1,∞), referring to Corollary 2 to attain the optimal solution of the problem in (50), it is suffices to solvethe following convex optimization problem

minρ1,ρ2

maxq∈[0,1]

((1− q)(1− ρ1)α + qρα2 )1α + ((1− q)ρα2 + q(1− ρ2)α)

1α (168a)

s.t. (1− p)ρ1 + pρ2 ≤ D (168b)ρ1, ρ2 ≥ 0 (168c)

where the simplification of (50c) as (168c) is justified by that for D < min{p, 1 − p}, the inequality in (168b)implies that ρ1 + ρ2 < 1. Referring to Definition 3, the objective function in (168a) is concave in p, by using theKarush–Kuhn–Tucker (KKT) conditions, we attain the optimal value of the maximization in (168a) as following

((1− ρ1)α(1− ρ2)α − ρα1 ρα2 )1α

(((1− ρ1)α − ρα2 )

11−α + ((1− ρ2)α − ρα1 )

11−α)α−1

α

, (169)

and indicated by f(ρ1, ρ2). For a simpler computation, let g(ρ1, ρ2) , (f(ρ1, ρ2))αα−1 for α ∈ (1,∞) such that to

solve the optimal solution of (168) is equivalent to solve

minρ1,ρ2

g(ρ1, ρ2) = ((1− ρ1)α(1− ρ2)α − ρα1 ρα2 )1

α−1

(((1− ρ1)α − ρα2 )

11−α + ((1− ρ2)α − ρα1 )

11−α)

(170a)

s.t. (1− p)ρ1 + pρ2 ≤ D (170b)ρ1, ρ2 ≥ 0. (170c)

Let (ρ∗1, ρ∗2) indicate the optimal solutions of (170). In this convex program,

(1) g(ρ1, ρ2) is symmetric and convex in (ρ1, ρ2) such that g(ρ1, ρ2) is Schur convex in (ρ1, ρ2). We know thatfor a Schur convex function, the minimum is attained when all variables are the same. Therefore, for p = 0.5,ρ∗1 = ρ∗2 = D. In addition, if p = 0.5, when α ∈ (1,∞), the optimal solution is invariant in α for all D ∈ [0, 0.5].

(2) The convex function g(ρ1, ρ2) attains the minimal value 0 if and only if ρ1 + ρ2 = 1 and the maximal value 2if and only if (ρ1, ρ2) = (0, 0), (1, 1), such that g(ρ1, ρ2) is monotonically decreasing in ρ1, ρ2 in the feasibleregion determined by (170b)-(170c). Therefore, the optimal solutions of (170) holds the equality in (170b). Thus,we have (1 − p)ρ∗1 + pρ∗2 = D such that to solve the problem in (170) is equivalent to solve the one variableproblem

minρ1

g

(ρ1,

D − (1− p)ρ1p

)(171a)

s.t. 0 ≤ ρ1 ≤D

1− p, (171b)

where the condition (171b) is from (170c).Therefore, we have that ρ∗1 = 0 (resp. ρ∗1 = D

1−p ) if and only if the convex function g(ρ1,

D−(1−p)ρ1p

)is

monotonically increasing (resp. decreasing) in[0, D

1−p

], and a sufficient and necessary condition is that the

derivative at ρ1 = 0 (resp. ρ1 = D1−p ) is non-negative (resp. non-positive), i.e., the condition in (a.) (resp. (b.)) in

(55). Due to

g

(0,D

p

)= 1 +

(

1− Dp

)α1−

(Dp

)α

1α−1

(172a)

g

(D

1− p, 0

)= 1 +

(

1− D1−p

)α1−

(D

1−p

)α

1α−1

, (172b)

we consider the function k(x) = (1−x)α1−xα . Since k(x) is monotonically decreasing in [0, 1]) for α > 1, we have a

necessary condition for ρ∗1 = 0 (resp. ρ∗1 = D1−p ) as p < 0.5 (resp. p > 0.5).

(3) g(ρ1, ρ2) is symmetric in (ρ1, ρ2) such that for any given D, if p < 0.5 ρ∗2 > ρ∗1, and if p > 0.5 ρ∗1 < ρ∗2.That is, in the optimization problem (171), if p < 0.5 (resp. p > 0.5), the optimal solution 0 ≤ ρ∗1 < D (resp.D < ρ∗1 ≤ D

1−p ). For the convex optimization problem in (171), let λ1, λ2 ≥ 0 be the Lagrange multipliers, such

that the Lagrange function is

L(ρ1, λ1, λ2) = g

(ρ1,

D − (1− p)ρ1p

)+ λ1(0− ρ1) + λ2

(ρ1 −

D

1− p

). (173)

Let λ∗1, λ∗2 indicate the optimal solutions of the dual optimization problem of (171). From the KKT conditions,

the optimal solution ρ∗1, λ∗1, λ∗2 satisfying

dL (ρ1, λ1, λ2)

d ρ1

∣∣∣∣ρ∗1

=d g(ρ1,

D−(1−p)ρ1p

)d ρ1

∣∣∣∣∣∣ρ∗1

− λ1 + λ2 = 0 (174a)

λ∗1(0− ρ∗1) = 0 (174b)

λ∗2

(ρ∗1 −

D

1− p

)= 0. (174c)

(174d)

Therefore, if ρ∗1 6= 0, D1−p , we have λ∗1 = λ∗2 = 0 such that the optimal solution ρ∗1 satisfies (56).

Therefore, if p = 0.5, when α ∈ (1,∞), the optimal mechanism (175) is[1−D DD 1−D

], (175)

and invariant in α for all D ∈ [0, 0.5]; and for p < 0.5 (resp. p > 0.5), the optimal mechanism is[1 0Dp

p−Dp

] (resp.

[ 1−p−D1−p

D1−p

0 1

]), (176)

and it keeps the same if α satisfies (a.) (resp. (b.)) in (55).

ACKNOWLEDGMENT

The authors would like to thank Prof. Vincent Y. F. Tan from National University of Singapore for many valuable discussions.

REFERENCES

[1] National Science and Technology Council Networking and Information Technology Research and Development Program, “National privacy researchstrategy,” Executive Office of the President of The United States, Tech. Rep., June 2016.

[2] T. Dalenius, “Finding a needle in a haystack - or identifying anonymous census records,” J. Official Stats., vol. 2, no. 3, pp. 329–336, 1986.[3] A. Dobra, S. Fienberg, and M. Trottini, Assessing the Risk of Disclosure of Confidential Categorical Data. Oxford University Press, 2000, vol. 7, pp.

125-144.[4] S. E. Fienberg, “Datamining and disclosure limitation for categorical statistical databases,” in Proc. 4th IEEE Intl Conf. Workshop on Privacy and

Security Aspects of Data Mining. Nova Science Publishing, 2004, pp. 1–12.[5] D. B. Rubin, “Discussion: Statistical disclosure limitation,” J. Official Stats., vol. 9, no. 2, pp. 461–468, 1993.[6] J. Domingo-Ferrer, A. Oganian, and V. Torra, “Information-theoretic disclosure risk measures in statistical disclosure control of tabular data,” in Proc.

14th Intl. Conf. Scientific and Statistical Database Management. IEEE Computer Society, 2002, pp. 227–231.[7] A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, E. S. Nordholt, K. Spicer, and P.-P. De Wolf, Statistical disclosure control. John Wiley &

Sons, 2012.[8] L. Sweeney, “k-anonymity: A model for protecting privacy,” Intl. J. Uncertainty, Fuzziness, and Knowledge-based Systems, vol. 10, no. 5, pp. 557–570,

2002.[9] P. Samarati and L. Sweeney, “Generalizing data to provide anonymity when disclosing information (abstract),” in Proc. Prin. Database Sys., Seattle,

Washington, May 1998.[10] ——, “Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and specialization,” in Technical Report

SRI-CSL-98-04, SRI Intl., 1998.[11] D. Agrawal and C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proc. 20th Symp. Principles of

Database Systems, Santa Barbara, CA, May 2001.[12] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigraphy, D. Thomas, and A. Zhu, “Achieving anonymity via clustering,” in Proc. Symp.

Principles Database Sys., Dallas, TX, Jun. 2006.[13] C. C. Aggarwal, “On k-anonymity and the curse of dimensionality,” in Proceedings of the 31st International Conference on Very Large Data Bases.

ACM, 2005, pp. 901–909.[14] C. Dwork, “Differential privacy,” in Proc. 33rd Intl. Colloq. Automata, Lang., Prog., Venice, Italy, Jul. 2006.[15] ——, “Differential privacy: A survey of results,” in Theory and Applications of Models of Computation: Lecture Notes in Computer Science. New

York:Springer, Apr. 2008.[16] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Found. Trends Theor. Comput. Sci., vol. 9, no. 3–4, pp. 211–407,

Aug. 2014. [Online]. Available: http://dx.doi.org/10.1561/0400000042[17] H. Yamamoto, “A source coding problem for sources with additional outputs to keep secret from the receiver or wiretappers,” IEEE Trans. Inform.

Theory, vol. 29, no. 6, pp. 918–923, Nov. 1983.[18] D. Rebollo-Monedero, J. Forne, and J. Domingo-Ferrer, “From t-closeness-like privacy to postrandomization via information theory,” IEEE Transactions

on Knowledge and Data Engineering, vol. 22, no. 11, pp. 1623–1636, Nov. 2010.

http://dx.doi.org/10.1561/0400000042

[19] D. Varodayan and A. Khisti, “Smart meter privacy using a rechargeable battery: Minimizing the rate of information leakage,” in 2011 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 1932–1935.

[20] L. Sankar, S. K. Kar, R. Tandon, and H. V. Poor, “Competitive privacy in the smart grid: An information-theoretic approach,” in Smart GridCommunications, Brusells, Belgium, Oct. 2011.

[21] L. Sankar, S. R. Rajagopalan, and H. V. Poor, “Utility-privacy tradeoffs in databases: An information-theoretic approach,” IEEE Transactions onInformation Forensics and Security, vol. 8, no. 6, pp. 838–852, 2013.

[22] L. Sankar, S. R. Rajagopalan, S. Mohajer, and H. V. Poor, “Smart meter privacy: A theoretical framework,” IEEE Transactions on Smart Grid, vol. 4,no. 2, pp. 837–846, 2013.

[23] J. Liao, L. Sankar, V. Y. F. Tan, and F. du Pin Calmon, “Hypothesis testing in the high privacy limit,” in Allerton Conference 2016, Monticello, IL, 2016.[24] F. P. Calmon, M. Varia, and M. Medard, “On information-theoretic metrics for symmetric-key encryption and privacy,” in Proc. 52nd Annual Allerton

Conf. on Commun., Control, and Comput., 2014.[25] F. P. Calmon, A. Makhdoumi, and M. Medard, “Fundamental limits of perfect privacy,” in 2015 IEEE International Symposium on Information Theory

(ISIT), June 2015, pp. 1796–1800.[26] S. Asoodeh, F. Alajaji, and T. Linder, “Notes on information-theoretic privacy,” in 2014 52nd Annual Allerton Conference on Communication, Control,

and Computing (Allerton), Sept 2014, pp. 1272–1278.[27] F. P. Calmon, A. Makhdoumi, and M. Medard, “Fundamental limits of perfect privacy,” in Proc. International Symp. on Info. Theory, 2015.[28] Y. O. Basciftci, Y. Wang, and P. Ishwar, “On privacy-utility tradeoffs for constrained data release mechanisms,” in 2016 Information Theory and

Applications Workshop (ITA), Jan 2016, pp. 1–6.[29] K. Kalantari, O. Kosut, and L. Sankar, “On the fine asymptotics of information theoretic privacy,” in 2016 54th Annual Allerton Conference on

Communication, Control, and Computing (Allerton), Sept 2016, pp. 532–539.[30] K. Kalantari, L. Sankar, and O. Kosut, “On information-theoretic privacy with general distortion cost functions,” in 2017 IEEE International Symposium

on Information Theory (ISIT), June 2017, pp. 2865–2869.[31] K. Kalantari, O. Kosut, and L. Sankar, “Information-theoretic privacy with general distortion constraints,” Aug. 2017, arXiv:1708.05468.[32] S. Asoodeh, F. Alajaji, and T. Linder, “On maximal correlation, mutual information and data privacy,” in Information Theory (CWIT), 2015 IEEE 14th

Canadian Workshop on, July 2015, pp. 27–31.[33] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Information extraction under privacy constraints,” Information, vol. 7, no. 1, p. 15, 2016.[34] S. Asoodeh, F. Alajaji, and T. Linder, “Privacy-aware MMSE estimation,” in 2016 IEEE International Symposium on Information Theory (ISIT), July

2016, pp. 1989–1993.[35] B. Moraffah and L. Sankar, “Privacy-guaranteed two-agent interactions using information-theoretic mechanisms,” IEEE Transactions on Information

Forensics and Security, vol. 12, no. 9, pp. 2168–2183, Sept 2017.[36] I. Issa, S. Kamath, and A. B. Wagner, “An operational measure of information leakage,” in 2016 Annual Conference on Information Science and

Systems, CISS 2016, Princeton, NJ, USA, March 16-18, 2016, 2016, pp. 234–239. [Online]. Available: http://dx.doi.org/10.1109/CISS.2016.7460507[37] J. Duchi, M. Jordan, and M. Wainwright, “Local privacy and statistical minimax rates,” in Foundations of Computer Science (FOCS), 2013 IEEE 54th

Annual Symposium on. IEEE, 2013, pp. 429–438.[38] I. Issa and A. B. Wagner, “Operational definitions for some common information leakage metrics,” in 2017 IEEE International Symposium on Information

Theory (ISIT), June 2017, pp. 769–773.[39] S. E. Fienberg, A. Rinaldo, and X. Yang, Differential Privacy and the Risk-Utility Tradeoff for Multi-dimensional Contingency Tables. Berlin,

Heidelberg: Springer Berlin Heidelberg, 2010, pp. 187–199. [Online]. Available: https://doi.org/10.1007/978-3-642-15838-4 17[40] Y. Wang, J. Lee, and D. Kifer, “Differentially private hypothesis testing, revisited,” arXiv preprint arXiv:1511.03376, 2015.[41] C. Uhlerop, A. Slavkovic, and S. E. Fienberg, “Privacy-preserving data sharing for genome-wide association studies,” The Journal of privacy and

confidentiality, vol. 5, no. 1, p. 137, 2013.[42] F. Yu, S. E. Fienberg, A. B. Slavkovic, and C. Uhler, “Scalable privacy-preserving data sharing methodology for genome-wide association studies,”

Journal of biomedical informatics, vol. 50, pp. 133–141, 2014.[43] V. Karwa and A. Slavkovic, “Inference using noisy degrees: Differentially private β-model and synthetic graphs,” The Annals of Statistics, vol. 44, no. 1,

pp. 87–112, 2016.[44] J. Duchi, M. J. Wainwright, and M. I. Jordan, “Local privacy and minimax bounds: Sharp rates for probability estimation,” in Advances in Neural

Information Processing Systems, 2013, pp. 1529–1537.[45] J. Duchi, M. Wainwright, and M. Jordan, “Minimax optimal procedures for locally private estimation,” arXiv preprint arXiv:1604.02390, 2016.[46] P. Kairouz, K. Bonawitz, and D. Ramage, “Discrete distribution estimation under local privacy,” in Proceedings of the 33rd International

Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, pp. 2436–2444. [Online]. Available:http://dl.acm.org/citation.cfm?id=3045390.3045647

[47] M. Ye and A. Barg, “Optimal schemes for discrete distribution estimation under local differential privacy,” in 2017 IEEE International Symposium onInformation Theory (ISIT), June 2017, pp. 759–763.

[48] T. Steinke and J. Ullman, “Between pure and approximate differential privacy,” Journal of Privacy and Confidentiality, vol. 7, no. 2, 2017.[49] R. Rogers, A. Roth, A. Smith, and O. Thakkar, “Max-information, differential privacy, and post-selection hypothesis testing,” in Foundations of Computer

Science (FOCS), 2016 IEEE 57th Annual Symposium on. IEEE, 2016, pp. 487–494.[50] J. Liao, L. Sankar, V. Y. Tan, and F. P. Calmon, “Hypothesis testing under maximal leakage privacy constraints,” in IEEE Int. Sym. on Inf. Theory (ISIT),

2017.[51] Z. Montazeri, A. Houmansadr, and H. Pishro-Nik, “Achieving perfect location privacy in Markov models using anonymization,” in International

Symposium on Information Theory and Its Applications, 2016.[52] P. Kairouz, K. Bonawitz, and D. Ramage, “Discrete distribution estimation under local privacy,” arXiv:1602.07387 [stat.ML], 2016.[53] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Estimation efficiency under privacy constraints,” arXiv:1707.02409 [cs.IT], 2017.[54] M. Hayashi, “Exponential decreasing rate of leaked information in universal random privacy amplification,” IEEE Transactions on Information Theory,

vol. 57, no. 6, pp. 3989–4001, 2011.[55] L. Sankar, S. R. Rajagopalan, and H. V. Poor, “Utility-privacy tradeoffs in databases: An information-theoretic approach,” IEEE Trans. on Inform. For.

and Sec., vol. 8, no. 6, pp. 838–852, 2013.[56] P. Kairouz, S. Oh, and P. Viswanath, “Extremal mechanisms for local differential privacy,” in Advances in Neural Information Processing Systems, 2014.[57] M. Gaboardi, R. Rogers, and S. Vadhan, “Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing,”

arXiv:1602.03090 [math.ST], 2016.[58] F. du Pin Calmon and N. Fawaz, “Privacy against statistical inference,” in 50th Annual Allerton Conference on Communication, Control, and Computing,

2012.[59] N. Merhav and M. Feder, “Universal prediction,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2124–2147, Oct 1998.[60] T. A. Courtade and R. D. Wesel, “Multiterminal source coding with an entropy-based distortion measure,” in IEEE International Symposium on Information

Theory Proceedings, July 2011, pp. 2040–2044.

http://dx.doi.org/10.1109/CISS.2016.7460507

https://doi.org/10.1007/978-3-642-15838-4_17

http://dl.acm.org/citation.cfm?id=3045390.3045647

[61] T. A. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp.740–761, Jan 2014.

[62] I. Issa, S. Kamath, and A. B. Wagner, “An operational measure of information leakage,” in 2016 Annual Conference on Information Science and Systems(CISS), 2016.

[63] S. Verdu, “α-mutual information,” in 2015 Information Theory and Applications Workshop (ITA), 2015.[64] A. Renyi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability.

The Regents of the University of California, 1961, pp. 547–561.[65] R. Sibson, “Information radius,” Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete, vol. 14, no. 2, pp. 149–160, 1969.[66] S. Arimoto, “Information measures and capacity of order α for discrete memoryless channels,” in Colloquia mathematica Societatis Janos Bolyai,

Kestheley, Hungary, 1975, p. 41C52.[67] S.-W. Ho and S. Verdu, “Convexity/concavity of renyi entropy and α-mutual information,” in 2015 IEEE International Symposium on Information Theory

(ISIT), 2015.[68] I. Sason and S. Verdu, “f-divergence inequalities.”[69] Y. Polyanskiy and S. Verdu, “Arimoto channel coding converse and renyi divergence,” in Communication, Control, and Computing (Allerton), 2010 48th

Annual Allerton Conference on. IEEE, 2010, pp. 1327–1333.[70] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley-Interscience, 2006.[71] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2014.[72] T. Van Erven and P. Harremos, “Renyi divergence and kullback-leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp.

3797–3820, 2014.[73] F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10,

pp. 4394–4412, Oct 2006.

A General Framework for Information Leakage

Documents