PoolView: stream privacy for grassroots participatory sensing

PoolView: Stream Privacy for Grassroots ParticipatorySensing

Raghu K. Ganti, Nam Pham, Yu-En Tsai, and Tarek F. AbdelzaherDepartment of Computer Science, University of Illinois, Urbana-Champaign

rganti2,nampham2,ytsai20,[email protected]

ABSTRACTThis paper develops mathematical foundations and architec-tural components for providing privacy guarantees on streamdata in grassroots participatory sensing applications, wheregroups of participants use privately-owned sensors to col-lectively measure aggregate phenomena of mutual interest.Grassroots applications refer to those initiated by membersof the community themselves as opposed to by some gov-erning or official entities. The potential lack of a hierar-chical trust structure in such applications makes it harderto enforce privacy. To address this problem, we develop aprivacy-preserving architecture, called PoolView , that relieson data perturbation on the client-side to ensure individuals’privacy and uses community-wide reconstruction techniquesto compute the aggregate information of interest. PoolViewallows arbitrary parties to start new services, called pools,to compute new types of aggregate information for theirclients. Both the client-side and server-side components ofPoolView are implemented and available for download, in-cluding the data perturbation and reconstruction compo-nents. Two simple sensing services are developed for illus-tration; one computes traffic statistics from subscriber GPSdata and the other computes weight statistics for a partic-ular diet. Evaluation, using actual data traces collected bythe authors, demonstrates the privacy-preserving aggrega-tion functionality in PoolView.

Categories and Subject DescriptorsG.3 [Mathematics of Computing]: Probability and Statis-tics—Time series analysis; K.4.1 [Computing Milieux]:Computers and Society—Privacy

General TermsAlgorithms, Design, Experimentation

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SenSys’08, November 5–7, 2008, Raleigh, North Carolina, USA.Copyright 2008 ACM 978-1-59593-990-6/08/11 ...$5.00.

KeywordsPrivacy, Architecture, Data perturbation, Stream privacy,Grassroots participatory sensing

1. INTRODUCTIONMuch of the past sensor networks research focused on net-

working issues; a scope naturally suggested by the name ofthe discipline. Another very important aspect of distributedsensing, however, is data management . In this paper, wefocus on privacy as a category of data management con-cerns in emerging applications. Our work is motivated bythe recent surge in distributed collection of data by self-selected participants for the purpose of characterizing aggre-gate real-world properties, computing community statistics,or mapping physical phenomena of mutual interest. Thistype of applications has recently been called participatorysensing , [5]. Examples of such applications include CarTel[22], BikeNet [10], MMM2 [8], and ImageScape [32].

This paper presents an architecture, mathematical foun-dations, and service implementation to enable grassrootsparticipatory sensing applications. We consider communi-ties of individuals with sensors collecting streams of privatedata for personal reasons. These data could also be of valueif shared with the community for fusion purposes to computeaggregate metrics of mutual interest. One main problem insuch applications is privacy. This problem motivates ourwork.

In this paper, we address privacy assurances in the ab-sence of a trust hierarchy. We rely on data perturbation atthe data source to empower clients to ensure privacy of theirdata themselves using tools that perturb such data priorto sharing for aggregation purposes. Privacy approaches,including data perturbation, are generally met with criti-cism for several good reasons. First, it has been repeatedlyshown that adding random noise to data does not protectprivacy [24, 21]. It is generally easy to reconstruct datafrom noisy measurements, unless noise is so large that util-ity cannot be attained from sharing the noisy data. Second,anonymity (another approach to privacy) does not help ei-ther. Anonymized GPS data still reveals the identity of theuser. Withholding location data in a radius around homecan be a solution, but opting to withhold, in itself may re-veal information. Moreover, in a sparsely deployed network,the radius would have to be very large to truly anonymizethe data. A third question is whether the assumption oflack of a centralized trusted entity is justified. After all,we already entrust our cell phone providers with a signif-icant amount of information. It should not be difficult to

281

provide added-value services that benefit from the current(fairly extensive) trust model.

Below, we address the aforementioned questions prior topresenting our approach. The paper addresses the problemof privacy in time-series data1. The fundamental insight asto why perturbation techniques do not protect privacy iscorrelation among different pieces of data or between dataand context (e.g., identity of owner). The authors take thesmall step of addressing correlations within a data stream.We show that with proper tools, non-expert users can gener-ate appropriately-correlated application-specific noise in theabsence of trust, such that data of these individuals cannotbe reconstructed correctly, but community aggregates canstill be computed with accuracy. We further explain hownon-expert users might be able to generate the appropriateapplication-specific noise without trusting external partiesrelated to that specific application. Observe that inabilityto reconstruct actual user data largely obviates the needfor anonymity. We acknowledge that our solutions are notneeded for scenarios where a hierarchy of trust exists. Incontrast to such scenarios, in this paper, we are interestedin providing a way for individuals in the community to col-lect information from their peers such as “how well does thisor that diet or exercise routine work” or “what patterns ofenergy use at home really worked for you to reduce yourenergy bill”? Obviating the requirement to find a mutuallytrusted entity before data are collected is a way to encouragethe proliferation of grassroots participatory sensing applica-tions.

We adopt a client-server architecture, called PoolView ,where clients share (perturbed) private sensory data andservers (called pools) aggregate such data into useful infor-mation made available to the community. PoolView presentsa simple API for individuals to set up new pools the waythey might set up a wiki or discussion group. Simple APIsare also provided for clients to subscribe to pools and ex-port their data. Interactions between clients and serversrely on a common data-stream abstraction. A stream al-lows an individual to share a sequence of (perturbed) datameasurements such as weight values or GPS coordinates (alogged trip). One main goal we address in this paper isto compute perturbation such that (i) it preserves the pri-vacy of application-specific data streams against commonreconstruction algorithms, (ii) it allows computation of com-munity aggregates within proven accuracy bounds, and (iii)the perturbation (which may be application-specific) can beapplied by non-expert users without having to trust the ap-plication. Hence, any person can propose a custom statisticand set up a pool to collect (perturbed) data from non-expert peers who can verify independently that they areapplying the “right” (application-specific) perturbation topreserve their privacy before sharing their data.

As alluded to above, ensuring privacy of data streamsvia perturbation techniques is complicated by the existenceof correlation among subsequent data values in time-seriesdata. Such correlations can, in general, be leveraged to at-tack the privacy of the stream. For example, sharing a singledata value representing one’s weight perturbed by adding arandom number between -2000 and 2000 pounds will usuallynot reveal much about the real weight. On the other hand,sharing the current weight value every day, perturbed by

1We do not yet consider multimedia sensor data in this paper

a different random number, makes it possible to guess theweight progressively more accurately simply by averagingthe sequence to cancel out noise. Perturbing the sequenceby adding the same random number every day does notwork either because it will reveal the trend in weight mea-surements over time (e.g., how much weight the individualloses or gains every day). Our goal is to hide both the ac-tual value and trend of a given individual’s data series, whileallowing such statistics to be computed over a community.Hence, for example, a community of weight watchers canrecord their weights as measured on a particular diet, allow-ing weight-loss statistics (such as average weight loss andstandard deviation of loss) to be computed as a function oftime on the diet.

To instantiate the architecture, we have implemented (codeavailable at [30]) and deployed two PoolView services (pools),one for computing average weight of a self-selected commu-nity (e.g., all those on a particular diet), and another forcomputing traffic statistics in a privacy-preserving fashion.We present data from the above two case studies, collectedby the authors.

The rest of this paper is organized as follows. Section 2 de-scribes the perturbation techniques that we develop for shar-ing time-series data in a privacy-preserving manner. Sec-tion 3 describes PoolView, our privacy-centric architecturefor participatory sensing. We discuss the results from thetwo case studies in Section 4. Related work is presented inSection 5. Finally, we conclude the paper in Section 6 anddiscuss directions for future research.

2. DATA PERTURBATIONConsider a participatory sensing application where users

collect data that are then shared (in a perturbed form)to compute community statistics. The reader may assumean application where statistics are computed after the fact(such as average traffic or average energy consumption statis-tics), or where they evolve very slowly (such as weight statis-tics).

In this section, we provide the mathematical foundationsneeded for perturbing time-series data in grassroots partic-ipatory sensing applications. Our perturbation problem isdefined as follows. Perturb a user’s sequence of data valuessuch that (i) the individual data items and their trend (i.e.,their changes with time) cannot be estimated without largeerror, whereas (ii) the distribution of community data at anypoint in time, as well as the average community data trendcan be estimated with high accuracy.

For instance, in the weight-watchers example, it may bedesired to find the average weight loss trend as well as thedistribution of weight loss as a function of time on the diet.This is to be accomplished without being able to reconstructany individual’s weight and weight trend. For another exam-ple, it may be desired to compute the average traffic speedon a given city street, as well as the speed variance (i.e., thedegree to which traffic is “stop-and-go”), using speed datacontributed by individuals without being able to reconstructany individual’s speed and acceleration curves.

Examples of data perturbation techniques can be foundin [3, 2, 12]. The general idea is to add random noise with aknown distribution to the user’s data, after which a recon-struction algorithm is used to estimate the distribution ofthe original data. Early approaches relied on adding inde-pendent random noise, which were shown to be inadequate.

282

For example, a special technique based on random matrixtheory has been proposed in [24] to recover the user datawith high accuracy. Later approaches considered hiding in-dividual data values collected from different private parties,taking into account that data from different individuals maybe correlated [21] 2. However, they do not make assump-tions on the model describing the evolution of data valuesfrom a given party over time, which can be used to jeopardizeprivacy of data streams. By developing a perturbation tech-nique that specifically considers the data evolution model,we show that it is strong against attacks that extract regu-larities in correlated data such as spectral filtering [24] andPrincipal Component Analysis (PCA) [21].

2.1 A General OverviewWe show in this paper that privacy of time-series data

(measuring some phenomenon, P ) can be preserved if thenoise used to perturb the data is itself generated from aprocess that approximately models the phenomenon. Forinstance, in the weight watchers example, we may have anintuitive feel for the time scales and ranges of weight evolu-tion when humans gain or lose weight. Hence, a noise modelcan be constructed that exports realistic-looking parametersfor both the direction and time-constant of weight changes.We can think of this noise as the (possibly scaled) outputof a virtual user . For now, let us not worry about who ac-tually comes up with the noise model and what the trustimplications are. Later, we shall revisit this issue in depth(Section 2.9).

Once the noise model is available (noise model generationis described at the end of this section), its structure andprobability distributions of all parameters are shared withthe community. By choosing random values for these pa-rameters from the specified distribution, it is possible, forexample, to generate arbitrary weight curves of virtual peo-ple showing weight gain or loss. A real user can then addtheir true weight curve to that of one or several locally gen-erated virtual users obtained from the noise model. Theactual model parameters used to generate the noise are keptprivate. The resulting perturbed stream is shared with thepool where it can be aggregated with that of others in thecommunity. Since the distributions of noise model param-eters are statistically known, it is possible to estimate thesum, average and distribution of added noise (of the entirecommunity) as a function of time. Subtracting that knownaverage noise time series from the sum of perturbed commu-nity curves will thus yield the true community trend. Thedistribution of community data at a given time can simi-larly be determined since the distribution of noise (i.e., datafrom virtual users) is known. The estimate improves withcommunity size.

A useful refinement of the above technique is to separatein the virtual user model parameters that are inputs fromthose that express intrinsic properties of the model. For ex-ample, food intake may be an input parameter of a virtualuser model. Inputs can be time-varying. Our perturbationalgorithm allows changing the values of input model param-eters with time. Since the input fed to the virtual users isnot shared, it becomes very hard to extract real user datafrom added noise (i.e., virtual user) curves.

2Since the data correlation is across individuals of the popu-lation, we will not compare it with our work in the evaluationsection

One last question relates to the issue of trust. Earlier, wemotivated perturbation approaches in part by the lack ofa central trusted party that would otherwise be able to pri-vately collect real unperturbed data and compute the neededstatistics. Given that non-experts cannot be asked to pro-gram noise models for each new application (or even beexpected to know what these models are), and since theycannot trust the data collection party, where does the noise(i.e., the virtual user) model come from and how does a non-expert client know that the model is not fake? Obtainingthe noise model from an untrusted party is risky. If theparty is malicious, it could send a “bad” model that is, say,a constant, or a very fast-changing function (that can beeasily separated from real data using a low-pass filter), orperhaps a function with a very small range that perturbsreal data by only a negligible amount. Such noise modelswill not perturb data in a way that protects privacy.

The answer comes from the requirement, stated earlier,that the noise added be an approximation of the real phe-nomenon. Incidentally, observe that the above requirementdoes not mean that the noise curve be similar to the userdata curve. It only means that both come from a model ofthe same structure but different parameters. Hence, in theweight example, it could be that the user is losing weightwhereas the noise added is a curve that shows weight gain.Both curves come from the same model structure (e.g., afirst order dynamic system that responds to food intake witha gain and time-constant). The models would have differ-ent parameters (a different gain, a different time-constant,and importantly a different input modeling the time-varyingfood intake).

With the above in mind, we allow the server (that is un-trusted with our private data) to announce the used noisemodel structure and parameter distribution to the commu-nity of users. The model announced by the server can betrusted by a user only if that user’s own data could havebeen generated from some parameter instantiation of thatmodel with a non-trivial probability. This can be tested lo-cally by a curve-fitting tool on the user’s side the first timethe user uses the pool. Such a general tool is a part of theclient-side PoolView code distribution. Informally, a noisemodel structure and parameter distributions are acceptedby a user only if (i) the curve fitting error for user’s owndata is not too large and (ii) the identified model parametervalues for user’s data (that result from curve fitting) are nottoo improbable (given the probability distributions of modelparameters).

In the rest of this section, we formalize the notions ofperturbation (Section 2.2), reconstruction (Sections 2.3, 2.4,2.5, 2.6), and model validation (Section 2.7) discussed above.We prove properties of the approach such as the degree ofprivacy achieved and the community reconstruction error.

2.2 Data Perturbation AlgorithmConsider a particular application where a pool (an aggre-

gation server) collects data from a community to performstatistics. To describe the perturbation algorithm, let N bethe number of users in the community. Let M be the num-ber of data points sent to the aggregation server by eachuser (we assume this to be the same across users for no-tational simplicity, but the algorithm does not depend onthat). Let xi = (xi

1, xi2, . . . , x

iM ), ni = (ni

1, ni2, . . . , n

iM ),

and yi = (yi1, y

i2, . . . , y

iM ) represent the data stream, noise

283

and perturbed data shared by user i, respectively. At timeinstant k, let fk(x) be the empirical community distribu-tion, fe

k(x) be the exact community distribution, fk(n) bethe noise distribution, and fk(y) be the perturbed commu-nity distribution. The exact community distribution is thedistribution of a community with infinite population. In re-ality, this is not true. Therefore, we use the notion empiricalcommunity distribution to address the distribution of a truecommunity with limited population. The notion of exactcommunity distribution is useful when the reconstructionerror of a small community is considered.

Most user data streams can be generated according toeither linear or non-linear discrete models. In general, amodel can be written as a discrete function of index k (e.g.time, distance), application dependent parameters θ, inputsu, and is denoted as g(k, θ,u). Notice that θ is a fixedlength parameters vector characterizing the model while uis a vector of length M characterizing the input to the modelat each instance.

For example, in the Diet Tracker application, accordingto one article [26], the weight of a dieting user over time [26]can be roughly approximated by three parameters: λk, β,and W0. β is the body metabolism coefficient, W0 is theinitial weight of the person right before dieting, and λk is theaverage calorie intake of that person on day k. The weightW (k) of a dieting user on day k of the diet is characterizedby a non-linear equation:

W (k) = W (k − 1) + λk + βW (k − 1)3/4 (1)

W (0) = W0 (2)

While we do not assert the validity of the above dietmodel, we shall use it for illustration. In this example, themodel parameter vector is θ = (β, W0) and the input to themodel is u = (λ1, λ2, . . . , λM ). From Equations (1, 2), givenθ and u, one can generate the weight of a user with high ac-curacy. While the model for a dieting person is not privateand the probability distributions of weight parameters overa large population can be approximately hypothesized, it isdesired to hide the parameters θ and u of any given userfrom being estimated. This protects an individual user’sprivacy.

Once the model for shared data is known, the entire datastream of user i can be represented as a pair of parametervectors (θi,ui). We can assume that for the community,θi is drawn from a probability distribution fθ(θ) and ui isdrawn from another probability distribution fu(u). Boththe real distributions fθ(θ) and fu(u) are unknown to theaggregation server.

The distributions fθ(θ) and fu(u) are important sincethey characterize typical data streams of the users in a com-munity. To generate noise with the same model as the data,the parameters θ and u are required. Because the real dis-tributions fθ(θ) and fu(u) are unknown, approximate dis-tributions fn

θ (θ) and fnu (u), are used to generate θ and u,

respectively.In short, given the data xi = (xi

1, xi2, . . . , x

iM ), the model

g(k, θ,u), and the approximated distributions fnθ (θ), fn

u (u),the perturbed data for user i is generated as follow:

• Generate samples θi

nand ui

n, from the distributions

fnθ (θ) and fn

u (u), respectively.

• Generate noise stream ni = (ni1, n

i2, . . . , n

iM ), where

nik = g(k, θi

n, ui

n)

• Perturbed data is generated by adding the noise streamto the data stream yi = xi + ni.

To achieve better privacy, a scaled version of the noisemay be added to the data, thus the perturbed data will nowbe yi = xi+Ani, where A is either a random variable chosenfrom a known distribution fA(A) or a constant. However,the choice of fA(A) (if A is a random variable) or the valueof A (if A is a constant) must be the same for all users inthe community so that the aggregation server can be ableto reconstruct the community distribution. In the situationthat a scaling factor is used, the parameters associated withA are provided by the aggregation server along with themodel and the model’s parameters.

2.3 Reconstruction of Community AverageIn this section, we consider a simple case where the ag-

gregation server is interested in estimating the communityaverage at a certain time instance k. Since the parameterdistributions (fn

θ (θ), fnu (u), fA(A)) and the model g(k, θ,u)

are known, the noise distribution at arbitrary time instancek can be accurately calculated. All users use the samedata model structure and parameter distributions to gen-erate their noise streams. Therefore the noise distributionat any time instance k is the same for all the users and isdenoted as fk(n).

Upon receiving the perturbed data yi from all users, theaggregation server calculates the empirical average of thecommunity data at time k as PAk = 1

N

PNi=1 yi

k. By thelaw of large numbers, if the number of users in the commu-nity (N) is large enough, the empirical value is equal to theexpected value of the community perturbed data E[yk]. Wecan write E[yk] as follows:

E[yk] = E[xk + Ank] (3)

= E[xk] + E[A]E[nk] (4)

Because the distribution of A and nk are known to the ag-gregation server, E[A] and E[nk] can be computed. There-fore the community average at time k can be estimated asE[xk] = PAk − E[A]E[nk]. Note that, Equation ( 4) fol-lows from Equation ( 3) because A is either a constant or arandom variable that is independent of nk.

In the special case that A is chosen as a zero mean randomvariable, the estimated community average at time k is alsothe average of the perturbed data at time k. In other words,the server simply averages the perturbed data to get (a goodestimate of) the true community average.

2.4 Reconstruction of Community DistributionsWe will now describe the more general problem of re-

constructing the distribution (as opposed to the average) ofcommunity data at a given point in time. At time instancek, the perturbed data of each user is the sum of the actualdata and the noise yi

k = xik+ni

k. Thus the distribution of theperturbed data fk(y) is the convolution of the communitydistribution fk(x) and the noise distribution fk(n):

fk(y) = fk(n) ∗ fk(x) (5)

All the distributions in Equation (5) can be discretized as:

fk(n) = (fn(0), fn(1), . . . , fn(L))

fk(x) = (fx(0), fx(1), . . . , fx(L))

fk(y) = (fy(0), fy(1), . . . , fy(2L))

284

And the Equation (5) can be rewritten as:

fy(m) =

∞X

k=−∞

fx(k)fn(m − k) (6)

Since convolution is a linear operator, Equation (6) canbe written as

fk(y) = Hfk(x) (7)

where H is a L × (2L + 1) Toeplitz cyclic matrix, which isalso called the blurring kernel, constructed from elements ofthe discrete distribution fk(n) as:

H =

0

B

B

B

B

B

@

fn(0) 0 0 . . . 0fn(1) fn(0) 0 . . . 0fn(2) fn(1) fn(0) . . . 0

. . . . . . . . . . . . . . .0 0 . . . fn(L) fn(L − 1)0 0 0 . . . fn(L)

1

C

C

C

C

C

A

(8)

In Equation (7), fk(x) is the community distribution at timek that needs to be estimated, H is known and fk(y) is theempirical perturbed data distribution. This problem is wellknown in the literature as the deconvolution problem. Sev-eral algorithms have been developed to solve this problemand can be categorized into two classes. The first is a setof /iterative algorithms, such as Richardson-Lucy algorithm,EM algorithm, and Poisson MAP method. The second classof algorithms are non-iterative, examples include Tikhonov-Miller restoration and SECB restoration. None of the it-erative algorithms give bounds on the reconstruction error,while the non-iterative algorithms, supported by well definedmathematical optimization methods, give upper bounds onthe reconstruction error. In this paper, the Tikhonov-Millerrestoration method is employed to compute the communitydistribution.

The Tikhonov-Miller restoration [34] requires an aprioribound ε for the L2 norm of the noise, together with an apri-ori bound M for the L2 norm of the community distribution:

||Hfek (x) − fk(y)||2 ≤ ε (9)

||(HT H)−νfek (x)||2 ≤ M (10)

Throughout this paper, || ||p denotes the Lp(R) norm ofa vector. The optimal solution fk(x) is chosen to minimizethe regularized quadratic functional:

||Hfk(x) − fk(y)||22 +“ ε

M

”2

||(HT H)−νfk(x)||22 (11)

The fraction λ = ε/M is called the regularization coef-ficient which governs the relative importance between theerror and the regularized term [23].

By minimizing Equation (11), the exact expression for theoptimal solution f∗

k (x) can be found:

f∗

k (x) = Q−1T HT fk(y) (12)

QT = HT H +“ ε

M

”2

(HT H)−2ν (13)

Equation (12) and (13) are used in the aggregation serverto reconstruct the community distribution. All the parame-ters ε, M , and ν are empirically tuned to get a small recon-struction error. The optimal solution f∗

k (x) may not forma probability distribution, thus normalization is needed to

achieve a proper probability distribution. The relation be-tween those parameters and reconstruction error is analyzedin the following section.

2.5 Error Bound on Community DistributionReconstruction

If all the constraints in Equation (9) and Equation (10)are satisfied, the error bound of the reconstruction is givenas:

||fek (x) − f∗

k (x)||2 ≤√

2{A(ν)}−1/2M1/(1+2ν)ε2ν/(1+2ν) (14)

A(ν) = (2ν)1/(1+2ν) + (2ν)−2ν/(1+2ν) (15)

The reconstruction error bound in Equation (14) dependson ε, M , and ν. From Equation (14), we observe thatthe larger the ε, the larger the reconstruction error’s upperbound. We can rewrite Equation (9) as follows to observethe trade-off between the noise variance and the reconstruc-tion error:

||Hfek (x) − fk(y)||2 = ||fk(n) ∗ fe

k (x) − fk(n) ∗ fk(x)||2 (16)

= ||fk(n) ∗ (fek (x) − fk(x))||2 (17)

≤ ||fk(n)||2||fek (x) − fk(x)||1 (18)

Equation (18) is obtained from Equation (17) using Young’sinequality. Note that ε is chosen so that Equation (9) is sat-isfied, therefore we can tighten the condition on ε such that:

||Hfek (x) − fk(y)||2 ≤ ε ≤ ||fk(n)||2||fe

k (x) − fk(x)||1 (19)

Then the error bound in (14) can be written as:

||fek (x) − f∗

k (x)||2 ≤√

2{A(ν)}−1/2 × M1/(1+2ν) ××||fk(n)||2ν/(1+2ν)

2 × ||fek (x) − fk(x)||2ν/(1+2ν)

1 (20)

This equation gives us a good approximation of the com-munity reconstruction error based on the noise distribution.The term ||fk(n)||2 is the noise energy, this term is the sumof all the noise from all users at time instance k. The term||fe

k (x) − fk(x)||1 is the sum of all the differences betweenthe true community distribution at time k and the empir-ical distribution constructed from all the community datapoints at time k, which is very small. We call this termcommunity sampling error. The community sampling errordepends on the number of users in the community, N . Alarger N implies a smaller community sampling error andvice versa. Thus, the larger the noise energy, the higher thereconstruction error and the larger the number of users inthe community, the lower the reconstruction error. Hence,for a large enough community, a good compromise may beachieved between privacy (which we relate to noise energy)and reconstruction error.

In Section 2.3, we mentioned that privacy can be im-proved by multiplying the noise by a factor A. For sim-plicity, first consider the case when A is a constant. Notethat ||Afk(n)||2 = A||fk(n)||2. Therefore the reconstruction

error in Equation (20) is scaled by a factor of A2ν/(1+2ν).If A is a finite random variable then the bound becomes||Afk(n)||2 = δ(A)||fk(n)||2, where δ(A) is a number whosevalue depends on the distribution of A. In both cases, thereconstruction error is scaled by a factor which can be esti-mated.

285

2.6 Privacy and User Data ReconstructionIn this section, we will analyze the level of user privacy

achieved using our proposed perturbation algorithm. Manymethods have been proposed to measure privacy. For ex-ample, in [3], privacy is measured in terms of confidenceintervals. In [2], a measure of privacy using mutual informa-tion is suggested. However, the above methods do not takethe data model as well as the exploitation method into ac-count. In fact, privacy breaches are different depending onthe exploitation method employed by the adversary. In thispaper, we quantify privacy by analyzing the degree to whichactual user data can be estimated from perturbed data us-ing methods that take advantage of data correlations such asPCA and spectral filtering. Other possible estimation meth-ods such as Maximum Mean Squared Estimation (MMSE)are also discussed.

First, consider traditional filtering methods such as PCAand spectral filtering. These methods work based on theassumptions that (i) additive noise is time independent andis independent of user data, and (ii) noise variance is smallcompared to the signal variance. With our proposed per-turbation scheme, both assumptions are violated. The noiseis generated from a known model but with unknown pa-rameters, thus noise points are correlated. In addition, thenoise and user data are generated from the same model.The noise variance is not necessarily small since it can beamplified by a factor A as discussed above. The filteringtechniques require prior knowledge about the noise in orderto do accurate estimation. In our scheme, on the other hand,one does not know the noise distribution for any single user,since it is a function of the specific model parameters cho-sen, which remain private. We only know the distribution ofsuch parameters over the community, but not their specificinstances for any given user. Therefore, it is expected thatthe filtering techniques cannot reveal the user data with highaccuracy. It is empirically shown in Section 4 that the PCAmethod is not successful in reconstructing user data. Othermethods (not shown) present similarly poor reconstruction.Very little information is breached.

A better way to estimate user data is to estimate theuser parameters only, since the model is known. MMSEis one of the most common methods to estimate parame-ters given the model. Assume that the user data is gen-erated by g(k, θx, ux), the noise by an approximated modelga(k, θn, un), and the perturbed data y is g(k, θx, ux)+ga(k, θn, un).Then, the parameter estimation using MMSE is defined asfollows:

(θ∗, u∗) = argmin(θ,u)

||y − g(k, θ, u)||2 (21)

We consider the case when the noise is generated by awell approximated model. In this case, the perturbed datahas the same dynamics as the user data and hence can beapproximated by y = g(k, θy, uy) with a very small error,δ = ||y − y||. Then the optimal solution for Equation (21)is (θ∗, u∗) = (θy, uy). Thus, the error between the userparameters and the MMSE is ||(θy, uy)− (θx, ux)||. In orderto make it big, the set of possible noise streams must belarge. If the above assumption holds and the data model iswell approximated, then our approach is robust to MMSEattacks.

However, if an ill approximated data model (or a totallydifferent model) is used to generate the noise stream, user

data may be revealed. A malicious server may use thismethod to exploit the user parameters. In this case, insteadof using the MMSE method defined above, the maliciousserver may use a slightly different estimation, such as:

(θ∗, u∗, θ∗, u∗) = argminθ,u,θ,u

||y − g(k, θ, u) − ga(k, θ, u)||2 (22)

In the estimation above, the malicious server tries to esti-mate both the parameters of the client and the parameters ofthe noise. Because there is a modeling mismatch, the aboveestimation may give a good approximation of the user data.

Similarly, a malicious server may send a good data model,but with a very small set of parameters. With this type ofattack, using the parameter distributions sent by the server,the client can only generate a very small set of noise streams.Thus, the server can extract user data from perturbed datastream since the noise is predictable. We devised a method,which can be used on the user side, to effectively verify thatthe noise model announced at the server and its parametersare adequate. This is discussed in Section 2.7. Users maychoose not to share their data if bad noise models or badparameter distributions are detected as will be shown next.

2.7 Model and Parameter VerificationIn Section 2.6, we have shown that a malicious server can

deliberately announce a “bad”noise model in order to revealuser data by using special estimation techniques. In whatfollows, we will formalize the method to detect whether amodel and its parameter distribution is malicious or not.The detection is based on the following observation: theuser data should be a typical realization of the model whichalso means that the probability of the parameters of the userdata, as sampled from the noise model parameter distribu-tion, is high.

Given the model g(k, θ, u), the distribution fn(θ), fn(u)(both publicly announced to the community by the aggre-gation server) and user data xi, we perform parameter ver-ification as follows.

We estimate the user data parameters by minimizing thefollowing quadratic function:

(θi, ui) = argminθ,u

||xi − g(k, θ, u)||2 (23)

The minimization problem in Equation (23) can be solvednumerically by algorithms such as gradient descent or quasiNewton. A numeric solver with a simple (one-click) user APIis included in the client-side PoolView code. Observe that ifthe given model is an ill approximation of the data model,then the error in this estimation is high. Thus to check thevalidity of the model it is required that the estimation errorbe less than a predefined threshold p1, which depends on themean and variance of user data, (i.e. ||xi − g(k, θi, ui)|| ≤p1). Using triangular inequality, it can be easily shown that:

min(|||xi|| − ||g(k)||min|, |||xi|| − ||g(k)||max|)≤ ||xi − g(k, θi, ui)|| ≤ p1 (24)

In the Equation (24), ||g(k)||min and ||g(k)||max are theminimum and maximum norms of g(k) over all possible noisecurves. For each user, ||xi||, ||g(k)||min, and ||g(k)||max areknown. Hence this equation gives the user a lower boundfor p1. It is empirically shown that a good bound is usuallywithin 10% of this estimated bound.

286

Finding noise model parameters that approximate the userdata is not sufficient. The parameters that are obtainedshould not lie at the very tail of the parameter distributionsent by the server. If this happens, it would mean that eitherthe user is anomalous (e.g., a very overweight person) or theserver is not sending a representative distribution. The va-lidity of the parameter distribution is checked by verifyingif the probability of (θi, ui) is larger than a threshold (i.e.P (θi, ui) = fn(θi)fn(ui) ≥ p2).

A disadvantage of this approach is that individuals whosedata are anomalous will not know to trust the server evenif the server was trusted. In a social setting, by contact-ing their peers, however, such individuals may be able todisambiguate the situation.

Further, the server can breach privacy by sequentiallyproposing different noise models to the user. In such a case,the user can use a simple technique of not accepting mul-tiple noise models from a given server for a particular datastream.

Table 1 summarizes the various attacks (that breach pri-vacy) and the countermeasures proposed in this paper.

Attacks Countermeasures

Reconstruction attacks us-ing estimation techniques(e.g., PCA)

Use a noise model similarto data model

Malicious noise modelsfrom server

Reject models that do notfit data on client side

Sequential noise modelsfrom server

Limit noise model updatesaccepted by client

Malicious data from client Reject outliers by server

Table 1: Summary of attacks and countermeasures

2.8 Context PrivacyThus far, we have developed a general data perturba-

tion technique that preserves an individual user’s privacy fortime-series data. We observe that the privacy is preserved inthe sense that the values and the trend of user data are notrevealed (hard to infer). Another important aspect of pri-vacy is context privacy . For example, data measurements areassociated with a given time and place. Sharing data (e.g.,on city traffic), even in perturbed form, still puts the userat a given time at a particular location. In this paper, wedo not contribute to context privacy research. Traditionalapproaches to solve the problem typically rely on omittingsome fields from the data shared or not answering queriesfor data (even in perturbed form) unless they are appropri-ately broad. For example, a user may reveal that they wereon a particular city street at 11am on a Wednesday but notreveal which Wednesday it was. This could be enough toachieve a level of privacy and at the same time satisfy thestatistical need of the aggregation server, say, if the statisticbeing computed is that of traffic density as a function of timeof day and day of week. The policy used for blanking-outparts of the data fields shared to protect context is indepen-dent from our techniques to protect data. Context privacypolicies are beyond the scope of this work.

2.9 Noise Model GenerationAt the beginning of this section, we mentioned that the

noise model must be similar to the data model to avoidreconstruction attacks. This seems problematic, since thewhole purpose of measuring data is often to learn about theunknown, which seems inconsistent with having an a prioridata model. There is a slight fallacy in the above argument.First, observe that in most scientific pursuits, there is a hy-pothesis that scientists try to prove or disprove. This meansthat some models and expectations already exist. These canbe used to come up with a noise model. For example, thereare well-known models for vehicular traffic [19], which canbe used to generate the noise distribution (fn

θ ). Methods toextract models from sensor network data, such as the onepresented in [17], can also be used. Second, in most cases(even those not related to science), we need to have expec-tations on the shape of the data, at least so that we are ableto tell if the sensor is working correctly or not. This meansthat we already have a mental model of what is probable andwhat is improbable. These expectations can be translatedinto a noise model. Further, a given model can evolve overtime. As we learn more about the measured phenomenon,clients may accept a small number of noise model updates.

3. ARCHITECTURE ANDIMPLEMENTATION

In this section, we will present the architecture that wedevelop for participatory sensing applications and how thetheory we described in the previous section fits with ourarchitecture. We then describe the implementation of ourPoolView services.

3.1 General ArchitectureThe centerpiece of our architecture for participatory sens-

ing is the privacy firewall , controlling the release of a user’sprivate data.

A data owner will typically store their private data insidethe firewall, which introduces a need for a private storagelayer . The owner may be in possession of multiple (trusted)sensors that contribute data to her private storage. Thesedevices collectively form the sensing layer . In the context ofthis paper, we assume that the sensors an individual ownsgenerate time series data. Thus, for a user i, a given sensorgenerates xi = (xi

1, xi2, . . . , x

iM ). For example, a person on

a diet might be using a Bluetooth scale that associates withtheir cell-phone to upload daily weight measurements thatare then stored in the user’s private storage (a prototypeof this system was built by Motorola). Outside the privacyfirewall are the aggregation services that collect perturbeddata from multiple users to compute community informa-tion. These services form an information distillation layer.

The basic function of the privacy firewall is to screen orperturb user data in such a manner as to preserve the privacyof the data streams that the user owns. A privacy table isthe central data structure of the firewall. It can be thoughtof as a two dimensional array whose dimensions are (i) ag-gregation services and (ii) data types. A cell correspondingto a given service and data type contains a pointer to thecorresponding perturbation model.

To estimate the community distribution, fk(x), an aggre-gation server needs the perturbed distribution fk(y) and theblurring (noise) kernel, H. Since each client shares the per-

287

turbed time-series data (i.e., yi), the server knows fk(y).Also, recall from Section 2 that the blurring kernel H is ob-tained from the noise distribution, fk(n). Since the noisedistribution is known to the server, the server has the req-uisite information to estimate the community distribution,which is done as described in Section 2.

Our service architecture is pictorially represented in Fig-ure 1. The figure depicts the four layers of the architecture(namely, the sensor layer, the private storage layer, the pri-vacy firewall layer, and the information distillation layer).It is useful to divide sensors in the sensor layer into onesthat have Internet connectivity, such as sensors in 3G cell-phones, and ones that don’t. The latter would typically useproprietary or custom protocols to interface with a gatewaythat has Internet connectivity. It is possible that such pro-tocols will support disruption-tolerant operation where dataaccumulates when the sensor is disconnected and is uploadedin bulk to the gateway when the sensor and gateway comeinto contact. This, for example, is the case with Smart At-tire sensors described in [14], as well as the case with thevehicular study reported in the evaluation section.

Figure 1: PoolView architecture

3.2 PoolView ImplementationIn this section, we will describe the implementation details

of our architecture. We implemented our architecture andinstantiated it by deploying two applications, one that allowsfor the computation of traffic statistics, and the other thatcomputes weight statistics.

The fundamental tie that links the layers of the informa-tion distillation architecture together is our data stream ab-straction. A PoolView data stream is a generic time-indexedcollection of data items from a given source. The stream ischaracterized by a data owner (the source), data object type,location, and (start and end) time of collection, as well asa sampling frequency if applicable. A stream may containone or more data values. The above stream attributes areidentified using XML tags.

The interfaces between layers in our architecture are im-plemented in HTTP. Namely, Internet-ready sensors andgateways in the sensor layer employ HTTP PUT commandsto deposit data in storage servers. Note that, in this con-text, the storage server can be a simple laptop that the userowns. Users access information distillation servers using

regular HTTP GET commands. In exchange for informa-tion, a distillation server requests relevant (perturbed) userdata by issuing an HTTP POST query to the user’s storageserver. Users can also register with a distillation server topoll them for new data periodically, or have the distillationserver poll them on demand (e.g., each time their storageserver is updated). The POST query is essentially an SQLquery expressed in XML to remove implementation depen-dencies between the client and server.

The HTTP POST query is intercepted by the privacy fire-wall, which perturbs the requested data before sharing it.The above arrangement allows us to use standard web serversto implement information distillation including perturbationand aggregation functions. A new HTTP data type is de-fined which characterizes a PoolView stream. When theweb server receives a GET, POST, or PUT command on such adata type it invokes a (plug-in) handler for that data type.Hence, all is needed for implementing the scheme is to writethe plug-ins. For example, in an Apache server, this cor-responds to writing a module that handles the PoolViewstream data type. In our current implementation, we usean Apache HTTP server with the PUT and POST handlerswritten as modules in Perl. The privacy firewall modulesare also implemented as modules in Perl. The mathemati-cally intensive operations on the client side, such as the onesthat require curve fitting, probabilistic estimates are imple-mented in SciLab [33]. Java wrappers for SciLab are usedto integrate the SciLab modules with Perl. Our implemen-tation of the data storage server includes, in addition to theHTTP server, a MySQL database server for the storage andretrieval of personal data. A detailed document describingthe protocol is available for reference purposes [31].

4. CASE STUDIESIn this evaluation section, we present two case studies with

PoolView services, one is a traffic analyzer and the other isa diet tracker.

In both the case studies, when an individual user connectsto the information distillation server of the correspondingservice for community statistics, the server sends an HTTPPOST request to the user’s personal storage server askingfor the requisite data. The request is intercepted by theuser’s privacy firewall. The request is validated by the user’sprivacy firewall by first authenticating it to ascertain if it isfrom the correct server and then if that server has validaccess rights to the data that is requested. Data are thenshared in a perturbed manner. We will first discuss theresults from the traffic analyzer case study followed by thediet tracker.

4.1 Traffic AnalyzerThe traffic analyzer case study is motivated by the grow-

ing deployment of GPS devices that provide location andspeed information of the vehicles that they are deployed in.Such data can be used to analyze traffic patterns in a givencommunity (e.g., average speed on a given street between8am and 9am in the morning). Analysis of patterns such asrush hour traffic, off-peak traffic, average delays between dif-ferent key points in the city as a function of time of day andday of the week, and average speeding statistics on selectedstreets can shed light on traffic safety and traffic congestionstatus both at a given point in time and historically over alarge time interval.

288

With the above in mind, note that the aim of this evalua-tion section is to study the performance of the perturbationtechniques. It is not the goal of this paper to actually studytraffic comprehensively in a given city. Hence, we pickedtwo main streets whose traffic characteristics we would liketo study for illustration. To emulate a community of users,we drove on these streets multiple times (in our experiments,the authors took turns driving these streets at different timesof day). We collected data for a community of 30 users. Weused a Garmin Legend [15] GPS device to collect locationdata. The device returns a track of GPS coordinates. Thesampling frequency used in our experiments was 1 sampleevery 15 seconds. Each trip represented a different user forour experimentation purposes. The stretch of each of thetwo roads driven was about 1.3 miles. Data was collected inthe morning between 10 am and 12 noon as well as in theevening between 4 pm and 6 pm.

In a more densely deployed system, the assumption is thatdata will be naturally available from different users drivingover the period of weeks on these city streets at differenttimes of day. Such data may then be shared retroactivelyfor different application purposes. For example, individualsinterested in collecting data on traffic enforcement might col-lect and share speeding statistics on different city streets orfreeways they travel (e.g., what percentage of time, where,and by how much does traffic speed exceed posted signage).Such statistics may come in handy when an individual trav-els to a new destination. Since speeding is a private matter,perturbation techniques will be applied prior to sharing.

For the purposes of this paper, we shall call the two streetswe collected data from Green Street and University Avenue.The aggregation server divides city streets into small seg-ments of equal length. The average speed on each segmentis calculated from perturbed user data.

4.1.1 Generating the Noise ModelIn order to employ our perturbation scheme, we need a

noise model. Since the GPS data is collected with a very lowfrequency (1 sample every 15 seconds), speed may changedramatically on consecutive data points. Figures 2 showsthe real speed curve of one user on Green street in the morn-ing. We model the speed curve of each user as the sum ofseveral sinusoidal signals (observe that any waveform canbe expressed a sum of sinusoids by Fourier transform). Forsimplicity, we choose to use six sinusoids that represent thecommon harmonics present in natural speed variations ofcity traffic. The noise model is therefore as follows:

f(k) = a0 +6

X

i=1

ai sin(bi ∗ k + ci) (25)

The speed model in Equation (25) is characterized by 19parameters. Once the model for the speed is obtained, weneed to model the distribution of all 19 parameters suchthat the speed stream generated by this model has the samedynamics as the real speed curves. The service developer willcollect a few speed measurements empirically (which is whatwe did), take that small number of real speed curves, and usean MMSE curve fitting to find the range of each parameter.This approach is used by us to obtain the distribution ofthe parameters. The distribution of each parameter wasthen chosen to be a uniform within the range obtained. Asample of speed curve is shown in Figure 2.

Having produced an approximate noise model, the aggre-gation server announces the model information (structureand parameter distribution) to the users. Participating usersuse this information to choose their private noise parametersand generate their noise streams using client-side software(which includes a generic function generator in the privacyfirewall). Each user’s individual speed data is perturbed bythe given noise and sent to the aggregation server when theuser connects to the server. Typical perturbed data is shownin Figure 2.

0 500 1000 1500 2000−20

−10

0

10

20

30

40

50

Distance (m)Sp

eed

(mph

)

NoiseRealPerturbed

Figure 2: Graph showing the real speed, noise, andperturbed speed curves for a single user

0 500 1000 1500 20005

10

15

20

25

30

Distance (m)

Spee

d (m

ph)

Real Community AverageReconstructed Community Average

Figure 3: Graph showing the reconstructed commu-nity average speed vs. distance for a population of17 users

4.1.2 Reconstruction AccuracyIn order to compute the community average, noise distri-

butions at each time instance k must be available for theaggregation server. Obtaining the exact noise distributionat each time k given the parameter ai, bi, and ci can bedifficult. Therefore we approximate the noise distributionsby generating a numbers of noise curves (10000 samples)following the model in Equation 25 and compute normal-ized histograms of noise values at each time instance. Thesehistograms are approximations of noise distributions.

To show reconstruction accuracy using the community re-construction method developed in Section 2.3, the computed

289

community average speed curve for each street is presentedin Figure 3. Even with a very small community population(17 users), the community average reconstruction still pro-vides a fairly accurate estimate (the average error at eachpoint is 1.94 mph).

Next, we plot the community average reconstruction ac-curacy versus the scaling factor A and community popula-tion N , which is shown in Figure 4(a). First, we examinethe reconstruction error with respect to the scaling factorA chosen from {1, 10, . . . , 100}. It is theoretically shown inSection 2.3 that the reconstruction accuracy increases lin-early with A2ν/(1+2ν). Thus, we should expect a linear errorcurve. This is verified in Figure 4(a). The errors computedin this paper are normalized by dividing the mean squarederror by the number of data points. This can be interpretedas the average error for each reconstructed point. In thisexperiment, when A = 80, the normalized error is 8 mph,which is about one fourth of the average speed (30 mph).This might be unacceptable in some applications. Thus thescaling factor must be chosen with care.

Now, we compare the theoretical error bound devised inSection 2.5 and the empirical reconstruction error foundabove. Typical values for reconstruction parameters are ν =2, M = 0.01, and ||fe

k (x) − fk(x)||1 = 0.5. The normalizednoise variance ||fk(n)|| can be bounded, ||fk(n)||max = 1.508(since we know the distribution of the noise for all time in-stances). The upper bound on the normalized error is alsopresented in Figure 4(a).

0 20 40 60 80 1000

2

4

6

8

10

12

Scaling Factor A

Norm

alize

d Er

ror

Theoretical boundGreen StreetUniversity Avenue

(a)

0 200 400 600 800 10000.01

0.02

0.03

0.04

0.05

0.06

0.07

Community Population

Norm

alize

d Er

ror

Green StreetUniversity Avenue

(b)

Figure 4: Figures showing reconstruction error vs.scaling factor A (a) and population of community(b)

Next, we examine the reconstruction error versus the com-munity population. Since our actual collected data was lim-ited, we emulated additional user data by doing random lin-ear combinations of data from real users. Figure 4(b) showsthe normalized reconstruction error versus the communitypopulation. In this experiment, the scaling factor is fixed atA = 1. We observe that the error decreases exponentiallywith the number of users. In addition, the normalized errorfor a population N = 100 is about 0.05 which means the av-erage error for each reconstruction point is 0.05 mph. Thisvery small error suggests that our proposed reconstructionmethod can be used in a small community. In the abovegraphs, we plot the reconstruction errors for both GreenStreet and University Avenue.

In our next experiment, we compute the reconstructedcommunity speed distribution at a given location on Uni-versity Avenue. In order to estimate the distribution withhigh accuracy, it is required that the community popula-tion be large. Therefore, we emulated additional user datausing the same method as described in our previous exper-

iment. The real community speed distribution is shown inFigure 5(a). The reconstruction method discussed in Sec-tion 2.4 is used to estimate the community speed distribu-tion from the perturbed community data, with the resultbeing shown in Figure 5(b).

0 10 20 30 400

0.05

0.1

Speed

Prob

abilit

y

(a)

0 10 20 30 400

0.05

0.1

Speed

Prob

abilit

y

(b)

Figure 5: Figures showing the real (a) and recon-structed (b) community speed distributions for apopulation of 200 users

4.1.3 A Privacy EvaluationIn this section, we will analyze the degree to which an

individual user data can be revealed in our scheme. Specif-ically, we choose the PCA method to obtain an estimate ofthe original user data from the perturbed data. The PCAmethod is usually very effective in reconstructing data fromthe perturbed data with additive noise. Figure 6 shows thereal speed data of one user, the perturbed data, and thereconstructed data for the perturbation method that we de-veloped in this paper. We observe that the reconstructeddata is closer to the perturbed data and have very little cor-relation with the real data. We can conclude from this plotthat PCA is not an effective exploitation method against ourperturbation scheme. Figure 7 shows that what happens ifwhite noise was used to perturb data, as opposed to thetechnique proposed in this paper. As the figure shows, inthis case, PCA can reconstruct the original data curve accu-rately. This demonstrates the contribution of our proposednoise model generation.

0 200 400 600 800 1000 1200 1400 1600 1800−30

−20

−10

0

10

20

30

40

50

60

Distance

Spee

d (m

ph)

OriginalPCA ReconstructionPerturbed using our scheme

Figure 6: PCA based reconstruction of averagespeed for one user when noise based on our methodis used for perturbation

For a more accurate evaluation, PCA is applied on thedata of all users on both Green Street and University Av-enue. Then the standard deviation of the errors, which are

290

0 200 400 600 800 1000 1200 1400 1600 1800−30

−20

−10

0

10

20

30

40

Distance(m)

Spee

d (m

ph)

OriginalPCA ReconstructionPerturbed using white noise

Figure 7: PCA based reconstruction of averagespeed for one user when white noise is used for per-turbation

the difference between the reconstructed data using PCAand the real speed data, are calculated. The standard de-viations for Green street in the morning and evening were12.56 and 12.68 mph, respectively and those for the Uni-versity avenue in the morning and evening were 15.56 and10.47mph, respectively. We notice that the average error inall the cases are more than 10mph which is high in compar-ison with the average speed of 40 mph.

4.1.4 Coping with Malicious ServersThis section evaluates the techniques we developed in Sec-

tion 2.7 to deal with malicious servers. A malicious serveris one that “cheats” by announcing a poor noise model in anattempt to get poorly perturbed user data such that userprivacy can be violated. Since the server shares both thenoise model structure and parameter distributions, “cheat-ing” can occur either by sending the wrong noise model (amodel of an incompatible structure) or by sending a goodmodel with bad parameter distributions (so that the noisecurve can be easily estimated).

First, we consider an instance of a malicious server thatsends a wrong noise model (for traffic data). Assume thatthe model sent by the server to users is a linear one, y(k) =ak+b, where a and b are two random variables uniformly dis-tributed between -0.1 and 0.1. With this linear model, theserver can easily compute an individual user’s data trends.At the user side, the model is checked using the maliciousserver detection method discussed in Section 2.6. It is im-portant for the user to choose the appropriate threshold p1

(the acceptable fitting error). Too small a p1 may cause theserver to always be rejected (including good servers). Toolarge a p1 may cause malicious server noise models to be ac-cepted. In Figure 8(a), the acceptance rate of both maliciousservers and good servers is plotted against the threshold p1.We observe from the above figure that a safe threshold forp1 in this case is 0.1 ≤ p1 ≤ 0.3. We can estimate p1 us-ing the method proposed in Section 2.7. For the “good”model, ||g(k)||min and ||g(k)||max can be computed since weknow the range of all parameters, ||g(k)||min = 0.616 and||g(k)||max = 1.508. The data of the user in this experimenthas the norm of ||x|| = 1.269. Thus, an estimate of p1 = 0.23is a good one.

Second, consider a malicious server that sends a noisemodel of acceptable structure but with a bad parameter

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Threshold p1

Acce

ptan

ce R

ate

Malicious ServerGood Server

(a)

−80 −60 −40 −20 0−0.5

0

0.5

1

1.5

Log(Probability) Threshold p2

Acce

ptan

ce R

ate

Malicious ServerGood Server

(b)

Figure 8: Evaluation figures showing coping withmalicious server, (a) shows acceptance rate vs.threshold p1 and (b) shows acceptance rate vs.threshold p2

distribution. In this experiment, the distribution of all 19parameters of our multi-sinusoid noise model is chosen tobe Gaussian with the same means as a “good” model, buta very small variance σ = 0.1. Because σ is very small,the parameters drawn from this distribution are almost thesame as their means. Hence the noise curve can be easilypredicted in most cases. Figure 8(b) shows the acceptancerate of the good and malicious server versus the value ofthreshold p2 (on the probability that the users data mayhave come from the server supplied noise model). For com-putational convenience, the log of the actual probability isused as the threshold. A lower threshold is more permissivein that it accepts models that do not fit user data with highprobability. From the figure, the safe range for the thresholdis −50 ≤ p2 ≤ −5 which is very wide. Thus choosing a goodp2 is easier than choosing a good p1.

4.1.5 Coping with Malicious UsersFinally, we analyze the effect of malicious users on the ac-

curacy of community average reconstruction. Observe thatthere is fundamentally no way to ascertain that the user-supplied sensory data is accurate. In the weight-watcherscase, for instance, even if the scale could somehow authen-ticate the user and even if the system could authenticatethe scale, there is nothing to prevent the user from climb-ing on the scale with a laptop or other materials, causingthe reading to be incorrect. The system will work only ifsome motivation exists in the community to find out the realcommunity data. We assume that for a group of self-selectedparticipants genuinely interested in the overall statistic, sucha motivation exists. The question is, how many malicioususers (who purposely falsify their data) will the statisticwithstand before becoming too inaccurate?

For this purpose, we generate a big community (N = 1000)and change the number of malicious users. Each malicioususer generate their data according to an uniform distributionbetween 0 and Range. We are also interested in how therange of malicious data affects the reconstruction accuracy?Figure 9 plots the reconstruction error versus the percent-age of malicious users for different ranges. The results showthat the range of reconstruction error increases linearly butvery slowly with the percentage of malicious users. In ad-dition, the range of the malicious data has no effect on thereconstruction error. Thus, malicious users impose very lit-tle impact on the overall community reconstruction.

291

0 50 100 150 200 250 300 350 400 450 5000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Number of Malicious Users

Norm

alize

d Re

cons

truct

ion

Erro

r

Range = 20Range = 40Range = 60Range = 80

Figure 9: Reconstruction error v.s. Number of ma-licious users

4.2 Diet TrackerThe diet tracker case study is motivated by the numer-

ous weight watchers and diet communities that exist today.An individual on a particular diet monitors her weight ona periodic basis, perhaps by taking a weight measurementonce a day. This individual would likely be interested incomparing her weight loss to that of other people on a dietin order to get a feedback regarding the effectiveness of thediet program she is following. Although, the person wouldlike to do it in such a manner that her weight data remainsprivate.

In the Traffic Analyzer application, to the extent of theauthors’ knowledge, there is no good speed model for a ve-hicle on a city road. Thus, the speed is modeled in a semi-empirical way. However, in many other applications, ac-curate data models are well known and hence can be usedto provide more privacy. The Diet Tracker application isone such example. Several models for weight loss and diet-ing have been proposed in existing literature [26, 13, 4, 6].We adopt the model proposed in [26], which is a non-linearmodel and is described by Equations (1) and (2). The aboveequations are used to generate the noise stream.

In our deployment, we recorded the weight of a single userover the course of sixty days, once each day. We generatethe parameters for a typical user based on the data from ourdeployment and use these to emulate multiple users.

The parameters for this model include λk, β and W0. Therange of λ and β can be found in [26]. The range of theinitial weight W0 can be taken as the weight of a normaladult which is from 80 pounds to 210 pounds. The simplestdistribution for these parameters is uniform within their re-spective ranges. Samples of the real weight data, noise andthe perturbed data are shown in Figure 10.

In this application, we demonstrate a different way of per-turbing the user data, but use the same algorithm to re-construct the community distribution. Given the generatednoise n, and the data x, the perturbed data is generated asfollows, y = Ax+Bn+C. In this type of perturbation, A, Band C are random variables whose distributions are knownto the aggregation server and the users. The reconstructionof the community distribution can be done in a two-stepprocess:

• Reconstruct the distribution of Ax by considering Bn+C as noise, then compute the distribution of log(Ax).

0 10 20 30 40 50 60 70170

175

180

185

190

Time (day)

Wei

ght (

poun

d)

RealNoisePerturbed

Figure 10: Graph showing real weight, noise, andperturbed weight of a single user

0 10 20 30 40 50 60 70170

175

180

185

190

195

200

205

210

Time (day)

Wei

ght (

poun

d)

Real WeightPerturbed WeightReconstructed Weight

Figure 11: Graph showing the results of PCA re-construction scheme on a single user

• Because log(Ax) = log(A) + log(x), we could recon-struct the distribution of log(x) using the distributionof log(Ax) found above and the distribution of log(A).Finally, compute the distribution of x from the distri-bution of log(x).

Note that the transformation of random variables by logand exp is trivial because both functions are monotonic. Thereconstruction method used in each step is the same as themethod discussed in Section 2.4. Figures 12(a) and 12(b)plot the original weight distribution and the reconstructedweight distribution using the above method, respectively.In this experiment, we use the same method described inthe Traffic Analyzer application to generate a big commu-nity (500 users). For simplicity, we choose C = 0. A andB are drawn from uniform distribution between 0 and 10.We observe from the figures that the reconstructed commu-nity distribution is very close to the real distribution whichsuggests that the two-step reconstruction is a also good re-construction method.

We observe from Figure 10 that the perturbed data con-tains a numbers of high frequency components, thus it iscommon to ask if the user data can be revealed using filter-ing techniques? We apply the PCA reconstruction method(same method used in the Traffic Analyzer application) toreconstruct an individual user’s data. In order to employPCA, we generated a virtual community containing 1000users, where each user sends their perturbed data to the

292

0 50 100 150 200 2500

0.02

0.04

0.06

0.08

Weight (pound)

Prob

abilit

y

(a)

0 50 100 150 200 2500

0.02

0.04

0.06

0.08

Weight (Pound)

Prob

abilit

y

(b)

Figure 12: Figures showing real (a) and recon-structed (b) community weight distributions for oneuser

aggregation server. Figure 11 shows the real weight data,perturbed weight, and the reconstructed weight using PCAfor a single user. The result shows that the reconstructedcurve fits in the same direction as the perturbed data. Thusthe filtering techniques again do not work with our pertur-bation scheme.

In conclusion, the empirical studies in this section confirmthe robustness of the our perturbation technique. In thetwo applications, the server has successfully recovered thecommunity information (the average and the distribution),and the user privacy is preserved against traditional attacks(filtering) and specialized attacks (MMSE). Our proposedtechniques also provide means to detect malicious serversand give flexibilities by provisioning for multiple ways ofperturbing user data.

5. RELATED WORKParticipatory sensing applications have recently been de-

scribed as an important emerging category of future sensingsystems [1]. Early applications have already been publishedincluding a participatory sensor network to search and res-cue hikers in mountains [20], vehicular sensor networks (carswith sensor nodes) such as CarTel [22], that deployed sensornodes in cars, and sensor networks embedded in client at-tire [14], cyclist networks (BikeNet) [10], cellphone cameranetworks for sharing diet related images (ImageScape) [32],and cellphone networks for media sharing (MMM2) [8]. Theabove applications, however, do not explicitly address theconcerns of privacy.

An architecture for participatory sensing, called Parti-sans, has been proposed in [29]. In that paper, the mainchallenges addressed are those of data verifiability and pri-vacy. In contrast to our work, the approach assumes atrusted third party. A similar trust model was assumed in[25].

Several privacy preserving data perturbation, analysis, andmining techniques have been developed in the past liter-ature. We classify past work into three broad categories:(i) random perturbation (ii) randomized response, and (iii)secure multi-party computation. The techniques presentedbelow can be leveraged in future incarnations of our archi-tecture.

One of the first privacy preserving data perturbation tech-niques was proposed in [3]. In this technique, each clienthas a single numerical data item xi, perturbed by addinga random number ri drawn independently from a knowndistribution. The distribution of the community can be re-constructed using the Bayes’ rule to estimate a posterior

distribution function. The work in [3] was extended by theauthors’ of [2], where a reconstruction algorithm was pro-posed that converges to the maximum likelihood estimateof the original distribution. Several papers [24, 21, 28], ex-tended the technique presented in [3]. These papers showthat privacy breaches occur under certain conditions, whenthe randomized perturbation approach is used. They thendevelop solutions to prevent such breaches. But, these meth-ods do not address the problem of privacy in time-seriesdata.

A technique for privacy preserving data clustering wasdeveloped in [27]. In the paper, a rotation-based data per-turbation function is used to hide individual data. Due tothe nature of the rotation matrix, the clusters are computedcorrectly despite randomly displacing the individual datapoints. It is also possible to do classification and association-rule mining3 in a privacy-preserving manner. For example,privacy-preserving association rule mining algorithm is pre-sented in [12]. A geometric rotation based approach for dataclassification is presented in [7]. This work does not addresstime-series data.

Randomized response techniques were first introduced byWarner [35] as early as 1965 to find an estimate of the per-centage of people in a given population that have a sensitiveattribute X. The idea behind the models is to ask questionsthat do not reveal any private information. This idea wasextended in [11] to preserve privacy while mining categoricaldata (instead of numerical data) and to a multiple-attributedata set in [9]. The randomized response technique does notuse perturbation for achieving privacy and the above workdoes not address privacy in time-series data.

Secure multi-party computation addresses computing func-tions of private variables where each member of the commu-nity knows (and must keep private) one of the variables. Acomprehensive treatise on the basic results of secure multi-party computation is presented in [16]. The problem withthis approach is its significant overhead that requires a largenumber of pairwise exchanges between users in the commu-nity. Such exchanges do not scale when users are not avail-able simultaneously for purposes of completing the compu-tation, or when the number of users involved changes dy-namically.

6. CONCLUSIONSIn this paper, we presented an architecture and perturba-

tion algorithms for stream privacy. It ensures the privacyof individual user data while allowing community statisticsto be constructed. The architecture is geared for participa-tory sensing applications where community members mayset up data aggregation services to compute statistics of in-terest. In such scenarios, the existence of mutual trust ora trust hierarchy cannot always be assumed. Hence, ourdata perturbation techniques allow users to perturb privatemeasurements before sharing. The techniques address thespecial requirements of time series data; namely, the factthat data are correlated. Correlation makes it possible toattack privacy. A correlated noise model is proposed andimplemented. It is shown that community data can be re-constructed with accuracy while individual user data can-not. In future research, we shall extend the architecture to

3We refer the reader to Chapters 5 and 6 of Han, [18] forthe definitions of classification and association-rule mining

293

address privacy issues when multi-modal data are shared.In other words, multiple streams are shared by each clientand such streams may be mutually correlated. Context pri-vacy (not addressed in this work) will also be investigated.Further, we will address the problem of model checking forclients with data outliers.

7. ACKNOWLEDGMENTSThe authors thank the shepherd, Professor Nirupama Bu-

lusu, and the anonymous reviewers for providing valuablefeedback and improving the paper. The work reported inthis paper is funded in part by Microsoft Research and NSFgrants CNS 06-26342, CNS 06-15318 and DNS 05-54759.

8. REFERENCES[1] T. Abdelzaher et al. Mobiscopes for human spaces.

IEEE Pervasive Computing, 6(2):20–29, 2007.

[2] D. Agrawal and C. C. Aggarwal. On the design andquantification of privacy preserving data miningalgorithms. In Proc. of ACM Principles of DatabaseSystems, pages 247–255, 2001.

[3] R. Agrawal and R. Srikant. Privacy preserving datamining. In Proc. of ACM Conf. on Management ofData, pages 439–450, May 2000.

[4] S. S. Alpert. A two-reservoir energy model of thehuman body. The American Journal of ClinicalNutrition, 32(8):1710–1718, 1979.

[5] J. Burke et al. Participatory sensing. Workshop onWorld-Sensor-Web, co-located with ACM SenSys,2006.

[6] C. Carson and H. Kevin. The dynamics of humanbody weight change. PLOS Computational Biology,4(3):1000045, March 2008.

[7] K. Chen and L. Liu. Privacy preserving dataclassification with rotation perturbation. In Proc. ofIEEE International Conference on Data Mining, pages589–592, 2005.

[8] M. Davis et al. Mmm2: Mobile media metadata formedia sharing. In CHI Extended Abstracts on HumanFactors in Computing Systems, pages 1335–1338, 2005.

[9] W. Du and Z. Zhan. Using randomized responsetechniques for privacy-preserving data mining. InProc. of ACM SIGKDD Conf., pages 505–510, 2003.

[10] S. B. Eisenman et al. The bikenet mobile sensingsystem for cyclist experience mapping. In Proc. ofSenSys, November 2007.

[11] A. Evfimievski. Randomization in privacy preservingdata mining. ACM SIGKDD Explorations Newsletter,4(2):43–48, December 2002.

[12] A. Evfimievski, J. Gehrke, and R. Srikant. Limitingprivacy breaches in privacy preserving data mining. InProceedings of the SIGMOD/PODS Conference, pages211–222, 2003.

[13] G. B. Forbes. Weight loss during fasting: Implicationsfor the obese. The American Journal of ClinicalNutrition, 23(9):1212–1219, September 1970.

[14] R. K. Ganti, P. Jayachandran, T. F. Abdelzaher, andJ. A. Stankovic. Satire: a software architecture forsmart attire. In Proc. of ACM MobiSys, pages110–123, 2006.

[15] Garmin eTrex Legend.www8.garmin.com/products/etrexlegend.

[16] O. Goldreich. Secure multi-party computation (draft).Technical report, Weizmann Institute of Science, 2002.

[17] C. Guestrin et al. Distributed regression: An efficientframework for modeling sensor network data. In InProc. of IPSN ’04, pages 1–10, April 2004.

[18] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann, second edition, 2006.

[19] M. Herty, A. Klar, and A. K. Singh. An ode trafficnetwork model. J. Comput. Appl. Math.,203(2):419–436, 2007.

[20] J.-H. Huang, S. Amjad, and S. Mishra. Cenwits: asensor-based loosely coupled search and rescue systemusing witnesses. In Proc. of SenSys, pages 180–191,2005.

[21] Z. Huang, W. Du, and B. Chen. Deriving privateinformation from randomized data. In Proc. of ACMSIGMOD Conference, pages 37–48, June 2005.

[22] B. Hull et al. Cartel: a distributed mobile sensorcomputing system. In Proc. of SenSys, pages 125–138,2006.

[23] M. G. Kang and A. K. Katsaggelos. General choice ofthe regularization functional in regularized imagerestoration. IEEE Transaction on Image Processing,4(5):594–602, May 1995.

[24] H. Kargutpa, S. Datta, Q. Wang, and K. Sivakumar.On the privacy preserving properties of random dataperturbation techniques. In Proc. of the IEEE ICDM,pages 99–106, 2003.

[25] A. Krause, E. Horvitz, A. Kansal, and F. Zhao.Toward community sensing. In Proc. of IPSN, 2008.

[26] R. E. Mickens, D. N. Brewley, and M. L. Russell. Amodel of dieting. SIAM Review, 40(3):667–672,September 1998.

[27] S. R. M. Oliveira and O. R. Zaiane. Privacypreservation when sharing data for clustering. In Proc.of International Workshop on Secure DataManagement in a Connected World, pages 67–82,August 2004.

[28] S. Papadimitriou, F. Li, G. Kollios, and P. S. Yu.Time series compressibility and privacy. In In Proc. ofVLDB ’07, pages 459–470, September 2007.

[29] A. Parker et al. Network system challenges in selectivesharing and verification for personal, social, andurban-scale sensing applications. In Proceedings ofHotNets-V, pages 37–42, 2006.

[30] PoolView. http://smart-attire.cs.uiuc.edu/poolview/.

[31] PoolView Protocol Specifications. http://smart-attire.cs.uiuc.edu/poolview/files/fdtp.pdf.

[32] S. Reddy et al. Image browsing, processing, andclustering for participatory sensing: Lessons from adietsense prototype. In Proc of EmNets, pages 13–17,2007.

[33] SciLab. www.scilab.org.

[34] A. N. Tikhonov and V. Y. Arsenin. Solution of IllPosed Problems. V. H. Winstons and Sons, 1977.

[35] S. L. Warner. Randomized response: A surveytechnique for eliminating evasive answer bias. Jnl ofthe American Stat Association, 60(309):63–69, March1965.

294

PoolView: stream privacy for grassroots participatory sensing

Documents