Top Banner
P RIVGUARD: Privacy Regulation Compliance Made Easier Lun Wang UC Berkeley Usmann Khan Georgia Tech Joseph Near University of Vermont Qi Pang Zhejiang University Jithendaraa Subramanian NIT Tiruchirappalli Neel Somani UC Berkeley Peng Gao Virginia Tech Andrew Low UC Berkeley Dawn Song UC Berkeley Abstract Continuous compliance with privacy regulations, such as GDPR and CCPA, has become a costly burden for com- panies from small-sized start-ups to business giants. The culprit is the heavy reliance on human auditing in today’s compliance process, which is expensive, slow, and error- prone. To address the issue, we propose PRIVGUARD, a novel system design that reduces human participation required and improves the productivity of the compliance process. PRIVGUARD is mainly comprised of two com- ponents: (1) PRIVANALYZER, a static analyzer based on abstract interpretation for partly enforcing privacy reg- ulations, and (2) a set of components providing strong security protection on the data throughout its life cycle. To validate the effectiveness of this approach, we proto- type PRIVGUARD and integrate it into an industrial-level data governance platform. Our case studies and evalu- ation show that PRIVGUARD can correctly enforce the encoded privacy policies on real-world programs with reasonable performance overhead. 1 Introduction With the advent of privacy regulations such as the EU’s General Data Protection Regulation (GDPR) and Cal- ifornia Consumer Privacy Act (CCPA), unprecedented emphasis is put on the protection of user data. This is a positive development for data subjects, but presents major challenges for compliance. Today’s compliance paradigm relies heavily on human auditing, and is prob- lematic in two aspects. First, it is an expensive process to hire and train data protection personnel and rely on man- ual effort to monitor compliance. According to a report from Forbes [12], GDPR cost US Fortune 500 companies $7.8 Billion as of May 25th, 2018. Another report from DataGrail [4] shows that 74% of small- or mid-sized organizations spent more than $100,000 to prepare for continuous compliance with GDPR and CCPA. Second, human auditing is slow and error-prone. The inefficiency of compliance impedes the effective use of data and hin- ders productivity. Errors made by compliance officers can harm data subjects and result in legal liability. An ideal solution would enable data curators to easily ensure fine-grained compliance with minimal human par- ticipation and quickly adapt to new changes in privacy regulations. A significant amount of academic work seeks to address this challenge [2, 30, 32, 42, 43, 51, 53, 55, 57]. The European ICT PrimeLife project proposes to encode regulations using Primelife Policy Language (PPL) [55] and enforce them by matching the policies with user- specified privacy preferences and triggering obligatory actions when detecting specific behaviors. A-PPL [30] extends the PPL language by adding accountability rules. These two pioneering works play important roles in the exploration of efficient policy compliance. However, as they focus on Web 2.0 applications, they provide limited support for fine-grained privacy requirement compliance in complex data analysis tasks. The SPECIAL project [2] partly inherits the design of the PPL project and, as a result, suffers from similar limitations. The closest to our work is by Sen et al. [53], which proposed a formal lan- guage (LEGALEASE) for privacy policies and a system (GROK) to enforce them. However, GROK uses heuristics to help decide whether the analysis process is compliant with a policy and human auditing is required to catch false-negatives. Thus, effective compliance with privacy regulations at scale remains an important challenge. PRIVGUARD: Facilitating Compliance. This paper de- scribes a principled data analysis framework called PRIV- GUARD to reduce human participation in the compliance process. PRIVGUARD works in a five-step pipeline under the protection of cryptographic tools and trusted execu-
18

PRIVGUARD: Privacy Regulation Compliance Made Easier

Apr 17, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PRIVGUARD: Privacy Regulation Compliance Made Easier

PRIVGUARD: Privacy Regulation Compliance Made Easier

Lun WangUC Berkeley

Usmann KhanGeorgia Tech

Joseph NearUniversity of Vermont

Qi PangZhejiang University

Jithendaraa SubramanianNIT Tiruchirappalli

Neel SomaniUC Berkeley

Peng GaoVirginia Tech

Andrew LowUC Berkeley

Dawn SongUC Berkeley

AbstractContinuous compliance with privacy regulations, such asGDPR and CCPA, has become a costly burden for com-panies from small-sized start-ups to business giants. Theculprit is the heavy reliance on human auditing in today’scompliance process, which is expensive, slow, and error-prone. To address the issue, we propose PRIVGUARD,a novel system design that reduces human participationrequired and improves the productivity of the complianceprocess. PRIVGUARD is mainly comprised of two com-ponents: (1) PRIVANALYZER, a static analyzer based onabstract interpretation for partly enforcing privacy reg-ulations, and (2) a set of components providing strongsecurity protection on the data throughout its life cycle.To validate the effectiveness of this approach, we proto-type PRIVGUARD and integrate it into an industrial-leveldata governance platform. Our case studies and evalu-ation show that PRIVGUARD can correctly enforce theencoded privacy policies on real-world programs withreasonable performance overhead.

1 Introduction

With the advent of privacy regulations such as the EU’sGeneral Data Protection Regulation (GDPR) and Cal-ifornia Consumer Privacy Act (CCPA), unprecedentedemphasis is put on the protection of user data. This isa positive development for data subjects, but presentsmajor challenges for compliance. Today’s complianceparadigm relies heavily on human auditing, and is prob-lematic in two aspects. First, it is an expensive process tohire and train data protection personnel and rely on man-ual effort to monitor compliance. According to a reportfrom Forbes [12], GDPR cost US Fortune 500 companies$7.8 Billion as of May 25th, 2018. Another report fromDataGrail [4] shows that 74% of small- or mid-sized

organizations spent more than $100,000 to prepare forcontinuous compliance with GDPR and CCPA. Second,human auditing is slow and error-prone. The inefficiencyof compliance impedes the effective use of data and hin-ders productivity. Errors made by compliance officerscan harm data subjects and result in legal liability.

An ideal solution would enable data curators to easilyensure fine-grained compliance with minimal human par-ticipation and quickly adapt to new changes in privacyregulations. A significant amount of academic work seeksto address this challenge [2, 30, 32, 42, 43, 51, 53, 55, 57].The European ICT PrimeLife project proposes to encoderegulations using Primelife Policy Language (PPL) [55]and enforce them by matching the policies with user-specified privacy preferences and triggering obligatoryactions when detecting specific behaviors. A-PPL [30]extends the PPL language by adding accountability rules.These two pioneering works play important roles in theexploration of efficient policy compliance. However, asthey focus on Web 2.0 applications, they provide limitedsupport for fine-grained privacy requirement compliancein complex data analysis tasks. The SPECIAL project [2]partly inherits the design of the PPL project and, as aresult, suffers from similar limitations. The closest to ourwork is by Sen et al. [53], which proposed a formal lan-guage (LEGALEASE) for privacy policies and a system(GROK) to enforce them. However, GROK uses heuristicsto help decide whether the analysis process is compliantwith a policy and human auditing is required to catchfalse-negatives. Thus, effective compliance with privacyregulations at scale remains an important challenge.

PRIVGUARD: Facilitating Compliance. This paper de-scribes a principled data analysis framework called PRIV-GUARD to reduce human participation in the complianceprocess. PRIVGUARD works in a five-step pipeline underthe protection of cryptographic tools and trusted execu-

Page 2: PRIVGUARD: Privacy Regulation Compliance Made Easier

tion environments (TEEs).First, data protection officers (DPOs), legal experts,

and domain experts collaboratively translate privacy reg-ulations into a machine-readable policy language. Thetranslation process is application-specific and requiresdomain-specific knowledge in both the application andthe privacy regulation (e.g. mapping legal concepts toconcrete fields.). The encoded policy is referred to as thebase policy. Encoding the base policy is the step with themost human effort in PRIVGUARD’s workflow.

Second, before the data is collected, the data subjectsare aided by a client-side API to specify their privacypreferences. They can either directly accept the base pol-icy or add their own privacy requirements. The privacypreferences are collected together with the data.

Third, data analysts submit programs to analyze thecollected data. Analysts are required to submit a cor-responding guard policy no weaker than the base pol-icy along with their program. Only data with policy nostronger than the guard policy can be used.

Fourth, our proposed static analyzer, PRIVANALYZER,examines the analysis program to confirm its compliancewith the guard policy. At the same time, the subset of thedata whose privacy preferences are no stronger than theguard policy will be loaded to conduct the real analysis.

Finally, depending on the output of PRIVANALYZER,the result will be either declassified to the analyst orguarded by the remaining unsatisfied privacy require-ments (called a residual policy).Extension of LEGALEASE: Encoding Policies. PRIV-GUARD is designed to be compatible with many machine-readable policy languages such as [30, 55]. In this work,we instantiate our implementation with LEGALEASE [53]due to its readability and extensibility. We extendLEGALEASE [53] by providing new attribute types, in-cluding attributes requiring the use of privacy-enhancingtechnologies like differential privacy.PRIVANALYZER: Enforcing Policies. The core com-ponent of PRIVGUARD is PRIVANALYZER, a static an-alyzer checking the compliance of an analysis programwith a privacy policy. PRIVANALYZER performs staticanalysis of the programs submitted by analysts to checktheir compliance with the corresponding guard policies.

In contrast to previous approaches relying on accesscontrol [28] or manual verification [2, 36, 53], PRIVANA-LYZER is a novel policy enforcement mechanism basedon abstract interpretation [49]. PRIVANALYZER doesnot rely on heuristics to infer policy or program seman-tics, and provides provable soundness for some proper-ties. PRIVANALYZER examines only the program andthe policy (not the data), so the use of PRIVANALYZER

does not reveal the content of the data it protects. Ourapproach works for general-purpose programming lan-guages, including those with complex control flow, loops,and imperative features. Thus, PRIVANALYZER is able toanalyze programs as written by analysts—so analysts donot need to learn a new programming language or changetheir workflows. We instantiate our implementation withPython, one of the most widely used programming lan-guages for data analysis.

We implemented PRIVANALYZER in about 1400 linesof Python and integrated it in an industrial-level datagovernance platform to prototype PRIVGUARD. We eval-uated the prototype experimentally on 23 open-sourcePython programs that perform data analytics and ma-chine learning tasks. The selected programs leveragepopular libraries like Pandas, PySpark, TensorFlow, Py-Torch, Scikit-learn, and more. Our results demonstratethat PRIVGUARD is scalable and capable of analyzingunmodified Python programs, including programs thatmake extensive use of external libraries.Contributions. In brief, this paper makes the followingcontributions.• We propose PRIVGUARD, a novel framework for pri-

vacy regulation compliance minimizing human effort.• We propose PRIVANALYZER, a static analyzer based

on abstract interpretation for enforcing privacy poli-cies on unmodified analysis programs.

• We implemented PRIVANALYZER for LEGALEASEand Python in about 1400 lines of Python. Our imple-mentation supports commonly-used analysis librariessuch as Pandas, Scikit-learn, etc.

• We prototyped PRIVGUARD by integrating PRIVANA-LYZER in PARCEL, an industrial-level data governanceplatform. We simulated the execution of PRIVGUARDwith up to one million clients and the results show thatPRIVGUARD incurs about two-minute overhead whendealing with one million clients.

2 PRIVGUARD OverviewIn this section, we outline the design and implementationof PRIVGUARD. We first walk through PRIVGUARDusing a toy example and then introduce the system designand implementation. As last, the threat model and thesecurity of PRIVGUARD are discussed.

2.1 A Toy ExampleWe use a toy example to demonstrate the workflow ofPRIVGUARD, which also allows us to present the maincomponents used. A company launches a mobile ap-plication and collects user data to help make informed

Page 3: PRIVGUARD: Privacy Regulation Compliance Made Easier

business decisions. To facilitate compliance with privacyregulations, the company deploys PRIVGUARD to protectthe collected data.

First, the DPOs, legal experts, and domain experts en-code two requirements in the base policy: (1) minors’data shall not be used in any analysis; (2) any statisticson the data shall be protected using differential privacy.

Second, the privacy preferences are collected fromthe data subjects together with the data. Some data sub-jects (Group 1) trust the company and directly accept thebase policy. Some (Group 2) are more cautious and wanttheir zip codes to be redacted before analysis. The others(Group 3) do not trust the company and do not want theirdata to be used for purposes except for legitimate interest.

Third, a data analyst wants to survey the user age dis-tribution. It specifies a guard policy, that besides the basepolicy, zip codes shall not be used in the analysis either.The analyst submits a program calculating the user agehistogram to PRIVGUARD. She remembers to filter outall the minor information and redact the zip code field butforgets to protect the program with differential privacy.

Fourth, PRIVGUARD uses PRIVANALYZER to exam-ine the privacy preferences and loads data of Group 1 and2 into the TEE as their privacy preferences are no stricterthan the guard policy. PRIVGUARD runs the program andsaves the resulting histogram. However, after examiningthe program and the guard policy, PRIVGUARD finds thatthe program fails to protect the histogram with differen-tial privacy. Thus, the histogram is encrypted, dumped tothe storage layer and guarded by a residual policy indi-cating that differential privacy should be applied beforethe result can be declassified.

Lastly, PRIVGUARD outputs the residual policy to theanalyst. The analyst, after checking the residual policy,submits a program which adds noise to the histogram tosatisfy differential privacy. PRIVGUARD then decryptsthe histogram, loads it into TEE, and executes the pro-gram to add noise to it. This time, PRIVGUARD findsthat all the requirements in the guard policy are satisfied,so it declassifies the histogram to the analyst.

2.2 System Design

Base Policy Encoding. Encoding the base policy is thestep with the most human participation in PRIVGUARD’sworkflow. The base policy should be designed collabo-ratively by DPOs, legal experts, and domain experts toaccurately reflect the minimum requirements of the pri-vacy regulation. Note that only one base policy is neededfor each data collection and can be reused throughout allthe analyses on the data. The purpose of the base policy is

ALLOW ROLEPhysician

ALLOW SCHEMAHealthInformation

AND FILTER age < 90AND REDACT zip

(a) General encoding.

ALLOW ROLEPhysician

ALLOW SCHEMASerumCholestoral

AND FILTER age < 90

(b) Concrete encoding.

Figure 1: Encoding of several HIPAA requirements.

two-fold. First, the text version of the base policy is to bepresented to data subjects as the minimum privacy stan-dard before they opt in the data collection. Second, thedata analysts need to understand the base policy beforeconducting analysis. If their analysis satisfies a stricterprivacy standard, they can specify their own guard policyto take advantage of more user data.

We demonstrate the encoding process using a subsetof the HIPAA safe harbor method1 (Figure 1). The DPOsand legal experts first encode the regulation in a generalway without considering concrete data format. As shownin Figure 1a, the first clause (line 1-2) allows the patient’sphysician to check his or her data. The second clause(lines 3-7) represents some safe harbor requirements:health information may be released if the subject is under90 years old and the zip codes are removed. Then theDPOs and domain experts map the encoding to a concretedata collection by introducing real schemas and removingunnecessary requirements. For example, in Figure 1b,HealthInformation is replaced with a concrete column namein the dataset, SerumCholestoral, and the last requirement isremoved as the dataset does not contain zip codes.Data & Privacy Preference Collection. Besides thebase policy, the data subjects can also specify additionalprivacy preferences to exercise their rights to restrict pro-cessing. These privacy preferences are sent to the datacurator along with the data, where they will be kept to-gether in the storage layer. To defend against attacks dur-ing transmission and storage, the data is encrypted beforesent to the data curator. The decryption key is delegatedto a key manager for future decryption (Section 2.3).

A natural question is "how much expertise is neededto specify privacy preferences in LEGALEASE?" Sen etal. [53] conducted a survey targeting DPOs and foundthat the majority were able to correctly code policiesafter training. To complement their survey and better un-derstand how much expertise is needed, we conductedan online survey targeting general users without train-

1Recent research has shown that the approach prescribed in HIPAAdoes not really protect the privacy of individuals. In the future, weexpect that many data subjects will add a PRIVACY attribute requiringthe use of a provable privacy technology like differential privacy.

Page 4: PRIVGUARD: Privacy Regulation Compliance Made Easier

ing. The survey reveals two interesting facts: (1) thereis a significant positive correlation between the diffi-culty of understanding and encoding LEGALEASE poli-cies and the user’s familiarity with other programminglanguages; (2) most users cannot correctly understandprivacy techniques such as differential privacy withouttraining. According to these observations, we stronglyrecommend the users without programming experiencedirectly accept the base policy instead of encoding theirown. Although out of scope, we deem it important futuredirection to simplify privacy preference specification bydeveloping more user-friendly UI and translation tools.The details of the survey are deferred to Appendix B.Analysis Initialization. To initialize an analysis task,the analyst needs to submit (1) the analysis program, and(2) a guard policy, to PRIVANALYZER. A guard policyshould be no weaker than the base policy to satisfy theminimum privacy requirements. The stricter the guardpolicy is, the more data can be used for analysis.PRIVANALYZER Analysis. After receiving the submis-sion, PRIVANALYZER will load the privacy preferencesfrom the storage layer and compare them with the guardpolicy. Only the data with preferences no stricter thanthe guard policy will be loaded into the TEE, decryptedusing keys from the key manager, and merged to pre-pare for the real analysis. Meanwhile, PRIVANALYZERwill (1) check that the guard policy is no weaker thanthe base policy and (2) then examine the guard policyand the program to generate the residual policy. To makesure the static analysis runs correctly, PRIVANALYZER isprotected inside a trusted execution environment (TEE).

The compliance enforcement actually hinges on threechecks: (1) the guard policy is no weaker than the basepolicy; (2) only data with privacy preferences no strongerthan the guard policy is used; (3) the guard policy shouldbe satisfied before declassification. For (3), the guardpolicy can be satisfied either by a single program or bymultiple programs applied sequentially on the data. Thisdesign endows PRIVGUARD with the ability to enforceprivacy policies in multi-step analyses.Execution & Declassification. After PRIVANALYZERfinishes its analysis, the submitted program will be ex-ecuted with the decrypted data inside the TEE. If theresidual policy generated in the previous step is empty,then the result can be declassified to the analyst. Other-wise, the output will be encrypted again and stored in thestorage layer protected by the residual policy.

Attentive readers might ask “why does PRIVGUARDnot directly reject programs that fail to comply with theguard policy?” The design choice is motivated by twoconsiderations. First, it is not always possible to get an

Figure 2: PRIVGUARD prototype infrastructure. White:data/policies; green: analysis programs; blue: off-the-shelf components; yellow: newly-designed components.

empty residual policy when the guard policy containsROLE or PURPOSE attributes. These attributes will be sat-isfied by human auditing after the real data analysis. Sec-ond, PRIVGUARD is designed to be compatible withmulti-step analysis, a common case in real-world productpipelines. In multi-step analysis, it is likely that privacyrequirements are satisfied in different steps.

2.3 System Security

In this section, we present the threat model and demon-strate how to secure PRIVGUARD under the threat model.Threat Model. Our setting involves four parties - (1)data subjects (e.g. users), (2) a data curator (e.g. web ser-vice providers, banks, or hospitals), (3) data analysts (e.g.employees of the data curator), and (4) untrusted thirdparties (e.g. external attackers). Data is collected fromdata subjects, managed by the data curator, and analyzedby data analysts. Both the data subjects and the data cu-rator would like to comply with privacy regulations toeither protect their own data or avoid legal or reputationalrisk. The data analysts, however, are honest but reckless,and might unintentionally write programs that violateprivacy regulations. The only way that a data analystcan interact with the data is to submit analysis programsand check the output. The untrusted third parties mightactively launch attacks to steal the data or interfere withthe compliance process. A concrete example is the cloudprovider which hosts a small company’s service or data.We protect data confidentiality and execution integrityfrom third parties under the following two assumptions.First, we assume that the untrusted third parties cannotsubmit analysis programs to PRIVANALYZER or com-promise insiders to do so. Second, we assume that theuntrusted third parties fit in the threat model of the cho-sen TEE so that they cannot break the security guaranteeof the TEE.

Page 5: PRIVGUARD: Privacy Regulation Compliance Made Easier

Security Measure. PRIVGUARD takes the followingmeasures to defend against untrusted third parties andestablish a secure workflow under the above threat model.First, data is encrypted locally by the data subjects beforetransmitted to the data curator. The decryption key isdelegated to the key manager so no one can touch thedata intentionally or carelessly without asking the keymanager or the data subject for the decryption key. Tobind data and policy in an immutable way, the encrypteddata contains a hash value of the corresponding policy.Second, all transmission channels satisfy transport layersecurity standards (TLS 1.3). Third, PRIVANALYZER isrun inside a TEE to guarantee the integrity of the staticanalysis. The key manager can attest remotely to con-firm that PRIVANALYZER correctly examines the pro-gram and the policies before issuing the decryption key.Fourth, data decryption and analysis program executionare protected inside the TEE as well.

Security of PRIVGUARD against untrusted third-parties is based on the following sources of trust. First,confidentiality and integrity of data are preserved insideTEE and TLS channels. Second, the integrity of code ex-ecution is preserved inside the TEE. Remote attestationcan correctly and securely report the execution output.Third, the key manager is completely trusted such thatthe confidentiality of decryption keys is preserved. Thedesign of trusted key managers is orthogonal to the focusof the paper. Potential solutions include a key managerinside TEE or a decentralized key manager [44].

3 PRIVANALYZER: Static Analysis for En-forcing Privacy Policies

This section describes PRIVANALYZER, a static ana-lyzer for enforcing the privacy policies tracked by PRIV-GUARD. We first review the LEGALEASE policy lan-guage [53], which we use to encode policies formally,then describe how to statically enforce them. The formalmodel is deferred to Appendix A.

3.1 Background & Design ChallengesLEGALEASE is one example of a growing body of workthat has explored formal languages for encoding privacypolicies [31, 32, 36, 39, 42, 43, 51, 53, 56, 57]. A completediscussion of related work appears in Section 5. We adoptLEGALEASE to express PRIVGUARD policies due to itsexpressive power, formal semantics, and extensibility.

Sen et al. [53] developed a system called GROKthat combines static and dynamic analyses to enforceLEGALEASE policies. GROK constructs a data depen-dency graph which encodes all flows of sensitive data,

attr ∈ ROLE, SCHEMA, PRIVACY, FILTER,REDACT,PURPOSE

A ∈ attribute ::= attr attrValueC ∈ policy clause ::= A | A AND C | A OR CP ∈ policy ::= (ALLOW C)+

Figure 3: Policy language surface syntax

then applies a set of inference rules to check that eachnode in the graph satisfies the policy. GROK combinesanalysis of system logs with limited static analysis toconstruct the graph.

The GROK approach presents two challenges. First,the approach is a heuristic: it examines syntactic prop-erties of the program and individual executions of theprogram (via system logs), and thus may miss policyviolations due to implicit flows [37, 48, 50, 59]. Second,the GROK approach requires making the entire dataflowgraph explicit; in large systems with many data flows,constructing this graph may be intractable.

PRIVANALYZER is designed as an alternative to ad-dress both challenges. It uses static analysis based onabstract interpretation instead of GROK’s heuristic anal-ysis and avoids making the dataflow graph explicit byconstructing composable residual policies.

3.2 Policy Syntax & SemanticsPRIVANALYZER enforces privacy policies specified inLEGALEASE [53], a framework for expressing policiesusing attributes of various types. Attributes are organizedin concept lattices [3], which provide a partial order onattribute values. We express policies according to thegrammar in Figure 3 (a slightly different syntax fromthat of LEGALEASE). A policy consists of a top-levelALLOW keyword followed by clauses separated by AND

(for conjunction) and OR (for disjunction). For example,the following simple policy specifies that doctors or re-searchers may examine analysis results, as long as therecords of minors are not used in the analysis:

ALLOW (ROLE Doctor OR ROLE Researcher)AND FILTER age >= 18

Sen et al. [53] define the formal semantics ofLEGALEASE policies using a set of inference rules andthe partial ordering given by each attribute’s concept lat-tice. We take the same approach, but use a new attributeframework based on abstract domains [49] instead ofconcept lattices. Our approach enables PRIVPOLICY toencode policies with far more expressive requirements,like row-based access control and the use of privacy-enhancing technologies as described below.Attributes. Attributes are the basic building blocks

Page 6: PRIVGUARD: Privacy Regulation Compliance Made Easier

in LEGALEASE. Sen et al. [53] describe a set of use-ful attributes. We extend this set with two new ones:FILTER encodes row-based access control requirements,and PRIVACY requires the use of privacy-enhancing tech-nologies.Role. The ROLE attribute controls who may examine thecontents of the data. Roles are organized into partiallyordered hierarchies. A particular individual may havemany roles, and a particular role specification may repre-sent many individuals. For example, the doctor role mayrepresent doctors with many different specialties. Thefollowing policy says that only individuals with the roleOncologist may examine the data it covers:

ALLOW ROLE Oncologist

Schema. The SCHEMA attribute controls which columnsof the data may be examined. For example, the follow-ing policy allows oncologists to examine the age andcondition columns, but no others:

ALLOW ROLE OncologistAND SCHEMA age, condition

Privacy. The PRIVACY attribute controls how the datamay be used, by requiring the use of privacy-enhancingtechnologies. As a representative sample of the spec-trum of available mechanisms, our implementation sup-ports the following (with easy additions): (1) De-identification (or pseudonymization); (2) Aggregation;(3) k-Anonymity [54]; (4) `-diversity [41]; (5) t-closeness [40]; (6) Differential privacy [33,34]. For exam-ple, the following policy allows oncologists to examinethe age and condition columns under the protection ofdifferential privacy with certain privacy budget:

ALLOW ROLE OncologistAND SCHEMA age, conditionAND PRIVACY DP (1.0, 1e-5)

Filter. The FILTER attribute allows policies to specify thatcertain data items must be excluded from the analysis.For example, the following policy says that oncologistsmay examine the age and condition of individuals overthe age of 18 with differential privacy:

ALLOW ROLE OncologistAND SCHEMA age, conditionAND PRIVACY DP (1.0, 1e-5)AND FILTER age > 18

Redact. The REDACT attribute allows policies to requirethe partial or complete redaction of information in a col-umn. For example, the following policy requires analysisto redact the last 3 digits of ZIP codes (e.g. by replac-ing them with stars). The (2 : ) notation is taken fromPython’s slice and indicates the substring between thethird character and end of the string.

ALLOW ROLE OncologistAND SCHEMA age, condition

Figure 4: PRIVANALYZER Overview. PRIVANALYZERinputs an analysis program and policies, and producesresidual policies; it can be applied repeatedly (dashedline) for multi-step analyses.

AND PRIVACY DP (1.0, 1e-5)AND FILTER age > 18AND REDACT zip (2 : )

Purpose. The PURPOSE attribute allows the policy torestrict the purposes for which data may be analyzed. Forexample, the following policy allows the use of age andmedical condition for public interest purposes with allthe above requirements:

ALLOW ROLE OncologistAND SCHEMA age, conditionAND PRIVACY DP (1.0, 1e-5)AND FILTER age > 18AND REDACT zip (2 : )AND PURPOSE PublicInterest

3.3 PRIVANALYZER OverviewPRIVANALYZER performs its static analysis via abstractinterpretation [49], a general framework for sound analy-sis of programs. Abstract interpretation works by runningthe program on abstract values instead of concrete (regu-lar) values. Abstract values are organized into abstractdomains: partially ordered sets of abstract values whichcan represent all possible concrete values in the program-ming language. An abstract value usually represents aspecific property shared by the concrete values it repre-sents. In PRIVANALYZER, abstract values are based onthe abstract domains described earlier.

Our approach to static analysis is based on a novelinstantiation of the abstract interpretation framework, inwhich we encode policies as abstract values. The ap-proach is summarized in Figure 4. The use of abstractinterpretation allows us to construct the static analysissystematically, ensuring its correspondence with the in-tended semantics of attribute values.Analyzing Python Programs. The typical approachto abstract interpretation is to build an abstract inter-preter that computes with abstract values. For a complex,general-purpose language like Python, this approach re-quires significant engineering work. Rather than buildingan abstract interpreter from scratch, we re-use the stan-

Page 7: PRIVGUARD: Privacy Regulation Compliance Made Easier

dard Python interpreter to perform abstract interpretation.We embed abstract values with attached privacy policiesas Python objects and define operations over abstractvalues as methods on these objects.

For example, the Pandas library defines opera-tions on concrete dataframes; PRIVANALYZER definesthe AbsDataFrame class for abstract dataframes. TheAbsDataFrame class has the same interface as the PandasDataFrame class, but its methods are redefined to computeon abstract values with attached policies. We call the re-defined method a function summary, since it summarizesthe policy-relevant behavior of the original method. Forexample, the Pandas indexing function __getitem__ isused for filtering, so PRIVANALYZER’s function sum-mary for this function removes satisfied FILTER attributesfrom the policy.

def __getitem__(self, key):......

If isinstance(key, AbsIndicatorSeries):# ‘ runFilter ‘ removes satisfied FILTER attributesnewPolicy = self . policy . runFilter (...)return Dataframe(..., newPolicy)

......

Multi-step Analyses & Residual Policies. As shown inFigure 4, the output of PRIVANALYZER is a residual pol-icy. A residual policy is a new policy for the program’sconcrete output—it contains the requirements not yet sat-isfied by the analysis program. For a multi-step analysis,each step of the analysis can be fed into PRIVANALYZERas a separate analysis program, and the residual policiesin the previous step become the input policies for the nextstep. PRIVANALYZER is compositional: if multiple anal-yses are merged together into a single analysis program,then the final residual policy PRIVANALYZER returns forthe multi-step analysis will be at least as restrictive asthe one for the single-step version. The use of residualpolicies in PRIVGUARD enables compositional analy-ses without requiring explicit construction of a globaldataflow graph, addressing the challenge of GROK [53]mentioned earlier.

Handling libraries. Scaling to large programs is a ma-jor challenge for many static analyses, including abstractinterpreters. Libraries often present the biggest challenge,since they tend to be large and complex; it may be im-possible to analyze even a fairly small target program ifthe program depends on a large library. This is certainlytrue in our setting (Python programs for data process-ing), where programs typically leverage large librarieslike pandas (327,007 lines of code), scikit-learn (184,144lines of code), PyTorch (981,753 lines of code) and Ten-sorflow (2,490,119 lines of code). Worse, many librariesare written in a mix of languages (e.g. Python and C/C++)

Num

py

Pand

as

scik

it-le

arn

SciP

y

XG

Boo

st

Lig

htG

BM

date

time

Ker

as

Tens

orFl

ow

Oth

er

0

50

100

150

200 194 193

147

5138 36 30 26 21

209

#in

200

Kag

gle

prog

ram

s

Figure 5: Python library frequency statistics. We summa-rized the top frequently used libraries.

for performance reasons, so analysis for each one of theselanguages would be needed.

Our solution is to develop specifications of the abstractfunctionality of these libraries, like the AbsDataFrame ex-ample shown earlier, in the form of function summaries.We use the function summaries during analysis instead ofthe concrete implementation of the library itself. This ap-proach allows PRIVGUARD to enforce policies even foranalysis programs that leverage extremely large librarieswritten in multiple languages.

Our approach for handling libraries requires a domainexpert with knowledge of the library to implement itsspecification. In our experience, the data science com-munity has largely agreed upon a core set of importantlibraries which are commonly used (e.g. NumPy, pan-das, scikit-learn, etc.), so providing specifications for asmall number of libraries is sufficient to handle mostprograms. To validate the conjecture empirically, we ran-domly selected 200 programs from the Kaggle platformand counted the libraries they use (Figure 5). The resultsconfirmed that most data analysis programs use similarlibraries. We have already implemented specificationsfor the most frequently used libraries (Section 4). Fortu-nately, the abstract behavior for a library function tendsto be simpler than its concrete behavior. We have imple-mented 380 function summaries mainly for Numpy, Pan-das, and scikit-learn and are actively working on addingmore function summaries for various libraries.

We require correct specifications to rigorously enforceprivacy policies. An illustrative example of the impor-tance of correctly implementing specifications is the re-naming function. Cunning inside attackers may want tobypass the static analysis by renaming sensitive columns.A correct specification which renames the columns inboth the schema and the privacy clauses should miti-gate such attacks. To mitigate risks due to such errors,function summaries should be open-sourced so the com-

Page 8: PRIVGUARD: Privacy Regulation Compliance Made Easier

munity can help check their correctness.Comparison with dynamic approaches. Our choiceof a static analysis for PRIVANALYZER is motivated bytwo major advantages over dynamic approaches: (1) theability to handle implicit data flows, and (2) the goal ofadding minimal run-time overhead. The ability to de-tect implicit flows is a major advantage of static analysissystems [37, 48, 50, 59], including PRIVANALYZER. Un-like dynamic approaches, PRIVANALYZER cannot bedefeated by complex control flow designed to obfuscateexecution. For example, the data subject might specifythe policy ALLOW REDACT name (1 : ), which requires redact-ing most of the name column. An analyst might write thefollowing program:

if data.name == ’Alice’:return 1

else:return 2

This program clearly violates the policy, even though itdoes not return the value of data.name directly. This kindof violation is due to an implicit flow of the name columnto the return value. A return value of 1 allows the analystto confirm with certainty that the data subject’s name isAlice. This kind of implicit flow presents a significantchallenge for dynamic analyses, because dynamic analy-ses only execute one branch of each conditional, and canmake no conclusions about the branch not taken. A dy-namic analysis must either place significant restrictionson the use of sensitive values in conditionals, or allowfor unsoundness due to implicit flows.

Static analyzers like PRIVANALYZER, on the otherhand, can perform a worst-case analysis that inspectsboth branches. PRIVANALYZER’s analysis executes bothbranches with the abstract interpreter and returns theworst-case outcome of both branches. For loops with nobound on the number of iterations, the analysis resultsrepresent the worst-case outcome, no matter how manyiterations execute at runtime. This power comes at theexpense of a potential lack of precision—the analysismay reject programs that are actually safe to run. Ourevaluation suggests, however, that PRIVANALYZER anal-ysis is sufficiently precise for programs that perform dataanalyses. Static analysis tools like PRIVANALYZER donot require the policy specification to be aware of implicitflows as it combines both types of flows in its results.

3.4 PRIVANALYZER by ExampleThe input to PRIVANALYZER is a single analysis pro-gram, plus all of the policies of the data files it processes.For each of the program’s outputs, PRIVANALYZER pro-duces a residual policy. After running the analysis, PRIV-

GUARD will attach each of these residual policies to theappropriate concrete output.

PRIVANALYZER works by performing abstract inter-pretation, where the inputs to the program are abstractvalues containing representations of the associated poli-cies. The output of this process is a set of abstract valuescontaining representations of the residual policies.

A complete example of this process is summarized inFigure 6. The analysis is a Python program adapted froman open-source analysis submitted to Kaggle [9]. Impor-tant locations in the program are labeled with numbers(e.g. 1©) and the associated residual policy PRIVANA-LYZER computes at that program location.Reading Data into Abstract Values. The program be-gins by reading in a CSV containing the sensitive data(Fig. 6, 1©). PRIVANALYZER’s abstract interpretation li-brary redefines read_csv to return an abstract dataframecontaining the policy associated with the data being read.At location 1©, the variable df thus contains the full pol-icy for the program’s input. In this example, the policyallows the program to use the “age,” “credit,” and “dura-tion” columns, requires filtering out the data of minors,andMixing Concrete and Abstract Values. The next partof the program defines some constants, which PRIV-ANALYZER represents with concrete values lacking at-tached policies. Then, the program drops one of the inputcolumns (Fig. 6, 2©); this action does not change thepolicy, because the columns being dropped are not suffi-cient to satisfy any of the policy’s requirements, so the df

variable is unchanged.Satisfying FILTER Requirements. The next statement(Fig. 6, 3©) performs a filtering operation. The PRIVAN-ALYZER library redefines this operation to eliminate theappropriate FILTER requirements from the policy; sincethis filtering operation removes individuals below the ageof 25, it satisfies the FILTER requirement in the policy, andthe new value of df is an abstract dataframe whose policydoes not have this requirement.Handling Loops. The next part of the program con-tains a for loop (Fig. 6, 4©). Loops are traditionally abig challenge for static analyzers. PRIVANALYZER isdesigned to take advantage of loops over concrete values,like this one, by simply executing them concretely. PRI-VANALYZER can also handle loops over abstract values(described later in this section), but these were relativelyrare in our case studies.Libraries and Black-box Operations. The next piecesof code (Fig. 6, 5© and 6©) first take the log of eachfeature, then scale the features. Both of these operationsimpact the scale of feature values. After these operations,

Page 9: PRIVGUARD: Privacy Regulation Compliance Made Easier

Figure 6: Example Abstract Interpretation with PRIVANALYZER.

it becomes impossible to satisfy policy requirements likeFILTER, because the original data values have been lost.For lossy operations like these, which we call black-boxoperations (detailed below), PRIVANALYZER is designedto raise an alarm if value-dependent requirements (likeFILTER) remain in the policy.Training a Model. The final piece of code (Fig. 6, 7©and 8©) uses the KMeans implementation in scikit-learnto perform clustering. We summarize this method tospecify that it satisfies de-identification requirements inthe policy. The result in our example is an empty residualpolicy, which would allow the analyst to view the results.

3.5 Challenging Language FeaturesWe now address the approach taken in PRIVANALYZERfor several challenging language features.Conditionals. Conditionals depending on abstract val-ues require the abstract interpreter to run both branchesand compute the upper bound on both results. SincePython does not allow redefining if statements, we adda pre-processing step to PRIVGUARD that transformsconditionals by running both branches.Loops. Loops are traditionally the most challengingconstruct for abstract interpreters to handle. Fortunately,loops in Python programs for data analytics often fallinto restricted classes, like the ones in the example ofFigure 6. Both loops in this example are over constantvalues—so our abstract interpreter can simply run eachloop body as many times as the constant requires.

Loops over abstract values are more challenging, andthe simple approach may never terminate. To addressthis situation, we define a widening operator [49] foreach of the abstract domains used in PRIVANALYZER.Widening operators force the loop to arrive at a fixpoint;

in our example, widening corresponds to assuming theloop body will be executed over the whole dataframe.Aliasing. Another challenge for abstract interpretationcomes from the issue of aliasing, where two variablespoint to the same value. Sometimes, it is impossible forthe analysis to determine what abstract value a variablereferences. In this case, it is also impossible to determinethe outcome of the side effects on the variable.

Our approach of re-using the existing Python inter-preter helps address this challenge: in PRIVANALYZER,all variable references are evaluated concretely. In mostcases, references are to concrete objects, so the analysiscorresponds exactly to concrete execution. In a few cases,however, this approach leads to less precise analysis. Forexample, if a variable is re-assigned in both branches ofa conditional, PRIVANALYZER must assume the worst-case abstract value (i.e. with the most restrictive policy)is assigned to the variable in both cases. This approachworks well in our setting, where conditionals and aliasingare both relatively rare.

3.6 Attribute EnforcementWe now describe some attribute-specific details of ourcompliance analysis.Schema, Filter, and Redact. The SCHEMA, FILTER, andREDACT attributes can be defined formally and compli-ance can be checked by PRIVANALYZER. In our im-plementation, relevant function summaries remove theattribute from the privacy policy if the library’s concreteimplementation satisfies the corresponding requirement.Our summaries thus implement abstract interpretation forthese functions. Note that PRIVANALYZER assumes thatfunctions without summaries do not satisfy any policyrequirements. PRIVANALYZER is therefore incomplete:

Page 10: PRIVGUARD: Privacy Regulation Compliance Made Easier

some programs may be rejected (due to insufficient func-tion summaries) despite satisfying the relevant policies.Privacy. The PRIVACY attribute is also checked byPRIVANALYZER. Analysis programs can satisfy de-identification requirements by calling functions that re-move identifying information (e.g. aggregating recordsor training machine learning models). Programs can sat-isfy k-Anonymity2, `-diversity2, t-closeness or differen-tial privacy requirements by calling specific functionsthat provide these properties. Our function summariesinclude representative implementations from the cur-rent literature: IBM Differential Privacy Library [38],K-Anonymity Library [16], and Google’s Tensorflow Pri-vacy library [26].

There are two subtleties when enforcing differentialprivacy attributes. First, programs that satisfy differen-tial privacy also need to track the privacy budget [34].By default, PRIVGUARD tracks a single global cumu-lative privacy cost (values for ε and δ) for each sourceof submitted data, and rejects new analysis programsafter the privacy cost exceeds the budget amount. PRIV-ANALYZER reports the privacy cost of a single analysisprogram, allowing PRIVGUARD to update the global pri-vacy cost. A single global privacy budget may be quicklyexhausted in a setting with many analysts performingdifferent analyses. One solution to this problem is to gen-erate differentially private synthetic data, which can beused in later analyses without further privacy cost. TheHigh-Dimensional Matrix Mechanism (HDMM) [45] isone example of an algorithm for this purpose, used bythe US Census Bureau to release differentially privatedata. In PRIVGUARD, arbitrarily many additional anal-yses can be performed on the output of algorithms likeHDMM without using up the privacy budget. Anothersolution is fine-grained budgeting, at the record level (asin ProPer [35]) or a statically defined “region” level (asin UniTraX [47]). The first is more precise, but requiressilently dropping records for which the privacy budgetis exhausted, leading to biased results. Both approachesallow for continuous analysis of fresh data in growingdatabases (e.g. running a specific query workload ev-ery day on just the new data obtained that day). Second,to calculate privacy budget, PRIVGUARD initializes avariable to track the sensitivity of the pre-processingsteps before the differentially private function. The pre-processing function summaries should manipulate the

2 k-Anonymity and `-diversity are vulnerable to disclosure attacksas pointed out in [40]. k-Anonymity is vulnerable to homogeneity andbackground knowledge attacks, and `-diversity suffers from skewnessand similarity attacks. We strongly encourage using t-closeness or dif-ferential privacy for stronger protection. PRIVGUARD provides weakerapproaches only for compatibility purposes.

variable to specify their influence on the sensitivity. Ifsuch specification is absent in any function before thedifferentially private function, PRIVGUARD will throwa warning and recognize the differential privacy require-ment as unsatisfied.Role. ROLE attributes are enforced by authentication tech-niques such as password, 2-factor authentication , or evenbiometric authentication . In addition, ROLE attributes arealso recorded in an auditable system log described in thenext paragraph, and the analysts and the data curatorswill be held accountable for fake identities.Purpose. The PURPOSE attribute is inherently informal.Thus, we take an accountability-based approach to com-pliance checking for purposes. Analysts can specify theirpurposes when submitting the analysis program, andmay specify an invalid purpose unintentionally or mali-ciously. These purposes will be used by PRIVANALYZERto satisfy PURPOSE requirements. PRIVGUARD producesan audit log recording analysts, analysis programs, andclaimed purposes. Thus, all the analysis happening in thesystem can be verified after the fact, and analysts can beheld legally accountable for using invalid purposes.

4 Evaluation

The evaluation is designed to demonstrate that (1) PRIV-ANALYZER supports commonly-used libraries for dataanalytics and can analyze real-world programs, and (2)PRIVGUARD is lightweight and scalable. To demonstrate,we (1) test PRIVANALYZER using 23 real-world analy-sis programs drawn from the Kaggle contest platform,and (2) measure the overhead of PRIVGUARD using asubset of these programs. The results show that PRIV-GUARD can correctly enforce PRIVPOLICY policies onthese programs with reasonable performance overhead.

4.1 Experiment SetupWe implemented PRIVANALYZER in about 1400 lines ofPython and integrated it in an industrial-level data gov-ernance platform, Parcel [1], to prototype PRIVGUARD.We instantiated our implementation with Inter PlanetaryFile System (IPFS) for the storage layer, AES-256-GCMfor the encryption algorithm, and AMD SEV for TEE.

To evaluate PRIVGUARD’s static analysis on real-world programs, we collect analysis programs for 23different tasks from Kaggle, one of the most well-knownplatforms for data analysis contests. These programsanalyze sensitive data such as fraud detection [15] andtransaction prediction [22]. We selected these programsas case studies to demonstrate PRIVGUARD’s ability

Page 11: PRIVGUARD: Privacy Regulation Compliance Made Easier

Index Application Type # Features # Samples # LoC External libraries1 Fraud Detection - Random [15] 435 590541 157 LightGBM, NumPy, pandas, scikit-learn2 Fraud Detection - Top [15] 435 590541 157 LightGBM, NumPy, pandas, scikit-learn3 Merchant Recommendation [11] 41 201917 199 LightGBM, NumPy, pandas, scikit-learn4 Customer Satisfaction Prediction [21] 370 76020 104 NumPy, pandas, scikit-learn, XGBoost5 Customer Transaction Prediction - Random [22] 201 200000 89 NumPy, pandas, scikit-learn6 Customer Transaction Prediction - Top [22] 201 200000 89 NumPy, pandas, scikit-learn7 Bank Customer Classification [6] 13 10000 276 NumPy, pandas, scikit-learn8 Bank Customer Segmentation [7] 9 1000 75 NumPy, pandas, scikit-learn9 Credit Risk Analysis [9] 9 1000 57 NumPy, pandas, sklearn

10 Bank Customer Churn Prediction [5] 13 10000 169 NumPy, pandas, SciPy, scikit-learn11 Heart Disease Causal Inference [14] 14 303 83 NumPy, pandas, SHAP, scikit-learn12 Classify Forest Categories [8] 52 1000 50 NumPy, pandas, PySpark13 PyTorch-Simple LSTM [23] 45 1780832 178 NumPy, pandas, Keras, PyTorch14 Tensorflow-Solve Titanic [25] 7 891 163 NumPy, pandas, scikit-learn, Tensorflow15 Earthquake Prediction - Top [17] 2 1000 132 NumPy, pandas, tsfresh, librosa, pywt, SciPy16 Display Advertising - Top [10] 41 1000 60 math17 Fraud Detection - Top [24] 8 1000 106 NumPy, pandas, Keras, Tensorflow18 Restaurant Revenue Prediction - Top [20] 42 1000 115 NumPy, pandas, FastAI, scikit-learn19 NFL Analytics - Top [19] 18 1000 152 NumPy, pandas, SciPy, scikit-learn20 NCAA Prediction - Top [13] 34 1000 561 NumPy, pandas, Pymc321 Home Value Prediction - Top [29] 58 1000 272 NumPy, pandas, sklearn, XGBoost22 Malware Prediction - Top [18] 83 1000 194 NumPy, pandas23 Web Traffic Forecasting - Top [27] 551 1000 346 NumPy, pandas

Table 1: Case Study Programs. # LoC = Lines of Code. Random suffix means the program is chosen randomly from thecontest. Top-ranked suffix means the program is chosen from the top-ranked programs on the leaderboard.

to analyze real-world analysis programs and supportcommonly-used libraries. These case studies are chosento be representative of the programs written by data scien-tists during day-to-day operations at many different kindsof organizations. We surveyed 100 kaggle programs ran-domly and found that approximately 85% programs areless than 300 lines of code (after removing blank lines).Correspondingly, our case studies range between 50 and276 lines of code, total exactly 1600 lines of code and in-clude randomly picked programs from Kaggle notebookand top-ranked programs on the contest leaderboard. Asshown in Table 1, these programs use a variety of externallibraries including widely used libraries like pandas, PyS-park, Tensorflow, PyTorch, scikit-learn, and XGBoost.

As the first step of the evaluation, we use PRIVANA-LYZER to analyze the collected programs listed in Table 1.In the experiment, we manually designed an appropri-ate LEGALEASE policy for each program, and then at-tached them to each of the datasets. For each program,we recorded both the time running on the dataset andthe time for PRIVANALYZER to analyze the program.We also manually checked that the analysis result out-put by PRIVANALYZER was correct. All the experimentswere run on a Ubuntu 18.04 LTS server with 32 AMDOpteron(TM) Processor 6212 with 512GB RAM. Theresults appear in Table 2. As the second step of the evalu-ation, we picked 7 case studies with open-source datasets,ran them on the PRIVGUARD prototype, and measured

the system overhead.

1 4 8 9 10 11 120

50

100

150

200

Program ID

Run

ning

Tim

e(s

)

Analysis Overhead Merge OverheadScan Overhead Parcel Overhead

Figure 7: System overhead of PRIVGUARD prototypewith one million simulated users.

4.2 Results

Support for Real-World Programs. Our experimentdemonstrates PRIVGUARD’s ability to analyze the kindsof analysis programs commonly written to process data inorganizations. The results in Table 2 show that the staticanalysis took just a second or two for most programs, withthree outliers taking 3.32, 4.78, and 6.84 seconds. Thereason for the outliers is described in the next paragraph.

As in other abstract interpretation and symbolic execu-tion frameworks, we expect that conditionals, loops, andother control-flow constructs will have a bigger effect onanalysis time than program length. Fortunately, programsfor data analytics and machine learning tend not to make

Page 12: PRIVGUARD: Privacy Regulation Compliance Made Easier

IndexExecTime (s)

AnalysisTime (s)

Overhead Soundness

1 12571.01 1.41 1.12e−2% !

2 19951.10 3.32 1.66e−2% !

3 16762.65 1.18 7.04e−3% !

4 151.72 1.22 8.04e−1% !

5 17.14 1.08 6.30% !

6 33.71 0.96 2.84% !

7 32.66 2.03 6.22% !

8 86.82 2.19 2.52% !

9 4.65 1.01 21.72% !

10 295.16 1.29 4.37e−1% !

11 3.99 1.00 25.06% !

12 1017.83 1.01 1.00e−1% !

13 717.58 6.84 9.53e−1% !

14 13.33 4.78 35.86% !

15 217.36 2.26 1.04% !

16 3.60 1.20 33.33% !

17 5.19 1.37 26.40% !

18 47.12 1.57 3.33% !

19 202.96 1.33 6.55e−1% !

20 59.83 1.66 2.77% !

21 54.44 2.55 4.68% !

22 51.36 1.23 2.39% !

23 45.58 2.45 5.37% !

Table 2: Execution Time vs. Analysis Time. The indexof case studies is the same as in Table 1.

extensive use of these constructs, especially compared totraditional programs. Instead, they tend to use constructsprovided by libraries, like the query features defined inpandas or the model construction classes provided byscikit-learn. The outliers mentioned above (case studies2, 13, and 14) contain relatively heavy use of conditionals,and as a result, their analysis took slightly longer than theother programs. These results suggest that PRIVGUARDwill scale to even longer programs for data analytics andmachine learning, especially if those programs followthe same pattern of favoring library use over traditionalcontrol-flow constructs.

Table 2 reports performance overhead for all 23 casestudies. The results report analysis performance over-head—the ratio of the time taken for static analysis tothe native execution time of the original program. The re-sults show that this overhead is negligible. For case studyprograms which take a significant time to run, the perfor-mance overhead of deploying PRIVGUARD is typicallyless than 1%. For faster-running programs, the absoluteoverhead is similar—just a second or two, typically—butthis represents a larger relative change when the pro-gram’s execution time is small. The maximum relativeperformance overhead in our experiments was about 35%,for a program taking only 13.33 seconds.

Overall Performance Overhead and Scalability. Wealso evaluate 7 case studies on our prototype implementa-tion and measure the total overhead of the PRIVGUARDsystem. The results appear in Figure 7 and 8. For eachcase study, we synthesize one million random policies bycombining possible attributes and changing parametersin the attributes to simulate one million data subjects’ pri-vacy preferences. The results show that the performanceoverhead for ingesting one million policies is under 150seconds. Concretely, over half of the overhead is spenton Parcel’s system overhead such as data uploading, datastorage, data encryption, etc. Data ingestion takes aboutone-third of the overhead and the static analysis onlytakes less than 10 seconds.

We also benchmark the overhead with different num-bers of users as shown in Figure 8. Parcel overhead refersto the overhead incurred by the Parcel platform such asdata loading or transmission. Scan overhead refers tothe time spent on finding the policies no stricter thanthe guard policy. Merge overhead refers to the time usedto merge the datasets inside TEE. Analysis overheadrefers to the overhead of running PRIVANALYZER. Asshown in Figure 8, Parcel overhead, scan overhead andmerge overhead are relatively stable when the numberof users is small and then scale linearly with the num-ber of users. Note that we use the log10 scale to repre-sent the x-axis. The curves are exponential but the ratescales linearly. For all experiments except the static anal-ysis: Overhead = O(#Users)+O(1). Not surprisingly,the analysis overhead is almost constant for a fixed pro-gram. The results show that PRIVGUARD is scalable to alarge number of users and datasets.

5 Related Work

Related work falls into two categories: (1) formalizationand (2) enforcement of privacy policies.Privacy Policy Formalism. Tschantz et al. [56] usemodified Markov Decision Process to formalize purposerestrictions in privacy policies. Chowdhury et al. [31]present a policy language based on a subset of FOTLcapturing the requirements of HIPAA. Lam et al. [39]prove that for any privacy policy that conforms to patternsevident in HIPAA, there exists a finite representativehospital database that illustrates how the law applies inall possible hospitals. Gerl et al. [36] introduce LPL,an extensible Layered Privacy Language that allows toexpress and enforce new privacy properties such as userconsent. Trabelsi et al. [55] propose the PPL sticky policybased on XACML to express and process privacy policiesin Web 2.0. Azraoui et al. [30] focus on the accountability

Page 13: PRIVGUARD: Privacy Regulation Compliance Made Easier

100 101 102 103 104 105 106

60

70

80

90

# Users

Run

ning

Tim

e(s

)

1489101112

(a) Parcel overhead.

100 101 102 103 104 105 106

0

5

10

15

20

# Users

Run

ning

Tim

e(s

)

1489101112

(b) Scan overhead.

100 101 102 103 104 105 106

10

15

20

# Users

Run

ning

Tim

e(s

)

1489101112

(c) Merge overhead.

100 101 102 103 104 105 106

2

3

4

5

# Users

Run

ning

Tim

e(s

)

1489101112

(d) Analysis overhead.

Figure 8: PRIVGUARD overhead details. The scalabilityis linear (note the logarithmic scale).

side of privacy policies and extend PPL to A-PPL.Privacy Policy Compliance Enforcement. Going be-yond the formalism of privacy regulations, recent re-search also explores techniques to enforce formalizedprivacy policies. Chowdhury et al. [32] propose to usetemporal model-checking for run-time monitoring of pri-vacy policies. Sen et al. [53] introduce GROK, a datainventory for Map-Reduce-like big data systems. PODS /SOLID [43] focuses on returning control over data to itsowners. In PPL policy engine [55], the policy decisionpoint (PDP) matches the data curator’s privacy policy anddata subjects’ privacy preferences to decide compliance.The privacy policy is enforced by the policy enforcementpoint. Compared with our work, the PPL policy engineprovides limited support for fine-grained privacy compli-ance in complex data analysis tasks as its enforcementengine relies on direct trigger-to-action translation. In ad-dition, PPL does not provide a rigorous soundness proof.Similar differences exist in its extension A-PPL [30] andthe SPECIAL project [2]. Our work provides an enforce-ment mechanism necessary to address these issues andcan be seen as a first step towards meeting the ambitiouschallenge posed by Maniatis et al. [42].

6 Limitation

We would like to note several limitations of PRIVGUARDand deem mitigating them as important future directions.First, PRIVGUARD is vulnerable to insider attacks. Inour threat model, we assume the data analysts are hon-est but reckless and might violate the privacy regulation

unintentionally. Such a threat model should be enoughto capture most real-world use cases. However, defend-ing against malicious analysts is much more challenging.Because PRIVANALYZER is implemented as a Pythonlibrary, it is possible to craft malicious programs thatevade its analysis. For example, a malicious programmight dynamically redefine PRIVANALYZER’s runFilter

function (used in our earlier example) to always reportthat the policy has been satisfied. A syntactic analysisthat detects the use of dynamic language features beforethe program is loaded could address this issue. However,it is challenging to detect all such attacks due to the largenumber of dynamic features in Python.

Second, many PURPOSE attributes cannot be automat-ically enforced by PRIVGUARD as they are not relatedto program properties. For example, whether a programrepresents a “legitimate interest” can only be judged bya human, and thus cannot be decided by any fully auto-mated system without manual description by a human. Toaddress this challenge, we choose to log these attributesand make the log accessible for human auditing. We em-phasize that our goal is to minimize rather than eliminatehuman efforts in the compliance process.

Third, PRIVGUARD relies on TEE such as AMD SEVto defend against untrusted third parties. However, recentstudies have spotted several vulnerabilities in mainstreamTEEs [46, 58] which weakens their protection againstmalicious third parties. Although out of scope, we wouldlike to mention these possible exploits to potential users.

7 Conclusion & Future Work

In this paper, we propose PRIVGUARD, a framework forfacilitating privacy regulation compliance. The core com-ponent is PRIVANALYZER, a static analyzer supportingcompliance verification between a program and a policy.We prototype PRIVGUARD on Parcel, an industrial-leveldata governance platform. We believe that PRIVGUARDhas the potential to significantly reduce the cost of pri-vacy regulation compliance.

There are also several future directions we wouldlike to pursue for future versions of PRIVGUARD. First,we would like to further improve the usability of PRIV-GUARD’s API in consideration of HCI requirements sothat non-experts can easily specify their own privacy pref-erences. Second, PRIVGUARD now adopts an one-shotconsent strategy, which covers most current applicationscenarios but has several defects as pointed out in [52].This limitation can be addressed by allowing the datacurator to ask data subjects for dynamic consent afterdata collection, as depicted in [52].

Page 14: PRIVGUARD: Privacy Regulation Compliance Made Easier

References

[1] Oasis labs parcel sdk. https://www.oasislabs.com/parcelsdk. Accessed: 2021-02-02.

[2] Special: Scalable policy-aware linked data ar-chitecture for privacy, transparency and compli-ance. https://specialprivacy.ercim.eu/.Accessed: 2021-05-04.

[3] Formal concept analysis. https://en.wikipedia.org/wiki/Formal_concept_analysis, 2019. Online; accessed 30 May 2019.

[4] The age of privacy: The cost of continuous com-pliance. https://datagrail.io/downloads/GDPR-CCPA-cost-report.pdf, 2020. Online; ac-cessed 30 July 2020.

[5] Bank customer churn predition. https://www.kaggle.com/bandiang2/prediction-of-customer-churn-at-a-bank, 2020. Online;accessed 9 January 2020.

[6] Bank customer classification random forest. https://www.kaggle.com/taronzakaryan/bank-customer-classification-random-forest,2020. Online; accessed 9 January 2020.

[7] Bank customer segmentation. https://www.kaggle.com/paulinan/bank-customer-segmentation, 2020. Online;accessed 9 January 2020.

[8] Classify forest categories. https://www.kaggle.com/suyashgulati/using-pyspark-randomforest-crossvalidatn-gridsrch,2020. Online; accessed 9 January 2020.

[9] Credit risk analysis. https://www.kaggle.com/damaradiprabowo/clustering-german-credit-data, 2020. Online; accessed 9 January2020.

[10] Display advertising challenge. https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10322, 2020. Online;accessed 9 January 2020.

[11] Elo merchant category recommendation.https://www.kaggle.com/c/elo-merchant-category-recommendation, 2020. Online;accessed 9 January 2020.

[12] The gdpr racket: Who’s making money fromthis $9bn business shakedown. https://www.forbes.com/sites/oliversmith/2018/05/02/the-gdpr-racket-whos-making-money-from-this-9bn-business-shakedown/#54c0702034a2, 2020. Online;accessed 30 July 2020.

[13] Google cloud & ncaa ml competition 2019-women’s. https://www.kaggle.com/c/womens-machine-learning-competition-2019/discussion/90156, 2020. Online;accessed 9 January 2020.

[14] Heart disease causal inference. https://www.kaggle.com/tentotheminus9/what-causes-heart-disease-explaining-the-model, 2020.Online; accessed 9 January 2020.

[15] Ieee cis fraud detection. https://www.kaggle.com/c/ieee-fraud-detection, 2020. Online;accessed 9 January 2020.

[16] K-anonymity library. https://github.com/KENNN/k-anonymity, 2020. Online; accessed 02May 2020.

[17] Lanl earthquake prediction. https://www.kaggle.com/c/LANL-Earthquake-Prediction/discussion/94390, 2020. Online;accessed 9 January 2020.

[18] Microsoft malware prediction. https://www.kaggle.com/shrutimechlearn/large-data-loading-trick-with-ms-malware-data, 2020.Online; accessed 9 January 2020.

[19] Nfl punt analytics competition. https://www.kaggle.com/c/NFL-Punt-Analytics-Competition/discussion/78041#latest-663557, 2020. Online; accessed 9 January 2020.

[20] Restaurant revenue prediction. https://www.kaggle.com/jquesadar/restaurant-revenue-1st-place-solution, 2020. Online;accessed 9 January 2020.

[21] Santander customer satisfaction. https://www.kaggle.com/c/santander-customer-satisfaction, 2020. Online; accessed 9 January2020.

[22] Santander customer transaction prediction.https://www.kaggle.com/c/santander-customer-transaction-prediction, 2020.Online; accessed 9 January 2020.

Page 15: PRIVGUARD: Privacy Regulation Compliance Made Easier

[23] Simple lstm - pytorch version. https://www.kaggle.com/bminixhofer/simple-lstm-pytorch-version, 2020. Online; accessed9 January 2020.

[24] Talkingdata adtracking fraud detection challenge.https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56283, 2020. Online; accessed 9 January 2020.

[25] Tensorflow-deep learning to solve titanic. https://www.kaggle.com/linxinzhe/tensorflow-deep-learning-to-solve-titanic, 2020.Online; accessed 9 January 2020.

[26] Tensorflow privacy. https://github.com/tensorflow/privacy, 2020. Online; accessed 03May 2020.

[27] Web traffic time series forecasting. https://www.kaggle.com/muonneutrino/wikipedia-traffic-data-exploration, 2020. Online;accessed 9 January 2020.

[28] What is iam? https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html,2020. Online; accessed 29 April 2020.

[29] Zillow prize: Zillow’s home value predic-tion (zestimate). https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-zillow-prize, 2020. Online;accessed 9 January 2020.

[30] Monir Azraoui, Kaoutar Elkhiyaoui, Melek Önen,Karin Bernsmed, Anderson Santana De Oliveira,and Jakub Sendor. A-ppl: an accountability pol-icy language. In Data privacy management, au-tonomous spontaneous security, and security assur-ance, pages 319–326. Springer, 2014.

[31] Omar Chowdhury, Andreas Gampe, Jianwei Niu,Jeffery von Ronne, Jared Bennatt, Anupam Datta,Limin Jia, and William H Winsborough. Privacypromises that can be kept: a policy analysis methodwith application to the hipaa privacy rule. In Pro-ceedings of the 18th ACM symposium on Accesscontrol models and technologies, pages 3–14. ACM,2013.

[32] Omar Chowdhury, Limin Jia, Deepak Garg, andAnupam Datta. Temporal mode-checking for run-time monitoring of privacy policies. In Interna-tional Conference on Computer Aided Verification,pages 131–149. Springer, 2014.

[33] Cynthia Dwork, Frank McSherry, Kobbi Nissim,and Adam Smith. Calibrating noise to sensitivityin private data analysis. In Theory of cryptographyconference, pages 265–284. Springer, 2006.

[34] Cynthia Dwork, Aaron Roth, et al. The algorithmicfoundations of differential privacy. Foundationsand Trends R© in Theoretical Computer Science, 9(3–4):211–407, 2014.

[35] Hamid Ebadi, David Sands, and Gerardo Schneider.Differential privacy: Now it’s getting personal. AcmSigplan Notices, 50(1):69–81, 2015.

[36] Armin Gerl, Nadia Bennani, Harald Kosch, and Li-onel Brunie. Lpl, towards a gdpr-compliant pri-vacy language: Formal definition and usage. InTransactions on Large-Scale Data-and Knowledge-Centered Systems XXXVII, pages 41–80. Springer,2018.

[37] Daniel B Giffin, Amit Levy, Deian Stefan, DavidTerei, David Mazières, John C Mitchell, and Ale-jandro Russo. Hails: Protecting data privacy inuntrusted web applications. In Presented as partof the 10th {USENIX} Symposium on OperatingSystems Design and Implementation ({OSDI} 12),pages 47–60, 2012.

[38] Naoise Holohan, Stefano Braghin, PólMac Aonghusa, and Killian Levacher. Diff-privlib: The ibm differential privacy library. arXivpreprint arXiv:1907.02444, 2019.

[39] Peifung E Lam, John C Mitchell, Andre Scedrov,Sharada Sundaram, and Frank Wang. Declarativeprivacy policy: finite models and attribute-basedencryption. In Proceedings of the 2nd ACM SIGHITInternational Health Informatics Symposium, pages323–332. ACM, 2012.

[40] Ninghui Li, Tiancheng Li, and Suresh Venkatasub-ramanian. t-closeness: Privacy beyond k-anonymityand l-diversity. In 2007 IEEE 23rd InternationalConference on Data Engineering, pages 106–115.IEEE, 2007.

[41] Ashwin Machanavajjhala, Daniel Kifer, JohannesGehrke, and Muthuramakrishnan Venkitasubrama-niam. l-diversity: Privacy beyond k-anonymity.ACM Transactions on Knowledge Discovery fromData (TKDD), 1(1):3–es, 2007.

[42] Petros Maniatis, Devdatta Akhawe, Kevin R Fall,Elaine Shi, and Dawn Song. Do you know where

Page 16: PRIVGUARD: Privacy Regulation Compliance Made Easier

your data are? secure data capsules for deployabledata protection. In HotOS, volume 7, pages 193–205, 2011.

[43] Essam Mansour, Andrei Vlad Sambra, SandroHawke, Maged Zereba, Sarven Capadisli, Abdurrah-man Ghanem, Ashraf Aboulnaga, and Tim Berners-Lee. A demonstration of the solid platform for so-cial web applications. In Proceedings of the 25th In-ternational Conference Companion on World WideWeb, pages 223–226. International World Wide WebConferences Steering Committee, 2016.

[44] Sai Krishna Deepak Maram, Fan Zhang, Lun Wang,Andrew Low, Yupeng Zhang, Ari Juels, and DawnSong. Churp: Dynamic-committee proactive secretsharing. In Proceedings of the 2019 ACM SIGSACConference on Computer and Communications Se-curity, pages 2369–2386, 2019.

[45] Ryan McKenna, Gerome Miklau, Michael Hay, andAshwin Machanavajjhala. Optimizing error of high-dimensional statistical queries under differentialprivacy. Proceedings of the VLDB Endowment,11(10):1206–1219, 2018.

[46] Mathias Morbitzer, Sergej Proskurin, Martin Radev,Marko Dorfhuber, and Erick Quintanar Salas. Sever-ity: Code injection attacks against encrypted virtualmachines. arXiv preprint arXiv:2105.13824, 2021.

[47] Reinhard Munz, Fabienne Eigner, Matteo Maffei,Paul Francis, and Deepak Garg. Unitrax: protectingdata privacy with discoverable biases. In Interna-tional Conference on Principles of Security andTrust, pages 278–299. Springer, Cham, 2018.

[48] Andrew C Myers. Jflow: Practical mostly-staticinformation flow control. In Proceedings of the 26thACM SIGPLAN-SIGACT symposium on Principlesof programming languages, pages 228–241, 1999.

[49] Flemming Nielson, Hanne R Nielson, and ChrisHankin. Principles of program analysis. Springer,2015.

[50] Andrei Sabelfeld and Alejandro Russo. From dy-namic to static and back: Riding the roller coasterof information-flow control research. In Interna-tional Andrei Ershov Memorial Conference on Per-spectives of System Informatics, pages 352–365.Springer, 2009.

[51] Stefan Saroiu, Alec Wolman, and Sharad Agarwal.Policy-carrying data: A privacy abstraction for at-taching terms of service to mobile data. In Proceed-ings of the 16th International Workshop on MobileComputing Systems and Applications, pages 129–134, 2015.

[52] Eva Schlehahn, Patrick Murmann, Farzaneh Kare-gar, and Simone Fischer-Hübner. Opportunitiesand challenges of dynamic consent in commercialbig data analytics. In IFIP International SummerSchool on Privacy and Identity Management, pages29–44. Springer, 2019.

[53] Shayak Sen, Saikat Guha, Anupam Datta, Sriram KRajamani, Janice Tsai, and Jeannette M Wing. Boot-strapping privacy compliance in big data systems.In 2014 IEEE Symposium on Security and Privacy,pages 327–342. IEEE, 2014.

[54] Latanya Sweeney. k-anonymity: A model for pro-tecting privacy. International Journal of Uncer-tainty, Fuzziness and Knowledge-Based Systems,10(05):557–570, 2002.

[55] Slim Trabelsi, Jakub Sendor, and Stefanie Reinicke.Ppl: Primelife privacy policy engine. In 2011IEEE International Symposium on Policies for Dis-tributed Systems and Networks, pages 184–185.IEEE, 2011.

[56] Michael Carl Tschantz, Anupam Datta, and Jean-nette M Wing. Formalizing and enforcing purposerestrictions in privacy policies. In 2012 IEEE Sym-posium on Security and Privacy, pages 176–190.IEEE, 2012.

[57] Frank Wang, Ronny Ko, and James Mickens.Riverbed: enforcing user-defined privacy con-straints in distributed web services. In 16th{USENIX} Symposium on Networked Systems De-sign and Implementation ({NSDI} 19), pages 615–630, 2019.

[58] Luca Wilke, Jan Wichelmann, Florian Sieck, andThomas Eisenbarth. undeserved trust: Exploit-ing permutation-agnostic remote attestation. In2021 IEEE Security and Privacy Workshops (SPW),pages 456–466, 2021.

[59] Lantian Zheng and Andrew C Myers. Dynamicsecurity labels and static information flow control.International Journal of Information Security, 6(2-3):67–84, 2007.

Page 17: PRIVGUARD: Privacy Regulation Compliance Made Easier

A Formal Model of PRIVANALYZER

In this section, we formally present technique for provingsoundness of our static analysis. We use the filter attributeas an example to demonstrate the technique for provingsoundness in the context of a simple model of a program-ming language. As described in Section 3, Python is amore dynamic language than our model, and these dy-namic features may represent possible side channels formalicious adversaries.

A.1 Abstract DomainsWe provide the formal definition on of the abstract do-mains used in PRIVGUARD in this section. Formally,given a concrete domain D , we define the following com-ponents:• an abstract domain < D],v> containing abstract val-

ues (D]) which represent classes of abstract values,and a partial ordering (v) on those values.

• an abstraction function α : D→D] mapping concretevalues to abstract values.• a concretization function γ : D] → P (D) mapping

abstract values to sets of concrete values.For each attribute type, we define D], v, α, and γ. Wedefine each abstract domain in terms of a tabular dataformat approximating a Pandas dataframe (which wedenote d f ).

As described earlier, some kinds of loops may causethe abstract interpreter to loop forever. To address thischallenge, we adopt the standard approach of using awidening operator [49], denoted ∇, in place of the stan-dard partial ordering operatorv. Unlike the partial order-ing, the widening operator is guaranteed to stabilize whenapplied repeatedly in a loop. Finite abstract domains donot require a widening operator; for infinite domains (likethe interval domain used in FILTER attributes), we adoptthe standard widening operator used for the underlyingdomain (e.g. widening for intervals [49]).Filter Attributes. We track filtering done by analysisprograms using an interval domain [49], which is com-monly used in the abstract interpretation literature. Ab-stract dataframes in this domain (denoted D]) associatean interval (denoted I) with each column ci, and analy-sis results are guaranteed to lie within the interval. Inaddition to known intervals (i.e. (n1,n2)), our set of in-tervals includes > (i.e. the interval (−∞,∞)) and ⊥ (i.e.the interval containing no numbers). Our interval domainworks for dataframe columns containing integers or realnumbers; our formalization below uses R∞ to denote theset of real numbers, extended with infinity.

i ∈ IR = (R∞×R∞)∪{>,⊥}d f ] ∈ D] = (c1 : IR)× ...× (cn : IR)

f ∈ field m ∈ int s ∈ schema x ∈ dataframes

ϕ ∈ filter ::= c < m | c > me ∈ expr ::= x | filter(ϕ,e) | project(s,e)

| redact(c,n,n,e) | join(e,e)| union(e,e) | dpCount(ε,δ,e)

Figure 9: Program surface syntax

eval(ρ,x) = ρ(x)eval(ρ,filter(ϕ,e)) = σϕeval(ρ,e)eval(ρ,project(s,e)) = Πseval(ρ,e)eval(ρ, redact(c,n1,n2,e)) = {c : stars(s,n1,n2) |

c : s ∈ eval(ρ,e)}eval(ρ, join(e1,e2)) = eval(ρ,e1) ./ eval(ρ,e2)eval(ρ,union(e1,e2)) = eval(ρ,e1)∪eval(ρ,e2)

Figure 10: Concrete interpreter for language in Figure 9.

For ease of presentation, and without loss of generality,we define α and γ in terms of dataframes with a singlecolumn c. We denote values in the column c by d f .c.

c :⊥ v D]

D] v c :>c : (n1,n2) v c : (n3,n4) if n1 ≥ n3∧n2 ≤ n4α(d f ) = c : (min(d f .c),max(d f .c))γ(c :>) = Dγ(c :⊥) = {}γ(c : (n1,n2)) = {d f | ∀v ∈ d f .c. n1 ≤ v≤ n2}

A.2 Soundness

A sound analysis by abstract interpretation requires defin-ing the following:• A programming language of expressions e∈ Expr. We

define a simple language for dataframes, inspired byPandas, in Figure 9.

• A concrete interpreter eval : Env×Expr→D speci-fying the semantics of the programming language onconcrete values. We define the concrete interpreter forour simple language in Figure 10.

• An abstract interpreter eval] : Env] × Expr → D]

specifying the semantics of the programming lan-guage on abstract values. An example for FILTER at-tributes appears in Figure 11.

The concrete interpreter eval computes the concrete re-sult of a program e in the context of an environmentmapping variables to concrete values. The abstract inter-preter eval] computes an output policy of a program e inthe context of an abstract environment mapping variablesto their policies. The program satisfies its input policiesif it has at least one empty clause (i.e. a satisfied clause)in its output policy.

Page 18: PRIVGUARD: Privacy Regulation Compliance Made Easier

eval](ρ,x) = ρ(x)eval](ρ,filter(ϕ,e)) = eval](ρ,e)− interval(ϕ)eval](ρ,project(s,e)) = eval](ρ,e)eval](ρ, redact(c,n1,n2,e)) = eval](ρ,e)eval](ρ, join(e1,e2)) = eval](ρ,e1)teval](ρ,e2)eval](ρ,union(e1,e2)) = eval](ρ,e1)teval](ρ,e2)

interval(c < m) = c : (−∞,m)interval(c > m) = c : (m,∞)

c : (l1,u1)− c : (l2,u2) = c : (l3,u3)where l3 =−∞ when l1 ≤ l2, and l1 otherwise

u3 = ∞ when u1 ≥ u2, and u1 otherwise

Figure 11: Abstract interpreter for FILTER attributes

The soundness property for the abstract interpretersays that the concrete result of evaluating a program eis contained in the class of values represented by theresult of evaluating the same program using the abstractinterpreter.

Theorem 1 (Soundness). For all environments ρ andexpressions e, the abstract interpreter eval] is a soundapproximation of the concrete interpreter eval:

eval(ρ,e) ∈ γ(eval](α(ρ),e))

where the abstract environment α(ρ) is obtained byabstracting each value in the concrete environment ρ (i.e.α(ρ)(x) = α(ρ(x))).

Soundness can be proven for each abstract domain sep-arately. In each case, the proof of soundness proceeds byinduction on e, with case analysis on the kind of abstractvalue returned by uses of eval] on subterms.Soundness for Filter Attributes. We present the ab-stract interpreter for the filter abstract domain in Fig-ure 11. The interesting case of this interpreter is the onefor filter expressions, which converts the filter predicate ϕ

to an interval and returns an abstract value derived fromthe meet of this interval and the recursive call. We provethe soundness of the interpreter as following.Proof of soundness. By induction on e. We considerthe (representative) case where e = filter(ϕ,e′). Bythe inductive hypothesis, we have that eval(ρ,e′) ∈γ(eval](α(ρ),e′)). Let interval(ϕ) = (n1,n2). We wantto show (by definition of eval and eval]:

eval(ρ,filter(ϕ,e′)) ∈ γ(eval](α(ρ),filter(ϕ,e′)))⇐⇒ σϕeval(ρ,e′) ∈ γ(eval](α(ρ),e′)− interval(ϕ))⇐⇒ σn1≤c≤n2eval(ρ,e

′) ∈ γ(c : (n1,n2)ueval](α(ρ),e′))⇐⇒ σn1≤c≤n2eval(ρ,e

′) ∈ γ(c : (max(n1,n3),min(n2,n4)))⇐⇒ σn1≤c≤n2eval(ρ,e

′) ∈{d f | ∀v ∈ d f .c. max(n1,n3)≤ v≤min(n2,n4)}

which holds by the inductive hypothesis and semanticsof selection in relational algebra.

B Usability Survey

To complement the survey in [53] targeting privacy cham-pions, we conducted an online survey targeting general

users to obtain a preliminary understanding of how farexpertise is needed to understand or encode privacy pref-erences. The survey is granted IRB exemption by Officefor Protection of Human Subjects under category 2 ofthe Federal and/or UC Berkeley requirements.

We recruit 30 participants in total among which 7 hasno background in programming (Group A), 2 has pro-grammed in one language (Group B), 15 has programmedin two languages or more (Group C), and 6 self-identifyas experts in programming language (Group D). The sur-vey is comprised of 8 questions in total. The first threeare about understanding privacy policies and the latterfive are to choose the correct option from 4 possiblechoices. Group A makes 38.1% (8/21) mistakes when un-derstanding and 31.4% (11/35) mistakes when selecting.Group B makes 16.7% (1/6) mistakes when understand-ing and 20.0% (2/10) mistakes when selecting. Group Cmakes 15.6% (7/45) mistakes when understanding and17.3% (13/75) mistakes when selecting. Group D makes11.1% (2/18) mistakes when understanding and 13.3%(4/30) mistakes when selecting. Besides, each questionhas a different focus. For example, Question 2 focuseson understanding ROLE and PURPOSE attributes.

We observe several interesting facts in the survey re-sults. First, there is a big gap in accuracy between GroupA and B. This indicates that it might not be trivial forusers without programming experience to specify theirprivacy preferences in LEGALEASE directly. Althoughout of scope of this paper, we deem it an important fu-ture direction to simplify this process for Group A usersusing more user-friendly API or machine-learning-basedtranslation tools. Besides, this also shows that any pro-gramming experience is helpful in understanding andencoding in LEGALEASE. Second, there is no obviousaccuracy gap between Group B and Group C, and GroupD has better accuracy than them. Third, it is hard for allgroups to answer questions related to PRIVACY attributes.The hardness stems from the hardness in understandingprivacy techniques such as differential privacy.

Ethical Considerations. The survey was posted as apublic questionnaire on Twitter and Wechat with in-formed consent. The participants opted in the surveyvoluntarily. In order to fully respect the participants’ pri-vacy, we do not collect any personal identifiable informa-tion from them. Only the answers to the questionnaireare collected.