Top Banner
Security, privacy and trust: why and how might we control access to research data? Paul Burton, Rebecca Wilson University of Bristol, D2K Research Program NISO Symposium, Denver 11 th September, 2016
39

Burton - Security, Privacy and Trust

Jan 22, 2018

Download

Education

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Burton - Security, Privacy and Trust

Security, privacy and trust: why and how might we control access to research data?

Paul Burton, Rebecca Wilson

University of Bristol, D2K Research Program

NISO Symposium, Denver

11th September, 2016

Page 2: Burton - Security, Privacy and Trust

• Perhaps the most important message is in the title:

• This is a complex challenge involving science, technology, governance and other fundamental social issues

• No single solution will be adequate

• True transdisciplinary programs of work are essential

• Even the most complex and sophisticated of solutions will never offer fully effective exploitation of available data with zero risk of mistakes in managing data or of malign interference

Security, privacy and trust

Page 3: Burton - Security, Privacy and Trust

From Preface:

“Our view has always been that

anonymisation is a heavily context-

dependent process and only by

considering the data and its

environment as a total system

(which we call the data situation),

can one come to a well informed

decision about whether and what

anonymisation is needed.”

Page 4: Burton - Security, Privacy and Trust

Controlling access to

research data (security):

why and when?

Page 5: Burton - Security, Privacy and Trust

• Who might share data?

• Distinct generator and user

• Share data across a consortium

• How is ‘sharing’ achieved?

• Physically transfer data to a user

• Provide access to analyse

• Analysis on-site

• Remote analysis

• Federated analysis

• All interpretations valid and important

What does “sharing” research microdata mean?

Page 6: Burton - Security, Privacy and Trust

• Management of intellectual property invested and held in data – most areas of research

• Legal, ethical and other governance stipulations to protect the welfare of research participants –particularly in health/social/biomedical research

• Disclosure of identity

• Disclosure of associated information

• Balance between these and the societal benefits of streamlined comprehensive data access – which is evolving rapidly with time and social context

Why control research data at all?

Page 7: Burton - Security, Privacy and Trust

The Research

Data Pipeline

When are

data “at risk”?

Page 8: Burton - Security, Privacy and Trust

The Research

Data Pipeline

When are

data “at risk”?

Dissemination and evidence based action

Page 9: Burton - Security, Privacy and Trust

The Research

Data Pipeline

When are

data “at risk”?

Dissemination and evidence based action

Page 10: Burton - Security, Privacy and Trust

• Risks associated with: storage; transmission; use

• Accidental v deliberate violations

• Direct v inferential disclosure

• Risks and remedies lie in the nature of the data themselves and the contextual environment in which the data are to be used – and potentially misused

• Mark Elliot, Elaine Mackey, Keith Spicer, Caroline Tudor, the Anonymisation Decision-Making Framework, 2016

Issues to consider

Page 11: Burton - Security, Privacy and Trust

• Consider the user(s)

• Are he/she/they bona fide researchers?

• Consider the application – for example:

• Does it violate (or potentially violate) any of the

ethical permissions granted to the study or any of the consent forms signed by the participants or their guardians?

• Is there a risk it may produce information that may allow individual cohort members to be identified?

How to implement control in practice? UK MetaDAC and ALSPAC as illustrative examples

Page 12: Burton - Security, Privacy and Trust

• Administrative and research data held separately

• Hard copies of data held in locked storage

• Electronic data held on password protected systems with access restricted to those who really need it

• All electronic data held in encrypted form

• Extensive QC (security of quality)

• Multiple back-ups (security of existence)

Managing the data and the data environment

Page 13: Burton - Security, Privacy and Trust

• All data pseudonomised before release

• If pseudonomisation scientifically impossible, data can be analysed ‘on site’

• All data transfers encrypted using standard protocols

• All linked data released on study-specific IDs

• Explicit acknowledgment that no system can guarantee a zero risk of disclosure or misuse of data

Managing the data and the data environment

Page 14: Burton - Security, Privacy and Trust

• A strong underpinning governance structure is essential.

• EAGDA (Expert Advisory Group on Data Access) report 2015 considered, amongst many other key issues that:

EAGDA report, 2015

• Governance must be proportionate and context appropriate

• Must be transparent, auditable and appealable

• Need mutual trust and respect amongst stakeholders

Page 15: Burton - Security, Privacy and Trust

• Applicants for data through MetaDAC or from ALSPAC sign up agreeing to governance documents which include statements such as:

• Applicants are reminded that the Terms and Conditions for the cohort explicitly forbid any attempt to identify individuals or to compromise or otherwise infringe the confidentiality of information on data subjects and their right to privacy.

• Do you understand that you must not pass on any data or samples awarded, or any derived variables or genotypes generated by this application to a third party (i.e. to anybody that is not included in this list of applicants on this project, nor is a direct employee of one of these applicants)?

How to implement control in practice? UK MetaDAC and ALSPAC as illustrative examples

Page 16: Burton - Security, Privacy and Trust

Role of encryption and other

technology-based forms of

privacy protection in “open

science”

Page 17: Burton - Security, Privacy and Trust

• When research data are very sensitive or are seen as having a particular intellectual property value can we develop technology-based solutions that facilitate access to microdata by enhancing privacy protection so that all intellectual property and governance constraints are met in full while lowering the governance bar? This can promote open science by easing and/or speeding up access requests

• Should be seen as an additional component to be applied on top of a data access and governance system that is already well founded

• EAGDA report emphasises sustainability

Privacy protection in “open science”

Page 18: Burton - Security, Privacy and Trust

Data

Computer

DC

Analysis

Computer

AC

Single site DataSHIELD

2009: The DataSHIELDchallenge

Given that microdata are scientifically critical and yet potentially sensitive, can we ensure that the information driving analysis of the data at each centre only ever emerges from the firewall in non-disclosive form? (i)encryption (trivial and non-trivial); (ii) low dimensional (ideally sufficient) statistics

Multi-site DataSHIELDhorizontally partitioned data

Page 19: Burton - Security, Privacy and Trust

• One step analyses: e.g.ds.table2D - request non-disclosive output from all sources

• Multi-step analyses: e.g.ds.lexis – set up and then

request

• Iterative analyses: e.g.ds.glm - parallel processes linked together by non-identifying summary statistics – e.g. for glm= score vectors and information matrices

• Can be used as equivalent to full individual level analysis or to study level meta-analysis

The DataSHIELD solution

Page 20: Burton - Security, Privacy and Trust

DataSHIELD

b.vector<-c(0,0,0,0)

glm(cc~1+BMI+BMI.456+SNP,family=binomial,start=b.vector, maxit=1)

Analysis commands (1)

Page 21: Burton - Security, Privacy and Trust

Information Matrix Study 5

Score vector Study 5

Summary Statistics (1)

[36, 487.2951, 487.2951, 149]

Information Matrix Study 5

Score vector Study 5

Summary Statistics (1)

DataSHIELD

Page 22: Burton - Security, Privacy and Trust

Σ Information Matrix Study 5

Score vector Study 5

Summary Statistics (1)

[36, 487.2951, 487.2951, 149]

Information Matrix Study 5

Score vector Study 5

Summary Statistics (1)

DataSHIELD

Page 23: Burton - Security, Privacy and Trust

b.vector<-c(-0.322, 0.0223, 0.0391, 0.535)

Analysis commands (2)

glm(cc~1+BMI+BMI.456+SNP,family=binomial,start=b.vector, maxit=1)

DataSHIELD

Page 24: Burton - Security, Privacy and Trust

and so on .....

Page 25: Burton - Security, Privacy and Trust

Updated parameters (4)

ΣCoefficient Estimate Std Error

Intercept -0.3296 0.02838

BMI 0.02300 0.00621

BMI.456 0.04126 0.01140

SNP 0.5517 0.03295

Final parameter estimates

DataSHIELD

Page 26: Burton - Security, Privacy and Trust

DataSHIELD analysis

Direct conventional analysis

Parameter Coefficient Standard Error

bintercept -0.3296 0.02838

bBMI 0.02300 0.00621

bBMI.456 0.04126 0.01140

bSNP 0.5517 0.03295

Coefficients:Estimate Std. Error

(Intercept) -0.32956 0.02838BMI 0.02300 0.00621BMI.456 0.04126 0.01140SNP 0.55173 0.03295

Does itwork?

Page 27: Burton - Security, Privacy and Trust

Server-side functions

Client-side functionsIndividual level data never transmitted or seen by the statistician in charge, or by anybody outside the original centre in which they are stored.

R

R

R R

Web services

Web servicesWeb services

Data serverOpal

Finrisk

OpalPrevend

Opal1958BC

Data server

Data serverBioSHaREweb site

Web services

Analysisclient

DataSHIELD: current implementation for horizontally partitioned data

Page 28: Burton - Security, Privacy and Trust

IM5:

AnalysisComputer

R

R

R R

Web services

Web servicesWeb services

Data computer OpalNHS

OpalALSPAC

OpalEducation

Data computer Data computer

Regression coefficients = XTY/ XTX

XTX: Need to calculate

XAXA XAXB XAXC

XAXB XBXB XBXC

XAXC XBXC XCXC

XA

XB

XAXB

XA1 * XB1

+XA2 * XB2

+XA3 * XB3

+……

DataSHIELD: current implementation for vertically partitioned (linked) data

Page 29: Burton - Security, Privacy and Trust

IM5:

AnalysisComputer

R

R

R R

Web services

Web servicesWeb services

Data computer OpalNHS

OpalALSPAC

OpalEducation

Data computer Data computer

Regression coefficients = XTY/ XTX

XTX: Need to calculate

XAXA XAXB XAXC

XAXB XBXB XBXC

XAXC XBXC XCXC

MA

MB

MCXA

XB

XAXB

XA1 * XB1

+XA2 * XB2

+XA3 * XB3

+……

DataSHIELD: current implementation for vertically partitioned (linked) data

plain.text.vector.A plain.text.vector.N0 1 1 1 0 0 1 1 1 0 1 0 0 1

encryption.matrix[,1] [,2] [,3]

[1,] -1.444769 2.495677 -5.322736[2,] -1.355529 -9.369041 2.687347[3,] 4.603762 -3.622044 -2.817478

occluded.matrix.A[,1] [,2] [,3]

[1,] -1.4546711 0 4.0722205[2,] 6.4809785 1 -4.5814726[3,] 4.4954801 1 -8.7036260[4,] 0.1995684 1 -8.6872205[5,] -6.4060220 0 -6.6471777[6,] -0.5164345 0 -0.2564673[7,] -5.8981933 1 -8.5032852

Page 30: Burton - Security, Privacy and Trust

IM5:

AnalysisComputer

R

R

R R

Web services

OpalNHS

OpalALSPAC

OpalEducation

Data computer Data computer

Regression coefficients = XTY/ XTX

XTX: Need to calculate

XAXA XAXB XAXC

XAXB XBXB XBXC

XAXC XBXC XCXCMA XT

A

MAXTAXBMB

MA

XA

XB

MB

(MA)-1 MAXTAXBMB (MB)-1 = XAXB

DataSHIELD: current implementation for vertically partitioned (linked) data

plain.text.vector.A plain.text.vector.N0 1 1 1 0 0 1 1 1 0 1 0 0 1

encryption.matrix[,1] [,2] [,3]

[1,] -1.444769 2.495677 -5.322736[2,] -1.355529 -9.369041 2.687347[3,] 4.603762 -3.622044 -2.817478

occluded.matrix.A[,1] [,2] [,3]

[1,] -1.4546711 0 4.0722205[2,] 6.4809785 1 -4.5814726[3,] 4.4954801 1 -8.7036260[4,] 0.1995684 1 -8.6872205[5,] -6.4060220 0 -6.6471777[6,] -0.5164345 0 -0.2564673[7,] -5.8981933 1 -8.5032852

Page 31: Burton - Security, Privacy and Trust

The core DataSHIELDDevelopment Team

Page 32: Burton - Security, Privacy and Trust

Becca WilsonIf people want to know technical details about DataSHIELD or methods for secure data sharing/analysis: see 12th September 11:30 - 13:00 Secure Multiparty Computation for Statistical Analysis of Private Data

Demetris AvraamPoster #12 RDA poster session Wednesday, Thursday, Friday – DataSHIELD: a method for privacy protected analysis of individual level data

RDA Working Group for Data Security and TrustRDA 8th Plenary in the same venue as the NISO symposium. Session on 17th September 11:00 -12:30. We are running a survey to gather information on current data security practices in our community. The survey is available at www.bit.ly/dash-ing (see below)

Data to Knowledge (D2K) Research Group. Contact details:@Data2Knowledge there is also a "contact us" page on www.datashield.ac.uk

Setting up a professional community for stakeholders in health data sharingDASH-ING: DAta Sharing for Health - INnovation GroupThe website will initially (and temporarily) be at: www.bit.ly/dash-ing This webpage contains questions for RDA survey and link to join the professional community

Additional opportunities for interaction

Page 33: Burton - Security, Privacy and Trust

THANK YOU FOR LISTENING

Page 34: Burton - Security, Privacy and Trust

> plain.text.vector.L

[1] 0 1 1 1 0 0 1

> plain.text.vector.N

[1] 1 1 0 1 0 0 1

> sum(plain.text.vector.L*plain.text.matrix.N)

[1] 3

>t(matrix(plain.text.vector.L))%*%matrix(plain.text.vector.N)

[,1]

[1,] 3

How does matrix-based encryption work?

Page 35: Burton - Security, Privacy and Trust

> occluded.matrix.L > plain.text.vector.L

[,1] [,2] [,3] [1] 0 1 1 1 0 0 1

[1,] -1.4546711 0 4.0722205

[2,] 6.4809785 1 -4.5814726

[3,] 4.4954801 1 -8.7036260

[4,] 0.1995684 1 -8.6872205

[5,] -6.4060220 0 -6.6471777

[6,] -0.5164345 0 -0.2564673

[7,] -5.8981933 1 -8.5032852

> e.mat.L

[,1] [,2] [,3]

[1,] -1.444769 2.495677 -5.322736

[2,] -1.355529 -9.369041 2.687347

[3,] 4.603762 -3.622044 -2.817478

> e.mat.L%*%occluded.matrix.L

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949

[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142

[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104

How does matrix-based encryption work?

Page 36: Burton - Security, Privacy and Trust

> e.mat.L%*%occluded.matrix.L

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] -19.57369 17.51813 42.32785 48.44713 44.636397 2.11123627 56.277949

[2,] 12.91532 -30.46620 -38.85246 -32.98514 -9.179719 0.01082581 -24.225142

[3,] -18.17035 39.12303 41.59635 21.77277 -10.763524 -1.65495077 -6.818104

> plain.text.vector.N

[1] 1 1 0 1 0 0 1

> e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N

[,1]

[1,] 102.66952

[2,] -74.76116

[3,] 35.90735

> inv.e.mat.L%*%e.mat.L%*%occluded.matrix.L%*% plain.text.matrix.N

[,1]

[1,] -0.6723174

[2,] 3.0000000

[3,] -17.6997578

> sum(plain.text.vector.L*plain.text.matrix.N)

[1] 3

How does matrix-based encryption work?

Page 37: Burton - Security, Privacy and Trust

> plain.text.vector.L

[1] 0 1 1 1 0 0 1

> e.mat.1

[,1]

[1,] 7.13763

> e.mat.1%*%t(matrix(plain.text.vector.L))

[,1] [,2] [,3] [,4] [,5] [,6][,7]

[1,] 0 7.13763 7.13763 7.13763 0 0 7.13763

>e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.text.matrix.N

[,1]

[1,] 21.41289

>(1/e.mat.1)*e.mat.1%*%t(matrix(plain.text.vector.L))%*%plain.tex

t.matrix.N

[,1]

[1,] 3

Why do we need to occlude the original plain text vector?

Page 38: Burton - Security, Privacy and Trust

• Is there a significant risk of upsetting or alienating cohort members or of reducing their willingness to remain as active participants?

• Does it address topics that fall within the acknowledged scientific remit of the cohort?

• Is access requested to an infinite resource (data or cell line DNA) or a depletable resource. If non-depletable NO assessment of the science underpinning the application

UK MetaDAC and ALSPAC as illustrative examples

Page 39: Burton - Security, Privacy and Trust

• Wish to control access

• Undesirable loss of intellectual property

• Violation of legal, ethical and other governance stipulations – particularly disclosure of identity and/or associated information

• Wish to ensure data used widely for scientific research and that access procedures are streamlined

• Data and contextual “data environment” both crucial and control systems are typically complex and multi-faceted

How should access be controlled?