Top Banner
The Role of Agents in Distributed Data Mining: Issues and Benefits Josenildo Costa da Silva 1 , Matthias Klusch 1 , Stefano Lodi 2 , Gianluca Moro 2 , Claudio Sartori 2 1 Deduction and Multiagent Systems, German Research Center for Artificial Intelligence, Saarbruecken, Germany 2 Department of Electronics, Computer Science and Systems, University of Bologna, Bologna, Italy
57

The Role of Agents in Distributed Data Mining: Issues and Benefits

Feb 22, 2016

Download

Documents

Lyris

The Role of Agents in Distributed Data Mining: Issues and Benefits . Josenildo Costa da Silva 1 , Matthias Klusch 1 , Stefano Lodi 2 , Gianluca Moro 2 , Claudio Sartori 2 1 Deduction and Multiagent Systems, German Research Center for Artificial Intelligence , Saarbruecken , Germany - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Role of Agents in Distributed Data Mining: Issues and Benefits

The Role of Agents in Distributed Data Mining:

Issues and Benefits Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano

Lodi 2, Gianluca Moro 2, Claudio Sartori 2

1Deduction and Multiagent Systems,German Research Center for Artificial Intelligence,

Saarbruecken, Germany2Department of Electronics, Computer Science and

Systems,University of Bologna,

Bologna, Italy

Page 2: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma2

Distributed Data Mining (DDM)

• Data sets – Massive – Inherently distributed

• Networks– Limited bandwidth– Limited computing resources at nodes

• Privacy and security– Sensitive data– Share goals, not data

Page 3: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma3

Centralized solution• Apply traditional DM algorithms to

data retrieved from different sources and stored in a data warehouse

• May be impractical or even impossible for some business settings– Autonomy of data sources– Data privacy – Scalability (~TB/d)

Page 4: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma4

Agents and DDM• DDM exploits distributed processing and

problem decomposability• Is there any real added value of using

concepts from agent technology in DDM?– Few DDM algorithms use agents– Evidence that cooperation among distributed

DM processes may allow effective mining even without centralized control

– Autonomy, adaptivity, deliberative reasoning naturally fit into the DDM framework

Page 5: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma5

State of the Art• BODHI

– Mobile agent platform/Framework for collective DM on heterogeneous sites

• PADMA– Clustering homogeneous sites– Agent based text classification/visualization

• JAM– Metalearning, classifiers

• Papyrus– Wide area DDM over clusters– Move data/models/results to minimize network load

Page 6: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma6

Agents for DDM (pros)• Autonomy of data sources• Scalability of DM to massive

distributed data• Multi-strategy DDM• Collaborative DM

Page 7: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma7

Agents for DDM (against)• Need to enforce minimal privileges

at a data source– Unsolicited access to sensitive data– Eavesdropping– Data tampering– Denial of service attacks

Page 8: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma8

The Inference Problem• Work in statistical DB (mid 70’s)• Integration/aggregation at the

summary level is inherent in DDM– Infer sensitive data even from partial

integration to a certain extent and with some probability (inference problem)

– Existing DDM systems are not capable of coping with the inference problem

Page 9: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma9

Data Clustering• Popular problem

– Statistics (cluster analysis)– Pattern Recognition– Data Mining

• Decompose multivariate data set into groups of objects– Homogeneity within groups– Separation between groups

Page 10: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma10

DE-clustering

• Clustering based on non-parametric density estimation– Construct an estimate of the

probability density function from the data set

– Objects “attracted” by a local maximum of the estimate belong to the same cluster

Page 11: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma11

Kernel Density Estimation• • The higher the number of data objects

in the neighbourhood of x, the higher density at x

• A data object exerts more influence on the value of the estimate at x than any data object farther from x than xi

• The influence of data objects is radial

Page 12: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma12

Formalizing Density Estimators

• The density estimate at a space object x is proportional to a sum of weights

• The sum consists of one weight for every data object

• Weight is a monotonically decreasing function (kernel ) of the distance between x and xi scaled by a factor h (window width )

Page 13: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma13

Kernel Functions• Uniform kernel

-3 -2 -1 1 2 3

0.1

0.2

0.3

0.4

0.5

Page 14: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma14

Kernel Functions• Triangular kernel

-3 -2 -1 1 2 3

0.2

0.4

0.6

0.8

1

Page 15: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma15

Kernel Functions• Epanechnikov’s kernel

-3 -2 -1 1 2 3

0.05

0.1

0.15

0.2

0.25

0.3

Page 16: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma16

Kernel Functions• Gaussian kernel

-3 -2 -1 1 2 3

0.1

0.2

0.3

0.4

Page 17: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma17

Example (1/2)

• Uniform kernel, h=250

2000 4000 6000 8000 10000

0.0001

0.0002

0.0003

0.0004

0.0005

Page 18: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma18

Example (2/2)

• Gaussian kernel, h=250

2000 4000 6000 8000 10000

0.0001

0.0002

0.0003

0.0004

0.0005

Page 19: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma19

Distributed Data Clustering (1/2)

• Clustering algorithm A( )• Homogeneous distributed data

clustering problem for A:– Data set S– Sites Lj

– Lj stores data set Dj withM

jj SD

1

Page 20: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma20

Distributed Data Clustering (2/2)

• Problem: find clustering Cj in the data space of Lj such that:– Cj agree with A(S) (correctness

requirement):

– Time/communication costs are minimized (efficiency requirement)

– The size of data transferred out of the data space of any Lj is minimized (privacy requirement)

Page 21: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma21

Traditional (centralized) solution

• Gather all local data sets into one centralized repository (e.g., a data warehouse)

• Run A( ) on the centralized data set• Unsatisfied privacy requirement• Unsatisfied efficiency requirement

for some A( )

Page 22: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma22

Sampling• Goal: given some class of functions of type

represent every member as a sampling series

where:– is a collection of points of – is some set of suitable expansion functions

Page 23: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma23

Example• The class of polynomials of degree

1

– Sampling points– Expansion functions

• Finite sum

Page 24: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma24

Band-limited Functions• Function f of one real variable• Range of frequencies of a function f

support of the Fourier transform of f

• Any function whose range of frequencies is confined to a bounded set B is called band-limited to B (the band-region)

Page 25: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma25

Example: sinc function

-6 -4 -2 2 4 6

-0.4

-0.2

0.2

0.4

0.6

0.8

1

-7.5 -5 -2.5 2.5 5 7.5

0.2

0.4

0.6

0.8

1

Page 26: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma26

Sampling Theorem • If f is band-limited with band-

region

then

Page 27: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma27

Sampling Theorem (scaled multidimensional version)

• Let– – – where is the -th component of a

vector• If f is band-limited to B then

Page 28: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma28

Sampling Density Estimates (1/4)

• Additivity of density estimates of a distributed data set

Page 29: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma29

Sampling Density Estimates (2/4)

• The sampling series of the density estimate is also additive

where

Page 30: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma30

Sampling Density Estimates (3/4)

• Truncation errors– The support of a kernel function is not

bounded in general• Aliasing errors

– The support of the Fourier transform of a kernel function is not bounded in general kernel functions are not band-limited

Page 31: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma31

Sampling Density Estimates (4/4)

• The sampling series of a density estimate can only be approximated

• Trade-off between the number of samples and accuracy– Define a minimal multidimensional

rectangle outside which samples are negligible

– Define a vector of sampling intervals such that the aliasing error is negligible

Page 32: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma32

– Applies DE-clustering to Dj and

– Sends the summation back to each Lj

– Orderly sums the samples

– Waits for the samples of local density estimates

– Samples – Sends the samples

to H

– Reconstructs from its samples

The KDEC scheme– Computes a local

density estimate of its data Dj

– Waits for the samples of the global density estimate ][S

][S

][ jD][ jD

][ jD

• Every site Lj:

• Helper H:

Page 33: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma33

The KDEC schemeHelper

Site1 Site2

Page 34: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma34

The KDEC schemeHelper

0

2

4 x

-4

-2

0

y

00.511.52

0

2

4 x

0

2

4x

-4

-2

0

2

y00.511.52

0

2

4x

Page 35: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma35

The KDEC schemeHelper

0

2

4 x

-4

-2

0

y

00.511.52

0

2

4 x

-5-2.5

02.5

5x

-5

0

5

y

-5-2.5

02.5

5x

Page 36: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma36

The KDEC schemeHelper

-5-2.5

02.5

5x

-5

0

5

y

-5-2.5

02.5

5x

-5-2.5

02.5

5x

-5

0

5

y

-5-2.5

02.5

5x

Page 37: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma37

Helper

-20

24

6

-6-4

-20

2

00.050.10.15-2

02

46

-6-4

-20

2

-20

24

6

-6-4

-20

2

00.050.10.15-2

02

46

-6-4

-20

2

-20

2

4

6

-6

-4

-2

0

2

0

2

4

-20

2

4

6

-20

2

4

6

-6

-4

-2

0

2

0

2

4

-20

2

4

6

-20

2

4

6

-6

-4

-2

0

2

0

2

4

-20

2

4

6

The KDEC scheme

-20

2

4

6

-6

-4

-2

0

2

0

2

4

-20

2

4

6

-20

2

4

6

-6

-4

-2

0

2

0

2

4

-20

2

4

6

-5-2.5

02.5

5x

-5

0

5

y

-5-2.5

02.5

5x

-5-2.5

02.5

5x

-5

0

5

y

-5-2.5

02.5

5x

-1 1 2 3 4

-4

-3

-2

-1

1

2

-1 1 2 3 4 5

-5

-4

-3

-2

-1

1

Page 38: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma38

Properties of the approach• Communication complexity depends only

on the number of samples • Data objects are never transmitted over

the network• Local clusters are close to global clusters

which can be obtained using DE-cluster• Time complexity does not exceed the time

complexity of centralized DE-clustering

Page 39: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma39

Window width and sampling frequency

• Good estimates when h is not less than a small multiple of the smallest distance between objects

• As , the number of samples rarely exceeds the number of data points

Page 40: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma40

Complexity• Site j

– Sampling: O(q(N) Sam) – DE-cluster: O(|Dj|q(Dj))

• Helper– Summation of samples: O(Sam)

• Communication– Time: O(Sam)– Volume: O(M Sam)

Page 41: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma41

Complexity(centralized approach)

• Site j– Transmission/Reception of data

objects: O(|Dj|)• Helper

– Global DE-clustering: O(N q(N))• Communication:

– Time: O(N)– Volume: O(N)

Page 42: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma42

Stationary agent-based KDEC• The helper engages

site agents to agree on:– Kernel function– Window width– Sampling frequencies– Sampling region

• The global sampled form of the estimate is computed in a single stage

Page 43: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma43

Mobile agent-based KDEC• At site Ln the visiting

agent:– Negotiates kernel

function, window width, sampling frequencies, sampling region

– Carries the sum of samples collected at Lm, m<n, in its data space

• The global sampled form of the estimate is returned to the interested agents

Page 44: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma44

A Hierarchical Scheme• Additivity allows to

extend the scheme to trees of arbitrary arity

• Local sampled density estimates are propagated upwards in partial sums, until the global sampled DE is computed at the root and returned to the leaves •May provide more

protection against disclosure of DEs

Page 45: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma45

Inference and Trustworthiness

• Inference problem for kernel density estimates– Goal of inference attacks: exploit

information contained in a density estimate to infer the data objects

• Trustworthiness of helpers– Trustworthy helper no bit of information

written to memory by a process for the Helper procedure is sent to a system peripheral by a different process

Page 46: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma46

• Let

be extensionally equal to a density estimate:

• For example, g is the reconstructed density estimate (sampling series)

Inference Attacks on Kernel Density Estimates

Page 47: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma47

Inference Attacks on Kernel Density Estimates

• Simple strategy: Search the density estimate or its derivatives for discontinuities

• Example: The kernel is the square pulse– For each pair of projections of objects on an axis

there is a pair of projections of discontinuities on that axis having the same distance as the objects’ projections

– If h is known then the objects can be inferred easily• If the kernel has discontinuous derivatives,

then the same technique applies to the derivatives

Page 48: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma48

Inference Attacks on Kernel Density Estimates

500 1000 1500 2000

0.0001

0.0002

0.0003

0.0004

0.0005

• If g is not continuous at x an object lies at

h=250

Page 49: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma49

Inference Attacks on Kernel Density Estimates

• If the kernel is infinitely differentiable the problem is more difficult

• Select space objects and attempt to solve a nonlinear system of equations

Page 50: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma50

Attack Scenarios• Single-site attack

– One of the sites attempts to infer the data objects from the global density estimate

– Unable to associate a specific data object to a specific site

• Site coalition attack– A coalition computes the sum of the density

estimates of all the other sites as difference– Special case: the coalition includes all sites

but one the attack potentially reveals the data objects at the site

Page 51: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma51

Untrustworthy Helpers• Reputation binary random

variable with probabilities p and 1-p – p is the probability that the helper is

untrustworthy– If the agent community supports

referrals about an agent's reputation as a helper, then an agent might know per agent probabilities

Page 52: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma52

Conclusions• Agent technology in DDM

– Preserves the autonomy of data sources and scalability of the DM step

– Privacy protection (inherent in most DDM approaches) may be less effective

• Agent-based distributed density estimate computation & clustering– Scalable– Implementation based on mobile agents– Could be vulnerable to inference attacks on

density estimates perpetrated by coalitions

Page 53: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma53

Work in progress (1)• Inference attacks on sampled

density estimates by solving the corresponding system of NL equations– Globally convergent inexact Newton

methods with constraints [Bellavia, Macconi, Morini 2003]

– Gradient method

Page 54: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma54

Work in progress (2)• Inference attacks on sampled density

estimates by iteratively reducing the dimensionality of the system of NL equations– pointhunt algorithm– Proceeds iteratively by

• selecting a point x close to the “border” of the density estimate

• guessing an object such that the object is the only contributor to density estimate at x

Page 55: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma55

Work in progress (3)• Ontology and protocol• Algorithms for deliberative participation

based on trustworthiness referrals• Probabilities that a local density estimate

may be learned by another agent– Probability that no other agent learns the

density estimate– Probability that at least k other agents learn the

density estimate– Probability that exactly k other agents learn the

density estimate

Page 56: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma56

Future Work• Algorithms for the negotiation of

parameters• Formalization of errors

– Bounds on aliasing errors– Clustering errors (e.g., using an index

of partition difference)

Page 57: The Role of Agents in Distributed Data Mining: Issues and Benefits

22/0

4/23

AgentLink III: TFG1 IIA4WE, Roma57

Thanks!