The Role of Agents in Distributed Data Mining: Issues and Benefits Josenildo Costa da Silva 1 , Matthias Klusch 1 , Stefano Lodi 2 , Gianluca Moro 2 , Claudio Sartori 2 1 Deduction and Multiagent Systems, German Research Center for Artificial Intelligence, Saarbruecken, Germany 2 Department of Electronics, Computer Science and Systems, University of Bologna, Bologna, Italy
57
Embed
The Role of Agents in Distributed Data Mining: Issues and Benefits Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano Lodi 2, Gianluca Moro 2, Claudio.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Role of Agents in Distributed Data Mining:
Issues and Benefits Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano
Lodi 2, Gianluca Moro 2, Claudio Sartori 2
1Deduction and Multiagent Systems,German Research Center for Artificial Intelligence,
Saarbruecken, Germany2Department of Electronics, Computer Science and
Systems,University of Bologna,
Bologna, Italy
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
2
Distributed Data Mining (DDM)
• Data sets – Massive – Inherently distributed
• Networks– Limited bandwidth– Limited computing resources at nodes
• Privacy and security– Sensitive data– Share goals, not data
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
3
Centralized solution
• Apply traditional DM algorithms to data retrieved from different sources and stored in a data warehouse
• May be impractical or even impossible for some business settings– Autonomy of data sources– Data privacy – Scalability (~TB/d)
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
4
Agents and DDM
• DDM exploits distributed processing and problem decomposability
• Is there any real added value of using concepts from agent technology in DDM?– Few DDM algorithms use agents– Evidence that cooperation among distributed
DM processes may allow effective mining even without centralized control
– Autonomy, adaptivity, deliberative reasoning naturally fit into the DDM framework
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
5
State of the Art• BODHI
– Mobile agent platform/Framework for collective DM on heterogeneous sites
• PADMA– Clustering homogeneous sites– Agent based text classification/visualization
• JAM– Metalearning, classifiers
• Papyrus– Wide area DDM over clusters– Move data/models/results to minimize network load
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
6
Agents for DDM (pros)
• Autonomy of data sources• Scalability of DM to massive
• Additivity of density estimates of a distributed data set
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
29
Sampling Density Estimates (2/4)
• The sampling series of the density estimate is also additive
where
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
30
Sampling Density Estimates (3/4)
• Truncation errors– The support of a kernel function is not
bounded in general
• Aliasing errors– The support of the Fourier transform of a
kernel function is not bounded in general kernel functions are not band-limited
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
31
Sampling Density Estimates (4/4)
• The sampling series of a density estimate can only be approximated
• Trade-off between the number of samples and accuracy– Define a minimal multidimensional
rectangle outside which samples are negligible
– Define a vector of sampling intervals such that the aliasing error is negligible
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
32
– Applies DE-clustering to Dj and
– Sends the summation back to each Lj
– Orderly sums the samples
– Waits for the samples of local density estimates
– Samples
– Sends the samples to H
– Reconstructs from its samples
The KDEC scheme
– Computes a local density estimate of its data Dj
– Waits for the samples of the global density estimate ][S
][S
][ jD][ jD
][ jD
• Every site Lj:
• Helper H:
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
33
The KDEC scheme
Helper
Site1 Site2
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
34
The KDEC scheme
Helper
0
2
4 x
-4
-2
0
y
00.511.52
0
2
4 x
0
2
4x
-4
-2
0
2
y00.511.52
0
2
4x
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
35
The KDEC scheme
Helper
0
2
4 x
-4
-2
0
y
00.511.52
0
2
4 x
-5-2.5
0
2.5
5x
-5
0
5
y
-5-2.5
0
2.5
5x
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
36
The KDEC scheme
Helper
-5-2.5
0
2.5
5x
-5
0
5
y
-5-2.5
0
2.5
5x
-5-2.5
0
2.5
5x
-5
0
5
y
-5-2.5
0
2.5
5x
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
37
Helper
-20
24
6
-6-4
-20
2
00.050.10.15
-20
24
6
-6-4
-20
2
-20
24
6
-6-4
-20
2
00.050.10.15
-20
24
6
-6-4
-20
2
-2
0
2
4
6
-6
-4
-2
0
2
0
2
4
-2
0
2
4
6
-2
0
2
4
6
-6
-4
-2
0
2
0
2
4
-2
0
2
4
6
-2
0
2
4
6
-6
-4
-2
0
2
0
2
4
-2
0
2
4
6
The KDEC scheme
-2
0
2
4
6
-6
-4
-2
0
2
0
2
4
-2
0
2
4
6
-2
0
2
4
6
-6
-4
-2
0
2
0
2
4
-2
0
2
4
6
-5-2.5
0
2.5
5x
-5
0
5
y
-5-2.5
0
2.5
5x
-5-2.5
0
2.5
5x
-5
0
5
y
-5-2.5
0
2.5
5x
-1 1 2 3 4
-4
-3
-2
-1
1
2
-1 1 2 3 4 5
-5
-4
-3
-2
-1
1
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
38
Properties of the approach
• Communication complexity depends only on the number of samples
• Data objects are never transmitted over the network
• Local clusters are close to global clusters which can be obtained using DE-cluster
• Time complexity does not exceed the time complexity of centralized DE-clustering
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
39
Window width and sampling frequency
• Good estimates when h is not less than a small multiple of the smallest distance between objects
• As , the number of samples rarely exceeds the number of data points
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
40
Complexity
• Site j– Sampling: O(q(N) Sam) – DE-cluster: O(|Dj|q(Dj))
• Helper– Summation of samples: O(Sam)
• Communication– Time: O(Sam)– Volume: O(M Sam)
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
41
Complexity(centralized approach)
• Site j– Transmission/Reception of data
objects: O(|Dj|)
• Helper– Global DE-clustering: O(N q(N))
• Communication:– Time: O(N)– Volume: O(N)
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
42
Stationary agent-based KDEC
• The helper engages site agents to agree on:– Kernel function– Window width– Sampling frequencies– Sampling region
• The global sampled form of the estimate is computed in a single stage
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
43
Mobile agent-based KDEC
• At site Ln the visiting agent:– Negotiates kernel
function, window width, sampling frequencies, sampling region
– Carries the sum of samples collected at Lm, m<n, in its data space
• The global sampled form of the estimate is returned to the interested agents
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
44
A Hierarchical Scheme
• Additivity allows to extend the scheme to trees of arbitrary arity
• Local sampled density estimates are propagated upwards in partial sums, until the global sampled DE is computed at the root and returned to the leaves •May provide more
protection against disclosure of DEs
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
45
Inference and Trustworthiness
• Inference problem for kernel density estimates– Goal of inference attacks: exploit
information contained in a density estimate to infer the data objects
• Trustworthiness of helpers– Trustworthy helper no bit of information
written to memory by a process for the Helper procedure is sent to a system peripheral by a different process
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
46
• Let
be extensionally equal to a density estimate:
• For example, g is the reconstructed density estimate (sampling series)
Inference Attacks on Kernel Density Estimates
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
47
Inference Attacks on Kernel Density Estimates
• Simple strategy: Search the density estimate or its derivatives for discontinuities
• Example: The kernel is the square pulse– For each pair of projections of objects on an axis
there is a pair of projections of discontinuities on that axis having the same distance as the objects’ projections
– If h is known then the objects can be inferred easily
• If the kernel has discontinuous derivatives, then the same technique applies to the derivatives
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
48
Inference Attacks on Kernel Density Estimates
500 1000 1500 2000
0.0001
0.0002
0.0003
0.0004
0.0005
• If g is not continuous at x an object lies at
h=250
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
49
Inference Attacks on Kernel Density Estimates
• If the kernel is infinitely differentiable the problem is more difficult
• Select space objects and attempt to solve a nonlinear system of equations
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
50
Attack Scenarios
• Single-site attack– One of the sites attempts to infer the data
objects from the global density estimate– Unable to associate a specific data object to
a specific site
• Site coalition attack– A coalition computes the sum of the density
estimates of all the other sites as difference– Special case: the coalition includes all sites
but one the attack potentially reveals the data objects at the site
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
51
Untrustworthy Helpers
• Reputation binary random variable with probabilities p and 1-p – p is the probability that the helper is
untrustworthy– If the agent community supports
referrals about an agent's reputation as a helper, then an agent might know per agent probabilities
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
52
Conclusions
• Agent technology in DDM– Preserves the autonomy of data sources
and scalability of the DM step– Privacy protection (inherent in most DDM
approaches) may be less effective
• Agent-based distributed density estimate computation & clustering– Scalable– Implementation based on mobile agents– Could be vulnerable to inference attacks on
density estimates perpetrated by coalitions
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
53
Work in progress (1)
• Inference attacks on sampled density estimates by solving the corresponding system of NL equations– Globally convergent inexact Newton
methods with constraints [Bellavia, Macconi, Morini 2003]
– Gradient method
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
54
Work in progress (2)
• Inference attacks on sampled density estimates by iteratively reducing the dimensionality of the system of NL equations– pointhunt algorithm– Proceeds iteratively by
• selecting a point x close to the “border” of the density estimate
• guessing an object such that the object is the only contributor to density estimate at x
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
55
Work in progress (3)
• Ontology and protocol• Algorithms for deliberative participation
based on trustworthiness referrals• Probabilities that a local density estimate
may be learned by another agent– Probability that no other agent learns the
density estimate– Probability that at least k other agents learn the
density estimate– Probability that exactly k other agents learn the
density estimate
18
/04/2
3
AgentLink III: TFG1 IIA4WE, Roma
56
Future Work
• Algorithms for the negotiation of parameters
• Formalization of errors– Bounds on aliasing errors– Clustering errors (e.g., using an index