Top Banner
Mining Distributed Databases Raj Bhatnagar University of Cincinnati
21

Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Mining Distributed Databases

Raj Bhatnagar

University of Cincinnati

Page 2: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Distributed Databases

D = D1 X D2 X . . . X Dn

- D is implicitly specified

Goal: Discover patterns in implicit D, using the explicit Di’s

D1 D2 Dn

A B C C D E A E G

Limitations:- Can’t move Di’s to a common site

- Size / communication cost/Privacy- Can’t update local databases- Can’t send actual data tuples

Geographically distributed nodes

Page 3: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Explicit and Implicit Databases

321162

121162

211221

211261

321161

121161

FEDCBA

Implicit Database

Explicit Component Databases

22

12

21

11

CA

SharedSet

------162

122311161

121111261

211221221

CEAFCDCBA

Node 3Node 2Node 1

Page 4: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Decomposition of Computations

- Since D is implicit,

- For a computation:

- Decompose F into G and g’s

- Decomposition depends on

- F

- Di’s

- Set of shared attributes

D1 D2 Dn

A B C C D E A E G

)]()...(),([ 2211 nn DgDgDgGR

)(DFR

Page 5: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Decomposition of ComputationsComputational primitives

– Arithmetic primitives

• Count of tuples in implicit D• Mean Value of an attribute in D• Informational entropy for a subset of D• Covariance matrix for D

– non-numeric primitives

• Median value of an atribute in D• Sorting subsets of tuples in D

Page 6: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Decomposition of Computations• Computational cost of decomposition

– Communication cost• Number of messages exchanged

– Number of database queries

• Who does the decomposition?– Algorithm itself, at run time

– Depending on the nature of overlap in Di’s

Page 7: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Count All Tuples in Implicit D

)(# DtuplesR Can be decomposed as:

m

j

n

iCondi j

DNR1 1

))(((

– condJ : Jth tuple in Shareds

– n: number of participating databases (Dis)

– (N(Dt)condJ): count of tuples in Dt satisfying condJ

– Local computation: gi(Di,) = N(Dt)condJ

– G is a sum-of-products

22

12

21

11

CA

Shareds

L attributes;k values each;

kl tuples

Page 8: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Implementing Decomposed Computations

Stationary Agents

D1 D2

A B C C D E A E G

Dn Dx

A A AA

D1 D2

A B C C D E A E G

Dn Dx

Mobile Agents

Messages

Aglet

Page 9: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Implementation of Count(D)Stationary Agents

- Request / Send Summaries

- Simple SQL interface

- 1 count / message

- l attributes having k values each

- Query-code interface

- counts/message

- l attributes having k values each

Mobile Agents:

D1 D2 Dn

A B C C D E A E G

22

12

21

11

CA

Shareds

L attributes;k values each;

kl tuples

kln*Messages exchanged:

kl

Messages exchanged:n

Number of hops:n

Page 10: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Implementation of Count(D-test)

Stationary Agents- Simple SQL interface

- Query-code interface

Mobile Agents:

22

12

21

11

CA

Shareds

L attributes;k values each;

kl tuples

kln*Messages exchanged:

Messages exchanged:n

Number of hops:n

)))(((..1

J testandcond

n

tttest

J

DNCount

Page 11: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Average Value of an attribute in D

Compute counts for each value of an attribute:

n

iiiC

CNCN total

Avg1

1 ))(*(*)(

Stationary Agents- Simple SQL interface

- Query-code interface

Mobile Agents:

klnk **)1( Messages exchanged:

Messages exchanged:n

Number of hops:n

(1 integer/message)

integers/message)1(* kl k

Page 12: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Exception Tuples

• Database of interest may exclude some tuples of D• Learning site keeps a relation E of exception tuples

– E may have explicit tuples

– E may have rules to generate exception tuples

m

j

n

iCondCondi jj

NDNR1 1

))()((( E

Explicit Databases

22

12

21

11

CA

SharedSet------162

122311161

121111261

211221221

CEAFCDCBA

Node 3Node 2Node 1

--

--

--

32

EB

Exceptions

Page 13: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Computing Informational Entropy

Consists of various counts only:

Stationary agent/Simple SQL interface:

Stationary agent/Query-code interface:

Mobile agent:

))log( 2b

bc

c b

bc

b N

N

N

NE

2** kln kMessages exchanged:

nMessages exchanged:

Number of hops:n

(1 integer/message)

integers/message2*kl k

[Number of messages/hops is independent of the size of D]

Page 14: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Decomposition of Algorithms

• Arithmetic primitives are 1-step decompositions– Counts, averages, entropy

• Algorithms involve– Arithmetic primitives

– non-numeric primitives

– Control structure

• Decomposition studied for– Decision tree induction algorithm

– Mining of association rules• Control structure is unaltered

• Primitive computations are decomposed

D1 D2

A B C C D E A E G

Dn Dx

• Learner Node• Control structure• Decomposition• Composition

Page 15: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Building a Decision Tree

To induce a decision tree having:

- d levels; m attributes in n databases; l shared attributes

- k values/attribute

Stationary agent/Simple SQL interface:

Stationary agent/Query-code interface:

Mobile agent:

]**[*]2/[ 22 klndmd kMessages exchanged: (1 integer/message)

][*]2/[ 2 ndmd Messages exchanged: integers/message2*kl k

][*]2/[ 2 ndmd hops

[Number of messages/hops is independent of the size of D]

Page 16: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Mining Association Rules

Main operations:

- Enumerate item-sets

- Compute support/confidence

- Basic computation: Count-of-tuples

Communication Complexity:

- m (avg.) item sets at each level of enumeration tree

- j levels of enumeration tree

- Query-code can count for all item sets at a level simultaneously

- Therefore, we need:

Number of Counts Needed: jm**2

nj * nj *Messages, or hops

Page 17: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

More Complex Computations

• Covariance matrix for D– Useful for eigen vectors/principal components

– Needs second order moments

• Graph/Network algorithms– Each node has part of a graph

– Some nodes are shared• Determine MST

• Paths of Min/Max flow

• flow patterns

Dt

tt yx

Page 18: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Sum of Products

• Sum of products for two attributes:

• There are six different ways in which x and y may be distributed

• Each requires a different decomposition

– Case 1: x same as y; and x belongs to the SharedSet.

– Case 2: x same as y; and x does not belong to the SharedSet.

– Case 3: x and y both belong to the SharedSet.

Dt

tt yx

)....(*2 DinxCountx jj

j

)(*)....( 2kk k condCountcondforxAvg

)....(** SharedincondCountyx kkk k

Page 19: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Sum of Products

– Case 4: x belongs to SharedSet and y does not.

– Case 5: x, y don’t belong to the SharedSet and reside on different nodes.• For each tuple t in SharedSet, obtain

• and then

– Case 6: x, y don’t belong to the SharedSet and reside on the same node.

)(** jj j xxCountyx

)(,)( tytx

t

tySumtxSum ))((*))((

t

tCounttod )(*)(Pr where

Prod(t) is average of product of x and y for cond-t of SharedSet

Page 20: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Self-decomposing Algorithms• Easy decomposability of arithmetic primitives

– Average/Covariance matrix/Entropy

• Control structure of algorithms is not altered– More gains possible, by altering control structure

• Decomposition is driven by the set of shared attributes

• Algorithm can determine shared attributes in n messages/hops

• Algorithms decompose in accordance with attribute sharing– No human intervention needed

• Message complexity is independent of sizes of databases

Page 21: Mining Distributed Databases Raj Bhatnagar University of Cincinnati.

Continuing Work

Determine patterns of flow in a network– Communication network traffic

– Geographic/economic flows

Localflowdata

Localflowdata

Localflowdata

Localflowdata