Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore County, Baltimore, MD. Also affiliated to Agnik, LLC, Columbia, MD. Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments
27
Embed
Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Haimonti Dutta1 and Hillol Kargupta2
1Center for Computational Learning Systems (CCLS), Columbia University, NY,
USA.2University of Maryland, Baltimore County,
Baltimore, MD. Also affiliated to Agnik, LLC, Columbia, MD.
Distributed Linear Programming and Resource Management for Data Mining
in Distributed Environments
Motivation
Support Vector (Kernel) Regression An illustration
Support Vector Kernel Regression
Find a function f(x)=y to fit a set of example data points
Problem can be phrased as constrained optimization task
Solved using a standard LP solver
Motivation contd .. Knowledge Based Kernel RegressionIn addition to sample
points, give adviceIf (x ≥3) and (x ≤5)
Then (y≥5)Rules add constraints
about regionsConstraints added to LP
and a new solution (with advice constraints) can be constructed
Fung, Mangasarian and Shavlik,”Knowledge Based Support Vector Machine Classifiers”, NIPS, 2002.
Mangasarian, Shavlik and Wild, “Knowledge Based Kernel Approximation”, JMLR, 5, 1127 – 1141, 2005.
Figure adapted from McLain, Shavlik, Walker and Torrey, “Knowledge-based Support Vector Regression for Reinforcement Learning”, IJCAI, 2005
Distributed Data Mining Applications – An example of Scientific Data Mining in Astronomy
Distributed data and computing resources on the National Virtual Observatory
P2P Data Mining on homogeneously partitioned sky survey
H Dutta, Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure, Ph.D Thesis, UMBC, Maryland, 2007.
Need for distributed optimization strategies
Road MapMotivationRelated WorkFraming an Linear Programming problemThe simplex algorithmThe distributed simplex algorithmExperimental ResultsConclusion and Directions of Future Work
Related WorkResource Discovery in Distributed EnvironmentsImantichi, “Resource Discovery in Large Resource
Sharing Experiments”, Ph.D. Thesis, University of Chicago, 2003.
Livny and Solomon, “Matchmaking: Distributed Resource Management for high throughput computing”, HPDC, 1998.
Optimization TechniquesYarmish, “Distributed Implementation of the Simplex
Method”, Ph.D. Thesis, CIS Polytechnic University, 2001.Hall and McKinnon, “Update procedures for parallel
revised simplex methods, Tech Report, University of Edinburg, UK, 1992
Craig and Reed, “Hypercube Implementation of the Simplex Algorithm”, ACM, pages 1473 – 1482, 1998.
The Optimization Problem
7
Assumptions:n nodes in the networkThe network is staticDataset Di at node iProcessing Cost at i-th node – νi per recordTransportation Cost between i and j – μij
Amount of Data Transferred between nodes – xij
Cost Function Z = Σij μij xij + νi xij = Σij cij xij
Framing the Linear Programming Problem: An illustration
The Steps of the Simplex Algorithm (Dantzig)Obtain a canonical representation (Introduce Slack
Variables)Find a Column PivotFind a Row PivotPerform Gauss Jordan Elimination
The simplex tableau and iterations
2 1 1 1 0 0 14
4 2 3 0 1 0 28
2 5 5 0 0 1 30
-1 -2 1 0 0 0 0
x1 x2 x3 s1 s2 s3 B
Pivot Column
Canonical Representation
14/1= 14
28/2=14
30/5= 6
Pivot Row
2 1 1 1 0 0 14
4 2 3 0 1 0 28
2 5 5 0 0 1 30
-1 -2 1 0 0 0 0
Simplex iterations contd …Perform Gauss
Jordan EliminationThe Final Tableau
8/5 0 0 1 0 -1/5 8
16/5 0 1 0 1 -2/5 16
2/5 1 1 0 0 1/5 6
-1/5 0 3 0 0 2/5 12
0 0 -1/2 1 -1/2 0 0
1 0 5/16 0 5/16 -1/8 5
0 1 7/8 0 -1/8 4 4
0 0 49/16 0 1/16 3/8 13
Road MapMotivationRelated WorkFraming an Linear Programming problemThe simplex algorithmThe distributed simplex algorithmExperimental ResultsConclusions and Future Work
The Distributed Problem – An Example
14
Node1 Node 2
Node 5 Node 4
Node 3
x12+x15+x14+2x25≤300
x12+2x15-x25=2
300 GB
x12+x23+x25≤600
2x25-x12-x23=4
600 GB
x15+x25+x45≤300
x25-2x15-x45=5
300 GB
x34 +8 x25≤300
300 GB
x23+x34 ≤300
300 GB
Each site observes different constraints, but wants to solve the same objective function
Protocol Push Min (gossip based)Minimum estimation problemIteration t-1: {mr} values sent to node i
mti = min {{mr} , current row pivot}Termination: All nodes have exactly the
same minimum value
Analysis of Protocol Push Min
19
Based on spread of an epidemic in a large population
Suseptible, infected and dead nodesThe “epidemic” spreads exponentially fast
Node1 Node 2
Node 5 Node 4
Node 3
Comments and Discussions
20
Assume η no of nodes in the networkCommunication Complexity is
O(no of iterations of simplex X η)Worst case Simplex may require
exponential no of iterations.For most practical purposes it is λ m (λ<4)
Road MapMotivationRelated WorkFraming an Linear Programming problemThe simplex algorithmThe distributed simplex algorithmExperimental ResultsConclusion and Directions of Future Work
Experimental Results
Artificial Data SetSimulated constraint matrices at each nodeUsed Distributed Data Mining Toolkit (DDMT)
developed at University of Maryland, Baltimore County (UMBC) for simulating the network structure
Two different metrics for evaluation: TCC (Total Communication Cost in the network)Average Communication Cost per Node (ACCN)
Communication CostAverage Communication Cost Per Node
versus Number of Nodes in the network
More Experimental Results ….TCC versus No of Variables at each node
TCC versus No of constraints at each node
Conclusions and Future Work
Resource management and pattern recognition present formidable challenges on distributed systems
Present a distributed algorithm for resource management based on the simplex algorithm
Test our algorithm on simulated data Future WorkIncorporation of dynamics of the networkTesting the algorithm on a real distributed networkEffect of size and structure of network on the
mining results Examine the trade-off between accuracy and
communication cost incurred before and after using distributed simplex on a mining task like classification or clustering
Selected BibliographyG.B.Dantzig, “Linear Programming and Extensions”.
Princeton University Press, Princeton, NJ, 1963Kargupta and Chan,”Advances in Distributed and
Parallel Knowledge Discovery”, AAAI Press, Menlo Park, CA, 2000.
A. L. Turinsky. “Balancing Cost and Accuracy in Distributed Data Mining”. PhD thesis, University of Illinois at Chicago., 2002.
Haimonti Dutta, “Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure”, Ph.D. Thesis, UMBC, 2007.
Mangasarian, “Mathematical Programming in Data Mining”, DMKD, Vol 42, pg 183 – 201, 1997.