1 Support Cluster Machine Paper from ICML2007 Read by Haiqin Yang 2007-10-18 This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping.

Support Cluster MachinePaper from ICML2007

Read by Haiqin Yang

2007-10-18

This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping Fan, Xiangyang Xue, which was published in 2007.

Outline

Background and Motivation

Support Cluster Machine － SCM

Kernel in SCM

Experiments

An Interesting Application: Privacy-preserving Data Mining

Discussions

Background and Motivation

Large scale classification problem Decomposition methods

Osuna et al., 1997; Joachims, 1999; Platt, 1999; Collobert & Bengio, 2001; Keerthi et al., 2001;

Incremental algorithms Cauwenberghs & Poggio, 2000; Fung & Mangasarian, 2002; Laskov et al., 2006;

Parallel techniques Collobert et al., 2001; Graf et al., 2004;

Approximate formula Fung & Mangasarian, 2001; Lee & Mangasarian, 2001;

Choose representatives Active learning － Schohn & Co

hn, 2003; Cluster Based-SVM － Yu et al.,

2003; Core Vector Machine (CVM) －

Tsang et al., 2005; Clustering SVM － Boley, D. &

Cao, 2004;

Support Cluster Machine － SCM

Given training samples:

Procedure

SCM Solution

Dual representation

Decision function

Kernel

Probability product kernel

By Gaussian assumption, i.e.,

Kernel Property I

That is

Decision function

Property II

Experiments

Datasets Toydata MNIST – Handwritten digits

(‘0’-’9’) classification Adult – Privacy-preserving Dat

Clustering algorithms Threshold Order Dependent (T

OD) EM algorithm

Classification methods libSVM SVMTorch SVMlight

CVM (Core Vector Machine) SCM

Model selection

CPU: 3.0GHz

Toydata

Samples: 2500 samples/class generated from a mixture of Gaussian distribution

Clustering algorithm: TOD Clustering results: 25 positive, 25 negative

MNIST Data description

10 classes: Handwritten digits ‘0’-’9’ Training samples: 60,000, about 6000 for each class Testing samples: 10,000

Construct 45 binary classifiers Results

25 Clusters for EM algorithm

Test results for TOD algorithm

Privacy-preserving Data Mining Inter-Enterprise data mining

Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information.

Horizontally partitionedRecords (users) split across companiesExample: Credit card fraud detection model

Vertically partitionedAttributes split across companiesExample: Associations across websites

Privacy-preserving Data Mining Randomization approach

50 | 40K | ... 30 | 70K | ... ...

Randomizer Randomizer

Reconstructdistribution

of Age

Reconstructdistributionof Salary

Data MiningAlgorithms

65 | 20K | ... 25 | 60K | ... ...

Classification Example

Age Salary Repeat Visitor?

23 50K Repeat

17 30K Repeat

43 40K Repeat

68 50K Single

32 70K Single

20 20K Repeat

Age < 25

Salary < 50K

Repeat

Single

Privacy-preserving Dataset: Adult

Data description Training samples: 30162 Testing samples: 15060 Percentage of positive samples: 24.78%

Procedure Horizontally partition data into three subsets (parties) Cluster by TOD algorithm Obtain three positive and three negative GMMs Combine positive and negative GMMs into one positive and one negative

GMMs with modified priors Classify them by SCM

Privacy-preserving Dataset: Adult Partition results

Experimental results

Discussions Solved problems

Large scale problems: downsample by clustering + classifier Privacy-preserving problems: hide individual information

Differences to other methods Training units are generative model, testing units are vectors Training units contain complete statistical information Only one parameter for model selection Easy implementation Generalization ability is not clear, while the RBF kernel in SVM has the p

roperty of larger width leads to lower VC dimension.

Discussions

Advantages of using priors and covariances

Thank you!

1 Support Cluster Machine Paper from ICML2007 Read by Haiqin Yang 2007-10-18 This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping.

scm slide

negative slide

procedure slide

tod algorithm slide

em algorithm slide

data mining discussions

support cluster machine

toydata samples

Documents

Wolfgang E. Paulus, M.D.,a Mingmin Zhang, M.D.,b Erwin...

Cluster 4 Cluster 3 Cluster 2

Evolution of Big Data in USA YANG, Haiqin 2013-04-22 1.

Tian, Ai-Ling and Lu, MingMin and Calderón-Mantilla,...

WiMAX Cluster Cluster Optimization Report

Emotion Recognition using Wireless Signals -...

Cluster mode and plf cluster

VeTrack: Real Time Vehicle Tracking in Uninstrumented Indoor...

Shuyang Senior Middle School Made by Yang Haiqin.

Stationary wave prediction Coupled global model research...

Cluster-Cluster // The Launch Issue

Cluster Profile Report Faridabad - Mixed Engineering Cluster...

Online Learning for Collaborative Filtering Guang Ling,...

Annexure-1 CLUSTER ULBS “Cluster” or “GMADA Cluster”

MULTISCREEN SERVICE IN SHANGHAI Oriental Cable Network...

KFNet: Learning Temporal Camera Relocalization using ... ·...