Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz (University of Ottawa, Canada) Published: European Conference on Machine Learning (ECML), 2004 Presenter: Rehan Akbani Home Page: http://www.cs.utsa.edu/~rakbani/
24
Embed
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Motivation Imbalanced datasets are datasets where the negative instances far outnumber the positive instances (or vice versa). Naturally occurring imbalanced datasets: Gene profiling Medical diagnosis Credit card fraud detection Ratios of negative to positive instances of 100 to 1 are not uncommon.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Applying Support Vector Machines to Imbalanced Datasets
Authors: Rehan Akbani, Stephen Kwek
(University of Texas at San Antonio, USA)Nathalie Japkowicz
(University of Ottawa, Canada)Published: European Conference on Machine
Motivation and Problem Definition Key Issues Support Vector Machines Background Problem in Detail Traditional Approaches to Solve the Problem Our Approach Results and Conclusions Future Work and Suggested Improvements
Motivation
Imbalanced datasets are datasets where the negative instances far outnumber the positive instances (or vice versa).
Ratios of negative to positive instances of 100 to 1 are not uncommon.
Key Issues
Traditional algorithms such as SVM, decision trees, neural networks etc. perform poorly with imbalanced data.
Accuracy is not a good metric to measure performance.
Need to improve traditional algorithms so that they can handle imbalanced data.
Need to define other metrics to measure performance.
Support Vector Machines Background
Find the maximum margin boundary that separates the green and red instances.
Support Vector MachinesSupport Vectors
Circled instances are support vectors.
Support Vector Machines Kernels
Kernels allow non-linear separation of instances.E.g. Gaussian Kernel
Effects of Imbalance on SVM
1. Positive (minority) instances lie further away from the “ideal” boundary.
Effects of Imbalance on SVM
2. Support vector ratio is imbalanced.
Support vectors are shown in red.
Effects of Imbalance on SVM
3. Weakness of Soft-Margins. Minimize the primal Lagrangian:
Compromise between minimization oftotal error and maximization of margin.
n
iip C
wL
1
2
2
n
i
n
iiiiiii rbxwy
1 1
1.
Effects of Imbalance on SVM
Margin is maximized at the cost of small total error
Traditional Approaches
Oversample the minority class or undersample the majority class.
Sample distribution is no longer random – its distribution no longer approximates the target distribution. Defense: Sample biased to begin with
With undersampling, we are discarding instances that may contain valuable information.
Problem with Undersampling
Before After
After undersampling, the learned plane estimates the distance of the ideal plane better but the orientation of the learned plane is no longer as accurate.
Our Approach – SMOTE with Different Costs (SDC)
Do not undersample the majority class in order to retain all the information.
Use Synthetic Minority Oversampling TEchnique (SMOTE) (Chawla et al, 2002).
Use Different Error Costs (DEC) to push the boundary away from positive instances (Veropoulos et al, 1999).
Effect of DEC
Before DEC After DEC
Effect of SMOTE and DEC – (SDC)
After DEC alone After SMOTE and DEC
Experiments
Used 10 different UCI datasets. Compared with four other algorithms: