us and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques Nearest Neighbor Editing and Condensing Techniques 1.Nearest Neighbor Revisited 2.Condensing Techniques 3.Proximity Graphs and Decision Boundaries 4.Editing Techniques Organization Last updated: Nov. 7, 2013
27
Embed
Nearest Neighbor Editing and Condensing Techniques
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Nearest Neighbor Editing and Condensing Techniques
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Nearest Neighbour Issues
• Expensive– To determine the nearest neighbour of a query point q, must compute
the distance to all N training examples+ Pre-sort training examples into fast data structures (kd-trees)+ Compute only an approximate distance (LSH)+ Remove redundant data (condensing)
• Storage Requirements– Must store all training data P
+ Remove redundant data (condensing)- Pre-sorting often increases the storage requirements
• High Dimensional Data– “Curse of Dimensionality”
• Required amount of training data increases exponentially with dimension
• Computational cost also increases dramatically• Partitioning techniques degrade to linear search in high dimension
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Exact Nearest Neighbour
• Asymptotic error (infinite sample size) is less than twice the Bayes classification error – Requires a lot of training data
• Expensive for high dimensional data (d>20?)
• O(Nd) complexity for both storage and query time– N is the number of training examples, d is the dimension of each
sample– This can be reduced through dataset editing/condensing
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Decision RegionsEach cell contains one sample, and every location within the cell is closer to that sample than to any other sample.
A Voronoi diagram divides the space into such cells.
Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class regions based on the 1-NN decision rule.Knowledge of this boundary is sufficient to classify new points.The boundary itself is rarely computed; many algorithms seek to retain only those points necessary to generate an identical boundary.
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing
• Aim is to reduce the number of training samples• Retain only the samples that are needed to define the decision boundary• This is reminiscent of a Support Vector Machine• Decision Boundary Consistent – a subset whose nearest neighbour
decision boundary is identical to the boundary of the entire training set• Consistent Set --- – the smallest subset of the training data that correctly
classifies all of the original training data
• Minimum Consistent Set – smallest consistent set
Original data Condensed data Minimum Consistent Set
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing• Condensed Nearest Neighbour (CNN)
Hart 1968– Incremental– Order dependent– Neither minimal nor decision
boundary consistent– O(n3) for brute-force method– Can follow up with reduced NN
[Gates72] • Remove a sample if doing so
does not cause any incorrect classifications
1. Initialize subset with a single training example
2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
Produces consistent set
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing• Condensed Nearest Neighbour (CNN)
Hart 1968– Incremental– Order dependent– Neither minimal nor decision
boundary consistent– O(n3) for brute-force method– Can follow up with reduced NN
[Gates72] • Remove a sample if doing so
does not cause any incorrect classifications
1. Initialize subset with a single training example
2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing• Condensed Nearest Neighbour (CNN)
Hart 1968– Incremental– Order dependent– Neither minimal nor decision
boundary consistent– O(n3) for brute-force method– Can follow up with reduced NN
[Gates72] • Remove a sample if doing so
does not cause any incorrect classifications
1. Initialize subset with a single training example
2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing• Condensed Nearest Neighbour (CNN)
Hart 1968– Incremental– Order dependent– Neither minimal nor decision
boundary consistent– O(n3) for brute-force method– Can follow up with reduced NN
[Gates72] • Remove a sample if doing so
does not cause any incorrect classifications
1. Initialize subset with a single training example
2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing• Condensed Nearest Neighbour (CNN)
Hart 1968– Incremental– Order dependent– Neither minimal nor decision
boundary consistent– O(n3) for brute-force method– Can follow up with reduced NN
[Gates72] • Remove a sample if doing so
does not cause any incorrect classifications
1. Initialize subset with a single training example
2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing• Condensed Nearest Neighbour (CNN)
Hart 1968– Incremental– Order dependent– Neither minimal nor decision
boundary consistent– O(n3) for brute-force method– Can follow up with reduced NN
[Gates72] • Remove a sample if doing so
does not cause any incorrect classifications
1. Initialize subset with a single training example
2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Condensing• Condensed Nearest Neighbour (CNN)
Hart 1968– Incremental– Order dependent– Neither minimal nor decision
boundary consistent– O(n3) for brute-force method– Can follow up with reduced NN
[Gates72] • Remove a sample if doing so
does not cause any incorrect classifications
1. Initialize subset with a single training example
2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
3. Return to 2 until no transfers occurred or the subset is full
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Not a Gabriel Edge
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Example Gabriel Graph
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Proximity Graphs: RNG
• The Relative Neighbourhood Graph (RNG) is a subset of the Gabriel graph
• Two points are neighbours if the “lune” defined by the intersection of their radial spheres is empty
• Further reduces the number of neighbours• Decision boundary changes are often
drastic, and not guaranteed to be training set consistent
Gabriel edited RNG edited – not consistent
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Dataset Reduction: Editing
• Training data may contain noise, overlapping classes– starting to make assumptions about the underlying distributions
• Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries
• Results in homogenous clusters of points
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Wilson Editing• Wilson 1972• Remove points that do not agree with the majority of their k nearest neighbours
Wilson editing with k=7
Original data
Earlier example
Wilson editing with k=7
Original data
Overlapping classes
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Multi-edit
• Multi-edit [Devijer & Kittler ’79]– Repeatedly apply Wilson editing
to random partitions– Classify with the 1-NN rule
• Approximates the error rate of the Bayes decision rule
1. Diffusion: divide data into N ≥ 3 random subsets
2. Classification: Classify Si using 1-NN with S(i+1)Mod N as the training set (i = 1..N)
3. Editing: Discard all samples incorrectly classified in (2)
4. Confusion: Pool all remaining samples into a new set
5. Termination: If the last I iterations produced no editing then end; otherwise go to (1)
Multi-edit, 8 iterations – last 3 same
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Combined Editing/Condensing• First edit the data to remove noise and smooth the boundary• Then condense to obtain a smaller subset
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Where are we with respect to NN?
• Simple method, pretty powerful rule• Very popular in text mining (seems to work well for this
task)• Can be made to run fast• Requires a lot of training data• Edit to reduce noise, class overlap• Condense to remove data that are not needed
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques
Problems when using k-NN in Practice
• What distance measure to use?– Often Euclidean distance is used– Locally adaptive metrics – More complicated with non-numeric data, or when different
dimensions have different scales• Choice of k?
– Cross-validation– 1-NN often performs well in practice– k-NN needed for overlapping classes– Re-label all data according to k-NN, then classify with 1-NN– Reduce k-NN problem to 1-NN through dataset editing