Observation 1: The convergence of DTW and Euclidean distance results for increasing data sizes. Observation 2: The increasing effectiveness of lower-bounding pruning for increasing data sizes. Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan Begum, Nurjahan Begum, Liudmila Liudmila Ulanova, Ulanova, Jun Wang Jun Wang 1 and and Eamonn Eamonn Keogh Keogh University University of California, of California, Riverside Riverside UT Dallas UT Dallas 1 Why is DTW Clustering Hard? Why is DTW Clustering Hard? Motivation of DTW Clustering Motivation of DTW Clustering Density Peaks (DP) Algorithm Density Peaks (DP) Algorithm Why Existing Work is Why Existing Work is not not the Answer? the Answer? TADPole TADPole: Our Proposed Algorithm : Our Proposed Algorithm How ‘good’ are TADPole Clusters? Case Study 1: Electromagnetic Case Study 1: Electromagnetic Articulograph Articulograph How Effective is How Effective is TADPole’s TADPole’s Pruning? Pruning? #kanyewest #Michael #MichaelJackson #taylorswift 0 40 80 120 hours Synonym Discovery ? Association Discovery ? “I’mma let you finish” Bos taurus Hyperoodon ampullatus Talpa europaea Bos taurus Hyperoodon ampullatus Talpa europaea Cetartiodactyla DTW ED 0 1000 2000 0.01 0.03 0.05 0.07 1-NN error rate Size of training set Euclidean DTW 0 1000 2000 0.6 0.7 0.8 0.9 Dataset Size Rand Index DTW Euclidean Neither of these two observations help! 5 1 2 3 4 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 Mislabeled by k-means Outlier Scalability Issue: DTW is not a metric, therefore very difficult to index Quality Issue: Need clustering algorithm which is insensitive to outliers 3 steps 1. Density Calculation 2. NN within Higher Density List Calculation 3. Cluster Assignment 1 2 3 4 5 6 8 7 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 4 3 6 4 5 3 1 3 1 1 2 2 2 ρ 3 5 Elements with higher density 4.2 6 Item 1’s cluster label = item 3’s cluster label 1 d c j c ij i d d ) ( Pruning During Local Density Computation j LB Matrix (i,j) D ij UB Matrix (i,j) LB Matrix (i,j) D ij UB Matrix (i,j) d c LB Matrix (i,j) D ij UB Matrix (i,j) B) C) D) i j i i j j i D ij = 0 A) Pruning During NN Distance Calculation From Higher Density List LB Matrix (i,j 1 ) D 1 UB Matrix (i,j 1 ) D 2 UB Matrix (i,j 2 ) D 3 UB Matrix (i,j 3 ) A) B) C) i j 1 i i j 2 j 3 D 4 UB Matrix (i,j 4 ) i j 4 D) LB Matrix (i,j 2 ) LB Matrix (i,j 4 ) LB Matrix (i,j 3 ) Distance Calculations 0 3500 1 3 5 7 x 10 6 TADPole Number of objects Absolute Number 0 3500 0 100 Number of objects Brute force TADPole Percentage DP: 9 Hours TADPole: 9 minutes Distance Computation Ordering: Distance Computation Ordering: Anytime Anytime TADPole TADPole Distance Computation Percentage 100% 0.4 1 0 Rand Index Euclidean Distance Oracle Order TADPole Order 0 10% 0.4 1 Oracle Order Random Order TADPole Order Random Order Rand Index Distance Computation Percentage Zoom-In of Above Figure This reflects the 90% of DTW calculations that were admissibly pruned This reflects the 10% of DTW calculations that were calculated in anytime ordering 10% 0 150 Y Z Y Z 1 2 3 4 5 6 7 0.84 0.92 1 Distance Computation Percentage Rand Index Euclidean Distance Oracle Order Random Order TADPole Order Pruning: 94% Case Study 2: Case Study 2: Pulsus Pulsus Dataset Dataset Suspected Pulsus Severe Pulsus Healthy Oximeter Vein Artery Photo Detector LED 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Patient 639 Patient 523 Patient 618 Patient 2975918 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Normalized Respiration Rate Normalized Heart Rate Power Spectral Density Frequency A) B) C) D) E) F) 200 600 1000 1400 1800 200 600 1000 1400 1800 Non-Severe Pulsus Severe Pulsus PPG Reproducibility Reproducibility All the code and datasets used in this paper are publicly available in: www.cs.ucr.edu/~nbegu001/SpeededClusteringDTW Pruning: 88%