D Hi hl P ll lI C t Dt ti Damascene: Highly Parallel Image Contour Detection Damascene: Highly Parallel Image Contour Detection B Ct N S d B Yii S Y L M kM h K t K t Bryan Catanzaro, Narayanan Sundaram, Bor‐Yiing Su, Yunsup Lee, Mark Murphy, Kurt Keutzer I C D i Program Flow Overall Performance Image Contour Detection Program Flow Overall Performance Image Contour Detection Image Size: 481 by 321, 154401 pixels in total I t dt ti if d t lt i Convert Color Space Textons:K means Image Image Size: 481 by 321, 154401 pixels in total CPU/GPU Runtime: 236 7s/2 081s Image contour detection is fundamental to image Convert Color Space Textons: K-means CPU/GPU Runtime: 236.7s/2.081s segmentation and many other computer vision problems Speedup: 114x segmentation and many other computer vision problems CPU/GPU: 0.09/0.001 (s) 90x CPU/GPU: 0.09/0.001 (s) 90x CPU/GPU: 8.58/0.169(s) 51x CPU/GPU: 8.58/0.169(s) 51x L A B Convert Kmeans Local Cues Local Cues Combine Nonmax Intervening Local Cues Local Cues Eigensolver Local Cues CPU/GPU: 53 18/0 78(s) 68x Local Cues CPU/GPU: 53 18/0 78(s) 68x OE Combine CPU/GPU: 53.18/0.78(s) 68x CPU/GPU: 53.18/0.78(s) 68x Combine BG CGA CGB TG Image Human Generated Machine Generated BG CGA CGB TG Contours Contours CPU R i GPU Runtime Contours Contours CPU Runtime GPU Runtime R=3 R=5 R=5 R=5 R=3 R=5 R=5 R=5 Precision‐Recall Graph gPb Algorithm: Current Leader Precision‐Recall Graph gPb Algorithm: Current Leader R=5 R=10 R=10 R=10 R=10 R=20 R=20 R=20 global Probability of CVPR 2008 damascene R=10 R=20 R=20 R=20 boundary [Maire 1 boundary [Maire, Abl F lk 1 We achieve slightly better Arbelaez, Fowlkes, We achieve slightly better h B kl Malik, CVPR 2008 ] accuracy on the Berkeley Currently the most 0.8 Segmentation Dataset Currently, the most t i t Combine Non-max Suppression Intervening Contour Segmentation Dataset accurate image contour Combine Non max Suppression Intervening Contour Comparing to human detector n segmented “ground truth” 58 mins per small image 0.6 on CPU/GPU: CPU/GPU: CPU/GPU: CPU/GPU: CPU/GPU: CPU/GPU: segmented ground truth F meas re 0 70 for both 5.8 mins per small image (481 b 321) li it it isio 277/0.833(ms) 277/0.833(ms) 130/0.31(ms) 130/0.31(ms) 6.32/0.034 (s) 6.32/0.034 (s) F‐measure 0.70 for both (481 by 321) limits its eci 277/0.833(ms) 332x 277/0.833(ms) 332x 130/0.31(ms) 419x 130/0.31(ms) 419x 6.32/0.034 (s) 185x 6.32/0.034 (s) 185x applicability 0.4 Pre 332x 332x 419x 419x 185x 185x Too slow for interactive P Too slow for interactive ht diti photo editing Too slow even for Image 0.2 Generlaized Eigen Solver Too slow even for Image Retrieval Generlaized Eigen Solver Retrieval 0 CPU/GPU: 151.2/0.81 (s) 186x CPU/GPU: 151.2/0.81 (s) 186x 0 0 0.2 0.4 0.6 0.8 1 CPU/GPU: 151.2/0.81 (s) 186x CPU/GPU: 151.2/0.81 (s) 186x Recall Recall Platform: Nvidia GTX200 Series Platform: Nvidia GTX200 Series Conclusion Oriented Energy Combination Conclusion Oriented Energy Combination S ifi ti GTX280 Specifications GTX280 CPU/GPU: 2300/16.5 (ms) 140x CPU/GPU: 2300/16.5 (ms) 140x Processors 30 @ 1.3 GHz Physical SIMD Width 8 • Damascene provides highest quality image contour Physical SIMD Width 8 SP GFLOPS 933 • Damascene provides highest quality image contour d i bl SP GFLOPS 933 detection at user acceptable rates Memory Bandwidth 141.7 GB/s • It demonstrates the transformational speedup Register File 1 875 MB It demonstrates the transformational speedup t ti l f hit t Register File 1.875 MB L lS 480 kB potential of manycore architectures Local Store 480 kB • Damascene was enabled by the collaborative Global Pb Combine, Normalize Memory 1 GB Damascene was enabled by the collaborative environment at the Berkeley UPCRC Global Pb Combine, Normalize environment at the Berkeley UPCRC • Future work will generalize Damascene into a case CPU/GPU: CPU/GPU: study for application and programming frameworks 437/0.241 (ms) 437/0.241 (ms) study for application and programming frameworks d 437/0.241 (ms) 1813x 437/0.241 (ms) 1813x (stay tuned) 1813x 1813x