D D De e et t te e ec c ct t ti i in n ng g g T T T e e ex x xt t t i i in n n N N Na a at t tu u ur r r a a al l l I I Im m ma a ag g ge e e w w wi i it t th h h C C Co o on n nn n ne e ec c ct t ti i io o on n ni i is s st t t T T T e e ex x xt t t P P Pr r ro o op p po o os s sa a al l l N N Ne e et t tw w wo o or r rk k k Zhi Tian 1 , Weilin Huang 1,2 , Tong He 1 , Pan He 1 , and Yu Qiao 1,3 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China 2 University of Oxford, UK 3 The Chinese University of Hongkong, China Insight Motivation ● Current bottom-up approaches are complicated, with weak robustness and reliability, and accumulated errors. ● Stat-of-the-art object detectors are powerful, but not accurate for text localisation. ● Fill the gap between general object detection (e.g., RPN [1]) and text detection. Connectionist Text Proposal Network (CTPN) (CTPN) CTPN Architecture CTPN Proposals ● Recurrently connect sequential proposals by BLSTM ● Jointly predict text scores, y-axis coordinates, and refinement offsets ● Detect text in sequences of fine-scale proposals Recurrent Connectionist Text Proposals Top: CTPN without recurrent connection. Bottom: with recurrent connection ● RNN layer connects sequential proposals directly in convolutional layer ● In-network recurrent architecture is end-to-end trainable ● Detect highly ambiguous text, and reduce false detections considerably Red Box: with side-refinement. Yellow Box: without side-refinement ● Predict offsets for side-proposals - horizontal sides rectification ● Further improve localisation accuracy ● Joint predictions - not a post-precessing step Side-Refinement RdB ith id fi t Y ll B ith t id fi t Detecting Text in Fine-Scale Proposals RPN Proposals CTPN Proposals ● Slide a 3x3 window through Conv5 ● Text anchors are used for each window ● Output a sequence of 16-pixel width proposals Details: ● Improve localisation accuracy ● Generalise to multi scales, aspects, and languages ● Using single-scale image Advantages: ● Encode rich context information Experimental Results [1] S. Ren, K. He, R, Girshick, and J. Sun: Faster R-CNN: Towards real-time object detection with region proposals network, NIPS, 2016. Reference: [2] K. Simonyan, A. Zisserman: Very deep convolutional networks for large-scale image recognition, ICLR, 2015. [3] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang: Reading scene text in deep convolutional sequences, AAAI, 2016. Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao European Conference on Computer Vision (ECCV), 2016 Online demo: textdet.com Summary: Red Box: CTPN detection. Yellow Box: ground truth ● Trained on 3K images in English and Chinese, generalise well to others (e,g., Korean) ● Fine-scale strategy improves Precision, while using RNN increases Recall and Precision ● Obtain 0.88 and 0.61 F-measures on the ICDAR 2013 and 2015, respectively ● Computationally efficient, with 0.14s/image GPU time (scale=600) ● Strong capability for detecting very small-size text