Detecting Text in Natural Image with Connectionist Text ... · Detecting Text in Natural Image with Connectionist Text Proposal Network 1Zhi Tian , Weilin Huang1,2, Tong He 1,Pan

DDDeeettteeeccctttiiinnnggg TTTeeexxxttt iiinnn NNNaaatttuuurrraaalll IIImmmaaagggeee wwwiiittthhhCCCooonnnnnneeeccctttiiiooonnniiisssttt TTTeeexxxttt PPPrrrooopppooosssaaalll NNNeeetttwwwooorrrkkk Zhi Tian1, Weilin Huang1,2, Tong He1, Pan He1, and Yu Qiao1,3 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China

2 University of Oxford, UK 3 The Chinese University of Hongkong, China

Insight

Motivation●Current bottom-up approaches are complicated, with weak robustness and

reliability, and accumulated errors.● Stat-of-the-art object detectors are powerful, but not accurate for text localisation.

● Fill the gap between general object detection (e.g., RPN [1]) and text detection.

Connectionist Text Proposal Network (CTPN)(CTPN)

CTPN Architecture

CTPN Proposals

●Recurrently connect sequential proposals by BLSTM

● Jointly predict text scores, y-axis coordinates, and refinement offsets

●Detect text in sequences of fine-scale proposals

Recurrent Connectionist Text Proposals

Top: CTPN without recurrent connection. Bottom: with recurrent connection

●RNN layer connects sequential proposals directly in convolutional layer● In-network recurrent architecture is end-to-end trainable

●Detect highly ambiguous text, and reduce false detections considerably

Red Box: with side-refinement. Yellow Box: without side-refinement

●Predict offsets for side-proposals - horizontal sides rectification● Further improve localisation accuracy● Joint predictions - not a post-precessing step

Side-Refinement

R d B ith id fi t Y ll B ith t id fi t

Detecting Text in Fine-Scale Proposals

RPN Proposals CTPN Proposals

● Slide a 3x3 window through Conv5● Text anchors are used for each window● Output a sequence of 16-pixel width proposals

Details:● Improve localisation accuracy●Generalise to multi scales, aspects,

and languages● Using single-scale image

Advantages:

●Encode rich context information

Experimental Results

[1] S. Ren, K. He, R, Girshick, and J. Sun: Faster R-CNN: Towards real-time object detection with region proposals network, NIPS, 2016. Reference:

[2] K. Simonyan, A. Zisserman: Very deep convolutional networks for large-scale image recognition, ICLR, 2015.[3] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang: Reading scene text in deep convolutional sequences, AAAI, 2016.

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao European Conference on Computer Vision (ECCV), 2016

Online demo: textdet.com

Summary:

Red Box: CTPN detection. Yellow Box: ground truth

● Trained on 3K images in English and Chinese, generalise well to others (e,g., Korean)● Fine-scale strategy improves Precision, while using RNN increases Recall and Precision● Obtain 0.88 and 0.61 F-measures on the ICDAR 2013 and 2015, respectively● Computationally efficient, with 0.14s/image GPU time (scale=600)● Strong capability for detecting very small-size text

Detecting Text in Natural Image with Connectionist Text ... · Detecting Text in Natural Image with Connectionist Text Proposal Network 1Zhi Tian , Weilin Huang1,2, Tong He 1,Pan

Documents