Top Banner
D D De e et t te e ec c ct t ti i in n ng g g T T T e e ex x xt t t i i in n n N N Na a at t tu u ur r r a a al l l I I Im m ma a ag g ge e e w w wi i it t th h h C C Co o on n nn n ne e ec c ct t ti i io o on n ni i is s st t t T T T e e ex x xt t t P P Pr r ro o op p po o os s sa a al l l N N Ne e et t tw w wo o or r rk k k Zhi Tian 1 , Weilin Huang 1,2 , Tong He 1 , Pan He 1 , and Yu Qiao 1,3 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China 2 University of Oxford, UK 3 The Chinese University of Hongkong, China Insight Motivation Current bottom-up approaches are complicated, with weak robustness and reliability, and accumulated errors. Stat-of-the-art object detectors are powerful, but not accurate for text localisation. Fill the gap between general object detection (e.g., RPN [1]) and text detection. Connectionist Text Proposal Network (CTPN) (CTPN) CTPN Architecture CTPN Proposals Recurrently connect sequential proposals by BLSTM Jointly predict text scores, y-axis coordinates, and refinement offsets Detect text in sequences of fine-scale proposals Recurrent Connectionist Text Proposals Top: CTPN without recurrent connection. Bottom: with recurrent connection RNN layer connects sequential proposals directly in convolutional layer In-network recurrent architecture is end-to-end trainable Detect highly ambiguous text, and reduce false detections considerably Red Box: with side-refinement. Yellow Box: without side-refinement Predict offsets for side-proposals - horizontal sides rectification Further improve localisation accuracy Joint predictions - not a post-precessing step Side-Refinement RdB ith id fi t Y ll B ith t id fi t Detecting Text in Fine-Scale Proposals RPN Proposals CTPN Proposals Slide a 3x3 window through Conv5 Text anchors are used for each window Output a sequence of 16-pixel width proposals Details: Improve localisation accuracy Generalise to multi scales, aspects, and languages Using single-scale image Advantages: Encode rich context information Experimental Results [1] S. Ren, K. He, R, Girshick, and J. Sun: Faster R-CNN: Towards real-time object detection with region proposals network, NIPS, 2016. Reference: [2] K. Simonyan, A. Zisserman: Very deep convolutional networks for large-scale image recognition, ICLR, 2015. [3] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang: Reading scene text in deep convolutional sequences, AAAI, 2016. Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao European Conference on Computer Vision (ECCV), 2016 Online demo: textdet.com Summary: Red Box: CTPN detection. Yellow Box: ground truth Trained on 3K images in English and Chinese, generalise well to others (e,g., Korean) Fine-scale strategy improves Precision, while using RNN increases Recall and Precision Obtain 0.88 and 0.61 F-measures on the ICDAR 2013 and 2015, respectively Computationally efficient, with 0.14s/image GPU time (scale=600) Strong capability for detecting very small-size text
1

Detecting Text in Natural Image with Connectionist Text ... · Detecting Text in Natural Image with Connectionist Text Proposal Network 1Zhi Tian , Weilin Huang1,2, Tong He 1,Pan

Nov 08, 2018

Download

Documents

lamtuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Detecting Text in Natural Image with Connectionist Text ... · Detecting Text in Natural Image with Connectionist Text Proposal Network 1Zhi Tian , Weilin Huang1,2, Tong He 1,Pan

DDDeeettteeeccctttiiinnnggg TTTeeexxxttt iiinnn NNNaaatttuuurrraaalll IIImmmaaagggeee wwwiiittthhhCCCooonnnnnneeeccctttiiiooonnniiisssttt TTTeeexxxttt PPPrrrooopppooosssaaalll NNNeeetttwwwooorrrkkk Zhi Tian1, Weilin Huang1,2, Tong He1, Pan He1, and Yu Qiao1,3 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China

2 University of Oxford, UK 3 The Chinese University of Hongkong, China

Insight

Motivation●Current bottom-up approaches are complicated, with weak robustness and

reliability, and accumulated errors.● Stat-of-the-art object detectors are powerful, but not accurate for text localisation.

● Fill the gap between general object detection (e.g., RPN [1]) and text detection.

Connectionist Text Proposal Network (CTPN)(CTPN)

CTPN Architecture

CTPN Proposals

●Recurrently connect sequential proposals by BLSTM

● Jointly predict text scores, y-axis coordinates, and refinement offsets

●Detect text in sequences of fine-scale proposals

Recurrent Connectionist Text Proposals

Top: CTPN without recurrent connection. Bottom: with recurrent connection

●RNN layer connects sequential proposals directly in convolutional layer● In-network recurrent architecture is end-to-end trainable

●Detect highly ambiguous text, and reduce false detections considerably

Red Box: with side-refinement. Yellow Box: without side-refinement

●Predict offsets for side-proposals - horizontal sides rectification● Further improve localisation accuracy● Joint predictions - not a post-precessing step

Side-Refinement

R d B ith id fi t Y ll B ith t id fi t

Detecting Text in Fine-Scale Proposals

RPN Proposals CTPN Proposals

● Slide a 3x3 window through Conv5● Text anchors are used for each window● Output a sequence of 16-pixel width proposals

Details:● Improve localisation accuracy●Generalise to multi scales, aspects,

and languages● Using single-scale image

Advantages:

●Encode rich context information

Experimental Results

[1] S. Ren, K. He, R, Girshick, and J. Sun: Faster R-CNN: Towards real-time object detection with region proposals network, NIPS, 2016. Reference:

[2] K. Simonyan, A. Zisserman: Very deep convolutional networks for large-scale image recognition, ICLR, 2015.[3] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang: Reading scene text in deep convolutional sequences, AAAI, 2016.

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao European Conference on Computer Vision (ECCV), 2016

Online demo: textdet.com

Summary:

Red Box: CTPN detection. Yellow Box: ground truth

● Trained on 3K images in English and Chinese, generalise well to others (e,g., Korean)● Fine-scale strategy improves Precision, while using RNN increases Recall and Precision● Obtain 0.88 and 0.61 F-measures on the ICDAR 2013 and 2015, respectively● Computationally efficient, with 0.14s/image GPU time (scale=600)● Strong capability for detecting very small-size text