Context-aware television-internet mash-ups using logo detection and character recognition

1 23

Pattern Analysis and Applications ISSN 1433-7541 Pattern Anal ApplicDOI 10.1007/s10044-014-0422-6

Context-aware television-internet mash-ups using logo detection and characterrecognition

Arpan Pal, Tanushyam Chattopadhyay,Aniruddha Sinha & Ramjee Prasad

1 23

Your article is protected by copyright and

all rights are held exclusively by Springer-

Verlag London. This e-offprint is for personal

use only and shall not be self-archived

in electronic repositories. If you wish to

self-archive your article, please use the

accepted manuscript version for posting on

your own website. You may further deposit

the accepted manuscript version in any

repository, provided it is only made publicly

available 12 months after official publication

or later and provided acknowledgement is

given to the original source of publication

and a link is inserted to the published article

on Springer's website. The link must be

accompanied by the following text: "The final

publication is available at link.springer.com”.

INDUSTRIAL AND COMMERCIAL APPLICATION

Context-aware television-internet mash-ups using logo detectionand character recognition

Arpan Pal • Tanushyam Chattopadhyay •

Aniruddha Sinha • Ramjee Prasad

Received: 20 December 2012 / Accepted: 20 October 2014

� Springer-Verlag London 2014

Abstract Television can be a prime candidate for

bringing internet to masses in an affordable manner,

especially in developing nations. One such system avail-

able is called home infotainment platform (HIP) that uses

an over-the-top box to provide a low-cost and affordable

solution. However, user study from HIP suggests that the

user experience of browsing internet on TV in a traditional

way is not satisfactory. In this paper, we introduce the

novel concept of context-aware television implemented on

HIP, where we extract TV program contexts like identity

and content using image processing techniques of logo

detection and character recognition. There can be innova-

tive internet-TV mash-up applications using such contexts.

The techniques are especially useful for deriving the con-

texts from analog broadcast TV content that is prevalent in

countries like India. The algorithms are designed in a

lightweight manner so that they can be run efficiently on a

low-cost resource-constrained platform like HIP. Experi-

mental results with live Indian TV channel data show

acceptable accuracy for the proposed systems with low-

computational complexity.

Keywords Smart TV � Connected TV � TV-internet

mash-up � Template matching � Optical character

recognition

1 Introduction

As we embrace the ubiquitous computing technology, there

is a visible trend all across the world of moving from

personal computer (PC) towards mobile phones, tablets and

TVs as the preferred set of ubiquitous screens in our life.

However, market studies in India reveal some interesting

facts. According to studies by Indian Marketing Research

Bureau (IMRB) [29], in 2009 there were 87 million PC

literate people in India (out of 818 million total population

above age group of 12) and 63 million internet users. Only

30 % of these users accessed internet from home PCs.

There were a sizeable 37 % of users accessing the internet

from cyber cafes and only 4 % accessing from alternate

devices like mobiles. More recent studies by International

Telecommunication Union (ITU) indicate [30] that in

2010, household computer penetration in India was only

6.1 % and household internet penetration was only 4.2 %.

This clearly brings out a clear picture of the digital divide

that exists in India, where very little proportion of the

population have access to PCs or internet due to cost, skill,

usability and other issues. Reference [30] also indicates

that there is 61.4 % mobile phone penetration in India in

2010. In another similar report [31], it is stated that India

has about 812 million mobile subscribers in 2011, however,

only 26.3 million of them are active mobile internet users.

This can be attributed to the fact that majority of the mobile

phone in India are low end having small size screens,

thereby limiting the volume and quality of information that

can be disseminated and overall end-user experience.

A. Pal (&) � T. Chattopadhyay � A. Sinha

Innovation Lab, Tata Consultancy Services, Kolkata, India

e-mail: [email protected]

T. Chattopadhyay


A. Sinha


R. Prasad

CTIF, Aalborg University, Aalborg, Denmark


123

Pattern Anal Applic

DOI 10.1007/s10044-014-0422-6

Author's personal copy

https://www.researchgate.net/publication/257926265_Measuring_the_Information_Society?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==


Tablets, though they have a larger screen size and have a

nice touch-screen experience, are not yet available in an

affordable price level. Similar kind of digital divide pic-

tures emerge from other developing countries also [32, 33].

On the other hand, the number of television sets used in

India has reached more than 60 % of homes (158 million

households in 2011 (http://en.wikipedia.org/wiki/Televi

sion_in_India). In this context, if we could make the tele-

vision connected to the internet world in a low-cost

manner, it has the potential of becoming the ‘‘ubiquitous

computing screen’’ for the home helping in bringing down

the above-mentioned digital divide because it already has

high penetration and a large display screen capable of

providing acceptable user experience.

The authors have already introduced such an internet-

connected platform called home infotainment platform

(HIP), which uses TV as a display, is affordable and can be

deployed for masses. HIP had among other application, an

internet browser [34]. However, user study of the HIP

(internal technical report) revealed that very few users

liked a separate internet browsing experience on TV and

they do not want to watch TV and browse internet simul-

taneously as this is probably distracting. Hence there is

need for novel approaches of blending internet experience

and TV experience together, which, in turn needs auto-

matic understanding of TV program context.

Understanding the basic TV context (what channel is

being watched and what the content of the program is) is

quite simple for digital TV broadcast like IPTV using

metadata provided in the digital TV stream [1]. But, in

developing countries, IPTV penetration is almost zero and

even penetration for other kinds of digital TV like digital

cable or satellite (DTH) direct-to-home is also quite low

(less than 10 % in India). Even for the small percentage of

digital cable or satellite DTH coverage, the content is not

really programmed to have context metadata suitable for

interactivity as there is no return path for the interactivity.

This is mainly due to the need for keeping content com-

patibility to the analog legacy system; additionally cost and

infrastructure issues also play a role. HIP, by its inherent

internet-enabled architecture, has no issues with return path

and has capability to blend video with graphics. Hence, it is

worthwhile to explore possibility of providing context-

aware TV-internet mash-up-based applications on HIP. In

Fig. 1, we provide a setup in which HIP can be used to

provide such applications. To keep the cost low, HIP has

been designed with limited computing power and memory,

hence it is important to create context extraction algorithms

that are computationally lightweight.

Quite a few interesting applications can be created using

context-aware TV-internet mash-ups. There can be three

different context types, namely channel identity and

embedded text on static video and embedded text in

dynamic video content. In Sect. 2, we provide the back-

ground, state-of-the-art study and problem articulation for

each of these three contexts. In Sect. 3, we describe the

proposed system and in Sect. 4, we provide implementation

results followed by discussion—both these sections cover

all the three contexts mentioned above. Finally in Sect. 5,

we summarize and conclude.

2 Background and problem definition

There are two main contexts for TV programs—(a) What

channel is being watched, i.e. TV channel identity and

(b) What video content is being watched. The latter one can

further be sub-divided into two classes—(1) Context in

static pages of TV programs (mainly happens in interactive

TV) and (2) Context in dynamic video contents. Both these

two classes of context can be identified from the textual

content embedded in the video. We now elaborate more on

the state-of-the-art and problem statement in context of

HIP for each of these three classes of contexts. For all

cases, the basic architecture of context-aware TV-internet

mash-up remains the same and is given in Fig. 1.

DTH / Cable Set top Box

Video Capture &Context Extraction

Information Mash-up Engine

Television

InternetRF in A/V in

Video Graphics

A/V Out

ALPHA Blending

Fig. 1 Using HIP for TV-

internet mash-ups

Pattern Anal Applic

123


http://en.wikipedia.org/wiki/Television_in_India

http://en.wikipedia.org/wiki/Television_in_India

2.1 TV channel identity as context

TV-internet mash-up applications like electronic program

guide (EPG), TV audience viewership rating, targeted

advertisement through user viewership profiling, social

networking among people watching the same program,

etc., can benefit from identifying which channel is being

watched [2]. TV audience viewership rating applications

often use audio watermarking and audio signature-based

techniques to identify the channels [2–4]. However, audio

watermarking-based techniques, though real time, need

modification of the content on the broadcaster end. Audio

signature-based techniques do not need content modifica-

tion on the broadcaster end, however, they require sending

captured audio feature data of channel being watched to

back end for offline analytics and hence cannot be per-

formed in real time. Since we are looking at broadcaster-

agnostic real-time TV-internet mash-up kind of applica-

tions, these techniques will not work well. Hence we need

to look for alternate techniques that should work in real

time and should be computationally lightweight so that it

can be run on HIP.

In our proposed work, we explore the possibility of

using TV channel logo for channel identification. Each TV

channel broadcast shows its unique logo image at pre-

defined locations of the screen. The identification can be

typically done by doing image template-based matching of

the unique channel logo. Figure 2 gives a screenshot of the

channel logo in a broadcast video.

We looked at the logos of most popular 92 Indian TV

channels and found that the logos can be classified into 7

different types:

1. Static, opaque and rectangular (28 such channels).

2. Static, opaque and non-rectangular (19 such channels).

3. Static, transparent background and opaque foreground

(38 such channels).

4. Static, alpha-blended with video (2 such channels).

5. Non-static, changing colors with time (2 such

channels).

6. Non-static, fixed animated logos (2 such channels).

7. Non-static, randomly animated logos (1 such channel).

The proportion of the channels classified in these seven

types is given in Fig. 3. In our requirement, we have

considered all the types of channels except the non-static

randomly animated ones because they do not have a unique

signature to detect and anyway their proportion is also very

small (1 %).

Some related work in this field is described in literature

[5–9]. All the approaches were analyzed and it was found

that the best performance is observed for the approaches

described in [5] and [9]. But the approaches taken in [9]

involve principal component analysis (PCA) and indepen-

dent component analysis (ICA), both of which is very

much computationally expensive and thus is difficult to be

realized on HIP to get a real-time performance. The

approach of [5] works well only for channel type (a)—

static, opaque and rectangular logos. Hence there is need

for developing a channel logo recognition algorithm that on

one side should be lightweight enough to be run in real

time on HIP and on the other side should detect all the six

types of channel logos considered (type a to type f). There

are solutions available in the market like MythTV (www.

mythTV.com), which provide channel logo based detection

features, but it does not support all types of channel logos

and it also does not support the SDTV resolution PAL TV

standard prevalent in India.

The main contribution of the proposed work has been

fourfold:

1. We propose a design that reduces the processing

overhead by limiting the search space to known

positions of logo and integrates an available light-

weight color template-based matching algorithm to

detect logos.

2. We propose a novel algorithm to automatically declare

any portion of the logo to be ‘‘don’t care’’ to take care

of the non-rectangular, transparent and alpha-blended

Fig. 2 Channel logos in

broadcast video

Fig. 3 Channel logo types

Pattern Anal Applic

123


http://www.mythTV.com

http://www.mythTV.com

https://www.researchgate.net/publication/4351574_A_fast_method_for_animated_TV_logo_detection?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/221263577_A_Robust_Method_for_TV_Logo_Tracking_in_Video_Streams?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/228796275_AUTOMATIC_TV_LOGO_DETECTION_AND_CLASSIFICATION_IN_BROADCAST_VIDEOS?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==



https://www.researchgate.net/publication/269317882_Spatial_detection_of_TV_channel_logos_as_outliers_from_the_content_-_art_no_60770X?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/224160313_Generation_of_electronic_program_guide_for_RF_fed_TV_channels_by_recognizing_the_channel_logo_using_fuzzy_multifactor_analysis?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==



static logos (types b, c and d). This makes use of the

fact that static portions of the logo will be time

invariant whereas transparent or alpha-blended por-

tions of the logo will be time varying. It also

innovatively applies radar detection theory as a post-

processing block to improve the accuracy of the

detection under noisy video conditions that are

prevalent in analog video scenarios.

3. To make the logo detection work reliably for non-static

logos (types e and f), we propose creating a sequence

of logo templates covering the whole time variation

cycle of the logo and doing correlation of the captured

video with the set of templates to find the best match.

4. To save on the scarce computing resources, the logo

detection algorithm is not run all the time. The system

uses an innovative blue-screen/blank-screen detection

during channel change as an event to trigger the logo

detection algorithm only after a channel change.

Pixel by pixel matching of a test logo against all logos in

the template is computationally inefficient and to address

this, we have used fuzzy multi factor-based approach for

matching the template against test logo as described in [40].

2.2 Textual context from static pages in broadcast TV

Active services are value-added interactive services pro-

vided by the DTH (direct-to-home) providers and are

designed based on (DVB-S) digital video broadcasting

standard for satellites. They include education, sports,

online banking, shopping, jobs, matrimony, etc. These

services provide interactivity using short messaging service

(SMS) as the return path. For instance, the consumers can

interact by sending an SMS having a text string displayed

on the TV screen to a predetermined mobile number. For

example, Tata Sky, the leading DTH provider in India

(www.tatasky.com) provides services like active mall to

download wall papers and ringtones, active astrology,

active matrimony, movies on demand, service subscription,

account balance, etc.

As the return path for the traditional DTH boxes is not

available, as part of interactivity, these pages instruct the

user to send some alphanumeric code generated on the TV

screen via SMS from their registered mobiles. This is

illustrated in Fig. 4 with a screenshot of an active page of

Tata Sky, with the text to SMS marked in red. The system

is quite cumbersome form user experience perspective. A

better experience can be provided if the texts in the video

frame can be recognized automatically and SMS is gen-

erated. There was no work found in the literature on text

recognition in static TV video frames.

In our proposed system, we have presented an optical

character recognition-based approach to extract the SMS

address and text content from video to send SMS auto-

matically by just pressing a hot key in the HIP remote

control. In addition to the complete end-to-end system

implementation, the main contribution is in the design of

an efficient pre-processing scheme consisting of noise

removal, resolution enhancement and touching character

segmentation, after which standard binarization techniques

and open source print OCR tools like GOCR (http://jocr.

sourceforge.net/) and Tesseract (http://sourceforge.net/pro

jects/tesseract-ocr/) are used to recover and understand the

textual content. There are OCR products like abbyy

screenshot reader and abbyy finereader also available

(http://www.abbyy.com/); however, it was decided to use

open source tools to keep system cost low.

2.3 Textual context from text embedded in broadcast

video

Typically broadcast videos of news channels, business

channels, music channels, education channels and sport

channels carry quite a bit of informative text that are

inserted/overlaid on top of the original videos. If this

information can be extracted using optical character rec-

ognition (OCR), related information from web can be

mashed up either with the existing video on TV or can be

pushed into the second-screen devices like mobile phone

and tablets. Figure 5 gives an example screenshot of tex-

tual information inserted in a typical Indian news channel.

Gartner report suggests that there is quite a bit of

potential for new connected TV widget-based services.1

The survey on the wish list of the customers of connected

TV shows that there is a demand of a service where the

user can get some additional information from internet or

Fig. 4 Text embedded in active pages of DTH TV

1 http://blogs.gartner.com/allen_weiner/2009/01/09/ces-day-2-

yahoosconnected-tv-looks-strong.

Pattern Anal Applic

123


http://www.tatasky.com

http://jocr.sourceforge.net/

http://jocr.sourceforge.net/

http://sourceforge.net/projects/tesseract-ocr/

http://sourceforge.net/projects/tesseract-ocr/

http://www.abbyy.com/

http://blogs.gartner.com/allen_weiner/2009/01/09/ces-day-2-yahoosconnected-tv-looks-strong

http://blogs.gartner.com/allen_weiner/2009/01/09/ces-day-2-yahoosconnected-tv-looks-strong

different RSS feeds, related to the news show the customer

is watching over TV. A comprehensive analysis on the pros

and cons of the products on connected TV can be found in

[12]. But none of the above meets the contextual news

mash-up requirement. A nearly similar feature is demon-

strated by Microsoft in international consumer electronics

show (CES) 2008 where the viewers can access the con-

tents on election coverage of CNN.com while watching

CNN’s television broadcast, and possibly even participate

in interactive straw votes for candidates.2 But this solution

is IPTV metadata based and hence does no need textual

context extraction.

The main technical challenge for creating the solution

lies in identifying the text area that changes dynamically

against a background of dynamically changing video. The

state-of-the-art shows that the approaches for text locali-

zation can be classified broadly in two types—(1) using

pixel domain information when the input video is in raw

format and (2) using the compressed domain information

when the input video is in compressed format. Since we are

already capturing the raw (UYVY) video as the input for

the proposed system, we focus only on the pixel domain

methods. A comprehensive survey on text localization is

described in [13] where all different techniques in the lit-

erature from 1994 to 2004 have been discussed. It is seen

that the pixel domain approaches are mainly texture-based

(TB) and TB-based approaches are further sub-divided into

connected component based (CB) and edge based (EB).

CB-based approaches are covered in [14–18]. EB-based

approaches are covered in [19–23]. In [24–26] we get

combined CB and EB-based approaches, whereas [27, 28]

combines compressed domain and pixel domain informa-

tion along with combination of texture/edge-based

methods.

It is typically seen that it is difficult to have one par-

ticular method perform well against varying kind of texts

and video backgrounds––hybrid approaches proposed in

[27] and [28] seem to perform well in these scenarios. In

this work, we intend to propose an end-to-end system that

can provide these features on HIP. The main contribution

of the work lies in proposing low-computational com-

plexity algorithms for

1. An improved method for localizing the text regions of

the video and then identifying the screen layout for

those text regions, extending the work in [27] and [28].

2. Recognizing the content for each of the regions

containing the text information using novel pre-

processing techniques and Tesseract OCR as stated

in Sect. 2.2.

3. Applying heuristics-based key word spotting algorithm

where the heuristics are purely based on the observa-

tion on the breaking news telecasted in Indian news

channels.

Some of the contributions mentioned in Sects 2.1, 2.2

and 2.3 have already been published [35–39].

3 Proposed system

There are three different systems that can be built in an

integrated manner for the three different classes of context,

using the architecture described in Fig. 1. We present these

three systems below.


The overview of channel logo recognition methodology is

described in Fig. 6 in the context of the overall system

described in Fig. 1. Each step is elaborated in detail below.

3.1.1 Logo template creation

Initially the videos of all channels are recorded to create a

single video file. Manual annotation is performed on the

video file to generate a ground-truth file containing channel

code, start frame number and end frame number. This

Logo Template Creation

Logo Matching

Logo Detection

Fig. 6 Overview of channel logo recognition

Fig. 5 Contextual text embedded in TV video

2 http://www.microsoft.com/presspass/press/2008/jan08/01-06MSMe

diaroomTVLifePR.mspx.

Pattern Anal Applic

123


http://www.microsoft.com/presspass/press/2008/jan08/01-06MSMediaroomTVLifePR.mspx

http://www.microsoft.com/presspass/press/2008/jan08/01-06MSMediaroomTVLifePR.mspx

https://www.researchgate.net/publication/224085688_Neural_Network_Based_Text_Detection_in_Videos_Using_Local_Binary_Patterns?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/224581469_Text_detection_in_video_frames_using_hybrid_features?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/220461050_Video_text_detection_and_segmentation_for_optical_character_recognition?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/224359876_Caption_text_location_with_combined_features_using_SVM?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/224170731_Mash_up_of_Breaking_News_and_Contextual_Web_Information_A_Novel_Service_for_Connected_Television?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/221909024_Recognition_of_Characters_from_Streaming_Videos?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/220861884_A_Gradient_Difference_Based_Technique_for_Video_Text_Detection?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/224559364_Recognition_of_trademarks_from_sports_videos_for_channel_hyperlinking_in_consumer_end?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==



https://www.researchgate.net/publication/221266306_Video_text_detection_based_on_filters_and_edge_features?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/220927934_Efficient_Video_Text_Detection_using_Edge_Features_ICPR?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/220932909_An_Efficient_Edge_Based_Technique_for_Text_Detection_in_Video_Frames?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/220861813_A_Robust_Wavelet_Transform_Based_Technique_for_Video_Text_Detection?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/220860540_Development_and_Evaluation_of_Text_Localization_Techniques_Based_on_Structural_Texture_Features_and_Neural_Classifiers?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/232657945_Caption_Text_Location_with_Combined_Features_for_News_Videos?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/222674563_Text_information_extraction_in_images_and_video_a_survey_Pattern_Recogn?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/224215103_Identification_of_Trademarks_Painted_on_Ground_and_Billboards_Using_H264_Compressed_Domain_Features_from_Sports_Videos?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==



https://www.researchgate.net/publication/220933146_A_Hybrid_System_for_Text_Detection_in_Video_Frames?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

video is played using a tool that enables the user to select

the region of interest (ROI) of the logo from the video

using mouse. To aid the user, a ROI suggestion system is

provided in the tool, which is introduced below as an

innovative extension. The tool takes the annotated ground-

truth file as input to generate the logo template file con-

taining ROI coordinates, height and width of ROI and a

feature-based template for each channel logo. The feature

considered for the template generation is quantized HSV

values of the pixels in the ROI [5]. To reduce the template

size without affecting the detection performance, 36 levels

of quantization are taken. It should be noted that input

video comes in UYVY format (as per HIP implementa-

tion), so the tool coverts this video to HSV.

3.1.2 Method of marking pixels of interest in ROI

The algorithm is based on the principle that logo region

remains invariant midst varying video. The video buffer for

the ith frame (fi) is used to store the quantized HSV values

of all pixels in the ith frame.

• Compute the run-time average of each pixels of ith

frame at (x, y) coordinate ai (x, y) as

aiðx; yÞ ¼ðai�1ðx; yÞ � ði� 1Þ þ fiðx; yÞÞ

i

• Compute dispersion di(x, y) of each pixel of ith frame

as

diðx; yÞ ¼ di�1ðx; yÞ þ absðaiðx; yÞ � fiðx; yÞÞ

• Compute variation vi(x, y) in pixel value at location (x,

y) at ith frame as

viðx; yÞ ¼di

i

• Suggest the pixels having a variance greater than

threshold as out of logo region.

fiðx; yÞ ¼ DON’TCARE 8x; y 2 viðx; yÞ[ Thvar ð1Þ

3.1.3 Logo matching

The template of each logo is uploaded to the HIP box.

Inside the box, the captured TV video in the corresponding

ROI is compared with the template using correlation

coefficient based approach. The score always gives a value

in the range of 0–1. We consider the logo as a candidate if

the score is greater than a fixed threshold. For noise-free

videos, a fixed threshold arrived at using experimentation

and heuristics works well. However, for noisy videos, we

need to go for statistical processing-based decision logic.

Usually, first the fixed threshold-based algorithm is applied

with threshold kept on the lower side (0.75 in our case) to

arrive at a set of candidate channels with best matching

scores. This normally contains quite a few false positives.

We employ the standard M/N detection approach used in

radar detection theory [10] to reduce the false positives.

The logo scores are generated for every f frames of video,

where f is the averaging window length. A decision algo-

rithm is implemented using N consecutive scores. The

channel that is occurring at least M times out of N is

detected as the recognized channel. We have experimented

and have arrived at an optimal value of M = 5 and N = 9.

For time-varying logos at fixed locations (logo types e

and f), it is observed that the variation follows a fixed pat-

tern over time. It is seen that either the color of the logo goes

through a cycle of variation or the image of the logo itself is

animated going through a fixed animation cycle. For both

these cases, instead of taking one image of the logo as

template we take a series of images of the logo (representing

its full variation cycle either in color level or in image level)

as a template set and follow same methodology as proposed

above followed by some aggregation logic.

Logo detection is a resource hungry algorithm as it does

pixel by pixel matching for correlation. Hence it should be

triggered only when there is a channel change. The change in

channel is detected using the blue or blank screen that comes

during channel transitions. In the proposed system, it runs

logo detection every 15 s until a channel is detected. Once

detected the next logo detection is triggered only by a channel

change event. This frees up useful computing resource on HIP

during normal channel viewing which can be used for textual

context detection described in Sects. 3.2 and 3.3


The proposed system is implemented using the generic

architecture given in Fig. 1. After collecting a set of images

of active pages, we made the following observations:

• The location of the relevant text region is fixed for a

particular active page.

• There is a considerable contrast difference between the

relevant text region and the background.

• The characters to be recognized are of standard font

type and size.

Based on these observations, we propose a set of steps

for the text recognition algorithm as depicted in Fig. 7.

Each of the steps is elaborated in detail below.

3.2.1 A priori ROI mapping

In this phase the relative position of all relevant text regions

for each active page is manually marked and stored in a

database. First we find the bounding box coordinates for each

Pattern Anal Applic

123



ROI in the reference active pages through manual annota-

tion. This manually found ROI can be used as a priori

information as it was found that the active pages are static.

3.2.2 Pre-processing

Once the ROI is defined manually we can directly give this

ROI to the recognition module of some OCR engine.

However, it is found that there are a lot of blurring and

artifacts in the ROI that reduces the recognition rate of the

OCR. Hence we propose our own pre-processing scheme to

improve the quality of the text image before giving it to a

standard OCR engine for recognition. The pre-processing

scheme is divided into two parts—noise removal and

image enhancement. For noise removal, we do a 5-pixel

moving window average for the Luminance (Y) values.

The image is enhanced using the following steps:

• Apply six tap interpolation filter with filter coefficients

(1, -5, 20, 20, -5, 1) to zoom the ROI two times in

height and width.

• Apply frequency domain low-pass filtering using DCT

on the higher resolution image.

ICA-based approach can also produce very good result

but we stayed with the above approach to keep the com-

putational complexity low.

3.2.3 Binarization and touching character segmentation

The output of the pre-processed model is then binarized

using an adaptive thresholding algorithm. There are several

ways to achieve binarization so that the foreground and the

background can be separated. However, as both the char-

acters present in the relevant text region as well as the

background are not of a fixed gray level value, adaptive

thresholding is used in this approach for binarization. To

obtain the threshold image, the very popular Otsu’s method

[11] is used.

Once the binarized image is obtained very frequently it

is observed that the image consists of a number of touching

characters. These touching characters degrade the accuracy

rate of the OCR. Hence the touching character segmenta-

tion is required to improve the performance of the OCR.

We propose an outlier detection-based approach, the steps

of which are as below:

• Find the width of each character. It is assumed that each

connected component with a significant width is a

character. Let the character width for the ith component

be WCi.

• Find average character width lWC ¼ 1=nPn

i¼1

WCi where

n is the number of characters in the ROI.

• Find the standard deviation of character width (rWC) as

rWC = STDEV(WCi).

• Define the threshold of character length (TWC) as

TWC = lWC ? 3rWC.

• If WCi [ TWC mark the ith connected component as

candidate character.

3.2.4 Automatic detection of the text by the OCR engine

The properly segmented characters obtained as output of

the previous module is passed to two standard OCR

engines—GOCR and Tesseract for automatic text detec-

tion. Once the text is detected, it is automatically sent as

SMS to the satellite DTH service provider.


video

The proposed system follows the system design presented

in Fig. 1 and consists of steps given in Fig. 8. Each of the

steps is presented in detail below.

3.3.1 Localization of suspected text regions

We have used the approach of text localization described in

[26]. Our proposed methodology is based on the following

assumptions based on the observation form different news

videos.

• Text regions have a high contrast.

• Texts are aligned horizontally.

• Texts have a strong vertical edge with background.

• Texts of Breaking news persist in the video for at least

2 s.

Following [26], first we filter out low-contrast compo-

nents based on intensity-based thresholding and mark the

output as Vcont. Then for final text localization, we propose

A-priori ROI Mapping

Pre-processing for noise removal and image enhancement

Binarization and Touching Character Segmentation

OCR using standard engines

Fig. 7 Text recognition in static pages

Pattern Anal Applic

123


https://www.researchgate.net/publication/302937305_A_threshold_selection_method_from_gray-level_histograms?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==



a low-computational complexity algorithm that can local-

ize the candidate regions efficiently. The methodology is

presented as below:

• Count the number of black pixels in a row in each row

of Vcont. Let the number of black pixels in ith row be

defined as cntblack(i)

• Compute the average (avgblack) number of black pixels

in a row as

avgblack ¼Xht

i¼1

cntblackðiÞ=ht

where ht is the height of the frame.

• Compute the absolute variation (av(i)) in number of

black pixels in a row from avgblack for each row as

avðiÞ ¼ abs(cntblackðiÞ � avgblackÞ

• Compute the average absolute variation (aav) as

aav ¼Xht

i¼1

avðiÞ=ht

• Compute the threshold for marking the textual region as

THtxt reg = avgblack + aav

• Mark all pixels in ith row of Vcont as white if

cntblackðiÞ\THtxt reg ð2Þ

3.3.2 Text region confirmation

This portion of the proposed method is based on assump-

tion that texts in the breaking news persist for some time.

Vcont sometime contains noise because of some high-

contrast regions in the video frame. But this noise usually

comes for some isolated frames only and is not present in

all the frames in which the breaking news text is persistent.

In a typical video sequence with 30 FPS, one frame gets

displayed for 33 ms. Assuming breaking news to be per-

sistent for at least 2 s, we filter out all regions which are not

persistently present for more than 2 s.

3.3.3 Binarization

Once the pre-processing is done, we compute the vertical

and horizontal energy of the sub block based on the

assumption that the blocks with text have high energy

levels. The regions with lower energy are marked as black

after they are checked using a threshold value. We first

compute the histogram for all the energy levels in a row,

determine the two major peaks denoting start and end of a

text segment and mark the threshold slightly lower than

the smaller peak. The result obtained contains some false

positives, i.e. noise along with the text detected. Hence,

we go for some morphological operations and filtering

which enhance the image and give better localization with

less false positives. The final rectangular binarized image

of the localized text region is fed into the text recognition

block.

3.3.4 Text recognition

For text recognition, we exactly follow the process outlined

in Sect. 3.2 under ‘‘Touching character segmentation’’ and

‘‘Optical character recognition’’. We use the Tesseract

OCR engine.

One advantage of recognizing texts from TV videos is

that the font variation in the text of different TV channels is

very little. So we have applied a modified perfect metric-

based method as described in [43] to recognize the textual

context followed by a weighted finite state transducer

(WFST)-based post-processing as described in [41].

3.3.5 Keyword selection

Here we propose an innovative post-processing approach

on the detected text based on the following observed

properties.

• Breaking news always comes in capital letter.

• Font size of the breaking news is larger than that of the

ticker text.

• They tend to appear on the central to central-bottom

part of the screen.

These assumptions can be justified by the screen shots of

news shows telecasted in different news channels as shown

in Fig. 9.

Localization of suspected text regions

Confirmation of the Text regions using Temporal Consistency

Binarization

Text Recognition

Keyword Selection

Fig. 8 Text recognition in broadcast video

Pattern Anal Applic

123


https://www.researchgate.net/publication/261202641_A_novel_low_complexity_TV_video_OCR_system?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/224265529_A_Weighted_Finite-State_Transducer_WFST-Based_Language_Model_for_Online_Indic_Script_Handwriting_Recognition?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

From these observations, we have used the following

approach to identify the keywords.

• Operate the OCR only in upper case.

• If the number of words in a text line is above a

heuristically obtained threshold value we consider them

as candidate text region.

• If multiple such text lines are obtained, we chose a line

near the bottom.

• Remove the stop words (like a, an, the, for, of, etc.) and

correct the words using a dictionary.

• Concatenate the remaining words to generate the search

string for internet search engine: selected keyword can

be given to internet search engines using web APIs to

fetch-related news, which can be blended on top of TV

video to create a mash-up between TV and web. Since

search engines like Google already provide word

correction, thereby eliminating the requirement of

dictionary-based correction of keywords.

4 Results and discussion

All the three proposed sub-systems were implemented on

HIP under the general architecture outlined in Fig. 1.

Experimental data were collected from Indian live TV

channels to prove the efficacy of the proposed algorithms.

We describe the results obtained for each of the three sub-

systems below followed by a discussion on the results.


The channel logo recognition module is tested with videos

recorded from around 92 Indian channels. The accuracy of

recognition is measured using two parameters namely

recall and precision recall (r) and precision (p).

r ¼ c

cþ m; p ¼ c

cþ fpð3Þ

where c is total number of correct detections, m is total

number of misses and fp is the total number of false

positives.

The channel logo recognition module is tested with

videos recorded from around 92 Indian channels with each

video of approximately 10 min duration. We have used

single logo template for 87 channels and more than one

template (varying from 3 to 5) for rest 5 channels as these

are varying in either shape/color over time. The experi-

mental results without machine learning-based time com-

plexity optimization are as below

• Recall rate r = 0.96, signifying miss = 4 %

• Precision p = 0.95, signifying false positive = 5 %

For the 92 channels tested, we get r = 0.96 and

p = 0.95.

As is seen from the results, the accuracy of the algorithm

is quite good. We analyzed the reasons for the small recall

and precision inaccuracy and found that they can be

explained as follows:

• The channel logos with very small number of pixels

representing the foreground pixel of the channel logo

are missed in less than 5 % cases.

• The reason behind the misses is that the channel logo is

shifted to a different location from its usual position or

channel logo itself has changed. A sample screen shot

of the channel logo location shift in Ten Sports channel

is shown in Fig. 10. Sample screenshots of the channel

logo color change in Sony Max channel and altogether

change in channel logo for Star Plus channel are shown

in Figs. 11 and 12, respectively.

To explain the false positive results, we present the

details in the form of confusion matrix in the Table 1. It is

evident that most of the channels are mainly confused with

DD Ne. The major reason behind it is that DD Ne channel

logo is very small in size and false positives can be

improved by removing DD Ne template from the corpus.

The reason for Zee Punjabi and Nepal-1 being detected

wrongly is because these logos are transparent and false

detection occurs in some conditions of the background

video. It does not happen all the time and hence can be

improved through time averaging.

We also measured the computational complexity of the

proposed system and the results are shown in Table 2 for

different parts of the algorithm. As is seen from the results,

we are able to detect the channel at less than 1.5 s after the

channel change which is quite acceptable from the user

Fig. 9 Screen shots showing breaking news in four different channels

Pattern Anal Applic

123


experience perspective. However, since logo detection is

triggered by channel change, the DSP CPU is available for

other tasks when the user is not changing the channels.

If we apply machine learning-based method [40] of

template matching, we can further reduce the time com-

plexity by nearly 60 % as it does not involve pixel by pixel

matching and also increase the recognition accuracy to an

average recall rate of 1 and precision of 0.995.

4.1.1 Discussion

We have proposed a logo recognition-based channel

identification technique here for value-added TV-internet

mash-up applications. For logo recognition, we have

introduced a solution using template-based matching where

the logo templates are generated offline and the logo rec-

ognition is performed on the captured TV video in real time

on HIP boxes using the templates. The main contribution of

the proposed work has been fourfold:

a. An algorithm to suggest logo ROI during manual

template generation.

b. Algorithm to handle the non-rectangular, transparent

and alpha-blended static logos with improved detection

accuracy using statistical decision logic.

c. Time sequence-based algorithm to handle non-static

logos.

d. Channel change event detection as trigger to logo

recognition algorithm for reduced computational

overhead.

Results of experimental study of 92 Indian TV channels

are presented. Results show a recall rate of 96 % and

precision rate of 95 %, which is quite acceptable. The cases

where the algorithm is failing are analyzed and it is found

that the failures happen in specific conditions that are

handled using machine learning-based approach [40] to

further improve the accuracy.

The time complexity of the algorithm is also profiled and it

is found that a channel can be detected within 1.5 s of a

channel change. While this figure is acceptable for the pro-

posed application, there is scope of optimization using the

DSP hardware accelerators like color space conversion,

correlation and SAD available on the DaVinci chipset of HIP.


The different kinds of videos are recorded from different

kinds of DTH active pages available. The screenshots of 10

Fig. 10 Channel logo location

shift

Fig. 11 Channel logo color change

Fig. 12 Channel logo image change

Table 1 Confusion matrix for

channel logo recognitionOriginal

channel

Detected as

Zee Trendz DD Ne

Zee Punjabi TV9 Gujarati

DD News DD Ne

Nick DD Ne

Nepal 1 Zee Cinema

Table 2 Time complexity of

different algorithm componentsModule Time

(ms)

YUV to HSV 321.09

ROI mapping 0.08

Mean SAD

matching

293.65

Correlation 293.65

Pattern Anal Applic

123


different frames (only the relevant text region or ROI) are

given in Fig. 13a–j. The page contents are manually

annotated by storing the actual text (as read by a person)

along with the page in a file. The captured video frames are

passed through the proposed algorithm and its output (text

strings) is also stored in another file. The two files are

compared for results.

The performance is analyzed by comparing the accuracy

of the available OCR engines (GOCR and Tesseract)

before and after applying the proposed image enhancement

techniques (pre-processing, binarization and touching

character segmentation). The raw textual results are given

in Table . The accuracy is calculated from the raw text

outputs using character comparison and is presented

graphically in Fig. 14. From the results, it is evident that

considerable improvement (50 % in average, 80 % in some

cases) is obtained in character recognition after using our

proposed methodology of restricting the ROI and applying

pre-processing and touching character segmentation before

providing the final image to the OCR engine. It is also seen

that Tesseract performs better as an OCR engine compared

to GOCR.

4.2.1 Discussion

In our proposed system, we have presented an end-to-end

system solution to automate user interaction in DTH active

pages by extracting the textual context of the active page

screens through text recognition. In addition to the com-

plete end-to-end system implementation, the main contri-

bution is in the design of an efficient pre-processing

scheme consisting of noise removal, resolution enhance-

ment and touching character segmentation on which stan-

dard binarization techniques (like Otsu’s) and open source

print OCR tools like GOCR and Tesseract are applied.

From the results, it is quite clear that the proposed pre-

processing schemes improve the text recognition accuracy

quite significantly. Additionally it is seen that Tesseract

OCR performs much better than GOCR. Hence the final

system is implemented on HIP using the proposed pre-

processing algorithms of noise removal, resolution

enhancement and touching character segmentation, along

with Otsu’s binarization scheme and Tesseract OCR.


video

We have tested the system against 20 channels comprising

of news, 4 music and movies and 3 sports channels. A

Fig. 13 Different active page text region screenshots (a–j)

Table 3 Raw text outputs from OCR algorithms for different active pages

Image Output of GOCR Output of Tesseract After Applying Proposed Algorithms

GOCR Tesseract

(a)

Sta_ring Govind_. Reem_ _n.

RajpaI Yadav. Om Puri.

Starring Guvinda, Rcema Sen,

Raipal Yadav, Om Puri.

Starring Govind_. Reem_ _n.

RajpaI Yadav. Om Puri.

Starring Guvinda. Reema Sen,

Raipal Yadav. Om Puri.

(b) _____ ___ ___ _________

____ __ __

Pluww SMS thu fnlluwmg

(adn In 56633

___ SMS th_ folIcmng cod_ to

S__

Planta SMS tha Iullmmng mda

tn 56633

(c) SmS YR SH to SMS YR SH in 56633 SmS YR SH to _____ SMS YR SH to 56533

(d) _m_ BD to _____ SMS BD to 56633 SMS BD to S____ SMS BD to 56633

(e) AM t___o_,_b _q____ AM 2048eb 141117 AM tOa_gb _q____ AM 2048eb 141117

(f) _M_= _ _A___ to Sd___ SMS: SC 34393 tn 56533 _M_= _ _A___ to Sd___ SMS: SC34393 tn 56633

g) _W _ ' _b _ Ib_lb _a W6.} 048abl;lbwzIb1a ___ __Y_b yIbw_Ib_a WP 2048ab Mlbwzlb 1 a

(h) ADD Ed_J to S____ ADD Eau to $6633 ADD Ed_J to S____ ADD Edu to 56633

(i) AIC STAlUSlS/OUO_

t_;OS;t_

AIC STATUS25/02/09 1

9:05:1 4

mlC S_ATUSlS/OUO_

t_;OS=tA

A/C STATUS 25/02/09 1 9:05:14

(j) _ ________'__ Sub ID 1005681893 WbID_OOS_B_B__ Sub ID 1005681893

Pattern Anal Applic

123


number of video sequences from each channel of approx-

imately 5 min duration were taken (633 min total).

The experimental results are analyzed as below.

4.3.1 Accuracy of text localization

We have used the value of THContrast as 32 which is

identified by experiment from the recorded video sequen-

ces. The threshold values are chosen so that there is no

false negative but there may be some false positives; in this

way we do not miss any important text region. A typical

video frame and the high-contrast region extracted from the

frame are shown in Fig. 15. In Fig. 16, we show the

improved screenshots for the text localization after noise

cleaning using the proposed methodology. Referring to the

recall and precision measures outlined in Eq. 3, experi-

mental results show a recall rate of 100 % and precision of

78 % for the text localization module. The reason behind a

low precision rate is that we have tuned the parameters and

threshold values in a manner so that the probability of false

negative (misses) is minimized. The final precision per-

formance can be only seen after applying text recognition

and keyword selection algorithms.

4.3.2 Accuracy of text recognition

Once the text regions are localized, each candidate text

rows undergo some processing prior to OCR and are given

as input to Tesseract for OCR. It is found that in case of

false positives, a number of special characters are coming

as output of OCR. So we discard the candidate texts having

special character/alphabet ratio [1. Moreover, our key-

word detection method suggests that we are concentrating

more on capital letters. So we have considered the words in

all capitals under consideration. It is found that character

level accuracy of the selected OCR for those cases

improves to 86.57 %.

4.3.3 Accuracy in information retrieval

Limitations of the OCR module can be overcome by hav-

ing a strong dictionary or language model. But in the

proposed method, we get rid of this constraint as the Go-

ogle search engine itself has one such strong module. So

we simply give the output to Google search engine and in

turn Google gives the option with actual text as shown in

Fig. 17. We have given the input to Google as ‘‘MUMBAI

ATTACHED’’ as it is the text detected by the OCR. But

Google itself gives the actual text ‘‘MUMBAI

ATTACKED’’ as an option in their ‘‘Did you mean’’ tab.

This can be done programmatically using web APIs pro-

vided by Google.

Finally in Fig. 18, we present a screenshot of the final

application, where the ‘‘Mumbai attacked’’ text phrase

identified using the proposed system is used to search for

relevant news from internet and one such news (‘‘The Taj

attack took place at 06:00 h’’) is superposed on top of the

TV video using alpha blending in HIP.

4.3.4 Discussion

In this section, we have proposed an end-to-end system on

HIP that provides low-computational complexity algo-

rithms for text localization, text recognition and keyword

selection leading towards a novel TV-web mash-up appli-

cation. As seen from the results, the proposed pre-pro-

cessing algorithms for text region localization in TV news

videos gives pretty good accuracy (*87 %) in final text

Fig. 14 Performance of different OCR engines before and after the

proposed algorithms

Fig. 15 High-contrast regions

in the video

Pattern Anal Applic

123


recognition, which when used with word correction feature

of Google, gives almost 100 % accuracy in retrieving rel-

evant news from the web. Finally we have shown how this

information retrieved from web can be mashed up with the

TV video using alpha blending on HIP.

With improvement in performance suggested in [41],

the system was further tested on the data set described in

[42] which reduces the 1.79 MB memory requirement of

Tesseract to only 2268 bytes. The analysis of improvement

in time complexity is reported in [43].

As a scope of future work, the same information also

can be displayed on the second screen of the user like

mobile phones and tablets. There is also scope of working

on (NLP) natural language processing for regional news

channels and giving cross-lingual mash-ups.

5 Conclusion

In this paper, we have presented a novel system of mashing

up of related data from internet by understanding the

broadcast video context and also shown three applications

on television where it can be applied. We have presented

three different novel methodologies for identifying TV

video context:

• Low-computational complexity channel identification

using logo recognition and using it for an web-based

fetching of electronic program guide for analog TV

channels.

• Detecting text in static screens of satellite DTH TV

active pages and using it for an automated mode of

interactivity for the end user. Text detection accuracy is

improved using novel pre-processing techniques.

• Detecting text in the form of breaking news in news TV

channels and using it for mashing up relevant news

from the web on TV. Text detection accuracy is

improved using novel text localization techniques and

computational complexity is reduced using innovative

methodologies utilizing unique properties of the

‘‘Breaking News’’ text and using search engine text

correction features instead of local dictionary.

Experimental results show that the applications are

functional and work with acceptable accuracy. We believe

that for developing nations this is the best way to bring

power of internet to masses, as the broadcast TV medium is

still primarily analog and the PC penetration is very poor.

This is one of the suggested ways to improve the poor

internet interactivity reported in the user study reported in

section 1 [44].

Acknowledgments The authors thank Prof. Bidyut Baran Chaudh-

uri and Prof. Utpal Garain from Indian Statistical Institute for their

kind advice and suggestions on the algorithm development. The

authors also thank Chirabrata Bhaumik and Avik Ghose of TCS

Innovation Labs for their help in system implementation of the pro-

posed work on HIP. This work was supported by Innovation Lab, Tata

Consultancy Services.

References

1. ITU-T Technical Report (2011) Access to internet-sourced con-

tents. HSTP-IPTV-AISC (2011–03), March 2011

Fig. 16 Text regions after noise cleaning

Fig. 17 Screen shot of the Google search engine with recognized text

as input

Fig. 18 Screen shot of the final application with TV-web mash-up

Pattern Anal Applic

123


https://www.researchgate.net/publication/220861826_Creation_and_Analysis_of_a_Corpus_of_Text_Rich_Indian_TV_Videos?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/236255703_A_low-cost_Connected_TV_platform_for_Emerging_Markets--Requirement_Analysis_through_User_Study?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==



2. Fink M, Covell M, Baluja S (2006) Social- and interactive-tele-

vision applications based on real-time ambient-audio identifica-

tion. In: Proceedings of EuroITV.

3. Baluja S, Covell M (2006) Content fingerprinting using wavelets,

3rd european conference on visual media production. (CVMP

2006), London

4. Baluja S, Covell M (2008) Waveprint: efficient wavelet-based

audio fingerprinting. Pattern Recognition, 41(11), Elsevier

5. Chattopadhyay T, and Agnuru C (2010) Generation of electronic

program guide for RF fed TV channels by recognizing the

channel logo using fuzzy multifactor analysis. In: International

symposium on consumer electronics (ISCE 2010), Germany

6. Esen E, Soysal M, Ates TK, Saracoglu A, Alatan AA (2008) A

fast method for animated TV logo detection. CBMI, June 2008

7. Ekin A, Braspenning E (2006) Spatial detection of TV channel

logos as outliers from the content. In: Proc VCIP, SPIE 2006

8. Wang J, Duan L, Li Z, Liu J, Lu H, Jin JS (2006) A robust

method for TV logo tracking in video streams. ICME, 2006

9. Ozay N, Sankur B (2009) Automatic TV logo detection and

classification in broadcast videos. EUSIPCO, Scotland, 2009

10. Skolnik IM (2002) Introduction to radar systems, 3rd edn.

McGraw-Hill, New York

11. Otsu N (1979) A threshold selection method from gray-level

histograms. IEEE Trans Syst Man Cybernetics 9:1

12. Harry McCracken (2009) The connected TV: web video comes to

the living room. PC World, Mar 23, 2009

13. Jung K, Kim KI, Jain AK (2004) Text information extraction in

images and video: a survey. Pattern Recognition, Vol. 37, Issue 5,

May 2004

14. Shivakumara P, Trung QP, Chew LT (2009) A gradient differ-

ence based technique for video text detection. In: Proceedings of

10th international conference on document analysis and recog-

nition, 26–29 July 2009

15. Shivakumara P, Phan TQ, Lim TC (2009) A robust wavelet

transform based technique for video text detection. In: Proceed-

ings of 10th international conference on document analysis and

recognition, 26–29 July 2009

16. Emmanouilidis C, Batsalas C, Papamarkos N (2009) Develop-

ment and evaluation of text localization techniques based on

structural texture features and neural classifiers. In: Proceedings

of 10th international conference on document analysis and rec-

ognition, 26–29, pp 1270–1274

17. Jun Y, Lin-Lin H, Xiao LH (2009) Neural network based text

detection in videos using local binary patterns. In: Proceedings

of Chinese conference on pattern recognition, 4–6 Nov 2009,

pp 1–5

18. Zhong J, Jian W, Yu-Ting S (2009) Text detection in video

frames using hybrid features. In: Proceedings of international

conference on machine learning and cybernetics, pp 12–15

19. Ngo CW, Chan CK (2005) Video text detection and segmentation

for optical character recognition. Multimed Syst 10:3

20. Anthimopoulos M, Gatos B, Pratikakis I (2008) A Hybrid system

for text detection in video frames. In: Proceedings of the eighth

IAPR international workshop on document analysis systems,

pp 16–19

21. Shivakumara P, Phan TQ, Lim TC (2009) Video text detection

based on filters and edge features. In: Proceedings of IEEE

international conference on multimedia and expo, June 28–July 3

2009

22. Shivakumara P, Phan TQ, Lim TC (2008) Efficient video text

detection using edge features. In: Proceedings of 19th interna-

tional conference on pattern recognition, 8–11 Dec 2008

23. Shivakumara P, Phan TQ, Lim TC (2008) An efficient edge based

technique for text detection in video Frames. In: Proceedings of

the eighth IAPR international workshop on document analysis

systems, 16–19 Sept 2008

24. Yu S, Wenhong W (2009) Text localization and detection for

news video. In: Proceedings of second international conference

on information and computing science, 21–22 May 2009

25. Su Y, Ji Z, Song X, Hua R(2008) Caption text location with

combined features using SVM. In: Proceedings of 11th IEEE

international conference on communication technology, 10–12

Nov 2008

26. Su Y, Ji Z, Song X, Hua R (2008) Caption text location with

combined features for news videos. In: Proceedings of interna-

tional workshop on geoscience and remote sensing and education

technology and training, 21–22 Dec 2008

27. Chattopadhyay T, Sinha A (2009) Recognition of trademarks

from sports videos for channel hyperlinking in consumer end. In:

Proceeding of the 13th international symposium on consumer

electronics (ISCE’09), Japan, 25–28 May 2009

28. Chattopadhyay T, Chaki A (2010) Identification of trademarks

painted on ground and billboards using compressed domain fea-

tures of H.264 from sports videos. National Conference on

Computer Vision Pattern Recognition, Image Processing and

Graphics (NCVPRIPG), Jaipur, India, Jan 2010

29. Indian Market Research Bureau (2010) I-Cube 2009–10, Feb

2010

30. International Telecommunication Union (ITU) (2011) Measuring

the information society

31. Internet and Mobile Association of India (2011) Report on

Mobile internet in India, Aug 2011

32. International Telecommunication Union (ITU) (2011) The World

in 2011––ICT facts and figures

33. International Telecommunication Union (ITU) (2011), Informa-

tion society statistical profiles–Asia and the Pacific

34. Pal A, Prashant M, Ghose A, Bhaumik C (2010) Home infotain-

ment platform—a ubiquitous access device for masses. In:

Springer communications in computer and information science,

Vol 75, 2010. Ubiquitous computing and multimedia applications,

Also In: Proceedings on ubiquitous computing and multimedia

applications (UCMA), Miyazaki, Japan, March 2010, pp 11–19

35. Chattopadhyay T, Sinha A, Pal A, Pradhan D, Chowdhury SR

(2011) Recognition of channel logos from streamed videos for

value added services in connected TV. IEEE international con-

ference for consumer electronics (ICCE), Las Vegas, USA

36. Chattopadhyay T, Sinha A, Pal A (2011) TV Video context

extraction. IEEE trends and developments in converging tech-

nology towards 2020 (TENCON 2011), Bali, Indonesia, Nov

21–24 2011

37. Chattopadhyay T, Pal A, Garain U (2010) Mash up of breaking

news and contextual web information: a novel service for con-

nected television. In: Proceedings of 19th international confer-

ence on computer communications and networks (ICCCN),

Zurich, Switzerland, Aug 2010

38. Pal A, Sinha A, Chattopadhyay T (2010) Recognition of char-

acters from streaming videos. In: Minoru M (ed) Character rec-

ognition, Sciyo Publications, ISBN: 978-953-307-105-3, Sept

2010

39. Chattopadhyay T, Chaki A, Chattopadhayay D, Nandini Chat-

terjee, Pal A (2009) A novel value added interactive services for

active pages of DTH set top boxes. Presented in experience

workshop in third international conference on pattern recognition

and machine intelligence (PReMI), New Delhi, India, Dec 2009

40. Chattopadhyay T, Agnuru C (2010) Generation of electronic

program guide for RF fed TV channels by recognizing the

channel logo using fuzzy multifactor analysis. In: Proceedings of

the 14th international symposium on consumer electronics

(ISCE’10), Germany, 7–10 June 2010

41. Chowdhury S, Garain U, Chattopadhyay T (2011) A weighted

finite-state transducer (WFST)-based language model for online

indic script handwriting recognition. In: Proceedings of 11th

Pattern Anal Applic

123






























https://www.researchgate.net/publication/228632999_Social-and_interactive-television_applications_based_on_real-time_ambient-audio_identification?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

















































https://www.researchgate.net/publication/223054043_Waveprint_Efficient_wavelet-based_audio_fingerprinting?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==

https://www.researchgate.net/publication/223054043_Waveprint_Efficient_wavelet-based_audio_fingerprinting?el=1_x_8&enrichId=rgreq-0cfe8379f4ebb84533eba28cc7adfc11-XXX&enrichSource=Y292ZXJQYWdlOzI3MDk2Nzk2NjtBUzoxOTA1NTg0NjkxODk2MzNAMTQyMjQ0NDA4NzQwNg==














international conference on document analysis and recognition

(ICDAR), Beijing, China, Sept 2011

42. Chattopadhyay T, Sengupta S, Sinha A, Rampuria N (2011)

Creation and analysis of a corpus of text rich Indian TV videos.

In: Proceedings of 11th international conference on document

analysis and recognition (ICDAR), Beijing, China, Sept 2011

43. Chattopadhyay T, Chaudhuri BB, Jain R (2012) A novel low

complexity TV video OCR system. In: 21st international con-

ference on pattern recognition (ICPR), Japan, Nov 2012

44. Pal A, Prasad R, Gupta R (2012) A low-cost connected tv plat-

form for emerging markets—requirement analysis through user

study. In: ESTIJ, Dec 2012

Pattern Anal Applic

123












Context-aware television-internet mash-ups using logo detection and character recognition

Documents